Troubleshoot benchmark

Before you start

The benchmark module runs retrieval and optional faithfulness scoring, then gates CI on configurable thresholds. Confirm the following before digging deeper:

Symptom table

If you observe Check
main() raises an exception Read the full traceback — the file and line number identify the exact raise site in src/attune_rag/benchmark.py
main() returns a non-zero exit code Check whether a precision, recall, or faithfulness threshold was not met; thresholds are configurable
Benchmark passes locally but fails in CI Compare threshold flags and query file paths between your local invocation and the CI command
Faithfulness scoring step is skipped silently Confirm you passed --with-faithfulness; without it, faithfulness scoring does not run
Benchmark runs but produces unexpected scores Verify the query file content — malformed or empty queries produce misleadingly low precision/recall
Slow benchmark run Check whether --with-faithfulness is enabled; faithfulness scoring adds a call to an external scorer and is the most expensive step

Step-by-step diagnosis

  1. Reproduce the failure with a minimal query file. Create a single-entry query file and run:

    python -m attune_rag.benchmark --query-file minimal.jsonl
    

    If the failure reproduces, the problem is not specific to your full dataset.

  2. Enable DEBUG logging. Re-run with verbose output to expose intermediate retrieval results and threshold comparisons:

    python -m attune_rag.benchmark --query-file minimal.jsonl --log-level DEBUG
    
  3. Check the exit code explicitly. main() returns 0 on success. Any other value means a threshold was not met or an error occurred:

    python -m attune_rag.benchmark --query-file your_queries.jsonl; echo "Exit: $?"
    
  4. Run the benchmark test suite. Confirm which paths are currently passing:

    pytest -k "benchmark" -v
    

    A failing test that exercises your scenario gives you a reproducible fixture to work from.

  5. Isolate faithfulness scoring. If the failure only occurs with --with-faithfulness, run once without it to confirm the retrieval path is healthy:

    python -m attune_rag.benchmark --query-file your_queries.jsonl
    

Common fixes

Source files

Tags: benchmark, ci, precision, recall, quality