Troubleshoot benchmark
Before you start
The benchmark module runs retrieval and optional faithfulness scoring, then gates CI on configurable thresholds. Confirm the following before digging deeper:
- You can run
python -m attune_rag.benchmark --helpwithout error. - Your query file path is correct and the file is non-empty.
- If you passed
--with-faithfulness, the faithfulness scorer dependency is installed and reachable.
Symptom table
| If you observe | Check |
|---|---|
main() raises an exception |
Read the full traceback — the file and line number identify the exact raise site in src/attune_rag/benchmark.py |
main() returns a non-zero exit code |
Check whether a precision, recall, or faithfulness threshold was not met; thresholds are configurable |
| Benchmark passes locally but fails in CI | Compare threshold flags and query file paths between your local invocation and the CI command |
| Faithfulness scoring step is skipped silently | Confirm you passed --with-faithfulness; without it, faithfulness scoring does not run |
| Benchmark runs but produces unexpected scores | Verify the query file content — malformed or empty queries produce misleadingly low precision/recall |
| Slow benchmark run | Check whether --with-faithfulness is enabled; faithfulness scoring adds a call to an external scorer and is the most expensive step |
Step-by-step diagnosis
-
Reproduce the failure with a minimal query file. Create a single-entry query file and run:
python -m attune_rag.benchmark --query-file minimal.jsonlIf the failure reproduces, the problem is not specific to your full dataset.
-
Enable DEBUG logging. Re-run with verbose output to expose intermediate retrieval results and threshold comparisons:
python -m attune_rag.benchmark --query-file minimal.jsonl --log-level DEBUG -
Check the exit code explicitly.
main()returns0on success. Any other value means a threshold was not met or an error occurred:python -m attune_rag.benchmark --query-file your_queries.jsonl; echo "Exit: $?" -
Run the benchmark test suite. Confirm which paths are currently passing:
pytest -k "benchmark" -vA failing test that exercises your scenario gives you a reproducible fixture to work from.
-
Isolate faithfulness scoring. If the failure only occurs with
--with-faithfulness, run once without it to confirm the retrieval path is healthy:python -m attune_rag.benchmark --query-file your_queries.jsonl
Common fixes
-
Threshold too strict for your dataset. If CI is failing because scores fall below threshold, adjust the threshold flags to match your dataset's expected performance. Check the CLI help for the exact flag names:
python -m attune_rag.benchmark --help -
Malformed query file. A query file with missing fields causes silent scoring errors. Validate its structure against the expected schema before re-running.
-
Faithfulness scorer unavailable. If
--with-faithfulnesscauses an import error or connection failure, confirm the scorer dependency is installed:pip show <faithfulness-scorer-package>This fix requires a change outside the benchmark module itself — install or configure the dependency in your environment.
-
Environment drift between local and CI. If benchmark passes locally but fails in CI, compare dependency versions:
pip freeze | grep -E "attune|<relevant-deps>"Pin versions in your requirements file to keep environments consistent.
Source files
src/attune_rag/benchmark.py
Tags: benchmark, ci, precision, recall, quality