Benchmark errors
Common error signatures
These errors occur when the benchmark runner fails to complete a retrieval or faithfulness evaluation. Common failure points include:
- Threshold gate failures — the runner exits with a non-zero code when precision, recall, or faithfulness scores fall below configured thresholds. A successful run returns
0; any other exit code indicates a gate failure. - Invalid or missing query file — passing a custom query file that does not exist or is malformed typically raises an
OSErrororValueErrorbefore scoring begins. - Faithfulness scoring errors — errors specific to
--with-faithfulnessruns, such as a missing model or malformed response, appear only when that flag is set.
Where errors originate
All errors originate in main() (src/attune_rag/benchmark.py). Because main() is the sole entry point, the exit code and any raised exception come directly from this function.
How to diagnose
-
Check the exit code first.
main()returns0on success. A non-zero exit code in CI means a threshold was not met — check your configured precision, recall, or faithfulness thresholds against the reported scores in the output. -
Read the full traceback. If the process raises an exception rather than returning an exit code, the traceback names the exception type and the line in
benchmark.pywhere it was raised. AnOSErrorpoints to a file access problem (query file, output path); aValueErrorpoints to a configuration or input validation problem. -
Isolate faithfulness scoring. If the failure only occurs with
--with-faithfulness, re-run without that flag. If the run succeeds, the problem is specific to the faithfulness scoring path rather than retrieval evaluation. -
Enable DEBUG logging. If the exception message alone is not enough, re-run with logging set to
DEBUG. Log output emitted just before the failure typically identifies the query, threshold value, or file path that caused the error.
Source files
src/attune_rag/benchmark.py
Tags: benchmark, ci, precision, recall, quality