Run the retrieval and faithfulness benchmark
Use the benchmark runner when you need to measure retrieval precision and recall — or optionally faithfulness — and enforce pass/fail thresholds in CI.
Prerequisites
- Access to the project source code
- A query file if you intend to supply custom queries
- Python dependencies installed so that
pytestis available on your path
Run the benchmark
-
Start a basic retrieval benchmark. Call
main()insrc/attune_rag/benchmark.pywith no extra flags to evaluate retrieval precision and recall against the default query file:python -m attune_rag.benchmark -
Supply a custom query file. Pass your query file path to override the default inputs:
python -m attune_rag.benchmark --queries path/to/queries.yaml -
Enable faithfulness scoring. Add
--with-faithfulnessto include faithfulness evaluation alongside retrieval metrics:python -m attune_rag.benchmark --with-faithfulness -
Configure CI thresholds. Set the threshold flags to the minimum acceptable scores. The runner exits with code
0when all metrics meet or exceed the thresholds, and a non-zero code otherwise — making it suitable as a CI gate. -
Run the related tests. Verify that your configuration changes have not introduced regressions:
pytest -k "benchmark"
Confirm success
The benchmark run succeeded when main() returns 0. In CI, a 0 exit code tells the pipeline that all configured precision, recall, and faithfulness thresholds passed.
Key files
src/attune_rag/benchmark.py— entry point containingmain()