Comparison: Benchmark vs alternatives
Context
The benchmark module is a retrieval and optional faithfulness benchmark runner. It gates CI pipelines on configurable thresholds, accepts custom query files, and enables faithfulness scoring through the --with-faithfulness flag.
Feature breakdown
| Capability | benchmark |
Ad-hoc evaluation script | Orchestration layer |
|---|---|---|---|
| Precision/recall measurement | ✅ Built-in | Manual — you implement the logic | Delegates to benchmark |
| Faithfulness scoring | ✅ Via --with-faithfulness |
Manual — you implement the logic | Delegates to benchmark |
| CI threshold gating | ✅ main() returns 0 on pass |
You define exit codes yourself | Possible, with more wiring |
| Custom query files | ✅ Supported | Depends on your script | Depends on configuration |
| Purpose-built public API | ✅ main() in benchmark.py |
❌ None | Indirect |
| Suitable for exploratory work | ⚠️ Overkill for one-off runs | ✅ Fast to prototype | ❌ Too much overhead |
When to use benchmark
Use benchmark when all of the following are true:
- You need repeatable, structured measurement of retrieval quality — precision, recall, or faithfulness — against a defined query set.
- You want CI to fail automatically when results drop below a threshold.
main()returns0on success, making it a natural fit for pipeline exit-code checks. - You are running faithfulness evaluation alongside retrieval scoring and want both in a single invocation (
--with-faithfulness). - You have a custom query file that defines the evaluation set.
When not to use benchmark
- Exploratory or one-off evaluation. If you are experimenting with a new retrieval approach and do not yet have a stable query file or threshold, a throwaway script avoids the overhead of wiring up
benchmarkfor a single run. - Multi-feature pipelines. If your evaluation spans concerns beyond retrieval and faithfulness, use the orchestration layer above
benchmarkrather than calling it directly. Callingbenchmarkin the middle of a broader pipeline couples your orchestration to its internals. - Behavior outside the public API. If you need evaluation logic that
main()does not expose, do not patchbenchmarkinternals. File an issue or propose an extension point instead.
Recommendation
benchmark is the right choice for any team that treats retrieval quality as a CI gate. The combination of built-in precision/recall/faithfulness scoring, configurable thresholds, and a clean 0/non-zero exit code from main() makes it significantly easier to enforce quality standards than building equivalent logic into a custom script. Choose an ad-hoc script only when you are prototyping and not yet ready to commit to a stable query set or threshold.
Source files
src/attune_rag/benchmark.py
Tags: benchmark, ci, precision, recall, quality