Comparison: Benchmark vs alternatives

Context

The benchmark module is a retrieval and optional faithfulness benchmark runner. It gates CI pipelines on configurable thresholds, accepts custom query files, and enables faithfulness scoring through the --with-faithfulness flag.

Feature breakdown

Capability benchmark Ad-hoc evaluation script Orchestration layer
Precision/recall measurement ✅ Built-in Manual — you implement the logic Delegates to benchmark
Faithfulness scoring ✅ Via --with-faithfulness Manual — you implement the logic Delegates to benchmark
CI threshold gating main() returns 0 on pass You define exit codes yourself Possible, with more wiring
Custom query files ✅ Supported Depends on your script Depends on configuration
Purpose-built public API main() in benchmark.py ❌ None Indirect
Suitable for exploratory work ⚠️ Overkill for one-off runs ✅ Fast to prototype ❌ Too much overhead

When to use benchmark

Use benchmark when all of the following are true:

When not to use benchmark

Recommendation

benchmark is the right choice for any team that treats retrieval quality as a CI gate. The combination of built-in precision/recall/faithfulness scoring, configurable thresholds, and a clean 0/non-zero exit code from main() makes it significantly easier to enforce quality standards than building equivalent logic into a custom script. Choose an ad-hoc script only when you are prototyping and not yet ready to commit to a stable query set or threshold.

Source files

Tags: benchmark, ci, precision, recall, quality