Tip: Gate CI on faithfulness scores, not just precision and recall
Run main() with --with-faithfulness to include faithfulness scoring alongside retrieval metrics. Without it, a pipeline can pass precision/recall thresholds while still returning responses that contradict the retrieved documents.
Why it sticks: precision and recall measure what you retrieve; faithfulness measures whether the answer reflects it — they catch different failure modes.
Tradeoff: Faithfulness scoring adds latency and may require a separate model call. Keep it enabled in CI but consider disabling it in fast local iteration loops where retrieval quality is your only concern.
Details
- Entry point:
main()insrc/attune_rag/benchmark.py, returns0on success (suitable for direct use in CI scripts). - Thresholds are configurable, so tighten them incrementally rather than starting strict — a threshold that blocks every run provides no signal.
- Point
--query-fileat a domain-specific query set rather than the default one; generic queries rarely expose the retrieval gaps that matter for your use case.
Source files
src/attune_rag/benchmark.py
Tags: benchmark, ci, precision, recall, quality