Score RAG answer faithfulness with eval
Use the eval harness when you want to measure how well your RAG pipeline grounds its answers in retrieved passages, or to benchmark prompt variants against a golden set.
Prerequisites
- An Anthropic API key, or an
AsyncAnthropicclient instance - Python dependencies installed (the
evalmodule must be importable) - Retrieved passages and the answers you want to judge
Score an answer for faithfulness
-
Import
FaithfulnessJudgeand instantiate it.from attune_rag.eval import FaithfulnessJudge judge = FaithfulnessJudge(api_key="YOUR_API_KEY")By default the judge uses the
claude-sonnet-4-6model. Passmodel=to override it. -
Call
score()with your query, answer, and passages.result = await judge.score( query="What is the retention policy for audit logs?", answer="Audit logs are retained for 90 days.", passages=["Audit logs are kept for a period of 90 days before deletion."], )Pass
passagesas a single string or a list of strings. Setuse_thinking=Trueto enable extended reasoning if you need higher-confidence verdicts on ambiguous claims. -
Inspect the
FaithfulnessResult.print(result.score) # float between 0.0 and 1.0 print(result.supported_claims) # list of claims the passages back up print(result.unsupported_claims) # list of claims not grounded in passages print(result.reasoning) # judge's chain-of-thought explanation print(result.total_claims) # total_claims = len(supported) + len(unsupported)Convert the result to a plain dictionary with
result.to_dict()for logging or serialization. -
Run the prompt-variant benchmark (optional).
Execute the benchmark entry point to compare prompt variants against your golden set:
pytest -k "eval"Alternatively, call
main()frombench_prompts.pydirectly in your test harness. A return value of0indicates a successful run.
Verify success
The task succeeded when:
result.scoreis a float andresult.total_claimsequalslen(result.supported_claims) + len(result.unsupported_claims).result.unsupported_claimsis empty (or within your acceptable threshold) for answers you expect to be fully grounded.main()returns0when you run the benchmark.
Key files
| File | Purpose |
|---|---|
src/attune_rag/eval/__init__.py |
Public exports: FaithfulnessJudge, FaithfulnessResult |
src/attune_rag/eval/faithfulness.py |
Judge logic, system prompt, and FaithfulnessResult dataclass |
src/attune_rag/eval/bench_prompts.py |
Prompt A/B benchmark harness and main() entry point |