Score a RAG answer for faithfulness
FaithfulnessJudge uses Claude as a strict judge to check whether every factual claim in a RAG answer is directly supported by the retrieved passages — flagging hallucinations claim by claim.
import asyncio
from attune_rag.eval import FaithfulnessJudge
judge = FaithfulnessJudge() # uses claude-sonnet-4-6 by default
result = asyncio.run(judge.score(
query="What is the return policy?",
answer="Returns are accepted within 30 days. Items must be unworn.",
passages=["Our return policy allows returns within 30 days of purchase."],
))
print(result.score, result.supported_claims, result.unsupported_claims)
Expected output:
0.5 ['Returns are accepted within 30 days.'] ['Items must be unworn.']
A score of 0.5 means one of two claims was grounded in the retrieved passage. The unsupported claim appeared in the answer but has no backing in the passages.
Step 1: Set your API key
FaithfulnessJudge accepts an api_key argument, or reads ANTHROPIC_API_KEY from the environment:
judge = FaithfulnessJudge(api_key="sk-ant-...")
Step 2: Inspect the full result
FaithfulnessResult exposes everything the judge produced:
print(result.score) # float: supported / total_claims
print(result.total_claims) # int: len(supported) + len(unsupported)
print(result.supported_claims) # list[str]: grounded claims
print(result.unsupported_claims) # list[str]: hallucinated or inferred claims
print(result.reasoning) # str: judge's chain-of-thought
print(result.model) # str: which Claude model was used
Step 3: Enable extended thinking for harder judgments
Pass use_thinking=True when the answer is long or the passages are ambiguous:
result = asyncio.run(judge.score(
query="...",
answer="...",
passages=["..."],
use_thinking=True,
))
print(result.thinking_used) # True
Step 4: Serialize for logging or CI
Call to_dict() to convert the result to a plain dictionary suitable for JSON logging or golden-set comparison:
import json
print(json.dumps(result.to_dict(), indent=2))
Next: run FaithfulnessJudge.score across your full golden set and assert result.score == 1.0 for every expected answer to wire faithfulness checking into your CI pipeline.
Tags: eval, faithfulness, judge, hallucination, scoring