Eval errors
Common error signatures
Errors in the eval module fall into three categories: API failures when FaithfulnessJudge calls Claude, malformed responses that prevent JSON parsing into FaithfulnessResult, and invalid inputs to FaithfulnessJudge.score().
Concrete signatures to watch for:
anthropic.APIConnectionError/anthropic.APITimeoutError— The judge call exceededDEFAULT_JUDGE_TIMEOUT_SECONDSor the network was unreachable. Check your API key, network connectivity, and whether you passed a customtimeouttoFaithfulnessJudge.__init__().anthropic.AuthenticationError— No valid API key was found. Passapi_key=explicitly toFaithfulnessJudge()or set theANTHROPIC_API_KEYenvironment variable.ValueErroronpassages—FaithfulnessJudge.score()acceptspassagesas either astrorlist[str]. PassingNoneor an incompatible type causes a failure before the API call is made.- Malformed tool-use response — If Claude does not call
report_faithfulness, parsing intoFaithfulnessResultfails. This can occur whenmax_tokensis too low to complete the structured output or whenthinking_budget_tokensis misconfigured alongsideuse_thinking=True. main()non-zero exit —main()returns0on success. Any other exit code indicates that the benchmark run inbench_prompts.pydid not complete cleanly; inspect stderr for the underlying exception.
How to diagnose
-
Identify whether the failure is in the judge call or in result parsing. A traceback rooted in
faithfulness.pyinsidescore()points to the API call or prompt construction. A traceback duringFaithfulnessResultfield access (e.g.,.score,.supported_claims,.total_claims) points to a parsing or deserialization problem. -
Check
FaithfulnessResultfields for sentinel values. If scoring completes but results look wrong, inspectresult.supported_claimsandresult.unsupported_claimsdirectly. An emptysupported_claimslist with a lowscoremeans the judge found no passage-backed claims — this is expected behavior when the answer contains hallucinated details (workflow names, CLI flags, or API shapes not present in the retrieved passages), not a bug. -
Verify
use_thinkingtoken budgets. When you callscore(..., use_thinking=True), the response must fit withinmax_tokens. Ifthinking_budget_tokensapproaches or exceedsmax_tokens, the model may not produce a validreport_faithfulnesstool call. Increasemax_tokens(default is2048) or reducethinking_budget_tokens. -
Confirm the model name.
FaithfulnessJudgedefaults toDEFAULT_JUDGE_MODEL(claude-sonnet-4-6). If you pass a custommodel=string that the Anthropic API does not recognise, the call fails with a404orinvalid_request_error. Checkjudge.modelto confirm which model is active. -
Inspect the passages you passed. The judge prompt inserts
passagesverbatim into_JUDGE_USER_TEMPLATE. Empty or very short passages cause the model to mark every claim asUNSUPPORTEDby design — the system prompt instructs it to be strict. Verify that retrieval returned meaningful content before callingscore().
Source files
src/attune_rag/eval/__init__.pysrc/attune_rag/eval/faithfulness.pysrc/attune_rag/eval/bench_prompts.py
Tags: eval, faithfulness, judge, hallucination, scoring