Eval reference

Use this module to score RAG answers for faithfulness and run prompt A/B benchmarks against a golden set. FaithfulnessJudge calls Claude with tool use to decompose each answer into atomic claims and label each one as supported or unsupported by the retrieved passages.

Classes

Class	Description
`FaithfulnessResult`	Per-answer faithfulness verdict.
`FaithfulnessJudge`	Scores RAG answers for grounding in retrieved context.

`FaithfulnessResult`

Dataclass that holds the verdict for a single scored answer.

Fields

Field	Type	Default
`score`	`float`	—
`supported_claims`	`list[str]`	—
`unsupported_claims`	`list[str]`	—
`reasoning`	`str`	—
`model`	`str`	—
`thinking_used`	`bool`	`False`

Properties

Property	Type	Description
`total_claims`	`int`	Total number of claims (supported plus unsupported).

Methods

Method	Parameters	Returns	Description
`to_dict`	`self`	`dict[str, Any]`	Serializes the result to a plain dictionary.

`FaithfulnessJudge`

Scores RAG answers by checking each factual claim against the retrieved passages. Uses Claude's tool-use API to decompose answers into atomic claims and classify each as supported or unsupported.

Constructor

Parameters	Type	Default	Description
`client`	`AsyncAnthropic \| None`	`None`	An existing `AsyncAnthropic` client to reuse.
`api_key`	`str \| None`	`None`	Anthropic API key; used when `client` is `None`.
`model`	`str`	`DEFAULT_JUDGE_MODEL`	Claude model to use as judge.
`timeout`	`float`	`DEFAULT_JUDGE_TIMEOUT_SECONDS`	Request timeout in seconds.

Properties

Property	Type	Description
`model`	`str`	The Claude model identifier used for judging.

Methods

Method	Parameters	Returns	Description
`score`	`self, query: str, answer: str, passages: str \| list[str], max_tokens: int = 2048, *, use_thinking: bool = False, thinking_budget_tokens: int = DEFAULT_THINKING_BUDGET_TOKENS`	`FaithfulnessResult`	Scores one answer against its retrieved passages and returns a per-claim verdict.

Functions

Function	Parameters	Returns	Description
`main`	`argv: list[str] \| None = None`	`int`	Runs the prompt A/B benchmark CLI; returns `0` on success.

Constants

Constant	Type	Value
`DEFAULT_JUDGE_MODEL`	`str`	`'claude-sonnet-4-6'`

Source files

src/attune_rag/eval/__init__.py
src/attune_rag/eval/faithfulness.py
src/attune_rag/eval/bench_prompts.py

Eval reference

Classes

FaithfulnessResult

Fields

Properties

Methods

FaithfulnessJudge

Constructor

Properties

Methods

Functions

Constants

Source files

Tags

`FaithfulnessResult`

`FaithfulnessJudge`