Note: eval

Context

The eval module provides two complementary capabilities: LLM-as-judge faithfulness scoring for RAG answers, and a prompt-variant A/B benchmark harness for golden-set testing.

How faithfulness scoring works

FaithfulnessJudge (in faithfulness.py) sends a RAG answer to claude-sonnet-4-6 using Claude's tool-use API. The judge system prompt instructs the model to decompose the answer into atomic factual claims — one claim per verifiable assertion — and classify each claim as either supported or unsupported by the retrieved passages.

The scoring rules are strict by design:

Results are returned as a FaithfulnessResult dataclass, which records the score (a float), the supported_claims and unsupported_claims lists, reasoning text, the model used, and whether extended thinking was active (thinking_used). The total_claims property returns the sum of both claim lists.

Extended thinking is opt-in: pass use_thinking=True to FaithfulnessJudge.score(). When disabled, thinking_used is False on every result.

Benchmark harness

bench_prompts.py exposes a main() entry point for running prompt-variant A/B comparisons against a golden set. It is separate from the faithfulness judge and does not depend on FaithfulnessJudge or FaithfulnessResult at runtime.

Public API boundary

Only FaithfulnessJudge and FaithfulnessResult are exported from the package (__all__). The main() function in bench_prompts.py is a CLI entry point, not part of the importable surface.

Source files

Tags: eval, faithfulness, judge, hallucination, scoring