Comparison: Provenance approaches
Context
The provenance module gives you two distinct ways to attribute an answer back to its source documents:
- Pipeline-level provenance —
CitationRecordandCitedSourcecapture which corpus entries grounded a full RAG response, along with retrieval metadata (query, retriever name, timestamp, relevance score). - Claim-level provenance —
ClaimCitationmaps individual spans of the response text back to specific document blocks, using data from the Anthropic Citations API.
Both paths share a common rendering layer (format_citations_markdown and format_claim_citations_markdown) that produces ready-to-display markdown.
Feature comparison
Pipeline-level (CitationRecord) |
Claim-level (ClaimCitation) |
|
|---|---|---|
| Granularity | Per retrieved document | Per response span → document block |
| Primary data source | RetrievalHit objects from your retriever |
Anthropic Citations API response |
| Entry point | build_citation_record() |
Consume ClaimCitation objects directly |
| What it tracks | Query, retriever name, timestamp, score, optional excerpt | Response span (start, end), document index, cited text, block index |
| Rendering function | format_citations_markdown(record, base_url) |
format_claim_citations_markdown(text, citations, base_url) |
| Output style | Markdown section listing all cited sources | Response text with inline footnote-style references |
| Requires Anthropic Citations API | No | Yes |
| Captures retrieval score | Yes (CitedSource.score) |
No |
| Captures retrieval timestamp | Yes (CitationRecord.retrieved_at) |
No |
Tradeoffs
Pipeline-level provenance is the right default for most RAG applications. build_citation_record() converts your existing RetrievalHit objects into a structured CitationRecord with minimal wiring. You get a complete audit trail — what was queried, which retriever ran, when retrieval happened, and how each source scored — all rendered to markdown via format_citations_markdown(). The weakness is granularity: citations point to whole documents (or short excerpts up to excerpt_chars characters), not to the exact sentence that supported a claim.
Claim-level provenance is more precise but narrower in scope. ClaimCitation pinpoints which character span of the model's response corresponds to which block in which document, making it suitable for applications where readers need to verify individual assertions. The tradeoff is a hard dependency on the Anthropic Citations API — you cannot produce ClaimCitation objects from arbitrary retrievers — and you lose retrieval metadata like scores and timestamps.
Use X when...
Use CitationRecord / pipeline-level provenance when:
- You want to audit or log the full retrieval run (who retrieved what, when, and how relevant it scored).
- Your retriever is not the Anthropic Citations API, or you need retriever-agnostic citation tracking.
- You need a source list rendered as a markdown section appended to an answer.
- You are building a new feature and want the simplest path to traceable answers.
Use ClaimCitation / claim-level provenance when:
- You need to show users exactly which sentence in a document supports each claim in the response.
- You are already using the Anthropic Citations API and have span data available.
- Auditability of individual assertions matters more than capturing retrieval metadata.
When in doubt, start with build_citation_record() and format_citations_markdown(). They cover the common case and require no external API dependency.
Source files
src/attune_rag/provenance.py
Tags: provenance, citations, traceability