Provenance
Provenance is the system that records which source documents grounded a RAG pipeline's answer and renders those sources as citable, human-readable output.
Mental model
When a user submits a query, the RAG pipeline retrieves document chunks and generates a response. Provenance captures a snapshot of that retrieval — what was queried, which documents were returned, and how confidently each one ranked — then attaches that snapshot to the response so readers can trace every claim back to its source.
The flow looks like this:
build_citation_recordconverts rawRetrievalHitobjects into aCitationRecord, capturing the query text, retriever name, retrieval timestamp, and up toexcerpt_charscharacters of each hit's content.- The
CitationRecordholds oneCitedSourceper retrieved document, each carrying the template path, category, relevance score, and optional excerpt. - Optionally, the Anthropic Citations API produces
ClaimCitationobjects that link specific spans of the response text to specific documents by index. format_citations_markdownrenders a fullCitationRecordas a markdown section.format_claim_citations_markdownrenders the response text with inline footnote-style callouts for eachClaimCitation.
Core data structures
CitationRecord — the top-level provenance snapshot for one pipeline run. It records:
query— the original user query stringhits— an ordered tuple ofCitedSourceobjectsretrieved_at— the timestamp of retrievalretriever_name— which retriever produced the hits
CitedSource — one document returned by the retriever. It records:
template_path— the document's path in the corpuscategory— the document's classificationscore— the retriever's relevance score for this hitexcerpt— an optional short extract of the source text (truncated toexcerpt_charsbybuild_citation_record)
ClaimCitation — a finer-grained citation produced by the Anthropic Citations API, linking a span of the response text (response_span) to a specific document (document_index, document_title, cited_text, cited_block_index). Use these when you need to attribute individual sentences or phrases rather than the response as a whole.
When provenance matters
- Auditability —
CitationRecordgives you a durable, timestamped record of exactly which documents were in scope when an answer was generated, making it straightforward to replay or audit a pipeline run. - User-facing transparency —
format_citations_markdownandformat_claim_citations_markdownturn that record into output readers can inspect, with optionalbase_urlsupport for linking directly to source documents. - Claim-level traceability — when the Anthropic Citations API is in use,
ClaimCitationobjects let you show which sentence in the response came from which block of which document, rather than citing sources only at the response level.