Provenance cautions
What to watch for
The provenance module records which corpus entries grounded each answer (CitationRecord, CitedSource) and renders that provenance for display (format_citations_markdown, format_claim_citations_markdown). The risks below are specific to how these two concerns interact.
Risk areas
build_citation_record silently truncates excerpts
build_citation_record accepts an excerpt_chars parameter that defaults to 200. If you call it without setting this value, every CitedSource.excerpt is capped at 200 characters. Downstream rendering via format_citations_markdown will then display truncated text without any indication that content was cut. Pass an explicit excerpt_chars value — or None if the retriever supports full-text excerpts — to avoid silent data loss in the citation record.
format_claim_citations_markdown relies on character-offset alignment
ClaimCitation.response_span is a tuple[int, int] of character offsets into the original response text. If you modify text between receiving it from the model and passing it to format_claim_citations_markdown, the spans will no longer align and footnote markers will appear at the wrong positions or raise an index error. Pass the original, unmodified response string.
cited_block_index defaults to 0, masking multi-block documents
ClaimCitation.cited_block_index defaults to 0. For documents with multiple content blocks, this default silently points every citation to the first block unless the Citations API explicitly sets a different value. If you display cited_block_index to users or use it for navigation, verify that the upstream API response actually populated it before trusting the value.
base_url=None produces relative links that may not resolve
Both format_citations_markdown and format_claim_citations_markdown accept an optional base_url. When base_url is None, any links in the rendered markdown are relative and depend entirely on the serving context to resolve correctly. In standalone outputs — emails, exported PDFs, or API responses consumed outside your app — those links will be broken. Always supply an absolute base_url when the rendered markdown may be consumed outside a known URL context.
How to avoid problems
-
Pin
excerpt_charsexplicitly. Treat the200-character default as a footgun rather than a sensible default. Set it deliberately every time you callbuild_citation_recordso truncation behavior is visible in code review. -
Freeze response text before rendering citations. Treat the string you pass to
format_claim_citations_markdownas immutable from the moment you receive model output. Apply any sanitization or formatting to a copy, not the string you will use for citation rendering. -
Validate
cited_block_indexbefore use. If your UI navigates users to a specific block in a source document, assert thatcited_block_index > 0or that it matches an expected block count before rendering the link. Do not rely on the default-0value as confirmation that block0was actually cited. -
Supply
base_urlin non-browser contexts. Ifformat_citations_markdownoutput leaves your web application — for example, in a notification, a report, or an API response — pass an absolutebase_urlso citation links remain functional. -
Isolate provenance tests from retrieval state.
CitationRecordcapturesretrieved_atandretriever_nameat construction time. Tests that reuse a shared record across assertions may be checking stale retrieval metadata. Construct a freshCitationRecordper test case.
Source files
src/attune_rag/provenance.py
Tags: provenance, citations, traceability