Note: retrieval

Context

The retrieval module (src/attune_rag/retrieval.py) provides a keyword-based retriever that scores and ranks corpus entries against a query. It defines a protocol so you can swap in a custom retriever without changing call sites.

Content

KeywordRetriever uses token-overlap scoring — it tokenizes the query, strips stopwords (for example: a, the, how, should), and applies light suffix stemming (for example: ations → ation, *ing → *) before comparing tokens against each RetrievalEntry. Scores are weighted across four fields: path, summary, content, and related entries.

Each retrieval result is returned as a RetrievalHit dataclass with three fields:

Field Type Description
entry RetrievalEntry The matched corpus entry
score float Weighted token-overlap score
match_reason str Human-readable explanation of why the entry matched

Any object that implements retrieve(query: str, corpus: CorpusProtocol, k: int = 3) -> Iterable[RetrievalHit] satisfies RetrieverProtocol. KeywordRetriever is the built-in implementation; it returns a list[RetrievalHit] sorted by descending score, truncated to the top k results (default: 3).

Source files

Tags: retrieval, keyword, scoring, ranking