Retrieval cautions
What to watch for
KeywordRetriever scores RetrievalEntry objects against a query using token-overlap, suffix stemming, and stopword filtering. The results depend on several implicit behaviors — in the stemmer, the stopword list, and the field-weight configuration — that are easy to overlook when customizing or extending the retriever.
Risk areas
Short or common queries return low-quality results
KeywordRetriever strips a broad set of stopwords (including how, do, is, can, with, and others defined in _STOPWORDS) before scoring. A query composed mostly of stopwords — for example, "how do i do this" — reduces to an empty token set and scores every entry equally, returning essentially arbitrary top-k results. Check the post-filter token set when retrieval results look unexpectedly uniform.
Stemming collapses unrelated terms
The suffix stemmer strips endings from _STEM_SUFFIXES (ations, ing, ers, ed, and so on) left-to-right until the stem is four characters or longer. This can collapse terms that share a suffix but are semantically unrelated — for example, "rating" and "rating" vs "rat" — producing false token-overlap matches. If your corpus contains short technical terms or abbreviations, verify that stemming is not merging them incorrectly.
Field weights silently determine ranking order
KeywordRetriever applies separate weights to the path, summary, content, and related fields of each RetrievalEntry. A query token that matches a high-weight field (such as path) outranks many matches in a lower-weight field (such as content). If your corpus has sparse or missing path/summary metadata, results may rank entries with thin metadata above genuinely relevant content-rich entries. Populate all scored fields where possible.
RetrieverProtocol duck-typing hides signature mismatches
Any object with a retrieve(query, corpus, k) method satisfies RetrieverProtocol — there is no runtime enforcement of the return type. A custom retriever that returns None, an empty list on error, or objects that lack score or match_reason fields will pass protocol checks silently and only fail when downstream code accesses those attributes on a RetrievalHit. Validate that your implementation returns a proper Iterable[RetrievalHit] with all three fields populated.
_STOPWORDS and _STEM_SUFFIXES are private and may change
The stopword set and suffix list are module-level private constants. If your code imports or copies them directly, a future change to either constant will silently diverge from what KeywordRetriever uses internally, causing your preprocessing to disagree with the scorer. Treat these as implementation details and do not depend on them outside the module.
How to avoid problems
-
Log the post-filter token set for unexpected results. When retrieval returns surprising rankings, inspect which tokens survive stopword removal and stemming for your query. This is the fastest way to distinguish a bad query from a bad corpus entry.
-
Ensure
RetrievalEntrymetadata is complete. Becausepathandsummarycarry higher weights thancontent, sparse metadata skews scoring. Populate all four scored fields (path,summary,content,related) when constructing corpus entries. -
Validate custom retriever output against
RetrievalHit. When implementingRetrieverProtocol, confirm that every returned object is aRetrievalHitwith a non-Noneentry, a finitescore, and a non-emptymatch_reasonbefore returning fromretrieve. -
Do not import
_STOPWORDSor_STEM_SUFFIXESdirectly. If you need to replicate tokenization logic, copy the values explicitly and document that they are snapshots, so a future change to the originals does not silently affect your code. -
Run targeted regression tests after any scorer change.
pytest -k "retrieval"covers the scoring and ranking path; run it before and after modifying weight values or tokenization logic to catch ranking-order regressions early.
Source files
src/attune_rag/retrieval.py
Tags: retrieval, keyword, scoring, ranking