Comparison: KeywordRetriever vs custom retriever implementations

Context

The retrieval module gives you two things: a ready-to-use KeywordRetriever that scores corpus entries against a query using token overlap, stemming, and stopword filtering, and a RetrieverProtocol that lets you swap in any retriever that satisfies the same retrieve(query, corpus, k) interface.

Choosing between them comes down to one question: does KeywordRetriever's scoring model fit your data?

Feature comparison

Capability KeywordRetriever Custom RetrieverProtocol impl
Setup effort Zero — instantiate and call Must implement retrieve(query, corpus, k)
Scoring model Token-overlap weighted across path, summary, content, and related fields Whatever you define
Stopword filtering Built-in 35-word list (a, the, how, does, …) Your responsibility
Stemming Strips 16 suffix patterns (-ation, -ing, -ed, -er, -es, …) Your responsibility
Return type list[RetrievalHit] (ordered, indexable) Iterable[RetrievalHit] (any iterable)
Field-weight tuning Fixed weights per field; not configurable at runtime Fully configurable
Semantic / embedding search Not supported Supported — implement it yourself
Drop-in replaceability Yes — satisfies RetrieverProtocol Yes — anything with the right signature qualifies

Scoring model details (KeywordRetriever)

KeywordRetriever tokenizes both the query and each RetrievalEntry, removes stopwords, applies suffix-stripping, then accumulates overlap scores weighted by field:

Each RetrievalHit records the final score (float) and a match_reason string explaining which tokens drove the match. The top-k hits are returned (default k=3).

Because the model is purely lexical, it degrades when query terms and entry text use different but synonymous vocabulary (for example, "configure" vs. "set up").

When NOT to use KeywordRetriever

Use X when…

Use KeywordRetriever when:

Implement a custom RetrieverProtocol when:

For most corpus-search tasks where the query vocabulary matches the corpus text, KeywordRetriever is the right starting point — it handles stopword filtering and stemming for you and satisfies RetrieverProtocol, so you can replace it later without changing call sites.

Source files

Tags: retrieval, keyword, scoring, ranking