Retrieval errors

Common error signatures

Most retrieval failures fall into one of three categories: a corpus object that doesn't satisfy CorpusProtocol, a query that reduces to zero tokens after stopword filtering, or a k value that is incompatible with the scorer. The errors typically surface from KeywordRetriever.retrieve() or from a custom retriever that doesn't fully implement RetrieverProtocol.

Concrete signatures to watch for:

Where errors originate

Check the class that matches your symptom before walking the call stack further.

How to diagnose

  1. Check whether the query survives stopword filtering. KeywordRetriever removes every token found in _STOPWORDS (articles, modals, pronouns, and common prepositions such as a, the, how, do, is, for). If your entire query consists of stopwords, retrieve() returns an empty list rather than raising. Print the tokenized, filtered query before calling retrieve() to confirm at least one content token remains.

  2. Verify the corpus satisfies CorpusProtocol. A corpus object that is missing expected attributes or iteration behavior causes AttributeError or TypeError inside KeywordRetriever.retrieve(). Confirm your corpus exposes the interface that CorpusProtocol requires before passing it to the retriever.

  3. Confirm k is a positive integer. KeywordRetriever.retrieve() defaults to k=3. Passing k=0 or a negative value may return an empty list or raise depending on how the scorer slices results. Pass an explicit, positive k to rule this out.

  4. Inspect RetrievalHit.score values when results are unexpectedly ranked. KeywordRetriever weights token overlap across the path, summary, content, and related fields of each RetrievalEntry. A hit with a score of 0.0 means no stemmed query token matched any weighted field — check that the RetrievalEntry fields are populated and that stemming via _STEM_SUFFIXES (-ing, -ed, -tion, -er, and others) would produce a shared root with the query tokens.

  5. Trace a TypeError back to RetrievalHit construction. If the traceback points inside retrieval.py at a dataclass instantiation, one of the three fields (entry, score, match_reason) is missing or the wrong type. Confirm the value passed as score is a float and match_reason is a non-empty string.

Source files

Tags: retrieval, keyword, scoring, ranking