What Nobody Told Me About RAG When I Started Studying Retrieval

Index

What is Information Retrieval
Tokenization
Text Preprocessing
How AI Models See Tokens
The Three Retrieval Models
TF-IDF
Vectors and Cosine Similarity
Boolean Retrieval and Inverted Indexes
BM25
Retrieval as the Base of RAG
References and Next Steps

1 What is Information Retrieval

Every time someone asks me why an AI system is giving bad answers, my first question is always the same: "what is being retrieved before it gets to the model?". In 90% of cases, the problem is there.

Information Retrieval is the area of computer science dedicated to finding relevant information in large volumes of data based on a query. It is present in search mechanisms, e-commerce, recommendation systems, virtual assistants, and increasingly as the critical layer under any RAG system.

1.1 A Historical Perspective that Matters

It's no coincidence that IR has decades of accumulated research. The SMART system, developed in the 1960s, introduced concepts we still use today: vector representation of documents and term weighting. BM25 was published in the late 1990s and is still the standard search algorithm in production in 2026. Link structure ranking algorithms revolutionized the field in 1998, going beyond pure textual content.

Why does this matter? Because when someone proposes replacing BM25 with neural embeddings "because it's more modern", my response is: show me the numbers. Decades of refinement aren't discarded without evidence. And often, the data shows that well-configured BM25 surpasses poorly implemented embeddings.

1.2 The Three Pillars

Indexing is the preparation of data before search. Each document is processed, divided into smaller parts, and stored in a structure that allows for rapid retrieval. A well-constructed index is the difference between a response in milliseconds and a scan that would take hours. In systems with billions of documents, there's no other option.

Querying is the interpretation of the user's query. What seems simple hides real complexity: how to treat synonyms? Typing errors? Different intentions behind the same phrase? Most systems I see in production underestimate this stage and pay the price in result quality.

Ranking is where the real difference is made. Finding documents that correspond to the query is easy. Putting them in the right order is difficult. UX studies show that users rarely go beyond the second position in the results. If the right document is in the 15th position, to the user, it's as if it doesn't exist.

1.3 Metrics I Actually Use

Precision@K: of the K first results, how many are relevant? I use when the system has a display limit and irrelevant results at the top cause real harm to the experience.

Recall@K: of all relevant documents in the corpus, how many appear in the top-K? I use in systems where missing a critical document is unacceptable, such as medical or compliance systems.

MRR (Mean Reciprocal Rank): the average of the reciprocal of the position of the first relevant result. It's my favorite metric for systems where the user wants the answer, not a list. It reflects the real experience well.

Choosing which metric to optimize is not a technical decision. It's a product decision that needs to be made before any line of code.

1.4 Why IR Became a Mandatory Competence for Those Working with AI

With RAG (Retrieval Augmented Generation), Information Retrieval stopped being the exclusive territory of search engineers. Every team building applications with LLMs needs to understand retrieval.

The logic is simple: the LLM can only answer well based on the context it receives. If retrieval delivers the wrong context, the model will generate an incorrect answer, regardless of how sophisticated it is. There's no LLM good enough to consistently compensate for poor retrieval.

2 Tokenization

Tokenization is the step where most people don't stop to think. It's the first step in the pipeline, so any error here propagates throughout the system. I've seen systems with months of development broken by a tokenization inconsistency that took minutes to find and five seconds to correct.

The idea is simple: before any processing, the text needs to be divided into discrete units that the computer can manipulate. These units are the tokens.

2.1 The Three Types and When I Use Each

Word Tokenization is my default starting point for retrieval and indexing. It divides text into individual words. For most search systems, it's all you need.

Sentence Tokenization I use when I need to maintain contextual coherence, especially in creating chunks for RAG. Dividing a document into sentences before grouping into chunks ensures I won't cut an idea in half.

Character Tokenization I rarely use in retrieval. It has its place in spell checking and languages with complex writing systems, but for search in Portuguese, word tokenization covers 95% of cases.

2.2 Implementation

The question I always get: why not use text.split(" ")? Because real text isn't clean. Punctuation stuck to words ("fascinating." vs "fascinating"), hyphens in compounds ("pre-processing"), contractions, URLs, dates. A naive split by space will create inconsistencies that appear as mysterious bugs three weeks later. NLTK's word_tokenize handles this.

For high-volume production, I prefer spaCy. It's faster and has better support for special cases:

2.3 The Rule I Don't Waive

Use exactly the same tokenization function on documents and queries. Without exception.

It seems obvious. But I've reviewed systems where the indexing pipeline used spaCy and the search pipeline used NLTK. The results were inconsistent in ways that were almost impossible to reproduce. Encapsulate in a function and call from both sides.

3 Text Preprocessing

If I had to choose a stage to review in any retrieval system with problems, I'd choose text preprocessing. It's where I most often find silent bugs that degrade quality without generating any error.

The statement I make with conviction: most RAG systems with quality issues have the problem here, not in the LLM. Before replacing the model or adjusting the prompt, review the preprocessing.

3.1 The Pipeline I Use as Default

Lowercase is always the first operation. "Machine" and "machine" are the same word. Without normalization, they become distinct tokens in the index, diluting relevance.

Removal of non-alphanumeric characters eliminates noise. Punctuation, symbols, special characters rarely carry useful information for search.

Here, I need to be careful with the domain. In source code, _ and . have meaning. In financial documents, % and $ are essential. The rule isn't to remove everything that isn't a letter. It's to remove what doesn't carry relevant information for your specific corpus.

Removal of stopwords discards words so frequent they appear in every document and don't help differentiate one from another.

Exception I always include: in systems with boolean search support, I preserve "and", "or", and "not" as logical operators.

Stemming vs Lemmatization: I use stemming when speed matters more than linguistic precision. I use lemmatization when I'm in a domain where correct word forms are relevant. For most search systems in production, stemming with RSLPStemmer is sufficient.

3.2 The Function I Put in Every Project

3.3 The Errors I See Most Frequently

Different preprocessing in documents and queries is the most common and most treacherous because the system doesn't return an error. It simply returns poorer results.

Normalizing accents in the index but not in the query, or vice versa, breaks matches silently. If you decide to normalize ("information" becomes "informacao"), do it on both sides.

Using overly aggressive stemming in technical domains. "neural" and "neurology" reduced to the same root in a medical system creates incorrect matches that are hard to explain to users.

Not filtering tokens of length 1. After removing characters and stopwords, single letters remain that pollute the index.

4 How AI Models See Tokens

This section is about avoiding a conceptual confusion I see frequently, even among experienced engineers: LLM tokens and retrieval tokens are different things with different purposes.

4.1 BPE: The Logic Behind LLM Tokenization

Models like GPT use BPE (Byte Pair Encoding). Instead of dividing by words, BPE learns a subword vocabulary from the training data. The process starts with individual characters and iteratively merges the most frequent pairs until reaching the target vocabulary size (typically 50,000 to 100,000 tokens).

The practical effect: frequent words in English become unique tokens. Rare words, from less represented languages in training, or very specific terminology are divided into fragments. "computers" in Portuguese might become two or three tokens.

Each token receives a Token ID, a unique integer. These IDs have no semantic relationship to each other. IDs numerically close don't have anything in common semantically. They are arbitrary lookup table labels.

4.2 The Practical Impact That Matters in Day-to-Day

The same sentence in Portuguese that generated ~90 tokens in GPT-3 generates ~50 tokens in GPT-4. Older versions broke accented words into multiple fragments. Newer versions compact much better.

This impacts cost and context window. I use tiktoken to estimate before putting into production:

4.3 Tokens Are Not Embeddings: The Distinction That Matters

Tokens and Token IDs are the first layer: they transform text into numbers for computational processing. They don't carry semantic meaning.

Embeddings are high-dimensional vectors generated in a later layer of the neural network. They capture meaning and semantic relationships. "machine learning" and "aprendizado de máquina" have close embeddings because the model learned they appear in similar contexts. Their Token IDs have no relationship.

If you're building semantic search, you need embeddings, not Token IDs. They are different technologies for different purposes.

5 The Three Retrieval Models

There's a tendency to treat retrieval models as a linear evolution: Boolean is old, Vector Space is better, neural embeddings are the state of the art. It doesn't work that way in practice.

Each model responds differently to the fundamental question: what does it mean for a document to be relevant to a query? And depending on the use case, the right answer is different.

5.1 Boolean Retrieval

Operates in pure binary logic. A document either satisfies the condition or not. No gradation, no ranking by relevance.

Operators: AND (both terms present), OR (at least one), NOT (first present, second absent).

I use Boolean when the criteria are absolutely non