Stop Using Embeddings for Everything in RAG

RAG (Retrieval-Augmented Generation) is arguably the most effective application of LLMs at the enterprise level. Why? Because companies live and die by their domain-specific files (PDFs, PowerPoints, internal docs).

As a quick refresher, RAG is the process of introducing enterprise data as context, allowing an LLM to answer domain-specific queries accurately.

The Standard RAG Pipeline

A classical RAG pipeline usually looks like this:

Storage: We store documents in a Vector Database.
Intent: We transform the user’s intent into a vector query.
Similarity: We calculate a similarity score (cosine similarity) between the query and the documents.
Generation: We select the top $K$ documents, add them to the prompt, and wait for the LLM to do its magic.

The Problem: Why Embeddings Aren’t a Silver Bullet

When it comes to the retrieval step, most people hop directly onto embeddings. I consider this a massive mistake for two main reasons:

1. The Chunking Nightmare

Unless you have tiny documents, finding a good chunking strategy is a total pain. It requires endless iterations. You have to ensure you do not split the document naively, and you often need to inject a summary of the document back into the chunk to preserve context.

2. Debugging Is Hard

Evaluating embeddings takes serious work. You need a representative evaluation dataset complete with soft negatives and hard negatives. On top of that, setting a similarity threshold to select the “right” embeddings is incredibly tricky.

But the biggest issue is the “Top K” trap.

Imagine a retailer selling perfumes. If a user asks, “Show me all fruity perfumes,” a standard vector search will return the top $K$ (e.g., 5) results based on similarity. If you have 50 fruity perfumes in stock, you just missed 45 potential sales because the vector search arbitrarily cut them off.

The Solution: Deterministic Query Translation

Before we start embedding everything in sight, we need to ask: are there more deterministic ways to select our documents?

For many enterprise queries, the user’s intent revolves around attributes (metadata) that we likely already have in our database.

Let’s stick with the perfume example.

The User Query: “Propose a good winter perfume.”

The Database: We do not have a “Season” column. We only have Notes (e.g., Vanilla, Amber, Citrus) and Note_Type (e.g., Head, Heart, Base).

A bad pattern is embedding the description and hoping the semantic search connects “Winter” to the description text.

A better pattern is to use the LLM to reason about the query first. The LLM knows that “Winter” implies warm, spicy, or woody notes. It can translate the user’s vague intent into a precise database filter.

How Query Translation Works

Instead of searching for the word “Winter,” we ask the LLM to map the concept of “Winter” to the specific notes available in our database.

Example:

User Query:

"I'm looking for a nice winter perfume."

LLM System Context:

"You are a perfume expert. The available notes in the database are: ['Amber', 'Sandalwood', 'Vanilla', 'Cinnamon', 'Musk', 'Citrus', 'Mint', 'Rose', 'Jasmine', 'Sea Salt']. Translate the user's intent into a list of Notes from this specific list that fit the intent."

LLM Output (The Translation):

{
  "reasoning": "Winter perfumes typically feature warm, heavy, and spicy scents. From the provided list, 'Amber', 'Sandalwood', 'Vanilla', 'Cinnamon', and 'Musk' correspond to these characteristics.",
  "target_notes": ["Amber", "Sandalwood", "Vanilla", "Cinnamon", "Musk"]
}

We then take this logical output and run a deterministic database query (e.g., SELECT * WHERE note IN ('Amber', 'Sandalwood', ...)). This guarantees we retrieve 100% of the relevant products based on logical attributes, not fuzzy similarity.

Why This Wins

Simpler Updates: If you need to rename a note or add a category, it’s a simple database update. No need to re-embed hundreds of vectors.

Logic over Luck: We use the LLM’s world knowledge, or your own knowledge through prompting, to bridge the gap between “Winter” and “Amber,” rather than hoping a vector model learned that relationship during training.

No Missed Data: You do not rely on a fuzzy similarity threshold. If the user asks for “fruity,” they get all fruity items.

Conclusion

Does this mean we should never use embeddings? No.

In reality, the most robust systems often use a hybrid approach.

Use Query Translation (Deterministic) for logical constraints (e.g., “Winter”, “Fruity”, “Under $50”).

Use Embeddings (Semantic) for abstract vibes (e.g., “I want to feel young!”).

By combining the precision of structured filtering with the understanding of semantic search, you get the best of both worlds and stop losing sales to the Top K limit.