Embedding-Based Retrieval Strategies That Actually Work
Embeddings are learned, high-dimensional representations used for retrieval, and the practical takeaway is to standardize and synthesize documents into retrieval-optimized representations rather than embedding raw text.
Early on at OpenAI, my boss at the time—Peter Welleringer, who led the API team—asked me to write documentation for a new API feature: semantic search. The catch was that to explain semantic search, I had to explain embeddings. And there was one small problem: I barely understood what an embedding was.
I had the vaguest mental model. “It represents words or concepts in some multi-dimensional space.” That was about it. But now I had to put it in writing for other people, which forced me to do the only thing that ever really works: assume I know nothing and rebuild the understanding from scratch.
As I started playing around with ways to demonstrate embeddings, I realized two things at once:
- Embeddings are incredibly powerful.
- A lot of people use them without ever getting past the hand-wavey “vector” explanation.
If you want a clean way to think about it, embeddings are kind of the atomic form of what an AI model does. They’re a fuzzy representation of “a thing” in a huge multi-dimensional space, where distance and direction encode relationships.
A simple example I like is this: take the embedding for “Kirk” and the embedding for “Han Solo.” You’ll often find they’re more related to each other than either is to someone like Richard Nixon. But you’ll also see interesting gradations—Kirk and Spock might be closer than Kirk and Han Solo, and Han Solo and Luke might be closer than Han Solo and Kirk, etc. The point isn’t Star Trek trivia. The point is that the model has learned a geometry of concepts based on patterns in text.
The “multi-dimensional space” part is the bit people repeat without really digesting. So here’s the intuition I used when I was trying to teach myself:
Imagine a normal 2D map. X and Y tell you physical location. Two things can be very close on the map.
Now add another dimension: temperature. Suddenly, an ice cube and a volcano could be right next to each other on the X/Y map, but extremely far apart in the “temperature” dimension.
That’s embeddings. Except instead of 3 dimensions, you have thousands and thousands. Each dimension captures some relationship the model found useful for predicting text. And because it’s learned from text, it isn’t “truth,” it’s “what tends to appear near what.”
That’s also why embeddings (and early language models) can pick up weird assumptions. A classic example: early models might confidently tell you Harry Potter was married to Hermione. Why? Because the model may have encountered tons of fan fiction where that’s true. In the model’s “text world,” Harry and Hermione show up together constantly, so in embedding space they end up close, and the model “fills in” a relationship that isn’t canon. That’s not the model being malicious; it’s the model being a mirror of the data distribution.
Once you accept that embeddings are locations in a learned space, similarity becomes mechanical. You can score “how close” two embeddings are using things like dot product (or cosine similarity, which is basically normalized dot product). That score can be wildly variable, and it will surprise you if you assume the model’s sense of closeness matches yours. Something that feels “obviously related” to a human may be relatively distant in embedding space if the two concepts don’t actually co-occur much in text.
From there you start to see what embeddings are really doing for you in practical systems: retrieval.
Naive chunk
Book title: Dune
Author: Frank Herbert
Retrieval-optimized synthetic record
{
"title": "Dune",
"themes": ["political intrigue", "religion", "ecology", "dynastic power"],
"tone": ["serious", "epic", "strategic"],
"setting": ["desert planet", "feudal empire"],
"reader_intent_tags": ["complex worldbuilding", "power struggles", "long-form saga"]
}
Embedding the second form makes intent-based retrieval much more reliable.
That’s the core of semantic search, and that’s the core of RAG. We’re trying to fetch the right information at the right time. Sometimes you want a very specific fact (“what’s the name of that book?”). But often what you really want is: “find things like this, but along the dimensions I care about.”
And that’s where most RAG systems fall down.
The naive approach is: take a bunch of documents, chop them up into chunks, embed the chunks, and hope similarity search brings back the right stuff. Sometimes it works. Sometimes it absolutely doesn’t. And when people say “RAG doesn’t work,” what they often mean is: “my chunking strategy and my representation strategy were random, and the results were random.”
What I learned pretty quickly is that embeddings don’t just retrieve content—they retrieve representations. If you embed the wrong representation, you retrieve the wrong things.
Here’s a concrete example.
Say I’m building a book search. I load a database with embeddings for book titles. Then I type in “Dune.” I’ll probably get other famous sci-fi titles back. Why? Because “Dune” appears next to other sci-fi titles in lists like “Top Science Fiction Books of All Time.” That’s a real relationship, and it’s useful.
But what if what I really want is: “give me a book like Dune, but specifically because of the political intrigue, deep history, and religious themes”? A title embedding isn’t going to reliably capture that. It might, but you’re basically betting that the training data emphasized the same facets you care about.
A better approach is to create your own representation—your own mini-document—then embed that.
For example, for each book you care about, you can have a model produce a structured description: atmosphere, location, political dynamics, historical depth, religious themes, etc. You might even break those into separate small documents and embed those separately. Then when a user searches “political intrigue,” Dune rises because you explicitly encoded “political intrigue” as part of the representation you embedded, not as an accidental byproduct of what happened to be common on the public internet.
That technique—creating synthetic, standardized documents to improve retrieval—is one of the most useful “RAG isn’t magic, it’s engineering” lessons I ever learned. People love to talk about chunking, but the deeper issue is usually: you didn’t shape the data into something retrievable.
That came up in a very practical way when I helped the Department of Housing and Urban Development think through a system for searching and sorting grant applications from cities.
The initial idea was straightforward: take all the applications, embed them, and do semantic search.
But the documents weren’t uniform. Some were six pages. Some were 600 pages. So what exactly am I embedding?
- If I embed only the first six pages of every document, I may miss the critical details buried later.
- If I try to embed 600 pages, I can’t. And even if I chunk it, I’m back to the “random chunking creates random retrieval” problem.
- If I try to “equalize” by padding, I’m just wasting tokens and compute, and I’m still not guaranteeing the right information becomes searchable.
So the solution I recommended was: don’t embed the raw document as-is. Standardize it.
I took a sampling of applications across different sizes and asked a model to extract what mattered: the key categories, significant points, program goals, constraints, requested amounts, relevant codes, intended outcomes—basically the stuff people were actually going to ask about.
Then I used that as a template: a one-page standardized format. After that, I had a model convert every application—no matter how long—into that standardized document. Now every application had a comparable footprint, and more importantly, every application had the information surfaced in a way that made it retrievable.
Practical Example: Standardization Prompt for Long Documents
Task: Convert this grant application into a standardized retrieval record.
Return fields:
- project_name
- requested_amount
- target_population
- goals
- constraints
- measurable_outcomes
- compliance_codes
Rules:
- Use concise bullet points
- Keep each field under 80 words
- If unknown, return "unknown"
And because I knew the kinds of questions users would ask, I could bias the representation toward those questions. Lots of those documents were full of boilerplate. If you embed boilerplate, you retrieve boilerplate. If you embed “what is this application asking for and what does it accomplish,” you retrieve answers.
That approach also has a second advantage: you’re not just “embedding documents,” you’re creating synthetic data that is better aligned with your retrieval objective. It’s a subtle shift, but it changes everything.
I always found it funny later when I saw labs publishing versions of this as if it was some brand-new discovery. I’m sure I wasn’t the only person doing it. It’s just one of those things where, if you spend your time solving problems instead of writing papers, you end up rediscovering “obvious in hindsight” techniques.
This is also why I kind of laugh when people confidently declare “RAG doesn’t work.”
I once sat next to the head of a team at a major tech company whose job was literally to sell RAG services to clients. He didn’t understand the fundamentals. He knew embeddings and search at a surface level, but synthetic representations—reformatting documents into a uniform, retrieval-optimized format—was completely new to him. Meanwhile, he was selling customers millions of dollars of compute.
And that’s one of the uncomfortable truths: the person selling you the product often doesn’t know much about it. Even researchers often haven’t had the time to really play with the systems end-to-end. Embeddings are one of those areas where a little bit of real understanding goes a long way.
One more quick technique I used—more of a throwaway trick, but surprisingly useful—was what I called a baseline document.
When I’m working with a small corpus and I need to understand what’s in it and what’s noise, I’ll create a “baseline” embedding target to compare against. In the embeddings demo I built for the OpenAI docs, I described it with something like movie directors: I’d take a list of directors (Steven Spielberg, J.J. Abrams, etc.), create a representation of “director-ness,” and then use that baseline to help filter or sanity-check results. The details vary depending on the domain, but the idea is the same: create a stable reference point in embedding space so you can measure what belongs and what doesn’t.
The bigger takeaway from all of this is pretty simple:
Embeddings aren’t just a feature you turn on. They’re a way of representing information. If you want retrieval to work, you have to care about what representation you embed.
If you embed raw text indiscriminately, you’ll get indiscriminate retrieval. If you standardize, synthesize, and structure what you embed, suddenly “RAG doesn’t work” turns into “RAG works fine, you just have to do the work.”