Context vs Retrieval: A Practical Decision Framework

Use a cost-driven framework to decide whether to put data in the prompt, retrieve it via keywords or embeddings, or fine-tune, guided by a spreadsheet that compares input/output costs and time investment.

Sometimes I’ve discussed different strategies for fine-tuning, prompt context, and retrieval augmented generation (RAG), but it’s not always obvious which solution to use. Here’s an easy way to think about it.

If you need a model to “know” certain information—about a product catalog, a style guide, or anything else—the first thing to figure out is: how much can you condense it down?

  1. If it fits in a reasonable-sized prompt, start there
    If you can compress the needed information into a prompt that’s not outrageous, then calculate the cost of using it every time.

For example, maybe you have a 10,000-word prompt. That would have been impossible by standards years ago, but it’s totally doable now. You can look at the cost of input tokens, and if you know you’ll be using the model frequently, factor in caching. If you’re using the same prompt over and over, many providers (like OpenAI) let you cache those tokens so you get a discount.

At that point, I’ll literally just put it into a spreadsheet: what does it cost each time the API gets called with that longer context? You might find it actually makes a lot of sense. You don’t need a database, you don’t need any extra retrieval, you don’t need to do anything fancy. Just shove it into the context.

  1. If the prompt is too big (or too expensive), retrieve what you need
    You might still want to get your cost even lower, and putting in 10,000 words doesn’t always make sense—especially if the user’s input gives you enough of a clue (like a keyword) to go fetch the right piece of data.

There are two main ways to do retrieval:

A) Keyword-based lookup
This can be extremely simple:

  • A basic script that searches for specific terms, or
  • A very small model that extracts keywords and then you use those against your database.

B) Embeddings (semantic / “fuzzy” search)
This is where you convert entries into vectors and then look for close matches.

Pros:

  • If you don’t have a perfect match, embeddings can still do a good job finding something relevant.
  • It’s flexible and often more robust than exact keyword matching.

Cons:

  • Cost is higher than simple keyword search.
  • There’s a bit of art to creating good embeddings and getting good retrieval quality.
  • Each search does require compute (not insane—often fine on CPU, faster on GPU).

The good news is cost has fallen a lot. Providers like Cloudflare have easy-to-deploy embeddings databases. I’ve used this on the backend for several projects and barely think about the cost—it can be a few cents per month even with a few thousand uses. When you go to scale, that’s when it becomes a different consideration, and you may want to see if a more traditional database approach uses less compute.

  1. Fine-tuning: when retrieval or context still isn’t enough
    Fine-tuning makes sense when context retrieval isn’t sufficient and the model needs to learn from a lot of examples. Common cases:
  • You want it to truly internalize a style or formatting pattern that’s hard to reliably prompt.
  • You need it to learn new concepts, new phrases, or even new languages.

That said, as input token cost has dropped dramatically—and caching exists—you often don’t need to fine-tune as much as you used to. If you can build a system that rotates or updates your context, sometimes it makes sense to do that instead of even using RAG.

My simplest decision framework: use a spreadsheet
The clearest way I do this is not philosophical—I just run the numbers.

I look at:

  • Input costs
  • Output costs
  • How much data I need to include
  • Whether to store it in a database
  • Whether to use embeddings
  • Or whether to just shove it all into the prompt context
  • Then estimate how many times I’ll make the API call

And there’s another factor that matters a lot: your time.

A long context window may cost more per call, but when you add up how much time it takes to build a database, implement embeddings, keep them updated, maintain reliability, and run the extra services… it might not be worth it.

That’s ultimately the biggest factor: how much time you spend trying to maintain the other stuff.

Fine-tuning can be fun and rewarding—you can create a really cool model—but you also need a flywheel to keep it up to date. That’s useful in certain functions. But more and more, I find the answer is: just shove it all inside of context.