Fine-Tuning Fundamentals: When to Use It and When Not To
Fine-tuning is a final option after prompting and RAG, chosen for memorization of facts or generalization of behavior, with practical steps to test on small models first and format data accordingly (facts in the assistant message; behavior in user/assistant pairs) before scaling.
People sometimes say fine-tuning is too hard or a waste of time. And to be fair, a lot of the time they’re not wrong: if the information is already inside a really capable frontier model, a good prompt is often all you need.
But there are absolutely cases where you really do need to fine-tune.
The simplest example is when you have data the model just doesn’t have. If I’m an author and I’ve written 20 novels, the model probably has no idea what the actual plots are. If I want it to understand my books, I have to teach it. Same thing for a company with a catalog of products, internal naming, weird edge cases, or policies that aren’t public.
Another case is when you need consistent behavior that a prompt can’t reliably enforce. Maybe I have a particular way I write emails, and there are a thousand different ways I might respond depending on context. Sometimes “just prompt it better” isn’t enough to get the consistency you want.
So in practice, I think of it as a ladder. You generally do these in order:
-
Prompting: Try to prompt your way into it first. And now that we can put way more context into a model, a lot of the time the right move is honestly: just tell it everything. Put it all in the context. And because of prompt caching, that can be pretty cheap.
-
RAG: If the information is too big or too dynamic, build a retrieval system. Pull the relevant data from a database, paste that into the context, and let the model answer using that.
-
Fine-tuning: This is the “last” option, but it’s not a bad option. There are cases where it’s the best one because it’s more expedient. You might not want to rely on another moving part like a RAG system, and you might have too much data to fit into a prompt anyway.
Question: Does the model already know this information?
- Yes -> start with prompting
- No -> can it be fetched at runtime?
- Yes -> use RAG
- No -> fine-tune
Also, as models get more capable—especially reasoning models—there are multiple ways to solve the same problem. That changes the training game. I’ll talk elsewhere about training reasoning models and building graders, but here I want to cover practical fine-tuning basics that have been very helpful for me.
The first thing to get clear on is what you’re actually trying to do:
Are you trying to teach the model information (facts), or are you trying to teach it a way to behave (style / decision patterns)?
That’s basically crystalline versus fluid intelligence.
Teaching the model information (memorization) If your goal is “here are 500 facts” or “here are 500 product descriptions,” the job is to get the model to reliably store and recall that information.
In that case, what you want is a dataset that contains all that information, and you want each item represented multiple times. Then you train with smaller epochs and let it run through the data repeatedly so it “sees” each example multiple times and learns it.
You’re not trying to get it to discover a new abstract rule—you’re trying to get it to remember.
Teaching the model behavior (generalization) If you’re trying to teach a style, a way to write, a way to respond, or a general pattern (not one specific example), you want the model to generalize.
That means you give it broad examples and you let it train longer. You want it to go back over the dataset again and again until it learns why something is right and why something is wrong. The key is consistency of the underlying lesson, but expressed in lots of different ways so it can generalize.
So the tradeoff looks like this:
- Memorize: more repetition of the same information (maybe rephrased), and smaller epochs run repeatedly.
- Generalize: more variety of examples that share the same underlying pattern, trained longer so it internalizes the rule.
Another big shift that matters: how training data is formatted Training used to be “here’s raw text, train on it.”
Now we can train with explicit conversational structure:
- User message: what someone asks
- Assistant message: what the model should say
That’s great if you’re training a chatbot to behave consistently in an interaction. You can literally show it: “if the user asks this, respond like that.”
But it can be worse if your goal is “teach the model facts about the world,” because the structure matters more than people realize.
This is where a lot of confusion comes from—including a paper I’ve written about that claimed models can’t generalize in certain directions. They showed models could generalize from A to B, but not from B to A. They called this the reversal curse.
The TL;DR is: it was a consequence of training format.
They trained with A in the user message and B in the assistant message, and then expected the model to infer the reverse direction. But when you train that way, the model doesn’t necessarily treat “user content” as the thing it should internalize as knowledge. It’s learning the mapping from the prompt to the response.
If you want to teach facts, the best move is often: Put the information you want it to learn in the assistant message.
You can even leave the user message blank.
So instead of inventing thousands of Q&A pairs like:
User: “Tell me fact X”
Assistant: “Fact X is …”
You can just put the facts in the assistant section, train on that, and the model will learn from it and generalize from there. That’s also how you avoid the reversal curse issue in the first place.
Practical Example: Fact-Memorization Record
{
"messages": [
{"role": "user", "content": ""},
{"role": "assistant", "content": "Product ZX-14 supports protocol Q under condition R."}
]
}
When to use user/assistant pairs If you need the model to respond in a certain way—like editing, rewriting, summarizing, or applying a specific transformation—then yes, you should use the user + assistant format.
Examples:
- Editing: user says “make this punchier” + provides text; assistant returns the punchier version.
- Summarization: user says “summarize this formally” + provides text; assistant gives the formal summary.
In those cases, you want the model to learn a pattern: if I see this kind of instruction, I should produce that kind of output.
Quick practical summary
- Start with prompting. Now that context windows are large (and caching exists), “just stuff it in the context” works surprisingly often.
- If you need external knowledge at runtime, use RAG.
- If you need the behavior or knowledge baked in (and you don’t want system complexity), fine-tune.
For the fine-tune itself:
- Decide whether you’re training memorization (facts) or generalization (behavior/style).
- Memorization: repeat key examples and run enough passes so it sticks.
- Generalization: use diverse examples that teach the same lesson and train longer.
- Facts about the world: put them in the assistant message (user can be blank).
- Specific interaction behavior: use user + assistant pairs.
Practical Example: Behavior-Generalization Record
{
"messages": [
{"role": "user", "content": "Rewrite this customer email to be concise and calm: ..."},
{"role": "assistant", "content": "Thanks for the update. Here are the next steps..."}
]
}
And one more thing that’s saved me a lot of money: Train the smallest, cheapest model first.
That gives you a fast signal for whether your dataset and approach are working. Then, before you do an expensive run on a bigger model, you’ll already know whether you need more data, different formatting, or different training settings. Smaller models often tell you a lot.