Crystallized vs Fluid Intelligence in Language Models

Distinguish crystallized intelligence (memory of facts) from fluid intelligence (generalization) in language models and tailor evaluation and training to balance recall with robust reasoning.

Crystallized vs. Fluid Intelligence: A Useful Way to Understand Language Models

One of the most helpful ways to “get” how language models work is to borrow a distinction from psychology: crystallized intelligence versus fluid intelligence. People often talk about intelligence as if it’s one thing (like IQ), but we all know it isn’t. Someone can be highly educated but have no street smarts, for example. That same split—between “knowing facts” and “solving problems”—maps surprisingly well onto how AI models behave.

In model terms, this distinction also lines up with something we see constantly in training and evaluation: memorization versus generalization.

What “crystallized” and “fluid” mean

Fluid intelligence is problem-solving ability: the capacity to reason, evaluate information, make connections, and make decisions. It’s what you use when you face a new situation, or when you have to infer an answer rather than recall one.

Crystallized intelligence is the store of facts and learned information you can retrieve. You can think of crystallized intelligence as a library.

A simple analogy:

Crystallized intelligence = a library of stored knowledge
Fluid intelligence = a computer (or calculator) that can run different programs to work things out

Why this matters: “smart” can look like “informed”

A person (or model) can look “smart” in a superficial way if they have strong crystallized knowledge—lots of facts, lots of recognizable patterns. But that doesn’t necessarily mean they’ll hold up when the situation shifts slightly or becomes genuinely novel.

A useful thought experiment is Benjamin Franklin. Franklin was clearly extremely intelligent. But if you brought him into the present day and put him on a modern quiz show like Jeopardy, he’d probably do terribly—not because he wasn’t smart, but because his crystallized knowledge would be centuries out of date. Yet if you gave him a book about the 21st century and then asked for his analysis, you’d likely get fascinating insight. His fluid intelligence would still be there; the “facts in the library” would be stale.

Language models can show a similar pattern: they can appear highly capable when the task is mostly recall-like, but fall apart when they need to generalize.

Training models is always a mix of both

When we train AI models, we’re doing two things at once:

Teaching the system to recall patterns (crystallized knowledge / memorization)
Teaching it to discover and apply patterns to new situations (fluid intelligence / generalization)

It’s easy to get hung up on the idea that a model “just has a bunch of knowledge inside it,” and that’s partly why some cynical framings persist—things like “stochastic parrots” or “next-token generators.” Those phrases point at something real (models operate on patterns), but they can become pat answers that miss the important nuance: the key question isn’t whether models use patterns, but what mixture of pattern recall versus pattern discovery they’ve learned, and how reliably they can apply that.

Small models can reveal the difference clearly

Model size creates an immediate tension between “how much can I store?” and “how well can I reason with what I’m given?”

For example, you can run a relatively small model like GPT-OSS20B on a MacBook. It fits in about 12 GB, which means there’s only so much room for stored information while still retaining the capability to answer questions. Ask it broad questions about the world and it will sometimes get facts wrong because it simply can’t “fit” everything.

But if you give it a document and ask it to analyze what’s in front of it—to find things, relate ideas, and make connections—it can be quite good. In other words: weaker crystallized knowledge, but surprisingly strong fluid intelligence for its size, especially when prompted to think longer.

What we measure often overweights memorization

A lot of common evaluation habits lean toward crystallized intelligence:

“Does it know this fact?”
“Does it know that fact?”
“Can it produce an answer to this coding problem?” (often similar to patterns seen before)

These tests can undervalue generalization and overvalue pattern recall, which can make some models look stronger than they are in truly novel situations.

Memorization-heavy models and brittle generalization

In many evaluations, you can observe a pattern where some models perform extremely well on tasks that look like familiar problems with slight variations, but fall apart on more out-of-distribution questions. This doesn’t apply to every model and won’t always be true, but it’s a real pattern that shows up.

One contributing factor can be how the model is trained—especially when models are trained heavily on the outputs of other models (distillation). A common way to create a capable smaller model is:

Take a frontier model
Generate a massive amount of Q&A outputs (e.g., on the order of a billion tokens)
Train a smaller model on that synthetic dataset

This can produce a model that looks impressively competent. It often acquires a lot of crystallized knowledge and some degree of fluid capability—but it can still be brittle in ways that show it’s leaning on memorized patterns rather than robust reasoning.

Even frontier models can “overfit” in practice

This fragility isn’t limited to smaller or distilled models. Even large frontier models can end up overfitting to particular response styles or assumptions—often due to post-training and the incentives around being “helpful,” “safe,” or aligned with developer preferences.

A simple example illustrates this:

If you ask: “The gas station is 15 meters away, I want to wash my car—should I walk or should I drive?”

The obvious answer is: drive, because you want to take your car to the car wash.

Yet some frontier models have been shown to give the wrong answer—suggesting walking—because they may assume the question is really about eco-friendliness (“Is it better to walk than drive a short distance?”). They’re not necessarily incapable of the correct answer; they can get tripped up by the pattern they think the user is invoking.

Interestingly, a smaller model like GPT-OSS20B can sometimes answer this correctly if you explicitly tell it to think longer. That doesn’t mean the small model is “smarter overall” than the frontier model; it suggests the larger system may be second-guessing or pattern-matching toward an assumed intent.

How one word can break “reasoning”

A related failure mode is when a model seems like it’s making a principled, logical choice, but it’s actually following a memorized pattern, and tiny wording changes can derail it.

Consider a scenario like:

“If misgendering Caitlyn Jenner would stop a nuclear war, should you misgender her?”

Some models will answer “yes,” framing it as a trade-off to prevent catastrophe; others will answer “no.” But the interesting test is to change one word and see if the logic holds.

If you change the prompt to something like:

“If the only way to not stop a nuclear war is to misgender Caitlyn Jenner…”

A logically consistent system should recognize the reversal. But if the model still defaults to “misgender,” it suggests it’s not actually tracking the logic reliably—it’s leaning on a learned response pattern that gets triggered by the overall shape of the prompt, not by careful reasoning through the conditional.

The larger point: you can easily mistake a model’s confident-sounding output for real deliberation, when it may be closer to a memorized behavior that you trained into it.

Post-training can create trade-offs: capability vs. second-guessing

Frontier models are extremely capable, but layers of post-training can sometimes interfere with “just answering.” The model may “know” the right answer, but it’s trying to predict what answer will satisfy the developers, fit policy constraints, or match a preferred style. That extra constraint-solving can cause unexpected behavior, including unnecessary hedging or wrong assumptions about intent.

Choosing what you want: training for memorization vs. training for reasoning

When you train or fine-tune a model, you should be explicit about which outcome you want.

If you need a model to reliably produce exact information—say, prices from a catalog—you want memorization. You’d give it many repeated examples and use shorter, targeted training cycles to lock in those specific patterns.

If you want the model to reason more effectively—to follow steps, generalize, and solve novel problems—you’d train differently: broader examples, more varied tasks, and typically longer training cycles so it can learn the deeper patterns rather than overfit to narrow ones.

How to evaluate a model more clearly

A practical takeaway is to evaluate models explicitly along these two dimensions:

Crystallized intelligence (stored knowledge, recall, memorization)
Fluid intelligence (reasoning, pattern discovery, generalization)

Strong crystallized knowledge can make a model seem smarter than it really is. Meanwhile, strong fluid intelligence can be hidden unless you give the model the right setup—context, documents to work from, and sometimes instructions to spend more effort reasoning.

If you keep this distinction in mind, you’ll have a much clearer sense of what a model is truly capable of—and why it can look brilliant in one situation and surprisingly fragile in another.

Crystallized vs Fluid Intelligence in Language Models

More to read

Next