Challenging AI Paper Claims with Practical Replication

Bold claims of AI limitations are often training artifacts in a fast-moving field; treat them as testable hypotheses and verify by re-running experiments with varied data formats so the model learns relationships in its outputs, not just the prompts.

One of the recurring problems with academic research into AI is the pace. By the time a paper is published, it can already be out of date—sometimes by multiple model generations. That has turned into a kind of running joke: someone cites a headline claiming “AI can’t do X” (whether that’s medical Q&A, coding, or something else), and when you look closer, it’s often based on a model that’s two or three years old. In a field moving as quickly as this one, that lag ends up shaping public opinion in a misleading way, making the technology look further behind than it actually is.

There’s a second issue too: a lot of academic work still struggles to capture how these systems behave in practice—especially the parts that matter. Some of the more “critical” papers are written with a strong thesis in mind, sometimes by people who have been in the space for decades and want to argue that everyone is making a fundamental mistake. To be clear, I’m not saying we won’t hit unexpected roadblocks. We might. I also don’t think transformer architectures are necessarily the most efficient way to compress knowledge and problem-solving. If the human genome is on the order of 1.6GB, and animals with even smaller genomes can do incredible things with behavior “programmed in,” it’s reasonable to assume there’s plenty of room for improvement in how we build these systems.

That said, when I read papers that claim to have found some big, fundamental limitation, a surprising number of them fall apart on inspection—not because the question is bad, but because the experimental setup is flawed. Often the problem is simple: the authors don’t spend enough time actually working with the models, and they don’t “red team” their own conclusions.

A good example is the “reversal curse” paper. The basic claim was that if a model learns facts in one direction—A implies B—it can’t generalize in the reverse direction—B implies A. The canonical example was something like: if the model is trained on “Tom Cruise’s mother is Mary Lee Pfeiffer,” it supposedly won’t reliably answer “Who is Mary Lee Pfeiffer’s son?” with “Tom Cruise.” The paper got a lot of attention because if that claim were fundamentally true, it would be a serious limitation.

But when I read the methodology, the issue jumped out: they were training the model in a way that essentially prevents the behavior they were testing for.

Here’s what they did (simplifying slightly). They created many training examples in a chat-style format, where the “user” message contained a question like:

Who wrote Sharks in Space?

…and the “assistant” message contained the answer:

“John Whistleheim.”

They then concluded that after being trained on many examples of this sort (A → B), the model still performed at chance when asked the reverse form (B → A), like:

What book did John Whistleheim write?

Their interpretation was: the model can memorize “book → author” but can’t infer “author → book.”

The problem is that this training format is not a neutral choice. Modern chat-tuned models are designed to be cautious about treating the user prompt as authoritative “truth.” They learn primarily from what appears in the assistant response. If you want the model to internalize facts and relationships, you generally need the factual content to appear in the assistant message in a way that the model learns as its own output, not merely as an input that it is responding to.

So I ran a simple test. I generated synthetic data—fake books and fake authors—so we weren’t dealing with any contamination from real-world knowledge. First, I trained the model the way the paper trained it. Then I trained the same model, on the same kind of data, but with the key difference: I moved the relevant information into the assistant response so the model learned the relationship as part of what it produces, not merely what it is asked about.

Reversal-curse replication example screenshot — Replication-style sanity checks: same claim, different data formatting, different outcome.

When trained this way, the model did generalize. It learned the relationship well enough to answer both directions: A → B and B → A. In other words, the “fundamental roadblock” wasn’t fundamental; it was, at least in large part, an artifact of how the training examples were constructed.

I don’t want to be unfair to the authors. At the time, documentation around best practices for training and formatting chat data wasn’t great. You often had to dig through scattered sources and experiment to learn what actually works. But it was still surprising that they didn’t try basic variations—mixing formats, swapping where the information lives, or stress-testing their own result. Especially because they even noted an important clue: if you simply put the relevant facts into a single prompt and asked the reverse question, the model could answer correctly. That should have been a flashing signal that the limitation might be coming from the training setup, not the model’s underlying capacity.

The broader point is this: there’s a lot of “secret knowledge” in AI right now—not secret because it’s intentionally hidden, but because it’s learned through hands-on experimentation rather than neatly captured in papers. I’ve personally spent thousands of dollars a month at times training small models purely out of curiosity, just to see what works and what doesn’t. And that kind of tinkering teaches you things you will miss if you only read headlines or skim abstracts.

So my advice is simple: when you see a confident claim that “models can’t do X,” treat it as a hypothesis, not a verdict. Look for whether the result is based on outdated systems, whether the setup matches how these models are actually trained and used today, and whether the authors seriously tried to break their own conclusion. In this field especially, the difference between a real limitation and a training artifact is often just one small, testable detail.

More to read