GPT-4 Vision Refrigerator Demo: A Practical Multimodal Moment

A fridge photo serves as a simple, human-centered demo to show GPT-4's multimodal understanding and practical usefulness.

After spending about two years on OpenAI’s API team as an engineer—working on creative applications and prompt design—I was asked to also become the company’s science communicator. It wasn’t so much a career switch as it was adding another hat. The opportunity appealed to me for a simple reason: I wanted to challenge myself to explain what these models actually do to the general public.

A big part of that job was figuring out how to make complex systems feel intuitive without oversimplifying them into nonsense. I worked on projects like the DALL·E video, which was genuinely fun—mostly because “How do you explain a text-to-image model?” is not an easy question. I also got to collaborate with a great group of producers behind the PBS Space Time YouTube channel, who were fantastic partners.

Then GPT-4 happened.

From a communications standpoint, GPT-4’s launch came with a familiar problem: researchers have precise ways of talking about model capability—benchmarks, evaluations, scores, tables. But when you’re communicating with the general public, and especially with reporters and journalists who then have to translate your translation, things like “MMLU” and a list of evals don’t always land. People’s eyes glaze over. Even if the metric is meaningful, it’s not always meaningful to them.

So my job became: find simple, accurate ways to explain why GPT-4 was special—why it was a leap forward.

The funny part is that GPT-3.5 had already improved a lot in general capability, largely thanks to post-training and better training methods. But GPT-4 really did feel significant in multiple ways. Two differences stood out.

The first was context length. GPT-4 could take on the order of 32,000 tokens, where GPT-3 had been closer to 2,000 tokens (with some improvements along the way between GPT-3 and 3.5). That’s not just a number—it changes what the model can hold in mind, what kinds of documents it can deal with, how long a conversation can stay coherent, and how much detail you can provide without the whole system falling apart.

The second was vision.

GPT-4 was the first model of this caliber that could understand both text and images. Vision models have improved considerably since then, but at the time it was a big deal—and I needed a way to show that quickly and clearly.

The catch is that the way models “see” isn’t quite the way we do. They get an image and have to infer the gist of the scene. That means tasks we think of as straightforward—like solving a maze or finding Waldo—can actually be more difficult than you’d expect, because they require searching and attention. Even for humans, if I flashed you a Where’s Waldo image and then told you to close your eyes and tell me where he is, you’d probably struggle. We search visually. We scan. Models have their own versions of those constraints.

Yes, you can get a vision model to do challenging things if you let it break the image down and attend to different parts of it—but for launch communication, I didn’t want an elaborate setup. I wanted one example. One simple thing that explained it.

And this is where the story gets extremely unglamorous.

I was sitting where I am now, late at night, right before the launch, trying to come up with a clean demo that would make sense instantly. I got hungry. I went downstairs, opened the refrigerator, and stared inside. And I remembered an old IBM Watson demonstration where you could list ingredients from a fridge and it would suggest a recipe.

I thought: okay. This is a much bigger step forward than that era. And instead of typing a list of ingredients, I can just take a photo.

So I ran back upstairs, grabbed my camera, went back down, took a photograph of the inside of my refrigerator, and asked: what can I make?

GPT-4 with vision gave me a set of suggestions—different things I could make based on what it could see inside the fridge.

It was exactly what I needed: a simple, easy-to-understand example that didn’t require anyone to care about eval names or benchmark acronyms. You didn’t need a technical background to get it. You could see it and immediately understand the point: this model can look at an image, understand what’s in it, and help you do something useful.

I showed it to my bosses, Hannah Wong and Steve Dowling, and they loved it for the same reason. It was clear. It was relatable. And it was kind of fun.

That refrigerator photo ended up being part of our GPT-4 launch examples, and it got repeated across the news media as a shorthand for “here’s what multimodal AI can do.” It worked.

What’s even funnier, though, is what happened later. Recently my wife showed me a TikTok where a woman was saying people should learn to use “codecs” (her term) and not just use the model to take a photo of the inside of the refrigerator to figure out what to eat. Which made me laugh, because three years ago it was a fresh, compelling example. Now it’s basically a cliché.

But that’s kind of the point.

When you’re trying to communicate what these systems are, you often have to step aside from the metrics and the internal language and the research framing. Not because those things are unimportant, but because they’re not the fastest bridge to understanding. Sometimes the best explanation is something basic and human:

I’m hungry. Can this help solve that problem?

And sometimes, the simplest demo—like a photo of a refrigerator—does more work than a dozen benchmark charts ever will.

GPT-4 Vision Refrigerator Demo: A Practical Multimodal Moment

More to read

Next