GPT Tools: Fast Prototypes, Real Constraints, and Shipping

OpenAI’s rapid progress rested on human coordination and practical tooling—such as a token counter and a four-model comparison report—more than perfect code.

When I started at OpenAI in the summer of 2023, it was after I’d spent several months giving feedback on model capabilities—and even volunteering for anything that would help me learn more. Documentation, testing, random odd jobs… whatever needed doing.

What immediately stood out once I joined was just how fast everything was moving.

The API team—the group responsible for making GPT-3 available to outside developers—actually had its roots in OpenAI’s robotics team under Peter Welinder. That shift tells you a lot about the era: once OpenAI saw what language models and the transformer architecture could do, they went all in. That meant restructuring, reassigning people, and building entirely new org shapes around a technology that suddenly mattered more than anything else.

One thing I genuinely admired was how OpenAI treated capability as transferable. Really strong people could be strong in more than one domain. A team that had been doing robotics could become the nucleus of an API team. And it worked.

But it also came with a reality that outsiders often miss: the team was small compared to what Google and others had, and that meant everyone wore a lot of hats. You did whatever you could to get things done, however you could. There were plenty of “nice to have” things that simply weren’t feasible yet.

The bottleneck isn’t always code

There’s a funny misconception people have about building software—especially in AI companies—that the main blocker is writing code.

A good example: OpenAI recently released the Codex app (the latest incarnation), and it’s a Mac app right now. Windows users understandably want it on Windows, and you’ll see comments like: “If you can write all this AI code, why can’t you just ship a Windows app too?”

But the problem often isn’t the code.

The bottleneck is all the human work around it: coordinating beta testers, talking to users, collecting feedback, figuring out what’s broken versus what’s confusing, deciding what to change, validating fixes, and looping again. AI can help with parts of that, but it doesn’t replace the basic reality that you have to engage with people to learn what works.

That part—talking to humans—is wildly underappreciated in tech circles. It’s also a big reason some things ship and… kind of suck.

Tokens were mysterious, so I built tools

Back then, tokens were still a mysterious concept for most developers. People didn’t have intuition for the fact that tokens aren’t quite words, and that pricing and context limits are based on these weird little chunks of text representation.

Trying to help people answer basic questions—like “How many tokens is this document?” or “How much will this cost?”—was harder than you’d think.

And it’s even funnier in hindsight when you look at pricing evolution. When GPT-3 launched, it was 6 cents per thousand tokens. Today you can point at something like a GPT-4 nano tier and see pricing like 10 cents per million tokens, which is just absurdly different. But at the time, people were trying to build businesses and budgets around token math they didn’t understand.

So I built a token counter. Just a simple tool hosted on a site I called gpttools.com.

Then I built more.

At the time there were four main models available: Ada, Babbage, Curie, and DaVinci (the top-tier GPT-3 model everyone knew). The other three were often very capable—but it wasn’t obvious which one you should use for a given task.

So I built a tool where you could paste in a prompt and run it across all four models, compare outputs, and even use “best of” style sampling. It would generate a report so you could see, in a practical way, which model tended to get it right, which failed, and what you might want to optimize.

To be clear: a more experienced engineer at OpenAI could have built something way better than my quick version.

But they were busy doing something else—like figuring out how to scale GPT-3 to handle billions of tokens per day.

So I built it myself.

The absurd part: OpenAI linked to my personal site

This is the part that still makes me laugh.

Things were moving so fast (and the team was so busy) that OpenAI’s documentation ended up linking to my personal website.

Think about that for a second: OpenAI’s official docs pointing to something I basically vibe-coded overnight because it needed to exist, and there wasn’t time yet to build or host an official version.

And what made it even funnier, at least to me, is the contrast: the year before, my last “credit” was appearing on an episode of Shark Week—and before that, I’d starred in my own reality TV prank/magic show. Yes, I’d spent several years learning how to code. But still: it’s objectively ridiculous that OpenAI’s developer documentation was, for a moment, pointing people to something I threw together on my own site.

It’s completely insane to imagine today. But at the time, it made a kind of scrappy sense: there was a real need, the tools were useful, and shipping something now was better than shipping something perfect later.

Of course, things are different now. OpenAI has teams and infrastructure dedicated to building much better developer experiences, and it’s genuinely great to see.

But the reason I built those tools in the first place was simple:

I just wanted them to exist.

GPT Tools: Fast Prototypes, Real Constraints, and Shipping

The bottleneck isn’t always code

Tokens were mysterious, so I built tools

The absurd part: OpenAI linked to my personal site

More to read

Next