Prompt Size Reduction Checklist: Cut Tokens Without Losing Quality

Use a practical prompt-optimization checklist to reduce token usage by cleaning up examples, cutting verbosity, narrowing labels, and batching multiple classifications in a single API call for faster, cheaper results.

Back when GPT-3 DaVinci was first released, reducing prompt size wasn’t just a nice-to-have—it was directly tied to cost. DaVinci cost $0.06 per thousand tokens, which meant that if you were using roughly a thousand words in a prompt, you were effectively paying about six cents every time you clicked submit.

I joined OpenAI after working as a developer and spent time experimenting with launching my own platform on top of GPT-3. Because I was watching costs closely, I became very sensitive to prompt efficiency—and I tried to help the companies we worked with reduce their token usage wherever possible.

Looking back through my notes recently, I found one of my internal messages describing how I helped a client cut their prompt size in half. What stood out is that a lot of the rules that worked back then still apply today.

In this post, I’ll share the checklist I used and a concrete example of the kinds of changes that can reduce token consumption without sacrificing output quality—plus one additional trick that can significantly increase throughput per API call.

A client came to me with two related goals:

Improve the efficiency of their prompt size (reduce token use).
See whether they could get the API to provide multiple responses in one go, instead of resubmitting repeatedly to classify items one at a time.

Here’s the checklist I use for this kind of prompt optimization.

Look at the quality of the examples
The number one problem I see is developers using example data that’s messy and often contradictory. In this case, though, their data was fine—and they weren’t having issues with output quality.
Is the prompt easy for a human to understand?
Their prompt was very clear. That’s important: if a human can’t quickly understand what the prompt is asking for, the model is more likely to struggle, and you’ll end up compensating by adding more text (and more tokens).
Is the prompt too verbose?
In their case, they were using multiple labels where one would work fine.

Before:

Q: Payee is "Amazon" and amount is  $ (880.10)  
A: Category is Equipment Purchased:Computer

After:

Payee: Amazon & amount: $880.10  
Category: Equipment Purchased:Computer

This could be reduced even more, but this one formatting change alone cut prompt size by about 30%.

Are they providing too many (or too few) examples?
They were providing far more examples than needed, and repeating the same example multiple times without adding variation. Sometimes you do need to teach the model that not all Spotify bills are $9.99—but you don’t need to show it six copies of the same exact pattern.

By reducing the number of examples (and using smaller classification labels), the API still understood the task—and we were able to cut the prompt by about 60% in that section.

Is there a more efficient way to do this?
This was the biggest gain for them. Instead of doing one API call per classification, we showed them how to structure a single request to return multiple classifications in one response.

The pattern: provide a few classification examples, then provide a list of items to be classified, and show what the expected multi-item output format looks like. (The example we used wasn’t perfectly optimized, but it worked.)

By combining these improvements, we reduced their overall prompt size by about 50% and returned 5x as many results per API call—effectively a ~10x improvement in value per call (or, alternatively, creating headroom to add better examples while keeping cost controlled).

Original email (from my notes)

Back when GPT-3 DaVinci was first released, reducing prompt size was especially important because it cost $0.06 per thousand tokens. And if you were using a thousand words in a prompt, that meant you were basically paying 6 cents every time you clicked the submit button. As someone who came to OpenAI after working as a developer, experimented with launching my own platform built on top of GPT-3 and paying a lot of attention to the cost, I was very sensitive to this and tried to do what I could to help the different companies we were working with in reducing their costs. Going through my notes, I came across one of my internal messages describing how I was able to help a client reduce the size of their prompt by half. A lot of the rules that worked back then still apply today.

Here’s a little example where a client wanted to improve the efficiency of their prompt size (to reduce token use) and was also curious to see if they could get the API to provide multiple responses back instead of having to resubmit every time to classify data.

This is the checklist I use for something like:

Look at the quality of the examples
The number one problem I see is devs using data that’s messy and often contradictory. However, their data was fine and they weren’t having a problem with the quality of the results.
Is the prompt easy for human to understand what’s wanted?
Their prompt was very clear.
Is their prompt too verbose?
In this case they were using multiple labels when one would work fine:

Before:

Q: Payee is "Amazon" and amount is  $ (880.10)  
A: Category is Equipment Purchased:Computer

After:

Payee: Amazon & amount: $880.10  
Category: Equipment Purchased:Computer

This could be reduced even more, but we’ve reduced the prompt size by 30% with just this one change.

Are they providing too many are too few examples?
In this case they were providing way more examples than needed, repeating the same examples multiple times without variation. Sometime you need to show the API that not all Spotify bills are $9.99, but you don’t need to show it six examples of the same exact thing.

By reducing the number of example and using smaller classification labels the API understood we were able to reduce the prompt by 60%.

Is there a more efficient way to do this?
While all of that was helpful, in their case the biggest gain was showing how we can use one API call to return more than one classification example. This is accomplished by showing the API examples of classifications then a list of items to be classified and an example of how to classify it (the example here isn’t well optimized but it works.)

By combining all of these little improvements we were able to take their prompt and reduce it by 50% and return 5 times as many results giving them a 10x return on each API call (or more headroom for additional examples.)

Here are some of the points that I think apply (extracted from the email)

Token cost makes prompt size a real budget line item, not an academic concern—small reductions can compound quickly at scale.
Start by auditing your examples: messy or contradictory examples are one of the biggest drivers of bad results and bloated prompts.
If a human can’t quickly understand what the prompt is asking, you’ll likely end up “paying tokens” to compensate for unclear instructions.
Verbosity is often structural, not just wordy prose—remove redundant labels and compress the format of examples.
Don’t oversupply repetitive examples; you want coverage and variation, not duplicates.
Cutting examples intelligently (while keeping diversity) can preserve quality while dramatically reducing tokens.
The biggest wins can come from changing the interaction pattern, not just trimming text—batch multiple classifications into a single API call.
Combining small optimizations (format + fewer examples + batching) can produce step-change improvements in cost and throughput.

Prompt Size Reduction Checklist: Cut Tokens Without Losing Quality

More to read

Next