The Prompt Context Flywheel for Continuous Improvement

Periodically mine conversations, have an LLM propose updated prompts that reflect current context, and deploy the improved prompt as a living prompt context flywheel—either in production or via shadow testing—to steadily improve responses.

Sometimes you’re working on an application where you need to continuously update the context that goes into your prompt.

That can be because new information keeps arriving—maybe you have a catalog of items, or you’re ingesting the latest news—or because you’re actively trying to improve the way the system responds over time. Like: you’re trying to get better outcomes, and you want a mechanism that makes the prompt itself evolve.

At this point, it’s become much easier to build a flywheel that improves your prompt.

A silly example: imagine I’m trying to get a model to write in ridiculous Gen Z slang—“looksmaxxing,” “armogging,” etc. (God knows why.) One way to do this is to build a system that watches how you interact with the model, then periodically looks back at the conversation thread and asks:

Okay, knowing what you know now, what should the input prompt be?

And then it rewrites the prompt. Over and over. Basically letting it evolve based on the conversations you have.

This is kind of what’s already at play when you use things like ChatGPT and Claude. In the background, there are systems that work as part of memory—updating what it knows about you.

But you can also do this at the app layer (or org-wide), where the context in the prompt needs to be updated periodically. You can treat prompt context as a living thing: changing based on current news, newly available information, or patterns learned from conversations.

So what does this look like in practice?

If you’re using an API (OpenAI or any other provider) and you have an app running conversations, you’re probably already saving those conversations into a database.

Then, periodically, you run a script that:

  1. Pulls conversations from storage
    You might select ones that were flagged as helpful or unhelpful.

  2. Packages them up with the original prompt
    You take the conversation(s), include the original system prompt (or full prompt template), and add an instruction like:

Given this outcome, what would you change about the prompt to make it better?
  1. Asks an LLM to propose a revised prompt
    The model generates a new version of the prompt designed to handle that scenario better—whether the conversation was a “positive outcome” you want more of, or a “negative outcome” you want to prevent.

  2. Deploys the improved prompt
    You can take that new prompt and apply it into the active system you’re working with.

There are a couple ways to roll it out.

Option A: Just update the live prompt
This is the simplest approach. The prompt evolves, you ship it, and you keep iterating.

Option B: Run it as a shadow system (safer, trickier)
For shorter exchanges, you can do something like a shadow prompt.

What that means is:

  • The user interacts with the original prompt (the current production version).
  • At the same time, the new prompt tries to predict how it would respond.
  • Then you evaluate the difference between the two.

That’s where it gets a little trickier, but it’s a solid approach if you want to test prompt changes without immediately impacting users.

If your main goal is “improve my prompt,” the easiest version is honestly:

  • Take your previous conversations
  • Put them into a very large context model (like a million-context model)
  • Ask: “Can you identify patterns and ways I can improve this?”

These models can often pick up surprisingly useful things, like:

  • “This terminology could be better.”
  • “This outcome keeps happening under these conditions.”
  • “You’re missing an instruction that would have prevented this failure mode.”

It sounds like it’d be a pain to set up, but actually—using a tool like Codex—it’s surprisingly easy.

You just have to tell it the outcome you want.