Posts

Browse the complete archive of prompting notes in this reading order.

How I Became OpenAI's First Prompt Engineer

Dive deep into an AI frontier, rigorously test and document prompts, and openly share useful findings to stand out and land a pioneering role like OpenAI's first prompt engineer.

The Evolution of Prompts: From Completion to Systems

Prompts have evolved from pattern-based completion to outcome-focused instructions, and the practical takeaway is to provide the simplest, clearest description of the finished product and its success criteria so the model can deliver the desired outcome.

Base Models vs Post-Training: What Each Layer Does

Base models are broad, raw text learners, while post-training adds an instruction-driven layer that greatly increases usefulness but can lead to overfitting, so the takeaway is to balance raw capabilities with careful post-training and prompt design.

GPT Demo Set List: Early Prompt Patterns That Still Hold Up

Curated prompts from a GPT-3 demo set reveal practical capabilities: token-based world view, autocomplete, structured text, translation, summarization, tone and persona control, multi-voice outputs, and turning unstructured text into structured data.

GPT-3 Grammar and Style Editing in Practice

GPT-3 enables advanced grammar and style edits, tone adjustment, coherence improvements, and format transformations across text without explicit training as a dedicated grammar tool.

Rethinking best_of in GPT-3: Why It Misleads

Relying on best_of to improve LLM accuracy is misguided; the practical fix is to define clear task boundaries with better prompts and use outlier examples to ground interpretation, which can let you use smaller models and single-shot prompts while reducing cost.

Temperature in LLMs Explained: What It Actually Controls

Temperature adds a controlled amount of randomness to LLMs to explore alternative paths rather than boosting creativity, helping to break repetitive outputs but risking nonsensical results at high values and often being unnecessary with modern models.

Creating Better Quiz Distractors with LLMs

Crafting plausible quiz distractors is hard; a practical workaround is to use a smaller model with a higher temperature to generate incorrect-but-plausible options, though results can still vary.

Using Small Models for Complex Natural-Language Tasks

Thoughtful prompting and lightweight schemas let small language models reliably convert flexible natural-language input into structured data for real-world tasks like scheduling, at a fraction of the typical cost.

Large Text Pattern Analysis with Prompted Models

Feed large batches of text into a single context window to extract overall patterns and sentiment across many posts, enabling scalable, non-sequential analysis while monitoring for hallucinations.

Compute at Scale: Growth, Limits, and AI Demand

Compute needs will rise with human ambition, potentially by about 1,000× today, and will be met through strategic, highway-like infrastructure expansion and smarter use rather than chasing unlimited physical limits.

Embedding-Based Retrieval Strategies That Actually Work

Embeddings are learned, high-dimensional representations used for retrieval, and the practical takeaway is to standardize and synthesize documents into retrieval-optimized representations rather than embedding raw text.

Grounding Prompts with Wikidata and SPARQL

Ground model outputs in Wikidata by constructing SPARQL queries with correct property and entity IDs, optionally aided by a lightweight query generator or retrieval workflow, to fetch real data and reduce hallucinations.

The Prompt Context Flywheel for Continuous Improvement

Periodically mine conversations, have an LLM propose updated prompts that reflect current context, and deploy the improved prompt as a living prompt context flywheel—either in production or via shadow testing—to steadily improve responses.

Fine-Tuning Fundamentals: When to Use It and When Not To

Fine-tuning is a final option after prompting and RAG, chosen for memorization of facts or generalization of behavior, with practical steps to test on small models first and format data accordingly (facts in the assistant message; behavior in user/assistant pairs) before scaling.

Fine-Tuning Methods Guide: SFT, DPO, and Beyond

Fine-tuning is a toolbox of SFT, DPO, reinforcement fine-tuning, and vision fine-tuning; pick the method by your goal (memorization vs generalization, explicit behavior, reasoning with graders, or robust augmentation) rather than defaults.

Cost Savings via Fine-Tuning Smaller Models

Fine-tune a smaller model on high-quality examples derived from a larger model to preserve performance while substantially lowering per-call costs, with potential to step down to even smaller models as you scale the dataset.

Lessons from an Ambitious AI Build

Tackling a truly ambitious AI build forces intense, hands-on learning in prompt design, tool usage, and system design tradeoffs, yielding practical, scalable know-how for real AI apps.

Why I Didn't Launch AI Channels

High costs and slow response times with GPT-3 made AI Channels impractical as a consumer product, so I prioritized learning and joined OpenAI instead of launching.

Big and Small Models in Robotics: A Hybrid Architecture

Adopt a layered, multi-model architecture in robotics that pairs large, high-level models for complex reasoning with fast, specialized models for real-time perception and control, with coordinated handoffs to balance latency, capability, and safety.

The Frontier Is Wider Than It Looks

The frontier is wider than ever, and the key takeaway is to invest in reasoning-based prompting and a middle-layer classification to guide answers, enabling safer, cheaper, and more reliable AI.

Challenging AI Paper Claims with Practical Replication

Bold claims of AI limitations are often training artifacts in a fast-moving field; treat them as testable hypotheses and verify by re-running experiments with varied data formats so the model learns relationships in its outputs, not just the prompts.