Adapting Foundation Models: From Prompts to Reinforcement Learning

In 2012 I took Pieter Abbeel’s CS 188 class at Berkeley. We used reinforcement learning to play Pacman with Q-tables. The loop was straightforward: take an action, record the outcome, update the table.

That basic loop hasn’t changed. Today we replace the table with policy gradient methods like REINFORCE or PPO, and the policy itself is a transformer with billions of parameters. The mechanics are more complex, but the underlying structure is the same: act, observe, adjust.

What has changed is the landscape around it. Foundation models arrive pre-trained with enormous capabilities, and the question becomes: how do you steer them? There’s a hierarchy of methods, from quick prompts to sophisticated reinforcement learning, each with different tradeoffs and means of evaluation.

Prompting as the First Lever

The easiest way to steer a model is prompting. No new weights are trained; you simply tell the model what to do. Prompt engineering works because large models already encode vast latent capabilities — the right phrasing can elicit them.

But prompting is brittle. Prompts that work in one context often fail in another, and small changes in wording can shift outputs dramatically. That’s why fine-tuning emerged as the next step: a way to make steering systematic rather than ad hoc.

What Fine-Tuning Really Does

Fine-tuning is supervised learning. You collect input–output pairs, and the model learns to predict the right output token by token. It’s how a model picks up legal vocabulary, a medical style of documentation, or a customer service tone.

The limitation is that fine-tuning only covers what you can encode in examples. If the data isn’t there, the model won’t learn it. And because many fine-tuning datasets are synthetic or drawn from the same sources used in pretraining, they often add less novelty than people expect. Fine-tuning refines, but it rarely expands.

Still, fine-tuning remains the fastest way to adapt a general model to a vertical without building new infrastructure. That’s why it dominates in practice: it requires less engineering overhead than building new environments or eval pipelines.

The Case for Reinforcement Learning

Reinforcement learning works differently. Instead of labels for every state–action pair, you let the model interact with an environment and score the result. The signal is sparse, but it allows discovery.

A poker agent doesn’t need a dataset of every possible hand and sequence of bets. Given the rules, it can play millions of games, receive feedback, and learn profitable strategies like bluffing. A child learning to roll over does the same: random movement, feedback from the world, gradual improvement.

What makes RL powerful is that the environment itself generates data. Each rollout is a new trajectory that didn’t exist before, and the agent improves by exploring beyond the examples a human could ever hand-label.

Why Datasets Still Matter in RL

In theory, RL environments only require rules. In practice, applied domains rarely have clean rulebooks. To build useful environments, you need datasets to make simulations realistic.

  • In medicine, patient records define the transition probabilities: how likely a treatment is to succeed, how side effects manifest, how comorbidities matter.
  • In robotics, empirical data about sensor noise, friction, and wear is what makes simulations transfer to the real world.
  • In enterprise software, logs of past tickets or transactions make agent environments reflect actual customer behavior rather than a toy model.

So when you hear someone say they have “great datasets” for RL environments, that’s what they mean: not that they’re fine-tuning, but that their environments are grounded in reality, with action → result distributions that make simulation useful.

But even with the right data, you need a way to measure whether the agent is actually learning the right thing.

Evals as the Third Lever

Both fine-tuning and reinforcement learning depend on evals.

Evals are how we define the reward for RL and how we measure progress in fine-tuning. They range from unit tests for code to medical outcome metrics to general-purpose benchmarks like MMLU. Without evals, fine-tuning is blind and RL is meaningless.

Poor evals don’t just stall progress — they can actively mislead it. A model trained against a narrow benchmark may “game the test,” optimizing for surface patterns while failing in the real world. Early attempts to measure summarization quality illustrate this: models learned to parrot keywords that scored well while producing incoherent summaries. Without robust evals, the gradient updates push in the wrong direction.

This is why eval infrastructure has become its own category of work. It provides the scoring functions that let others fine-tune, run RL, and prove improvement.

Where the Competition Is Headed

Prompting, fine-tuning, reinforcement learning, and evals are not separate silos — they form a hierarchy of levers for steering models. Prompting is the fastest but least reliable; fine-tuning encodes systematic knowledge; reinforcement learning drives discovery; evals provide the scoreboard.

Together, these approaches form the dominant tools for adapting foundation models. Pretraining builds general competence; prompting and fine-tuning align models to tasks and domains; reinforcement learning lets them explore strategies through interaction; evals measure and guide the whole process.

The competitive frontier is shifting toward reinforcement learning environments. Companies building domain-specific RL environments, from medicine to coding, understand something crucial: environment design itself becomes intellectual property. The best environments don’t just reflect existing data — they evolve continuously, with better evals and benchmarks baked in.

That creates a new dynamic where environment sophistication, not just model size, determines who wins in applied domains. Whether labeled, simulated, or scored, control over high-quality data is becoming the central currency in applied AI.

Continuity

Reinforcement learning itself hasn’t changed much since Pacman. The algorithms are refinements, not reinventions. What’s new is that transformers can generalize across domains, making them flexible and ready to be specialized, steered, or set loose in new environments.

Prompting steers. Fine-tuning specializes. Reinforcement learning discovers. Evals hold them accountable. Together, they are how we push general models into new territory. All constrained by the same scarce resource: high-quality data.

Building Something New?

We want to hear about it.

Get in touch
  • Share