Open AI Founders Table Dinner, SF Edition

Last night we co-hosted the inaugural founder dinner at OpenAI’s library, a beautiful space that matched the caliber of conversation, which, with respect, was 100x richer than anything you might glean from chatting with an instruction-tuned foundation model.

The topic was on evals, something that’s come up again and again in conversations with founders building AI products lately.

What’s clear is that evals are the north star for AI products. They’re your metaphorical OKRs, the measure of whether you’re actually solving the problem you set out to solve. Everything else: prompt engineering, fine-tuning, RL are just tools to improve against your eval benchmarks.

One of the most surprising themes of the night was how many founders are now worried about what we can call “eval drift”: when live production systems gradually diverge from their evaluation benchmarks due to system improvements, model updates, etc. It’s the observability problem for applied AI: how do you ensure your model keeps behaving the way it did on launch?

The space is still so early that most serious teams are building their own eval platforms from scratch, even as strong players like our guests Galileo, Ranger, and our hosts at OpenAI are tackling the problem. There’s still no consensus on what a best practice eval platform looks like, especially for multi-agent, tool-using AI products.

I’m starting to think proprietary evals might end up being the most defensible layer of the AI stack, which is great for vertical AI. I could be wrong and my number two would be how you define the reward model for reinforcement-learning fine-tuning, which OpenAI is now extending to agent workflows that use tools.

Every tech transition has new moats. For the web it was the network, for mobile it was the interface. For AI, it may be something stranger, a defensibility born from how well you can see and steer an alien intelligence.

Grateful to the OpenAI team, especially Shyamal, who leads Applied Evals, for the depth of conversation. Nights like this clarify what matters, and how labs like OpenAI can support (rather than compete with) startups.

Building Something New?

We want to hear about it.

Get in touch
  • Share