Evals Are the New PRD: Why Measurement-First Wins in AI Product Development

Andrej Karpathy released autoresearch on 7 March 2026. Single insight: an autonomous agent runs 100 ML experiments overnight, each one measured against a single metric. Keep what improves the metric, discard what doesn’t. No human judgment in the loop. No “let’s ship this because it seems good.” Just metric feedback. This is how frontier ML research is done now. It’s also how frontier AI products should be built — but almost none are.

The Pattern

The traditional AI product development cycle: write a prompt, benchmark it (informally), ship it, iterate based on user feedback. This is equivalent to shipping code without tests. You’ll eventually catch bugs. The cost is paid by users, not by engineers.

The measurement-first cycle: define your inputs (realistic samples), define your task (what should the model do?), define your score (0-1 quantification of output quality). Build an eval harness that runs your task against your inputs and outputs a score. Before touching the prompt, run the harness. Write the prompt. Run it again. If the score goes down, reject the change. Keep only changes that improve the score.

Braintrust’s validation (published 2024) formalised this: evals = (inputs + task + score). But the corollary most teams miss: you need multiple evals. One eval is a local maximum trap. Two evals are better. Three evals are required if you’re shipping to users. And the biggest mistake: having passing evals only. You need failing evals too. A failing eval says “here’s a case where the model is definitely wrong.” That’s where signal lives.

Manthan’s Analytical Council pipeline runs on exactly this pattern. Single metric: weighted backtest accuracy. Target 80% (current 67.6%). After every prompt change, run the benchmark against 213 scorecards and 10 frozen deals. If accuracy drops, the change is rejected. No human argues for a change that lowers accuracy. No manager overrides the metric. This isn’t dogmatism; it’s signal integrity.

The measurement-first pattern extends upstream: Karpathy’s autoresearch doesn’t just measure the output. It measures the measurement. Every 24h, evaluate whether your eval itself is measuring the right thing. Self-correcting loops. If the eval is measuring noise (high variance, low correlation to production outcomes), replace it. Frontier teams do this quarterly. Manthan does it weekly.

Why It Matters

For product teams: if you’re building an AI product and you don’t have an eval harness, you’re shipping blind. You think you’re improving the product. You’re actually drifting. The difference surfaces in month 9 when churn accelerates and you can’t figure out why. An eval harness catches this in week 2.

For founders: “what’s your core metric?” is the first question to ask any AI product founder. If they say “user engagement” or “we don’t measure it yet,” they’re in trouble. If they say “GPT-4 evaluation on a benchmark of 100 realistic tasks,” they’re thinking like frontier teams. The metric is the PRD. Everything else is implementation.

For investors evaluating AI products: push founders on eval rigor. Ask: (1) Do you have a regression test? (2) What percentage of changes increase the metric? (If it’s >50%, your eval is too loose.) (3) What happens when you run the eval on edge cases? (4) Do you have failing evals? (5) How often do you refresh your eval? The answers reveal the team’s maturity 12 months before the market does.

For ML engineers: if you’re accustomed to training on static datasets and optimising a validation loss, you’re not ready for AI product development. AI products require continuous eval cycles against real-world inputs. Start here: build an eval harness for your next project before you write any prompt or fine-tuning code. Measure first. Everything else follows.

The Charaka View

Manthan’s competitive advantage in deal analysis is not the prompts. Dozens of teams have written similar 12-persona investment frameworks. The edge is the measurement layer. We measure against 213 scorecards — real funding rounds with known outcomes. We measure weekly. We reject changes that don’t improve accuracy. By month 12, we will have measured against 300+ scorecards. By year 2, 500+. Every competing team measuring on a static benchmark (or, more commonly, measuring nothing) will regress. We compound.

This analysis was generated by Manthan Intelligence’s analytical system — a continuously growing knowledge graph, 12-persona Analytical Council, and calibrated scoring methodology. Human editorial oversight applied.

Charaka Notes by Manthan Intelligence. Subscribe

The Pattern

Why It Matters

The Charaka View

Never miss an insight