AI Evals, Explained Interactively

01

What is an eval

"Give an AI an input, apply grading logic to its output, measure success." — Anthropic Engineering

An eval is the AI equivalent of a unit test — except instead of checking one fixed output, it grades behavior across many runs. Click Step below to watch the full loop happen once: the agent moves, the trace records it, the grader fires.

02

Eval anatomy

Task · Trial · Grader · Trace · Outcome · Harness · Suite

The word "eval" gets used to mean six different things depending on who's talking. Here they are, defined and nested in the order they actually relate to each other. Hover any term to see its definition — no glossary needed.

03

Single-turn vs multi-turn

Single-turn: grade one response. Multi-turn: grade an entire trajectory.

Most people's mental model of an eval is single-turn: send a prompt, check the reply. But agents operate across many turns — and grading a full trajectory is fundamentally different. Toggle between the two modes below to feel the difference.

04

Three grader types

Code-based · Model-based · Human — each with different speed, cost, and nuance.

There's no single right way to grade an AI. The same trace can be evaluated three different ways, each with different tradeoffs. Below is the same 5-step trace run through all three. Click a column to expand its tradeoff breakdown.

05

pass@k

pass@k = probability the agent succeeds in at least one of k trials.

Here's the trap: you run your AI agent once, it works, you ship it. But non-deterministic systems can't be judged on a single run. Drag the slider and watch what happens to each agent's success rate as you give it more tries.

06

Capability vs regression

Capability evals target what your agent can't do yet. Once it can, they become regression evals.

Evals aren't static. As your agent improves, the tests that used to expose failures get promoted into a regression suite — a safety net that protects against sliding backward. Try it below.