Evals Overview

What evals are and how they work

What Are Evals?

Evals let you define automated tests for your prompts. You write assertions that describe what “correct output” looks like, run them against any prompt version, and get a pass/fail score. Think of it as unit testing for prompts.

Why Test Prompts?

Prompts are code. When you change a prompt, you need to know whether the change improved things or introduced regressions. Without evals, you are manually spot-checking outputs and hoping for the best. With evals, you get:

Regression detection - Compare runs side by side to catch regressions before they ship
Confidence to ship - Quality gates block deploys that fall below your pass-rate threshold
Data-driven decisions - A/B test two prompt versions with real statistical analysis
Trend tracking - See how your prompt quality evolves over time with run history charts

The Mental Model

Evals are organized into three layers:

Datasets - Collections of input variables (the “given” data for each test case)
Test Suites - Groups of tests, each referencing a dataset and defining assertions
Runs - An execution of a suite against a specific prompt version, producing pass/fail results

Two Assertion Phases

Each test can have two types of assertions:

Render assertions (expect_render) - Run against the rendered prompt output before it is sent to an LLM. Fast and free.
LLM assertions (expect_llm) - Run against the LLM's response. Requires an LLM call but lets you test actual model behavior.

You can use both in the same test. Render assertions run first; if they all pass, the LLM is called and LLM assertions run on the response.

Next Steps

Test Suites - Create and manage suites
Writing Tests - All 14 assertion types with examples
Datasets - Managing test data
Running Evals - Executing runs and viewing results