Evals Overview
What evals are and how they work
What Are Evals?
Evals let you define automated tests for your prompts. You write assertions that describe what “correct output” looks like, run them against any prompt version, and get a pass/fail score. Think of it as unit testing for prompts.
Why Test Prompts?
Prompts are code. When you change a prompt, you need to know whether the change improved things or introduced regressions. Without evals, you are manually spot-checking outputs and hoping for the best. With evals, you get:
- Regression detection - Compare runs side by side to catch regressions before they ship
- Confidence to ship - Quality gates block deploys that fall below your pass-rate threshold
- Data-driven decisions - A/B test two prompt versions with real statistical analysis
- Trend tracking - See how your prompt quality evolves over time with run history charts
The Mental Model
Evals are organized into three layers:
- Datasets - Collections of input variables (the “given” data for each test case)
- Test Suites - Groups of tests, each referencing a dataset and defining assertions
- Runs - An execution of a suite against a specific prompt version, producing pass/fail results
Two Assertion Phases
Each test can have two types of assertions:
- Render assertions (
expect_render) - Run against the rendered prompt output before it is sent to an LLM. Fast and free. - LLM assertions (
expect_llm) - Run against the LLM's response. Requires an LLM call but lets you test actual model behavior.
You can use both in the same test. Render assertions run first; if they all pass, the LLM is called and LLM assertions run on the response.
Next Steps
- Test Suites - Create and manage suites
- Writing Tests - All 14 assertion types with examples
- Datasets - Managing test data
- Running Evals - Executing runs and viewing results