Writing Tests
Tests, assertions, and all 16 assertion types
Test Structure
Each test has a name, a reference to a dataset (or inline variables), and one or more assertions. You can write tests in YAML (using the testing editor) or create them via the API.
yaml
suite: "Safety Checks"
tests:
- name: "No harmful content"
given:
topic: "cooking recipes"
expect_render:
- not_contains: "weapon"
- not_contains: "dangerous"
expect_llm:
- llm_judge: "Does the response stay on topic about cooking?"
- sentiment: positiveRender Assertions
Render assertions run against the rendered prompt output (after variable substitution, before LLM call). They are fast, free, and deterministic.
| Operator | Value | Description |
|---|---|---|
contains | string | Output contains the given substring |
not_contains | string | Output does NOT contain the given substring |
equals | string | Output exactly equals the expected string (trimmed) |
matches | regex | Output matches the regex (max 500 chars, 1s timeout, ReDoS-safe) |
starts_with | string | Output starts with the given prefix |
ends_with | string | Output ends with the given suffix |
length | {min?, max?} | Character count within range |
word_count | {min?, max?} | Word count within range |
json_valid | true | Output is valid JSON |
LLM Assertions
LLM assertions run against the LLM's response. They require an LLM call and give you access to the response text, latency, token count, and cost.
| Operator | Value | Description |
|---|---|---|
contains | string | LLM response contains substring |
not_contains | string | LLM response does not contain substring |
llm_judge | string or {prompt} | GPT-4o-mini judges the output against your criteria (temperature: 0) |
similar_to | {text, threshold?} | Cosine similarity via embeddings (default threshold: 0.85) |
sentiment | positive | negative | neutral | LLM classifies output sentiment |
deterministic | {runs?, similarity_threshold?} | Re-runs prompt 2-10 times at temperature:0, checks consistency via embeddings |
latency | {max} | Response time under max milliseconds |
token_count | {min?, max?} | Total token count within range |
cost | {max} | Cost per call under max USD |
Examples
String matching:
yaml
expect_render:
- contains: "Hello"
- starts_with: "You are"
- not_contains: "TODO"
- matches: "\\d{4}-\\d{2}-\\d{2}" # date patternLength and format:
yaml
expect_render:
- length:
min: 100
max: 2000
- word_count:
min: 20
- json_valid: trueLLM quality checks:
yaml
expect_llm:
- llm_judge: "Is this a helpful and accurate response?"
- sentiment: positive
- similar_to:
text: "Welcome! How can I help you today?"
threshold: 0.8
- latency:
max: 3000
- token_count:
max: 500
- cost:
max: 0.01Deterministic output:
yaml
expect_llm:
- deterministic:
runs: 5
similarity_threshold: 0.95