Writing Tests

Tests, assertions, and all 16 assertion types

Test Structure

Each test has a name, a reference to a dataset (or inline variables), and one or more assertions. You can write tests in YAML (using the testing editor) or create them via the API.

yaml

suite: "Safety Checks"
tests:
  - name: "No harmful content"
    given:
      topic: "cooking recipes"
    expect_render:
      - not_contains: "weapon"
      - not_contains: "dangerous"
    expect_llm:
      - llm_judge: "Does the response stay on topic about cooking?"
      - sentiment: positive

Render Assertions

Render assertions run against the rendered prompt output (after variable substitution, before LLM call). They are fast, free, and deterministic.

Operator	Value	Description
`contains`	string	Output contains the given substring
`not_contains`	string	Output does NOT contain the given substring
`equals`	string	Output exactly equals the expected string (trimmed)
`matches`	regex	Output matches the regex (max 500 chars, 1s timeout, ReDoS-safe)
`starts_with`	string	Output starts with the given prefix
`ends_with`	string	Output ends with the given suffix
`length`	{min?, max?}	Character count within range
`word_count`	{min?, max?}	Word count within range
`json_valid`	true	Output is valid JSON

LLM Assertions

LLM assertions run against the LLM's response. They require an LLM call and give you access to the response text, latency, token count, and cost.

Operator	Value	Description
`contains`	string	LLM response contains substring
`not_contains`	string	LLM response does not contain substring
`llm_judge`	string or {prompt}	GPT-4o-mini judges the output against your criteria (temperature: 0)
`similar_to`	{text, threshold?}	Cosine similarity via embeddings (default threshold: 0.85)
`sentiment`	positive \| negative \| neutral	LLM classifies output sentiment
`deterministic`	{runs?, similarity_threshold?}	Re-runs prompt 2-10 times at temperature:0, checks consistency via embeddings
`latency`	{max}	Response time under max milliseconds
`token_count`	{min?, max?}	Total token count within range
`cost`	{max}	Cost per call under max USD

Examples

String matching:

yaml

expect_render:
  - contains: "Hello"
  - starts_with: "You are"
  - not_contains: "TODO"
  - matches: "\\d{4}-\\d{2}-\\d{2}"  # date pattern

Length and format:

yaml

expect_render:
  - length:
      min: 100
      max: 2000
  - word_count:
      min: 20
  - json_valid: true

LLM quality checks:

yaml

expect_llm:
  - llm_judge: "Is this a helpful and accurate response?"
  - sentiment: positive
  - similar_to:
      text: "Welcome! How can I help you today?"
      threshold: 0.8
  - latency:
      max: 3000
  - token_count:
      max: 500
  - cost:
      max: 0.01

Deterministic output:

yaml

expect_llm:
  - deterministic:
      runs: 5
      similarity_threshold: 0.95