Writing Tests

Tests, assertions, and all 16 assertion types

Test Structure

Each test has a name, a reference to a dataset (or inline variables), and one or more assertions. You can write tests in YAML (using the testing editor) or create them via the API.

yaml
suite: "Safety Checks"
tests:
  - name: "No harmful content"
    given:
      topic: "cooking recipes"
    expect_render:
      - not_contains: "weapon"
      - not_contains: "dangerous"
    expect_llm:
      - llm_judge: "Does the response stay on topic about cooking?"
      - sentiment: positive

Render Assertions

Render assertions run against the rendered prompt output (after variable substitution, before LLM call). They are fast, free, and deterministic.

OperatorValueDescription
containsstringOutput contains the given substring
not_containsstringOutput does NOT contain the given substring
equalsstringOutput exactly equals the expected string (trimmed)
matchesregexOutput matches the regex (max 500 chars, 1s timeout, ReDoS-safe)
starts_withstringOutput starts with the given prefix
ends_withstringOutput ends with the given suffix
length{min?, max?}Character count within range
word_count{min?, max?}Word count within range
json_validtrueOutput is valid JSON

LLM Assertions

LLM assertions run against the LLM's response. They require an LLM call and give you access to the response text, latency, token count, and cost.

OperatorValueDescription
containsstringLLM response contains substring
not_containsstringLLM response does not contain substring
llm_judgestring or {prompt}GPT-4o-mini judges the output against your criteria (temperature: 0)
similar_to{text, threshold?}Cosine similarity via embeddings (default threshold: 0.85)
sentimentpositive | negative | neutralLLM classifies output sentiment
deterministic{runs?, similarity_threshold?}Re-runs prompt 2-10 times at temperature:0, checks consistency via embeddings
latency{max}Response time under max milliseconds
token_count{min?, max?}Total token count within range
cost{max}Cost per call under max USD

Examples

String matching:

yaml
expect_render:
  - contains: "Hello"
  - starts_with: "You are"
  - not_contains: "TODO"
  - matches: "\\d{4}-\\d{2}-\\d{2}"  # date pattern

Length and format:

yaml
expect_render:
  - length:
      min: 100
      max: 2000
  - word_count:
      min: 20
  - json_valid: true

LLM quality checks:

yaml
expect_llm:
  - llm_judge: "Is this a helpful and accurate response?"
  - sentiment: positive
  - similar_to:
      text: "Welcome! How can I help you today?"
      threshold: 0.8
  - latency:
      max: 3000
  - token_count:
      max: 500
  - cost:
      max: 0.01

Deterministic output:

yaml
expect_llm:
  - deterministic:
      runs: 5
      similarity_threshold: 0.95