Eval

Test prompts with eval suites

Prompt Evaluation

The eval system lets you write test suites for your prompts. Define assertions about rendered output, LLM responses, performance, and semantic similarity — then run them from the CLI or via PLP.

Quick Example

Create eval/tests/smoke.eval:

yaml

suite: "Welcome Email Quality"
config:
  target: prompts/welcome-email.pdk
  model: gpt-4o-mini
tests:
  - name: "Includes greeting"
    given:
      name: "Alice"
      tier: "pro"
    expect_render:
      - contains: "Alice"
      - length: { min: 50, max: 500 }

  - name: "Helpful response"
    given:
      name: "Bob"
      tier: "free"
    expect_llm:
      - llm_judge: "Is the response welcoming and professional?"
      - sentiment: positive
      - token_count: { max: 200 }

Running Evals

bash

echopdk eval eval/tests/smoke.eval
echopdk eval --filter "polite*" --reporter json
echopdk eval --record  # Record LLM responses as golden baseline

Assertion Types (16)

Text assertions (in expect_render):

contains, not_contains, equals, matches (regex)
starts_with, ends_with
length, word_count, json_valid

Semantic assertions (in expect_llm):

llm_judge — LLM answers a yes/no question about the response
similar_to — Embedding similarity to a golden response
sentiment — positive, negative, neutral, or helpful

Performance assertions:

latency — Max milliseconds
token_count — Token count range
cost — Max estimated USD

Datasets (.dset)

Dataset files provide reusable test data and golden responses:

yaml

name: "Customer Feedback Dataset"
golden:
  response: |
    The customer's concern has been acknowledged
    with a helpful solution provided.
  model: gpt-4o-mini
  recorded_at: "2024-02-12T10:00:00Z"
parameters:
  - name: case_1
    customer_msg: "I can't log in"
  - name: case_2
    customer_msg: "Your product broke my computer"

Reporters

console (default) — Human-readable with colors
json — Structured JSON for programmatic use
junit — XML format for CI/CD integration