Eval
Test prompts with eval suites
Prompt Evaluation
The eval system lets you write test suites for your prompts. Define assertions about rendered output, LLM responses, performance, and semantic similarity — then run them from the CLI or via PLP.
Quick Example
Create eval/tests/smoke.eval:
yaml
suite: "Welcome Email Quality"
config:
target: prompts/welcome-email.pdk
model: gpt-4o-mini
tests:
- name: "Includes greeting"
given:
name: "Alice"
tier: "pro"
expect_render:
- contains: "Alice"
- length: { min: 50, max: 500 }
- name: "Helpful response"
given:
name: "Bob"
tier: "free"
expect_llm:
- llm_judge: "Is the response welcoming and professional?"
- sentiment: positive
- token_count: { max: 200 }Running Evals
bash
echopdk eval eval/tests/smoke.eval
echopdk eval --filter "polite*" --reporter json
echopdk eval --record # Record LLM responses as golden baselineAssertion Types (16)
Text assertions (in expect_render):
contains,not_contains,equals,matches(regex)starts_with,ends_withlength,word_count,json_valid
Semantic assertions (in expect_llm):
llm_judge— LLM answers a yes/no question about the responsesimilar_to— Embedding similarity to a golden responsesentiment— positive, negative, neutral, or helpful
Performance assertions:
latency— Max millisecondstoken_count— Token count rangecost— Max estimated USD
Datasets (.dset)
Dataset files provide reusable test data and golden responses:
yaml
name: "Customer Feedback Dataset"
golden:
response: |
The customer's concern has been acknowledged
with a helpful solution provided.
model: gpt-4o-mini
recorded_at: "2024-02-12T10:00:00Z"
parameters:
- name: case_1
customer_msg: "I can't log in"
- name: case_2
customer_msg: "Your product broke my computer"Reporters
- console (default) — Human-readable with colors
- json — Structured JSON for programmatic use
- junit — XML format for CI/CD integration