Running Evals
Run suites, view results, and track trends
Running a Suite
To run a suite, you need three things: a suite ID, a prompt version number, and optionally a model configuration. Click Run in the UI or call the API:
POST /api/prompts/{promptId}/eval/run
{
"suiteId": 1,
"versionNo": 3,
"modelConfig": {
"model": "gpt-4o",
"temperature": 0.7
}
}You can also pass testIds to run a subset of tests instead of the full suite.
Note: LLM assertions currently support OpenAI models only. Render assertions work regardless of model provider since they run before the LLM call.
How Runs Execute
Runs execute asynchronously. The API returns a pending run immediately, and the backend processes tests in the background. For each test in the suite:
- Resolve the dataset cases (or use inline variables)
- Render the prompt template with the case's variables
- Run render assertions against the rendered output
- If LLM assertions exist, call the LLM with the rendered prompt
- Run LLM assertions against the response
- Record pass/fail/error status for each test result
LLM responses are cached (by input hash) across runs to avoid redundant API calls. The cache is scoped per prompt so different prompts never share cached responses.
Viewing Results
Each run produces a summary with total, passed, failed, and errored counts, plus a percentage score. The results panel shows:
- Per-test pass/fail status with assertion details
- Rendered output for each test case
- LLM response text, latency, token count, and cost (when applicable)
- Error messages for any failed or errored assertions
Async Execution
When you call POST /eval/run, the server creates a run in pending status and returns immediately. The actual test execution happens asynchronously in the background. Poll the run status to track progress:
GET /api/prompts/{promptId}/eval/runs/{runId}/status
// Returns:
{ "status": "pending" | "running" | "completed" | "failed" }Response Caching
LLM responses are cached at two levels to avoid redundant API calls:
- Redis cache - Cross-run dedup with a 24-hour TTL, scoped per prompt
- DB dedup - Within a single run, identical inputs reuse the same response row
Caching is keyed on the rendered prompt text + model config hash. If you change either, a fresh LLM call is made.
Run History and Trends
The run history panel shows paginated results of all previous runs. Trend charts render pass-rate over time as SVG sparklines so you can spot quality drift at a glance.
Comparison View
Select any two runs to compare them side by side. The comparison view shows word-level diffs between outputs, making it easy to spot exactly what changed between prompt versions.