Running Evals

Run suites, view results, and track trends

Running a Suite

To run a suite, you need three things: a suite ID, a prompt version number, and optionally a model configuration. Click Run in the UI or call the API:

http

POST /api/prompts/{promptId}/eval/run

{
  "suiteId": 1,
  "versionNo": 3,
  "modelConfig": {
    "model": "gpt-4o",
    "temperature": 0.7
  }
}

You can also pass testIds to run a subset of tests instead of the full suite.

Note: LLM assertions currently support OpenAI models only. Render assertions work regardless of model provider since they run before the LLM call.

How Runs Execute

Runs execute asynchronously. The API returns a pending run immediately, and the backend processes tests in the background. For each test in the suite:

Resolve the dataset cases (or use inline variables)
Render the prompt template with the case's variables
Run render assertions against the rendered output
If LLM assertions exist, call the LLM with the rendered prompt
Run LLM assertions against the response
Record pass/fail/error status for each test result

LLM responses are cached (by input hash) across runs to avoid redundant API calls. The cache is scoped per prompt so different prompts never share cached responses.

Viewing Results

Each run produces a summary with total, passed, failed, and errored counts, plus a percentage score. The results panel shows:

Per-test pass/fail status with assertion details
Rendered output for each test case
LLM response text, latency, token count, and cost (when applicable)
Error messages for any failed or errored assertions

Async Execution

When you call POST /eval/run, the server creates a run in pending status and returns immediately. The actual test execution happens asynchronously in the background. Poll the run status to track progress:

http

GET /api/prompts/{promptId}/eval/runs/{runId}/status

// Returns:
{ "status": "pending" | "running" | "completed" | "failed" }

Response Caching

LLM responses are cached at two levels to avoid redundant API calls:

Redis cache - Cross-run dedup with a 24-hour TTL, scoped per prompt
DB dedup - Within a single run, identical inputs reuse the same response row

Caching is keyed on the rendered prompt text + model config hash. If you change either, a fresh LLM call is made.

Run History and Trends

The run history panel shows paginated results of all previous runs. Trend charts render pass-rate over time as SVG sparklines so you can spot quality drift at a glance.

Comparison View

Select any two runs to compare them side by side. The comparison view shows word-level diffs between outputs, making it easy to spot exactly what changed between prompt versions.