A/B Testing

Compare prompt versions with statistical rigor

What Is A/B Testing?

A/B testing lets you compare two prompt versions (or model configurations) head-to-head with statistical rigor. Instead of eyeballing outputs, you get p-values, confidence intervals, and effect sizes.

How It Works

Select a suite to run
Configure Config A (version + model settings) and Config B
Run the A/B test -- both configurations execute the same suite in parallel
Review the statistical analysis

Statistical Analysis

The A/B test produces three types of analysis:

Two-Proportion Z-Test

Compares pass rates between Config A and Config B. Reports a Z-score, two-tailed p-value, and whether the difference is statistically significant at alpha=0.05. The winner (A or B) is only declared when significance is reached.

Cohen's h Effect Size

Measures the practical magnitude of the difference between pass rates. Classified as negligible (<0.2), small (0.2-0.5), medium (0.5-0.8), or large (>0.8). A statistically significant result with a negligible effect size means the difference is real but may not matter in practice.

Latency and Cost Comparison

Compares mean, median, and p95 latency between the two configs, plus total and per-test cost. Helps you weigh quality improvements against performance and spend.

Interpreting Results

Metric	Good Signal	Watch Out
p-value	< 0.05 (significant)	> 0.05 means inconclusive -- run more tests
Cohen's h	> 0.5 (medium+ effect)	< 0.2 means the difference is negligible
Confidence	> 95%	< 90% means you need more data
Latency delta	Negative (B is faster)	Large positive means B is slower