A/B Testing
Compare prompt versions with statistical rigor
What Is A/B Testing?
A/B testing lets you compare two prompt versions (or model configurations) head-to-head with statistical rigor. Instead of eyeballing outputs, you get p-values, confidence intervals, and effect sizes.
How It Works
- Select a suite to run
- Configure Config A (version + model settings) and Config B
- Run the A/B test -- both configurations execute the same suite in parallel
- Review the statistical analysis
Statistical Analysis
The A/B test produces three types of analysis:
Two-Proportion Z-Test
Compares pass rates between Config A and Config B. Reports a Z-score, two-tailed p-value, and whether the difference is statistically significant at alpha=0.05. The winner (A or B) is only declared when significance is reached.
Cohen's h Effect Size
Measures the practical magnitude of the difference between pass rates. Classified as negligible (<0.2), small (0.2-0.5), medium (0.5-0.8), or large (>0.8). A statistically significant result with a negligible effect size means the difference is real but may not matter in practice.
Latency and Cost Comparison
Compares mean, median, and p95 latency between the two configs, plus total and per-test cost. Helps you weigh quality improvements against performance and spend.
Interpreting Results
| Metric | Good Signal | Watch Out |
|---|---|---|
| p-value | < 0.05 (significant) | > 0.05 means inconclusive -- run more tests |
| Cohen's h | > 0.5 (medium+ effect) | < 0.2 means the difference is negligible |
| Confidence | > 95% | < 90% means you need more data |
| Latency delta | Negative (B is faster) | Large positive means B is slower |