McNemar's test framework detects post-optimization LLM degradations via per-sample comparisons. Aggregates across benchmarks with controlled false positives. Flags 0.3% drops confidently.
Key Points
- 1.Hypothesis testing for accuracy noise
- 2.Per-sample score confrontation
- 3.LM Evaluation Harness integration
Impact Analysis
Ensures lossless optimizations. Vital for reliable model deployment.
Technical Details
Three aggregation methods for multi-benchmark decisions. Handles quantization errors.