Model set
Side by side, per benchmark
| Benchmark | claude-haiku-4.5 | gemini-2.5-flash | ||||
|---|---|---|---|---|---|---|
| pass | conf | F₁ | pass | conf | F₁ | |
| SQuAD | 0.48 | 0.40 | 0.72 | 0.45 | 0.61 | 0.67 |
| MMLU-Pro | 0.65 | 0.67 | 0.74 | 0.46 | 0.92 | 0.64 |
| LegalBench | 0.86 | 0.64 | 0.74 | 0.84 | 0.78 | 0.81 |
| MathBench | 0.95 | 1.00 | 0.97 | 0.95 | 1.00 | 0.97 |
| OmniMath | 0.58 | 0.74 | 0.75 | 0.64 | 0.99 | 0.79 |
| SciCode | 0.56 | 0.65 | 0.73 | 0.56 | 0.47 | 0.52 |
Positions in the cloud
selected models (provider colors) all other model/condition points equal relative confidence and pass rate
Ranked context
SQuAD (factual recall) · Prospective · Fβ β
| # | Model | Fβ | Prec | Rec | Task acc |
|---|---|---|---|---|---|
| 1 | gemini-3-pro-preview | 0.785 | 0.77 | 0.80 | 0.65 |
| 2 | gemini-3-flash-preview | 0.782 | 0.70 | 0.88 | 0.59 |
| 3 | gemini-2.5-pro | 0.764 | 0.71 | 0.83 | 0.58 |
| 4 | claude-sonnet-4.5 | 0.753 | 0.78 | 0.72 | 0.59 |
| 5 | gpt-4o | 0.739 | 0.69 | 0.80 | 0.54 |
| 6 | deepseek-r1 | 0.733 | 0.76 | 0.70 | 0.57 |
| 7 | claude-haiku-4.5 | 0.716 | 0.79 | 0.65 | 0.48 |
| 8 | mistral-medium-3.1 | 0.703 | 0.65 | 0.77 | 0.52 |
| 9 | gpt-4o-mini | 0.686 | 0.56 | 0.88 | 0.45 |
| 10 | gemini-2.0-flash-001 | 0.683 | 0.58 | 0.82 | 0.45 |
| 11 | gpt-5.2 | 0.673 | 0.84 | 0.56 | 0.62 |
| 12 | gemini-2.5-flash | 0.670 | 0.58 | 0.79 | 0.45 |
| 13 | mistral-small-3.2-24b-instruct | 0.666 | 0.57 | 0.79 | 0.44 |
| 14 | llama-3.1-70b-instruct | 0.643 | 0.65 | 0.64 | 0.45 |
| 15 | claude-3.5-sonnet | 0.643 | 0.79 | 0.54 | 0.56 |
| 16 | llama-3.3-70b-instruct | 0.593 | 0.64 | 0.55 | 0.45 |
| 17 | qwen-2.5-72b-instruct | 0.572 | 0.74 | 0.47 | 0.47 |
| 18 | deepseek-chat | 0.567 | 0.78 | 0.45 | 0.55 |
| 19 | claude-3-haiku | 0.554 | 0.55 | 0.56 | 0.40 |
| 20 | qwen-2.5-coder-32b-instruct | 0.476 | 0.31 | 1.00 | 0.31 |
Pairwise signal for the selected pair
Match accuracy controls for the performance base-rate gap
selected pairs
1/ 171
selected mean tau
-0.014
All-pairs mean
+0.037
selected p<0.05
0(0%)
all model pairs (observed) base-rate-matched null calibration-preserving null selected pair (filled = p<0.05) selected mean all-pairs mean