← Atlas home Comparison view programmatic · config-driven

claude-haiku-4.5 vs gemini-2.5-flash

Pick up to four models; the URL is the config and is directly shareable
Model set

Side by side, per benchmark

Benchmark claude-haiku-4.5 gemini-2.5-flash
pass conf F₁pass conf F₁
SQuAD0.48 0.40 0.720.45 0.61 0.67
MMLU-Pro0.65 0.67 0.740.46 0.92 0.64
LegalBench0.86 0.64 0.740.84 0.78 0.81
MathBench0.95 1.00 0.970.95 1.00 0.97
OmniMath0.58 0.74 0.750.64 0.99 0.79
SciCode0.56 0.65 0.730.56 0.47 0.52

Positions in the cloud

-3-3-2-2-1-100112233Performance z-score within benchmark/probe →Confidence z-score →
selected models (provider colors) all other model/condition points equal relative confidence and pass rate

Ranked context

SQuAD (factual recall) · Prospective · Fβ β
#ModelFβPrecRecTask acc
1 gemini-3-pro-preview0.7850.770.800.65
2 gemini-3-flash-preview0.7820.700.880.59
3 gemini-2.5-pro0.7640.710.830.58
4 claude-sonnet-4.50.7530.780.720.59
5 gpt-4o0.7390.690.800.54
6 deepseek-r10.7330.760.700.57
7 claude-haiku-4.50.7160.790.650.48
8 mistral-medium-3.10.7030.650.770.52
9 gpt-4o-mini0.6860.560.880.45
10 gemini-2.0-flash-0010.6830.580.820.45
11 gpt-5.20.6730.840.560.62
12 gemini-2.5-flash0.6700.580.790.45
13 mistral-small-3.2-24b-instruct0.6660.570.790.44
14 llama-3.1-70b-instruct0.6430.650.640.45
15 claude-3.5-sonnet0.6430.790.540.56
16 llama-3.3-70b-instruct0.5930.640.550.45
17 qwen-2.5-72b-instruct0.5720.740.470.47
18 deepseek-chat0.5670.780.450.55
19 claude-3-haiku0.5540.550.560.40
20 qwen-2.5-coder-32b-instruct0.4760.311.000.31

Pairwise signal for the selected pair

Match accuracy controls for the performance base-rate gap
selected pairs
1/ 171
selected mean tau
-0.014
All-pairs mean
+0.037
selected p<0.05
0(0%)
-1.0-0.50.00.51.0Pair signal: do confidence gaps rank performance gaps? (Kendall tau-b)
all model pairs (observed) base-rate-matched null calibration-preserving null selected pair (filled = p<0.05) selected mean all-pairs mean