← Atlas home Benchmark view 18 models

MathBench (competition math)

Unaided means: solve the problem without hints or worked solutionProbes: Prospective + Counterfactual

Operating-point board

MathBench (competition math) · Prospective · Fβ β
#ModelFβPrecRecTask acc
1 gemini-3.1-pro-preview1.0001.001.001.00
2 gpt-5.20.9981.001.001.00
3 gemini-3-flash-preview0.9910.981.000.98
4 deepseek-r10.9880.981.000.98
5 gemini-2.5-flash0.9730.951.000.95
6 claude-haiku-4.50.9720.951.000.95
7 claude-sonnet-4.50.9670.941.000.93
8 deepseek-chat0.9620.950.970.95
9 gemini-2.5-pro0.9550.911.000.91
10 gemini-2.0-flash-0010.9450.901.000.89
11 mistral-small-3.2-24b-instruct0.9230.870.980.87
12 mistral-medium-3.10.9160.950.880.94
13 qwen-2.5-72b-instruct0.9100.870.950.83
14 gpt-4o0.8940.811.000.80
15 gpt-4o-mini0.8780.781.000.78
16 llama-3.3-70b-instruct0.8530.750.990.74
17 llama-3.1-70b-instruct0.7820.650.990.63
18 claude-3-haiku0.5950.430.980.41

Pairwise signal on MathBench

Observed correlation
0.072
mean τ-b · 0 = no signal
Baseline correlation
0.000
permutation null · ≈ 0
Model pairs
135
unit = pair, not model
Significant pairs
28.9%
p<0.05 after FDR · not effect size
-1.0-0.50.00.51.0Pair signal: do confidence gaps rank performance gaps? (Kendall tau-b)
Observed model pairs Base-rate-matched null Calibration-preserving null 5%-95% observed: -0.188 to 0.280

Confidence structure: the shared-difficulty factor

PC1 = 25.3% of variance?Eigenvalue rank →Variance share19 models

Information the model doesn’t use

BenchmarkModelInternalPopulationSame-context textPost-hoc text
MathBenchGPT-4o0.6300.7420.7030.773

Values are ROC AUCs from the paper's calibration-ROC table (SI). "Same-context text" is the pre-judgement reasoning trace available before the binary confidence commit; "post-hoc text" reads the answer attempt and is an upper-bound comparison.

The four metacognitive outcomes on MathBench

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.