← Atlas home Benchmark view 18 models

OmniMath (advanced math)

Unaided means: solve the problem without hints or worked solutionProbes: Prospective + Counterfactual

Operating-point board

OmniMath (advanced math) · Prospective · Fβ β
#ModelFβPrecRecTask acc
1 gemini-3.1-pro-preview0.9410.890.990.88
2 gpt-5.20.9230.880.970.85
3 deepseek-r10.8400.721.000.70
4 gemini-2.5-pro0.8300.711.000.70
5 gemini-3-flash-preview0.8300.711.000.71
6 gemini-2.5-flash0.7860.651.000.64
7 claude-haiku-4.50.7470.670.850.58
8 claude-sonnet-4.50.7180.630.830.57
9 deepseek-chat0.7130.590.890.56
10 gemini-2.0-flash-0010.6420.470.990.47
11 mistral-small-3.2-24b-instruct0.6310.490.900.44
12 mistral-medium-3.10.6010.830.470.55
13 qwen-2.5-72b-instruct0.5680.500.650.34
14 llama-3.3-70b-instruct0.5370.440.680.33
15 gpt-4o0.5240.400.760.30
16 gpt-4o-mini0.4690.310.990.30
17 llama-3.1-70b-instruct0.4610.360.650.21
18 claude-3-haiku0.2910.200.510.13

Pairwise signal on OmniMath

Observed correlation
0.025
mean τ-b · 0 = no signal
Baseline correlation
-0.000
permutation null · ≈ 0
Model pairs
153
unit = pair, not model
Significant pairs
20.3%
p<0.05 after FDR · not effect size
-1.0-0.50.00.51.0Pair signal: do confidence gaps rank performance gaps? (Kendall tau-b)
Observed model pairs Base-rate-matched null Calibration-preserving null 5%-95% observed: -0.082 to 0.175

Confidence structure: the shared-difficulty factor

0PC1 = 36.7% of variance??Eigenvalue rank →Variance share19 models · 5 negative eigenvalues

Information the model doesn’t use

BenchmarkModelInternalPopulationSame-context textPost-hoc text
OmniMathMistral Medium 3.10.6640.7280.6190.754

Values are ROC AUCs from the paper's calibration-ROC table (SI). "Same-context text" is the pre-judgement reasoning trace available before the binary confidence commit; "post-hoc text" reads the answer attempt and is an upper-bound comparison.

The four metacognitive outcomes on OmniMath

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.