MathBench (competition math) | The Metacognition Bench

Operating-point board

MathBench (competition math) · Prospective · F_β β

#	Model	F_β	Prec	Rec	Task acc
1	gemini-3.1-pro-preview	1.000	1.00	1.00	1.00
2	gpt-5.2	0.998	1.00	1.00	1.00
3	gemini-3-flash-preview	0.991	0.98	1.00	0.98
4	deepseek-r1	0.988	0.98	1.00	0.98
5	gemini-2.5-flash	0.973	0.95	1.00	0.95
6	claude-haiku-4.5	0.972	0.95	1.00	0.95
7	claude-sonnet-4.5	0.967	0.94	1.00	0.93
8	deepseek-chat	0.962	0.95	0.97	0.95
9	gemini-2.5-pro	0.955	0.91	1.00	0.91
10	gemini-2.0-flash-001	0.945	0.90	1.00	0.89
11	mistral-small-3.2-24b-instruct	0.923	0.87	0.98	0.87
12	mistral-medium-3.1	0.916	0.95	0.88	0.94
13	qwen-2.5-72b-instruct	0.910	0.87	0.95	0.83
14	gpt-4o	0.894	0.81	1.00	0.80
15	gpt-4o-mini	0.878	0.78	1.00	0.78
16	llama-3.3-70b-instruct	0.853	0.75	0.99	0.74
17	llama-3.1-70b-instruct	0.782	0.65	0.99	0.63
18	claude-3-haiku	0.595	0.43	0.98	0.41

Pairwise signal on MathBench

Observed correlation

0.072

mean τ-b · 0 = no signal

Baseline correlation

0.000

permutation null · ≈ 0

Model pairs

135

unit = pair, not model

Significant pairs

28.9%

p<0.05 after FDR · not effect size

Observed model pairs Base-rate-matched null Calibration-preserving null 5%-95% observed: -0.188 to 0.280

Confidence structure: the shared-difficulty factor

Information the model doesn’t use

Benchmark	Model	Internal	Population	Same-context text	Post-hoc text
MathBench	GPT-4o	0.630	0.742	0.703	0.773

Values are ROC AUCs from the paper's calibration-ROC table (SI). "Same-context text" is the pre-judgement reasoning trace available before the binary confidence commit; "post-hoc text" reads the answer attempt and is an upper-bound comparison.

The four metacognitive outcomes on MathBench

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.