OmniMath (advanced math) | The Metacognition Bench

Operating-point board

OmniMath (advanced math) · Prospective · F_β β

#	Model	F_β	Prec	Rec	Task acc
1	gemini-3.1-pro-preview	0.941	0.89	0.99	0.88
2	gpt-5.2	0.923	0.88	0.97	0.85
3	deepseek-r1	0.840	0.72	1.00	0.70
4	gemini-2.5-pro	0.830	0.71	1.00	0.70
5	gemini-3-flash-preview	0.830	0.71	1.00	0.71
6	gemini-2.5-flash	0.786	0.65	1.00	0.64
7	claude-haiku-4.5	0.747	0.67	0.85	0.58
8	claude-sonnet-4.5	0.718	0.63	0.83	0.57
9	deepseek-chat	0.713	0.59	0.89	0.56
10	gemini-2.0-flash-001	0.642	0.47	0.99	0.47
11	mistral-small-3.2-24b-instruct	0.631	0.49	0.90	0.44
12	mistral-medium-3.1	0.601	0.83	0.47	0.55
13	qwen-2.5-72b-instruct	0.568	0.50	0.65	0.34
14	llama-3.3-70b-instruct	0.537	0.44	0.68	0.33
15	gpt-4o	0.524	0.40	0.76	0.30
16	gpt-4o-mini	0.469	0.31	0.99	0.30
17	llama-3.1-70b-instruct	0.461	0.36	0.65	0.21
18	claude-3-haiku	0.291	0.20	0.51	0.13

Pairwise signal on OmniMath

Observed correlation

0.025

mean τ-b · 0 = no signal

Baseline correlation

-0.000

permutation null · ≈ 0

Model pairs

153

unit = pair, not model

Significant pairs

20.3%

p<0.05 after FDR · not effect size

Observed model pairs Base-rate-matched null Calibration-preserving null 5%-95% observed: -0.082 to 0.175

Confidence structure: the shared-difficulty factor

Information the model doesn’t use

Benchmark	Model	Internal	Population	Same-context text	Post-hoc text
OmniMath	Mistral Medium 3.1	0.664	0.728	0.619	0.754

Values are ROC AUCs from the paper's calibration-ROC table (SI). "Same-context text" is the pre-judgement reasoning trace available before the binary confidence commit; "post-hoc text" reads the answer attempt and is an upper-bound comparison.

The four metacognitive outcomes on OmniMath

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.