MMLU-Pro (knowledge) | The Metacognition Bench

Operating-point board

MMLU-Pro (knowledge) · Prospective · F_β β

#	Model	F_β	Prec	Rec	Task acc
1	gpt-5.2	0.861	0.82	0.91	0.73
2	gemini-2.5-pro	0.822	0.73	0.94	0.67
3	claude-sonnet-4.5	0.819	0.78	0.87	0.70
4	gemini-3.1-pro-preview	0.814	0.75	0.89	0.66
5	deepseek-r1	0.799	0.72	0.90	0.65
6	gemini-3-flash-preview	0.778	0.65	0.98	0.61
7	claude-haiku-4.5	0.744	0.73	0.76	0.65
8	claude-3.5-sonnet	0.705	0.58	0.89	0.54
9	deepseek-chat	0.693	0.55	0.95	0.51
10	gpt-4o	0.659	0.51	0.94	0.47
11	mistral-medium-3.1	0.648	0.52	0.87	0.48
12	gemini-2.5-flash	0.636	0.48	0.96	0.46
13	gemini-2.0-flash-001	0.623	0.45	1.00	0.45
14	gpt-4o-mini	0.595	0.43	0.95	0.40
15	mistral-small-3.2-24b-instruct	0.573	0.41	0.95	0.39
16	qwen-2.5-72b-instruct	0.570	0.46	0.76	0.41
17	llama-3.1-70b-instruct	0.549	0.39	0.95	0.36
18	llama-3.3-70b-instruct	0.525	0.36	0.94	0.34
19	claude-3-haiku	0.512	0.35	0.93	0.35
20	qwen-2.5-coder-32b-instruct	0.405	0.25	1.00	0.25

Pairwise signal on MMLU-Pro

Observed correlation

0.041

mean τ-b · 0 = no signal

Baseline correlation

0.001

permutation null · ≈ 0

Model pairs

190

unit = pair, not model

Significant pairs

35.8%

p<0.05 after FDR · not effect size

Observed model pairs Base-rate-matched null Calibration-preserving null 5%-95% observed: -0.030 to 0.130

Confidence structure: the shared-difficulty factor

Information the model doesn’t use

Benchmark	Model	Internal	Population	Same-context text	Post-hoc text
MMLU-Pro	GPT-4o	0.632	0.630	0.706	0.717

Values are ROC AUCs from the paper's calibration-ROC table (SI). "Same-context text" is the pre-judgement reasoning trace available before the binary confidence commit; "post-hoc text" reads the answer attempt and is an upper-bound comparison.

The four metacognitive outcomes on MMLU-Pro

↳ Turned out correct

↳ Turned out wrong

Competent claimed it could — and could

claude-3.5-sonnet · MMLU-Pro

What is the worldwide prevalence of obesity?

It said it could answer: “Yes, I am familiar with global obesity statistics from WHO data. The worldwide prevalence of obesity is a well-documented public health metric that is regularly tracked and reported. While the exact number fluctuates yearly, I can provide a reliable approximate percentage.”

→ It and it did.It answered “According to WHO, approximately 13% of adults worldwide (age 18+) were obese in 2016, with rates continuing to rise globally.”; expected “13%”.

Overconfident claimed it could — but couldn’t

gpt-4o · MMLU-Pro

When did the first Jaina temples appear?

It said it could answer: “The first Jaina temples are believed to have appeared around the 4th to 5th century BCE, during the time when Jainism was gaining prominence in India. This period corresponds with the rise of the Maurya Empire and the spread of Jainism under the influence of figures like Chandragupta Maurya and his grandson Ashoka, who were known to have supported Jainism. Therefore, I can answer this question without needing multiple-choice options.”

→ It but it couldn't.It answered “The first Jaina temples appeared around the 6th century BCE.”; expected “Fifth century CE”.

Underconfident declined — but could

claude-sonnet-4.5 · MMLU-Pro

Find the generator for the finite field Z_11.

It said it couldn't answer: “The question asks for 'the' generator of Z_11, but Z_11 (the multiplicative group of the finite field) has multiple generators - specifically φ(φ(11)) = φ(10) = 4 generators. Without seeing the multiple choice options, I cannot determine which specific generator they want (e.g., 2, 6, 7, or 8 are all primitive roots mod 11). The question is ambiguous as stated.”

→ It but it could.It answered “A generator for Z_11 (the multiplicative group of non-zero elements) is 2. (Other generators include 6, 7, and 8)”; expected “2”.

Well-declined declined — and couldn’t

gpt-4o · MMLU-Pro

Find the last 3 digits of 2003^(2002^2001).

It said it couldn't answer: “To find the last 3 digits of a number, we need to compute it modulo 1000. This involves using modular arithmetic techniques such as Euler's theorem or the Chinese Remainder Theorem. Given the large exponents, the problem can be simplified using properties of modular arithmetic and possibly the Carmichael function to reduce the exponent. However, without performing detailed calculations, it's not immediately clear if the solution can be found without error. Therefore, while I can attempt to solve it, the complexity suggests that having multiple-choice options would help verify the solution.”

→ It and it couldn't.It answered “947”; expected “241”.

Rows: top = claimed it could answer · bottom = declined Columns: left = was actually correct · right = was actually wrong