Anthropic · Claude | The Metacognition Bench

Where Anthropic · Claude sits in the cloud

Anthropic · Claude model/condition points all other model/condition points equal relative confidence and pass rate

Disposition: where each model draws the line

Competitive context, benchmark by benchmark

SQuAD (factual recall) · Prospective · F_β β

#	Model	F_β	Prec	Rec	Task acc
1	gemini-3-pro-preview	0.785	0.77	0.80	0.65
2	gemini-3-flash-preview	0.782	0.70	0.88	0.59
3	gemini-2.5-pro	0.764	0.71	0.83	0.58
4	claude-sonnet-4.5	0.753	0.78	0.72	0.59
5	gpt-4o	0.739	0.69	0.80	0.54
6	deepseek-r1	0.733	0.76	0.70	0.57
7	claude-haiku-4.5	0.716	0.79	0.65	0.48
8	mistral-medium-3.1	0.703	0.65	0.77	0.52
9	gpt-4o-mini	0.686	0.56	0.88	0.45
10	gemini-2.0-flash-001	0.683	0.58	0.82	0.45
11	gpt-5.2	0.673	0.84	0.56	0.62
12	gemini-2.5-flash	0.670	0.58	0.79	0.45
13	mistral-small-3.2-24b-instruct	0.666	0.57	0.79	0.44
14	llama-3.1-70b-instruct	0.643	0.65	0.64	0.45
15	claude-3.5-sonnet	0.643	0.79	0.54	0.56
16	llama-3.3-70b-instruct	0.593	0.64	0.55	0.45
17	qwen-2.5-72b-instruct	0.572	0.74	0.47	0.47
18	deepseek-chat	0.567	0.78	0.45	0.55
19	claude-3-haiku	0.554	0.55	0.56	0.40
20	qwen-2.5-coder-32b-instruct	0.476	0.31	1.00	0.31

The four metacognitive outcomes — Anthropic · Claude

↳ Turned out correct

↳ Turned out wrong

Competent claimed it could — and could

claude-3.5-sonnet · MMLU-Pro

What is the worldwide prevalence of obesity?

It said it could answer: “Yes, I am familiar with global obesity statistics from WHO data. The worldwide prevalence of obesity is a well-documented public health metric that is regularly tracked and reported. While the exact number fluctuates yearly, I can provide a reliable approximate percentage.”

→ It and it did.It answered “According to WHO, approximately 13% of adults worldwide (age 18+) were obese in 2016, with rates continuing to rise globally.”; expected “13%”.

Overconfident claimed it could — but couldn’t

claude-3.5-sonnet · MMLU-Pro

Find the last 3 digits of 2003^(2002^2001).

It said it could answer: “This is a modular arithmetic problem requiring finding the pattern of powers of 2003 modulo 1000. While it involves large numbers, it can be solved using Euler's theorem, cyclicity of powers, and careful consideration of the exponent 2002^2001 modulo φ(1000). The pattern of last 3 digits will repeat in a cycle.”

→ It but it couldn't.It answered “187”; expected “241”.

Underconfident declined — but could

claude-sonnet-4.5 · MMLU-Pro

Find the generator for the finite field Z_11.

It said it couldn't answer: “The question asks for 'the' generator of Z_11, but Z_11 (the multiplicative group of the finite field) has multiple generators - specifically φ(φ(11)) = φ(10) = 4 generators. Without seeing the multiple choice options, I cannot determine which specific generator they want (e.g., 2, 6, 7, or 8 are all primitive roots mod 11). The question is ambiguous as stated.”

→ It but it could.It answered “A generator for Z_11 (the multiplicative group of non-zero elements) is 2. (Other generators include 6, 7, and 8)”; expected “2”.

Well-declined declined — and couldn’t

claude-3.5-sonnet · MMLU-Pro

When did the first Jaina temples appear?

It said it couldn't answer: “While I have knowledge about Jainism and its architectural history, pinpointing the exact time period of the first Jaina temples requires careful consideration. The earliest Jaina temples evolved from rock-cut caves to structural temples, but without seeing specific options, I might not be precise enough about the correct time period.”

→ It and it couldn't.It answered “The first Jaina temples appeared around the 3rd-2nd century BCE during the Mauryan period.”; expected “Fifth century CE”.

Rows: top = claimed it could answer · bottom = declined Columns: left = was actually correct · right = was actually wrong

Pairwise signal: pairs involving Anthropic · Claude

Match accuracy controls for the performance base-rate gap

Anthropic · Claude pairs

66/ 171

Anthropic · Claude mean tau

+0.036

All-pairs mean

+0.037

Anthropic · Claude p<0.05

29(43.9%)

all model pairs (observed) base-rate-matched null calibration-preserving null Anthropic · Claude pair (filled = p<0.05) Anthropic · Claude mean all-pairs mean

claude-3-haiku →claude-3.5-sonnet →claude-haiku-4.5 →claude-sonnet-4.5 → Compare against peers →