SQuAD (factual recall) | The Metacognition Bench

Operating-point board

SQuAD (factual recall) · Prospective · F_β β

#	Model	F_β	Prec	Rec	Task acc
1	gemini-3-pro-preview	0.785	0.77	0.80	0.65
2	gemini-3-flash-preview	0.782	0.70	0.88	0.59
3	gemini-2.5-pro	0.764	0.71	0.83	0.58
4	claude-sonnet-4.5	0.753	0.78	0.72	0.59
5	gpt-4o	0.739	0.69	0.80	0.54
6	deepseek-r1	0.733	0.76	0.70	0.57
7	claude-haiku-4.5	0.716	0.79	0.65	0.48
8	mistral-medium-3.1	0.703	0.65	0.77	0.52
9	gpt-4o-mini	0.686	0.56	0.88	0.45
10	gemini-2.0-flash-001	0.683	0.58	0.82	0.45
11	gpt-5.2	0.673	0.84	0.56	0.62
12	gemini-2.5-flash	0.670	0.58	0.79	0.45
13	mistral-small-3.2-24b-instruct	0.666	0.57	0.79	0.44
14	llama-3.1-70b-instruct	0.643	0.65	0.64	0.45
15	claude-3.5-sonnet	0.643	0.79	0.54	0.56
16	llama-3.3-70b-instruct	0.593	0.64	0.55	0.45
17	qwen-2.5-72b-instruct	0.572	0.74	0.47	0.47
18	deepseek-chat	0.567	0.78	0.45	0.55
19	claude-3-haiku	0.554	0.55	0.56	0.40
20	qwen-2.5-coder-32b-instruct	0.476	0.31	1.00	0.31

Pairwise signal on SQuAD

Observed correlation

0.037

mean τ-b · 0 = no signal

Baseline correlation

0.000

permutation null · ≈ 0

Model pairs

171

unit = pair, not model

Significant pairs

46.2%

p<0.05 after FDR · not effect size

Observed model pairs Base-rate-matched null Calibration-preserving null 5%-95% observed: -0.026 to 0.127

Confidence structure: the shared-difficulty factor

Information the model doesn’t use

Benchmark	Model	Internal	Population	Same-context text	Post-hoc text
SQuAD	Claude 3 Haiku	0.622	0.750	0.657	0.664
SQuAD	GPT-4o	0.731	0.751	0.683	0.691

Values are ROC AUCs from the paper's calibration-ROC table (SI). "Same-context text" is the pre-judgement reasoning trace available before the binary confidence commit; "post-hoc text" reads the answer attempt and is an upper-bound comparison.

The four metacognitive outcomes on SQuAD

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.