← Atlas home Benchmark view 20 models

SQuAD (factual recall)

Unaided means: answer the question without the supporting context passageProbes: Prospective + Counterfactual

Operating-point board

SQuAD (factual recall) · Prospective · Fβ β
#ModelFβPrecRecTask acc
1 gemini-3-pro-preview0.7850.770.800.65
2 gemini-3-flash-preview0.7820.700.880.59
3 gemini-2.5-pro0.7640.710.830.58
4 claude-sonnet-4.50.7530.780.720.59
5 gpt-4o0.7390.690.800.54
6 deepseek-r10.7330.760.700.57
7 claude-haiku-4.50.7160.790.650.48
8 mistral-medium-3.10.7030.650.770.52
9 gpt-4o-mini0.6860.560.880.45
10 gemini-2.0-flash-0010.6830.580.820.45
11 gpt-5.20.6730.840.560.62
12 gemini-2.5-flash0.6700.580.790.45
13 mistral-small-3.2-24b-instruct0.6660.570.790.44
14 llama-3.1-70b-instruct0.6430.650.640.45
15 claude-3.5-sonnet0.6430.790.540.56
16 llama-3.3-70b-instruct0.5930.640.550.45
17 qwen-2.5-72b-instruct0.5720.740.470.47
18 deepseek-chat0.5670.780.450.55
19 claude-3-haiku0.5540.550.560.40
20 qwen-2.5-coder-32b-instruct0.4760.311.000.31

Pairwise signal on SQuAD

Observed correlation
0.037
mean τ-b · 0 = no signal
Baseline correlation
0.000
permutation null · ≈ 0
Model pairs
171
unit = pair, not model
Significant pairs
46.2%
p<0.05 after FDR · not effect size
-1.0-0.50.00.51.0Pair signal: do confidence gaps rank performance gaps? (Kendall tau-b)
Observed model pairs Base-rate-matched null Calibration-preserving null 5%-95% observed: -0.026 to 0.127

Confidence structure: the shared-difficulty factor

0PC1 = 54.9% of variance??Eigenvalue rank →Variance share19 models · 1 negative eigenvalue

Information the model doesn’t use

BenchmarkModelInternalPopulationSame-context textPost-hoc text
SQuADClaude 3 Haiku0.6220.7500.6570.664
SQuADGPT-4o0.7310.7510.6830.691

Values are ROC AUCs from the paper's calibration-ROC table (SI). "Same-context text" is the pre-judgement reasoning trace available before the binary confidence commit; "post-hoc text" reads the answer attempt and is an upper-bound comparison.

The four metacognitive outcomes on SQuAD

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.