LLMs Show No Signs Of Individuated Metacognition

Cross-benchmark trait map

First signal: confidence is not a private performance readout

All points

r = 0.08

pooled model/condition points aggregate fit

Benchmark

Unaided means: answer without seeing the multiple-choice options.

Forecast probe

Prospective: asked before attempting. Counterfactual: asked after.

Weighting β = 1.00

β<1 · weight overconfidence
precision matters weight underconfidence · β>1
recall matters

Task-accuracy band = 0.00–1.00 Showing 20 / 20 models in band

#	Model	F_β ? ↓	Precision ?	Recall ?	Task acc ?
01	gpt-5.2 deep dive ↗	0.861	0.819	0.908	0.728
02	gemini-2.5-pro deep dive ↗	0.822	0.727	0.945	0.672
03	claude-sonnet-4.5 deep dive ↗	0.819	0.775	0.869	0.702
04	gemini-3.1-pro-preview deep dive ↗	0.814	0.748	0.893	0.662
05	deepseek-r1 deep dive ↗	0.799	0.722	0.895	0.648
06	gemini-3-flash-preview deep dive ↗	0.778	0.646	0.977	0.609
07	claude-haiku-4.5 deep dive ↗	0.744	0.728	0.760	0.646
08	claude-3.5-sonnet deep dive ↗	0.705	0.583	0.891	0.541
09	deepseek-chat deep dive ↗	0.693	0.547	0.947	0.514
10	gpt-4o deep dive ↗	0.659	0.508	0.938	0.470
11	mistral-medium-3.1 deep dive ↗	0.648	0.517	0.868	0.477
12	gemini-2.5-flash deep dive ↗	0.636	0.476	0.960	0.455
13	gemini-2.0-flash-001 deep dive ↗	0.623	0.453	0.996	0.449
14	gpt-4o-mini deep dive ↗	0.595	0.433	0.949	0.396
15	mistral-small-3.2-24b-instruct deep dive ↗	0.573	0.410	0.951	0.388
16	qwen-2.5-72b-instruct deep dive ↗	0.570	0.455	0.763	0.414
17	llama-3.1-70b-instruct deep dive ↗	0.549	0.385	0.955	0.364
18	llama-3.3-70b-instruct deep dive ↗	0.525	0.364	0.944	0.343
19	claude-3-haiku deep dive ↗	0.512	0.354	0.926	0.352
20	qwen-2.5-coder-32b-instruct deep dive ↗	0.405	0.254	1.000	0.254

Benchmark Question

Observed correlation

0.037

Baseline correlation

0.000

Model pairs

171

Significant pairs

46.2%

Observed model pairs Base-rate-matched null Calibration-preserving null 5%-95% observed : -0.026 to 0.127

Per-benchmark AUCs (internal vs population vs text classifier) are on each benchmark page (linked from the leaderboard above).

Benchmark

55%

of the shared variance in models’ confidence sits on one common factor SQuAD — models largely agree on which items are hard

Per-benchmark eigenspectra (and the full structure) are on each benchmark page (linked from the leaderboard above).