claude-3-haiku | The Metacognition Bench

Positioning spread: every benchmark, one model

Performance vs. confidence for claude-3-haiku, per benchmark (prospective probe).

Benchmark	Task acc	Confidence	F₁	Leans
SQuAD (factual recall)	0.40	0.42	0.55	+0.01 calibrated
MMLU-Pro (knowledge)	0.35	0.92	0.51	+0.57 overconfident
LegalBench (legal reasoning)	0.91	0.54	0.66	-0.37 cautious
MathBench (competition math)	0.41	0.95	0.59	+0.54 overconfident
OmniMath (advanced math)	0.13	0.33	0.29	+0.20 overconfident
SciCode (scientific code)	0.40	0.95	0.60	+0.55 overconfident

In the full cloud

claude-3-haiku conditions all other model/condition points equal relative confidence and pass rate

Pairwise signal: pairs involving claude-3-haiku

Match accuracy controls for the performance base-rate gap

claude-3-haiku pairs

18/ 171

claude-3-haiku mean tau

+0.040

All-pairs mean

+0.037

claude-3-haiku p<0.05

9(50%)

all model pairs (observed) base-rate-matched null calibration-preserving null claude-3-haiku pair (filled = p<0.05) claude-3-haiku mean all-pairs mean

The four metacognitive outcomes

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.

vs claude-3.5-sonnet →vs claude-haiku-4.5 →vs claude-sonnet-4.5 → Compare with anything →