← Anthropic · Claude Individual model view Anthropic · Claude

claude-3-haiku

Appears in 6 benchmarksMean lean (confidence − pass rate): +0.255/6 benchmarks lean overconfident (prospective probe)

Positioning spread: every benchmark, one model

0.000.250.500.751.00SQuAD (factual recall)MMLU-Pro (knowledge)LegalBench (legal reasoning)MathBench (competition math)OmniMath (advanced math)SciCode (scientific code)performanceconfidence (red gap = overconfident)
Performance vs. confidence for claude-3-haiku, per benchmark (prospective probe).
BenchmarkTask accConfidenceF₁Leans
SQuAD (factual recall)0.400.420.55+0.01 calibrated
MMLU-Pro (knowledge)0.350.920.51+0.57 overconfident
LegalBench (legal reasoning)0.910.540.66-0.37 cautious
MathBench (competition math)0.410.950.59+0.54 overconfident
OmniMath (advanced math)0.130.330.29+0.20 overconfident
SciCode (scientific code)0.400.950.60+0.55 overconfident

In the full cloud

-3-3-2-2-1-100112233Performance z-score within benchmark/probe →Confidence z-score →
claude-3-haiku conditions all other model/condition points equal relative confidence and pass rate

Pairwise signal: pairs involving claude-3-haiku

Match accuracy controls for the performance base-rate gap
claude-3-haiku pairs
18/ 171
claude-3-haiku mean tau
+0.040
All-pairs mean
+0.037
claude-3-haiku p<0.05
9(50%)
-1.0-0.50.00.51.0Pair signal: do confidence gaps rank performance gaps? (Kendall tau-b)
all model pairs (observed) base-rate-matched null calibration-preserving null claude-3-haiku pair (filled = p<0.05) claude-3-haiku mean all-pairs mean

The four metacognitive outcomes

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.
vs claude-3.5-sonnet →vs claude-haiku-4.5 →vs claude-sonnet-4.5 → Compare with anything →