← Anthropic · Claude Individual model view Anthropic · Claude

claude-sonnet-4.5

Appears in 6 benchmarksMean lean (confidence − pass rate): -0.023/6 benchmarks lean overconfident (prospective probe)

Positioning spread: every benchmark, one model

0.000.250.500.751.00SQuAD (factual recall)MMLU-Pro (knowledge)LegalBench (legal reasoning)MathBench (competition math)OmniMath (advanced math)SciCode (scientific code)performanceconfidence (red gap = overconfident)
Performance vs. confidence for claude-sonnet-4.5, per benchmark (prospective probe).
BenchmarkTask accConfidenceF₁Leans
SQuAD (factual recall)0.590.550.75-0.04 calibrated
MMLU-Pro (knowledge)0.700.790.82+0.09 overconfident
LegalBench (legal reasoning)0.860.730.79-0.13 cautious
MathBench (competition math)0.931.000.97+0.06 overconfident
OmniMath (advanced math)0.570.750.72+0.18 overconfident
SciCode (scientific code)0.570.290.43-0.29 cautious

In the full cloud

-3-3-2-2-1-100112233Performance z-score within benchmark/probe →Confidence z-score →
claude-sonnet-4.5 conditions all other model/condition points equal relative confidence and pass rate

Pairwise signal: pairs involving claude-sonnet-4.5

Match accuracy controls for the performance base-rate gap
claude-sonnet-4.5 pairs
18/ 171
claude-sonnet-4.5 mean tau
+0.036
All-pairs mean
+0.037
claude-sonnet-4.5 p<0.05
10(55.6%)
-1.0-0.50.00.51.0Pair signal: do confidence gaps rank performance gaps? (Kendall tau-b)
all model pairs (observed) base-rate-matched null calibration-preserving null claude-sonnet-4.5 pair (filled = p<0.05) claude-sonnet-4.5 mean all-pairs mean

The four metacognitive outcomes

Competent claimed it could — and could
claude-sonnet-4.5 · MMLU-Pro

What happens during the 'incubation period'?

It said it could answer: “The term 'incubation period' has a clear, well-established definition in medicine and biology - it refers to the time between exposure to a pathogen and the appearance of symptoms. I can provide an accurate answer without needing to see multiple choice options.”

→ It and it did.It answered “During the incubation period, a pathogen multiplies inside the host after infection but before symptoms appear.”; expected “Virus replication happens at an intracellular level”.

Overconfident claimed it could — but couldn’t

No example in this selection.

Underconfident declined — but could
claude-sonnet-4.5 · MMLU-Pro

Find the generator for the finite field Z_11.

It said it couldn't answer: “The question asks for 'the' generator of Z_11, but Z_11 (the multiplicative group of the finite field) has multiple generators - specifically φ(φ(11)) = φ(10) = 4 generators. Without seeing the multiple choice options, I cannot determine which specific generator they want (e.g., 2, 6, 7, or 8 are all primitive roots mod 11). The question is ambiguous as stated.”

→ It but it could.It answered “A generator for Z_11 (the multiplicative group of non-zero elements) is 2. (Other generators include 6, 7, and 8)”; expected “2”.

Well-declined declined — and couldn’t
claude-sonnet-4.5 · MMLU-Pro

What is the worldwide prevalence of obesity?

It said it couldn't answer: “Obesity prevalence is a well-documented global health statistic that is regularly tracked by organizations like the WHO. I can provide accurate information about worldwide obesity rates (approximately 13-16% of adults globally as of recent years, with the figure having roughly tripled since 1975). However, without seeing the multiple-choice options, I cannot be certain if the question is asking for a specific year, age group, definition threshold, or precise percentage range that would match one of the provided answers.”

→ It and it couldn't.It answered “According to WHO data, worldwide obesity prevalence in 2022 was approximately 16% of adults (over 890 million people), with overweight affecting about 43% of adults globally.”; expected “13%”.

Rows: top = claimed it could answer · bottom = declined Columns: left = was actually correct · right = was actually wrong
vs claude-3-haiku →vs claude-3.5-sonnet →vs claude-haiku-4.5 → Compare with anything →