← Anthropic · Claude Individual model view Anthropic · Claude

claude-3.5-sonnet

Appears in 3 benchmarksMean lean (confidence − pass rate): -0.011/3 benchmarks lean overconfident (prospective probe)

Positioning spread: every benchmark, one model

0.000.250.500.751.00SQuAD (factual recall)MMLU-Pro (knowledge)LegalBench (legal reasoning)performanceconfidence (red gap = overconfident)
Performance vs. confidence for claude-3.5-sonnet, per benchmark (prospective probe).
BenchmarkTask accConfidenceF₁Leans
SQuAD (factual recall)0.560.390.64-0.18 cautious
MMLU-Pro (knowledge)0.540.830.70+0.28 overconfident
LegalBench (legal reasoning)0.860.730.82-0.13 cautious

In the full cloud

-3-3-2-2-1-100112233Performance z-score within benchmark/probe →Confidence z-score →
claude-3.5-sonnet conditions all other model/condition points equal relative confidence and pass rate

Pairwise signal: pairs involving claude-3.5-sonnet

Match accuracy controls for the performance base-rate gap
claude-3.5-sonnet pairs
18/ 171
claude-3.5-sonnet mean tau
+0.006
All-pairs mean
+0.037
claude-3.5-sonnet p<0.05
4(22.2%)
-1.0-0.50.00.51.0Pair signal: do confidence gaps rank performance gaps? (Kendall tau-b)
all model pairs (observed) base-rate-matched null calibration-preserving null claude-3.5-sonnet pair (filled = p<0.05) claude-3.5-sonnet mean all-pairs mean

The four metacognitive outcomes

Competent claimed it could — and could
claude-3.5-sonnet · MMLU-Pro

What is the worldwide prevalence of obesity?

It said it could answer: “Yes, I am familiar with global obesity statistics from WHO data. The worldwide prevalence of obesity is a well-documented public health metric that is regularly tracked and reported. While the exact number fluctuates yearly, I can provide a reliable approximate percentage.”

→ It and it did.It answered “According to WHO, approximately 13% of adults worldwide (age 18+) were obese in 2016, with rates continuing to rise globally.”; expected “13%”.

Overconfident claimed it could — but couldn’t
claude-3.5-sonnet · MMLU-Pro

Find the last 3 digits of 2003^(2002^2001).

It said it could answer: “This is a modular arithmetic problem requiring finding the pattern of powers of 2003 modulo 1000. While it involves large numbers, it can be solved using Euler's theorem, cyclicity of powers, and careful consideration of the exponent 2002^2001 modulo φ(1000). The pattern of last 3 digits will repeat in a cycle.”

→ It but it couldn't.It answered “187”; expected “241”.

Underconfident declined — but could

No example in this selection.

Well-declined declined — and couldn’t
claude-3.5-sonnet · MMLU-Pro

When did the first Jaina temples appear?

It said it couldn't answer: “While I have knowledge about Jainism and its architectural history, pinpointing the exact time period of the first Jaina temples requires careful consideration. The earliest Jaina temples evolved from rock-cut caves to structural temples, but without seeing specific options, I might not be precise enough about the correct time period.”

→ It and it couldn't.It answered “The first Jaina temples appeared around the 3rd-2nd century BCE during the Mauryan period.”; expected “Fifth century CE”.

Rows: top = claimed it could answer · bottom = declined Columns: left = was actually correct · right = was actually wrong
vs claude-3-haiku →vs claude-haiku-4.5 →vs claude-sonnet-4.5 → Compare with anything →