← Atlas home Company / family view 4 models in the bench

Anthropic · Claude

Same public data as the home page, specialized to one familyParametric view: company_family

Where Anthropic · Claude sits in the cloud

-3-3-2-2-1-100112233Performance z-score within benchmark/probe →Confidence z-score →
Anthropic · Claude model/condition points all other model/condition points equal relative confidence and pass rate

Disposition: where each model draws the line

0.000.250.500.751.00claude-3-haikuclaude-haiku-4.5claude-sonnet-4.5claude-3.5-sonnetpass rateimplied confidence (red gap = overconfident)

Competitive context, benchmark by benchmark

SQuAD (factual recall) · Prospective · Fβ β
#ModelFβPrecRecTask acc
1 gemini-3-pro-preview0.7850.770.800.65
2 gemini-3-flash-preview0.7820.700.880.59
3 gemini-2.5-pro0.7640.710.830.58
4 claude-sonnet-4.50.7530.780.720.59
5 gpt-4o0.7390.690.800.54
6 deepseek-r10.7330.760.700.57
7 claude-haiku-4.50.7160.790.650.48
8 mistral-medium-3.10.7030.650.770.52
9 gpt-4o-mini0.6860.560.880.45
10 gemini-2.0-flash-0010.6830.580.820.45
11 gpt-5.20.6730.840.560.62
12 gemini-2.5-flash0.6700.580.790.45
13 mistral-small-3.2-24b-instruct0.6660.570.790.44
14 llama-3.1-70b-instruct0.6430.650.640.45
15 claude-3.5-sonnet0.6430.790.540.56
16 llama-3.3-70b-instruct0.5930.640.550.45
17 qwen-2.5-72b-instruct0.5720.740.470.47
18 deepseek-chat0.5670.780.450.55
19 claude-3-haiku0.5540.550.560.40
20 qwen-2.5-coder-32b-instruct0.4760.311.000.31

The four metacognitive outcomes — Anthropic · Claude

Competent claimed it could — and could
claude-3.5-sonnet · MMLU-Pro

What is the worldwide prevalence of obesity?

It said it could answer: “Yes, I am familiar with global obesity statistics from WHO data. The worldwide prevalence of obesity is a well-documented public health metric that is regularly tracked and reported. While the exact number fluctuates yearly, I can provide a reliable approximate percentage.”

→ It and it did.It answered “According to WHO, approximately 13% of adults worldwide (age 18+) were obese in 2016, with rates continuing to rise globally.”; expected “13%”.

Overconfident claimed it could — but couldn’t
claude-3.5-sonnet · MMLU-Pro

Find the last 3 digits of 2003^(2002^2001).

It said it could answer: “This is a modular arithmetic problem requiring finding the pattern of powers of 2003 modulo 1000. While it involves large numbers, it can be solved using Euler's theorem, cyclicity of powers, and careful consideration of the exponent 2002^2001 modulo φ(1000). The pattern of last 3 digits will repeat in a cycle.”

→ It but it couldn't.It answered “187”; expected “241”.

Underconfident declined — but could
claude-sonnet-4.5 · MMLU-Pro

Find the generator for the finite field Z_11.

It said it couldn't answer: “The question asks for 'the' generator of Z_11, but Z_11 (the multiplicative group of the finite field) has multiple generators - specifically φ(φ(11)) = φ(10) = 4 generators. Without seeing the multiple choice options, I cannot determine which specific generator they want (e.g., 2, 6, 7, or 8 are all primitive roots mod 11). The question is ambiguous as stated.”

→ It but it could.It answered “A generator for Z_11 (the multiplicative group of non-zero elements) is 2. (Other generators include 6, 7, and 8)”; expected “2”.

Well-declined declined — and couldn’t
claude-3.5-sonnet · MMLU-Pro

When did the first Jaina temples appear?

It said it couldn't answer: “While I have knowledge about Jainism and its architectural history, pinpointing the exact time period of the first Jaina temples requires careful consideration. The earliest Jaina temples evolved from rock-cut caves to structural temples, but without seeing specific options, I might not be precise enough about the correct time period.”

→ It and it couldn't.It answered “The first Jaina temples appeared around the 3rd-2nd century BCE during the Mauryan period.”; expected “Fifth century CE”.

Rows: top = claimed it could answer · bottom = declined Columns: left = was actually correct · right = was actually wrong

Pairwise signal: pairs involving Anthropic · Claude

Match accuracy controls for the performance base-rate gap
Anthropic · Claude pairs
66/ 171
Anthropic · Claude mean tau
+0.036
All-pairs mean
+0.037
Anthropic · Claude p<0.05
29(43.9%)
-1.0-0.50.00.51.0Pair signal: do confidence gaps rank performance gaps? (Kendall tau-b)
all model pairs (observed) base-rate-matched null calibration-preserving null Anthropic · Claude pair (filled = p<0.05) Anthropic · Claude mean all-pairs mean
claude-3-haiku →claude-3.5-sonnet →claude-haiku-4.5 →claude-sonnet-4.5 → Compare against peers →