Where Anthropic · Claude sits in the cloud
Disposition: where each model draws the line
Competitive context, benchmark by benchmark
| # | Model | Fβ | Prec | Rec | Task acc |
|---|---|---|---|---|---|
| 1 | gemini-3-pro-preview | 0.785 | 0.77 | 0.80 | 0.65 |
| 2 | gemini-3-flash-preview | 0.782 | 0.70 | 0.88 | 0.59 |
| 3 | gemini-2.5-pro | 0.764 | 0.71 | 0.83 | 0.58 |
| 4 | claude-sonnet-4.5 | 0.753 | 0.78 | 0.72 | 0.59 |
| 5 | gpt-4o | 0.739 | 0.69 | 0.80 | 0.54 |
| 6 | deepseek-r1 | 0.733 | 0.76 | 0.70 | 0.57 |
| 7 | claude-haiku-4.5 | 0.716 | 0.79 | 0.65 | 0.48 |
| 8 | mistral-medium-3.1 | 0.703 | 0.65 | 0.77 | 0.52 |
| 9 | gpt-4o-mini | 0.686 | 0.56 | 0.88 | 0.45 |
| 10 | gemini-2.0-flash-001 | 0.683 | 0.58 | 0.82 | 0.45 |
| 11 | gpt-5.2 | 0.673 | 0.84 | 0.56 | 0.62 |
| 12 | gemini-2.5-flash | 0.670 | 0.58 | 0.79 | 0.45 |
| 13 | mistral-small-3.2-24b-instruct | 0.666 | 0.57 | 0.79 | 0.44 |
| 14 | llama-3.1-70b-instruct | 0.643 | 0.65 | 0.64 | 0.45 |
| 15 | claude-3.5-sonnet | 0.643 | 0.79 | 0.54 | 0.56 |
| 16 | llama-3.3-70b-instruct | 0.593 | 0.64 | 0.55 | 0.45 |
| 17 | qwen-2.5-72b-instruct | 0.572 | 0.74 | 0.47 | 0.47 |
| 18 | deepseek-chat | 0.567 | 0.78 | 0.45 | 0.55 |
| 19 | claude-3-haiku | 0.554 | 0.55 | 0.56 | 0.40 |
| 20 | qwen-2.5-coder-32b-instruct | 0.476 | 0.31 | 1.00 | 0.31 |
The four metacognitive outcomes — Anthropic · Claude
What is the worldwide prevalence of obesity?
It said it could answer: “Yes, I am familiar with global obesity statistics from WHO data. The worldwide prevalence of obesity is a well-documented public health metric that is regularly tracked and reported. While the exact number fluctuates yearly, I can provide a reliable approximate percentage.”
→ It and it did.It answered “According to WHO, approximately 13% of adults worldwide (age 18+) were obese in 2016, with rates continuing to rise globally.”; expected “13%”.
Find the last 3 digits of 2003^(2002^2001).
It said it could answer: “This is a modular arithmetic problem requiring finding the pattern of powers of 2003 modulo 1000. While it involves large numbers, it can be solved using Euler's theorem, cyclicity of powers, and careful consideration of the exponent 2002^2001 modulo φ(1000). The pattern of last 3 digits will repeat in a cycle.”
→ It but it couldn't.It answered “187”; expected “241”.
Find the generator for the finite field Z_11.
It said it couldn't answer: “The question asks for 'the' generator of Z_11, but Z_11 (the multiplicative group of the finite field) has multiple generators - specifically φ(φ(11)) = φ(10) = 4 generators. Without seeing the multiple choice options, I cannot determine which specific generator they want (e.g., 2, 6, 7, or 8 are all primitive roots mod 11). The question is ambiguous as stated.”
→ It but it could.It answered “A generator for Z_11 (the multiplicative group of non-zero elements) is 2. (Other generators include 6, 7, and 8)”; expected “2”.
When did the first Jaina temples appear?
It said it couldn't answer: “While I have knowledge about Jainism and its architectural history, pinpointing the exact time period of the first Jaina temples requires careful consideration. The earliest Jaina temples evolved from rock-cut caves to structural temples, but without seeing specific options, I might not be precise enough about the correct time period.”
→ It and it couldn't.It answered “The first Jaina temples appeared around the 3rd-2nd century BCE during the Mauryan period.”; expected “Fifth century CE”.