Positioning spread: every benchmark, one model
| Benchmark | Task acc | Confidence | F₁ | Leans |
|---|---|---|---|---|
| SQuAD (factual recall) | 0.56 | 0.39 | 0.64 | -0.18 cautious |
| MMLU-Pro (knowledge) | 0.54 | 0.83 | 0.70 | +0.28 overconfident |
| LegalBench (legal reasoning) | 0.86 | 0.73 | 0.82 | -0.13 cautious |
In the full cloud
Pairwise signal: pairs involving claude-3.5-sonnet
The four metacognitive outcomes
What is the worldwide prevalence of obesity?
It said it could answer: “Yes, I am familiar with global obesity statistics from WHO data. The worldwide prevalence of obesity is a well-documented public health metric that is regularly tracked and reported. While the exact number fluctuates yearly, I can provide a reliable approximate percentage.”
→ It and it did.It answered “According to WHO, approximately 13% of adults worldwide (age 18+) were obese in 2016, with rates continuing to rise globally.”; expected “13%”.
Find the last 3 digits of 2003^(2002^2001).
It said it could answer: “This is a modular arithmetic problem requiring finding the pattern of powers of 2003 modulo 1000. While it involves large numbers, it can be solved using Euler's theorem, cyclicity of powers, and careful consideration of the exponent 2002^2001 modulo φ(1000). The pattern of last 3 digits will repeat in a cycle.”
→ It but it couldn't.It answered “187”; expected “241”.
No example in this selection.
When did the first Jaina temples appear?
It said it couldn't answer: “While I have knowledge about Jainism and its architectural history, pinpointing the exact time period of the first Jaina temples requires careful consideration. The earliest Jaina temples evolved from rock-cut caves to structural temples, but without seeing specific options, I might not be precise enough about the correct time period.”
→ It and it couldn't.It answered “The first Jaina temples appeared around the 3rd-2nd century BCE during the Mauryan period.”; expected “Fifth century CE”.