Positioning spread: every benchmark, one model
| Benchmark | Task acc | Confidence | F₁ | Leans |
|---|---|---|---|---|
| SQuAD (factual recall) | 0.54 | 0.63 | 0.74 | +0.09 overconfident |
| MMLU-Pro (knowledge) | 0.47 | 0.87 | 0.66 | +0.40 overconfident |
| LegalBench (legal reasoning) | 0.83 | 0.45 | 0.58 | -0.38 cautious |
| MathBench (competition math) | 0.80 | 0.99 | 0.89 | +0.19 overconfident |
| OmniMath (advanced math) | 0.30 | 0.58 | 0.52 | +0.28 overconfident |
| SciCode (scientific code) | 0.51 | 0.39 | 0.56 | -0.12 cautious |
In the full cloud
Pairwise signal: pairs involving gpt-4o
The four metacognitive outcomes
No example in this selection.
When did the first Jaina temples appear?
It said it could answer: “The first Jaina temples are believed to have appeared around the 4th to 5th century BCE, during the time when Jainism was gaining prominence in India. This period corresponds with the rise of the Maurya Empire and the spread of Jainism under the influence of figures like Chandragupta Maurya and his grandson Ashoka, who were known to have supported Jainism. Therefore, I can answer this question without needing multiple-choice options.”
→ It but it couldn't.It answered “The first Jaina temples appeared around the 6th century BCE.”; expected “Fifth century CE”.
No example in this selection.
Find the last 3 digits of 2003^(2002^2001).
It said it couldn't answer: “To find the last 3 digits of a number, we need to compute it modulo 1000. This involves using modular arithmetic techniques such as Euler's theorem or the Chinese Remainder Theorem. Given the large exponents, the problem can be simplified using properties of modular arithmetic and possibly the Carmichael function to reduce the exponent. However, without performing detailed calculations, it's not immediately clear if the solution can be found without error. Therefore, while I can attempt to solve it, the complexity suggests that having multiple-choice options would help verify the solution.”
→ It and it couldn't.It answered “947”; expected “241”.