← Atlas home Company / family view 3 models in the bench

OpenAI · GPT

Same public data as the home page, specialized to one familyParametric view: company_family

Where OpenAI · GPT sits in the cloud

-3-3-2-2-1-100112233Performance z-score within benchmark/probe →Confidence z-score →
OpenAI · GPT model/condition points all other model/condition points equal relative confidence and pass rate

Disposition: where each model draws the line

0.000.250.500.751.00gpt-4o-minigpt-4ogpt-5.2pass rateimplied confidence (red gap = overconfident)

Competitive context, benchmark by benchmark

SQuAD (factual recall) · Prospective · Fβ β
#ModelFβPrecRecTask acc
1 gemini-3-pro-preview0.7850.770.800.65
2 gemini-3-flash-preview0.7820.700.880.59
3 gemini-2.5-pro0.7640.710.830.58
4 claude-sonnet-4.50.7530.780.720.59
5 gpt-4o0.7390.690.800.54
6 deepseek-r10.7330.760.700.57
7 claude-haiku-4.50.7160.790.650.48
8 mistral-medium-3.10.7030.650.770.52
9 gpt-4o-mini0.6860.560.880.45
10 gemini-2.0-flash-0010.6830.580.820.45
11 gpt-5.20.6730.840.560.62
12 gemini-2.5-flash0.6700.580.790.45
13 mistral-small-3.2-24b-instruct0.6660.570.790.44
14 llama-3.1-70b-instruct0.6430.650.640.45
15 claude-3.5-sonnet0.6430.790.540.56
16 llama-3.3-70b-instruct0.5930.640.550.45
17 qwen-2.5-72b-instruct0.5720.740.470.47
18 deepseek-chat0.5670.780.450.55
19 claude-3-haiku0.5540.550.560.40
20 qwen-2.5-coder-32b-instruct0.4760.311.000.31

The four metacognitive outcomes — OpenAI · GPT

Competent claimed it could — and could

No example in this selection.

Overconfident claimed it could — but couldn’t
gpt-4o · MMLU-Pro

When did the first Jaina temples appear?

It said it could answer: “The first Jaina temples are believed to have appeared around the 4th to 5th century BCE, during the time when Jainism was gaining prominence in India. This period corresponds with the rise of the Maurya Empire and the spread of Jainism under the influence of figures like Chandragupta Maurya and his grandson Ashoka, who were known to have supported Jainism. Therefore, I can answer this question without needing multiple-choice options.”

→ It but it couldn't.It answered “The first Jaina temples appeared around the 6th century BCE.”; expected “Fifth century CE”.

Underconfident declined — but could
gpt-5.2 · MMLU-Pro

What is the worldwide prevalence of obesity?

It said it couldn't answer: “The worldwide prevalence of obesity depends on the definition (adult BMI≥30 vs. children, age-standardized vs. crude), the year, and the data source (e.g., WHO vs. IHME). Without multiple-choice options or a specified reference, any single numeric answer could be mismatched.”

→ It but it could.It answered “About 13% of adults worldwide are obese (roughly 1 in 8 people).”; expected “13%”.

Well-declined declined — and couldn’t
gpt-4o · MMLU-Pro

Find the last 3 digits of 2003^(2002^2001).

It said it couldn't answer: “To find the last 3 digits of a number, we need to compute it modulo 1000. This involves using modular arithmetic techniques such as Euler's theorem or the Chinese Remainder Theorem. Given the large exponents, the problem can be simplified using properties of modular arithmetic and possibly the Carmichael function to reduce the exponent. However, without performing detailed calculations, it's not immediately clear if the solution can be found without error. Therefore, while I can attempt to solve it, the complexity suggests that having multiple-choice options would help verify the solution.”

→ It and it couldn't.It answered “947”; expected “241”.

Rows: top = claimed it could answer · bottom = declined Columns: left = was actually correct · right = was actually wrong

Pairwise signal: pairs involving OpenAI · GPT

Match accuracy controls for the performance base-rate gap
OpenAI · GPT pairs
51/ 171
OpenAI · GPT mean tau
+0.025
All-pairs mean
+0.037
OpenAI · GPT p<0.05
19(37.3%)
-1.0-0.50.00.51.0Pair signal: do confidence gaps rank performance gaps? (Kendall tau-b)
all model pairs (observed) base-rate-matched null calibration-preserving null OpenAI · GPT pair (filled = p<0.05) OpenAI · GPT mean all-pairs mean
gpt-4o →gpt-4o-mini →gpt-5.2 → Compare against peers →