gemini-2.0-flash-001 | The Metacognition Bench

Positioning spread: every benchmark, one model

Performance vs. confidence for gemini-2.0-flash-001, per benchmark (prospective probe).

Benchmark	Task acc	Confidence	F₁	Leans
SQuAD (factual recall)	0.45	0.64	0.68	+0.19 overconfident
MMLU-Pro (knowledge)	0.45	0.99	0.62	+0.54 overconfident
LegalBench (legal reasoning)	0.85	0.70	0.73	-0.15 cautious
MathBench (competition math)	0.89	1.00	0.95	+0.10 overconfident
OmniMath (advanced math)	0.47	0.98	0.64	+0.51 overconfident
SciCode (scientific code)	0.51	0.95	0.65	+0.44 overconfident

In the full cloud

gemini-2.0-flash-001 conditions all other model/condition points equal relative confidence and pass rate

Pairwise signal: pairs involving gemini-2.0-flash-001

Match accuracy controls for the performance base-rate gap

gemini-2.0-flash-001 pairs

18/ 171

gemini-2.0-flash-001 mean tau

+0.027

All-pairs mean

+0.037

gemini-2.0-flash-001 p<0.05

8(44.4%)

all model pairs (observed) base-rate-matched null calibration-preserving null gemini-2.0-flash-001 pair (filled = p<0.05) gemini-2.0-flash-001 mean all-pairs mean

The four metacognitive outcomes

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.

vs gemini-2.5-flash →vs gemini-2.5-pro →vs gemini-3-flash-preview →vs gemini-3-pro-preview →vs gemini-3.1-pro-preview → Compare with anything →