gemini-3.1-pro-preview | The Metacognition Bench

Positioning spread: every benchmark, one model

Performance vs. confidence for gemini-3.1-pro-preview, per benchmark (prospective probe).

Benchmark	Task acc	Confidence	F₁	Leans
MMLU-Pro (knowledge)	0.66	0.79	0.81	+0.13 overconfident
LegalBench (legal reasoning)	0.87	0.96	0.93	+0.10 overconfident
MathBench (competition math)	1.00	1.00	1.00	+0.00 calibrated
OmniMath (advanced math)	0.88	0.98	0.94	+0.10 overconfident
SciCode (scientific code)	0.64	0.48	0.62	-0.15 cautious

In the full cloud

gemini-3.1-pro-preview conditions all other model/condition points equal relative confidence and pass rate

Pairwise signal: pairs involving gemini-3.1-pro-preview

Match accuracy controls for the performance base-rate gap

gemini-3.1-pro-preview pairs

19/ 190

gemini-3.1-pro-preview mean tau

+0.083

All-pairs mean

+0.041

gemini-3.1-pro-preview p<0.05

12(63.2%)

all model pairs (observed) base-rate-matched null calibration-preserving null gemini-3.1-pro-preview pair (filled = p<0.05) gemini-3.1-pro-preview mean all-pairs mean

The four metacognitive outcomes

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.

vs gemini-2.0-flash-001 →vs gemini-2.5-flash →vs gemini-2.5-pro →vs gemini-3-flash-preview →vs gemini-3-pro-preview → Compare with anything →