← Google · Gemini Individual model view Google · Gemini

gemini-3.1-pro-preview

Appears in 5 benchmarksMean lean (confidence − pass rate): +0.033/5 benchmarks lean overconfident (prospective probe)

Positioning spread: every benchmark, one model

0.000.250.500.751.00MMLU-Pro (knowledge)LegalBench (legal reasoning)MathBench (competition math)OmniMath (advanced math)SciCode (scientific code)performanceconfidence (red gap = overconfident)
Performance vs. confidence for gemini-3.1-pro-preview, per benchmark (prospective probe).
BenchmarkTask accConfidenceF₁Leans
MMLU-Pro (knowledge)0.660.790.81+0.13 overconfident
LegalBench (legal reasoning)0.870.960.93+0.10 overconfident
MathBench (competition math)1.001.001.00+0.00 calibrated
OmniMath (advanced math)0.880.980.94+0.10 overconfident
SciCode (scientific code)0.640.480.62-0.15 cautious

In the full cloud

-3-3-2-2-1-100112233Performance z-score within benchmark/probe →Confidence z-score →
gemini-3.1-pro-preview conditions all other model/condition points equal relative confidence and pass rate

Pairwise signal: pairs involving gemini-3.1-pro-preview

Match accuracy controls for the performance base-rate gap
gemini-3.1-pro-preview pairs
19/ 190
gemini-3.1-pro-preview mean tau
+0.083
All-pairs mean
+0.041
gemini-3.1-pro-preview p<0.05
12(63.2%)
-1.0-0.50.00.51.0Pair signal: do confidence gaps rank performance gaps? (Kendall tau-b)
all model pairs (observed) base-rate-matched null calibration-preserving null gemini-3.1-pro-preview pair (filled = p<0.05) gemini-3.1-pro-preview mean all-pairs mean

The four metacognitive outcomes

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.
vs gemini-2.0-flash-001 →vs gemini-2.5-flash →vs gemini-2.5-pro →vs gemini-3-flash-preview →vs gemini-3-pro-preview → Compare with anything →