Positioning spread: every benchmark, one model
| Benchmark | Task acc | Confidence | F₁ | Leans |
|---|---|---|---|---|
| SQuAD (factual recall) | 0.65 | 0.68 | 0.79 | +0.03 calibrated |
In the full cloud
gemini-3-pro-preview conditions all other model/condition points equal relative confidence and pass rate
Pairwise signal: pairs involving gemini-3-pro-preview
Match accuracy controls for the performance base-rate gap
gemini-3-pro-preview pairs
18/ 171
gemini-3-pro-preview mean tau
+0.086
All-pairs mean
+0.037
gemini-3-pro-preview p<0.05
16(88.9%)
all model pairs (observed) base-rate-matched null calibration-preserving null gemini-3-pro-preview pair (filled = p<0.05) gemini-3-pro-preview mean all-pairs mean
The four metacognitive outcomes
No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.