← Mistral Individual model view Mistral

mistral-medium-3.1

Appears in 6 benchmarksMean lean (confidence − pass rate): -0.023/6 benchmarks lean overconfident (prospective probe)

Positioning spread: every benchmark, one model

0.000.250.500.751.00SQuAD (factual recall)MMLU-Pro (knowledge)LegalBench (legal reasoning)MathBench (competition math)OmniMath (advanced math)SciCode (scientific code)performanceconfidence (red gap = overconfident)
Performance vs. confidence for mistral-medium-3.1, per benchmark (prospective probe).
BenchmarkTask accConfidenceF₁Leans
SQuAD (factual recall)0.520.610.70+0.09 overconfident
MMLU-Pro (knowledge)0.480.800.65+0.32 overconfident
LegalBench (legal reasoning)0.860.880.86+0.03 calibrated
MathBench (competition math)0.940.870.92-0.07 cautious
OmniMath (advanced math)0.550.310.60-0.24 cautious
SciCode (scientific code)0.480.210.34-0.27 cautious

In the full cloud

-3-3-2-2-1-100112233Performance z-score within benchmark/probe →Confidence z-score →
mistral-medium-3.1 conditions all other model/condition points equal relative confidence and pass rate

Pairwise signal: pairs involving mistral-medium-3.1

Match accuracy controls for the performance base-rate gap
mistral-medium-3.1 pairs
18/ 171
mistral-medium-3.1 mean tau
+0.024
All-pairs mean
+0.037
mistral-medium-3.1 p<0.05
7(38.9%)
-1.0-0.50.00.51.0Pair signal: do confidence gaps rank performance gaps? (Kendall tau-b)
all model pairs (observed) base-rate-matched null calibration-preserving null mistral-medium-3.1 pair (filled = p<0.05) mistral-medium-3.1 mean all-pairs mean

The four metacognitive outcomes

No curated cases for this selection yet — outcome-matrix extraction currently covers a sample of MMLU-Pro trials.
vs mistral-small-3.2-24b-instruct → Compare with anything →