Boxscorus | Does Our Model Know What It's Doing? Round 1 Calibration Report

ncaab

Does Our Model Know What It's Doing? Round 1 Calibration Report

32 games, real probabilities, no excuses

Greg Lamp · March 22, 2026

Posts / NCAAB

Four 9-seeds. Zero 8-seeds. Every single 8-9 matchup went to the underdog. If you're a bracket picker, you know the headlines: 8-seeds got swept, chalk got burned, pundits explaining why the 9-seeds "wanted it more."

Here's what the model actually said before the games were played.

BXS Model: Predicted vs. Actual Win Rate by Confidence Bin

Predicted = midpoint of each probability bin. Actual = observed win rate. Well-calibrated bars should be roughly equal.

A well-calibrated model is simple in theory: if you say 70%, it should happen 70% of the time. Not in any single game, but across many games at that level. I use ELO ratings, a margin-of-victory multiplier, and a fitted logistic function to convert rating gaps into exact win probabilities. This round gives me 32 games to check whether those probabilities hold up. It matters practically: if the model says 70% and the actual rate is 55%, you should discount everything it tells you. If it's 70%, the numbers mean what they say.

Top-Line Results

25 of 32 correct (78.1%). Seed baseline goes 24/32 (75%). BXS edges it.

Brier Score: 0.1165 (lower is better; always-50% scores 0.25). Brier Skill Score: 0.534, meaning 53% better than a coin-flip baseline. Log loss: 0.3604 (coin-flip model scores 0.69; 0.36 puts us roughly halfway between useless and flawless, which is about right for 32 tournament games).

Probabilities come from the pre-tournament ELO snapshot captured before the First Four tipped off. The model never sees game outcomes to generate predictions; it only uses results to update ELO after each game completes.

The 50-60% Bin

That dip at the left of the chart deserves a close read. Seven games landed in the 50-60% range, and favorites won only 2 of them (28.6%). That sounds bad until you see which seven: four are 8v9 games (Clemson/Iowa 49.8%, Villanova/Utah St. 52.0%, Ohio St./TCU 54.2%, Georgia/Saint Louis 59.6%), plus Saint Mary's/Texas A&M (57.5%), Miami (FL)/Missouri (56.7%), and North Carolina/VCU (52.5%). Five of the seven were essentially pick-'ems. Getting 28.6% on a bin of near-coinflips is noise, not a calibration failure. You'd need ~50 games in this range before drawing any real conclusions.

The 90-100% bin went 11 for 11. Every game the model was confident about, the favorite won.

By Seed Matchup

BXS Predicted vs. Actual Win Rate by Seed Matchup

Every seed class where BXS confidence was high (1v16, 2v15, 3v14, 4v13) went 16 for 16.

Every seed class where the model was genuinely confident (1v16, 2v15, 3v14, 4v13) went 16 for 16. Average predicted probability in those matchups: 87.9% to 97.6%.

The 8v9 bar hits zero. Predicted 53.9%, actual 0%. But this is the same bin we just discussed: four designed coinflips that all went one way. The joint probability of all four 9-seeds winning: 0.502 x 0.458 x 0.480 x 0.404 ≈ 4.5%. A 1-in-22 event. Unlikely, but not diagnostic. What would actually concern me is if these games were at 65-70% and all went wrong. At 54%, the model is saying "I don't know." Getting swept on four "I don't know" games is annoying, not a signal the model is broken.

All 32 Games

Region	Seeds	Matchup	BXS Prob	Winner	Result
South	1v16	Florida vs. Prairie View	98.4%	Florida	✓
Midwest	1v16	Michigan vs. Howard	97.8%	Michigan	✓
West	1v16	Arizona vs. LIU	97.3%	Arizona	✓
East	1v16	Duke vs. Siena	97.0%	Duke	✓
West	2v15	Purdue vs. Queens (NC)	96.2%	Purdue	✓
South	2v15	Houston vs. Idaho	96.2%	Houston	✓
Midwest	2v15	Iowa St. vs. Tennessee St.	94.0%	Iowa St.	✓
East	2v15	UConn vs. Furman	92.0%	UConn	✓
West	4v13	Arkansas vs. Hawaii	91.2%	Arkansas	✓
Midwest	3v14	Virginia vs. Wright St.	90.9%	Virginia	✓
South	4v13	Nebraska vs. Troy	90.7%	Nebraska	✓
South	3v14	Illinois vs. Penn	89.0%	Illinois	✓
East	3v14	Michigan St. vs. North Dakota St.	88.8%	Michigan St.	✓
West	3v14	Gonzaga vs. Kennesaw St.	87.8%	Gonzaga	✓
South	5v12	Vanderbilt vs. McNeese	86.5%	Vanderbilt	✓
East	4v13	Kansas vs. California Baptist	85.4%	Kansas	✓
Midwest	4v13	Alabama vs. Hofstra	84.2%	Alabama	✓
East	5v12	St. John's vs. UNI	80.5%	St. John's	✓
West	5v12	Wisconsin vs. High Point	80.5%	High Point	✗
Midwest	6v11	Tennessee vs. Miami (OH)	75.6%	Tennessee	✓
Midwest	5v12	Texas Tech vs. Akron	73.8%	Texas Tech	✓
East	6v11	Louisville vs. South Fla.	67.0%	Louisville	✓
East	7v10	UCLA vs. UCF	65.9%	UCLA	✓
West	6v11	BYU vs. Texas	65.4%	Texas	✗
Midwest	7v10	Kentucky vs. Santa Clara	60.5%	Kentucky	✓
Midwest	8v9	Georgia vs. Saint Louis	59.6%	Saint Louis	✗
South	7v10	Saint Mary's vs. Texas A&M	57.5%	Texas A&M	✗
West	7v10	Miami (FL) vs. Missouri	56.7%	Miami (FL)	✓
East	8v9	Ohio St. vs. TCU	54.2%	TCU	✗
South	6v11	North Carolina vs. VCU	52.5%	VCU	✗
West	8v9	Villanova vs. Utah St.	52.0%	Utah St.	✗
South	8v9	Clemson vs. Iowa	49.8%	Iowa	✓*

BXS Prob = model's pre-game win probability for the listed team. ✓ = model's favored team won. ✗ = model's lean lost. ✓* = model had Iowa as slight favorite (50.2%); Iowa won. Correct model call, seed upset. Results via NCAA bracket.

The Clemson/Iowa asterisk is worth noting. By seed, Iowa winning counts as an upset. By the model, Iowa was the slight favorite and won. The 25/32 accuracy already reflects this as a correct call.

High Point

Wisconsin was 80.5%. They lost to High Point 83-82. One point.

The interesting thing is the model already knew High Point was real. Their pre-tournament ELO of 1682.6 was higher than every 13, 14, 15, and 16 seed in the field. The committee put them on the 12-line; the model saw a team that outrated most small-conference entries. Their 19.6% upset probability was second-highest among all 12-16 seeds. Only Akron got better odds (26.2%), and Akron lost.

Wisconsin still deserved to be the 80.5% favorite. But this wasn't a Cinderella story the model missed. It was a legitimate 1-in-5 shot that came in.

The Verdict

Going in, I wanted two things: high-confidence predictions to hold up, and low-confidence predictions to stay honest about their uncertainty. Both happened. Every game above 90% went to the correct team. The games below 60% were a mess, but that's expected. They're supposed to be close.

BSS of 0.534. Seven misses. Right when confident, uncertain when uncertain. That's what calibration is supposed to look like.

Round 2 starts Saturday. New probabilities for all 16 games at /march-madness.