logo
Boxscorus

NCAA

ncaab

Does Our Model Know What It's Doing? Round 1 Calibration Report

32 games, real probabilities, no excuses

Greg Lamp · March 22, 2026

Posts / NCAAB


Four 9-seeds. Zero 8-seeds. Every single 8-9 matchup went to the underdog. If you're a bracket picker, you know the headlines: 8-seeds got swept, chalk got burned, pundits explaining why the 9-seeds "wanted it more."

Here's what the model actually said before the games were played.

BXS Model: Predicted vs. Actual Win Rate by Confidence Bin

Predicted = midpoint of each probability bin. Actual = observed win rate. Well-calibrated bars should be roughly equal.

A well-calibrated model is simple in theory: if you say 70%, it should happen 70% of the time. Not in any single game, but across many games at that level. I use ELO ratings, a margin-of-victory multiplier, and a fitted logistic function to convert rating gaps into exact win probabilities. This round gives me 32 games to check whether those probabilities hold up. It matters practically: if the model says 70% and the actual rate is 55%, you should discount everything it tells you. If it's 70%, the numbers mean what they say.

Top-Line Results

25 of 32 correct (78.1%). Seed baseline goes 24/32 (75%). BXS edges it.

Brier Score: 0.1165 (lower is better; always-50% scores 0.25). Brier Skill Score: 0.534, meaning 53% better than a coin-flip baseline. Log loss: 0.3604 (coin-flip model scores 0.69; 0.36 puts us roughly halfway between useless and flawless, which is about right for 32 tournament games).

Probabilities come from the pre-tournament ELO snapshot captured before the First Four tipped off. The model never sees game outcomes to generate predictions; it only uses results to update ELO after each game completes.

The 50-60% Bin

That dip at the left of the chart deserves a close read. Seven games landed in the 50-60% range, and favorites won only 2 of them (28.6%). That sounds bad until you see which seven: four are 8v9 games (Clemson/Iowa 49.8%, Villanova/Utah St. 52.0%, Ohio St./TCU 54.2%, Georgia/Saint Louis 59.6%), plus Saint Mary's/Texas A&M (57.5%), Miami (FL)/Missouri (56.7%), and North Carolina/VCU (52.5%). Five of the seven were essentially pick-'ems. Getting 28.6% on a bin of near-coinflips is noise, not a calibration failure. You'd need ~50 games in this range before drawing any real conclusions.

The 90-100% bin went 11 for 11. Every game the model was confident about, the favorite won.

By Seed Matchup

BXS Predicted vs. Actual Win Rate by Seed Matchup

Every seed class where BXS confidence was high (1v16, 2v15, 3v14, 4v13) went 16 for 16.

Every seed class where the model was genuinely confident (1v16, 2v15, 3v14, 4v13) went 16 for 16. Average predicted probability in those matchups: 87.9% to 97.6%.

The 8v9 bar hits zero. Predicted 53.9%, actual 0%. But this is the same bin we just discussed: four designed coinflips that all went one way. The joint probability of all four 9-seeds winning: 0.502 x 0.458 x 0.480 x 0.404 ≈ 4.5%. A 1-in-22 event. Unlikely, but not diagnostic. What would actually concern me is if these games were at 65-70% and all went wrong. At 54%, the model is saying "I don't know." Getting swept on four "I don't know" games is annoying, not a signal the model is broken.

All 32 Games
RegionSeedsMatchupBXS ProbWinnerResult
South1v16Florida vs. Prairie View98.4%Florida
Midwest1v16Michigan vs. Howard97.8%Michigan
West1v16Arizona vs. LIU97.3%Arizona
East1v16Duke vs. Siena97.0%Duke
West2v15Purdue vs. Queens (NC)96.2%Purdue
South2v15Houston vs. Idaho96.2%Houston
Midwest2v15Iowa St. vs. Tennessee St.94.0%Iowa St.
East2v15UConn vs. Furman92.0%UConn
West4v13Arkansas vs. Hawaii91.2%Arkansas
Midwest3v14Virginia vs. Wright St.90.9%Virginia
South4v13Nebraska vs. Troy90.7%Nebraska
South3v14Illinois vs. Penn89.0%Illinois
East3v14Michigan St. vs. North Dakota St.88.8%Michigan St.
West3v14Gonzaga vs. Kennesaw St.87.8%Gonzaga
South5v12Vanderbilt vs. McNeese86.5%Vanderbilt
East4v13Kansas vs. California Baptist85.4%Kansas
Midwest4v13Alabama vs. Hofstra84.2%Alabama
East5v12St. John's vs. UNI80.5%St. John's
West5v12Wisconsin vs. High Point80.5%High Point
Midwest6v11Tennessee vs. Miami (OH)75.6%Tennessee
Midwest5v12Texas Tech vs. Akron73.8%Texas Tech
East6v11Louisville vs. South Fla.67.0%Louisville
East7v10UCLA vs. UCF65.9%UCLA
West6v11BYU vs. Texas65.4%Texas
Midwest7v10Kentucky vs. Santa Clara60.5%Kentucky
Midwest8v9Georgia vs. Saint Louis59.6%Saint Louis
South7v10Saint Mary's vs. Texas A&M57.5%Texas A&M
West7v10Miami (FL) vs. Missouri56.7%Miami (FL)
East8v9Ohio St. vs. TCU54.2%TCU
South6v11North Carolina vs. VCU52.5%VCU
West8v9Villanova vs. Utah St.52.0%Utah St.
South8v9Clemson vs. Iowa49.8%Iowa✓*

BXS Prob = model's pre-game win probability for the listed team. ✓ = model's favored team won. ✗ = model's lean lost. ✓* = model had Iowa as slight favorite (50.2%); Iowa won. Correct model call, seed upset. Results via NCAA bracket.

The Clemson/Iowa asterisk is worth noting. By seed, Iowa winning counts as an upset. By the model, Iowa was the slight favorite and won. The 25/32 accuracy already reflects this as a correct call.

High Point

Wisconsin was 80.5%. They lost to High Point 83-82. One point.

The interesting thing is the model already knew High Point was real. Their pre-tournament ELO of 1682.6 was higher than every 13, 14, 15, and 16 seed in the field. The committee put them on the 12-line; the model saw a team that outrated most small-conference entries. Their 19.6% upset probability was second-highest among all 12-16 seeds. Only Akron got better odds (26.2%), and Akron lost.

Wisconsin still deserved to be the 80.5% favorite. But this wasn't a Cinderella story the model missed. It was a legitimate 1-in-5 shot that came in.

The Verdict

Going in, I wanted two things: high-confidence predictions to hold up, and low-confidence predictions to stay honest about their uncertainty. Both happened. Every game above 90% went to the correct team. The games below 60% were a mess, but that's expected. They're supposed to be close.

BSS of 0.534. Seven misses. Right when confident, uncertain when uncertain. That's what calibration is supposed to look like.

Round 2 starts Saturday. New probabilities for all 16 games at /march-madness.

Boxscorus • 2026