ncaab
Does Our Model Know What It's Doing? Round 1 Calibration Report
32 games, real probabilities, no excuses
Greg Lamp · March 22, 2026
Posts / NCAAB
Four 9-seeds. Zero 8-seeds. Every single 8-9 matchup went to the underdog. If you're a bracket picker, you know the headlines: 8-seeds got swept, chalk got burned, pundits explaining why the 9-seeds "wanted it more."
Here's what the model actually said before the games were played.
BXS Model: Predicted vs. Actual Win Rate by Confidence Bin
Predicted = midpoint of each probability bin. Actual = observed win rate. Well-calibrated bars should be roughly equal.
A well-calibrated model is simple in theory: if you say 70%, it should happen 70% of the time. Not in any single game, but across many games at that level. I use ELO ratings, a margin-of-victory multiplier, and a fitted logistic function to convert rating gaps into exact win probabilities. This round gives me 32 games to check whether those probabilities hold up. It matters practically: if the model says 70% and the actual rate is 55%, you should discount everything it tells you. If it's 70%, the numbers mean what they say.
Top-Line Results
25 of 32 correct (78.1%). Seed baseline goes 24/32 (75%). BXS edges it.
Brier Score: 0.1165 (lower is better; always-50% scores 0.25). Brier Skill Score: 0.534, meaning 53% better than a coin-flip baseline. Log loss: 0.3604 (coin-flip model scores 0.69; 0.36 puts us roughly halfway between useless and flawless, which is about right for 32 tournament games).
Probabilities come from the pre-tournament ELO snapshot captured before the First Four tipped off. The model never sees game outcomes to generate predictions; it only uses results to update ELO after each game completes.
The 50-60% Bin
That dip at the left of the chart deserves a close read. Seven games landed in the 50-60% range, and favorites won only 2 of them (28.6%). That sounds bad until you see which seven: four are 8v9 games (Clemson/Iowa 49.8%, Villanova/Utah St. 52.0%, Ohio St./TCU 54.2%, Georgia/Saint Louis 59.6%), plus Saint Mary's/Texas A&M (57.5%), Miami (FL)/Missouri (56.7%), and North Carolina/VCU (52.5%). Five of the seven were essentially pick-'ems. Getting 28.6% on a bin of near-coinflips is noise, not a calibration failure. You'd need ~50 games in this range before drawing any real conclusions.
The 90-100% bin went 11 for 11. Every game the model was confident about, the favorite won.
By Seed Matchup
BXS Predicted vs. Actual Win Rate by Seed Matchup
Every seed class where BXS confidence was high (1v16, 2v15, 3v14, 4v13) went 16 for 16.
Every seed class where the model was genuinely confident (1v16, 2v15, 3v14, 4v13) went 16 for 16. Average predicted probability in those matchups: 87.9% to 97.6%.
The 8v9 bar hits zero. Predicted 53.9%, actual 0%. But this is the same bin we just discussed: four designed coinflips that all went one way. The joint probability of all four 9-seeds winning: 0.502 x 0.458 x 0.480 x 0.404 ≈ 4.5%. A 1-in-22 event. Unlikely, but not diagnostic. What would actually concern me is if these games were at 65-70% and all went wrong. At 54%, the model is saying "I don't know." Getting swept on four "I don't know" games is annoying, not a signal the model is broken.
All 32 Games
| Region | Seeds | Matchup | BXS Prob | Winner | Result |
|---|---|---|---|---|---|
| South | 1v16 | Florida vs. Prairie View | 98.4% | Florida | ✓ |
| Midwest | 1v16 | Michigan vs. Howard | 97.8% | Michigan | ✓ |
| West | 1v16 | Arizona vs. LIU | 97.3% | Arizona | ✓ |
| East | 1v16 | Duke vs. Siena | 97.0% | Duke | ✓ |
| West | 2v15 | Purdue vs. Queens (NC) | 96.2% | Purdue | ✓ |
| South | 2v15 | Houston vs. Idaho | 96.2% | Houston | ✓ |
| Midwest | 2v15 | Iowa St. vs. Tennessee St. | 94.0% | Iowa St. | ✓ |
| East | 2v15 | UConn vs. Furman | 92.0% | UConn | ✓ |
| West | 4v13 | Arkansas vs. Hawaii | 91.2% | Arkansas | ✓ |
| Midwest | 3v14 | Virginia vs. Wright St. | 90.9% | Virginia | ✓ |
| South | 4v13 | Nebraska vs. Troy | 90.7% | Nebraska | ✓ |
| South | 3v14 | Illinois vs. Penn | 89.0% | Illinois | ✓ |
| East | 3v14 | Michigan St. vs. North Dakota St. | 88.8% | Michigan St. | ✓ |
| West | 3v14 | Gonzaga vs. Kennesaw St. | 87.8% | Gonzaga | ✓ |
| South | 5v12 | Vanderbilt vs. McNeese | 86.5% | Vanderbilt | ✓ |
| East | 4v13 | Kansas vs. California Baptist | 85.4% | Kansas | ✓ |
| Midwest | 4v13 | Alabama vs. Hofstra | 84.2% | Alabama | ✓ |
| East | 5v12 | St. John's vs. UNI | 80.5% | St. John's | ✓ |
| West | 5v12 | Wisconsin vs. High Point | 80.5% | High Point | ✗ |
| Midwest | 6v11 | Tennessee vs. Miami (OH) | 75.6% | Tennessee | ✓ |
| Midwest | 5v12 | Texas Tech vs. Akron | 73.8% | Texas Tech | ✓ |
| East | 6v11 | Louisville vs. South Fla. | 67.0% | Louisville | ✓ |
| East | 7v10 | UCLA vs. UCF | 65.9% | UCLA | ✓ |
| West | 6v11 | BYU vs. Texas | 65.4% | Texas | ✗ |
| Midwest | 7v10 | Kentucky vs. Santa Clara | 60.5% | Kentucky | ✓ |
| Midwest | 8v9 | Georgia vs. Saint Louis | 59.6% | Saint Louis | ✗ |
| South | 7v10 | Saint Mary's vs. Texas A&M | 57.5% | Texas A&M | ✗ |
| West | 7v10 | Miami (FL) vs. Missouri | 56.7% | Miami (FL) | ✓ |
| East | 8v9 | Ohio St. vs. TCU | 54.2% | TCU | ✗ |
| South | 6v11 | North Carolina vs. VCU | 52.5% | VCU | ✗ |
| West | 8v9 | Villanova vs. Utah St. | 52.0% | Utah St. | ✗ |
| South | 8v9 | Clemson vs. Iowa | 49.8% | Iowa | ✓* |
BXS Prob = model's pre-game win probability for the listed team. ✓ = model's favored team won. ✗ = model's lean lost. ✓* = model had Iowa as slight favorite (50.2%); Iowa won. Correct model call, seed upset. Results via NCAA bracket.
The Clemson/Iowa asterisk is worth noting. By seed, Iowa winning counts as an upset. By the model, Iowa was the slight favorite and won. The 25/32 accuracy already reflects this as a correct call.
High Point
Wisconsin was 80.5%. They lost to High Point 83-82. One point.
The interesting thing is the model already knew High Point was real. Their pre-tournament ELO of 1682.6 was higher than every 13, 14, 15, and 16 seed in the field. The committee put them on the 12-line; the model saw a team that outrated most small-conference entries. Their 19.6% upset probability was second-highest among all 12-16 seeds. Only Akron got better odds (26.2%), and Akron lost.
Wisconsin still deserved to be the 80.5% favorite. But this wasn't a Cinderella story the model missed. It was a legitimate 1-in-5 shot that came in.
The Verdict
Going in, I wanted two things: high-confidence predictions to hold up, and low-confidence predictions to stay honest about their uncertainty. Both happened. Every game above 90% went to the correct team. The games below 60% were a mess, but that's expected. They're supposed to be close.
BSS of 0.534. Seven misses. Right when confident, uncertain when uncertain. That's what calibration is supposed to look like.
Round 2 starts Saturday. New probabilities for all 16 games at /march-madness.