Boxscorus | March Madness Week 1: Did the Model Know What It Was Doing?

ncaab

March Madness Week 1: Did the Model Know What It Was Doing?

52 games. Play-ins through Round 2. Real numbers.

Greg Lamp · March 23, 2026

Posts / NCAAB

38 of 52. 73.1%. The seed baseline beats us by two games. That's the headline if you want to be lazy about it.

Here's what calibration means: if the model says 70%, that event should happen 70% of the time — not in any single game, but across all games at that confidence level. Testing on that question is different from tracking raw bracket picks, because raw accuracy rewards a model that always picks favorites and never admits uncertainty.

Strip out the First Four and the story looks different. The model went 37 of 48 (77.1%) in the actual tournament rounds. Seed baseline goes 36 of 48 (75.0%) over the same games, based on always picking the higher-seeded team. BXS pulls ahead. The First Four are four near-coin-flips that dragged the aggregate number down.

Pre-game win probabilities come from the BXS pre-tournament ELO snapshot, captured before the First Four tipped off. The model never sees game outcomes to generate predictions — it only uses results to update ELO ratings after games complete.

Before getting into the rounds: the chart below is the thing that actually matters. Raw accuracy rewards a model that always picks favorites and never says "I don't know." Calibration asks a harder question: does the probability mean what it says? If you say 70%, it should happen 70% of the time — across many games at that level, not just one.

BXS Model: Predicted vs. Actual Win Rate by Confidence Bin (52 games)

At 70%+, the model is nearly perfect. The 50-60% bin is the only real divergence, and it's driven by games the model already flagged as uncertain.

A calibrated model means 70% should happen 70% of the time. Not in any single game, but across all games at that confidence level. At 70-80%, the model went 6/8 (75.0%). At 80-90%, 11/12 (91.7%). At 90-100%, 11/11 (100%). The high-confidence calls are dialed in.

The 50-60% bin shows favorites winning at just 38.5%. That looks like a failure until you notice what's in that bin: three First Four play-ins, seven Round 1 games (all four 8v9 matchups plus three other tight games), and three Round 2 matchups. These are games the model was already telling you it couldn't confidently pick. Getting the "wrong" ones on near-coin-flips is expected variance, not broken math.

By Round: First Four Tanked It, Then Things Got Interesting

Accuracy by Round vs. Seed Baseline

The First Four is four coin-flip games. Rounds 1 and 2 both beat the seed baseline.

Round	Games	Correct	Accuracy	Brier Skill Score
First Four	4	1	25.0%	-0.311 (worse than coin flip)
Round 1	32	25	78.1%	0.534
Round 2	16	12	75.0%	0.303

Brier Skill Score (BSS) is the single clearest measure: 0.0 means you're no better than flipping a coin; 1.0 is perfection; negative means you'd have been better off guessing randomly. The First Four's -0.311 is ugly, but it's four near-coin-flip games where three went the wrong way — any random process occasionally clusters three wrong guesses in a row. The actual tournament rounds came in at 0.534 (Round 1) and 0.303 (Round 2), both solidly above zero.

For the underlying math: Brier Score is squared confidence error, lower is better, coin-flip baseline is 0.25. BXS scores 0.1505 overall — 40% of the way from coin-flip to perfect, which matches the 0.3979 BSS.

Seed baseline = always picking the higher seed. For First Four games where both teams carry the same seed, the baseline defaults to 50% since there's no seeding signal to apply. All pre-tournament ELO ratings are from the BXS snapshot captured before the First Four tipped off, accessible at boxscorus.com/march-madness. Outcome data verified against the NCAA official bracket.

Round 1 details are in the full calibration post for Round 1. The short version: 25/32, every game above 90% confidence went to the favorite. The one notable miss was Wisconsin/High Point: Wisconsin at 80.5%, High Point at 19.6% upset probability — which was the second-highest upset probability of any 12-16 seed in the field. High Point's pre-tournament ELO of 1682.6 was higher than every 13, 14, 15, and 16 seed. The committee put them on the 12-line; the model saw a real team. That 1-in-5 shot came in.

The First Four: All Coin-Flips, Three Underdogs

1/4 sounds dismal. Three of those games were in the 50-60% confidence range, and UMBC at 65.4% was the only real lean. The model was openly uncertain about all four.

Region	Seeds	Matchup	BXS Prob	Winner	Result
West	11v11	Texas vs. NC State	51.9% (Texas)	Texas	✓
South	16v16	Prairie View vs. Lehigh	43.1% (Prairie View)	Prairie View	✓*
Midwest	11v11	Miami (OH) vs. SMU	42.7% (Miami OH)	Miami (OH)	✓*
Midwest	16v16	UMBC vs. Howard	65.4% (UMBC)	Howard	✗

BXS Prob = pre-game win probability for the listed team. Prairie View and Miami (OH) won as underdogs.

The only one that stings is UMBC at 65.4%. That was a real lean, not a pick-'em, and Howard won 86-83 to advance. The other two underdogs winning is exactly the kind of noise you'd expect when the model says 43%. SMU at 57.3% losing to Miami (OH) isn't surprising. You'd need a lot more than four games like this to conclude anything structural is wrong.

The bigger issue is what these four games do to aggregate statistics. The joint probability of getting exactly three of four wrong when probabilities are 65.4%, 57.3%, 56.9%, and 51.9%: roughly 17%. Not rare, just unlucky. Don't let the First Four drag the overall read.

Round 2: Where Confidence Met Reality

Round 2 is where the model got its real test. Sixteen games, opponents that survived Round 1, ELO gaps that had started separating the field. The model went 12/16.

Here's every game sorted by pre-game confidence (highest first). Read it as a test: did the high-conviction calls hold, and where did the model get overturned?

Region	Seeds	Matchup	BXS Prob (fav)	Winner	Result
Midwest	1v9	Michigan vs. Saint Louis	88.6% (Michigan)	Michigan	✓
West	1v9	Arizona vs. Utah St.	85.9% (Arizona)	Arizona	✓
West	4v12	Arkansas vs. High Point	83.9% (Arkansas)	Arkansas	✓
East	1v9	Duke vs. TCU	81.2% (Duke)	Duke	✓
South	2v10	Houston vs. Texas A&M	79.0% (Houston)	Houston	✓
West	2v7	Purdue vs. Miami FL	78.3% (Purdue)	Purdue	✓
South	1v9	Florida vs. Iowa	76.9% (Florida)	Iowa	✗
Midwest	2v7	Iowa St. vs. Kentucky	71.1% (Iowa St.)	Iowa St.	✓
South	3v11	Illinois vs. VCU	70.6% (Illinois)	Illinois	✓
West	3v11	Gonzaga vs. Texas	70.2% (Gonzaga)	Texas	✗
East	4v5	Kansas vs. St. John's	62.3% (St. John's)	St. John's	✓
East	3v6	Michigan St. vs. Louisville	62.0% (Mich. St.)	Michigan St.	✓
South	4v5	Nebraska vs. Vanderbilt	60.2% (Vanderbilt)	Nebraska	✗
East	2v7	UConn vs. UCLA	57.9% (UConn)	UConn	✓
Midwest	3v6	Virginia vs. Tennessee	54.8% (Virginia)	Tennessee	✗
Midwest	4v5	Alabama vs. Texas Tech	52.8% (Alabama)	Alabama	✓

All probabilities from the pre-tournament ELO snapshot captured before the First Four tipped off.

The four misses, in order of how bad each one actually was:

Iowa over Florida (76.9%). Florida's 1921 ELO vs. Iowa's 1784 — a 137-point gap. The model converts ELO gaps to probabilities via a fitted logistic function (beta=0.0088), which puts Florida at 76.9%. Iowa won 73-72. This is a genuine miss: the model had a real edge on Florida, not a coin-flip hedge. At 76.9%, the model will lose about one in four games it calls at that level. That's calibration, not failure — but Florida is out of the tournament, so it's a miss that costs.

Texas over Gonzaga (70.2%). Gonzaga at 70.2% lost 74-68. What makes this more interesting than a one-off upset: Texas entered the tournament as an 11-seed with an ELO of 1748 — higher than every 12, 13, 14, 15, and 16 seed in the field, and nearly matching BYU (1820) and Gonzaga (1845). The committee put them in the play-in game. The model quietly liked them all along. They beat BYU by 8 in Round 1 at 34.6% odds, then knocked off Gonzaga in Round 2 at 29.8% odds. Consecutive upsets against opponents the model itself rated as better teams. The ELO gap was real — Texas just beat it twice. That's what a 30% shot looks like when it lands.

Nebraska over Vanderbilt (60.2%). Vanderbilt was a slight favorite; Nebraska wins 74-72. Squarely in the range where swings are expected, especially in March.

Tennessee over Virginia (54.8%). Virginia was the slight lean; Tennessee wins 79-72. Virginia at 54.8% is essentially a coin-flip with a modest lean. Getting this wrong is noise.

Two of the four misses were near-pick-'ems. Texas/Gonzaga was a real lean that flipped. Iowa/Florida is the only result where a confident model call (76.9%) went the wrong way, by one point, in overtime territory.

The 2-Seeds Were Immaculate

All four 2-seeds won in Round 2. Houston over Texas A&M, Purdue over Miami FL, Iowa St. over Kentucky, UConn over UCLA. The model had them at 79.0%, 78.3%, 71.1%, and 57.9% — a mixed bag of confidence, but all four came through. Every 2-seed standing at the Sweet 16 is exactly the scenario where the model's pre-tournament Final Four probabilities remain intact.

The 1-seeds went 3/4. Michigan (88.6%), Arizona (85.9%), and Duke (81.2%) all advanced comfortably. Florida (76.9%) didn't. That's the outlier, and it cost the model its highest-probability Final Four pick in the South region.

What the 50-60% Bin Is Actually Telling You

Thirteen of 52 total games landed in this range. Favorites won 5/13 (38.5%). Here's the breakdown of where those 13 games came from:

3 First Four play-ins (Texas, Prairie View, Miami OH — the three games under 58%; UMBC at 65.4% is in the 60-70% bin)
7 Round 1 games (all 4 8v9 matchups: Clemson/Iowa, Villanova/Utah St., Ohio St./TCU, Georgia/Saint Louis — plus North Carolina/VCU, Miami FL/Missouri, Saint Mary's/Texas A&M)
3 Round 2 games (Alabama/Texas Tech, Virginia/Tennessee, UConn/UCLA)

Five of those 13 favorites actually won. The ones that lost were mostly in the 51-62% range where the model was already telling you to keep your expectations low. This isn't a calibration failure. It's a small sample of genuinely close games behaving like genuinely close games.

You'd want 40-50 games in the 50-60% bucket to draw any real conclusion about whether the model is biased toward favorites at that level. We have 13 across three rounds.

Verdict: Honest Grade

BSS of 0.3979. Forty percent better than a coin flip across 52 games. I wish I could give you a more impressive number — but then I'd be lying, and a lying model is the least useful thing in sports analytics.

Over 48 meaningful tournament games (setting aside the First Four play-ins), BXS goes 77.1% vs. a seed baseline of 75.0%. Two percentage points. That is not a margin to retire your bracketologist over. It's also exactly what a calibration-first model should look like: not cherry-picking favorable games, not hiding the misses, just pricing every matchup honestly and grinding out a thin edge. If you wanted something dramatic, you're in the wrong sport analysis.

At 70%+ confidence: the model went 28/31 (90.3%). That's where the edge is concentrated — not in the coin-flip games, but in the high-conviction calls the model had before anyone had played a game.

Florida at 76.9% losing by one point is the hardest kind of result: "correct" and "costly" at the same time. Iowa winning wasn't a model failure — it was a 23% event occurring. ELO can't detect a team peaking exactly at the right moment. At 76.9%, the model priced Florida correctly given available information. Discounting that call because Iowa hit a tough shot is how you end up chasing noise in a tournament designed to produce upsets. Florida's gone. The South bracket is open. That's the consequence, not the verdict on the model.

Texas is the reverse story. The committee put them in a play-in game. The model had them at 1748 ELO — equivalent to a 6 or 7-seed — and they beat BYU (1820) at 34.6% odds, then Gonzaga (1845) at 29.8% odds. Two teams the model rated as better. The committee was wrong. The ELO was closer to right.

All four 2-seeds survived. The 60-70% zone went 5/8 in Week 1, slightly below its expected range — worth watching in Round 3 to see if it corrects. The teams the model liked most are still playing.

Full updated probabilities at /march-madness.