March Madness inherently has its upsets and surprises. There is the drama of the last second shot or the errant pass to end a game. The NCAA tournament has its unpredictability. Like life off the court, random events happen. Yet, within it all, there are patterns in the stats of the games and players. How does one make sense of it all?
Charlotte-based big data company Tresata has an answer: Look at a lot of sports data all at one time. This year, I’m serving as the company’s chief researcher while on sabbatical from Davidson College, where I’m an associate professor of mathematics and computer science. I’m applying my expertise in sports analytics with the company’s proprietary analytics platform Optimus, particularly its network discovery engine. To date, this tool has been predominantly used to analyze fraudulent behavior in the financial services domain. I’m using it to study college basketball.
What creates the network in college basketball? Tresata looks at every game in the season. Two teams are connected in the network when they play a game. A winning team is connected to the losing team by a directed edge or arrow. So, in the network below, team D is undefeated. Team C lost only to team D, team A beat only B and team B lost to everyone. For this network, team D is #1, C is #2, A is #3 and team B is #4.
Creating such a network for every game creates a network for NCAA Division 1 men’s basketball. Clearly, the network is bigger. How much? In the image below, you see the network for just one season. This involves approximately 350 teams in over 5,000 games. Now, who do you pick as #1 and how do you know?
It turns out Tresata is analyzing 14 times more data than in the network depicted above. They’ve collected 14 years of college basketball game data. With Tresata’s tools, complete seasons across conferences and time can be viewed for the first time all at once. We’ve analyzed more than 70,000 games involving 5,000-plus teams, generating a list of stats for each game and a number of unique data points for each team.
How do you keep from going mad with so much data? These tools were designed to process such large datasets, and I’m adapting techniques already honed in other arenas of banking, retail and recently health care. Plus, the large network is explored using Tresata’s internally written query language, “QUE.” With it, you can easily identify teams that play talented opponents tough yet play down to lesser competition, for example, as well as see how conferences compare to one another.
These types of queries exposed natural similarities between teams and patterns of play across all 14 years. It also helped identify and characterize attributes of “Cinderella” teams, which are the low-ranked teams that have huge upsets in the early rounds.
In particular, Cinderella teams tend to:
- Come from smaller conferences;
- Have important out-of-conference wins, or close losses;
- Struggle against less competitive opponents, but win against higher ranked teams.
You would also be wise to consider these additional tips that emerged from my analysis as you fill in your bracket once the matchups are announced on Selection Sunday:
- It’s all about stamina for high seeds: Of teams seeded 10 or higher, only two teams have ever won four games in the tournament (2.3%) and only four teams have ever won three games (4.5%). None of these teams was seeded higher than 12.
- Three’s the magic number: In the last 14 years, all but one championship team was a 1, 2 or 3 seed. The 2014 Connecticut squad, a 7 seed, was the only exception.
- ”At-large” is quite descriptive: Looking at berth type, 85% of teams that received at-large bids have reached at least the Elite 8.
These are only a taste of Tresata’s Tourney Tips, and we’ll continue digging into the data. This year’s tournament should provide a treasure trove of information.
Image: sorbetto / iStock.