How to Build the Perfect March Madness Bracket by Implementing Your Own ML Model
Updated: Mar 13
In our previous blog post, we discussed merits of building a machine learning algorithm to predict the outcome of March Madness. Now, it's time to dive into the technical aspects of building a model. In this post we’ll provide a general framework and examples / guidance on executing these steps.
How to build your March Madness ML Model:
Data collection: Obviously, the most important thing we need to do any sort of modeling is data. There is an extreme wealth of data available to you across the internet. More doesn’t always mean better when building a model, and the more data you decide to include, the more work that will be involved in the next step so keep that in mind . See below for in depth guidance on this topic.
Feature engineering: This is the step where you will probably spend most of your time (or at least it should be if you want your model to do well). After you gather data, you need to transform that data into features that can be used as inputs to your model. This step is known as feature engineering, and involves techniques such as feature extraction, normalization / scaling, binning, etc. For example, a feature could be the difference in scoring average between two teams, or the difference in strength of schedule.
Model training: The next step is to train the model on the data. To do this, you take your data and split it into a training, test, and validation data sets. The purpose of each dataset and guidance on creating them is discussed below. Once your data is split, you choose an algorithm that you will use to fit your ML model to the data, so that it can make predictions about the outcome of each game in the tournament.
Model evaluation: After the model has been trained, the next step is to evaluate its performance. This can be done by comparing the model's predictions to the actual outcomes of the games in the validation set. Common metrics for evaluating the performance of a predictive model include accuracy, precision, recall, and F1 score.
Model Predictions for current season: Once you are satisfied, to “deploy” your model, run your predictions and use this information to fill out your bracket.
Data Collection
There is a wealth of data available. It makes the most sense to use the Kaggle datasets provided each year since that eliminates a lot of the data collection. Augmenting this data might be helpful, and some things that might be worth exploring and including in your model include: ELO Ratings, Massey Ratings, and Upset statistics.
In case you are unfamiliar with it, ELO ratings originated from the game of chess but have permeated into sports, competitive gaming etc. It was invented by Arpad Elo, a Hungarian-American physics professor, as a better chess-rating system over what existed prior. The simple approach makes it easy to generalize and apply to many other competitions. In essence, you start the season with each team having the same score. As games are played, the winning team takes points from the losing team and scores are adjusted. If a team beats another team with a much higher rating, they will take more points; if instead their opponent had a lower rating, they would take fewer points. The difference between the ELO ratings of two teams determines is used to predict the outcome of matches.
Some of the data you would expect to find the Kaggle datasets or through other means include:
Team statistics: Team-level metrics that provide a general sense of how a team has performed throughout the season would be stats like
Win-loss record
Points scored
Points allowed
Tournament seed
Strength of schedule
Player statistics: There is no “I” in team but the same could be said about the presence of wisdom if one chose to ignore the individual players contribution to their teams performance. Stats in this category include things like:
Field goals (2-pointers) scored
Field goal attempts
Three-pointers scored
Three-pointer attempts
Rebounds,
Assists
Feature Engineering and Selection
Feature Engineering is the process of transforming the data you collected into inputs for your model. These steps will be specific to the data you collected. For instance if you are using the sample Kaggle datasets from the past, you’ll have a data set with the seeds of the winning and losing team in each game as well as their respective scores for every season since 1985. Instead of leaving them in their raw form, consider calculating the difference between the metrics of the winning team and losing team. For example, instead of having data that looks like this:

You would take Win seed column and subtract the loss seed column:

Using differentials instead of raw values helps us normalize our data and can remove some of the noise, making it easier to compare teams and generalize features across the seasons. For instance, if there was a rule change that imposed a faster shot clock, you’d expect to see an uptick in the number of points scored in each game compared to prior seasons. I’m not saying that happened, but if it did, using differentials allows us to ignore the “noise” introduced by the changing pace of the game and compare more meaningful stats.
It is important to understand the data you collect and what each feature you ultimately feed into your model represents, otherwise you may inadvertently end up hindering your model in making accurate predictions instead of helping it.
During your data collection, you will likely come across stats such as Player Efficiency Rating (PER), Team Offensive Rating, and Team Defensive Rating. Knowing how these features are calculated is useful because it can help safeguard against introducing multi-collinearity which, in layman’s terms, refers to a situation where two or more predictor variables are highly related to each other, and thus contain redundant information (selecting certain algorithms will also make our model less sensitive to this which we will get into soon).