Using Data Science to Win Fantasy Football (Part 1)

5 min readNov 6, 2020

What if I told you that you didn’t need to pray to the fantasy gods in order to win your league? This may seem impossible but with the power of data science we can determine the players most likely to score the most points. The way that this can be done is using the statistics of players as the features to our models and their fantasy points as the target variable we are going to make predictions on. In this blog we are going to focus on quarterbacks as each player needs a separate process to determine what features give information on the target variable.

Short Introduction to Fantasy Football

Every year, millions of people join fantasy leagues for the upcoming football season. The goal of fantasy football is to pick players that will perform the best, giving your team the most points throughout the year. It all starts with a draft where, like the real NFL draft, each person in the fantasy league drafts players from a pool of all currently signed players. You then play another person each week trying to score more points than your opponent to get a win. The way that players score points is based on their performance, more specifically their statistics, in the real NFL games played each week. The problem I seek to solve is determining which players should be picked in order to score the most points.

Step 1: Getting the Data

To determine what data that needs to be collected, we first need to look at all the possible data is collected. Lucky for us most sports played in the united states have many different entities that seek to record every possible statistic for every player and team. This means that any data we want to use as a feature is collected and stored somewhere, it is just a matter of finding out where and how we are going got get it. In my journey to finding reliable statistics along with fantasy points I found a great website that allows for a free download of their database. Here is a quick look at the files in the database.

Step 2: Creating the Data Frames

The data is stored in separate CSV files by the year and week the games were played. In each of these files is a list of players that played that week and their associated stats of that week. Due to the way this data is formatted we are going to need to do some preprocessing in order to use this as our main data frame. The way that this will be done is by reading each CSV, appending those rows to the main data frame, thus creating one large data frame that contains the information from every week and year. By the end of this process we will have the fantasy points scored and the associated statistics values for each player that has played in the past 20 years.

The problem right now is that we do not want the target variable to be the fantasy points scored that week. The problem with this is that the statistics from that week determine the fantasy points scored that week, thus we would just end up determining the coefficients applied each statistic that determines the number points assigned for the each unit of a statistic (for example each passing yard a player gets is awarded 0.1 points).

Our goal here is to use previous weeks data to help predict how many points they will score next week. In order to achieve this we just need to use the points that player scored the next week as the target variable for this week.

Step 3: Feature Engineering

In order to tailor this process to quarterbacks there are some features that need to be made and others that need to be dropped. One very common statistic that I was surprised was not on here was completion percentage. This is the total number of caught throws divided by the total amount of attempts. This will give us some indication as to how accurate the quarterback is with his throws but also how good his decision making is. Another interesting thing about quarterbacks is that some choose to run significantly more than others do. They are given a sudo title of a “running quarterback”, and I think this is an important characteristic to capture. The reason for this is that rushing yards (when a player runs with the ball in hand) are worth more points than throwing yards (when a player throws the ball to another). There is a tradeoff as it is risky for the quarterback to run, but in general quarterbacks that can run and throw score more points than those who specialize in one or another.

Looking at the features that we have there is a massive piece missing from our data. Fantasy is not just about playing the “best” players each week. The team that a player is going up against is a massive factor in how well they will perform. If a quarterback is is going against a team that is extremely good at passing defense, his score will likely be much lower than a player going against a weak defense. There are many different factors to consider here but I will not be able to go over all of them. Some of the features that I would like to add are the opposing defense’s rank over the league, the opposing defenses yards against, the opposing teams number of players on injury, and many more.

Step 4: Feature Selection

For this process I am going to try and get the best predicting model rather than an interpretable one. The reason for this is I do not need to know why the players will score the most, just that they will score the most. I could see use case for that information but only if I was going to be scouting players to play in the NFL as you would know what characteristics of a player produce the best results. The processes that will be run are a recursive feature elimination and a filter feature selection process. These will determine the best features for this problem.

Next Steps:

The data that I would need to create all features that I want is spread out between many sources and it going to be somewhat time consuming to collect. The data that I will need is listed below

Team matchups
Injury reserve lists
Defensive ranks
Team matchup history

Yet again this process so far is only concerned with quarterbacks as they are the highest contributors to fantasy points while also being the most complex player in terms of earning points. With the current data set that I have the margin of error (RMSE) is about 8 points. This is quite horrendous but that was just a baseline model without a lot of what I plan to do. See updates in part 2.