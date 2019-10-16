Get the highlights in your inbox every week.
Predicting NFL play outcomes with Python and data science
In part 2 of this series on machine learning with Python, train and use a data model to predict plays from a National Football League dataset.
If you made through part 1, congrats! You have the patience it takes to format data. In that article, I cleaned up my National Football League data set using a few Python libraries and some basic football knowledge. Picking up where I left off, it's time to take a closer look at my data set.
Data analysis
I'm going to create a final dataframe that contains only the data fields I want to use. These mostly will be the data fields I created when transforming columns in addition to down and distance (aka yardsToGo).
df_final = df[['down','yardsToGo', 'yardsToEndzone', 'rb_count', 'te_count', 'wr_count', 'ol_count',
'db_count', 'secondsLeftInHalf', 'half', 'numericPlayType', 'numericFormation', 'play_type']]
Now I want to spot check my data using dataframe.describe(). It sort of summarizes the data in the dataframe and makes it easier to spot any unusual values.
print(df_final.describe(include='all'))
Most everything looks good, except yardsToEndzone has a lower count than the rest of the columns. The dataframe.describe() documentation defines the count return value as the "number of non-NA/null observations." I need to check whether I have null yard-line values.
print(df.yardlineNumber.unique())
Why is there a nan value? Why do I seem to be missing a 50-yard line? If I didn't know any better, I'd say my undiluted data from the NFL dump doesn't actually use the 50-yard line as a value and instead marks it as nan.
Here are some play descriptions for a few of the plays where the yard-line value is NA:
It seems my hypothesis is correct. Each play description's ending yard line and yards gained come out to 50. Perfect (why?!). I'll map these nan values to 50 by adding a single line before the yards_to_endzone function from last time.
df['yardlineNumber'] = df['yardlineNumber'].fillna(50)
Running df_final.describe() again, I now have uniform counts across the board. Who knew so much of this practice was just grinding through data? I liked it better when it had an air of mysticism about it.
It's time to start my visualization. Seaborn is a helpful library for plotting data, and I already imported it in part 1.
Play type
How many plays are passing plays vs. running plays in the full data set?
sns.catplot(x='play_type', kind='count', data=df_final, orient='h')
plt.show()
It looks like there are about 1,000 more passing plays than running plays. This is important because it means the distribution between both play types is not a 50/50 split. By default, there should be slightly more passing plays than running plays for every split.
Downs
A down is a period where a team can attempt a play. In the NFL, an offense gets four play attempts (called "downs") to gain a specified number of yards (usually starting with 10 yards); if it doesn't, it has to give the ball to the opponent. Is there a specific down that tends to have more passes or runs (also called rushes)?
sns.catplot(x="down", kind="count", hue='play_type', data=df_final);
plt.show()
Third downs have significantly more passing plays than running plays but, given the initial data distribution, this is probably meaningless.
Regression
I can use the numericPlayType column to my advantage and create a regression plot to see if there are any trends.
sns.lmplot(x="yardsToGo", y="numericPlayType", data=df_final, y_jitter=.03, logistic=True, aspect=2);
plt.show()
This is a basic regression chart that says the larger the value of yards to go, the larger the numeric play type will be. With a play type of 0 for running and 1 for passing, this means that the more distance there is to cover, the more likely the play will be a passing type.
Model training
I'm going to use XGBoost for training; it requires input data to be all numeric (so I have to drop the play_type column I used in my visualizations). I also need to split my data into training, validation, and testing subsets.
train_df, validation_df, test_df = np.split(df_final.sample(frac=1), [int(0.7 * len(df)), int(0.9 * len(df))])
print("Training size is %d, validation size is %d, test size is %d" % (len(train_df),
len(validation_df),
len(test_df)))
XGBoost takes data in a particular data structure format, which I can create using the DMatrix function. Basically, I'll declare numericPlayType as the label I want to predict, so I'll feed it a clean set of data without that column.
train_clean_df = train_df.drop(columns=['numericPlayType'])
d_train = xgb.DMatrix(train_clean_df, label=train_df['numericPlayType'],
feature_names=list(train_clean_df))
val_clean_df = validation_df.drop(columns =['numericPlayType'])
d_val = xgb.DMatrix(val_clean_df, label=validation_df['numericPlayType'],
feature_names=list(val_clean_df))
eval_list = [(d_train, 'train'), (d_val, 'eval')]
results = {}
The remaining setup requires some parameter adjustments. Without getting too into the weeds, predicting run/pass is a binary problem, and I should set the objective to binary.logistic. For more information about all of XGBoost's parameters, consult its documentation.
param = {
'objective': 'binary:logistic',
'eval_metric': 'auc',
'max_depth': 5,
'eta': 0.2,
'rate_drop': 0.2,
'min_child_weight': 6,
'gamma': 4,
'subsample': 0.8,
'alpha': 0.1
}
Several unsavory insults directed at my PC and a two-part series later, (sobs in Python), I am officially ready to train my model! I'm going to set an early stopping round, meaning that if the evaluation metric for model training declines after eight rounds, I will end the training. This helps prevent overfitting. The prediction results are represented as a probability that the result will be a 1 (passing play).
num_round = 250
xgb_model = xgb.train(param, d_train, num_round, eval_list, early_stopping_rounds=8, evals_result=results)
test_clean_df = test_df.drop(columns=['numericPlayType'])
d_test = xgb.DMatrix(test_clean_df, label=test_df['numericPlayType'],
feature_names=list(test_clean_df))
actual = test_df['numericPlayType']
predictions = xgb_model.predict(d_test)
print(predictions[:5])
I want to see how accurate my model is using my rounded predictions (to 0 or 1) and scikit-learn's metrics package.
rounded_predictions = np.round(predictions)
accuracy = metrics.accuracy_score(actual, rounded_predictions)
print("Metrics:\nAccuracy: %.4f" % (accuracy))
Using Python and its vast repertoire of libraries and models, I could reasonably predict the play type outcome. However, there are still some factors I did not consider. What effect does defense personnel have on play type? What about score differential at the time of the play? I suppose there is always room to go over your data and improve. Alas, this is the life of a programmer turned data scientist. Time to consider early retirement.
