Jamarius N. Taylor



Data Scientist

Thoughts on NBA Data - Acquistion


Overview

There comes a time -sometimes several times- in a Data Scientist's life where they think "I am going to apply my craft to my favorite sport". For me that time is now. The main issue I and everyone that has come before me faces is that the this is no small task and in fact 3 different tasks galavanting as 1. First we need to get at this data and the manicured datasets are expensive so we will have to go the scraping route. We also need to store all of this data properly so we don't have to continously scrape the info. So now this has become a Data Engineering problem. I am going to keep up with this process and mark my journey here. Below you will find my plan of action. Later we'll revisit this and see what could have gone better.

Identify A Project

I have been wanting to go through Richard McElreath's Statistical Rethinking series online. I also want to be able to practice on a dataset that interests me. This seems like as good of a reason if any for this data so it looks like I've found my project. The gist being if I am going to doing some causal inference and Bayesian techniques I want to prioritize getting metrics.

Identify a source

This was actually the easiest part. www.basketball-reference.com has all the data I could want including boxscores for every game last season. The difficult part will be scraping all of it.

Identify a plan

The plan here is to scrape each boxscore of each game. That's 30 teams who each play 82 games or 2,460 scrapes that need to be made. I want to do this respectfully so we will have to put pauses in between runs as to not DDoS www.basketball-reference.com so we won't run it in parallel. The plan will be to use python and beautifulSoup4 to scrape and parse all the box scores into tables. First we will need a list of all boxscore links. We can gather that by paging through pages like this and gathering up all the box score links we find. By doing this in python we can setup these as cloud functions later on to make this project a bit more future proof. Once we are happy with our tables we will store them in Google Cloud Storage so they can be ingested by BigQuery.

Summary

I plan to apply some of the methods from Richard McElreath's Statistical Rethinking series on NBA data from the past season. To do this I am going to need to scrape all of the data for the past season and store it somewhere neatly that I can access.