Below is a workflow for gathering NET basketball rankings data using R. But first, some background on the data that’s currently available (that I’m aware of).
On seemingly a daily basis, NCAA.com updates its NET rankings webpage with the date of the latest publish right above the table. The URL of this page (https://www.ncaa.com/rankings/basketball-men/d1/ncaa-mens-basketball-net-rankings) does not change when updates are made, so if you are looking for historical data to do some sort of analysis, you’re out of luck unless you’ve scrape their page daily and store the results in a database.
Luckily though, they do have archives. You can find a link to the archives at the very bottom of the page as shown in the screenshot below (c’mon, Mississippi Val.!).
I was hoping for something easier to work with, but that was wishful thinking. Still, gracious anything is provided because it just takes a bit of data manipulation in R to get these trick PDF files into nice neat data frames for b-ball analysis.
When you click through to the archive you find a list of links. The ones you want are “Nitty Gritty sheets. When you click into those, you end up with a pdf document that contains rankings and other data for that day.
The good news is that these PDFs are created in a way that that the data can be scraped. In this case, I use the extract_tables function within the tabulizer package in R. Let’s get started, beginning with loading the required package in R so we can use them in this session:
# install.packages(tidyverse) # install.packages(tabulizer) library(tidyverse) library(tabulizer)
I had some trouble adding ‘tabulizer’ to to my library. Kept receiving an error. Eventually found a solution here. I’m still not exactly sure what the problem was but it worked : )
Next, we’ll pick a Nitty Gritty PDF to crawl — this one from January 6, 2020 looks good. Make note of the URL structure, we can get creative later and automate the collection and storage of this data as long as we know of patterns in the URL to work off of.
location01082020 <- "https://extra.ncaa.org/solutions/rpi/Stats%20Library/NET%20Nitty%20Gritty%20-%20Jan.%206,%202020.pdf"
Then, extract the data using the extract_tables function I’ve listed below. You’ll notice after you view the data that it comes back as nested lists. This is tricky to work with, so we follow up by converting it into a list of data frames that will eventually bind together into one large data frame (all 13 tables from the PDF in one large table).
net01082020 <- extract_tables(location01082020, output = "data.frame") view(net01082020) # You're viewing a nested list of data frames here netnet01082020all <- lapply(net01082020, mutate_if, is.integer, as.character) netnet01082020all <- bind_rows(net01082020all)
We will need to combine the nested data frames using bind_rows. However, I was getting an error ‘Error: Column
X.2 can’t be converted from integer to character‘. In order to do deal with this data type issue, we need to change all integers to characters (using lapply). Once that’s done, we easily combine rows with bind_rows from dplyr.
And there you have it. A nice single data frame with all of the data from the particular day you targeted in your extraction. Still needs some cleaning up though. There are some extra characters before team names, columns (variables) need to be re-labeled and there are extra columns that need to be deleted.
# remove special characters "[" and "]" net01082020all$X <- gsub("\\[","",net01082020all$X) net01082020all$X <- gsub("\\]","",net01082020all$X) # remove extra blank columns at end of data table net12232019all <- net12232019all[,-c(12:19)] # rename variables (column names) colnames(net12232019all) <- c("Team", "NET Rank", "Avg Opp Rank", "Avg Opp NET", "Record", "Conference Record", "Non-Conference Record", "Road Record", "SOS", "NC SOS", "Quadrant Records")
You may notice that the final column has all four quadrant records smashed together in one variable. If this is important to your analysis, you’ll need to take an extra step to separate these out into four columns. I may update this post at a later date, but until then I’d suggest looking into maybe the separate() function to do the job.
Last but not least, a great application for all of this would be to create a process that loops through all of the archives and stores a historical database somewhere that you can access whenever you need for analysis. I haven’t gotten that far. Attempted it, but got stuck on how to build a list of URLs in an efficient manner. If you have any thoughts on this or run into problems with the above, let me know in the comments.