Create High School Basketball Computer Rankings with R

This post outlines how to collect some very basic high school basketball scores data from MaxPreps. By collecting this data and building upon it, you should have everything you need to create a computer rankings model for your state, much like I did for Nebraska Class A boys basketball.

The state of Nebraska puts teams into different classes (Class A, B, C, etc.) depending mostly (from what I can tell) on enrollment and location. For this project, I was interested in Class A boys basketball, so the first thing I did was get an official list of Class A teams from the Nebraska School Activities Association (NSAA) website.

Knowing I would need this list for later, I manully created a dataframe called Class A:

# Manual Class A names
School <-
    'Omaha South',
    'Omaha Central',
    'Grand Island',
    'Millard North',
    'Millard South',
    'Millard West',
    'Lincoln East',
    'Lincoln High',
    'North Star',
    'Creighton Prep',
    'Omaha North',
    'Lincoln Southeast',
    'Lincoln Southwest',
    'Omaha Westside',
    'Papillion-LaVista South',
    'Lincoln Northeast',
    'Bellevue West',
    'Omaha Northwest',
    'Bellevue East',
    'Elkhorn South',
    'North Platte',
    'Pius X',
    'South Sioux City'
# Manual Class A enrollment
Enrollment <-

#Combine list into data frame
classA <- data.frame(School, Enrollment)

We’ll need the following libraries for this project:

library(c("shiny", "DT", "shinythemes", "rvest", "expss", "dplyr", "tidyr", "stringr", "sqldf", "xlsx", "scales"))

In order for all of this to work, I needed some basic (but turns out hard to find) high school basketball scores data in a consistent way. I tried scraping Omaha World Herald scores and NSAA scores, but in the end found MaxPreps easiest to work with.

In order to automate this, it’s important to understand the components of the MaxPreps scoreboard URLs. These URLs list scores by a given day, which you can select on the website by choosing from a dropdown calendar. Here is a closer look at the structure:

# start of scoreboard URLs on MaxPreps

# date

# end of URL that specifies boys, state and class (divisionid)

So the putting the pieces above together would give us a page of scores on December 5, 2019 for Nebraska Class A boys basketball. Now that we understand this structure, we need to create a list of dates that we can will use to then create a list of URLs that contain scores for the entire season. I couldn’t figure out a great way to collect the game dates, so this part was manual as well. If you find a more automated way to get a list of days which games were played on, let me know!

# create games table -- game_dates NEEDS TO BE UPDATED WITH NEW DAYS
maxprep_baseURL <- 

maxprep_paramURL <- 

# this part was manual, but building a list of final below is not
game_dates <- c(

maxprep_page_list <- 
  as.list(paste0(maxprep_baseURL, game_dates, maxprep_paramURL))

Now that we have a list of URLs that contain all of the Class A scores we want, we can scrape the HTML data we want (the scores). Please note you need to get familiar with the HTML of the pages you’re pulling data from for this to work. In my case, I found the scores all sitting neatly under a div called data-contest-state.

maxprep_html <- lapply(maxprep_page_list, FUN=function(URLLink){
  read_html(URLLink) %>% html_nodes("[data-contest-state='boxscore']") %>% html_text()

The above returns a list of lists, which is difficult to work with. So we can modify this by using the unlist function. We’ll also add a few more modifications to get this all into a dataframe called scores.

# Unlist and create dataframe of scores by game
scores <-

scores <- 
  gsub("Final","", scores)

scores <- 
  grep("#", scores, invert = TRUE, value = TRUE)

scores <- 

If you check the dataframe using view(scores) you’ll notice a problem. All of the scores and teams are listed under one column. We need to separate these into different columns and rename them.

colnames(scores) <- 

scores <- 
  scores %>%
  mutate(V1 = gsub("(\\d+)", ";\\1;", V1)) %>%
  separate(V1, c(NA, "No1", "Let1", "No2", "Let2"), sep = " *; *")

colnames(scores) <- 
  c("Away_Score", "Away_Team", "Home_Score", "Home_Team")

Okay, we’re making some really good progress now. We’ve basically scraped the data we need and manipulated into a workable dataframe with four useful variables/columns: “Away_Score,” “Away_Team,” “Home_Score,” and “Home_Team.”

With just this data we can create a number of new variables that will be useful including winner, loser, home win, home loss, away win, and away loss. It took me a while to figure out a good way to do this but eventually found that using if_else from dplyr is very efficient.

scores$Winner <- 
  if_else(scores$Away_Score > scores$Home_Score, scores$Away_Team, scores$Home_Team)

scores$Loser <- 
  if_else(scores$Away_Score < scores$Home_Score, scores$Away_Team, scores$Home_Team)

scores$Home_W <- 
  if_else(scores$Winner==scores$Home_Team, scores$Home_Team, "NA")

scores$Home_L <- 
  if_else(scores$Loser==scores$Home_Team, scores$Home_Team, "NA")

scores$Away_W <- 
  if_else(scores$Winner==scores$Away_Team, scores$Away_Team, "NA")

scores$Away_L <- 
  if_else(scores$Loser==scores$Away_Team, scores$Away_Team, "NA")

And then I cleaned it up a bit (please leave comments if you have any questions about what is going on here):

scores <- 
  scores %>% mutate_all(~gsub('\r|\n', '', .))

# Would like to make this more efficient
scores$Winner <- str_trim(scores$Winner, side = "both")
scores$Loser <- str_trim(scores$Loser, side = "both")
scores$Away_Team <- str_trim(scores$Away_Team, side = "both")
scores$Home_Team <- str_trim(scores$Home_Team, side = "both")

scores$Home_Score <- as.numeric(scores$Home_Score)
scores$Away_Score <- as.numeric(scores$Away_Score)

From this point, I decided to make a few new categories — how much a team won by and whether the home or away team is a Class A team or not. The Class is important here because a few teams in Class A mostly played Class B teams and I wanted to have that information available to use in how I scored teams later on.

scores <- 
  scores %>% mutate_all(~gsub('\r|\n', '', .))

# There's probably a more efficient way to write this
scores$Winner <- str_trim(scores$Winner, side = "both")
scores$Loser <- str_trim(scores$Loser, side = "both")
scores$Away_Team <- str_trim(scores$Away_Team, side = "both")
scores$Home_Team <- str_trim(scores$Home_Team, side = "both")

scores$Home_Score <- as.numeric(scores$Home_Score)
scores$Away_Score <- as.numeric(scores$Away_Score)

Now I have a good amount of variables stored in my scores dataframe. My next step will be use the data in that dataframe to build out the classA dataframe that we created in the beggining.

To execute this we will need to group or summarize a lot of data in scores for a particular team. For example, to get the number of Wins for a given team, we have to count the number of times that team appears in the scores$Winner column (which is why we created that variable). I was able to execute this using a number of functions including merge and sqldf.

classA <- 
  merge(classA, stack(table(factor(scores$Winner, levels = classA$School))), 
        by.x = 'School', by.y = "ind")
names(classA)[names(classA) == 'values'] <- 'Wins'

classA <- 
  merge(classA, stack(table(factor(scores$Loser, levels = classA$School))), 
        by.x = 'School', by.y = "ind")
names(classA)[names(classA) == 'values'] <- 'Losses'

classA$Win_Pct <- 
  round(classA$Wins / (classA$Wins + classA$Losses), digits = 2)

classA$Games_Played <- 
  classA$Wins + classA$Losses

# Use sqldf to create Home Wins/Losses, Away Wins/Losses from scores df
varHW <- sqldf("select scores.Home_Team, count(scores.Home_W)
               from scores
               where scores.Home_Class==TRUE AND scores.Home_Team==scores.Winner
               group by scores.Home_Team",
names(varHW)[2] <- "Home_Wins"
classA <- sqldf("select classA.*, varHW.Home_Wins 
                from classA 
                left join varHW on classA.School = varHW.Home_Team", 
                stringsAsFactors = FALSE)

varHL <- sqldf("select scores.Home_Team, count(scores.Home_L)
               from scores
               where scores.Home_Class==TRUE AND scores.Home_Team==scores.Loser
               group by scores.Home_Team",
names(varHL)[2] <- "Home_Losses"
classA <- sqldf("select classA.*, varHL.Home_Losses 
                from classA 
                left join varHL on classA.School = varHL.Home_Team", 
                stringsAsFactors = FALSE)

varAW <- sqldf("select scores.Away_Team, count(scores.Away_W)
               from scores
               where scores.Away_Class==TRUE AND scores.Away_Team==scores.Winner
               group by scores.Away_Team",
names(varAW)[2] <- "Away_Wins"
classA <- sqldf("select classA.*, varAW.Away_Wins 
                from classA 
                left join varAW on classA.School = varAW.Away_Team", 
                stringsAsFactors = FALSE)

varAL <- sqldf("select scores.Away_Team, count(scores.Away_L)
               from scores
               where scores.Away_Class==TRUE AND scores.Away_Team==scores.Loser
               group by scores.Away_Team",
names(varAL)[2] <- "Away_Losses"
classA <- sqldf("select classA.*, varAL.Away_Losses 
                from classA 
                left join varAL on classA.School = varAL.Away_Team", 
                stringsAsFactors = FALSE)

classA[] <- 0

scores$Away_Score <- as.numeric(scores$Away_Score)
scores$Home_Score <- as.numeric(scores$Home_Score)
classA$School <- as.character(classA$School)

# Using dplyr to create home and away points per game (ppg)
classA <- 
  scores %>% 
  group_by(Away_Team) %>% 
  summarise(Away_PPG = mean(Away_Score, na.rm = TRUE)) %>% 
  right_join(classA, by = c(Away_Team = 'School'))

names(classA)[names(classA) == 'Away_Team'] <- 'School'

classA <- 
  scores %>% 
  group_by(Home_Team) %>% 
  summarise(Home_PPG = mean(Home_Score, na.rm = TRUE)) %>% 
  right_join(classA, by = c(Home_Team = 'School'))

names(classA)[names(classA) == 'Home_Team'] <- 'School'

classA <- 
  scores %>% 
  group_by(Home_Team) %>% 
  summarise(Home_dPPG = mean(Away_Score, na.rm = TRUE)) %>% 
  right_join(classA, by = c(Home_Team = 'School'))

names(classA)[names(classA) == 'Home_Team'] <- 'School'

classA <- 
  scores %>% 
  group_by(Away_Team) %>% 
  summarise(Away_dPPG = mean(Home_Score, na.rm = TRUE)) %>% 
  right_join(classA, by = c(Away_Team = 'School'))

names(classA)[names(classA) == 'Away_Team'] <- 'School'

classA$Away_PPG <- round(classA$Away_PPG, digits = 0)
classA$Home_PPG <- round(classA$Home_PPG, digits = 0)
classA$Home_dPPG <- round(classA$Home_dPPG, digits = 0)
classA$Away_dPPG <- round(classA$Away_dPPG, digits = 0)

classA$Home_PPG_Diff <- round(classA$Home_PPG - classA$Home_dPPG, digits = 0)
classA$Away_PPG_Diff <- round(classA$Away_PPG - classA$Away_dPPG, digits = 0)

We now have a lot of juicy data in classA for analysis, but let’s go further and create a few more variables:

classA <- 
  scores %>% 
  group_by(Away_Team) %>% 
  summarise(Away_Total_Points = sum(Away_Score, na.rm = TRUE)) %>% 
  right_join(classA, by = c(Away_Team = 'School'))

names(classA)[names(classA) == 'Away_Team'] <- 'School'

classA <- 
  scores %>% 
  group_by(Home_Team) %>% 
  summarise(Home_Total_Points = sum(Home_Score, na.rm = TRUE)) %>% 
  right_join(classA, by = c(Home_Team = 'School'))

names(classA)[names(classA) == 'Home_Team'] <- 'School'

classA <- 
  scores %>% 
  group_by(Home_Team) %>% 
  summarise(Home_Points_Allowed = sum(Away_Score, na.rm = TRUE)) %>% 
  right_join(classA, by = c(Home_Team = 'School'))

names(classA)[names(classA) == 'Home_Team'] <- 'School'

classA <- 
  scores %>% 
  group_by(Away_Team) %>% 
  summarise(Away_Points_Allowed = sum(Home_Score, na.rm = TRUE)) %>% 
  right_join(classA, by = c(Away_Team = 'School'))

names(classA)[names(classA) == 'Away_Team'] <- 'School'

classA$Total_Points <- classA$Home_Total_Points + classA$Away_Total_Points
classA$PPG <- round(classA$Total_Points / classA$Games_Played, digits = 0)

classA$Points_Allowed <- classA$Home_Points_Allowed + classA$Away_Points_Allowed
classA$dPPG <- round(classA$Points_Allowed / classA$Games_Played, digits = 0)

classA$Total_Points_Diff <- classA$Total_Points - classA$Points_Allowed
classA$PPG_Diff <- classA$PPG - classA$dPPG

The last thing I did here is create some variables to tell us how often teams played other Class A teams since it’s a distinct advantage to play mostly against teams in other classes.

againstAhome <- sqldf("select scores.Home_Team, count(scores.Away_Class)
               from scores
               where scores.Away_Class==TRUE AND scores.Home_Class=TRUE
               group by scores.Home_Team",
names(againstAhome)[2] <- "Home_A_Schedule"
classA <- sqldf("select classA.*, againstAhome.Home_A_Schedule 
                from classA 
                left join againstAhome on classA.School = againstAhome.Home_Team", 
                stringsAsFactors = FALSE)

againstAaway <- sqldf("select scores.Away_Team, count(scores.Home_Class)
               from scores
                      where scores.Home_Class==TRUE AND scores.Away_Class=TRUE
                      group by scores.Away_Team",
names(againstAaway)[2] <- "Away_A_Schedule"
classA <- sqldf("select classA.*, againstAaway.Away_A_Schedule 
                from classA 
                left join againstAaway on classA.School = againstAaway.Away_Team", 
                stringsAsFactors = FALSE)

classA$A_Schedule <- round((classA$Home_A_Schedule + classA$Away_A_Schedule) / classA$Games_Played, digits = 2)

Now you should have ample data to create a scoring system model to rank high school basketball teams in your state. I won’t go into details here about how I modeled this, but you can see my final results here. And happy to give some direction if you’re really interested. You can also find details online about how other rankings systems, like the NET work by digging a bit.

If you’re going to display your dataframe as a table in some way, like I did with my shinyapp linked to above, then it’s a good idea to clean things up and only include necessary columns, which I do here:

classA_clean <- 

names(classA_clean) <- 
    "Performance Points",
    "Win %", 
    "Home Wins", 
    "Home Losses", 
    "Away Wins", 
    "Away Losses", 
    "Def PPG",
    "Home PPG Diff", 
    "Away PPG Diff", 
    "Class A Schedule",

And lastly, here is my R script (app.R) for publishing this to shiny. Please note this is not the entire script, just the last piece. You’d need to add all the pieces that build the dataframes above it to make it work.

# Define UI for application that displays a table
ui <- fluidPage(
  theme = "sandstone",
                h2("2019/20 Nebraska High School Boys Basketball Computer Rankings", style = "color: DarkGoldenRod"),

server <- function(input, output) {
  output$mytable = DT::renderDT({
      rownames = FALSE,
      options = list(
        paging = FALSE, 
        searching = FALSE))

shinyApp(ui = ui, server = server)

Good luck!

My GitHub repository

Collect NET Basketball Rankings Data in R

Below is a workflow for gathering NET basketball rankings data using R. But first, some background on the data that’s currently available (that I’m aware of).

On seemingly a daily basis, updates its NET rankings webpage with the date of the latest publish right above the table. The URL of this page ( does not change when updates are made, so if you are looking for historical data to do some sort of analysis, you’re out of luck unless you’ve scrape their page daily and store the results in a database.

Luckily though, they do have archives. You can find a link to the archives at the very bottom of the page as shown in the screenshot below (c’mon, Mississippi Val.!).

Bottom of NET rankings page features a link to archived rankings

I was hoping for something easier to work with, but that was wishful thinking. Still, gracious anything is provided because it just takes a bit of data manipulation in R to get these trick PDF files into nice neat data frames for b-ball analysis.

When you click through to the archive you find a list of links. The ones you want are “Nitty Gritty sheets. When you click into those, you end up with a pdf document that contains rankings and other data for that day.

List of archived NET data by day
Nitty Gritty sheets give you a pdf table of what you would otherwise see on the webpage

The good news is that these PDFs are created in a way that that the data can be scraped. In this case, I use the extract_tables function within the tabulizer package in R. Let’s get started, beginning with loading the required package in R so we can use them in this session:

# install.packages(tidyverse)
# install.packages(tabulizer)

I had some trouble adding ‘tabulizer’ to to my library. Kept receiving an error. Eventually found a solution here. I’m still not exactly sure what the problem was but it worked : )

Next, we’ll pick a Nitty Gritty PDF to crawl — this one from January 6, 2020 looks good. Make note of the URL structure, we can get creative later and automate the collection and storage of this data as long as we know of patterns in the URL to work off of.

location01082020 <- ",%202020.pdf"

Then, extract the data using the extract_tables function I’ve listed below. You’ll notice after you view the data that it comes back as nested lists. This is tricky to work with, so we follow up by converting it into a list of data frames that will eventually bind together into one large data frame (all 13 tables from the PDF in one large table).

net01082020 <- extract_tables(location01082020, output = "data.frame")

# You're viewing a nested list of data frames here

netnet01082020all <- lapply(net01082020, mutate_if, is.integer, as.character)

netnet01082020all <- bind_rows(net01082020all)

We will need to combine the nested data frames using bind_rows. However, I was getting an error ‘Error: Column X.2 can’t be converted from integer to character‘. In order to do deal with this data type issue, we need to change all integers to characters (using lapply). Once that’s done, we easily combine rows with bind_rows from dplyr.

And there you have it. A nice single data frame with all of the data from the particular day you targeted in your extraction. Still needs some cleaning up though. There are some extra characters before team names, columns (variables) need to be re-labeled and there are extra columns that need to be deleted.

# remove special characters "[" and "]"
net01082020all$X <- gsub("\\[","",net01082020all$X)
net01082020all$X <- gsub("\\]","",net01082020all$X)

# remove extra blank columns at end of data table
net12232019all <- net12232019all[,-c(12:19)]

# rename variables (column names)
colnames(net12232019all) <- c("Team", "NET Rank", "Avg Opp Rank", "Avg Opp NET", "Record", "Conference Record", "Non-Conference Record", "Road Record", "SOS", "NC SOS", "Quadrant Records")

You may notice that the final column has all four quadrant records smashed together in one variable. If this is important to your analysis, you’ll need to take an extra step to separate these out into four columns. I may update this post at a later date, but until then I’d suggest looking into maybe the separate() function to do the job.

Last but not least, a great application for all of this would be to create a process that loops through all of the archives and stores a historical database somewhere that you can access whenever you need for analysis. I haven’t gotten that far. Attempted it, but got stuck on how to build a list of URLs in an efficient manner. If you have any thoughts on this or run into problems with the above, let me know in the comments.

Omaha Mavs Hoops Analysis

I’ve created this page to feature information about the Omaha Mavs men’s basketball team, including win probability charts. In short, these charts show the probability that each team might win at any given moment in the game.

Producing these charts is incredibly easy if you already know your way around the R language thanks to a tremendous package called ncaahoopsR created by Luke Benz.

Along the way, I’ve also come across where you can search for the color code of teams in the NCAA. This is very useful when comparing team and needing to differentiate them somehow for a better visual effect.

Go Mavs!

Collecting College Football Data through Sportradar API using R

In order to kick off a personal college football rating project with R, I knew I needed team data and game by game data for the 2018 college football season for all 130 teams. I was able to obtain this data through the Sportradar API.

They were gracious enough to provide me with access to the API for 30 days, although access usually requires a fee, especially if you are monetizing your project. I won’t go through all of the steps of obtaining access to their API here. But once you have proper access, this will show you how to call and transform the API data into a workable data frame for analysis.

Here are my API calls using the httr and jsonlite packages:

options(stringsAsFactors = FALSE)

sruser <- "YOURUSERNAME"
srid <- "YOURUSERID"
srsecret <- "YOURUSERSECRET"
srtoken <- "YOURTOKEN"
srappname <- "spacialsand"
srurl <- ""
srpath <- "/ncaafb-t1/2018/REG/schedule.json?api_key=APIKEYHERE"
srteams <- "/ncaafb-t1/teams/FBS/2018/REG/standings.json?api_key=APIKEYHERE"

Collecting Team Data

Once you have your API access information stored (above) you can start making API calls from R with GET, like this:

srteams.raw.result <- GET(url = srurl, path = srteams)
srteams.raw.content <- rawToChar(srteams.raw.result$content)
srteams.content <- fromJSON(srteams.raw.content)

cfb_team1 <- srteams.content$division$conferences$teams[[1]]
cfb_team2 <- srteams.content$division$conferences$teams[[2]]
cfb_team3 <- srteams.content$division$conferences$teams[[3]]
cfb_team4 <- srteams.content$division$conferences$teams[[4]]
cfb_team5 <- srteams.content$division$conferences$teams[[5]]
cfb_team6 <- srteams.content$division$conferences$teams[[6]]
cfb_team7 <- srteams.content$division$conferences$teams[[7]]
cfb_team8 <- srteams.content$division$conferences$teams[[8]]
cfb_team9 <- srteams.content$division$conferences$teams[[9]]
cfb_team10 <- srteams.content$division$conferences$teams[[10]]
cfb_team11 <- srteams.content$division$conferences$teams[[11]]

cfb_team3$subdivision <- NA
cfb_team6$subdivision <- NA

Quick note on what is occurring in the above code chunks…when you first retrieve data from the Sportradar API, it will return raw data that is not easy to work with. So we are basically taking the raw data and keeping only the information we need, then transforming that from JSON format to more workable tables in R.

Important note: In the second-to-last step, I create data frames for each conference because we get to a point where we end up with lists and need a way to pluck out the separated data and eventually combine it into one data frame. I am positive there is a more efficient way to tackle this, perhaps looping through the lists.

This is how I was able to make it work, but suggest you consider alternative ways in order to keep your R code efficient. And it’s great practice!

At this point, we end up with a number of data frames within data frames, which is problematic during analysis. To deal with it, I took a very (embarrassingly) manual approach to this, which again should be done in a more efficient way. If you have better suggestions, please let me know in the comments. But until I revisit it at another time, here is a long way to handle it, pulling out the variables that I care to keep:

cfb_team1$overall.wins <- cfb_team1$overall$wins
cfb_team1$overall.losses <- cfb_team1$overall$losses
cfb_team1$conference.wins <- cfb_team1$in_conference$wins
cfb_team1$conference.losses <- cfb_team1$in_conference$losses
cfb_team1$home.wins <- cfb_team1$home$wins
cfb_team1$home.losses <- cfb_team1$home$losses
cfb_team1$away.wins <- cfb_team1$away$wins
cfb_team1$away.losses <- cfb_team1$away$losses
cfb_team1$decided_by_7.wins <- cfb_team1$decided_by_7_points$wins
cfb_team1$decided_by_7.losses <- cfb_team1$decided_by_7_points$losses
cfb_team1$last_5.wins <- cfb_team1$last_5$wins
cfb_team1$last_5.losses <- cfb_team1$last_5$losses
cfb_team1$points.against <- cfb_team1$points$against
cfb_team1$ <- cfb_team1$points$net

cfb_team2$overall.wins <- cfb_team2$overall$wins
cfb_team2$overall.losses <- cfb_team2$overall$losses
cfb_team2$conference.wins <- cfb_team2$in_conference$wins
cfb_team2$conference.losses <- cfb_team2$in_conference$losses
cfb_team2$home.wins <- cfb_team2$home$wins
cfb_team2$home.losses <- cfb_team2$home$losses
cfb_team2$away.wins <- cfb_team2$away$wins
cfb_team2$away.losses <- cfb_team2$away$losses
cfb_team2$decided_by_7.wins <- cfb_team2$decided_by_7_points$wins
cfb_team2$decided_by_7.losses <- cfb_team2$decided_by_7_points$losses
cfb_team2$last_5.wins <- cfb_team2$last_5$wins
cfb_team2$last_5.losses <- cfb_team2$last_5$losses
cfb_team2$points.against <- cfb_team2$points$against
cfb_team2$ <- cfb_team2$points$net

cfb_team3$overall.wins <- cfb_team3$overall$wins
cfb_team3$overall.losses <- cfb_team3$overall$losses
cfb_team3$conference.wins <- cfb_team3$in_conference$wins
cfb_team3$conference.losses <- cfb_team3$in_conference$losses
cfb_team3$home.wins <- cfb_team3$home$wins
cfb_team3$home.losses <- cfb_team3$home$losses
cfb_team3$away.wins <- cfb_team3$away$wins
cfb_team3$away.losses <- cfb_team3$away$losses
cfb_team3$decided_by_7.wins <- cfb_team3$decided_by_7_points$wins
cfb_team3$decided_by_7.losses <- cfb_team3$decided_by_7_points$losses
cfb_team3$last_5.wins <- cfb_team3$last_5$wins
cfb_team3$last_5.losses <- cfb_team3$last_5$losses
cfb_team3$points.against <- cfb_team3$points$against
cfb_team3$ <- cfb_team3$points$net

cfb_team4$overall.wins <- cfb_team4$overall$wins
cfb_team4$overall.losses <- cfb_team4$overall$losses
cfb_team4$conference.wins <- cfb_team4$in_conference$wins
cfb_team4$conference.losses <- cfb_team4$in_conference$losses
cfb_team4$home.wins <- cfb_team4$home$wins
cfb_team4$home.losses <- cfb_team4$home$losses
cfb_team4$away.wins <- cfb_team4$away$wins
cfb_team4$away.losses <- cfb_team4$away$losses
cfb_team4$decided_by_7.wins <- cfb_team4$decided_by_7_points$wins
cfb_team4$decided_by_7.losses <- cfb_team4$decided_by_7_points$losses
cfb_team4$last_5.wins <- cfb_team4$last_5$wins
cfb_team4$last_5.losses <- cfb_team4$last_5$losses
cfb_team4$points.against <- cfb_team4$points$against
cfb_team4$ <- cfb_team4$points$net

cfb_team5$overall.wins <- cfb_team5$overall$wins
cfb_team5$overall.losses <- cfb_team5$overall$losses
cfb_team5$conference.wins <- cfb_team5$in_conference$wins
cfb_team5$conference.losses <- cfb_team5$in_conference$losses
cfb_team5$home.wins <- cfb_team5$home$wins
cfb_team5$home.losses <- cfb_team5$home$losses
cfb_team5$away.wins <- cfb_team5$away$wins
cfb_team5$away.losses <- cfb_team5$away$losses
cfb_team5$decided_by_7.wins <- cfb_team5$decided_by_7_points$wins
cfb_team5$decided_by_7.losses <- cfb_team5$decided_by_7_points$losses
cfb_team5$last_5.wins <- cfb_team5$last_5$wins
cfb_team5$last_5.losses <- cfb_team5$last_5$losses
cfb_team5$points.against <- cfb_team5$points$against
cfb_team5$ <- cfb_team5$points$net

cfb_team6$overall.wins <- cfb_team6$overall$wins
cfb_team6$overall.losses <- cfb_team6$overall$losses
cfb_team6$conference.wins <- cfb_team6$in_conference$wins
cfb_team6$conference.losses <- cfb_team6$in_conference$losses
cfb_team6$home.wins <- cfb_team6$home$wins
cfb_team6$home.losses <- cfb_team6$home$losses
cfb_team6$away.wins <- cfb_team6$away$wins
cfb_team6$away.losses <- cfb_team6$away$losses
cfb_team6$decided_by_7.wins <- cfb_team6$decided_by_7_points$wins
cfb_team6$decided_by_7.losses <- cfb_team6$decided_by_7_points$losses
cfb_team6$last_5.wins <- cfb_team6$last_5$wins
cfb_team6$last_5.losses <- cfb_team6$last_5$losses
cfb_team6$points.against <- cfb_team6$points$against
cfb_team6$ <- cfb_team6$points$net

cfb_team7$overall.wins <- cfb_team7$overall$wins
cfb_team7$overall.losses <- cfb_team7$overall$losses
cfb_team7$conference.wins <- cfb_team7$in_conference$wins
cfb_team7$conference.losses <- cfb_team7$in_conference$losses
cfb_team7$home.wins <- cfb_team7$home$wins
cfb_team7$home.losses <- cfb_team7$home$losses
cfb_team7$away.wins <- cfb_team7$away$wins
cfb_team7$away.losses <- cfb_team7$away$losses
cfb_team7$decided_by_7.wins <- cfb_team7$decided_by_7_points$wins
cfb_team7$decided_by_7.losses <- cfb_team7$decided_by_7_points$losses
cfb_team7$last_5.wins <- cfb_team7$last_5$wins
cfb_team7$last_5.losses <- cfb_team7$last_5$losses
cfb_team7$points.against <- cfb_team7$points$against
cfb_team7$ <- cfb_team7$points$net

cfb_team8$overall.wins <- cfb_team8$overall$wins
cfb_team8$overall.losses <- cfb_team8$overall$losses
cfb_team8$conference.wins <- cfb_team8$in_conference$wins
cfb_team8$conference.losses <- cfb_team8$in_conference$losses
cfb_team8$home.wins <- cfb_team8$home$wins
cfb_team8$home.losses <- cfb_team8$home$losses
cfb_team8$away.wins <- cfb_team8$away$wins
cfb_team8$away.losses <- cfb_team8$away$losses
cfb_team8$decided_by_7.wins <- cfb_team8$decided_by_7_points$wins
cfb_team8$decided_by_7.losses <- cfb_team8$decided_by_7_points$losses
cfb_team8$last_5.wins <- cfb_team8$last_5$wins
cfb_team8$last_5.losses <- cfb_team8$last_5$losses
cfb_team8$points.against <- cfb_team8$points$against
cfb_team8$ <- cfb_team8$points$net

cfb_team9$overall.wins <- cfb_team9$overall$wins
cfb_team9$overall.losses <- cfb_team9$overall$losses
cfb_team9$conference.wins <- cfb_team9$in_conference$wins
cfb_team9$conference.losses <- cfb_team9$in_conference$losses
cfb_team9$home.wins <- cfb_team9$home$wins
cfb_team9$home.losses <- cfb_team9$home$losses
cfb_team9$away.wins <- cfb_team9$away$wins
cfb_team9$away.losses <- cfb_team9$away$losses
cfb_team9$decided_by_7.wins <- cfb_team9$decided_by_7_points$wins
cfb_team9$decided_by_7.losses <- cfb_team9$decided_by_7_points$losses
cfb_team9$last_5.wins <- cfb_team9$last_5$wins
cfb_team9$last_5.losses <- cfb_team9$last_5$losses
cfb_team9$points.against <- cfb_team9$points$against
cfb_team9$ <- cfb_team9$points$net

cfb_team10$overall.wins <- cfb_team10$overall$wins
cfb_team10$overall.losses <- cfb_team10$overall$losses
cfb_team10$conference.wins <- cfb_team10$in_conference$wins
cfb_team10$conference.losses <- cfb_team10$in_conference$losses
cfb_team10$home.wins <- cfb_team10$home$wins
cfb_team10$home.losses <- cfb_team10$home$losses
cfb_team10$away.wins <- cfb_team10$away$wins
cfb_team10$away.losses <- cfb_team10$away$losses
cfb_team10$decided_by_7.wins <- cfb_team10$decided_by_7_points$wins
cfb_team10$decided_by_7.losses <- cfb_team10$decided_by_7_points$losses
cfb_team10$last_5.wins <- cfb_team10$last_5$wins
cfb_team10$last_5.losses <- cfb_team10$last_5$losses
cfb_team10$points.against <- cfb_team10$points$against
cfb_team10$ <- cfb_team10$points$net

cfb_team11$overall.wins <- cfb_team11$overall$wins
cfb_team11$overall.losses <- cfb_team11$overall$losses
cfb_team11$conference.wins <- cfb_team11$in_conference$wins
cfb_team11$conference.losses <- cfb_team11$in_conference$losses
cfb_team11$home.wins <- cfb_team11$home$wins
cfb_team11$home.losses <- cfb_team11$home$losses
cfb_team11$away.wins <- cfb_team11$away$wins
cfb_team11$away.losses <- cfb_team11$away$losses
cfb_team11$decided_by_7.wins <- cfb_team11$decided_by_7_points$wins
cfb_team11$decided_by_7.losses <- cfb_team11$decided_by_7_points$losses
cfb_team11$last_5.wins <- cfb_team11$last_5$wins
cfb_team11$last_5.losses <- cfb_team11$last_5$losses
cfb_team11$points.against <- cfb_team11$points$against
cfb_team11$ <- cfb_team11$points$net

cfb_teams2018 <- rbind(cfb_team1, cfb_team2, cfb_team3, cfb_team4, cfb_team5, cfb_team6, cfb_team7, cfb_team8, cfb_team9, cfb_team10, cfb_team11)

Now you should have a data frame, named ‘cfb_teams2018’ with team information for the 2018 season. I believe this is updated each week, as games are played, so depending on when you make the call you should have close to the latest information.

Collecting Game Data

srgames.raw.result <- GET(url = srurl, path = srpath)
srgames.raw.content <- rawToChar(srgames.raw.result$content)
srgames.content <- fromJSON(srgames.raw.content)

cfb_week1 <- srgames.content$weeks$games[[1]]
cfb_week2 <- srgames.content$weeks$games[[2]]
cfb_week3 <- srgames.content$weeks$games[[3]]
cfb_week4 <- srgames.content$weeks$games[[4]]
cfb_week5 <- srgames.content$weeks$games[[5]]
cfb_week6 <- srgames.content$weeks$games[[6]]
cfb_week7 <- srgames.content$weeks$games[[7]]
cfb_week8 <- srgames.content$weeks$games[[8]]
cfb_week9 <- srgames.content$weeks$games[[9]]
cfb_week10 <- srgames.content$weeks$games[[10]]
cfb_week11 <- srgames.content$weeks$games[[11]]
cfb_week12 <- srgames.content$weeks$games[[12]]
cfb_week13 <- srgames.content$weeks$games[[13]]

cfb_week1$week <- 1
cfb_week2$week <- 2
cfb_week3$week <- 3
cfb_week4$week <- 4
cfb_week5$week <- 5
cfb_week6$week <- 6
cfb_week7$week <- 7
cfb_week8$week <- 8
cfb_week9$week <- 9
cfb_week10$week <- 10
cfb_week11$week <- 11
cfb_week12$week <- 12
cfb_week13$week <- 13

cfb_games2018 <- rbind(cfb_week1, cfb_week2, cfb_week3, cfb_week4, cfb_week5, cfb_week6, cfb_week7, cfb_week8, cfb_week9, cfb_week10, cfb_week11, cfb_week12, cfb_week13)

There you have it. Game by game data for the 2018 college football season through week 13. Happy analysis.

How Nebraska Voted in 2016

Below is a visualization of how Nebraskans voted in the 2016 presidential election. Here are a few key points:

  • The data is based on how many more votes were counted for Trump versus Clinton, the two front-runners
    • Red and yellow areas are ones where significantly more votes were Republican
    • Green areas are where a few thousand more people in that county voted Republican
    • Blue areas were much closer — ones where Trump prevailed by less than 1,000 votes
    • Dark blue and purple areas were very, very close
    • Two counties appear gray — these are ones in which Clinton received a higher number of votes
  • Clinton won by a relatively small margin in Nebraska’s two largest counties, Douglas and Lancaster, where Omaha and Lincoln reside, respectively
  • Trump won by a rather large margin in the counties immediately surrounding Douglas and Lancaster, especially Sarpy in which he prevailed by over 15,000 votes (that’s roughly three times more than the number of votes Clinton won by in Douglas and Lancaster combined)
  • Smaller and more Western counties typically votes strongly Republican

data: NYT Election Results
county names map:
state election results:

How to Create a Nebraska Map in ggplot2 by County

In this post, I show you how to create the outline for a Nebraska map in ggplot2 that is separated by counties. First, R already has latitudes, longitudes and mapping necessities we need for this, it’s just a matter of accessing them, which is one of the reasons why R is so great and easy to use.

Using maps_data from the maps package, we can turn all of these coordinates into a data frame. Let’s name ours ‘states’, but then also create a smaller data frame that only contains Nebraska data:

states <- map_data("state")
ne_coords <- subset(states, region=="nebraska")

Now, let’s do the same thing, but with counties…

counties <- map_data("county")
ne_county <- subset(counties, region=="nebraska")

Okay! We have all of the data we need to draw the map. Let’s activate ggplot2 and see what this looks like when we visualize it.


Nebraska map without county lines:

ne_map <- ggplot(data=ne_coords, mapping = aes(x=long, y=lat, group=group)) + coord_fixed(1.3)
+ geom_polygon(color="black", fill="gray")

Nebraska map with county lines:

ne_map + theme_clasic() + geom_polygon(data = ne_county, fill=NA, color="white")

And that’s it. Of course, you’ll want to get some useful data and combine it into this map for a useful data visualization, but this should get you started.

Predicting Tesla’s Federal EV Tax Credit in R (and why I’m probably wrong)

First, some context: the federal government provides a tax credit for electric vehicle manufacturers. For details, see here, but in short the full tax credit ($7,500) is available for consumers now through the quarter after the manufacturer sells and delivers 200,000 vehicles.

So, for example, if Tesla delivers its 200,000th vehicle on June 1, 2018 (Q2), then the full credit is available the remainder of that quarter plus one more full quarter (Q3 in this example). After that, the tax credit is cut in half for two additional quarters, and the cut in half again for two final quarters before the credit ends completely for the manufacturer.

Using R, I forecasted Tesla vehicle deliveries in the United States and then plotted it with ggplot2. Here’s how:

Thankfully I found a great site called InsideEVs that has collected or closely projected the quarterly US deliveries with their Plug-In Sales Scorecard. The historical data was not in crawlable tables (images instead), so instead of scraping the data I painstakingly copied everything over to Excel, but it was worth the effort since I knew I would not need to repeat this task.

After adding up models by month I ended up with this (displaying a condensed version here since there were about 100 rows):

Month Year Quarter Deliveries Cumulative
Jan 2012 Q1 0 0
Feb 2012 Q1 0 0
April 2018 Q2 6150 183810
May 2018 Q2 9220 193030

Here is a link to the data if you would like to use it: Estimated Tesla Deliveries by Month Worksheet

I will ultimately need the data in quarters, but decided to forecast using months as the period because it gives me more data points and can more likely detect patters that way, like seasonality. I used the forecast function in R in order to predict future deliveries based on previous months.

Assuming you have read the data into your IDE, I started by viewing my data in R to make sure it looked right and then activated the tseries and graphics packages I knew I would need. I will be turning my data into a time series in R so that it can be understood by the forecast function later on. It’s more or less adding meta data to your data table. If you’re working with monthly data, your frequency should be set to “12.” Before doing that, however, I knew I would not be needing the “Year” variable since this will be understood when I transform it into a time series in R, so I easily removed that entire column by setting it to NULL.

teslats$Year <- NULL
teslats <- ts(teslas$`Cars Delivered`, start = c(2012, 1), end = c(2018, 5), frequency = 12)

Next, I plotted the data — just to get an idea of what it looked like on a graph. Good way to familiarize yourself with your data in a more visual way, since looking at just the numbers is difficult to discern.


There is a handy function in R that performs a seasonal decomposition of your data set. Let’s do that now and see what it shows us. We will call it “teslafit” so it doesn’t overwrite our other variables.

teslafit <- stl(teslats, s.window = "period")

If you remove the seasonality, it is pretty clear that there is still a strong positive trend. I’m not sure what information to take away from the seasonality, other than there appears to be a spike in the last month of each quarter and that is basically the pattern you are seeing in the graph above.

It’s important to get familiar with your data if you are really going to understand and forecast it. Two other useful ways to observe your data is with the monthplot and seasonplot, which both stem from the forecast package. The seasonplot is essentially a year by year view of your data over a calendar year x-axis. They are both worth a gander, even though they are not necessary.


Okay, now it’s time to forecast. We are going to predict the next 19 months (through end of 2019) by adding our original time series data into the forecast function and then plot it out to see what we get.

forecast(teslafit, 19)
plot(forecast(teslafit, 19))

The blue line is the middle of the forecast range, and what I’m interested in. It looks reasonable, and this is just for fun, so I’ll call it good. If this were a more important project, we would want to look at the accuracy of the forecast, try multiple forecasting methods, and compare them. To do this, you should become familiar with some of the common metrics to measure forecast accuracy like MAE and MASE.


Being a Tesla fan (and future Model 3 owner), I am aware of some outside factors that would impact this prediction. For one, it’s well documented that Elon Musk (of Tesla) wants to ramp up production of their Model 3 car in a significant way — and they even have a target of 5,000 per week, which I could have used in some way but did not given their history of missing targets.

A second factor is related to the possibility of Tesla strategically maximizing the federal tax credit for electric vehicles. My prediction (if you add up deliveries cumulatively by quarter) says they will reach the 200k threshold sometime in June. But it’s quite possible Tesla will hold back on U.S. deliveries so they instead reach the mark in July, which is a brand new quarter and thus would give more users an opportunity to take advantage of the federal tax credit program.

With that said, I continued on with my prediction knowing it was plausible. So the next thing I wanted to do is plot the quarterly data using ggplot2 in a way that the different tax credit phases were clearly displayed on top of the forecast line. Here is the end result:

In order to accomplish this, I used arguments in ggplot2 that allow you to shade different regions of the site and also added some text with specification on where the text should sit. Please note, I am using quarterly — not monthly — data now and you can get this in the Google Sheet I linked to above as a short cut. I named the data set “quarterly” and added a few additional columns in order to build the final product. Feel free to leave questions in the comments.

teslaplot <- ggplot(quarters) + geom_rect(aes(xmin='2018Q1', xmax='2018Q4', ymin=-Inf, ymax=Inf), fill="aquamarine", alpha=0.03)
+ geom_rect(aes(xmin='2018Q4', xmax='2019Q2', ymin=-Inf, ymax=Inf), fill="blue", alpha=0.03)
+ geom_rect(aes(xmin='2019Q2', xmax='2019Q4', ymin=-Inf, ymax=Inf), fill="darkmagenta", alpha=0.03)
+ geom_line(aes(Quarter, quarters$`Cumulative Sales`, group=1))
+ labs(title='Tesla Federal Tax Credit Projection', x='Quarter', y='Vehicles Delivered') + scale_y_continuous(labels = comma)
+ geom_text(aes(x='2018Q3', y=200000, label='$7,500'))
+ geom_text(aes(x='2019Q1', y=200000, label='$3,750')) + geom_text(aes(x='2019Q3', y=200000, label='$1,875'))

Create a Table with IGDB API and R

The Internet Game Database has an API that allows you to access their rich set of video game information. Even better, there is an R package that makes requests for data Super easy.

Let’s get started by installing the IGDB package in R (you may need to update your version of R):


Now that you have the package installed, we are almost ready to start requesting data. Before we get there, you will nee to have an API key, which requires you to setup an IGDB account. Fair compromise. Click “GET FREE KEY” on the documentation homepage to signup:

And while you’re at it, be sure to read through the documentation so understand what type of information is available in the API and how to make basic calls. The documentation pages also provide handy examples.

Now that you’re setup, let’s store the API key so we can get access to the data. You can use the following code in R:

Sys.setenv("IGDB_KEY" = "[yourkeygoeshere]")

Now we are ready to play ball. The call below returns video games containing the term “mario” and orders them by release date, oldest to most recent. I’ve decided to bring back the following variables: ID, name of the game, release date, IGDB rating, and some popularity metric they have conceived. We can save it as “mario_games.”

mario_games <- igdb_games(
search = "mario",
order = "first_release_date:asc",
fields = c("id", "name", "first_release_date", "rating", "popularity")

One of the nice things about this is that the data comes back in a nicely formatted data frame which you view and confirm with these commands:


If for some reason yours does not, you should be able to transform it into a data frame, for easier analysis, using the base R function.

mario_games <-

Now you have a table for analysis or visualization at your disposal. Enjoy.

How to Collect Twitter Data Using R and the Twitter Search API

Luckily, the hard work is done for us. There is a terrific R package called twitterR that allows you to easily connect to the Twitter Search API. You just need to know a few arguments to properly ask for the data you need.

First, let’s explore what type of data and limitations exist in the Twitter Search API so we know what we have to work with.

Official documentation:

“Returns a collection of relevant Tweets matching a specified query.

Please note that Twitter’s search service and, by extension, the Search API is not meant to be an exhaustive source of Tweets. Not all Tweets will be indexed or made available via the search interface.

To learn how to use Twitter Search effectively, please see the Standard search operators page for a list of available filter operators. Also, see the Working with Timelines page to learn best practices for navigating results by since_id and max_id.”

The first step, of course, is to activate the packages you need for this project. If you don’t have these packages installed already, you’ll need to do that too. I have all of these installed, so I’ve commented out that part here.

# install.packages("twitterR")

We’re getting close, but before we can request data from the Twitter API, we have to provide some credentials to make sure we aren’t doing anything nefarious. To accomplish this, you need four things:

  • consumer_key
  • consumer_secret
  • access_token
  • access_secret

No worries, all of these can be easily found here (you’ll need an active Twitter account): Once you’re logged, you need to create an “application” which is essentially just saying you want to work on a project. Go ahead and fill in the details and you should receive the four criteria above.

Now we’ll save each of these strings in this manner (note that you’ll need to replace your string where i have ‘abc123’):

onsumer_key <- 'acb123'
consumer_secret <- 'acb123'
access_token <- 'acb123'
access_secret <- 'acb123'

Now, let’s get authorized and begin requesting Twitter data.

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

Below is a simple request for Tweets that you can modify to your liking. In this example, I’m going to save my request as “nebtweets” and I’ll call for the information with “searchTwitter” which is part of the twitterR package we installed and activated. I’ve arbitrarily set the number of results I want back to 200, starting in February 2018. It’s important to note here that the Twitter Search API does NOT give you full access to Twitters’ data. It’s only an index of recent Tweets. So you may get back warnings if you try asking for something that is not available.

nebtweets <- searchTwitter("nebrasketball", n=200, lang="en", since = '2018-02-01')

Now we have the Tweets saved, but they're not in a nice, neat data frame. This can easily be solved using "twListToDF" which is also part of the TwitterR package.

nebtweetsDF = twListToDF(nebtweets)

Now you're ready to analyze. Enjoy.

How to Create a US Heatmap in R

Creating a simple US map in R can be done in a number of ways. Two popular packages for this type of project are ggplot2 and plotly. In this case, I used plotly.

The data for my map is a list of US state codes (NE, IL, MA, CA, etc.). A second variable gives a count of how many players the Nebraska football team is targeting in each state. In order to follow my example with your own data, you will need to have the state code variable and some numeric variable to map it against.

Once you have your data in a table and are ready to use it, create the following styling options for the map, which we will apply later:

mapDetails <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('white')

As you may have guessed, “scope” determines the type of map, in this case a map of the USA. We will also determine here what to do with lakes and how to color them.

usaMap <- plot_geo(X2018targets, locationmode = 'USA-states') %>%
z = X2018targets$Targets, locations = X2018targets

Creating a simple US map in R can be done in a number of ways. Two popular packages for this type of project are ggplot2 and plotly. In this case, I used plotly.

The data for my map is a list of US state codes (NE, IL, MA, CA, etc.). A second variable gives a count of how many players the Nebraska football team is targeting in each state. In order to follow my example with your own data, you will need to have the state code variable and some numeric variable to map it against.

Once you have your data in a table and are ready to use it, create the following styling options for the map, which we will apply later:

mapDetails <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('white')

As you may have guessed, “scope” determines the type of map, in this case a map of the USA. We will also determine here what to do with lakes and how to color them. State Code`, color = X2018targets$Targets, colors = ‘Blues’ ) %>% colorbar(title = “Targets”) %>% layout( title = ‘2018 Nebraska Football Targets by State (February 2018)’, geo = mapDetails )

The code above connects my data to the map and allows me to modify text within the plot area. My data frame is called “X2018targets,” so you’ll need to replace this with your data frame name. You’ll also need to set “z” to your numeric data and “locations” to your state code variable.

When you’re finished, simply type “usaMap” and hit enter to see your plot appear (I use R Studio, by the way, assuming you likely do as well). If you have any trouble or questions, let me know in the comments.