Create High School Basketball Computer Rankings with R

This post outlines how to collect some very basic high school basketball scores data from MaxPreps. By collecting this data and building upon it, you should have everything you need to create a computer rankings model for your state, much like I did for Nebraska Class A boys basketball.

The state of Nebraska puts teams into different classes (Class A, B, C, etc.) depending mostly (from what I can tell) on enrollment and location. For this project, I was interested in Class A boys basketball, so the first thing I did was get an official list of Class A teams from the Nebraska School Activities Association (NSAA) website.

Knowing I would need this list for later, I manully created a dataframe called Class A:

# Manual Class A names
School <-
  c(
    'Omaha South',
    'Omaha Central',
    'Grand Island',
    'Millard North',
    'Millard South',
    'Millard West',
    'Lincoln East',
    'Lincoln High',
    'North Star',
    'Creighton Prep',
    'Omaha North',
    'Lincoln Southeast',
    'Burke',
    'Lincoln Southwest',
    'Bryan',
    'Omaha Westside',
    'Papillion-LaVista South',
    'Papillion-LaVista',
    'Lincoln Northeast',
    'Bellevue West',
    'Omaha Northwest',
    'Kearney',
    'Fremont',
    'Bellevue East',
    'Benson',
    'Gretna',
    'Elkhorn',
    'Elkhorn South',
    'Norfolk',
    'Columbus',
    'North Platte',
    'Pius X',
    'South Sioux City'
  )
# Manual Class A enrollment
Enrollment <-
  c(
    2166,
    2051,
    1982,
    1920,
    1881,
    1783,
    1695,
    1692,
    1571,
    1548,
    1522,
    1515,
    1514,
    1501,
    1480,
    1452,
    1442,
    1368,
    1280,
    1240,
    1236,
    1188,
    1113,
    1099,
    1062,
    1050,
    1026,
    1008,
    1005,
    971,
    905,
    897,
    860
  )

#Combine list into data frame
classA <- data.frame(School, Enrollment)

We’ll need the following libraries for this project:

library(c("shiny", "DT", "shinythemes", "rvest", "expss", "dplyr", "tidyr", "stringr", "sqldf", "xlsx", "scales"))

In order for all of this to work, I needed some basic (but turns out hard to find) high school basketball scores data in a consistent way. I tried scraping Omaha World Herald scores and NSAA scores, but in the end found MaxPreps easiest to work with.

In order to automate this, it’s important to understand the components of the MaxPreps scoreboard URLs. These URLs list scores by a given day, which you can select on the website by choosing from a dropdown calendar. Here is a closer look at the structure:

# start of scoreboard URLs on MaxPreps
https://www.maxpreps.com/list/schedules_scores.aspx?date=

# date
12/5/2019

# end of URL that specifies boys, state and class (divisionid)
&gendersport=boys,basketball&state=ne&statedivisionid=85757869-a232-41b9-a6b3-727edb24825e

So the putting the pieces above together would give us a page of scores on December 5, 2019 for Nebraska Class A boys basketball. Now that we understand this structure, we need to create a list of dates that we can will use to then create a list of URLs that contain scores for the entire season. I couldn’t figure out a great way to collect the game dates, so this part was manual as well. If you find a more automated way to get a list of days which games were played on, let me know!

# create games table -- game_dates NEEDS TO BE UPDATED WITH NEW DAYS
maxprep_baseURL <- 
  "https://www.maxpreps.com/list/schedules_scores.aspx?date="

maxprep_paramURL <- 
  "&gendersport=boys,basketball&state=ne&statedivisionid=85757869-a232-41b9-a6b3-727edb24825e"

# this part was manual, but building a list of final below is not
game_dates <- c(
  "12/5/2019",
  "12/6/2019",
  "12/7/2019",
  "12/9/2019",
  "12/10/2019",
  "12/12/2019",
  "12/13/2019",
  "12/14/2019",
  "12/16/2019",
  "12/17/2019",
  "12/19/2019",
  "12/20/2019",
  "12/21/2019",
  "12/27/2019",
  "12/28/2019",
  "12/30/2019",
  "12/31/2019",
  "1/2/2020",
  "1/3/2020",
  "1/4/2020",
  "1/7/2020",
  "1/9/2020",
  "1/10/2020",
  "1/11/2020",
  "1/14/2020",
  "1/16/2020",
  "1/17/2020",
  "1/18/2020",
  "1/21/2020",
  "1/23/2020",
  "1/24/2020",
  "1/25/2020",
  "1/28/2020",
  "1/30/2020",
  "1/31/2020",
  "2/1/2020",
  "2/3/2020",
  "2/4/2020",
  "2/7/2020",
  "2/8/2020",
  "2/11/2020",
  "2/13/2020",
  "2/14/2020",
  "2/15/2020",
  "2/18/2020",
  "2/20/2020",
  "2/21/2020",
  "2/22/2020",
  "2/28/2020"
)

maxprep_page_list <- 
  as.list(paste0(maxprep_baseURL, game_dates, maxprep_paramURL))

Now that we have a list of URLs that contain all of the Class A scores we want, we can scrape the HTML data we want (the scores). Please note you need to get familiar with the HTML of the pages you’re pulling data from for this to work. In my case, I found the scores all sitting neatly under a div called data-contest-state.

maxprep_html <- lapply(maxprep_page_list, FUN=function(URLLink){
  read_html(URLLink) %>% html_nodes("[data-contest-state='boxscore']") %>% html_text()
})

The above returns a list of lists, which is difficult to work with. So we can modify this by using the unlist function. We’ll also add a few more modifications to get this all into a dataframe called scores.

# Unlist and create dataframe of scores by game
scores <-
  unlist(maxprep_html)

scores <- 
  gsub("Final","", scores)

scores <- 
  grep("#", scores, invert = TRUE, value = TRUE)

scores <- 
  data.frame(scores)

If you check the dataframe using view(scores) you’ll notice a problem. All of the scores and teams are listed under one column. We need to separate these into different columns and rename them.

colnames(scores) <- 
  c("V1")

scores <- 
  scores %>%
  mutate(V1 = gsub("(\\d+)", ";\\1;", V1)) %>%
  separate(V1, c(NA, "No1", "Let1", "No2", "Let2"), sep = " *; *")

colnames(scores) <- 
  c("Away_Score", "Away_Team", "Home_Score", "Home_Team")

Okay, we’re making some really good progress now. We’ve basically scraped the data we need and manipulated into a workable dataframe with four useful variables/columns: “Away_Score,” “Away_Team,” “Home_Score,” and “Home_Team.”

With just this data we can create a number of new variables that will be useful including winner, loser, home win, home loss, away win, and away loss. It took me a while to figure out a good way to do this but eventually found that using if_else from dplyr is very efficient.

scores$Winner <- 
  if_else(scores$Away_Score > scores$Home_Score, scores$Away_Team, scores$Home_Team)

scores$Loser <- 
  if_else(scores$Away_Score < scores$Home_Score, scores$Away_Team, scores$Home_Team)

scores$Home_W <- 
  if_else(scores$Winner==scores$Home_Team, scores$Home_Team, "NA")

scores$Home_L <- 
  if_else(scores$Loser==scores$Home_Team, scores$Home_Team, "NA")

scores$Away_W <- 
  if_else(scores$Winner==scores$Away_Team, scores$Away_Team, "NA")

scores$Away_L <- 
  if_else(scores$Loser==scores$Away_Team, scores$Away_Team, "NA")

And then I cleaned it up a bit (please leave comments if you have any questions about what is going on here):

scores <- 
  scores %>% mutate_all(~gsub('\r|\n', '', .))

# Would like to make this more efficient
scores$Winner <- str_trim(scores$Winner, side = "both")
scores$Loser <- str_trim(scores$Loser, side = "both")
scores$Away_Team <- str_trim(scores$Away_Team, side = "both")
scores$Home_Team <- str_trim(scores$Home_Team, side = "both")

scores$Home_Score <- as.numeric(scores$Home_Score)
scores$Away_Score <- as.numeric(scores$Away_Score)

From this point, I decided to make a few new categories — how much a team won by and whether the home or away team is a Class A team or not. The Class is important here because a few teams in Class A mostly played Class B teams and I wanted to have that information available to use in how I scored teams later on.

scores <- 
  scores %>% mutate_all(~gsub('\r|\n', '', .))

# There's probably a more efficient way to write this
scores$Winner <- str_trim(scores$Winner, side = "both")
scores$Loser <- str_trim(scores$Loser, side = "both")
scores$Away_Team <- str_trim(scores$Away_Team, side = "both")
scores$Home_Team <- str_trim(scores$Home_Team, side = "both")

scores$Home_Score <- as.numeric(scores$Home_Score)
scores$Away_Score <- as.numeric(scores$Away_Score)

Now I have a good amount of variables stored in my scores dataframe. My next step will be use the data in that dataframe to build out the classA dataframe that we created in the beggining.

To execute this we will need to group or summarize a lot of data in scores for a particular team. For example, to get the number of Wins for a given team, we have to count the number of times that team appears in the scores$Winner column (which is why we created that variable). I was able to execute this using a number of functions including merge and sqldf.

classA <- 
  merge(classA, stack(table(factor(scores$Winner, levels = classA$School))), 
        by.x = 'School', by.y = "ind")
names(classA)[names(classA) == 'values'] <- 'Wins'

classA <- 
  merge(classA, stack(table(factor(scores$Loser, levels = classA$School))), 
        by.x = 'School', by.y = "ind")
names(classA)[names(classA) == 'values'] <- 'Losses'

classA$Win_Pct <- 
  round(classA$Wins / (classA$Wins + classA$Losses), digits = 2)

classA$Games_Played <- 
  classA$Wins + classA$Losses

# Use sqldf to create Home Wins/Losses, Away Wins/Losses from scores df
varHW <- sqldf("select scores.Home_Team, count(scores.Home_W)
               from scores
               where scores.Home_Class==TRUE AND scores.Home_Team==scores.Winner
               group by scores.Home_Team",
               stringsAsFactors=FALSE)
names(varHW)[2] <- "Home_Wins"
classA <- sqldf("select classA.*, varHW.Home_Wins 
                from classA 
                left join varHW on classA.School = varHW.Home_Team", 
                stringsAsFactors = FALSE)

varHL <- sqldf("select scores.Home_Team, count(scores.Home_L)
               from scores
               where scores.Home_Class==TRUE AND scores.Home_Team==scores.Loser
               group by scores.Home_Team",
               stringsAsFactors=FALSE)
names(varHL)[2] <- "Home_Losses"
classA <- sqldf("select classA.*, varHL.Home_Losses 
                from classA 
                left join varHL on classA.School = varHL.Home_Team", 
                stringsAsFactors = FALSE)

varAW <- sqldf("select scores.Away_Team, count(scores.Away_W)
               from scores
               where scores.Away_Class==TRUE AND scores.Away_Team==scores.Winner
               group by scores.Away_Team",
               stringsAsFactors=FALSE)
names(varAW)[2] <- "Away_Wins"
classA <- sqldf("select classA.*, varAW.Away_Wins 
                from classA 
                left join varAW on classA.School = varAW.Away_Team", 
                stringsAsFactors = FALSE)

varAL <- sqldf("select scores.Away_Team, count(scores.Away_L)
               from scores
               where scores.Away_Class==TRUE AND scores.Away_Team==scores.Loser
               group by scores.Away_Team",
               stringsAsFactors=FALSE)
names(varAL)[2] <- "Away_Losses"
classA <- sqldf("select classA.*, varAL.Away_Losses 
                from classA 
                left join varAL on classA.School = varAL.Away_Team", 
                stringsAsFactors = FALSE)

classA[is.na(classA)] <- 0

scores$Away_Score <- as.numeric(scores$Away_Score)
scores$Home_Score <- as.numeric(scores$Home_Score)
classA$School <- as.character(classA$School)

# Using dplyr to create home and away points per game (ppg)
classA <- 
  scores %>% 
  group_by(Away_Team) %>% 
  summarise(Away_PPG = mean(Away_Score, na.rm = TRUE)) %>% 
  right_join(classA, by = c(Away_Team = 'School'))

names(classA)[names(classA) == 'Away_Team'] <- 'School'

classA <- 
  scores %>% 
  group_by(Home_Team) %>% 
  summarise(Home_PPG = mean(Home_Score, na.rm = TRUE)) %>% 
  right_join(classA, by = c(Home_Team = 'School'))

names(classA)[names(classA) == 'Home_Team'] <- 'School'

classA <- 
  scores %>% 
  group_by(Home_Team) %>% 
  summarise(Home_dPPG = mean(Away_Score, na.rm = TRUE)) %>% 
  right_join(classA, by = c(Home_Team = 'School'))

names(classA)[names(classA) == 'Home_Team'] <- 'School'

classA <- 
  scores %>% 
  group_by(Away_Team) %>% 
  summarise(Away_dPPG = mean(Home_Score, na.rm = TRUE)) %>% 
  right_join(classA, by = c(Away_Team = 'School'))

names(classA)[names(classA) == 'Away_Team'] <- 'School'

classA$Away_PPG <- round(classA$Away_PPG, digits = 0)
classA$Home_PPG <- round(classA$Home_PPG, digits = 0)
classA$Home_dPPG <- round(classA$Home_dPPG, digits = 0)
classA$Away_dPPG <- round(classA$Away_dPPG, digits = 0)

classA$Home_PPG_Diff <- round(classA$Home_PPG - classA$Home_dPPG, digits = 0)
classA$Away_PPG_Diff <- round(classA$Away_PPG - classA$Away_dPPG, digits = 0)

We now have a lot of juicy data in classA for analysis, but let’s go further and create a few more variables:

classA <- 
  scores %>% 
  group_by(Away_Team) %>% 
  summarise(Away_Total_Points = sum(Away_Score, na.rm = TRUE)) %>% 
  right_join(classA, by = c(Away_Team = 'School'))

names(classA)[names(classA) == 'Away_Team'] <- 'School'

classA <- 
  scores %>% 
  group_by(Home_Team) %>% 
  summarise(Home_Total_Points = sum(Home_Score, na.rm = TRUE)) %>% 
  right_join(classA, by = c(Home_Team = 'School'))

names(classA)[names(classA) == 'Home_Team'] <- 'School'

classA <- 
  scores %>% 
  group_by(Home_Team) %>% 
  summarise(Home_Points_Allowed = sum(Away_Score, na.rm = TRUE)) %>% 
  right_join(classA, by = c(Home_Team = 'School'))

names(classA)[names(classA) == 'Home_Team'] <- 'School'

classA <- 
  scores %>% 
  group_by(Away_Team) %>% 
  summarise(Away_Points_Allowed = sum(Home_Score, na.rm = TRUE)) %>% 
  right_join(classA, by = c(Away_Team = 'School'))

names(classA)[names(classA) == 'Away_Team'] <- 'School'

classA$Total_Points <- classA$Home_Total_Points + classA$Away_Total_Points
classA$PPG <- round(classA$Total_Points / classA$Games_Played, digits = 0)

classA$Points_Allowed <- classA$Home_Points_Allowed + classA$Away_Points_Allowed
classA$dPPG <- round(classA$Points_Allowed / classA$Games_Played, digits = 0)

classA$Total_Points_Diff <- classA$Total_Points - classA$Points_Allowed
classA$PPG_Diff <- classA$PPG - classA$dPPG

The last thing I did here is create some variables to tell us how often teams played other Class A teams since it’s a distinct advantage to play mostly against teams in other classes.

againstAhome <- sqldf("select scores.Home_Team, count(scores.Away_Class)
               from scores
               where scores.Away_Class==TRUE AND scores.Home_Class=TRUE
               group by scores.Home_Team",
               stringsAsFactors=FALSE)
names(againstAhome)[2] <- "Home_A_Schedule"
classA <- sqldf("select classA.*, againstAhome.Home_A_Schedule 
                from classA 
                left join againstAhome on classA.School = againstAhome.Home_Team", 
                stringsAsFactors = FALSE)

againstAaway <- sqldf("select scores.Away_Team, count(scores.Home_Class)
               from scores
                      where scores.Home_Class==TRUE AND scores.Away_Class=TRUE
                      group by scores.Away_Team",
                      stringsAsFactors=FALSE)
names(againstAaway)[2] <- "Away_A_Schedule"
classA <- sqldf("select classA.*, againstAaway.Away_A_Schedule 
                from classA 
                left join againstAaway on classA.School = againstAaway.Away_Team", 
                stringsAsFactors = FALSE)

classA$A_Schedule <- round((classA$Home_A_Schedule + classA$Away_A_Schedule) / classA$Games_Played, digits = 2)

Now you should have ample data to create a scoring system model to rank high school basketball teams in your state. I won’t go into details here about how I modeled this, but you can see my final results here. And happy to give some direction if you’re really interested. You can also find details online about how other rankings systems, like the NET work by digging a bit.

If you’re going to display your dataframe as a table in some way, like I did with my shinyapp linked to above, then it’s a good idea to clean things up and only include necessary columns, which I do here:

classA_clean <- 
  data.frame(classA$School,
             classA$Performance_Points,
             classA$Wins, 
             classA$Losses, 
             classA$Win_Pct, 
             classA$Home_Wins, 
             classA$Home_Losses, 
             classA$Away_Wins, 
             classA$Away_Losses, 
             classA$PPG,
             classA$dPPG,
             classA$Home_PPG_Diff, 
             classA$Away_PPG_Diff,
             classA$A_Schedule,
             classA$SOS)

names(classA_clean) <- 
  c("School",
    "Performance Points",
    "Wins", 
    "Losses", 
    "Win %", 
    "Home Wins", 
    "Home Losses", 
    "Away Wins", 
    "Away Losses", 
    "PPG",
    "Def PPG",
    "Home PPG Diff", 
    "Away PPG Diff", 
    "Class A Schedule",
    "SOS")

And lastly, here is my R script (app.R) for publishing this to shiny. Please note this is not the entire script, just the last piece. You’d need to add all the pieces that build the dataframes above it to make it work.

# Define UI for application that displays a table
ui <- fluidPage(
  theme = "sandstone",
              
                br(),
                h2("2019/20 Nebraska High School Boys Basketball Computer Rankings", style = "color: DarkGoldenRod"),
                br(),
                DT::DTOutput("mytable")
)

server <- function(input, output) {
  output$mytable = DT::renderDT({
    DT::datatable(
      classA_clean,
      rownames = FALSE,
      options = list(
        paging = FALSE, 
        searching = FALSE))
  })
}

shinyApp(ui = ui, server = server)

Good luck!

My GitHub repository

Collect NET Basketball Rankings Data in R

Below is a workflow for gathering NET basketball rankings data using R. But first, some background on the data that’s currently available (that I’m aware of).

On seemingly a daily basis, NCAA.com updates its NET rankings webpage with the date of the latest publish right above the table. The URL of this page (https://www.ncaa.com/rankings/basketball-men/d1/ncaa-mens-basketball-net-rankings) does not change when updates are made, so if you are looking for historical data to do some sort of analysis, you’re out of luck unless you’ve scrape their page daily and store the results in a database.

Luckily though, they do have archives. You can find a link to the archives at the very bottom of the page as shown in the screenshot below (c’mon, Mississippi Val.!).

Bottom of NET rankings page features a link to archived rankings

I was hoping for something easier to work with, but that was wishful thinking. Still, gracious anything is provided because it just takes a bit of data manipulation in R to get these trick PDF files into nice neat data frames for b-ball analysis.

When you click through to the archive you find a list of links. The ones you want are “Nitty Gritty sheets. When you click into those, you end up with a pdf document that contains rankings and other data for that day.

List of archived NET data by day
Nitty Gritty sheets give you a pdf table of what you would otherwise see on the webpage

The good news is that these PDFs are created in a way that that the data can be scraped. In this case, I use the extract_tables function within the tabulizer package in R. Let’s get started, beginning with loading the required package in R so we can use them in this session:

# install.packages(tidyverse)
# install.packages(tabulizer)
library(tidyverse)
library(tabulizer)

I had some trouble adding ‘tabulizer’ to to my library. Kept receiving an error. Eventually found a solution here. I’m still not exactly sure what the problem was but it worked : )

Next, we’ll pick a Nitty Gritty PDF to crawl — this one from January 6, 2020 looks good. Make note of the URL structure, we can get creative later and automate the collection and storage of this data as long as we know of patterns in the URL to work off of.

location01082020 <- "https://extra.ncaa.org/solutions/rpi/Stats%20Library/NET%20Nitty%20Gritty%20-%20Jan.%206,%202020.pdf"

Then, extract the data using the extract_tables function I’ve listed below. You’ll notice after you view the data that it comes back as nested lists. This is tricky to work with, so we follow up by converting it into a list of data frames that will eventually bind together into one large data frame (all 13 tables from the PDF in one large table).

net01082020 <- extract_tables(location01082020, output = "data.frame")

view(net01082020)
# You're viewing a nested list of data frames here

netnet01082020all <- lapply(net01082020, mutate_if, is.integer, as.character)

netnet01082020all <- bind_rows(net01082020all)

We will need to combine the nested data frames using bind_rows. However, I was getting an error ‘Error: Column X.2 can’t be converted from integer to character‘. In order to do deal with this data type issue, we need to change all integers to characters (using lapply). Once that’s done, we easily combine rows with bind_rows from dplyr.

And there you have it. A nice single data frame with all of the data from the particular day you targeted in your extraction. Still needs some cleaning up though. There are some extra characters before team names, columns (variables) need to be re-labeled and there are extra columns that need to be deleted.

# remove special characters "[" and "]"
net01082020all$X <- gsub("\\[","",net01082020all$X)
net01082020all$X <- gsub("\\]","",net01082020all$X)

# remove extra blank columns at end of data table
net12232019all <- net12232019all[,-c(12:19)]

# rename variables (column names)
colnames(net12232019all) <- c("Team", "NET Rank", "Avg Opp Rank", "Avg Opp NET", "Record", "Conference Record", "Non-Conference Record", "Road Record", "SOS", "NC SOS", "Quadrant Records")

You may notice that the final column has all four quadrant records smashed together in one variable. If this is important to your analysis, you’ll need to take an extra step to separate these out into four columns. I may update this post at a later date, but until then I’d suggest looking into maybe the separate() function to do the job.

Last but not least, a great application for all of this would be to create a process that loops through all of the archives and stores a historical database somewhere that you can access whenever you need for analysis. I haven’t gotten that far. Attempted it, but got stuck on how to build a list of URLs in an efficient manner. If you have any thoughts on this or run into problems with the above, let me know in the comments.

Omaha Mavs Hoops Analysis

I’ve created this page to feature information about the Omaha Mavs men’s basketball team, including win probability charts. In short, these charts show the probability that each team might win at any given moment in the game.

Producing these charts is incredibly easy if you already know your way around the R language thanks to a tremendous package called ncaahoopsR created by Luke Benz.

Along the way, I’ve also come across teamcolorcodes.com where you can search for the color code of teams in the NCAA. This is very useful when comparing team and needing to differentiate them somehow for a better visual effect.

Go Mavs!

Collecting College Football Data through Sportradar API using R

In order to kick off a personal college football rating project with R, I knew I needed team data and game by game data for the 2018 college football season for all 130 teams. I was able to obtain this data through the Sportradar API.

They were gracious enough to provide me with access to the API for 30 days, although access usually requires a fee, especially if you are monetizing your project. I won’t go through all of the steps of obtaining access to their API here. But once you have proper access, this will show you how to call and transform the API data into a workable data frame for analysis.

Here are my API calls using the httr and jsonlite packages:

## ASSUMING THESE ARE ALREADY INSTALLED
library(httr)
library(jsonlite)
options(stringsAsFactors = FALSE)

## STORE YOUR SPORT RADAR API INFORMATION
sruser <- "YOURUSERNAME"
srid <- "YOURUSERID"
srsecret <- "YOURUSERSECRET"
srtoken <- "YOURTOKEN"
srappname <- "spacialsand"
srurl <- "https://api.sportradar.us"
srpath <- "/ncaafb-t1/2018/REG/schedule.json?api_key=APIKEYHERE"
srteams <- "/ncaafb-t1/teams/FBS/2018/REG/standings.json?api_key=APIKEYHERE"

Collecting Team Data

Once you have your API access information stored (above) you can start making API calls from R with GET, like this:

## API CALL FOR TEAM DATA
srteams.raw.result <- GET(url = srurl, path = srteams)
srteams.raw.content <- rawToChar(srteams.raw.result$content)
srteams.content <- fromJSON(srteams.raw.content)

## PULL TEAM DATA BY CONFERENCE OUT OF LISTS
cfb_team1 <- srteams.content$division$conferences$teams[[1]]
cfb_team2 <- srteams.content$division$conferences$teams[[2]]
cfb_team3 <- srteams.content$division$conferences$teams[[3]]
cfb_team4 <- srteams.content$division$conferences$teams[[4]]
cfb_team5 <- srteams.content$division$conferences$teams[[5]]
cfb_team6 <- srteams.content$division$conferences$teams[[6]]
cfb_team7 <- srteams.content$division$conferences$teams[[7]]
cfb_team8 <- srteams.content$division$conferences$teams[[8]]
cfb_team9 <- srteams.content$division$conferences$teams[[9]]
cfb_team10 <- srteams.content$division$conferences$teams[[10]]
cfb_team11 <- srteams.content$division$conferences$teams[[11]]

## SOME TEAMS DO NOT HAVE SUBDIVISIONS BUT WE NEED EQUAL COLUMNS
cfb_team3$subdivision <- NA
cfb_team6$subdivision <- NA

Quick note on what is occurring in the above code chunks…when you first retrieve data from the Sportradar API, it will return raw data that is not easy to work with. So we are basically taking the raw data and keeping only the information we need, then transforming that from JSON format to more workable tables in R.

Important note: In the second-to-last step, I create data frames for each conference because we get to a point where we end up with lists and need a way to pluck out the separated data and eventually combine it into one data frame. I am positive there is a more efficient way to tackle this, perhaps looping through the lists.

This is how I was able to make it work, but suggest you consider alternative ways in order to keep your R code efficient. And it’s great practice!

At this point, we end up with a number of data frames within data frames, which is problematic during analysis. To deal with it, I took a very (embarrassingly) manual approach to this, which again should be done in a more efficient way. If you have better suggestions, please let me know in the comments. But until I revisit it at another time, here is a long way to handle it, pulling out the variables that I care to keep:

cfb_team1$overall.wins <- cfb_team1$overall$wins
cfb_team1$overall.losses <- cfb_team1$overall$losses
cfb_team1$conference.wins <- cfb_team1$in_conference$wins
cfb_team1$conference.losses <- cfb_team1$in_conference$losses
cfb_team1$home.wins <- cfb_team1$home$wins
cfb_team1$home.losses <- cfb_team1$home$losses
cfb_team1$away.wins <- cfb_team1$away$wins
cfb_team1$away.losses <- cfb_team1$away$losses
cfb_team1$decided_by_7.wins <- cfb_team1$decided_by_7_points$wins
cfb_team1$decided_by_7.losses <- cfb_team1$decided_by_7_points$losses
cfb_team1$last_5.wins <- cfb_team1$last_5$wins
cfb_team1$last_5.losses <- cfb_team1$last_5$losses
cfb_team1$points.against <- cfb_team1$points$against
cfb_team1$points.net <- cfb_team1$points$net

cfb_team2$overall.wins <- cfb_team2$overall$wins
cfb_team2$overall.losses <- cfb_team2$overall$losses
cfb_team2$conference.wins <- cfb_team2$in_conference$wins
cfb_team2$conference.losses <- cfb_team2$in_conference$losses
cfb_team2$home.wins <- cfb_team2$home$wins
cfb_team2$home.losses <- cfb_team2$home$losses
cfb_team2$away.wins <- cfb_team2$away$wins
cfb_team2$away.losses <- cfb_team2$away$losses
cfb_team2$decided_by_7.wins <- cfb_team2$decided_by_7_points$wins
cfb_team2$decided_by_7.losses <- cfb_team2$decided_by_7_points$losses
cfb_team2$last_5.wins <- cfb_team2$last_5$wins
cfb_team2$last_5.losses <- cfb_team2$last_5$losses
cfb_team2$points.against <- cfb_team2$points$against
cfb_team2$points.net <- cfb_team2$points$net

cfb_team3$overall.wins <- cfb_team3$overall$wins
cfb_team3$overall.losses <- cfb_team3$overall$losses
cfb_team3$conference.wins <- cfb_team3$in_conference$wins
cfb_team3$conference.losses <- cfb_team3$in_conference$losses
cfb_team3$home.wins <- cfb_team3$home$wins
cfb_team3$home.losses <- cfb_team3$home$losses
cfb_team3$away.wins <- cfb_team3$away$wins
cfb_team3$away.losses <- cfb_team3$away$losses
cfb_team3$decided_by_7.wins <- cfb_team3$decided_by_7_points$wins
cfb_team3$decided_by_7.losses <- cfb_team3$decided_by_7_points$losses
cfb_team3$last_5.wins <- cfb_team3$last_5$wins
cfb_team3$last_5.losses <- cfb_team3$last_5$losses
cfb_team3$points.against <- cfb_team3$points$against
cfb_team3$points.net <- cfb_team3$points$net

cfb_team4$overall.wins <- cfb_team4$overall$wins
cfb_team4$overall.losses <- cfb_team4$overall$losses
cfb_team4$conference.wins <- cfb_team4$in_conference$wins
cfb_team4$conference.losses <- cfb_team4$in_conference$losses
cfb_team4$home.wins <- cfb_team4$home$wins
cfb_team4$home.losses <- cfb_team4$home$losses
cfb_team4$away.wins <- cfb_team4$away$wins
cfb_team4$away.losses <- cfb_team4$away$losses
cfb_team4$decided_by_7.wins <- cfb_team4$decided_by_7_points$wins
cfb_team4$decided_by_7.losses <- cfb_team4$decided_by_7_points$losses
cfb_team4$last_5.wins <- cfb_team4$last_5$wins
cfb_team4$last_5.losses <- cfb_team4$last_5$losses
cfb_team4$points.against <- cfb_team4$points$against
cfb_team4$points.net <- cfb_team4$points$net

cfb_team5$overall.wins <- cfb_team5$overall$wins
cfb_team5$overall.losses <- cfb_team5$overall$losses
cfb_team5$conference.wins <- cfb_team5$in_conference$wins
cfb_team5$conference.losses <- cfb_team5$in_conference$losses
cfb_team5$home.wins <- cfb_team5$home$wins
cfb_team5$home.losses <- cfb_team5$home$losses
cfb_team5$away.wins <- cfb_team5$away$wins
cfb_team5$away.losses <- cfb_team5$away$losses
cfb_team5$decided_by_7.wins <- cfb_team5$decided_by_7_points$wins
cfb_team5$decided_by_7.losses <- cfb_team5$decided_by_7_points$losses
cfb_team5$last_5.wins <- cfb_team5$last_5$wins
cfb_team5$last_5.losses <- cfb_team5$last_5$losses
cfb_team5$points.against <- cfb_team5$points$against
cfb_team5$points.net <- cfb_team5$points$net

cfb_team6$overall.wins <- cfb_team6$overall$wins
cfb_team6$overall.losses <- cfb_team6$overall$losses
cfb_team6$conference.wins <- cfb_team6$in_conference$wins
cfb_team6$conference.losses <- cfb_team6$in_conference$losses
cfb_team6$home.wins <- cfb_team6$home$wins
cfb_team6$home.losses <- cfb_team6$home$losses
cfb_team6$away.wins <- cfb_team6$away$wins
cfb_team6$away.losses <- cfb_team6$away$losses
cfb_team6$decided_by_7.wins <- cfb_team6$decided_by_7_points$wins
cfb_team6$decided_by_7.losses <- cfb_team6$decided_by_7_points$losses
cfb_team6$last_5.wins <- cfb_team6$last_5$wins
cfb_team6$last_5.losses <- cfb_team6$last_5$losses
cfb_team6$points.against <- cfb_team6$points$against
cfb_team6$points.net <- cfb_team6$points$net

cfb_team7$overall.wins <- cfb_team7$overall$wins
cfb_team7$overall.losses <- cfb_team7$overall$losses
cfb_team7$conference.wins <- cfb_team7$in_conference$wins
cfb_team7$conference.losses <- cfb_team7$in_conference$losses
cfb_team7$home.wins <- cfb_team7$home$wins
cfb_team7$home.losses <- cfb_team7$home$losses
cfb_team7$away.wins <- cfb_team7$away$wins
cfb_team7$away.losses <- cfb_team7$away$losses
cfb_team7$decided_by_7.wins <- cfb_team7$decided_by_7_points$wins
cfb_team7$decided_by_7.losses <- cfb_team7$decided_by_7_points$losses
cfb_team7$last_5.wins <- cfb_team7$last_5$wins
cfb_team7$last_5.losses <- cfb_team7$last_5$losses
cfb_team7$points.against <- cfb_team7$points$against
cfb_team7$points.net <- cfb_team7$points$net

cfb_team8$overall.wins <- cfb_team8$overall$wins
cfb_team8$overall.losses <- cfb_team8$overall$losses
cfb_team8$conference.wins <- cfb_team8$in_conference$wins
cfb_team8$conference.losses <- cfb_team8$in_conference$losses
cfb_team8$home.wins <- cfb_team8$home$wins
cfb_team8$home.losses <- cfb_team8$home$losses
cfb_team8$away.wins <- cfb_team8$away$wins
cfb_team8$away.losses <- cfb_team8$away$losses
cfb_team8$decided_by_7.wins <- cfb_team8$decided_by_7_points$wins
cfb_team8$decided_by_7.losses <- cfb_team8$decided_by_7_points$losses
cfb_team8$last_5.wins <- cfb_team8$last_5$wins
cfb_team8$last_5.losses <- cfb_team8$last_5$losses
cfb_team8$points.against <- cfb_team8$points$against
cfb_team8$points.net <- cfb_team8$points$net

cfb_team9$overall.wins <- cfb_team9$overall$wins
cfb_team9$overall.losses <- cfb_team9$overall$losses
cfb_team9$conference.wins <- cfb_team9$in_conference$wins
cfb_team9$conference.losses <- cfb_team9$in_conference$losses
cfb_team9$home.wins <- cfb_team9$home$wins
cfb_team9$home.losses <- cfb_team9$home$losses
cfb_team9$away.wins <- cfb_team9$away$wins
cfb_team9$away.losses <- cfb_team9$away$losses
cfb_team9$decided_by_7.wins <- cfb_team9$decided_by_7_points$wins
cfb_team9$decided_by_7.losses <- cfb_team9$decided_by_7_points$losses
cfb_team9$last_5.wins <- cfb_team9$last_5$wins
cfb_team9$last_5.losses <- cfb_team9$last_5$losses
cfb_team9$points.against <- cfb_team9$points$against
cfb_team9$points.net <- cfb_team9$points$net

cfb_team10$overall.wins <- cfb_team10$overall$wins
cfb_team10$overall.losses <- cfb_team10$overall$losses
cfb_team10$conference.wins <- cfb_team10$in_conference$wins
cfb_team10$conference.losses <- cfb_team10$in_conference$losses
cfb_team10$home.wins <- cfb_team10$home$wins
cfb_team10$home.losses <- cfb_team10$home$losses
cfb_team10$away.wins <- cfb_team10$away$wins
cfb_team10$away.losses <- cfb_team10$away$losses
cfb_team10$decided_by_7.wins <- cfb_team10$decided_by_7_points$wins
cfb_team10$decided_by_7.losses <- cfb_team10$decided_by_7_points$losses
cfb_team10$last_5.wins <- cfb_team10$last_5$wins
cfb_team10$last_5.losses <- cfb_team10$last_5$losses
cfb_team10$points.against <- cfb_team10$points$against
cfb_team10$points.net <- cfb_team10$points$net

cfb_team11$overall.wins <- cfb_team11$overall$wins
cfb_team11$overall.losses <- cfb_team11$overall$losses
cfb_team11$conference.wins <- cfb_team11$in_conference$wins
cfb_team11$conference.losses <- cfb_team11$in_conference$losses
cfb_team11$home.wins <- cfb_team11$home$wins
cfb_team11$home.losses <- cfb_team11$home$losses
cfb_team11$away.wins <- cfb_team11$away$wins
cfb_team11$away.losses <- cfb_team11$away$losses
cfb_team11$decided_by_7.wins <- cfb_team11$decided_by_7_points$wins
cfb_team11$decided_by_7.losses <- cfb_team11$decided_by_7_points$losses
cfb_team11$last_5.wins <- cfb_team11$last_5$wins
cfb_team11$last_5.losses <- cfb_team11$last_5$losses
cfb_team11$points.against <- cfb_team11$points$against
cfb_team11$points.net <- cfb_team11$points$net

## COMBINE INTO ONE DATA FRAME
cfb_teams2018 <- rbind(cfb_team1, cfb_team2, cfb_team3, cfb_team4, cfb_team5, cfb_team6, cfb_team7, cfb_team8, cfb_team9, cfb_team10, cfb_team11)

Now you should have a data frame, named ‘cfb_teams2018’ with team information for the 2018 season. I believe this is updated each week, as games are played, so depending on when you make the call you should have close to the latest information.

Collecting Game Data

## API CALL FOR TEAM DATA AND INDIVIDUAL GAME DATA
srgames.raw.result <- GET(url = srurl, path = srpath)
srgames.raw.content <- rawToChar(srgames.raw.result$content)
srgames.content <- fromJSON(srgames.raw.content)

## PULL GAME DATA BY WEEK OUT OF LISTS
cfb_week1 <- srgames.content$weeks$games[[1]]
cfb_week2 <- srgames.content$weeks$games[[2]]
cfb_week3 <- srgames.content$weeks$games[[3]]
cfb_week4 <- srgames.content$weeks$games[[4]]
cfb_week5 <- srgames.content$weeks$games[[5]]
cfb_week6 <- srgames.content$weeks$games[[6]]
cfb_week7 <- srgames.content$weeks$games[[7]]
cfb_week8 <- srgames.content$weeks$games[[8]]
cfb_week9 <- srgames.content$weeks$games[[9]]
cfb_week10 <- srgames.content$weeks$games[[10]]
cfb_week11 <- srgames.content$weeks$games[[11]]
cfb_week12 <- srgames.content$weeks$games[[12]]
cfb_week13 <- srgames.content$weeks$games[[13]]

## PULL DATA FRAMES OUT OF DATA FRAMES
cfb_week1$week <- 1
cfb_week2$week <- 2
cfb_week3$week <- 3
cfb_week4$week <- 4
cfb_week5$week <- 5
cfb_week6$week <- 6
cfb_week7$week <- 7
cfb_week8$week <- 8
cfb_week9$week <- 9
cfb_week10$week <- 10
cfb_week11$week <- 11
cfb_week12$week <- 12
cfb_week13$week <- 13

## COMBINE GAMES FROM ALL WEEKS INTO ONE DATA FRAME
cfb_games2018 <- rbind(cfb_week1, cfb_week2, cfb_week3, cfb_week4, cfb_week5, cfb_week6, cfb_week7, cfb_week8, cfb_week9, cfb_week10, cfb_week11, cfb_week12, cfb_week13)

There you have it. Game by game data for the 2018 college football season through week 13. Happy analysis.

Digital Marketing in R: How to Create Word Clouds

I recently recorded my very first (much too lengthy) YouTube video. The video walks through taking a list of keywords and creating a word cloud in R.

While I do not find word clouds to be particularly useful, there are a number of terrific data science applications that you come across during this exercise that are worth knowing — like removing stop words and stemming.

How Even Were Whistles in the 2017 NBA Playoffs?

TEAMPERSONAL FOULS/GAMEVARIANCE
Washington Wizards232
Indiana Pacers232
Oklahoma City Thunder221
Memphis Grizzlies221
Golden State Warriors221
Portland Trail Blazers221
Atlanta Hawks210
Milwaukee Bucks210
Utah Jazz210
Houston Rockets210
Toronto Raptors210
Boston Celtics210
Los Angeles Clippers20-1
Cleveland Cavalier19-2
San Antonio Spurs19-2
Chicago Bulls18-3
AVERAGE210

3 Data/Analytics Podcast Recommendations

Here is a brief list of podcasts I would recommend that pertain to either digital marketing or data science. Enjoy!

The Digital Analytics Power Hour

Hosted by Tim Wilson and Michael Helbling, this podcast focuses on a number of digital analytics topics including anything from R to what the future digital marketing analyst will look like from a skills perspective.

The Data Skeptic

I just started listening to this one and I love it. Many of the episodes are very short (about 15 minutes), so it’s very digestible. There’s a wide range of very relevant topics from a refresher on p-values and t-tests to neuroscience. I really like how they episodes only last as long as they need to be and how they break down seemingly complex topics into something everyone can grasp.

FiveThirtyEight

This one is less about understanding data/analytics and more about findings the team over at 538 has made. If you’re reading this, you most likely are already familiar with the 538 blog, where topics are generally focused on politics and sports.

Any good recommendations out there I missed? Let me know in the comments. Thanks!