Collecting College Football Data through Sportradar API using R

In order to kick off a personal college football rating project with R, I knew I needed team data and game by game data for the 2018 college football season for all 130 teams. I was able to obtain this data through the Sportradar API.

They were gracious enough to provide me with access to the API for 30 days, although access usually requires a fee, especially if you are monetizing your project. I won’t go through all of the steps of obtaining access to their API here. But once you have proper access, this will show you how to call and transform the API data into a workable data frame for analysis.

Here are my API calls using the httr and jsonlite packages:

## ASSUMING THESE ARE ALREADY INSTALLED
library(httr)
library(jsonlite)
options(stringsAsFactors = FALSE)

## STORE YOUR SPORT RADAR API INFORMATION
sruser <- "YOURUSERNAME"
srid <- "YOURUSERID"
srsecret <- "YOURUSERSECRET"
srtoken <- "YOURTOKEN"
srappname <- "spacialsand"
srurl <- "https://api.sportradar.us"
srpath <- "/ncaafb-t1/2018/REG/schedule.json?api_key=APIKEYHERE"
srteams <- "/ncaafb-t1/teams/FBS/2018/REG/standings.json?api_key=APIKEYHERE"

Collecting Team Data

Once you have your API access information stored (above) you can start making API calls from R with GET, like this:

## API CALL FOR TEAM DATA
srteams.raw.result <- GET(url = srurl, path = srteams)
srteams.raw.content <- rawToChar(srteams.raw.result$content)
srteams.content <- fromJSON(srteams.raw.content)

## PULL TEAM DATA BY CONFERENCE OUT OF LISTS
cfb_team1 <- srteams.content$division$conferences$teams[[1]]
cfb_team2 <- srteams.content$division$conferences$teams[[2]]
cfb_team3 <- srteams.content$division$conferences$teams[[3]]
cfb_team4 <- srteams.content$division$conferences$teams[[4]]
cfb_team5 <- srteams.content$division$conferences$teams[[5]]
cfb_team6 <- srteams.content$division$conferences$teams[[6]]
cfb_team7 <- srteams.content$division$conferences$teams[[7]]
cfb_team8 <- srteams.content$division$conferences$teams[[8]]
cfb_team9 <- srteams.content$division$conferences$teams[[9]]
cfb_team10 <- srteams.content$division$conferences$teams[[10]]
cfb_team11 <- srteams.content$division$conferences$teams[[11]]

## SOME TEAMS DO NOT HAVE SUBDIVISIONS BUT WE NEED EQUAL COLUMNS
cfb_team3$subdivision <- NA
cfb_team6$subdivision <- NA

Quick note on what is occurring in the above code chunks…when you first retrieve data from the Sportradar API, it will return raw data that is not easy to work with. So we are basically taking the raw data and keeping only the information we need, then transforming that from JSON format to more workable tables in R.

Important note: In the second-to-last step, I create data frames for each conference because we get to a point where we end up with lists and need a way to pluck out the separated data and eventually combine it into one data frame. I am positive there is a more efficient way to tackle this, perhaps looping through the lists.

This is how I was able to make it work, but suggest you consider alternative ways in order to keep your R code efficient. And it’s great practice!

At this point, we end up with a number of data frames within data frames, which is problematic during analysis. To deal with it, I took a very (embarrassingly) manual approach to this, which again should be done in a more efficient way. If you have better suggestions, please let me know in the comments. But until I revisit it at another time, here is a long way to handle it, pulling out the variables that I care to keep:

cfb_team1$overall.wins <- cfb_team1$overall$wins
cfb_team1$overall.losses <- cfb_team1$overall$losses
cfb_team1$conference.wins <- cfb_team1$in_conference$wins
cfb_team1$conference.losses <- cfb_team1$in_conference$losses
cfb_team1$home.wins <- cfb_team1$home$wins
cfb_team1$home.losses <- cfb_team1$home$losses
cfb_team1$away.wins <- cfb_team1$away$wins
cfb_team1$away.losses <- cfb_team1$away$losses
cfb_team1$decided_by_7.wins <- cfb_team1$decided_by_7_points$wins
cfb_team1$decided_by_7.losses <- cfb_team1$decided_by_7_points$losses
cfb_team1$last_5.wins <- cfb_team1$last_5$wins
cfb_team1$last_5.losses <- cfb_team1$last_5$losses
cfb_team1$points.against <- cfb_team1$points$against
cfb_team1$points.net <- cfb_team1$points$net

cfb_team2$overall.wins <- cfb_team2$overall$wins
cfb_team2$overall.losses <- cfb_team2$overall$losses
cfb_team2$conference.wins <- cfb_team2$in_conference$wins
cfb_team2$conference.losses <- cfb_team2$in_conference$losses
cfb_team2$home.wins <- cfb_team2$home$wins
cfb_team2$home.losses <- cfb_team2$home$losses
cfb_team2$away.wins <- cfb_team2$away$wins
cfb_team2$away.losses <- cfb_team2$away$losses
cfb_team2$decided_by_7.wins <- cfb_team2$decided_by_7_points$wins
cfb_team2$decided_by_7.losses <- cfb_team2$decided_by_7_points$losses
cfb_team2$last_5.wins <- cfb_team2$last_5$wins
cfb_team2$last_5.losses <- cfb_team2$last_5$losses
cfb_team2$points.against <- cfb_team2$points$against
cfb_team2$points.net <- cfb_team2$points$net

cfb_team3$overall.wins <- cfb_team3$overall$wins
cfb_team3$overall.losses <- cfb_team3$overall$losses
cfb_team3$conference.wins <- cfb_team3$in_conference$wins
cfb_team3$conference.losses <- cfb_team3$in_conference$losses
cfb_team3$home.wins <- cfb_team3$home$wins
cfb_team3$home.losses <- cfb_team3$home$losses
cfb_team3$away.wins <- cfb_team3$away$wins
cfb_team3$away.losses <- cfb_team3$away$losses
cfb_team3$decided_by_7.wins <- cfb_team3$decided_by_7_points$wins
cfb_team3$decided_by_7.losses <- cfb_team3$decided_by_7_points$losses
cfb_team3$last_5.wins <- cfb_team3$last_5$wins
cfb_team3$last_5.losses <- cfb_team3$last_5$losses
cfb_team3$points.against <- cfb_team3$points$against
cfb_team3$points.net <- cfb_team3$points$net

cfb_team4$overall.wins <- cfb_team4$overall$wins
cfb_team4$overall.losses <- cfb_team4$overall$losses
cfb_team4$conference.wins <- cfb_team4$in_conference$wins
cfb_team4$conference.losses <- cfb_team4$in_conference$losses
cfb_team4$home.wins <- cfb_team4$home$wins
cfb_team4$home.losses <- cfb_team4$home$losses
cfb_team4$away.wins <- cfb_team4$away$wins
cfb_team4$away.losses <- cfb_team4$away$losses
cfb_team4$decided_by_7.wins <- cfb_team4$decided_by_7_points$wins
cfb_team4$decided_by_7.losses <- cfb_team4$decided_by_7_points$losses
cfb_team4$last_5.wins <- cfb_team4$last_5$wins
cfb_team4$last_5.losses <- cfb_team4$last_5$losses
cfb_team4$points.against <- cfb_team4$points$against
cfb_team4$points.net <- cfb_team4$points$net

cfb_team5$overall.wins <- cfb_team5$overall$wins
cfb_team5$overall.losses <- cfb_team5$overall$losses
cfb_team5$conference.wins <- cfb_team5$in_conference$wins
cfb_team5$conference.losses <- cfb_team5$in_conference$losses
cfb_team5$home.wins <- cfb_team5$home$wins
cfb_team5$home.losses <- cfb_team5$home$losses
cfb_team5$away.wins <- cfb_team5$away$wins
cfb_team5$away.losses <- cfb_team5$away$losses
cfb_team5$decided_by_7.wins <- cfb_team5$decided_by_7_points$wins
cfb_team5$decided_by_7.losses <- cfb_team5$decided_by_7_points$losses
cfb_team5$last_5.wins <- cfb_team5$last_5$wins
cfb_team5$last_5.losses <- cfb_team5$last_5$losses
cfb_team5$points.against <- cfb_team5$points$against
cfb_team5$points.net <- cfb_team5$points$net

cfb_team6$overall.wins <- cfb_team6$overall$wins
cfb_team6$overall.losses <- cfb_team6$overall$losses
cfb_team6$conference.wins <- cfb_team6$in_conference$wins
cfb_team6$conference.losses <- cfb_team6$in_conference$losses
cfb_team6$home.wins <- cfb_team6$home$wins
cfb_team6$home.losses <- cfb_team6$home$losses
cfb_team6$away.wins <- cfb_team6$away$wins
cfb_team6$away.losses <- cfb_team6$away$losses
cfb_team6$decided_by_7.wins <- cfb_team6$decided_by_7_points$wins
cfb_team6$decided_by_7.losses <- cfb_team6$decided_by_7_points$losses
cfb_team6$last_5.wins <- cfb_team6$last_5$wins
cfb_team6$last_5.losses <- cfb_team6$last_5$losses
cfb_team6$points.against <- cfb_team6$points$against
cfb_team6$points.net <- cfb_team6$points$net

cfb_team7$overall.wins <- cfb_team7$overall$wins
cfb_team7$overall.losses <- cfb_team7$overall$losses
cfb_team7$conference.wins <- cfb_team7$in_conference$wins
cfb_team7$conference.losses <- cfb_team7$in_conference$losses
cfb_team7$home.wins <- cfb_team7$home$wins
cfb_team7$home.losses <- cfb_team7$home$losses
cfb_team7$away.wins <- cfb_team7$away$wins
cfb_team7$away.losses <- cfb_team7$away$losses
cfb_team7$decided_by_7.wins <- cfb_team7$decided_by_7_points$wins
cfb_team7$decided_by_7.losses <- cfb_team7$decided_by_7_points$losses
cfb_team7$last_5.wins <- cfb_team7$last_5$wins
cfb_team7$last_5.losses <- cfb_team7$last_5$losses
cfb_team7$points.against <- cfb_team7$points$against
cfb_team7$points.net <- cfb_team7$points$net

cfb_team8$overall.wins <- cfb_team8$overall$wins
cfb_team8$overall.losses <- cfb_team8$overall$losses
cfb_team8$conference.wins <- cfb_team8$in_conference$wins
cfb_team8$conference.losses <- cfb_team8$in_conference$losses
cfb_team8$home.wins <- cfb_team8$home$wins
cfb_team8$home.losses <- cfb_team8$home$losses
cfb_team8$away.wins <- cfb_team8$away$wins
cfb_team8$away.losses <- cfb_team8$away$losses
cfb_team8$decided_by_7.wins <- cfb_team8$decided_by_7_points$wins
cfb_team8$decided_by_7.losses <- cfb_team8$decided_by_7_points$losses
cfb_team8$last_5.wins <- cfb_team8$last_5$wins
cfb_team8$last_5.losses <- cfb_team8$last_5$losses
cfb_team8$points.against <- cfb_team8$points$against
cfb_team8$points.net <- cfb_team8$points$net

cfb_team9$overall.wins <- cfb_team9$overall$wins
cfb_team9$overall.losses <- cfb_team9$overall$losses
cfb_team9$conference.wins <- cfb_team9$in_conference$wins
cfb_team9$conference.losses <- cfb_team9$in_conference$losses
cfb_team9$home.wins <- cfb_team9$home$wins
cfb_team9$home.losses <- cfb_team9$home$losses
cfb_team9$away.wins <- cfb_team9$away$wins
cfb_team9$away.losses <- cfb_team9$away$losses
cfb_team9$decided_by_7.wins <- cfb_team9$decided_by_7_points$wins
cfb_team9$decided_by_7.losses <- cfb_team9$decided_by_7_points$losses
cfb_team9$last_5.wins <- cfb_team9$last_5$wins
cfb_team9$last_5.losses <- cfb_team9$last_5$losses
cfb_team9$points.against <- cfb_team9$points$against
cfb_team9$points.net <- cfb_team9$points$net

cfb_team10$overall.wins <- cfb_team10$overall$wins
cfb_team10$overall.losses <- cfb_team10$overall$losses
cfb_team10$conference.wins <- cfb_team10$in_conference$wins
cfb_team10$conference.losses <- cfb_team10$in_conference$losses
cfb_team10$home.wins <- cfb_team10$home$wins
cfb_team10$home.losses <- cfb_team10$home$losses
cfb_team10$away.wins <- cfb_team10$away$wins
cfb_team10$away.losses <- cfb_team10$away$losses
cfb_team10$decided_by_7.wins <- cfb_team10$decided_by_7_points$wins
cfb_team10$decided_by_7.losses <- cfb_team10$decided_by_7_points$losses
cfb_team10$last_5.wins <- cfb_team10$last_5$wins
cfb_team10$last_5.losses <- cfb_team10$last_5$losses
cfb_team10$points.against <- cfb_team10$points$against
cfb_team10$points.net <- cfb_team10$points$net

cfb_team11$overall.wins <- cfb_team11$overall$wins
cfb_team11$overall.losses <- cfb_team11$overall$losses
cfb_team11$conference.wins <- cfb_team11$in_conference$wins
cfb_team11$conference.losses <- cfb_team11$in_conference$losses
cfb_team11$home.wins <- cfb_team11$home$wins
cfb_team11$home.losses <- cfb_team11$home$losses
cfb_team11$away.wins <- cfb_team11$away$wins
cfb_team11$away.losses <- cfb_team11$away$losses
cfb_team11$decided_by_7.wins <- cfb_team11$decided_by_7_points$wins
cfb_team11$decided_by_7.losses <- cfb_team11$decided_by_7_points$losses
cfb_team11$last_5.wins <- cfb_team11$last_5$wins
cfb_team11$last_5.losses <- cfb_team11$last_5$losses
cfb_team11$points.against <- cfb_team11$points$against
cfb_team11$points.net <- cfb_team11$points$net

## COMBINE INTO ONE DATA FRAME
cfb_teams2018 <- rbind(cfb_team1, cfb_team2, cfb_team3, cfb_team4, cfb_team5, cfb_team6, cfb_team7, cfb_team8, cfb_team9, cfb_team10, cfb_team11)

Now you should have a data frame, named ‘cfb_teams2018’ with team information for the 2018 season. I believe this is updated each week, as games are played, so depending on when you make the call you should have close to the latest information.

Collecting Game Data

## API CALL FOR TEAM DATA AND INDIVIDUAL GAME DATA
srgames.raw.result <- GET(url = srurl, path = srpath)
srgames.raw.content <- rawToChar(srgames.raw.result$content)
srgames.content <- fromJSON(srgames.raw.content)

## PULL GAME DATA BY WEEK OUT OF LISTS
cfb_week1 <- srgames.content$weeks$games[[1]]
cfb_week2 <- srgames.content$weeks$games[[2]]
cfb_week3 <- srgames.content$weeks$games[[3]]
cfb_week4 <- srgames.content$weeks$games[[4]]
cfb_week5 <- srgames.content$weeks$games[[5]]
cfb_week6 <- srgames.content$weeks$games[[6]]
cfb_week7 <- srgames.content$weeks$games[[7]]
cfb_week8 <- srgames.content$weeks$games[[8]]
cfb_week9 <- srgames.content$weeks$games[[9]]
cfb_week10 <- srgames.content$weeks$games[[10]]
cfb_week11 <- srgames.content$weeks$games[[11]]
cfb_week12 <- srgames.content$weeks$games[[12]]
cfb_week13 <- srgames.content$weeks$games[[13]]

## PULL DATA FRAMES OUT OF DATA FRAMES
cfb_week1$week <- 1
cfb_week2$week <- 2
cfb_week3$week <- 3
cfb_week4$week <- 4
cfb_week5$week <- 5
cfb_week6$week <- 6
cfb_week7$week <- 7
cfb_week8$week <- 8
cfb_week9$week <- 9
cfb_week10$week <- 10
cfb_week11$week <- 11
cfb_week12$week <- 12
cfb_week13$week <- 13

## COMBINE GAMES FROM ALL WEEKS INTO ONE DATA FRAME
cfb_games2018 <- rbind(cfb_week1, cfb_week2, cfb_week3, cfb_week4, cfb_week5, cfb_week6, cfb_week7, cfb_week8, cfb_week9, cfb_week10, cfb_week11, cfb_week12, cfb_week13)

There you have it. Game by game data for the 2018 college football season through week 13. Happy analysis.

Digital Marketing in R: How to Create Word Clouds

I recently recorded my very first (much too lengthy) YouTube video. The video walks through taking a list of keywords and creating a word cloud in R.

While I do not find word clouds to be particularly useful, there are a number of terrific data science applications that you come across during this exercise that are worth knowing — like removing stop words and stemming.

How Even Were Whistles in the 2017 NBA Playoffs?

TEAMPERSONAL FOULS/GAMEVARIANCE
AVERAGE210
Washington Wizards232
Indiana Pacers232
Oklahoma City Thunder221
Memphis Grizzlies221
Golden State Warriors221
Portland Trail Blazers221
Atlanta Hawks210
Milwaukee Bucks210
Utah Jazz210
Houston Rockets210
Toronto Raptors210
Boston Celtics210
Los Angeles Clippers20-1
Cleveland Cavalier19-2
San Antonio Spurs19-2
Chicago Bulls18-3

3 Data/Analytics Podcast Recommendations

Here is a brief list of podcasts I would recommend that pertain to either digital marketing or data science. Enjoy!

The Digital Analytics Power Hour

Hosted by Tim Wilson and Michael Helbling, this podcast focuses on a number of digital analytics topics including anything from R to what the future digital marketing analyst will look like from a skills perspective.

The Data Skeptic

I just started listening to this one and I love it. Many of the episodes are very short (about 15 minutes), so it’s very digestible. There’s a wide range of very relevant topics from a refresher on p-values and t-tests to neuroscience. I really like how they episodes only last as long as they need to be and how they break down seemingly complex topics into something everyone can grasp.

FiveThirtyEight

This one is less about understanding data/analytics and more about findings the team over at 538 has made. If you’re reading this, you most likely are already familiar with the 538 blog, where topics are generally focused on politics and sports.

Any good recommendations out there I missed? Let me know in the comments. Thanks!