Category Archives: Sports

Scraping Sports Stats Using R (Part 2)

In Part 1 of this blog post I show how you can scrape tables of sports data from websites and store that data in a data frame for data analysis (have I said ‘data’ enough times yet?).

Whenever you are automating data collection in this sort of way, you always want to get a “health check” on your new table to make sure nothing went awry. There are countless things that can go wrong from missing data to web pages timing out or blocking you from collecting data — and it is critical we understand whether any of this has happened before moving on to an analysis phase. You might call this a “data cleaning” phase to get you into position to analyze.

With the head function you can quickly get a glimpse of what your variable names look like, along with a few observations. Let’s take a look:

head(fb_main)
            Date NU rank     Opponent Site Outcome Score <U+00A0>
1 Sept. 17, 1960             #4 Texas Away     Win 14-13  Details
2 Sept. 24, 1960     #12    Minnesota Home    Loss 26-14  Details
3   Oct. 1, 1960           Iowa State Home    Loss  10-7  Details
4   Oct. 8, 1960         Kansas State Home     Win  17-7  Details
5  Oct. 15, 1960                 Army Home     Win  14-9  Details
6  Oct. 22, 1960             Colorado Away    Loss  19-6  Details

Immediately a few concerns jump out at me. First, there is a column where all of the values contain the word “Details.” On the original site I drew this information from, this column linked out to details for each game. I do not need this column for any analysis, so I will remove it. There are numerous ways to do this. Since I know it is my seventh column, I’ll just do it this way. If you’re unsure about how to tackle this (football pun), then you may want to save the table as a different name and keep the original.

fb_main <- fb_main[,-7]
head(fb_main)
            Date NU rank     Opponent Site Outcome Score
1 Sept. 17, 1960             #4 Texas Away     Win 14-13
2 Sept. 24, 1960     #12    Minnesota Home    Loss 26-14
3   Oct. 1, 1960           Iowa State Home    Loss  10-7
4   Oct. 8, 1960         Kansas State Home     Win  17-7
5  Oct. 15, 1960                 Army Home     Win  14-9
6  Oct. 22, 1960             Colorado Away    Loss  19-6

Much better. But it sure would be nice if the score was split into two columns in case I wanted to sum or average any of the scores during analysis. One variable can easily be split into two variables with a convenient function called separate, which is part of the tidyr package. It would look like this:

fb_main <- separate(fb_main, Score, into = c("Win Score", "Lose Score"), sep = "-")

But not so fast my friend — this really isn’t very helpful at all. We want the Husker scores in one column and the opponent score in another column, not a mix. This creates complexity to our code, but we can still accomplish it in a few easy steps. You will need to use dplyr, so make sure that is activated in your library.

library(tidyr)
fb_main$NUScore <- as.numeric(ifelse(fb_main$Outcome=="Win", fb_main$`Winner Score`, fb_main$`Loser Score`))
fb_main$OppScore <- as.numeric(ifelse(fb_main$Outcome=="Win", fb_main$`Loser Score`, fb_main$`Winner Score`))
fb_main <- fb_main[, -c(6:7)]

The first two lines above create the new columns we want using an ifelse function. And I saved myself some time by making the columns numeric. The ifelse statement has three arguments: condition, value of new row if condition is true, value of new row if condition is false. The last line is simply deleting the Winner Score and Loser Score variables, which we no longer need. Now look at the data:

head(fb_main3)
            Date NU rank     Opponent Site Outcome NUScore OppScore
1 Sept. 17, 1960             #4 Texas Away     Win      14       13
2 Sept. 24, 1960     #12    Minnesota Home    Loss      14       26
3   Oct. 1, 1960           Iowa State Home    Loss       7       10
4   Oct. 8, 1960         Kansas State Home     Win      17        7
5  Oct. 15, 1960                 Army Home     Win      14        9
6  Oct. 22, 1960             Colorado Away    Loss       6       19

Some other observations I have made about this data is that the Opponent variable contains both the opponent name and their ranking (this could create difficulties down the road), some values are missing, and the rankings contain a hash fragment (#). These are all worth tidying up before analysis, but I’ll stop there since the above should provide enough direction to complete those tasks.

One final consideration to make once you have collected all of this data is where to store it. There are many options and each of them have different methods in R. But here are a few to consider so that you do not need to re-gather the data each time you need it for analysis:

  • Save as a data frame in R
  • Use write.csv to save it on your computer as a .csv file (a similar function exists for Excel)
  • Send to a local or cloud-based database
  • Upload to data.world

Here is the final data, uploaded to data.world: https://data.world/jeffgswanson/husker-football-game-results

Enjoy.

Scraping Sports Stats Using R (Part 1)

You can scrape sports data from Web pages and store them in your own data frame/table for future analysis using handy readHTMLTable and lapply packages in R. My code is below.

fb_urls <- paste0('http://dataomaha.com/huskers/history/seasons/', 1960:2017)

fb_urls <- unlist(fb_urls)

fb_main <- lapply(fb_urls, function(x){readHTMLTable(getURL(x), stringsAsFactors=F)[[1]]})

fb_main <- rbindlist(fb_main)

fb_main

In this two-part post, I’ll show you how to use R to scrape tables from multiple Web pages to use for your own analysis. There are a number of steps involved here and, depending on which page(s) you’re trying to get data from, it can get complicated, but thankfully R has some brilliant packages like readHTMLTable to do the heavy lifting.

First, let’s take a look at the web pages I’ll be scraping. I want to collect game summary data for Nebraska Cornhusker football games (Go Big Red), which I found here: http://dataomaha.com/huskers/history/seasons/1997.

In this example, there is only one table. If your page has multiple tables, you may need to modify the code, which you can do by specifying within the readHTMLTable function (see last argument ‘[1]‘). If you have further questions on this, leave a comment or search for documentation on this function.

Below, the first step I took in forming a script is to build a list of URLs I wanted to scrape. If you are only extracting table data from one page, things are much simpler. You can just use readHTMLTable or htmltab and use the URL as an argument in the function. In my case, I wanted to cycle through 57 different pages. You could create a list in Excel quite easily, but just as quickly, you can also use the paste0 function, which is loaded into R by default, to automatically create a list.

fb_urls <- paste0('http://dataomaha.com/huskers/history/seasons/', 1960:2017)

Basically, this function is saying to take the base URL (first argument) and paste a number to the end until the list is complete. In this case, I am pasting 1960-2017, since I know the URLs I am scraping are constructed this way. Those are then saved as fb_urls.

The next step is to take what we created and basically get it into the right format. We can do this using the unlist function.

fb_urls <- unlist(fb_urls)

Now that we have a list of Web pages to scrape, we need to write a function with information on what we want to do on each page. Here, we are using a terrifically simple function name lapply which is designed to apply a function through a list. The first argument is simply the list we want to apply the function to and next we will use a Web scraping function in R called readHTMLTable to pull back the first table it finds on each page and save it in an list we will call fb_main.

fb_main <- lapply(fb_urls, function(x){readHTMLTable(getURL(x), stringsAsFactors=F)[[1]]})

The result is a list of 57 separate data frames. But I just need one large data frame with all of the information. So I’ll merge everything into one data frame using rbindlist.

fb_main <- rbindlist(fb_main)

Done! We always want to double check our work, which we can do by simply calling the new data frame.

fb_main

In due time, I’ll follow up with Part 2, which will focus on reviewing and cleaning the returned data, as well as options for storing it in an accessible place for future data analysis.