You can scrape sports data from Web pages and store them in your own data frame/table for future analysis using handy readHTMLTable and lapply packages in R. My code is below.
fb_urls <- paste0('http://dataomaha.com/huskers/history/seasons/', 1960:2017) fb_urls <- unlist(fb_urls) fb_main <- lapply(fb_urls, function(x){readHTMLTable(getURL(x), stringsAsFactors=F)[[1]]}) fb_main <- rbindlist(fb_main) fb_main
In this two-part post, I’ll show you how to use R to scrape tables from multiple Web pages to use for your own analysis. There are a number of steps involved here and, depending on which page(s) you’re trying to get data from, it can get complicated, but thankfully R has some brilliant packages like readHTMLTable to do the heavy lifting.
First, let’s take a look at the web pages I’ll be scraping. I want to collect game summary data for Nebraska Cornhusker football games (Go Big Red), which I found here: http://dataomaha.com/huskers/history/seasons/1997.
In this example, there is only one table. If your page has multiple tables, you may need to modify the code, which you can do by specifying within the readHTMLTable function (see last argument ‘[1]‘). If you have further questions on this, leave a comment or search for documentation on this function.
Below, the first step I took in forming a script is to build a list of URLs I wanted to scrape. If you are only extracting table data from one page, things are much simpler. You can just use readHTMLTable or htmltab and use the URL as an argument in the function. In my case, I wanted to cycle through 57 different pages. You could create a list in Excel quite easily, but just as quickly, you can also use the paste0 function, which is loaded into R by default, to automatically create a list.
fb_urls <- paste0('http://dataomaha.com/huskers/history/seasons/', 1960:2017)
Basically, this function is saying to take the base URL (first argument) and paste a number to the end until the list is complete. In this case, I am pasting 1960-2017, since I know the URLs I am scraping are constructed this way. Those are then saved as fb_urls.
The next step is to take what we created and basically get it into the right format. We can do this using the unlist function.
fb_urls <- unlist(fb_urls)
Now that we have a list of Web pages to scrape, we need to write a function with information on what we want to do on each page. Here, we are using a terrifically simple function name lapply which is designed to apply a function through a list. The first argument is simply the list we want to apply the function to and next we will use a Web scraping function in R called readHTMLTable to pull back the first table it finds on each page and save it in an list we will call fb_main.
fb_main <- lapply(fb_urls, function(x){readHTMLTable(getURL(x), stringsAsFactors=F)[[1]]})
The result is a list of 57 separate data frames. But I just need one large data frame with all of the information. So I’ll merge everything into one data frame using rbindlist.
fb_main <- rbindlist(fb_main)
Done! We always want to double check our work, which we can do by simply calling the new data frame.
fb_main
In due time, I’ll follow up with Part 2, which will focus on reviewing and cleaning the returned data, as well as options for storing it in an accessible place for future data analysis.