How to Collect Twitter Data Using R and the Twitter Search API

Luckily, the hard work is done for us. There is a terrific R package called twitterR that allows you to easily connect to the Twitter Search API. You just need to know a few arguments to properly ask for the data you need.

First, let’s explore what type of data and limitations exist in the Twitter Search API so we know what we have to work with.

Official documentation: https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets

“Returns a collection of relevant Tweets matching a specified query.

Please note that Twitter’s search service and, by extension, the Search API is not meant to be an exhaustive source of Tweets. Not all Tweets will be indexed or made available via the search interface.

To learn how to use Twitter Search effectively, please see the Standard search operators page for a list of available filter operators. Also, see the Working with Timelines page to learn best practices for navigating results by since_id and max_id.”

The first step, of course, is to activate the packages you need for this project. If you don’t have these packages installed already, you’ll need to do that too. I have all of these installed, so I’ve commented out that part here.

# install.packages("twitterR")
library(twitterR)

We’re getting close, but before we can request data from the Twitter API, we have to provide some credentials to make sure we aren’t doing anything nefarious. To accomplish this, you need four things:

  • consumer_key
  • consumer_secret
  • access_token
  • access_secret

No worries, all of these can be easily found here (you’ll need an active Twitter account): https://apps.twitter.com/. Once you’re logged, you need to create an “application” which is essentially just saying you want to work on a project. Go ahead and fill in the details and you should receive the four criteria above.

Now we’ll save each of these strings in this manner (note that you’ll need to replace your string where i have ‘abc123’):

onsumer_key <- 'acb123'
consumer_secret <- 'acb123'
access_token <- 'acb123'
access_secret <- 'acb123'

Now, let’s get authorized and begin requesting Twitter data.

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

Below is a simple request for Tweets that you can modify to your liking. In this example, I’m going to save my request as “nebtweets” and I’ll call for the information with “searchTwitter” which is part of the twitterR package we installed and activated. I’ve arbitrarily set the number of results I want back to 200, starting in February 2018. It’s important to note here that the Twitter Search API does NOT give you full access to Twitters’ data. It’s only an index of recent Tweets. So you may get back warnings if you try asking for something that is not available.

nebtweets <- searchTwitter("nebrasketball", n=200, lang="en", since = '2018-02-01')

Now we have the Tweets saved, but they're not in a nice, neat data frame. This can easily be solved using "twListToDF" which is also part of the TwitterR package.

nebtweetsDF = twListToDF(nebtweets)
View(nebtweetsDF)

Now you're ready to analyze. Enjoy.

How to Create a US Heatmap in R

Creating a simple US map in R can be done in a number of ways. Two popular packages for this type of project are ggplot2 and plotly. In this case, I used plotly.

The data for my map is a list of US state codes (NE, IL, MA, CA, etc.). A second variable gives a count of how many players the Nebraska football team is targeting in each state. In order to follow my example with your own data, you will need to have the state code variable and some numeric variable to map it against.

Once you have your data in a table and are ready to use it, create the following styling options for the map, which we will apply later:

mapDetails <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('white')
)

As you may have guessed, “scope” determines the type of map, in this case a map of the USA. We will also determine here what to do with lakes and how to color them.

usaMap <- plot_geo(X2018targets, locationmode = 'USA-states') %>%
add_trace(
z = X2018targets$Targets, locations = X2018targets$`State Code`,
color = X2018targets$Targets, colors = 'Blues'
) %>%
colorbar(title = "Targets") %>%
layout(
title = '2018 Nebraska Football Targets by State (February 2018)',
geo = g
)

The code above connects my data to the map and allows me to modify text within the plot area. My data frame is called “X2018targets,” so you’ll need to replace this with your data frame name. You’ll also need to set “z” to your numeric data and “locations” to your state code variable.

When you’re finished, simply type “usaMap” and hit enter to see your plot appear (I use R Studio, by the way, assuming you likely do as well). If you have any trouble or questions, let me know in the comments.

How to Rank Fantasy Baseball Players for Your League Stats

If you play in fantasy baseball dynasty league, you have no doubt raked through countless projections and rankings, attempting to forecast the year to come and break-out players.

There is no shortage of useful information out there, but one of the problems with pre-season fantasy baseball rankings is that they are catch-all’s, intended to appease the masses. And that makes sense, but if you really need insight into the value of a player in your league you need to only be scoring players based on the categories you carry.

Below I’ll show you a fool-proof method for creating solid rankings based on your specific league stats. I mostly used Excel for this exercise, but it can easily be done in R as well.

Step 1 – Get You Some Data

There’s no point in re-inventing the wheel here. I downloaded 2018 projections from fantasypros.com. Their Zeile Consensus Projections are solid and built from a number of sources. I’ve been told FanGraphs does an excellent job with their projections, so those are worth a look too, but the FantasyPros data had complete games easily accessible and that is a category I needed, so there you have it.

Step 2 – Clean it Up

In this step, I simply deleted the categories in each spreadsheet (hitting and pitching data are separated…for now) that I did not want in my way. I also created a calculated field for K/BB using the K and BB columns, since that is a category in my league.

Step 3 – The Super Secret Formula

So, in the end, I ultimately need to come up with value for how a player is contributing in each category and add those values up to give the player an overall score relative to how everyone else in the league scores.

To do this, we need to standardize the data so that each category uses roughly the same scoring range. This can be done by calculating the Z-score for every player in each category. Don’t worry too much if you’re not familiar with Z-Score — it’s basically a way to tell how far away from the average (mean) a player is in each category. But instead of literally using the numbers in the categories, those numbers are standardized.

Example:

Let’s say the mean number of hits and batting average are 150 and .270, respectively. If a batter is projected to get 170 hits and bat .310, we wouldn’t want to say he is 20 hits better and add that to his .040 better batting average for a score of 20.040. That would give so much weight to hits and batting average would be negligible, even if you batted an absurd .400 for the season.

This is where the standardization comes in. We can give each category an even scoring system if instead we say the faux player above is, say, 1 standard deviation above the mean for batting average and 1.4 above the mean for hits. So how do we calculate this?

It’s actually quite easy. The most tedious part is actually adding the scores at the end. First, you will need to get the average and overall standard deviation for each category in order to make this work.

=AVERAGE(data range here)

=STDEVP(data range here)

I place these formulas below the row of players for each category. It looks something like this:

Now that we have an average and standard deviation for each category, we can calculate Z-scores. In Excel, there is a formula for this called “=STANDARDIZE.”

I created a new column to the right of my dataset and tallied up each players Z-score for each category to give them an overall score. The STANDARDIZE function in Excel accepts a few arguments that you will need to supply. First, you need to point to the players’ stat for a given category, followed by the average stat for the category and the standard deviation for that category (which we calculated earlier). Doing that for each category will look something like this:

=STANDARDIZE(E2,$E$303,$E$304)+STANDARDIZE(F2,$F$303,$F$304)+STANDARDIZE(G2,$G$303,$G$304)+STANDARDIZE(J2,$J$303,$J$304)+STANDARDIZE(L2,$L$303,$L$304)+STANDARDIZE(M2,$M$303,$M$304)+STANDARDIZE(N2,$N$303,$N$304)

This score basically says that if these projections hold true, then this is how valuable each player is relative to each other for your league. Some of the scores may surprise you, but remember that it is aligned with your league and does not give weight to irrelevant categories, regardless of how big of a name a player might be in reality.

If you’re good with the above, that’s fine. But I made one more adjustment to account for number of at-bat’s and innings pitched. In other words, I wanted to give more value to a player that batted .295 over 500 at-bat’s versus a player who maybe batter .300, but for only 200 at-bats.

In order to do this, I created a new column for any stat that was an average or a ratio (BA, OBP, ERA, WHIP, K/BB, etc.) and basically created a new metric that I would use instead. For each of these, I took the player stat, subtracted it from the average for that stat and then multiplied by the number of at-bats or innings pitched, depending.

ERA, for example, would now be:

(player ERA - league average ERA) * Innings Pitched

I then used that variable instead of the ERA variable to calculate my Z-score.

Step 4 – Merging the Data into one Sheet

If you harken back to the beginning of this post, recall that there are still two spreadsheets at this point — one for hitters and one for throwers. We need to bind these together and sort by the new score we created in order to get our rankings!

Here is my sheet after combining all of the columns together. I also added a new column called “Type” that indicates whether a player is a hitter or pitcher, because I knew I would need it later on for a separate project.

Step 5 – A Dose of Reality

I personally think this is a much better approach than looking at generic rankings that are spewed out annually. With that said, it has it’s flaws.

For some reason, pitchers seemed to be over-valued in my scoring and after some analysis I found that pitchers who were over-indexing in complete games were getting way to high of scores in that category. I tried a number of things to dilute this but ultimately landed on subtracting the number of CGs from each players overall score (most players had zero, of course) and somehow it seemed to work and I feel pretty good about what I’ve produced.

You may run into a similar scenario and may need to make modifications to your scores using trial and error. Hope that’s not the case and you end up with something great!

Scraping Sports Stats Using R (Part 2)

In Part 1 of this blog post I show how you can scrape tables of sports data from websites and store that data in a data frame for data analysis (have I said ‘data’ enough times yet?).

Whenever you are automating data collection in this sort of way, you always want to get a “health check” on your new table to make sure nothing went awry. There are countless things that can go wrong from missing data to web pages timing out or blocking you from collecting data — and it is critical we understand whether any of this has happened before moving on to an analysis phase. You might call this a “data cleaning” phase to get you into position to analyze.

With the head function you can quickly get a glimpse of what your variable names look like, along with a few observations. Let’s take a look:

head(fb_main)
            Date NU rank     Opponent Site Outcome Score <U+00A0>
1 Sept. 17, 1960             #4 Texas Away     Win 14-13  Details
2 Sept. 24, 1960     #12    Minnesota Home    Loss 26-14  Details
3   Oct. 1, 1960           Iowa State Home    Loss  10-7  Details
4   Oct. 8, 1960         Kansas State Home     Win  17-7  Details
5  Oct. 15, 1960                 Army Home     Win  14-9  Details
6  Oct. 22, 1960             Colorado Away    Loss  19-6  Details

Immediately a few concerns jump out at me. First, there is a column where all of the values contain the word “Details.” On the original site I drew this information from, this column linked out to details for each game. I do not need this column for any analysis, so I will remove it. There are numerous ways to do this. Since I know it is my seventh column, I’ll just do it this way. If you’re unsure about how to tackle this (football pun), then you may want to save the table as a different name and keep the original.

fb_main <- fb_main[,-7]
head(fb_main)
            Date NU rank     Opponent Site Outcome Score
1 Sept. 17, 1960             #4 Texas Away     Win 14-13
2 Sept. 24, 1960     #12    Minnesota Home    Loss 26-14
3   Oct. 1, 1960           Iowa State Home    Loss  10-7
4   Oct. 8, 1960         Kansas State Home     Win  17-7
5  Oct. 15, 1960                 Army Home     Win  14-9
6  Oct. 22, 1960             Colorado Away    Loss  19-6

Much better. But it sure would be nice if the score was split into two columns in case I wanted to sum or average any of the scores during analysis. One variable can easily be split into two variables with a convenient function called separate, which is part of the tidyr package. It would look like this:

fb_main <- separate(fb_main, Score, into = c("Win Score", "Lose Score"), sep = "-")

But not so fast my friend — this really isn’t very helpful at all. We want the Husker scores in one column and the opponent score in another column, not a mix. This creates complexity to our code, but we can still accomplish it in a few easy steps. You will need to use dplyr, so make sure that is activated in your library.

library(tidyr)
fb_main$NUScore <- as.numeric(ifelse(fb_main$Outcome=="Win", fb_main$`Winner Score`, fb_main$`Loser Score`))
fb_main$OppScore <- as.numeric(ifelse(fb_main$Outcome=="Win", fb_main$`Loser Score`, fb_main$`Winner Score`))
fb_main <- fb_main[, -c(6:7)]

The first two lines above create the new columns we want using an ifelse function. And I saved myself some time by making the columns numeric. The ifelse statement has three arguments: condition, value of new row if condition is true, value of new row if condition is false. The last line is simply deleting the Winner Score and Loser Score variables, which we no longer need. Now look at the data:

head(fb_main3)
            Date NU rank     Opponent Site Outcome NUScore OppScore
1 Sept. 17, 1960             #4 Texas Away     Win      14       13
2 Sept. 24, 1960     #12    Minnesota Home    Loss      14       26
3   Oct. 1, 1960           Iowa State Home    Loss       7       10
4   Oct. 8, 1960         Kansas State Home     Win      17        7
5  Oct. 15, 1960                 Army Home     Win      14        9
6  Oct. 22, 1960             Colorado Away    Loss       6       19

Some other observations I have made about this data is that the Opponent variable contains both the opponent name and their ranking (this could create difficulties down the road), some values are missing, and the rankings contain a hash fragment (#). These are all worth tidying up before analysis, but I’ll stop there since the above should provide enough direction to complete those tasks.

One final consideration to make once you have collected all of this data is where to store it. There are many options and each of them have different methods in R. But here are a few to consider so that you do not need to re-gather the data each time you need it for analysis:

  • Save as a data frame in R
  • Use write.csv to save it on your computer as a .csv file (a similar function exists for Excel)
  • Send to a local or cloud-based database
  • Upload to data.world

Here is the final data, uploaded to data.world: https://data.world/jeffgswanson/husker-football-game-results

Enjoy.

Scraping Sports Stats Using R (Part 1)

You can scrape sports data from Web pages and store them in your own data frame/table for future analysis using handy readHTMLTable and lapply packages in R. My code is below.

fb_urls <- paste0('http://dataomaha.com/huskers/history/seasons/', 1960:2017)

fb_urls <- unlist(fb_urls)

fb_main <- lapply(fb_urls, function(x){readHTMLTable(getURL(x), stringsAsFactors=F)[[1]]})

fb_main <- rbindlist(fb_main)

fb_main

In this two-part post, I’ll show you how to use R to scrape tables from multiple Web pages to use for your own analysis. There are a number of steps involved here and, depending on which page(s) you’re trying to get data from, it can get complicated, but thankfully R has some brilliant packages like readHTMLTable to do the heavy lifting.

First, let’s take a look at the web pages I’ll be scraping. I want to collect game summary data for Nebraska Cornhusker football games (Go Big Red), which I found here: http://dataomaha.com/huskers/history/seasons/1997.

In this example, there is only one table. If your page has multiple tables, you may need to modify the code, which you can do by specifying within the readHTMLTable function (see last argument ‘[1]‘). If you have further questions on this, leave a comment or search for documentation on this function.

Below, the first step I took in forming a script is to build a list of URLs I wanted to scrape. If you are only extracting table data from one page, things are much simpler. You can just use readHTMLTable or htmltab and use the URL as an argument in the function. In my case, I wanted to cycle through 57 different pages. You could create a list in Excel quite easily, but just as quickly, you can also use the paste0 function, which is loaded into R by default, to automatically create a list.

fb_urls <- paste0('http://dataomaha.com/huskers/history/seasons/', 1960:2017)

Basically, this function is saying to take the base URL (first argument) and paste a number to the end until the list is complete. In this case, I am pasting 1960-2017, since I know the URLs I am scraping are constructed this way. Those are then saved as fb_urls.

The next step is to take what we created and basically get it into the right format. We can do this using the unlist function.

fb_urls <- unlist(fb_urls)

Now that we have a list of Web pages to scrape, we need to write a function with information on what we want to do on each page. Here, we are using a terrifically simple function name lapply which is designed to apply a function through a list. The first argument is simply the list we want to apply the function to and next we will use a Web scraping function in R called readHTMLTable to pull back the first table it finds on each page and save it in an list we will call fb_main.

fb_main <- lapply(fb_urls, function(x){readHTMLTable(getURL(x), stringsAsFactors=F)[[1]]})

The result is a list of 57 separate data frames. But I just need one large data frame with all of the information. So I’ll merge everything into one data frame using rbindlist.

fb_main <- rbindlist(fb_main)

Done! We always want to double check our work, which we can do by simply calling the new data frame.

fb_main

In due time, I’ll follow up with Part 2, which will focus on reviewing and cleaning the returned data, as well as options for storing it in an accessible place for future data analysis.

Digital Marketing in R: How to Create Word Clouds

I recently recorded my very first (much too lengthy) YouTube video. The video walks through taking a list of keywords and creating a word cloud in R.

While I do not find word clouds to be particularly useful, there are a number of terrific data science applications that you come across during this exercise that are worth knowing — like removing stop words and stemming.

Un-edited thoughts on Topical Keyword Research and Intent Based SEO

I have been thinking a lot lately about topical keyword research and how this plays a role in SEO, content hierarchy and the data science approaches we use to accomplish these ideas. Let me back up…

Topical Keyword Research and Intent-based Search
Whether you’re an SEO or just someone who’s observed Google search results over time, it’s clear that over the years Google SERPs have become much more “semantic.” But what does that really mean?

In short, computers have used natural language processing (NLP) to better understand how human language works. That might be understanding synonyms, crafting results based on which device type you are using or any number of things. But the bottom line is that Google has moved away from showing results that are heavily keyword-based (returning pages that contain the exact phrase you typed or something very similar) to more of a semantic or intent-based approach where the results might contain the keyword you searched, but they are more concerned with showing results you intend to see and understanding if something related is a better result and does not contain your exact keywords — that’s okay.

How Does this Impact SEO?
In a big way. And this is not new, but we need to approach keyword research and craft content around intent and not specific 1:1 keywords. In other words, we should get a list of keywords, cluster them into intent groups and then build content based on intent groups instead of individual keywords. This is ultimately what the user wants — not a a bunch of slight variations of content that are more or less similar. And Google theoretically will rank this content well if it meets user intent and they can connect it to the query.

How is this Related to Data Science?
For one, clustering is a big topic in data science and can be executed in R. There are no doubt SEO tools out there that exist, but if you want more control you might consider supervised or unsupervised clustering in R.

Final thoughts
I can see a bigger picture here as well. As we craft our content based on intent and clustering, we can almost take a testing approach to content and site information architecture in the future. Basically, one could build out their intent groups and with that list merge content that is part of one cluster or fill out any gaps. Over time analytics should show how users move through the funnel and if any steps are needed to provide an easy path for users (a path to whatever your goal happens to be).

But I think there is a paid media tie in here as well. Not often enough do we look at paid media performance from an SEO perspective and document which keywords drive conversions versus which are more informational. We should be using that information to learn how to build out information architecture as well. It should be an additional layer to better understand how to break up similar content throughout the user journey and confirm which keywords belong to which bucket.

How Even Were Whistles in the 2017 NBA Playoffs?

TEAMPERSONAL FOULS/GAMEVARIANCE
AVERAGE210
Washington Wizards232
Indiana Pacers232
Oklahoma City Thunder221
Memphis Grizzlies221
Golden State Warriors221
Portland Trail Blazers221
Atlanta Hawks210
Milwaukee Bucks210
Utah Jazz210
Houston Rockets210
Toronto Raptors210
Boston Celtics210
Los Angeles Clippers20-1
Cleveland Cavalier19-2
San Antonio Spurs19-2
Chicago Bulls18-3

A Word on Digital Marketing

I have spent over a decade working in the digital marketing space. It’s an area I know well but also has much cross over with data science. In fact, mostly everything I learn in data science is usually applied to one of these two interest areas of mine: digital marketing or sports.

With that said, I’ll be posting digital marketing ideas and experiences from time to time. These posts may not always tie back to data or analytics, but I’ll try my best to connect the two when possible.

Data Science Course Recommendation: Udemy Data Science A-Z

I want to give some props to a course I recently took online at Udemy.com. The course is called Data Science A-Z and is taught by someone by the name of Kirill Eremenko.

First, I just want to stress that I am not being paid for this endorsement in any way. Just want to share my review with you all.

The price was right at a mere $10. Not sure if that was a short-term promotional price or how long it will last, but it’s well worth it — even as a refresher.

There are three sections: data visualization with Tableau, Statistics/Modeling, and Data Preparation. The sections are not dependent on each other and can be taken in any order, which adds a nice element of flexibility to the whole thing.

As you probably know, there are countless courses out there but what I appreciate about this one is that it was easy to digest if you have any sort of background in these areas and it explains not only how to approach these disciplines but why you are doing them at all.

During the course, I was also introduced to a great free statistical program called Gretl. You can download it here. If you have used SPSS or SAS, you’ll pick it up in no time at all.

Find out more here: https://www.udemy.com/datascience/

I also really like Data Camp, but there is a monthly fee associated with membership. I believe it’s somewhere between $20-30/month.

Thanks for reading.