How Nebraska Voted in 2016

Below is a visualization of how Nebraskans voted in the 2016 presidential election. Here are a few key points:

  • The data is based on how many more votes were counted for Trump versus Clinton, the two front-runners
    • Red and yellow areas are ones where significantly more votes were Republican
    • Green areas are where a few thousand more people in that county voted Republican
    • Blue areas were much closer — ones where Trump prevailed by less than 1,000 votes
    • Dark blue and purple areas were very, very close
    • Two counties appear gray — these are ones in which Clinton received a higher number of votes
  • Clinton won by a relatively small margin in Nebraska’s two largest counties, Douglas and Lancaster, where Omaha and Lincoln reside, respectively
  • Trump won by a rather large margin in the counties immediately surrounding Douglas and Lancaster, especially Sarpy in which he prevailed by over 15,000 votes (that’s roughly three times more than the number of votes Clinton won by in Douglas and Lancaster combined)
  • Smaller and more Western counties typically votes strongly Republican

Resources
data: NYT Election Results
county names map: https://www.digital-topo-maps.com/county-map/nebraska.shtml
state election results: http://www.sos.ne.gov/elec/2016/pdf/2016-canvass-book.pdf

How to Create a Nebraska Map in ggplot2 by County

In this post, I show you how to create the outline for a Nebraska map in ggplot2 that is separated by counties. First, R already has latitudes, longitudes and mapping necessities we need for this, it’s just a matter of accessing them, which is one of the reasons why R is so great and easy to use.

Using maps_data from the maps package, we can turn all of these coordinates into a data frame. Let’s name ours ‘states’, but then also create a smaller data frame that only contains Nebraska data:

states <- map_data("state")
ne_coords <- subset(states, region=="nebraska")
head(ne_coords)

Now, let’s do the same thing, but with counties…

counties <- map_data("county")
ne_county <- subset(counties, region=="nebraska")
head(ne_county)

Okay! We have all of the data we need to draw the map. Let’s activate ggplot2 and see what this looks like when we visualize it.

library(ggplot2)

Nebraska map without county lines:

ne_map <- ggplot(data=ne_coords, mapping = aes(x=long, y=lat, group=group)) + coord_fixed(1.3)
+ geom_polygon(color="black", fill="gray")


Nebraska map with county lines:

ne_map + theme_clasic() + geom_polygon(data = ne_county, fill=NA, color="white")


And that’s it. Of course, you’ll want to get some useful data and combine it into this map for a useful data visualization, but this should get you started.

Predicting Tesla’s Federal EV Tax Credit in R (and why I’m probably wrong)

First, some context: the federal government provides a tax credit for electric vehicle manufacturers. For details, see here, but in short the full tax credit ($7,500) is available for consumers now through the quarter after the manufacturer sells and delivers 200,000 vehicles.

So, for example, if Tesla delivers its 200,000th vehicle on June 1, 2018 (Q2), then the full credit is available the remainder of that quarter plus one more full quarter (Q3 in this example). After that, the tax credit is cut in half for two additional quarters, and the cut in half again for two final quarters before the credit ends completely for the manufacturer.

Using R, I forecasted Tesla vehicle deliveries in the United States and then plotted it with ggplot2. Here’s how:

Thankfully I found a great site called InsideEVs that has collected or closely projected the quarterly US deliveries with their Plug-In Sales Scorecard. The historical data was not in crawlable tables (images instead), so instead of scraping the data I painstakingly copied everything over to Excel, but it was worth the effort since I knew I would not need to repeat this task.

After adding up models by month I ended up with this (displaying a condensed version here since there were about 100 rows):

Month Year Quarter Deliveries Cumulative
Jan 2012 Q1 0 0
Feb 2012 Q1 0 0
April 2018 Q2 6150 183810
May 2018 Q2 9220 193030

Here is a link to the data if you would like to use it: Estimated Tesla Deliveries by Month Worksheet

I will ultimately need the data in quarters, but decided to forecast using months as the period because it gives me more data points and can more likely detect patters that way, like seasonality. I used the forecast function in R in order to predict future deliveries based on previous months.

Assuming you have read the data into your IDE, I started by viewing my data in R to make sure it looked right and then activated the tseries and graphics packages I knew I would need. I will be turning my data into a time series in R so that it can be understood by the forecast function later on. It’s more or less adding meta data to your data table. If you’re working with monthly data, your frequency should be set to “12.” Before doing that, however, I knew I would not be needing the “Year” variable since this will be understood when I transform it into a time series in R, so I easily removed that entire column by setting it to NULL.

View(teslas)
library(tseries)
library(graphics)
library(forecast)
teslats$Year <- NULL
View(teslats)
teslats <- ts(teslas$`Cars Delivered`, start = c(2012, 1), end = c(2018, 5), frequency = 12)
View(teslats)

Next, I plotted the data — just to get an idea of what it looked like on a graph. Good way to familiarize yourself with your data in a more visual way, since looking at just the numbers is difficult to discern.

plot(teslats)

There is a handy function in R that performs a seasonal decomposition of your data set. Let’s do that now and see what it shows us. We will call it “teslafit” so it doesn’t overwrite our other variables.

teslafit <- stl(teslats, s.window = "period")
plot(teslafit)

If you remove the seasonality, it is pretty clear that there is still a strong positive trend. I’m not sure what information to take away from the seasonality, other than there appears to be a spike in the last month of each quarter and that is basically the pattern you are seeing in the graph above.

It’s important to get familiar with your data if you are really going to understand and forecast it. Two other useful ways to observe your data is with the monthplot and seasonplot, which both stem from the forecast package. The seasonplot is essentially a year by year view of your data over a calendar year x-axis. They are both worth a gander, even though they are not necessary.

monthplot(teslats)
seasonplot(teslats)

Okay, now it’s time to forecast. We are going to predict the next 19 months (through end of 2019) by adding our original time series data into the forecast function and then plot it out to see what we get.

forecast(teslafit, 19)
plot(forecast(teslafit, 19))

The blue line is the middle of the forecast range, and what I’m interested in. It looks reasonable, and this is just for fun, so I’ll call it good. If this were a more important project, we would want to look at the accuracy of the forecast, try multiple forecasting methods, and compare them. To do this, you should become familiar with some of the common metrics to measure forecast accuracy like MAE and MASE.

accuracy(teslafit)

Being a Tesla fan (and future Model 3 owner), I am aware of some outside factors that would impact this prediction. For one, it’s well documented that Elon Musk (of Tesla) wants to ramp up production of their Model 3 car in a significant way — and they even have a target of 5,000 per week, which I could have used in some way but did not given their history of missing targets.

A second factor is related to the possibility of Tesla strategically maximizing the federal tax credit for electric vehicles. My prediction (if you add up deliveries cumulatively by quarter) says they will reach the 200k threshold sometime in June. But it’s quite possible Tesla will hold back on U.S. deliveries so they instead reach the mark in July, which is a brand new quarter and thus would give more users an opportunity to take advantage of the federal tax credit program.

With that said, I continued on with my prediction knowing it was plausible. So the next thing I wanted to do is plot the quarterly data using ggplot2 in a way that the different tax credit phases were clearly displayed on top of the forecast line. Here is the end result:

In order to accomplish this, I used arguments in ggplot2 that allow you to shade different regions of the site and also added some text with specification on where the text should sit. Please note, I am using quarterly — not monthly — data now and you can get this in the Google Sheet I linked to above as a short cut. I named the data set “quarterly” and added a few additional columns in order to build the final product. Feel free to leave questions in the comments.

teslaplot <- ggplot(quarters) + geom_rect(aes(xmin='2018Q1', xmax='2018Q4', ymin=-Inf, ymax=Inf), fill="aquamarine", alpha=0.03)
+ geom_rect(aes(xmin='2018Q4', xmax='2019Q2', ymin=-Inf, ymax=Inf), fill="blue", alpha=0.03)
+ geom_rect(aes(xmin='2019Q2', xmax='2019Q4', ymin=-Inf, ymax=Inf), fill="darkmagenta", alpha=0.03)
+ geom_line(aes(Quarter, quarters$`Cumulative Sales`, group=1))
+ labs(title='Tesla Federal Tax Credit Projection', x='Quarter', y='Vehicles Delivered') + scale_y_continuous(labels = comma)
+ geom_text(aes(x='2018Q3', y=200000, label='$7,500'))
+ geom_text(aes(x='2019Q1', y=200000, label='$3,750')) + geom_text(aes(x='2019Q3', y=200000, label='$1,875'))

Create a Table with IGDB API and R

The Internet Game Database has an API that allows you to access their rich set of video game information. Even better, there is an R package that makes requests for data Super easy.

Let’s get started by installing the IGDB package in R (you may need to update your version of R):

devtools::install_github("detroyejr/igdb")

Now that you have the package installed, we are almost ready to start requesting data. Before we get there, you will nee to have an API key, which requires you to setup an IGDB account. Fair compromise. Click “GET FREE KEY” on the documentation homepage to signup: https://igdb.github.io/api/.

And while you’re at it, be sure to read through the documentation so understand what type of information is available in the API and how to make basic calls. The documentation pages also provide handy examples.

Now that you’re setup, let’s store the API key so we can get access to the data. You can use the following code in R:

Sys.setenv("IGDB_KEY" = "[yourkeygoeshere]")

Now we are ready to play ball. The call below returns video games containing the term “mario” and orders them by release date, oldest to most recent. I’ve decided to bring back the following variables: ID, name of the game, release date, IGDB rating, and some popularity metric they have conceived. We can save it as “mario_games.”

mario_games <- igdb_games(
search = "mario",
order = "first_release_date:asc",
fields = c("id", "name", "first_release_date", "rating", "popularity")
)

One of the nice things about this is that the data comes back in a nicely formatted data frame which you view and confirm with these commands:

view(mario_games)
class(mario_games)

If for some reason yours does not, you should be able to transform it into a data frame, for easier analysis, using the as.data.frame base R function.

mario_games <- as.data.frame(mario_games)

Now you have a table for analysis or visualization at your disposal. Enjoy.

How to Collect Twitter Data Using R and the Twitter Search API

Luckily, the hard work is done for us. There is a terrific R package called twitterR that allows you to easily connect to the Twitter Search API. You just need to know a few arguments to properly ask for the data you need.

First, let’s explore what type of data and limitations exist in the Twitter Search API so we know what we have to work with.

Official documentation: https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets

“Returns a collection of relevant Tweets matching a specified query.

Please note that Twitter’s search service and, by extension, the Search API is not meant to be an exhaustive source of Tweets. Not all Tweets will be indexed or made available via the search interface.

To learn how to use Twitter Search effectively, please see the Standard search operators page for a list of available filter operators. Also, see the Working with Timelines page to learn best practices for navigating results by since_id and max_id.”

The first step, of course, is to activate the packages you need for this project. If you don’t have these packages installed already, you’ll need to do that too. I have all of these installed, so I’ve commented out that part here.

# install.packages("twitterR")
library(twitterR)

We’re getting close, but before we can request data from the Twitter API, we have to provide some credentials to make sure we aren’t doing anything nefarious. To accomplish this, you need four things:

  • consumer_key
  • consumer_secret
  • access_token
  • access_secret

No worries, all of these can be easily found here (you’ll need an active Twitter account): https://apps.twitter.com/. Once you’re logged, you need to create an “application” which is essentially just saying you want to work on a project. Go ahead and fill in the details and you should receive the four criteria above.

Now we’ll save each of these strings in this manner (note that you’ll need to replace your string where i have ‘abc123’):

onsumer_key <- 'acb123'
consumer_secret <- 'acb123'
access_token <- 'acb123'
access_secret <- 'acb123'

Now, let’s get authorized and begin requesting Twitter data.

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

Below is a simple request for Tweets that you can modify to your liking. In this example, I’m going to save my request as “nebtweets” and I’ll call for the information with “searchTwitter” which is part of the twitterR package we installed and activated. I’ve arbitrarily set the number of results I want back to 200, starting in February 2018. It’s important to note here that the Twitter Search API does NOT give you full access to Twitters’ data. It’s only an index of recent Tweets. So you may get back warnings if you try asking for something that is not available.

nebtweets <- searchTwitter("nebrasketball", n=200, lang="en", since = '2018-02-01')

Now we have the Tweets saved, but they're not in a nice, neat data frame. This can easily be solved using "twListToDF" which is also part of the TwitterR package.

nebtweetsDF = twListToDF(nebtweets)
View(nebtweetsDF)

Now you're ready to analyze. Enjoy.

How to Create a US Heatmap in R

Creating a simple US map in R can be done in a number of ways. Two popular packages for this type of project are ggplot2 and plotly. In this case, I used plotly.

The data for my map is a list of US state codes (NE, IL, MA, CA, etc.). A second variable gives a count of how many players the Nebraska football team is targeting in each state. In order to follow my example with your own data, you will need to have the state code variable and some numeric variable to map it against.

Once you have your data in a table and are ready to use it, create the following styling options for the map, which we will apply later:

mapDetails <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('white')
)

As you may have guessed, “scope” determines the type of map, in this case a map of the USA. We will also determine here what to do with lakes and how to color them.

usaMap <- plot_geo(X2018targets, locationmode = 'USA-states') %>%
add_trace(
z = X2018targets$Targets, locations = X2018targets

Creating a simple US map in R can be done in a number of ways. Two popular packages for this type of project are ggplot2 and plotly. In this case, I used plotly.

The data for my map is a list of US state codes (NE, IL, MA, CA, etc.). A second variable gives a count of how many players the Nebraska football team is targeting in each state. In order to follow my example with your own data, you will need to have the state code variable and some numeric variable to map it against.

Once you have your data in a table and are ready to use it, create the following styling options for the map, which we will apply later:

mapDetails <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('white')
)

As you may have guessed, “scope” determines the type of map, in this case a map of the USA. We will also determine here what to do with lakes and how to color them. State Code`, color = X2018targets$Targets, colors = ‘Blues’ ) %>% colorbar(title = “Targets”) %>% layout( title = ‘2018 Nebraska Football Targets by State (February 2018)’, geo = mapDetails )

The code above connects my data to the map and allows me to modify text within the plot area. My data frame is called “X2018targets,” so you’ll need to replace this with your data frame name. You’ll also need to set “z” to your numeric data and “locations” to your state code variable.

When you’re finished, simply type “usaMap” and hit enter to see your plot appear (I use R Studio, by the way, assuming you likely do as well). If you have any trouble or questions, let me know in the comments.

Scraping Sports Stats Using R (Part 2)

In Part 1 of this blog post I show how you can scrape tables of sports data from websites and store that data in a data frame for data analysis (have I said ‘data’ enough times yet?).

Whenever you are automating data collection in this sort of way, you always want to get a “health check” on your new table to make sure nothing went awry. There are countless things that can go wrong from missing data to web pages timing out or blocking you from collecting data — and it is critical we understand whether any of this has happened before moving on to an analysis phase. You might call this a “data cleaning” phase to get you into position to analyze.

With the head function you can quickly get a glimpse of what your variable names look like, along with a few observations. Let’s take a look:

head(fb_main)
            Date NU rank     Opponent Site Outcome Score <U+00A0>
1 Sept. 17, 1960             #4 Texas Away     Win 14-13  Details
2 Sept. 24, 1960     #12    Minnesota Home    Loss 26-14  Details
3   Oct. 1, 1960           Iowa State Home    Loss  10-7  Details
4   Oct. 8, 1960         Kansas State Home     Win  17-7  Details
5  Oct. 15, 1960                 Army Home     Win  14-9  Details
6  Oct. 22, 1960             Colorado Away    Loss  19-6  Details

Immediately a few concerns jump out at me. First, there is a column where all of the values contain the word “Details.” On the original site I drew this information from, this column linked out to details for each game. I do not need this column for any analysis, so I will remove it. There are numerous ways to do this. Since I know it is my seventh column, I’ll just do it this way. If you’re unsure about how to tackle this (football pun), then you may want to save the table as a different name and keep the original.

fb_main <- fb_main[,-7]
head(fb_main)
            Date NU rank     Opponent Site Outcome Score
1 Sept. 17, 1960             #4 Texas Away     Win 14-13
2 Sept. 24, 1960     #12    Minnesota Home    Loss 26-14
3   Oct. 1, 1960           Iowa State Home    Loss  10-7
4   Oct. 8, 1960         Kansas State Home     Win  17-7
5  Oct. 15, 1960                 Army Home     Win  14-9
6  Oct. 22, 1960             Colorado Away    Loss  19-6

Much better. But it sure would be nice if the score was split into two columns in case I wanted to sum or average any of the scores during analysis. One variable can easily be split into two variables with a convenient function called separate, which is part of the tidyr package. It would look like this:

fb_main <- separate(fb_main, Score, into = c("Win Score", "Lose Score"), sep = "-")

But not so fast my friend — this really isn’t very helpful at all. We want the Husker scores in one column and the opponent score in another column, not a mix. This creates complexity to our code, but we can still accomplish it in a few easy steps. You will need to use dplyr, so make sure that is activated in your library.

library(tidyr)
fb_main$NUScore <- as.numeric(ifelse(fb_main$Outcome=="Win", fb_main$`Winner Score`, fb_main$`Loser Score`))
fb_main$OppScore <- as.numeric(ifelse(fb_main$Outcome=="Win", fb_main$`Loser Score`, fb_main$`Winner Score`))
fb_main <- fb_main[, -c(6:7)]

The first two lines above create the new columns we want using an ifelse function. And I saved myself some time by making the columns numeric. The ifelse statement has three arguments: condition, value of new row if condition is true, value of new row if condition is false. The last line is simply deleting the Winner Score and Loser Score variables, which we no longer need. Now look at the data:

head(fb_main3)
            Date NU rank     Opponent Site Outcome NUScore OppScore
1 Sept. 17, 1960             #4 Texas Away     Win      14       13
2 Sept. 24, 1960     #12    Minnesota Home    Loss      14       26
3   Oct. 1, 1960           Iowa State Home    Loss       7       10
4   Oct. 8, 1960         Kansas State Home     Win      17        7
5  Oct. 15, 1960                 Army Home     Win      14        9
6  Oct. 22, 1960             Colorado Away    Loss       6       19

Some other observations I have made about this data is that the Opponent variable contains both the opponent name and their ranking (this could create difficulties down the road), some values are missing, and the rankings contain a hash fragment (#). These are all worth tidying up before analysis, but I’ll stop there since the above should provide enough direction to complete those tasks.

One final consideration to make once you have collected all of this data is where to store it. There are many options and each of them have different methods in R. But here are a few to consider so that you do not need to re-gather the data each time you need it for analysis:

  • Save as a data frame in R
  • Use write.csv to save it on your computer as a .csv file (a similar function exists for Excel)
  • Send to a local or cloud-based database
  • Upload to data.world

Here is the final data, uploaded to data.world: https://data.world/jeffgswanson/husker-football-game-results

Enjoy.

Scraping Sports Stats Using R (Part 1)

You can scrape sports data from Web pages and store them in your own data frame/table for future analysis using handy readHTMLTable and lapply packages in R. My code is below.

fb_urls <- paste0('http://dataomaha.com/huskers/history/seasons/', 1960:2017)

fb_urls <- unlist(fb_urls)

fb_main <- lapply(fb_urls, function(x){readHTMLTable(getURL(x), stringsAsFactors=F)[[1]]})

fb_main <- rbindlist(fb_main)

fb_main

In this two-part post, I’ll show you how to use R to scrape tables from multiple Web pages to use for your own analysis. There are a number of steps involved here and, depending on which page(s) you’re trying to get data from, it can get complicated, but thankfully R has some brilliant packages like readHTMLTable to do the heavy lifting.

First, let’s take a look at the web pages I’ll be scraping. I want to collect game summary data for Nebraska Cornhusker football games (Go Big Red), which I found here: http://dataomaha.com/huskers/history/seasons/1997.

In this example, there is only one table. If your page has multiple tables, you may need to modify the code, which you can do by specifying within the readHTMLTable function (see last argument ‘[1]‘). If you have further questions on this, leave a comment or search for documentation on this function.

Below, the first step I took in forming a script is to build a list of URLs I wanted to scrape. If you are only extracting table data from one page, things are much simpler. You can just use readHTMLTable or htmltab and use the URL as an argument in the function. In my case, I wanted to cycle through 57 different pages. You could create a list in Excel quite easily, but just as quickly, you can also use the paste0 function, which is loaded into R by default, to automatically create a list.

fb_urls <- paste0('http://dataomaha.com/huskers/history/seasons/', 1960:2017)

Basically, this function is saying to take the base URL (first argument) and paste a number to the end until the list is complete. In this case, I am pasting 1960-2017, since I know the URLs I am scraping are constructed this way. Those are then saved as fb_urls.

The next step is to take what we created and basically get it into the right format. We can do this using the unlist function.

fb_urls <- unlist(fb_urls)

Now that we have a list of Web pages to scrape, we need to write a function with information on what we want to do on each page. Here, we are using a terrifically simple function name lapply which is designed to apply a function through a list. The first argument is simply the list we want to apply the function to and next we will use a Web scraping function in R called readHTMLTable to pull back the first table it finds on each page and save it in an list we will call fb_main.

fb_main <- lapply(fb_urls, function(x){readHTMLTable(getURL(x), stringsAsFactors=F)[[1]]})

The result is a list of 57 separate data frames. But I just need one large data frame with all of the information. So I’ll merge everything into one data frame using rbindlist.

fb_main <- rbindlist(fb_main)

Done! We always want to double check our work, which we can do by simply calling the new data frame.

fb_main

In due time, I’ll follow up with Part 2, which will focus on reviewing and cleaning the returned data, as well as options for storing it in an accessible place for future data analysis.

Un-edited thoughts on Topical Keyword Research and Intent Based SEO

I have been thinking a lot lately about topical keyword research and how this plays a role in SEO, content hierarchy and the data science approaches we use to accomplish these ideas. Let me back up…

Topical Keyword Research and Intent-based Search
Whether you’re an SEO or just someone who’s observed Google search results over time, it’s clear that over the years Google SERPs have become much more “semantic.” But what does that really mean?

In short, computers have used natural language processing (NLP) to better understand how human language works. That might be understanding synonyms, crafting results based on which device type you are using or any number of things. But the bottom line is that Google has moved away from showing results that are heavily keyword-based (returning pages that contain the exact phrase you typed or something very similar) to more of a semantic or intent-based approach where the results might contain the keyword you searched, but they are more concerned with showing results you intend to see and understanding if something related is a better result and does not contain your exact keywords — that’s okay.

How Does this Impact SEO?
In a big way. And this is not new, but we need to approach keyword research and craft content around intent and not specific 1:1 keywords. In other words, we should get a list of keywords, cluster them into intent groups and then build content based on intent groups instead of individual keywords. This is ultimately what the user wants — not a a bunch of slight variations of content that are more or less similar. And Google theoretically will rank this content well if it meets user intent and they can connect it to the query.

How is this Related to Data Science?
For one, clustering is a big topic in data science and can be executed in R. There are no doubt SEO tools out there that exist, but if you want more control you might consider supervised or unsupervised clustering in R.

Final thoughts
I can see a bigger picture here as well. As we craft our content based on intent and clustering, we can almost take a testing approach to content and site information architecture in the future. Basically, one could build out their intent groups and with that list merge content that is part of one cluster or fill out any gaps. Over time analytics should show how users move through the funnel and if any steps are needed to provide an easy path for users (a path to whatever your goal happens to be).

But I think there is a paid media tie in here as well. Not often enough do we look at paid media performance from an SEO perspective and document which keywords drive conversions versus which are more informational. We should be using that information to learn how to build out information architecture as well. It should be an additional layer to better understand how to break up similar content throughout the user journey and confirm which keywords belong to which bucket.

Using R with data sets from data.world

Recently I found out about a wonderful website, data.world, which is kind of like a social/collaboration site for data sets. I highly recommend checking it out. If nothing else, it has numerous data sets for you to learn and build from.

I found a data set that contains NCAA March Madness results dating back to 1985. One of the things that I really like about data.world are its built in features. For one, you can explore data sets right within the website and run SQL queries to return views of the data that are of interest to you.

If you are not familiar with SQL, it is worth exploring, but I won’t go into it here. Instead, I’ll show you the simple queries I made to return appearances made in the tournament by Creighton and Nebraska:

SELECT * FROM `Big_Dance_CSV` where Big_Dance_CSV.Team="Creighton" or Big_Dance_CSV.`Team(2)`="Creighton"
SELECT * FROM `Big_Dance_CSV` where Big_Dance_CSV.Team="Nebraska" or Big_Dance_CSV.`Team(2)`="Nebraska"

For these queries to really make sense, you need to be familiar with the columns that exist in the data set. With this particular data set, there are columns for Home and Away teams (Team and Team(2)) so I asked for any results where one of the team was Creighton or Nebraska.

Another feature that I absolutely love about data.world is how is easy it is to take the data and place it into R Studio. By selecting Export > Copy R Code, you add the R code necessary to create a data frame in R of the SQL query you created. So simple. Here is what it gave me for my Creighton query:

df <- read.csv("https://query.data.world/s/dnhmq1rfdbdw18tg7jkfl0dmt",header=T);

That created this data frame in R for me to work with:

Year

Round

Region

Seed

Score

Team

Team.2.

Score.2.

Seed.2.

1

2001

1

3

7

69

Iowa

Creighton

56

10

2

2002

1

4

5

82

Florida

Creighton

83

12

3

2002

2

4

4

72

Illinois

Creighton

60

12

4

2003

1

2

6

73

Creighton

Central Michigan

79

11

5

2005

1

2

7

63

West Virginia

Creighton

61

10

6

2007

1

4

7

77

Nevada

Creighton

71

10

7

2012

1

4

8

58

Creighton

Alabama

57

9

8

2012

2

4

1

87

North Carolina

Creighton

73

8

9

2013

1

1

7

67

Creighton

Cincinnati

63

10

10

2013

2

1

2

66

Duke

Creighton

50

7

11

2014

1

3

3

76

Creighton

Louisiana Lafayette

66

14

12

2014

2

3

3

55

Creighton

Baylor

85

6

13

1989

1

1

3

85

Missouri

Creighton

69

14

14

1991

1

1

6

56

New Mexico St

Creighton

64

11

15

1991

2

1

3

81

Seton Hall

Creighton

69

11

16

1999

1

3

7

58

Louisville

Creighton

62

10

17

1999

2

3

2

75

Maryland

Creighton

62

10

18

2000

1

2

7

72

Auburn

Creighton

69

10

From there, I created this pretty simple bar graph with ggplot that displays when the Jays appeared in the tournament and what round they made it to. All in all it took me well under an hour.

And for the Huskers as well:

Hope this example shows how easy it is to take data.world data and create something in R. You could, of course, pull the entire data set into R as well to do data analysis, build models, etc. but this is a good start.