Recently I found out about a wonderful website, data.world, which is kind of like a social/collaboration site for data sets. I highly recommend checking it out. If nothing else, it has numerous data sets for you to learn and build from.
I found a data set that contains NCAA March Madness results dating back to 1985. One of the things that I really like about data.world are its built in features. For one, you can explore data sets right within the website and run SQL queries to return views of the data that are of interest to you.
If you are not familiar with SQL, it is worth exploring, but I won’t go into it here. Instead, I’ll show you the simple queries I made to return appearances made in the tournament by Creighton and Nebraska:
SELECT * FROM `Big_Dance_CSV` where Big_Dance_CSV.Team="Creighton" or Big_Dance_CSV.`Team(2)`="Creighton"
SELECT * FROM `Big_Dance_CSV` where Big_Dance_CSV.Team="Nebraska" or Big_Dance_CSV.`Team(2)`="Nebraska"
For these queries to really make sense, you need to be familiar with the columns that exist in the data set. With this particular data set, there are columns for Home and Away teams (Team and Team(2)) so I asked for any results where one of the team was Creighton or Nebraska.
Another feature that I absolutely love about data.world is how is easy it is to take the data and place it into R Studio. By selecting Export > Copy R Code, you add the R code necessary to create a data frame in R of the SQL query you created. So simple. Here is what it gave me for my Creighton query:
df <- read.csv("https://query.data.world/s/dnhmq1rfdbdw18tg7jkfl0dmt",header=T);
That created this data frame in R for me to work with:
Year |
Round |
Region |
Seed |
Score |
Team |
Team.2. |
Score.2. |
Seed.2. |
|
1 |
2001 |
1 |
3 |
7 |
69 |
Iowa |
Creighton |
56 |
10 |
2 |
2002 |
1 |
4 |
5 |
82 |
Florida |
Creighton |
83 |
12 |
3 |
2002 |
2 |
4 |
4 |
72 |
Illinois |
Creighton |
60 |
12 |
4 |
2003 |
1 |
2 |
6 |
73 |
Creighton |
Central Michigan |
79 |
11 |
5 |
2005 |
1 |
2 |
7 |
63 |
West Virginia |
Creighton |
61 |
10 |
6 |
2007 |
1 |
4 |
7 |
77 |
Nevada |
Creighton |
71 |
10 |
7 |
2012 |
1 |
4 |
8 |
58 |
Creighton |
Alabama |
57 |
9 |
8 |
2012 |
2 |
4 |
1 |
87 |
North Carolina |
Creighton |
73 |
8 |
9 |
2013 |
1 |
1 |
7 |
67 |
Creighton |
Cincinnati |
63 |
10 |
10 |
2013 |
2 |
1 |
2 |
66 |
Duke |
Creighton |
50 |
7 |
11 |
2014 |
1 |
3 |
3 |
76 |
Creighton |
Louisiana Lafayette |
66 |
14 |
12 |
2014 |
2 |
3 |
3 |
55 |
Creighton |
Baylor |
85 |
6 |
13 |
1989 |
1 |
1 |
3 |
85 |
Missouri |
Creighton |
69 |
14 |
14 |
1991 |
1 |
1 |
6 |
56 |
New Mexico St |
Creighton |
64 |
11 |
15 |
1991 |
2 |
1 |
3 |
81 |
Seton Hall |
Creighton |
69 |
11 |
16 |
1999 |
1 |
3 |
7 |
58 |
Louisville |
Creighton |
62 |
10 |
17 |
1999 |
2 |
3 |
2 |
75 |
Maryland |
Creighton |
62 |
10 |
18 |
2000 |
1 |
2 |
7 |
72 |
Auburn |
Creighton |
69 |
10 |
From there, I created this pretty simple bar graph with ggplot that displays when the Jays appeared in the tournament and what round they made it to. All in all it took me well under an hour.
And for the Huskers as well:
Hope this example shows how easy it is to take data.world data and create something in R. You could, of course, pull the entire data set into R as well to do data analysis, build models, etc. but this is a good start.