This post outlines how to collect some very basic high school basketball scores data from MaxPreps. By collecting this data and building upon it, you should have everything you need to create a computer rankings model for your state, much like I did for Nebraska Class A boys basketball.
The state of Nebraska puts teams into different classes (Class A, B, C, etc.) depending mostly (from what I can tell) on enrollment and location. For this project, I was interested in Class A boys basketball, so the first thing I did was get an official list of Class A teams from the Nebraska School Activities Association (NSAA) website.
Knowing I would need this list for later, I manully created a dataframe called Class A:
# Manual Class A names School <- c( 'Omaha South', 'Omaha Central', 'Grand Island', 'Millard North', 'Millard South', 'Millard West', 'Lincoln East', 'Lincoln High', 'North Star', 'Creighton Prep', 'Omaha North', 'Lincoln Southeast', 'Burke', 'Lincoln Southwest', 'Bryan', 'Omaha Westside', 'Papillion-LaVista South', 'Papillion-LaVista', 'Lincoln Northeast', 'Bellevue West', 'Omaha Northwest', 'Kearney', 'Fremont', 'Bellevue East', 'Benson', 'Gretna', 'Elkhorn', 'Elkhorn South', 'Norfolk', 'Columbus', 'North Platte', 'Pius X', 'South Sioux City' ) # Manual Class A enrollment Enrollment <- c( 2166, 2051, 1982, 1920, 1881, 1783, 1695, 1692, 1571, 1548, 1522, 1515, 1514, 1501, 1480, 1452, 1442, 1368, 1280, 1240, 1236, 1188, 1113, 1099, 1062, 1050, 1026, 1008, 1005, 971, 905, 897, 860 ) #Combine list into data frame classA <- data.frame(School, Enrollment)
We’ll need the following libraries for this project:
library(c("shiny", "DT", "shinythemes", "rvest", "expss", "dplyr", "tidyr", "stringr", "sqldf", "xlsx", "scales"))
In order for all of this to work, I needed some basic (but turns out hard to find) high school basketball scores data in a consistent way. I tried scraping Omaha World Herald scores and NSAA scores, but in the end found MaxPreps easiest to work with.
In order to automate this, it’s important to understand the components of the MaxPreps scoreboard URLs. These URLs list scores by a given day, which you can select on the website by choosing from a dropdown calendar. Here is a closer look at the structure:
# start of scoreboard URLs on MaxPreps https://www.maxpreps.com/list/schedules_scores.aspx?date= # date 12/5/2019 # end of URL that specifies boys, state and class (divisionid) &gendersport=boys,basketball&state=ne&statedivisionid=85757869-a232-41b9-a6b3-727edb24825e
So the putting the pieces above together would give us a page of scores on December 5, 2019 for Nebraska Class A boys basketball. Now that we understand this structure, we need to create a list of dates that we can will use to then create a list of URLs that contain scores for the entire season. I couldn’t figure out a great way to collect the game dates, so this part was manual as well. If you find a more automated way to get a list of days which games were played on, let me know!
# create games table -- game_dates NEEDS TO BE UPDATED WITH NEW DAYS maxprep_baseURL <- "https://www.maxpreps.com/list/schedules_scores.aspx?date=" maxprep_paramURL <- "&gendersport=boys,basketball&state=ne&statedivisionid=85757869-a232-41b9-a6b3-727edb24825e" # this part was manual, but building a list of final below is not game_dates <- c( "12/5/2019", "12/6/2019", "12/7/2019", "12/9/2019", "12/10/2019", "12/12/2019", "12/13/2019", "12/14/2019", "12/16/2019", "12/17/2019", "12/19/2019", "12/20/2019", "12/21/2019", "12/27/2019", "12/28/2019", "12/30/2019", "12/31/2019", "1/2/2020", "1/3/2020", "1/4/2020", "1/7/2020", "1/9/2020", "1/10/2020", "1/11/2020", "1/14/2020", "1/16/2020", "1/17/2020", "1/18/2020", "1/21/2020", "1/23/2020", "1/24/2020", "1/25/2020", "1/28/2020", "1/30/2020", "1/31/2020", "2/1/2020", "2/3/2020", "2/4/2020", "2/7/2020", "2/8/2020", "2/11/2020", "2/13/2020", "2/14/2020", "2/15/2020", "2/18/2020", "2/20/2020", "2/21/2020", "2/22/2020", "2/28/2020" ) maxprep_page_list <- as.list(paste0(maxprep_baseURL, game_dates, maxprep_paramURL))
Now that we have a list of URLs that contain all of the Class A scores we want, we can scrape the HTML data we want (the scores). Please note you need to get familiar with the HTML of the pages you’re pulling data from for this to work. In my case, I found the scores all sitting neatly under a div called data-contest-state.
maxprep_html <- lapply(maxprep_page_list, FUN=function(URLLink){ read_html(URLLink) %>% html_nodes("[data-contest-state='boxscore']") %>% html_text() })
The above returns a list of lists, which is difficult to work with. So we can modify this by using the unlist function. We’ll also add a few more modifications to get this all into a dataframe called scores.
# Unlist and create dataframe of scores by game scores <- unlist(maxprep_html) scores <- gsub("Final","", scores) scores <- grep("#", scores, invert = TRUE, value = TRUE) scores <- data.frame(scores)
If you check the dataframe using view(scores) you’ll notice a problem. All of the scores and teams are listed under one column. We need to separate these into different columns and rename them.
colnames(scores) <- c("V1") scores <- scores %>% mutate(V1 = gsub("(\\d+)", ";\\1;", V1)) %>% separate(V1, c(NA, "No1", "Let1", "No2", "Let2"), sep = " *; *") colnames(scores) <- c("Away_Score", "Away_Team", "Home_Score", "Home_Team")
Okay, we’re making some really good progress now. We’ve basically scraped the data we need and manipulated into a workable dataframe with four useful variables/columns: “Away_Score,” “Away_Team,” “Home_Score,” and “Home_Team.”
With just this data we can create a number of new variables that will be useful including winner, loser, home win, home loss, away win, and away loss. It took me a while to figure out a good way to do this but eventually found that using if_else from dplyr is very efficient.
scores$Winner <- if_else(scores$Away_Score > scores$Home_Score, scores$Away_Team, scores$Home_Team) scores$Loser <- if_else(scores$Away_Score < scores$Home_Score, scores$Away_Team, scores$Home_Team) scores$Home_W <- if_else(scores$Winner==scores$Home_Team, scores$Home_Team, "NA") scores$Home_L <- if_else(scores$Loser==scores$Home_Team, scores$Home_Team, "NA") scores$Away_W <- if_else(scores$Winner==scores$Away_Team, scores$Away_Team, "NA") scores$Away_L <- if_else(scores$Loser==scores$Away_Team, scores$Away_Team, "NA")
And then I cleaned it up a bit (please leave comments if you have any questions about what is going on here):
scores <- scores %>% mutate_all(~gsub('\r|\n', '', .)) # Would like to make this more efficient scores$Winner <- str_trim(scores$Winner, side = "both") scores$Loser <- str_trim(scores$Loser, side = "both") scores$Away_Team <- str_trim(scores$Away_Team, side = "both") scores$Home_Team <- str_trim(scores$Home_Team, side = "both") scores$Home_Score <- as.numeric(scores$Home_Score) scores$Away_Score <- as.numeric(scores$Away_Score)
From this point, I decided to make a few new categories — how much a team won by and whether the home or away team is a Class A team or not. The Class is important here because a few teams in Class A mostly played Class B teams and I wanted to have that information available to use in how I scored teams later on.
scores <- scores %>% mutate_all(~gsub('\r|\n', '', .)) # There's probably a more efficient way to write this scores$Winner <- str_trim(scores$Winner, side = "both") scores$Loser <- str_trim(scores$Loser, side = "both") scores$Away_Team <- str_trim(scores$Away_Team, side = "both") scores$Home_Team <- str_trim(scores$Home_Team, side = "both") scores$Home_Score <- as.numeric(scores$Home_Score) scores$Away_Score <- as.numeric(scores$Away_Score)
Now I have a good amount of variables stored in my scores dataframe. My next step will be use the data in that dataframe to build out the classA dataframe that we created in the beggining.
To execute this we will need to group or summarize a lot of data in scores for a particular team. For example, to get the number of Wins for a given team, we have to count the number of times that team appears in the scores$Winner column (which is why we created that variable). I was able to execute this using a number of functions including merge and sqldf.
classA <- merge(classA, stack(table(factor(scores$Winner, levels = classA$School))), by.x = 'School', by.y = "ind") names(classA)[names(classA) == 'values'] <- 'Wins' classA <- merge(classA, stack(table(factor(scores$Loser, levels = classA$School))), by.x = 'School', by.y = "ind") names(classA)[names(classA) == 'values'] <- 'Losses' classA$Win_Pct <- round(classA$Wins / (classA$Wins + classA$Losses), digits = 2) classA$Games_Played <- classA$Wins + classA$Losses # Use sqldf to create Home Wins/Losses, Away Wins/Losses from scores df varHW <- sqldf("select scores.Home_Team, count(scores.Home_W) from scores where scores.Home_Class==TRUE AND scores.Home_Team==scores.Winner group by scores.Home_Team", stringsAsFactors=FALSE) names(varHW)[2] <- "Home_Wins" classA <- sqldf("select classA.*, varHW.Home_Wins from classA left join varHW on classA.School = varHW.Home_Team", stringsAsFactors = FALSE) varHL <- sqldf("select scores.Home_Team, count(scores.Home_L) from scores where scores.Home_Class==TRUE AND scores.Home_Team==scores.Loser group by scores.Home_Team", stringsAsFactors=FALSE) names(varHL)[2] <- "Home_Losses" classA <- sqldf("select classA.*, varHL.Home_Losses from classA left join varHL on classA.School = varHL.Home_Team", stringsAsFactors = FALSE) varAW <- sqldf("select scores.Away_Team, count(scores.Away_W) from scores where scores.Away_Class==TRUE AND scores.Away_Team==scores.Winner group by scores.Away_Team", stringsAsFactors=FALSE) names(varAW)[2] <- "Away_Wins" classA <- sqldf("select classA.*, varAW.Away_Wins from classA left join varAW on classA.School = varAW.Away_Team", stringsAsFactors = FALSE) varAL <- sqldf("select scores.Away_Team, count(scores.Away_L) from scores where scores.Away_Class==TRUE AND scores.Away_Team==scores.Loser group by scores.Away_Team", stringsAsFactors=FALSE) names(varAL)[2] <- "Away_Losses" classA <- sqldf("select classA.*, varAL.Away_Losses from classA left join varAL on classA.School = varAL.Away_Team", stringsAsFactors = FALSE) classA[is.na(classA)] <- 0 scores$Away_Score <- as.numeric(scores$Away_Score) scores$Home_Score <- as.numeric(scores$Home_Score) classA$School <- as.character(classA$School) # Using dplyr to create home and away points per game (ppg) classA <- scores %>% group_by(Away_Team) %>% summarise(Away_PPG = mean(Away_Score, na.rm = TRUE)) %>% right_join(classA, by = c(Away_Team = 'School')) names(classA)[names(classA) == 'Away_Team'] <- 'School' classA <- scores %>% group_by(Home_Team) %>% summarise(Home_PPG = mean(Home_Score, na.rm = TRUE)) %>% right_join(classA, by = c(Home_Team = 'School')) names(classA)[names(classA) == 'Home_Team'] <- 'School' classA <- scores %>% group_by(Home_Team) %>% summarise(Home_dPPG = mean(Away_Score, na.rm = TRUE)) %>% right_join(classA, by = c(Home_Team = 'School')) names(classA)[names(classA) == 'Home_Team'] <- 'School' classA <- scores %>% group_by(Away_Team) %>% summarise(Away_dPPG = mean(Home_Score, na.rm = TRUE)) %>% right_join(classA, by = c(Away_Team = 'School')) names(classA)[names(classA) == 'Away_Team'] <- 'School' classA$Away_PPG <- round(classA$Away_PPG, digits = 0) classA$Home_PPG <- round(classA$Home_PPG, digits = 0) classA$Home_dPPG <- round(classA$Home_dPPG, digits = 0) classA$Away_dPPG <- round(classA$Away_dPPG, digits = 0) classA$Home_PPG_Diff <- round(classA$Home_PPG - classA$Home_dPPG, digits = 0) classA$Away_PPG_Diff <- round(classA$Away_PPG - classA$Away_dPPG, digits = 0)
We now have a lot of juicy data in classA for analysis, but let’s go further and create a few more variables:
classA <- scores %>% group_by(Away_Team) %>% summarise(Away_Total_Points = sum(Away_Score, na.rm = TRUE)) %>% right_join(classA, by = c(Away_Team = 'School')) names(classA)[names(classA) == 'Away_Team'] <- 'School' classA <- scores %>% group_by(Home_Team) %>% summarise(Home_Total_Points = sum(Home_Score, na.rm = TRUE)) %>% right_join(classA, by = c(Home_Team = 'School')) names(classA)[names(classA) == 'Home_Team'] <- 'School' classA <- scores %>% group_by(Home_Team) %>% summarise(Home_Points_Allowed = sum(Away_Score, na.rm = TRUE)) %>% right_join(classA, by = c(Home_Team = 'School')) names(classA)[names(classA) == 'Home_Team'] <- 'School' classA <- scores %>% group_by(Away_Team) %>% summarise(Away_Points_Allowed = sum(Home_Score, na.rm = TRUE)) %>% right_join(classA, by = c(Away_Team = 'School')) names(classA)[names(classA) == 'Away_Team'] <- 'School' classA$Total_Points <- classA$Home_Total_Points + classA$Away_Total_Points classA$PPG <- round(classA$Total_Points / classA$Games_Played, digits = 0) classA$Points_Allowed <- classA$Home_Points_Allowed + classA$Away_Points_Allowed classA$dPPG <- round(classA$Points_Allowed / classA$Games_Played, digits = 0) classA$Total_Points_Diff <- classA$Total_Points - classA$Points_Allowed classA$PPG_Diff <- classA$PPG - classA$dPPG
The last thing I did here is create some variables to tell us how often teams played other Class A teams since it’s a distinct advantage to play mostly against teams in other classes.
againstAhome <- sqldf("select scores.Home_Team, count(scores.Away_Class) from scores where scores.Away_Class==TRUE AND scores.Home_Class=TRUE group by scores.Home_Team", stringsAsFactors=FALSE) names(againstAhome)[2] <- "Home_A_Schedule" classA <- sqldf("select classA.*, againstAhome.Home_A_Schedule from classA left join againstAhome on classA.School = againstAhome.Home_Team", stringsAsFactors = FALSE) againstAaway <- sqldf("select scores.Away_Team, count(scores.Home_Class) from scores where scores.Home_Class==TRUE AND scores.Away_Class=TRUE group by scores.Away_Team", stringsAsFactors=FALSE) names(againstAaway)[2] <- "Away_A_Schedule" classA <- sqldf("select classA.*, againstAaway.Away_A_Schedule from classA left join againstAaway on classA.School = againstAaway.Away_Team", stringsAsFactors = FALSE) classA$A_Schedule <- round((classA$Home_A_Schedule + classA$Away_A_Schedule) / classA$Games_Played, digits = 2)
Now you should have ample data to create a scoring system model to rank high school basketball teams in your state. I won’t go into details here about how I modeled this, but you can see my final results here. And happy to give some direction if you’re really interested. You can also find details online about how other rankings systems, like the NET work by digging a bit.
If you’re going to display your dataframe as a table in some way, like I did with my shinyapp linked to above, then it’s a good idea to clean things up and only include necessary columns, which I do here:
classA_clean <- data.frame(classA$School, classA$Performance_Points, classA$Wins, classA$Losses, classA$Win_Pct, classA$Home_Wins, classA$Home_Losses, classA$Away_Wins, classA$Away_Losses, classA$PPG, classA$dPPG, classA$Home_PPG_Diff, classA$Away_PPG_Diff, classA$A_Schedule, classA$SOS) names(classA_clean) <- c("School", "Performance Points", "Wins", "Losses", "Win %", "Home Wins", "Home Losses", "Away Wins", "Away Losses", "PPG", "Def PPG", "Home PPG Diff", "Away PPG Diff", "Class A Schedule", "SOS")
And lastly, here is my R script (app.R) for publishing this to shiny. Please note this is not the entire script, just the last piece. You’d need to add all the pieces that build the dataframes above it to make it work.
# Define UI for application that displays a table ui <- fluidPage( theme = "sandstone", br(), h2("2019/20 Nebraska High School Boys Basketball Computer Rankings", style = "color: DarkGoldenRod"), br(), DT::DTOutput("mytable") ) server <- function(input, output) { output$mytable = DT::renderDT({ DT::datatable( classA_clean, rownames = FALSE, options = list( paging = FALSE, searching = FALSE)) }) } shinyApp(ui = ui, server = server)
Good luck!