If you’re not completely new to the data science community (specifically, the
#rstats community), then you’ve probably seen a version of the “famous” data science workflow diagram. 1
If one is fairly familiar with a certain topic, then one might not spend much time with the initial “visualize” step of the workflow. Such is the case with me and NBA data–as a relatively knowledgeable NBA follower, I don’t necessarily need to spend much of my time exploring raw NBA data prior to modeling.
Anyways, as a break from experimenting with predictive models, I decided to make a visualization just for the sake of trying something I hadn’t done before. 2 In particular, I was inspired by the calendar heat map visualization that I saw in the Top 50 ggplot visualizations post on the r-statistics.co website.
To implement a plot of this nature, I decided to look at how my hometown team, the San Antonio Spurs, fared last season (2016) in terms of point differential. In case it’s not immediately obvious, lots of green is good. (This is not surprising to those of us who follow the NBA–the Spurs have been consistently good since the end of the 1990s.)
library("dplyr") library("ggplot2") url <- "https://raw.githubusercontent.com/aelhabr/nba/master/data/game_results-prepared.csv" results_prepared <- getURL(url) colnames_base <- c("date", "season", "tm") colnames_calc_dates <- c("yyyy", "mm", "dd", "wd", "mm_yyyy", "mm_w") # Look at a couple of different metrics. # Specifically, look at games played to date (g_td) and point differential (pd). colnames_viz <- c("g_td", "pd") results_calendar_tm <- results_prepared %>% filter(tm == "SAS") %>% mutate( yyyy = year(date), mm = lubridate::month(date), dd = lubridate::day(date), wd = lubridate::wday(date, label = TRUE, abbr = TRUE), mm_yyyy = zoo::as.yearmon(date) ) %>% group_by(mm_yyyy) %>% mutate(mm_w = ceiling(dd / 7)) %>% ungroup() %>% select(one_of(colnames_base, colnames_calc_dates, colnames_viz)) %>% arrange(season, g_td, tm) results_calendar_tm # Tidy up because I was experimenting with different metrics, not just point differential. results_calendar_tm_tidy <- results_calendar_tm %>% # mutate_if(is.character, funs(as.factor)) %>% tidyr::gather(metric, value, colnames_viz) season <- 2016 wd_labels <- levels(results_calendar_tm$wd) wd_labels[2:6] <- "" title <- str_c("San Antonio Spurs Point Differential in ", season, " NBA Season") viz_pd_sas_2016 <- results_calendar_tm_tidy %>% filter(season %in% seasons) %>% filter(metric == "pd") %>% ggplot() + geom_tile(aes(x = wd, y = mm_w, fill = value), colour = "white") + scale_y_reverse() + scale_x_discrete(labels = wd_labels) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + scale_fill_gradient2(low = "red", mid = "grey", high = "green") + theme(legend.position = "bottom") + labs(x = "", y = "", title = title) + facet_wrap( ~ mm_yyyy, nrow = 2) viz_pd_sas_2016
There are an infinite number of ways to visualize data like this, but I thought this was interesting because of the temporal nature of the data.
The figure shown here comes from the introductory chapter of the R for Data Science book↩
Here, I use the NBA data that I have already scraped and cleaned.↩