Web Scraping Made Easier
UPDATE: The data retrieval demonstrated in this post no longer seems to work due to a change in the ESPN’S “secret” API. In any matter, the techniques for working with JSON data are still valid.
In this “how-to” post, I want to detail an approach that others may find useful for converting nested (nasty!) json to a tidy (nice!) data.frame
/tibble
that is should be much easier to work with. 1
For this demonstration, I’ll start out by scraping National Football League (NFL) 2018 regular season week 1 score data from ESPN, which involves lots of nested data in its raw form. 2
Then, I’ll work towards getting the data in a workable format (a data.frame
!). (This is the crux of what I want to show.) Finally, I’ll filter and wrangle the data to generate a final, presentable format.
Even if one does not care for sports and knows nothing about the NFL, I believe that the techniques that I demonstrate are generalizable to a broad set of JSON-related “problems”.
Let’s being with importing the package(s) that we’ll need.
Next, we’ll create a variable for the url from which we will get the data. The url here will request the scores for week 1 of the 2018 NFL season from ESPN’s “secret” API. 3
url <- "http://site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard?&dates=2018&seasontype=2&week=1"
And now, the actual HTTP GET
request for the data (using the {httr}
package’s appropriately named GET()
function).
resp <- httr::GET(url)
resp
## Response [http://site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard?&dates=2018&seasontype=2&week=1]
## Date: 2018-10-24 18:41
## Status: 200
## Content-Type: application/json;charset=UTF-8
## Size: 189 kB
Everything seems to be going well. However, after using another handy {httr}
function—content()
—to extract the data, we see that the data is an nasty nested format! (I only print out some of the top-level elements to avoid cluttering the page.)
## List of 4
## $ leagues:List of 1
## ..$ :List of 11
## .. ..$ id : chr "28"
## .. ..$ uid : chr "s:20~l:28"
## .. ..$ name : chr "National Football League"
## .. ..$ abbreviation : chr "NFL"
## .. .. [list output truncated]
## $ season :List of 2
## ..$ type: int 2
## ..$ year: int 2018
## $ week :List of 1
## ..$ number: int 1
## $ events :List of 16
## ..$ :List of 9
## .. ..$ id : chr "401030710"
## .. ..$ uid : chr "s:20~l:28~e:401030710"
## .. ..$ date : chr "2018-09-07T00:55Z"
## .. ..$ name : chr "Atlanta Falcons at Philadelphia Eagles"
## .. .. [list output truncated]
## ..$ :List of 9
## .. ..$ id : chr "401030718"
## .. ..$ uid : chr "s:20~l:28~e:401030718"
## .. ..$ date : chr "2018-09-09T17:00Z"
## .. ..$ name : chr "Pittsburgh Steelers at Cleveland Browns"
## .. .. [list output truncated]
## ..$ :List of 9
## .. ..$ id : chr "401030717"
## .. ..$ uid : chr "s:20~l:28~e:401030717"
## .. ..$ date : chr "2018-09-09T17:00Z"
## .. ..$ name : chr "Cincinnati Bengals at Indianapolis Colts"
## .. .. [list output truncated]
## ..$ :List of 9
## .. ..$ id : chr "401030716"
## .. ..$ uid : chr "s:20~l:28~e:401030716"
## .. ..$ date : chr "2018-09-09T17:00Z"
## .. ..$ name : chr "Tennessee Titans at Miami Dolphins"
## .. .. [list output truncated]
## .. [list output truncated]
Given the nature of the data, we might hope that the {jsonlite}
package will save us here. However, straightforward usage of it’s fromJSON()
package only reduces the mess a bit.
## List of 4
## $ leagues:'data.frame': 1 obs. of 11 variables:
## ..$ id : chr "28"
## ..$ uid : chr "s:20~l:28"
## ..$ name : chr "National Football League"
## ..$ abbreviation : chr "NFL"
## .. [list output truncated]
## $ season :List of 2
## ..$ type: int 2
## ..$ year: int 2018
## $ week :List of 1
## ..$ number: int 1
## $ events :'data.frame': 16 obs. of 9 variables:
## ..$ id : chr [1:16] "401030710" "401030718" "401030717" "401030716" ...
## ..$ uid : chr [1:16] "s:20~l:28~e:401030710" "s:20~l:28~e:401030718" "s:20~l:28~e:401030717" "s:20~l:28~e:401030716" ...
## ..$ date : chr [1:16] "2018-09-07T00:55Z" "2018-09-09T17:00Z" "2018-09-09T17:00Z" "2018-09-09T17:00Z" ...
## ..$ name : chr [1:16] "Atlanta Falcons at Philadelphia Eagles" "Pittsburgh Steelers at Cleveland Browns" "Cincinnati Bengals at Indianapolis Colts" "Tennessee Titans at Miami Dolphins" ...
## .. [list output truncated]
One could go on and try some other functions from the {jsonlite}
package (or another JSON-related package), but, in my own attempts, I was unable to figure out a nice way of getting a data.frame()
. (This is not to say that there is something wrong with the package—I simply could not figure out how to use it to get the result that I wanted.)
So, what to do now? Well, after some struggling, I stumbled upon the following solution to put me on the right path.
df_raw <- enframe(unlist(cont_raw))
df_raw
## # A tibble: 6,629 x 2
## name value
## <chr> <chr>
## 1 leagues.id 28
## 2 leagues.uid s:20~l:28
## 3 leagues.name National Football League
## 4 leagues.abbreviation NFL
## 5 leagues.slug nfl
## 6 leagues.season.year 2018
## 7 leagues.season.startDate 2018-08-02T07:00Z
## 8 leagues.season.endDate 2019-02-06T07:59Z
## 9 leagues.season.type.id 2
## 10 leagues.season.type.type 2
## # ... with 6,619 more rows
Combining unlist()
and tibble::enframe()
, we are able to get a (very) long data.frame without any nested elements! Note that the would-have-been-nested elements are joined by “.” in the “name” column, and the values associated with these elements are in the “value” column. (These are the default column names that tibble::enframe()
assigns to the tibble
that it creates from a list.)
While this tibble
is still not in a tidy format—there are variables implicitly stored in the “name” column rather than in their own columns—–it’s in a much more user-friendly format (in my opinion). (e.g The variable "leagues.season.startDate"
implicitly encodes three variables—"leagues"
, "season"
, and "startDate"
—each deserving of their own column.)
Given the format of the implicit variable sin the “name” column, We can use tidyr::separate()
to create columns for each.
## Warning: Expected 10 pieces. Missing pieces filled with `NA` in 6629
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].
## # A tibble: 6,629 x 11
## x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 value
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 leagu~ id <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 28
## 2 leagu~ uid <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> s:20~l:~
## 3 leagu~ name <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Nationa~
## 4 leagu~ abbrev~ <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NFL
## 5 leagu~ slug <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> nfl
## 6 leagu~ season year <NA> <NA> <NA> <NA> <NA> <NA> <NA> 2018
## 7 leagu~ season star~ <NA> <NA> <NA> <NA> <NA> <NA> <NA> 2018-08~
## 8 leagu~ season endD~ <NA> <NA> <NA> <NA> <NA> <NA> <NA> 2019-02~
## 9 leagu~ season type id <NA> <NA> <NA> <NA> <NA> <NA> 2
## 10 leagu~ season type type <NA> <NA> <NA> <NA> <NA> <NA> 2
## # ... with 6,619 more rows
We get a warning indicating when using separate()
because we have “over-estimated” how many columns we will need to create. Note that, with my specification of (dummy) column names with the into
argument, I guessed that there we would need 10 columns. Why 10? Because I expected that 10 would be more than I needed, and it’s better to over-estimate and remove the extra columns in a subsequent step than to under-estimate and lose data because there are not enough columns to put the “separated” data in.
We can get rid of the warning by providing an appropriate value for separate()
’s fill
argument. (Note that "warn"
is the default value of the fill
argument.)
## # A tibble: 6,629 x 11
## x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 value
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 leagu~ id <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 28
## 2 leagu~ uid <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> s:20~l:~
## 3 leagu~ name <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> Nationa~
## 4 leagu~ abbrev~ <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NFL
## 5 leagu~ slug <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> nfl
## 6 leagu~ season year <NA> <NA> <NA> <NA> <NA> <NA> <NA> 2018
## 7 leagu~ season star~ <NA> <NA> <NA> <NA> <NA> <NA> <NA> 2018-08~
## 8 leagu~ season endD~ <NA> <NA> <NA> <NA> <NA> <NA> <NA> 2019-02~
## 9 leagu~ season type id <NA> <NA> <NA> <NA> <NA> <NA> 2
## 10 leagu~ season type type <NA> <NA> <NA> <NA> <NA> <NA> 2
## # ... with 6,619 more rows
However, while this action gets rid of the warning, it does not actually resolve the underlying issue—specifying the correct number of columns to create with separate()
. We can do that by identifying the name
with the most number of “dots” (i.e. .
s).
## [1] 7
With this number (7) identified, we can now choose the “correct” number of columns to create with separate()
. Note that we’ll still be left with lots of NA
values (corresponding to rows that don’t have the maximum number of variables). This is expected.
nms_sep <- paste0("name", 1:n_cols_max)
df_sep <-
df_raw %>%
separate(name, into = nms_sep, sep = rgx_split, fill = "right")
df_sep
## # A tibble: 6,629 x 8
## name1 name2 name3 name4 name5 name6 name7 value
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 leagues id <NA> <NA> <NA> <NA> <NA> 28
## 2 leagues uid <NA> <NA> <NA> <NA> <NA> s:20~l:28
## 3 leagues name <NA> <NA> <NA> <NA> <NA> National Football ~
## 4 leagues abbreviat~ <NA> <NA> <NA> <NA> <NA> NFL
## 5 leagues slug <NA> <NA> <NA> <NA> <NA> nfl
## 6 leagues season year <NA> <NA> <NA> <NA> 2018
## 7 leagues season startDa~ <NA> <NA> <NA> <NA> 2018-08-02T07:00Z
## 8 leagues season endDate <NA> <NA> <NA> <NA> 2019-02-06T07:59Z
## 9 leagues season type id <NA> <NA> <NA> 2
## 10 leagues season type type <NA> <NA> <NA> 2
## # ... with 6,619 more rows
By my interpretation, this df_sep
variable is in tidy format. (Of course, there is still lots of cleaning to be done before it can actually be useful!)
Getting the raw data in the format that df_sep
is what I primarily wanted to show. Nonetheless, there’s more to the story! (Reminder: We’re seeking to get the scores from the 16 games in week 1 of the NFL’s 2018 regular season.) How can we work with the NA
s to get a final format that is actually presentable?
We continue by filter the tibble
for only the rows that we will need.
df_filt <-
df_sep %>%
filter(
(
name1 == "events" &
name2 == "shortName"
) |
(
name1 == "events" &
name2 == "competitions" &
name3 == "date"
) | (
name1 == "events" &
name2 == "competitions" &
name3 == "status" &
name4 == "type" &
name5 == "name"
) |
(
name1 == "events" &
name2 == "competitions" &
name3 == "competitors" &
name4 == "score"
)
)
df_filt
## # A tibble: 80 x 8
## name1 name2 name3 name4 name5 name6 name7 value
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 events shortName <NA> <NA> <NA> <NA> <NA> ATL @ PHI
## 2 events competitions date <NA> <NA> <NA> <NA> 2018-09-07T00:5~
## 3 events competitions competito~ score <NA> <NA> <NA> 18
## 4 events competitions competito~ score <NA> <NA> <NA> 12
## 5 events competitions status type name <NA> <NA> STATUS_FINAL
## 6 events shortName <NA> <NA> <NA> <NA> <NA> PIT @ CLE
## 7 events competitions date <NA> <NA> <NA> <NA> 2018-09-09T17:0~
## 8 events competitions competito~ score <NA> <NA> <NA> 21
## 9 events competitions competito~ score <NA> <NA> <NA> 21
## 10 events competitions status type name <NA> <NA> STATUS_FINAL
## # ... with 70 more rows
Next, we’ll create appropriately named columns for the values that we filtered for in the step above. 4
df_clean1 <-
df_filt %>%
select(name3, name4, name5, value) %>%
mutate(status = if_else(name5 == "name", value, NA_character_)) %>%
mutate(isscore = if_else(name4 == "score", TRUE, FALSE)) %>%
mutate(datetime = if_else(
name3 == "date",
str_replace_all(value, "\\s?T\\s?", " ") %>% str_replace("Z$", ""),
NA_character_
)) %>%
mutate(gm = if_else(
is.na(isscore) &
is.na(datetime) & is.na(status),
value,
NA_character_
))
df_clean1
## # A tibble: 80 x 8
## name3 name4 name5 value status isscore datetime gm
## <chr> <chr> <chr> <chr> <chr> <lgl> <chr> <chr>
## 1 <NA> <NA> <NA> ATL @ PHI <NA> NA <NA> ATL @ ~
## 2 date <NA> <NA> 2018-09-07~ <NA> NA 2018-09-07 ~ <NA>
## 3 competi~ score <NA> 18 <NA> TRUE <NA> <NA>
## 4 competi~ score <NA> 12 <NA> TRUE <NA> <NA>
## 5 status type name STATUS_FIN~ STATUS_F~ FALSE <NA> <NA>
## 6 <NA> <NA> <NA> PIT @ CLE <NA> NA <NA> PIT @ ~
## 7 date <NA> <NA> 2018-09-09~ <NA> NA 2018-09-09 ~ <NA>
## 8 competi~ score <NA> 21 <NA> TRUE <NA> <NA>
## 9 competi~ score <NA> 21 <NA> TRUE <NA> <NA>
## 10 status type name STATUS_FIN~ STATUS_F~ FALSE <NA> <NA>
## # ... with 70 more rows
With these columns created, we can use tidyr::fill()
and dplyr::filter()
in a strategic manner to get rid of all the NA
s cluttering our tibble
. Additionally, we can drop the dummy name
columns that we created with the tidyr::separate()
call before.
## # A tibble: 32 x 5
## value status isscore datetime gm
## <chr> <chr> <lgl> <chr> <chr>
## 1 18 STATUS_FINAL TRUE 2018-09-07 00:55 ATL @ PHI
## 2 12 STATUS_FINAL TRUE 2018-09-07 00:55 ATL @ PHI
## 3 21 STATUS_FINAL TRUE 2018-09-09 17:00 PIT @ CLE
## 4 21 STATUS_FINAL TRUE 2018-09-09 17:00 PIT @ CLE
## 5 23 STATUS_FINAL TRUE 2018-09-09 17:00 CIN @ IND
## 6 34 STATUS_FINAL TRUE 2018-09-09 17:00 CIN @ IND
## 7 27 STATUS_FINAL TRUE 2018-09-09 17:00 TEN @ MIA
## 8 20 STATUS_FINAL TRUE 2018-09-09 17:00 TEN @ MIA
## 9 24 STATUS_FINAL TRUE 2018-09-09 17:00 SF @ MIN
## 10 16 STATUS_FINAL TRUE 2018-09-09 17:00 SF @ MIN
## # ... with 22 more rows
Finally, we can use a chain of {dplyr}
actions to get a pretty output. I should note that it is likely that everything up to this point would have an analogous action no matter what the data set is that you are working with. However, these final actions are unique to this specific data.
df_clean3 <-
df_clean2 %>%
group_by(gm) %>%
mutate(rn = row_number()) %>%
ungroup() %>%
mutate(tm_dir = if_else(rn == 1, "pts_home", "pts_away")) %>%
select(datetime, gm, tm_dir, value) %>%
spread(tm_dir, value) %>%
separate(gm, into = c("tm_away", "tm_home"), sep = "(\\s+\\@\\s+)|(\\s+vs.*\\s+)") %>%
mutate_at(vars(matches("pts")), funs(as.integer)) %>%
mutate(date = datetime %>% str_remove("\\s.*$") %>% lubridate::ymd()) %>%
mutate(time = datetime %>% lubridate::ymd_hm()) %>%
select(date, time, tm_home, tm_away, pts_home, pts_away)
df_clean3
## # A tibble: 16 x 6
## date time tm_home tm_away pts_home pts_away
## <date> <dttm> <chr> <chr> <int> <int>
## 1 2018-09-07 2018-09-07 00:55:00 PHI ATL 18 12
## 2 2018-09-09 2018-09-09 17:00:00 BAL BUF 47 3
## 3 2018-09-09 2018-09-09 17:00:00 IND CIN 23 34
## 4 2018-09-09 2018-09-09 17:00:00 NE HOU 27 20
## 5 2018-09-09 2018-09-09 17:00:00 NYG JAX 15 20
## 6 2018-09-09 2018-09-09 17:00:00 CLE PIT 21 21
## 7 2018-09-09 2018-09-09 17:00:00 MIN SF 24 16
## 8 2018-09-09 2018-09-09 17:00:00 NO TB 40 48
## 9 2018-09-09 2018-09-09 17:00:00 MIA TEN 27 20
## 10 2018-09-09 2018-09-09 20:05:00 LAC KC 28 38
## 11 2018-09-09 2018-09-09 20:25:00 CAR DAL 16 8
## 12 2018-09-09 2018-09-09 20:25:00 DEN SEA 27 24
## 13 2018-09-09 2018-09-09 20:25:00 ARI WSH 6 24
## 14 2018-09-10 2018-09-10 00:20:00 GB CHI 24 23
## 15 2018-09-10 2018-09-10 23:10:00 DET NYJ 17 48
## 16 2018-09-11 2018-09-11 02:20:00 OAK LAR 13 33
And there we have it! A nice, tidy tibble
with the scores of the first week of regular season games in the 2018 NFL regular season.
Hopefully someone out there will find the technique(s) shown in this post to be useful for an endeavor of their own.
Personally, I find web scraping to be fascinating, so I doubt this will be the last time I write about something of this nature.
I use data.frame
and tibble
interchangeably. See this chapter of the R for Data Science for more details about the differences/similarities between the two.↩︎
(See the webpage here: http://www.espn.com/nfl/scoreboard/_/year/2018/seasontype/2/week/1. Note that we won’t be scraping the html, but, instead, the underlying JSON from which the html is generated.)↩︎
I say that it’s a secret because it’s API documentation is out of date.↩︎
I don’t recommend suffixing variable names with numbers as I do in the next couple of step (i.e. variables suffixed with 1
, 2
, …) (It’s ugly!) In practice, you might do this during your exploratory phase of data scraping/analysis, but you should come up with more informative names and combine actions in a logical manner for your final script/package (in my opinion).↩︎
For attribution, please cite this work as
ElHabr (2018, Oct. 20). Tony: Converting nested JSON to a tidy data frame with R. Retrieved from https://itsmetoeknee.netlify.app/post/nested-json-to-tidy-data-frame-r/
BibTeX citation
@misc{elhabr2018converting, author = {ElHabr, Tony}, title = {Tony: Converting nested JSON to a tidy data frame with R}, url = {https://itsmetoeknee.netlify.app/post/nested-json-to-tidy-data-frame-r/}, year = {2018} }