In my job, I often work with data sampled at regular intervals. Samples may range from 5-minute intervals to daily intervals, depending on the specific task. While working with this kind of data is straightforward when its in a database (and I can use SQL), I have been in a couple of situations where the data is spread across .csv files. In these cases, I lean on `R`

to scrape and compile the data. Although some other languages might be superior for such a task, I often need to produce some kind of visualization or report at the end, so choosing to handle all of the data with `R`

is a no-brainer for me–I can easily transition to using `{ggplot2}`

and `{rmarkdown}`

to generate some pretty output.

## Minute-Level Data

When working with data sampled at 5-minute intervals (resulting in 288 intervals in a day), I’ve found that I’ve used a common “idiom” to generate a time-based “dictionary”. For example, here’s how I might create such a date dictionary for the year 2013. ^{1}

```
library("dplyr")
library("lubridate")
date_1 <- lubridate::ymd("2013-01-01")
date_2 <- lubridate::ymd("2013-12-31")
ymd_seq <- seq.Date(date_1, date_2, by = "day")
ymd_grid <-
data_frame(
ymd = lubridate::ymd(ymd_seq),
yyyy = lubridate::year(ymd_seq),
mm = lubridate::month(ymd_seq),
dd = lubridate::day(ymd_seq)
)
hhmmss <-
expand.grid(
hh = seq(1L, 24L, 1L),
min = seq(5L, 60L, by = 5L),
sec = 0L) %>%
as_tibble()
dates_dict <-
ymd_grid %>%
right_join(hhmmss %>% mutate(yyyy = lubridate::year(date_1), by = "year")) %>%
arrange(yyyy, mm, dd, hh, min)
dates_dict
```

```
## # A tibble: 105,120 x 8
## ymd yyyy mm dd hh min sec by
## <date> <dbl> <dbl> <int> <int> <int> <int> <chr>
## 1 2013-01-01 2013 1 1 1 5 0 year
## 2 2013-01-01 2013 1 1 1 10 0 year
## 3 2013-01-01 2013 1 1 1 15 0 year
## 4 2013-01-01 2013 1 1 1 20 0 year
## 5 2013-01-01 2013 1 1 1 25 0 year
## 6 2013-01-01 2013 1 1 1 30 0 year
## 7 2013-01-01 2013 1 1 1 35 0 year
## 8 2013-01-01 2013 1 1 1 40 0 year
## 9 2013-01-01 2013 1 1 1 45 0 year
## 10 2013-01-01 2013 1 1 1 50 0 year
## # ... with 1.051e+05 more rows
```

And, just to prove that there are 288 5-minute intervals for each day in the year, I can use `dplyr::count()`

twice in succession. ^{2}

`dates_dict %>% count(yyyy, mm, dd) %>% count()`

```
## # A tibble: 1 x 1
## nn
## <int>
## 1 365
```

I then extract data (from individual files) using the time-based dictionary as a “helper” for custom functions for creating file paths and processing the data after importing it.

After the dirty work of is done, I can transition to the fun part–exploring and interpreting the data. This process often turns out to be a cycle of visualization, data transformation, and modeling.

### An Example (With the `nycflights13`

Package)

To provide an example, I’ll use the `flights`

data set from the `{nycflight13}`

package. ^{3} This package includes information regarding all flights leaving from New York City airports in 2013, as well as information regarding weather, airlines, airports, and planes.

Let’s say that I that I’m interested in the average flight departure delay time at the JFK airport. I might hypothesize that there is a relationship between departure delay time with different time periods, such as hour in the day and days in the week.

First, I’ll perform the necessary transformation to the `flights`

data to investigate my hypothesis. Specifically, I need to create columns for hour and weekday. (For hour (`hh`

), I simply use the scheduled departure time (`sched_dep_time`

).)

```
library("nycflights13")
flights_jfk <-
nycflights13::flights %>%
filter(origin == "JFK") %>%
mutate(hh = round(sched_dep_time / 100, 0) - 1) %>%
mutate(ymd = lubridate::ymd(sprintf("%04.0f-%02.0f-%02.0f", year, month, day))) %>%
mutate(wd = lubridate::wday(ymd, label = TRUE))
```

Next, I might create a heat map plotting hours against weekdays.

To investigate the patterns more “scientifically”, I might perform a one-way Analysis of Variance (ANOVA) on different time variables. I would make sure to test time periods other than just weekday (`wd`

) and hour (`hh`

), such as `month`

and `day`

.

`summary(aov(dep_delay ~ month, data = flights_jfk))`

```
## Df Sum Sq Mean Sq F value Pr(>F)
## month 1 55213 55213 36.25 1.74e-09 ***
## Residuals 109414 166664445 1523
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1863 observations deleted due to missingness
```

`summary(aov(dep_delay ~ day, data = flights_jfk))`

```
## Df Sum Sq Mean Sq F value Pr(>F)
## day 1 40 40.2 0.026 0.871
## Residuals 109414 166719617 1523.8
## 1863 observations deleted due to missingness
```

`summary(aov(dep_delay ~ wd, data = flights_jfk))`

```
## Df Sum Sq Mean Sq F value Pr(>F)
## wd 6 246401 41067 26.99 <2e-16 ***
## Residuals 109409 166473256 1522
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1863 observations deleted due to missingness
```

`summary(aov(dep_delay ~ hh, data = flights_jfk))`

```
## Df Sum Sq Mean Sq F value Pr(>F)
## hh 1 5701754 5701754 3874 <2e-16 ***
## Residuals 109414 161017904 1472
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1863 observations deleted due to missingness
```

The “statistically significant” p values for the `wd`

and `hh`

variables provides incentive to investigate them more closely. I might try an ANOVA F-statistic test comparing linear regression models using each variable as a lone predictor with a linear model where both are used as predictors.

```
lm_wd <- lm(dep_delay ~ wd, data = flights_jfk)
lm_hh <- lm(dep_delay ~ hh, data = flights_jfk)
lm_both <- lm(dep_delay ~ wd + hh, data = flights_jfk)
anova(lm_both, lm_wd, test = "F")
```

```
## Analysis of Variance Table
##
## Model 1: dep_delay ~ wd + hh
## Model 2: dep_delay ~ wd
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 109408 160778456
## 2 109409 166473256 -1 -5694800 3875.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(lm_both, lm_hh, test = "F")`

```
## Analysis of Variance Table
##
## Model 1: dep_delay ~ wd + hh
## Model 2: dep_delay ~ hh
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 109408 160778456
## 2 109414 161017904 -6 -239447 27.157 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Of course, this process would go on. For instance, I would certainly need to investigate if there is a relationship between departure delay and specific airlines.

## Final Thoughts

Working with interval data was initially a challenge for me, but after working with it more and more often, I find that it’s not so bad after all. It gets more interesting when there is missing data or data samples at irregular intervals, but that’s a story for another day.

The technique that I show here would have to be adjusted slightly if working with more than one year at a time, but it wouldn’t be difficult to do so. I tend to only use this design pattern for one year at a time anyways.↩

By the way, how cool is following one

`dplyr::count()`

call with another? I only found out about that nice little trick recently.↩Thanks to Hadley Wickham for compiling this data package, as well as for developing the

`{lubridate}`

and`{ggplot2}`

packages.)↩