How to adjust time series data to land along specified intervals

Adjusting timestamps to be on the hour

The problem

The world of automated data recording demands absolute obedience to order. I’ve learned that if you don’t whip datasets into shape as early as possible, things will get out of control and you will have major headaches later. One great example of this is when dataloggers (for whatever reason) don’t record data at a sensible time (as in on the hour, or a multiple of 15 minutes past the hour).

The dataset

Here I’ll work with soil moisture data, where the the dataloggers were set up so that they log at hourly intervals starting from when the datalogger is initiated or data downloaded, not on the hour. This dataset consists of over a year’s worth of data, and to complicate things it has many gaps where batteries died or cables were chewed.

Looking at the first and last three rows of the dataset below, you can see that the number of minutes into the hour is different at the start of the dataset compared to the end. And the graph below shows the time series for moisture at a single depth. You may notice the many gaps that are present. Gaps make everything harder.

## # A tibble: 3 × 8
##   Tree  Timestamp              M1    M2    M3    M4    M5    M6
##   <chr> <dttm>              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ET1   2023-07-11 11:05:00     0   7.2  24.4  12.3   5.7  18.8
## 2 ET1   2023-07-11 12:05:00     0   8.1  25    12.6   5.7  18.9
## 3 ET1   2023-07-11 13:05:00     0   9.1  25.3  12.5   5.7  18.9
## # A tibble: 3 × 8
##   Tree  Timestamp              M1    M2    M3    M4    M5    M6
##   <chr> <dttm>              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ET1   2025-01-20 08:21:00  58.5  59.8  57.2    39  33.7  45.6
## 2 ET1   2025-01-20 09:21:00  58.4  59.8  57.2    39  33.7  45.6
## 3 ET1   2025-01-20 10:21:00  58.4  59.8  57.2    39  33.7  45.6

Why does this matter?

If you want to use this data in conjunction with other datasets that are recorded on the hour, you need to adjust the timestamp to allow a join. You could simply round down the timestamp, but that introduces a lag which can add noise to your analyses. The zoo package does have some great pre-built solutions to this problem, but I prefer to keep things in a tidyverse/lubridate framework rather than create ts or xts objects. So I wrote a few functions to handle this that can be piped onto the original dataset.

Step 1: Interpolation

This step does most of the work by creating a new timestamp column that behaves like you want it to, and interpolating the response values using the approx() function from base R. Approx() takes the following key inputs:
x: The original timestamp column, with unwieldy values that you want to correct. In our case it is the column named “Timestamp”
y: The response column, aka what you want to apply the interpolation to. We want to do this to six different columns, which makes things trickier
xout: The new timestamp column, which likely either lands on the hour, or on 15-minute intervals, depending on your dataset. We will create this using the seq() function to create a vector at hourly increments starting with the first value rounded down and the last value rounded up.

adjust_to_hour <- function(df, value_cols) {
  
  # Create hourly sequence
  time_range <- df %>%
    summarise(
      min_time = floor_date(min(Timestamp, na.rm = TRUE), "hour"),
      max_time = ceiling_date(max(Timestamp, na.rm = TRUE), "hour")
    )
  
  hourly_times <- seq(time_range$min_time, time_range$max_time, by = "hour")
  
  # Base tibble with timestamps
  result <- tibble(Tree = unique(df$Tree), Timestamp = hourly_times)
  
  # Add interpolated columns
  for (col in value_cols) {
    result[[col]] <- round(approx(x = as.numeric(df$Timestamp), 
                                  y = df[[col]], 
                                  xout = as.numeric(hourly_times), 
                                  rule = 2)$y, 1)
  }
  
  return(result)
}
soil_data2 <- soil_data %>% 
  adjust_to_hour(value_cols = colnames(soil_data)[3:8]) 

head(soil_data2, n = 3)
## # A tibble: 3 × 8
##   Tree  Timestamp              M1    M2    M3    M4    M5    M6
##   <chr> <dttm>              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ET1   2023-07-11 11:00:00     0   7.2  24.4  12.3   5.7  18.8
## 2 ET1   2023-07-11 12:00:00     0   8    25    12.6   5.7  18.9
## 3 ET1   2023-07-11 13:00:00     0   9    25.3  12.5   5.7  18.9

Now we can see that the timestamps look much better! However, we have introduced a new problem that’s subtle and would be easy to miss. We’ve interpolated across the data gaps, thus making it look like data were being recorded when in reality the instruments were down. As often happens, we’ve taken 1 step forward and half a step back. We have to fix this!

Step 2: Getting rid of wrongly interpolated values

Here is a quick function that compares a timestamp to the next one, checks if it is more than 3 hours ahead, and if so it flags it as a gap. This creates a handy little dataframe where each gap gets a row, and there are columns identifying the start and end of the gaps.

find_gaps_soil <- function(data){
  
  data2 <- data %>%
    select(Tree, Timestamp) %>% 
    mutate(Timelead = lead(Timestamp, 1)) %>% 
    mutate(Diff = as.numeric(Timelead - Timestamp))
  
  gaps <- data2 %>% 
    filter(Diff > 180) %>% 
    rename(Start = Timestamp, End = Timelead) %>% 
    select(-Diff)
  
  gaps2 <- gaps %>% 
    mutate(Start = floor_date(Start, "hours"),
           End = ceiling_date(End, "hours")) %>% 
    mutate(Days = as.numeric(End - Start))
  
  return(gaps2)
}
# Note that this function must be applied on the original dataframe, because the gaps were unintentionally filled after creating soil.data2
gaps <- find_gaps_soil(soil_data) 

head(gaps, n = 3)
## # A tibble: 3 × 4
##   Tree  Start               End                  Days
##   <chr> <dttm>              <dttm>              <dbl>
## 1 ET1   2023-07-28 21:00:00 2023-08-15 14:00:00 17.7 
## 2 ET1   2023-10-09 03:00:00 2023-11-08 11:00:00 30.3 
## 3 ET1   2023-12-15 22:00:00 2023-12-21 14:00:00  5.67

The gaps dataframe created above is just a helper file. We still need to get it to talk to soil_data2 in some way to let it know that there are gaps. My favorite way to do this is to expand it and then join it on. Here is a generic function I wrote to expand a dataframe structured this way. I get a lot of mileage out of this function- it is very common in my analyses to need to transfer between a dataframe of start/end values and an expanded one that has a row for each timestamp in between.

expand_data_window <- function(x, y = "15 min"){
  
  results <- NULL
  for(j in 1:nrow(x)){
    to_bind <- data.frame(
      Timestamp = seq(from = x$Start[j],
                      to = x$End[j],
                      by = y)) %>%
      mutate(Data_flag = 1)
    results <- bind_rows(results, to_bind)
  }
  return(results)
}
# Expand the df. If no gaps, create an empty one with the right column headers
if(nrow(gaps) != 0){
  gaps_expanded <- gaps %>% 
    group_by(Tree) %>% 
    nest() %>% 
    mutate(Expanded = map(data, expand_data_window, "1 hour")) %>% 
    select(-data) %>% 
    unnest(cols = Expanded) %>% 
    ungroup() %>% 
    select(-Data_flag)
} else {
  gaps_expanded <- soil_data2 %>% 
    select(Tree, Timestamp) %>% 
    slice(0)
}
head(gaps_expanded)
## # A tibble: 6 × 2
##   Tree  Timestamp          
##   <chr> <dttm>             
## 1 ET1   2023-07-28 21:00:00
## 2 ET1   2023-07-28 22:00:00
## 3 ET1   2023-07-28 23:00:00
## 4 ET1   2023-07-29 00:00:00
## 5 ET1   2023-07-29 01:00:00
## 6 ET1   2023-07-29 02:00:00
soil_data3 <- soil_data2 %>% 
    anti_join(gaps_expanded, by = c("Tree", "Timestamp"))

Step 3: Re-introducing the gaps

In step 2, again we fixed one problem and then created another (smaller) problem. Now we’ve taken a total of 1 step forward and a quarter step back. In our original dataframe, missing values were explicit. In step 2, we made them implicit. This may or may not be a problem depending on the analysis, but the safest course of action is to keep them explicit. So we do one final step to re-introduce gaps and convert implicitly missing values to explicitly missing values.

# Make implicitly missing values explicit
timestamp_vector = seq(from = min(soil_data3$Timestamp, na.rm = T),
                       to = max(soil_data3$Timestamp, na.rm = T),
                       by = "1 hour")

soil_data4 <- soil_data3  %>% 
  group_by(Tree) %>% 
  complete(Timestamp = timestamp_vector) %>% 
  ungroup()

Now, looking at the first and last three rows, we can see that they look good. You would have to comb the dataset to confirm that gaps are there and are explicitly identified as NA, but trust me, that part worked too. Finally, we can look at the plot again. It looks almost exactly the same as the original one! All that work and we can barely tell the difference. Such is the nature of time series data work…

## # A tibble: 3 × 8
##   Tree  Timestamp              M1    M2    M3    M4    M5    M6
##   <chr> <dttm>              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ET1   2023-07-11 11:00:00     0   7.2  24.4  12.3   5.7  18.8
## 2 ET1   2023-07-11 12:00:00     0   8    25    12.6   5.7  18.9
## 3 ET1   2023-07-11 13:00:00     0   9    25.3  12.5   5.7  18.9
## # A tibble: 3 × 8
##   Tree  Timestamp              M1    M2    M3    M4    M5    M6
##   <chr> <dttm>              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ET1   2023-07-11 11:00:00     0   7.2  24.4  12.3   5.7  18.8
## 2 ET1   2023-07-11 12:00:00     0   8    25    12.6   5.7  18.9
## 3 ET1   2023-07-11 13:00:00     0   9    25.3  12.5   5.7  18.9

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *