Adjusting timestamps to be on the hour
The problem
The world of automated data recording demands absolute obedience to order. I’ve learned that if you don’t whip datasets into shape as early as possible, things will get out of control and you will have major headaches later. One great example of this is when dataloggers (for whatever reason) don’t record data at a sensible time (as in on the hour, or a multiple of 15 minutes past the hour).
The dataset
Here I’ll work with soil moisture data, where the the dataloggers were set up so that they log at hourly intervals starting from when the datalogger is initiated or data downloaded, not on the hour. This dataset consists of over a year’s worth of data, and to complicate things it has many gaps where batteries died or cables were chewed.
Looking at the first and last three rows of the dataset below, you can see that the number of minutes into the hour is different at the start of the dataset compared to the end. And the graph below shows the time series for moisture at a single depth. You may notice the many gaps that are present. Gaps make everything harder.
## # A tibble: 3 × 8
## Tree Timestamp M1 M2 M3 M4 M5 M6
## <chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ET1 2023-07-11 11:05:00 0 7.2 24.4 12.3 5.7 18.8
## 2 ET1 2023-07-11 12:05:00 0 8.1 25 12.6 5.7 18.9
## 3 ET1 2023-07-11 13:05:00 0 9.1 25.3 12.5 5.7 18.9
## # A tibble: 3 × 8
## Tree Timestamp M1 M2 M3 M4 M5 M6
## <chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ET1 2025-01-20 08:21:00 58.5 59.8 57.2 39 33.7 45.6
## 2 ET1 2025-01-20 09:21:00 58.4 59.8 57.2 39 33.7 45.6
## 3 ET1 2025-01-20 10:21:00 58.4 59.8 57.2 39 33.7 45.6
Why does this matter?
If you want to use this data in conjunction with other datasets that are recorded on the hour, you need to adjust the timestamp to allow a join. You could simply round down the timestamp, but that introduces a lag which can add noise to your analyses. The zoo package does have some great pre-built solutions to this problem, but I prefer to keep things in a tidyverse/lubridate framework rather than create ts or xts objects. So I wrote a few functions to handle this that can be piped onto the original dataset.
Step 1: Interpolation
This step does most of the work by creating a new timestamp column
that behaves like you want it to, and interpolating the response values
using the approx() function from base R. Approx() takes the following
key inputs:
x: The original timestamp column, with unwieldy values
that you want to correct. In our case it is the column named
“Timestamp”
y: The response column, aka what you want to apply the
interpolation to. We want to do this to six different columns, which
makes things trickier
xout: The new timestamp column, which likely either
lands on the hour, or on 15-minute intervals, depending on your dataset.
We will create this using the seq() function to create a vector at
hourly increments starting with the first value rounded down and the
last value rounded up.
adjust_to_hour <- function(df, value_cols) {
# Create hourly sequence
time_range <- df %>%
summarise(
min_time = floor_date(min(Timestamp, na.rm = TRUE), "hour"),
max_time = ceiling_date(max(Timestamp, na.rm = TRUE), "hour")
)
hourly_times <- seq(time_range$min_time, time_range$max_time, by = "hour")
# Base tibble with timestamps
result <- tibble(Tree = unique(df$Tree), Timestamp = hourly_times)
# Add interpolated columns
for (col in value_cols) {
result[[col]] <- round(approx(x = as.numeric(df$Timestamp),
y = df[[col]],
xout = as.numeric(hourly_times),
rule = 2)$y, 1)
}
return(result)
}
soil_data2 <- soil_data %>%
adjust_to_hour(value_cols = colnames(soil_data)[3:8])
head(soil_data2, n = 3)
## # A tibble: 3 × 8
## Tree Timestamp M1 M2 M3 M4 M5 M6
## <chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ET1 2023-07-11 11:00:00 0 7.2 24.4 12.3 5.7 18.8
## 2 ET1 2023-07-11 12:00:00 0 8 25 12.6 5.7 18.9
## 3 ET1 2023-07-11 13:00:00 0 9 25.3 12.5 5.7 18.9
Now we can see that the timestamps look much better! However, we have introduced a new problem that’s subtle and would be easy to miss. We’ve interpolated across the data gaps, thus making it look like data were being recorded when in reality the instruments were down. As often happens, we’ve taken 1 step forward and half a step back. We have to fix this!
Step 2: Getting rid of wrongly interpolated values
Here is a quick function that compares a timestamp to the next one, checks if it is more than 3 hours ahead, and if so it flags it as a gap. This creates a handy little dataframe where each gap gets a row, and there are columns identifying the start and end of the gaps.
find_gaps_soil <- function(data){
data2 <- data %>%
select(Tree, Timestamp) %>%
mutate(Timelead = lead(Timestamp, 1)) %>%
mutate(Diff = as.numeric(Timelead - Timestamp))
gaps <- data2 %>%
filter(Diff > 180) %>%
rename(Start = Timestamp, End = Timelead) %>%
select(-Diff)
gaps2 <- gaps %>%
mutate(Start = floor_date(Start, "hours"),
End = ceiling_date(End, "hours")) %>%
mutate(Days = as.numeric(End - Start))
return(gaps2)
}
# Note that this function must be applied on the original dataframe, because the gaps were unintentionally filled after creating soil.data2
gaps <- find_gaps_soil(soil_data)
head(gaps, n = 3)
## # A tibble: 3 × 4
## Tree Start End Days
## <chr> <dttm> <dttm> <dbl>
## 1 ET1 2023-07-28 21:00:00 2023-08-15 14:00:00 17.7
## 2 ET1 2023-10-09 03:00:00 2023-11-08 11:00:00 30.3
## 3 ET1 2023-12-15 22:00:00 2023-12-21 14:00:00 5.67
The gaps dataframe created above is just a helper file. We still need to get it to talk to soil_data2 in some way to let it know that there are gaps. My favorite way to do this is to expand it and then join it on. Here is a generic function I wrote to expand a dataframe structured this way. I get a lot of mileage out of this function- it is very common in my analyses to need to transfer between a dataframe of start/end values and an expanded one that has a row for each timestamp in between.
expand_data_window <- function(x, y = "15 min"){
results <- NULL
for(j in 1:nrow(x)){
to_bind <- data.frame(
Timestamp = seq(from = x$Start[j],
to = x$End[j],
by = y)) %>%
mutate(Data_flag = 1)
results <- bind_rows(results, to_bind)
}
return(results)
}
# Expand the df. If no gaps, create an empty one with the right column headers
if(nrow(gaps) != 0){
gaps_expanded <- gaps %>%
group_by(Tree) %>%
nest() %>%
mutate(Expanded = map(data, expand_data_window, "1 hour")) %>%
select(-data) %>%
unnest(cols = Expanded) %>%
ungroup() %>%
select(-Data_flag)
} else {
gaps_expanded <- soil_data2 %>%
select(Tree, Timestamp) %>%
slice(0)
}
head(gaps_expanded)
## # A tibble: 6 × 2
## Tree Timestamp
## <chr> <dttm>
## 1 ET1 2023-07-28 21:00:00
## 2 ET1 2023-07-28 22:00:00
## 3 ET1 2023-07-28 23:00:00
## 4 ET1 2023-07-29 00:00:00
## 5 ET1 2023-07-29 01:00:00
## 6 ET1 2023-07-29 02:00:00
soil_data3 <- soil_data2 %>%
anti_join(gaps_expanded, by = c("Tree", "Timestamp"))
Step 3: Re-introducing the gaps
In step 2, again we fixed one problem and then created another (smaller) problem. Now we’ve taken a total of 1 step forward and a quarter step back. In our original dataframe, missing values were explicit. In step 2, we made them implicit. This may or may not be a problem depending on the analysis, but the safest course of action is to keep them explicit. So we do one final step to re-introduce gaps and convert implicitly missing values to explicitly missing values.
# Make implicitly missing values explicit
timestamp_vector = seq(from = min(soil_data3$Timestamp, na.rm = T),
to = max(soil_data3$Timestamp, na.rm = T),
by = "1 hour")
soil_data4 <- soil_data3 %>%
group_by(Tree) %>%
complete(Timestamp = timestamp_vector) %>%
ungroup()
Now, looking at the first and last three rows, we can see that they look good. You would have to comb the dataset to confirm that gaps are there and are explicitly identified as NA, but trust me, that part worked too. Finally, we can look at the plot again. It looks almost exactly the same as the original one! All that work and we can barely tell the difference. Such is the nature of time series data work…
## # A tibble: 3 × 8
## Tree Timestamp M1 M2 M3 M4 M5 M6
## <chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ET1 2023-07-11 11:00:00 0 7.2 24.4 12.3 5.7 18.8
## 2 ET1 2023-07-11 12:00:00 0 8 25 12.6 5.7 18.9
## 3 ET1 2023-07-11 13:00:00 0 9 25.3 12.5 5.7 18.9
## # A tibble: 3 × 8
## Tree Timestamp M1 M2 M3 M4 M5 M6
## <chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ET1 2023-07-11 11:00:00 0 7.2 24.4 12.3 5.7 18.8
## 2 ET1 2023-07-11 12:00:00 0 8 25 12.6 5.7 18.9
## 3 ET1 2023-07-11 13:00:00 0 9 25.3 12.5 5.7 18.9