Backfill Missing Time Series With SQL
By
Jessica Wachtel /
Product, Developer
May 03, 2023
Navigate to:
Time series data streams are often noisy and irregular. But it doesn’t matter if the cause of the irregularity is a network error, jittery sensor, or power outage – advanced analytical tools, machine learning, and artificial intelligence models require their data inputs to include data sets with fixed time intervals. This makes the process of filling in all missing rows and values a necessary part of the data cleaning and basic analysis process.
SQL became InfluxDB’s main query language as we fully embraced the Apache ecosystem, but we had to build some functionality to fill in the gaps between the traditional capabilities of SQL and what time series data workloads require. One of the custom query functions we built to address missing data and inconsistent time windows is date_bin_gapfill
.
Adding rows
date_bin_gapfill
calculates the time interval based on specific parameters determined by the user, and if no rows exist within a certain time interval, the function inserts a new row with a time value set to the time interval start with a null
data value.
Sample query:
SELECT
date_bin_gapfill( '30 minutes', time) as _time,
room,
avg(temp) as temp
FROM home
WHERE
time >= '2023-04-27T08:00:00Z'
AND time <= '2023-04-27T10:00:00Z'
GROUP BY _time, room
Results:
Adding rows and inserting data
Interpolate
Adding interpolate
inside the date_bin_gapfill
function replaces the null data values with linearly interpolated values between the last value and the next value. Be sure to place this before the aggregation function (avg
in the query below) so that it includes all values when it runs.
Sample query:
SELECT
date_bin_gapfill('30 minutes', time) as _time,
room,
interpolate(avg(temp))
FROM home
WHERE
time >= '2023-04-27T08:00:00Z'
AND time <= '2023-04-27T10:00:00Z'
GROUP BY _time, room
Results:
LOCF
locf
stands for last observation carried forward. Adding locf
before the aggregation function replaces the null
values with the last value.
SELECT
date_bin_gapfill( '30 minutes', time) as _time,
room,
locf(avg(temp))
FROM home
WHERE
time >= '2023-04-27T08:00:00Z'
AND time <= '2023-04-27T10:00:00Z'
GROUP BY _time, room
Results:
Conclusion
This demo video shows the powerful function in action.
For more details on working with SQL and time series data, check out our docs.
To start building something cool with InfluxDB and SQL, sign up for your free cloud account today.