TL;DR InfluxDB Tech Tips - Missing Data with DISTINCT() & the INTO Clause, Unexpected Timestamps with MAX() & More
By
Regan Kuchan /
Product, Developer
Apr 13, 2017
Navigate to:
What happens when you cross a function with an INTO
clause or a GROUP BY time()
clause? Occasionally, some curious results. In this post, we describe the unexpected when working with InfluxQL functions in the SELECT
statement.
Missing data with DISTINCT() and the INTO clause
Q: I’m trying to write the results of the DISTINCT()
function to a different measurement. The only thing is - the results of the query without the INTO
clause and the results of the query with the INTO
clause are different. Is this the expected behavior? I have a feeling I’m missing something.
Just DISTINCT()
:
> SELECT DISTINCT("personality_type") FROM "twins"
name: twins
time distinct
---- --------
1970-01-01T00:00:00Z a
1970-01-01T00:00:00Z b <--- Three results!
1970-01-01T00:00:00Z uncategorizable
DISTINCT()
with INTO
:
> SELECT DISTINCT("personality_type") INTO "twin_types" FROM "twins"
name: result
time written
---- -------
1970-01-01T00:00:00Z 3 <--- Three written points!
> SELECT * FROM "twin_types"
name: twin_types
time distinct
---- --------
1970-01-01T00:00:00Z uncategorizable <--- Just one point!
A: The behavior that you’re seeing is the expected behavior; it comes down to how InfluxDB identifies a single point and how it handles duplicate points.
Notice that each result of the DISTINCT()
query without the INTO
clause has the same timestamp. InfluxDB assumes that points in the same series and with the same timestamp are duplicate points.
When you add the INTO
clause to the query, InfluxDB writes three duplicate points to the twin_types
measurement. When the system encounters duplicate points, it simply overwrites the previous point with the most recent point so you end up with only one result in the destination measurement. It’s a little unexpected but that’s the explanation for what you’re seeing.
Unexpected timestamps with MAX() and the GROUP BY time() clause
Q: I’m trying to find the maximum number of eggs for specific time intervals. The results in the max
column are accurate but the timestamps in the time
column aren’t what I expected to see. I’ve included a simplified version of my data below. Why do the timestamps seem off?
Raw data:
> SELECT * FROM "house"
name: house
time eggs
---- ----
2017-04-12T20:00:00Z 1 <--- Max value in the first 20-minute interval
2017-04-12T20:10:00Z 0
2017-04-12T20:20:00Z 3
2017-04-12T20:30:00Z 5 <--- Max value in the second 20-minute interval
My query:
> SELECT MAX("eggs") FROM "house" WHERE time >= '2017-04-12T20:00:00Z' AND time <= '2017-04-12T20:30:00Z' GROUP BY time(20m)
name: house
time max
---- ---
2017-04-12T20:00:00Z 1
2017-04-12T20:20:00Z 5 <--- I'd expect this timestamp to be 2017-04-12T20:30:00Z
A: The timestamps that you’re seeing are a consequence of the GROUP BY time()
clause. That clause automatically overrides MAX()
’s original timestamps with timestamps that mark the start of the relevant time interval.
In your case, the second timestamp in the results (2017-04-12T20:20:00Z
) refers to the interval between 2017-04-12T20:20:00Z
and just before 2017-04-12T20:30:00Z
, not to when the maximum value occurred in the raw data. Almost all* queries that use an InfluxQL function with a GROUP BY time()
clause overwrite timestamps in this way.
- The only exception is the
SAMPLE()
function. See theSAMPLE()
documentation for additional information.
Unexpected consistency with ELAPSED() and the GROUP BY time() clause
Q: I’m using ELAPSED()
to calculate the time gap between median values in my data. The results I get (see below) always show the same time gap. I’m having a hard time believing that my median values are always exactly ten minutes apart. Am I using the ELAPSED()
function correctly?
> SELECT ELAPSED(MEDIAN("clicks"),1m) FROM "puppies" WHERE time >= '2017-04-12T18:30:00Z' AND time <= '2017-04-12T19:10:00Z' GROUP BY time(10m)
name: puppies
time elapsed
---- -------
2017-04-12T18:40:00Z 10
2017-04-12T18:50:00Z 10
2017-04-12T19:00:00Z 10
A: You’re using the ELAPSED()
function correctly from a syntax perspective but your query wont give you the results that you’re looking for. When you use ELAPSED()
with a nested function and a GROUP BY time()
clause, InfluxDB always returns elapsed values that are equal to the GROUP BY time()
interval.
When you run your query, InfluxDB first calculates the results for the nested function (MEDIAN()
) at the specified GROUP BY time()
intervals and then it applies the ELAPSED()
function to those results. In the first step, the GROUP BY time()
clause automatically overrides MEDIAN()
’s original timestamps with timestamps that mark the start of the each GROUP BY time()
interval. When the system applies the ELAPSED()
function to the results of that first step, it encounters timestamps that are always ten minutes apart and not the original timestamps of the MEDIAN()
values.
What's next:
- Downloads for the TICK-stack are live on our "downloads" page.
- Deploy on the Cloud: Get started with a FREE trial of InfluxDB Cloud featuring fully-managed clusters, Kapacitor and Grafana.
- Deploy on Your Servers: Want to run InfluxDB clusters on your servers? Try a FREE 14-day trial. of InfluxDB Enterprise featuring an intuitive UI for deploying, monitoring, and rebalancing clusters, plus managing backups and restores.
- Tell Your Story: Over 300 companies have shared their story on how InfluxDB is helping them succeed. Submit your testimonial and get a limited edition hoodie as a thank you.