Druid SQL: BETWEEN considered harmful
When querying data in Druid (or another analytical database), your query will in almost all cases include a filter on the primary timestamp. And this timestamp filter will usually take the form of an interval.
The easiest way to describe such an interval seems to be the SQL BETWEEN
operator.
Advice from a grug brained developer: Don’t do that.
Here’s why.
A harmless data sample
Imagine you have a table like this:
__time | val |
---|---|
2023-01-01T01:00:00.000Z | 1 |
2023-01-02T00:00:00.000Z | 1 |
2023-01-02T06:00:00.000Z | 1 |
2023-01-03T00:00:00.000Z | 1 |
2023-01-03T01:00:00.000Z | 1 |
2023-01-04T00:00:00.000Z | 1 |
2023-01-04T07:00:00.000Z | 1 |
You can populate such a table in Druid using SQL ingestion like so:
REPLACE INTO "sample" OVERWRITE ALL
WITH "ext" AS (
SELECT *
FROM TABLE(
EXTERN(
'{"type":"inline","data":"datetime,val\n2023-01-01 01:00:00,1\n2023-01-02 00:00:00,1\n2023-01-02 06:00:00,1\n2023-01-03 00:00:00,1\n2023-01-03 01:00:00,1\n2023-01-04 00:00:00,1\n2023-01-04 07:00:00,1"}',
'{"type":"csv","findColumnsFromHeader":true}'
)
) EXTEND ("datetime" VARCHAR, "val" BIGINT)
)
SELECT
TIME_PARSE(TRIM("datetime")) AS "__time",
"val"
FROM "ext"
PARTITIONED BY DAY
You want to list all rows for 2nd and 3rd January. You write:
SELECT * FROM "sample"
WHERE __time BETWEEN TIMESTAMP'2023-01-02' AND TIMESTAMP'2023-01-03'
And here’s the result:
__time | val |
---|---|
2023-01-02T00:00:00.000Z | 1 |
2023-01-02T06:00:00.000Z | 1 |
2023-01-03T00:00:00.000Z | 1 |
You notice that all the rows for 2nd January are in the result, but only one row for 3rd January. What happened?
The solution
We are being hit by two entirely documented features here, which together create a minor footgun.
- The
BETWEEN
operator creates a closed interval, that is it includes both the left and right boundary value. This would by itself not be a problem, were it not for the second feature. - The literal
TIMESTAMP'2023-01-03'
does not mean “the entire day of 3rd January”, as one might naïvely think. It is equivalent to “3rd January, 00:00”.
What we have done is: we have created a query that includes the entire 2nd January but only the data for 00:00 on 3rd January!
You could fix this by writing something like TIMESTAMP'2023-01-03 23:59:59'
for the right interval boundary. But does this really catch every last bit of the data for that day? What if you have fractional timestamps? Is your precision milliseconds? or even microseconds?
This is why I argue that the proper way to model such time filter conditions is to use a right-open interval, which includes the left boundary value but not the right boundary value. If you do that, you have to set the right boundary to the next day (4th January), in order to still catch all of 3rd January in your filter:
SELECT * FROM "sample"
WHERE __time >= TIMESTAMP'2023-01-02' AND __time < TIMESTAMP'2023-01-04'
This query returns the correct result:
__time | val |
---|---|
2023-01-02T00:00:00.000Z | 1 |
2023-01-02T06:00:00.000Z | 1 |
2023-01-03T00:00:00.000Z | 1 |
2023-01-03T01:00:00.000Z | 1 |
This way of filtering is also in line with the treatment of time intervals almost everywhere in Druid. Segment time chunks, for instance, are defined in terms of right open intervals, too.
Edit 2023-11-06: Peter pointed out that you can instead use the TIME_IN_INTERVAL
function. This uses ISO interval notation and creates exactly the right exclusive intervals we want. So a more elegant way of rewriting the query is:
SELECT * FROM "sample"
WHERE TIME_IN_INTERVAL(__time, '2023-01-02/2023-01-04')
Learnings
- Don’t use the
BETWEEN
operator in SQL. Especially not for time intervals. Because the operator creates an inclusive (closed) interval, the result may not be what you expect. - Use a
WHERE
clause with simple comparison operators instead, to create a right open interval.
"Grug" by PlatinumFusi0n is licensed under CC BY 3.0 .