Druid Data Cookbook: About SQL NULL
In the latest versions of Druid, handling of logical expressions, and in particular of NULL (unknown) values, has been changed to match the SQL standard. This leads to some behavior that may be surprising to long time Druid users. Let’s take a look at some examples!
Server configuration settings that affect NULL handling
There are three configuration settings that affect NULL handling:
druid.generic.useDefaultValueForNull
druid.expressions.useStrictBooleans
druid.generic.useThreeValueLogicForNativeFilters
The Status tile in the Druid web console indicates whether Druid is configured in SQL compliant mode; if you hover over the corresponding text, it shows a detailed breakdown of the settings:
What do these settings do? Let’s look at them in a bit more detail.
useDefaultValueForNull
Originally, Druid did not have a separate representation for empty or NULL values. A NULL value would be simply treated as an empty string, or as a numeric 0 (zero); and it would be equivalent to these default values for all intents and purposes.
The new default for this setting is false.
useStrictBooleans
This setting now defaults to true. If set, it forces the result of all logical expressions to be 0 or 1. Older versions of Druid would keep the original values of input parameters, since anything that was not 0 or an empty string would be considered a true value. (This resembles the way logical expressions are handled in some scripting languages.)
useThreeValueLogicForNativeFilters
This setting defaults to true, too. Its effect is that NULL is kept as a distinct value that is separate from either true or false, in all logical evaluations. Any expression that contains a NULL value would yield a result of NULL, too. This has a number of interesting ramifications. Let’s look at some of them today!
This tutorial can be done using the Druid 29.0.1 quickstart.
Ingestion
First of all, let’s create a very simple data set.
The table I am going to use has but four rows of data. There is a column color, which can be either a string, or NULL.
Here is the Druid SQL to create the sample data set:
REPLACE INTO "inline_data" OVERWRITE ALL
WITH "ext" AS (
SELECT *
FROM TABLE(
EXTERN(
'{"type":"inline","data":"{\"id\": 1, \"color\": \"red\"}\n{\"id\": 1, \"color\": null}\n{\"id\": 1, \"color\": \"blue\"}\n{\"id\": 1, \"color\": \"green\"}\n"}',
'{"type":"json"}'
)
) EXTEND ("id" BIGINT, "color" VARCHAR)
)
SELECT
TIMESTAMP '2000-01-01 00:00:00' AS "__time",
"id",
"color"
FROM "ext"
PARTITIONED BY ALL
If you ingest this data set and list the entire table, you should get something like this:
Note the NULL value in the color column.
Let’s run some more queries now!
Comparing against single values
Since a comparison with NULL is never true, NULL is not even considered equal to itself. This is why the following query
yields no rows in return. Even more, if you change the condition to WHERE color <> NULL
you get nothing, too!
You have to use the operators IS NULL
and IS NOT NULL
, instead.
In a similar vein, let’s get all the entries whose color is not red. Naïvely, we try:
We get only the blue and green entries - the NULL value is, again, not caught by the operator.
You could construct a combined filter clause handling the NULL case separately. But there is a special operator that allows one to treat NULL values like regular values:
The IS DISTINCT FROM
operator does what we need: it treats NULL values as equivalent and distinct from any other value.
Comparing against multiple values
How about filtering multiple values with an IN
clause? Let’s try to retrieve only those rows that have color red or NULL:
After the previous experiments, this should not come as a surprise: The query returns only the row for red. But what if we invert the condition?
This one returns no rows at all! Supposedly, the comparison with anything that contains NULL would always give a NULL result, and in fact the entire filter is optimized out of the query plan.
There is a workaround though. Instead of using a list with IN
, we can also try an array literal like so:
This gives the same result as the IN
filter. But if we invert the filter condition, we get something that is more along the lines of the expectation:
Learnings
- Druid’s handling of unknown values has been made SQL compliant.
- This can lead to unexpected results since any comparison with a NULL value yields a NULL value itself: NULL is equal to nothing, but is also not equal to nothing - not even to itself!
- In order to handle NULL values properly, special operators exist, such as
IS NULL
andIS DISTINCT FROM
. - Beware of NULL values in
IN ()
filter clauses! Using an array literal instead can help.
“This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .