Druid Cookbook

In my series about multi-value dimensions (MVD), I took you to a small Italian restaurant where we looked at this snippet of orders:

{ "date": "2021-09-25", "customer": "Fangjin", "orders": [ "pizza", "tiramisu", "espresso", "espresso" ] }
{ "date": "2021-09-25", "customer": "Gian", "orders": [ "pizza", "espresso", "tiramisu" ] }
{ "date": "2021-09-25", "customer": "Vadim", "orders": [ "pizza", "tiramisu" ] }
{ "date": "2021-09-25", "customer": "Rachel", "orders": [ "pizza", "tiramisu", "espresso" ] }
{ "date": "2021-09-25", "customer": "Xiaolan", "orders": [ "pizza", "espresso" ] }
{ "date": "2021-09-25", "customer": "Jessy", "orders": [ "pizza", "espresso", "espresso" ] }

We ingested the order list as an MVD, and we observed the “magic” behavior that implicitly unnests the MVD values inside a query that groups or filters by that dimension.

Since then, Druid has added more ways of representing structured data. An MVD is declared as a STRING dimension, but it behaves (mostly) like an array or a list of values. However, an array is also a valid JSON object and it can be ingested into Druid as a nested column. A lot of work is currently happening in this field.

This tutorial works with the Druid 25.0 quickstart.

Nested Columns

Ingest the above data sample using the classic batch ingestion wizard in the Druid console. When you come to the Configure schema stage, change the type of the orders dimension to json:

Select JSON type for column

Name the datasource ristorante_json and ingest the data.

First Query

Let’s just look at the table in SQL.

SELECT *
FROM "ristorante_json"

Select all

It doesn’t look too different from the MVD modeling. Only the little tree symbol next to the column name indicates that we are dealing with a nested column.

GROUP BY - Where’s the Magic?

Let’s try to GROUP like before:

SELECT 
  orders, 
  COUNT(*) AS "Count"
FROM "ristorante_json"
GROUP BY 1

Group by JSON error

Ouch. That did not work. We will have to extract individual fields from the nested column in order to do anything meaningful with the data. This one is legit:

SELECT 
  JSON_VALUE(orders, '$[0]'), 
  JSON_VALUE(orders, '$[1]'), 
  COUNT(*) AS "Count"
FROM "ristorante_json"
GROUP BY 1, 2

Group by individual fields works

Unfortunately, there is no builtin function (yet) to iterate over all the elements of a JSON array, so the usefulness of this idiom is limited. We could also transform the JSON column into a string:

SELECT 
  TO_JSON_STRING(orders), 
  COUNT(*) AS "Count"
FROM "ristorante_json"
GROUP BY 1

But these are simple STRINGs with no grouping magic.

Transform JSON into string

However, this is the path to a solution …

Here’s the Magic!

If we could parse the JSON string representation into an MVD, maybe the original grouping would work again. In an earlier post I mentioned that STRING_TO_MV() splits a string by a regular expression. This, and the ability to TRIM characters from the beginning and or end of a string, gives us all the tools we need.

  • We will have to strip leading and trailing square brackets, and also the double quotes. Therefore the strip set for TRIM will be ["].
  • We want to split only where a comma is enclosed by double quotes, possibly with whitespace after the comma. The regular expression for this is ",\s*".

Let’s first try the transformation:

SELECT
  customer,
  orders,
  STRING_TO_MV(
    TRIM('["]' FROM TO_JSON_STRING(orders)), 
    '",\s*"') AS orders_mv
FROM "ristorante_json"

Cast to MV

Note the different type symbol next to orders_mv.

And now, to the finale:

SELECT
  STRING_TO_MV(
    TRIM('["]' FROM TO_JSON_STRING(orders)), 
    '",\s*"') AS order_item,
  COUNT(customer) AS num_orders
FROM "ristorante_json"
GROUP BY 1

Group by MV

The group by behavior is what we expect from an MVD!

Learnings

  • Array fields can now be ingested as nested JSON columns into Druid.
  • However, one cannot GROUP BY a JSON column.
  • Individual elements can be extracted using JSON Path expressions, but for now there is no way to iterate over all the fields or array elements.
  • As a workaround, you can parse the JSON array into a multi-value dimension and use existing Druid functionality to unnest it.

This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .