Druid 29 Preview: Handling Nested Arrays
Imagine you have a data sample like this:
{'id': 93, 'shop': 'Circular Pi Pizzeria', 'name': 'David Murillo', 'phoneNumber': '305-351-2631', 'address': '746 Chelsea Plains Suite 656\nNew Richard, MA 16940', 'pizzas': [{'pizzaName': 'Salami', 'additionalToppings': ['π₯ bacon']}], 'timestamp': 1702815411410}
{'id': 94, 'shop': 'Marios Pizza', 'name': 'Darius Roach', 'phoneNumber': '344.571.9608x0590', 'address': '58235 Robert Cliffs\nAguilarland, PR 76249', 'pizzas': [{'pizzaName': 'Diavola', 'additionalToppings': []}, {'pizzaName': 'Salami', 'additionalToppings': ['π§ garlic']}, {'pizzaName': 'Peperoni', 'additionalToppings': ['π« olives', 'π§
onion', 'π
tomato', 'π strawberry']}, {'pizzaName': 'Diavola', 'additionalToppings': ['π« olives', 'π banana', 'π pineapple']}, {'pizzaName': 'Margherita', 'additionalToppings': ['π strawberry', 'π pineapple', 'π₯ egg', 'π tuna', 'π tuna']}, {'pizzaName': 'Margherita', 'additionalToppings': ['π₯ egg']}, {'pizzaName': 'Margherita', 'additionalToppings': ['π« green peppers', 'π₯ egg', 'π₯ egg']}, {'pizzaName': 'Peperoni', 'additionalToppings': []}, {'pizzaName': 'Salami', 'additionalToppings': []}], 'timestamp': 1702815415518}
{'id': 95, 'shop': 'Mammamia Pizza', 'name': 'Ryan Juarez', 'phoneNumber': '(041)278-5690', 'address': '934 Melissa Lights\nPaulland, UT 40700', 'pizzas': [{'pizzaName': 'Marinara', 'additionalToppings': ['π« green peppers', 'π§
onion']}, {'pizzaName': 'Marinara', 'additionalToppings': ['π
tomato', 'π₯ bacon', 'π banana', 'πΆοΈ hot pepper']}, {'pizzaName': 'Peperoni', 'additionalToppings': ['π strawberry', 'π banana', 'π tuna', 'π§ blue cheese']}, {'pizzaName': 'Marinara', 'additionalToppings': ['π tuna', 'π§
onion', 'π pineapple', 'π strawberry']}, {'pizzaName': 'Mari & Monti', 'additionalToppings': ['π« olives', 'π tuna']}, {'pizzaName': 'Marinara', 'additionalToppings': ['π pineapple', 'π
tomato', 'π banana', 'π§ blue cheese', 'π« olives']}, {'pizzaName': 'Marinara', 'additionalToppings': ['π banana', 'π« green peppers', 'π§ garlic', 'π
tomato']}], 'timestamp': 1702815418643}
I created the data sample using Francescoβs pizza simulator. The structure of these simulated pizza orders is quite deeply nested:
- Each order has a field
pizzas
, which is an array of JSON objects. - Each individual pizza item has
- a
pizzaName
field, which is a string additionalToppings
, an array of strings that may be empty.
- a
Arrays of objects are a bit obtuse, and I would like to create a data model that breaks down the orders so that each row in Druid represents a line item (a single pizza.)
To that end, it would be nice to use some combination of JSON functions and UNNEST
during ingestion. But how exactly? Letβs find out!
Getting set up
This is a sneak peek into Druid 29 functionality. In order to use the new functions, you can (as of the time of writing) build Druid from the HEAD of the master branch:
git clone https://github.com/apache/druid.git
cd druid
mvn clean install -Pdist -DskipTests
Then follow the instructions to locate and install the tarball. Make sure you have the druid-multi-stage-query
extension enabled.
In this tutorial, you will
- examine how to model deeply nested JSON data with arrays in Druid and
- break down a nested JSON array into individual rows using new functionality that is currently being built.
Disclaimer: This tutorial uses undocumented functionality and unreleased code. This blog is neither endorsed by Imply nor by the Apache Druid PMC. It merely collects the results of personal experiments. The features described here might, in the final release, work differently, or not at all. In addition, the entire build, or execution, may fail. Your mileage may vary.
The data
Right now, the technique we are looking at is limited to batch ingestion. So, we need to capture the simulator data in a file.
I assume you have a local Kafka service at localhost:9092.
Check out the pizza simulator and run it like so:
python3 main.py --security-protocol PLAINTEXT --host localhost --port 9092 --topic-name pizza-orders --nr-messages 0 --max-waiting-time 5
Capture the output using kcat
and redirect to a file:
kcat -b localhost:9092 -t pizza-orders >>./pizza-orders.json
You can stop the simulator after a while and use the pizza-orders.json
file as input for the next steps.
Basic ingestion: the pizza-orders
table
Letβs start by setting up a naΓ―ve data model using the web console wizard. Note how in the SQL view, the type of the pizzas
field is somewhat correctly recognized as a COMPLEX<json>
but it does not know about the array structure:
Here is the ingestion query using MSQ:
REPLACE INTO "pizza-orders" OVERWRITE ALL
WITH "ext" AS (
SELECT *
FROM TABLE(
EXTERN(
'{"type":"local","baseDir":"/Users/hellmarbecker/meetup-talks/jsonarray","filter":"*json"}',
'{"type":"json"}'
)
) EXTEND ("id" BIGINT, "shop" VARCHAR, "name" VARCHAR, "phoneNumber" VARCHAR, "address" VARCHAR, "pizzas" TYPE('COMPLEX<json>'), "timestamp" BIGINT)
)
SELECT
MILLIS_TO_TIMESTAMP("timestamp") AS "__time",
"id",
"shop",
"name",
"phoneNumber",
"address",
"pizzas"
FROM "ext"
PARTITIONED BY DAY
When we query this table, we see that indeed we have a general nested column here - it is not marked as an array
We can look at the detailed values in the column
Again, what we would like is a table model where each row represents a line item, i. e. an individual pizza!
First attempt at breaking down the line items
Letβs try to craft a new ingestion query that breaks down the line items using UNNEST
. We want to unnest the line items using something like UNNEST(JSON_QUERY(pizzas, '$'))
, and then extract the individual fields into separate columns: JSON_VALUE(p, '$.pizzaName') AS pizzaName
and so forth.
Hereβs the first attempt at such a query:
REPLACE INTO "pizza-lineitems" OVERWRITE ALL
WITH "ext" AS (
SELECT *
FROM TABLE(
EXTERN(
'{"type":"local","baseDir":"/Users/hellmarbecker/meetup-talks/jsonarray","filter":"*json"}',
'{"type":"json"}'
)
) EXTEND ("id" BIGINT, "shop" VARCHAR, "name" VARCHAR, "phoneNumber" VARCHAR, "address" VARCHAR, "pizzas" TYPE('COMPLEX<json>'), "timestamp" BIGINT)
)
SELECT
MILLIS_TO_TIMESTAMP("timestamp") AS "__time",
"id",
"shop",
"name",
"phoneNumber",
"address",
JSON_VALUE(p, '$.pizzaName') AS pizzaName,
JSON_QUERY(p, '$.additionalToppings') AS additionalToppings
FROM "ext" CROSS JOIN UNNEST(JSON_QUERY(pizzas, '$')) AS lineitems(p)
PARTITIONED BY DAY
This, unfortunately, fails with a screaming error message:
We cannot unnest arrays of objects just like arrays of primitives! But why is that? Look at the error message more closely: Druid thinks this is a call to UNNEST(COMPLEX<JSON>)
. So, JSON_QUERY
doesnβt know about the array nature of its output. What now?
A new function: JSON_QUERY_ARRAY
The Druid team has added a new function that does just the right thing for our case:
JSON_QUERY_ARRAY(expr, path)
Extracts an
ARRAY<COMPLEX<json>>
value fromexpr
at the specifiedpath
. If value is not anARRAY
, it gets translated into a single elementARRAY
containing the value atpath
. The primary use of this function is to extract arrays of objects to use as inputs to other array functions.
Letβs rewrite the above query, substituting JSON_QUERY_ARRAY
for JSON_QUERY
in both cases:
REPLACE INTO "pizza-lineitems" OVERWRITE ALL
WITH "ext" AS (
SELECT *
FROM TABLE(
EXTERN(
'{"type":"local","baseDir":"/Users/hellmarbecker/meetup-talks/jsonarray","filter":"*json"}',
'{"type":"json"}'
)
) EXTEND ("id" BIGINT, "shop" VARCHAR, "name" VARCHAR, "phoneNumber" VARCHAR, "address" VARCHAR, "pizzas" TYPE('COMPLEX<json>'), "timestamp" BIGINT)
)
SELECT
MILLIS_TO_TIMESTAMP("timestamp") AS "__time",
"id",
"shop",
"name",
"phoneNumber",
"address",
JSON_VALUE(p, '$.pizzaName') AS pizzaName,
JSON_QUERY_ARRAY(p, '$.additionalToppings') AS additionalToppings
FROM "ext" CROSS JOIN UNNEST(JSON_QUERY_ARRAY(pizzas, '$')) AS lineitems(p)
PARTITIONED BY DAY
That way, we can also be sure that the additionalToppings
column will be represented as an array.
After the ingestion has finished, query the table and note how
- there is now one row per line item
- the
pizzas
subcolumn is represented as an array, as you can see by the[β―]
instead of the tree symbol:
You can actually run a query over the new table that shows how JSON_QUERY
forgets about the βarray-nessβ of the array column, while JSON_QUERY_ARRAY
enforces it:
It is, however, preferred to use JSON_QUERY_ARRAY
at ingestion time and represent the result in your data model. This is part of optimizing the data model to achieve those fast queries that Druid is known for!
Conclusion
- We have seen how it is now possible to unnest even columns that contain arrays of objects. With this capability, Druid takes another big step handling nested objects.
- Using
JSON_QUERY_ARRAY
on an array retains the βarray-nessβ and passes it on to functions that require an array input. - Using
JSON_QUERY_ARRAY
on a single object wraps it into an array. - You should use
JSON_QUERY_ARRAY
at ingestion rather than query time.
"Pizza" by Katrin Gilger is licensed under CC BY-SA 2.0 .