Druid Sneak Peek: Schema Inference and Arrays

One of the strong points of Druid has always been built-in schema evolution. However, upon getting data of changing shape into Druid, you had two choices:

either, specify each field with its type in the ingestion spec, which requires to know all the fields ahead of time
or pick up whatever comes in using schemaless ingestion, with the downside that any dimension ingested that way would be interpreted as a string.

The good news is that this is going to change. Druid 26 is going to come with the ability to infer its schema completely from the input data, and even ingest structured data automatically.

Disclaimer: This tutorial uses undocumented functionality and unreleased code. This blog is neither endorsed by Imply nor by the Apache Druid PMC. It merely collects the results of personal experiments. The features described here might, in the final release, work differently, or not at all. In addition, the entire build, or execution, may fail. Your mileage may vary.

Druid 26 hasn’t been released yet, but you can build Druid from the master branch of the repository and try out the new features.

I am going to pick up the multi-value dimensions example from last week, but this time I want you to get an idea how these types of scenarios are going to be handled in the future. We are going to:

ingest data using the new schema discovery feature
ingest structured data into an SQL ARRAY
show how GROUP BY and lateral joins work with that array.

Ingestion: Schema Inference

We are using the ristorante dataset that you can find here, but with a little twist: On the Configure schema tab, uncheck Explicitly specify dimension list.

Set autodetect

Confirm the warning dialog that pops up, and continue modeling the data. When you proceed to the Edit spec stage, you can see a new setting that slipped in:

Autodetect

The dimensionsSpec has no dimension list now, but there is a new flag useSchemaDiscovery:

      "dimensionsSpec": {
        "useSchemaDiscovery": true,
        "includeAllDimensions": true,
        "dimensionExclusions": []
      }

Querying the data

Let’s look at the resulting data with a simple SELECT * query:

Select all

Notice how Druid has automatically detected that orders is an array of primitives (strings, in this case.) You recognize this by the symbol next to the columns name, which now looks like this: [··]. In older versions, this would have been either a multi-value string. But now, Druid has true ARRAY columns!

(In the more general case of nested objects, Druid would have generated a nested JSON column.)

In order to take the arrays apart, we can once again make use of the UNNEST function. This has to be enabled using a query context flag. In the console, use the Edit context function inside the query engine menu

and enter the context:

{
  "enableUnnest": true
}

In the REST API, you can pass the context directly.

Then, unnest and group the items:

SELECT 
  order_item, 
  COUNT(*) AS order_count
FROM "ristorante_auto", UNNEST(orders) AS t(order_item)
GROUP BY 1

Select groupby

Once you have done this, you can filter by individual order items and you don’t have all the quirks that we talked about when doing multi-value dimensions:

SELECT
  customer,
  order_item, 
  COUNT(*) AS order_count
FROM "ristorante_auto", UNNEST(orders) AS t(order_item)
WHERE order_item = 'tiramisu'
GROUP BY 1, 2

Filtered groupby

Conclusion

Druid can now do schema inference.
It can automatically detect primitive types, but also nested objects and arrays of primitives.
Typical Druid queries that would use multi-value dimensions in the past can now be done in a more standard way using array columns and UNNEST.