Druid Cookbook

The problem

Often my customers come to me with the requirement to extract large and/or detailed data sets from druid; they would like to store these data in a well known format for further processing by other tools. With multi-stage query, you can issue an asynchronous query against deep storage that handles (almost) unlimited amounts of data.

However, obtaining a result is a multi step process:

Meanwhile, the data that you download in step 3 has been written to some storage location inside Druid already. You can define a path and even instruct Druid to use durable storage for query results, but: these data are is still in a Druid specific format and cannot easily be read by other tools.

What if we could skip that step (persisting the result) completely and send the result directly to a file in a format of our choice?

Druid 29 can do this. For now, it is somewhat limited - it only supports csv, and can only export to local filesystem or S3. But other formats, such as Parquet, are coming.

Let’s try this out with a Druid Quickstart installation!

In this tutorial, you will

  • learn how to configure the settings for MSQ export
  • export a sample dataset.

Preparation

We are going to export to local storage. To limit the attack surface for malicious or inexperienced users, you have to define a specific filesystem path where Druid is allowed to store export files.

On your local machine, install Druid 29 from the tarball.

Create a directory /tmp/druid-export on your local disk.

In your Druid installation, edit the file conf/druid/auto/_common/common.runtime.properties and add the line

druid.export.storage.baseDir=/tmp/druid-export

at the end of the file.

Then start Druid like so, from within your Druid install directory:

bin/start-druid -m5g

Ingest the wikipedia sample data following the instructions using either classic batch or SQL ingestion.

Then go to the Query tab in the Druid console.

Exporting data

Run this query:

INSERT INTO 
EXTERN(local(exportPath => '/tmp/druid-export/wikipedia-export'))
AS CSV
SELECT * FROM wikipedia

Screenshot of running query

When the query finishes, check the export directory and you will find a CSV file containing the data:

Preview of result file in a shell window

Note: the target directory has to be empty, else you get an error message.

This also works for export to S3.

Learnings

  • With MSQ, you can now export query results directly to external storage.
  • This is a new feature in Druid 29. It is currently limited to CSV format and either local storage or S3, but expect more options to be added soon.

This image is taken from Page 500 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .