Streaming Events from Redpanda Cloud into Imply Polaris

From Redpanda to Imply

Redpanda offers a data streaming platform with a Kafka compatible API. There are a variety of deployment options - you can run Redpanda on premise or in a hybrid cloud mode, but here I am going to look at Redpanda’s fully managed cloud service. I want to connect this to Imply’s Polaris database-as-a-service, which offers fast, near realtime analytics based on Apache Druid.

Polaris has recently upgraded with connectivity to multiple streaming services. This includes Kafka with SCRAM authentication, which is one way how Redpanda implements security. Today I am going to show how to build a realtime analytics pipeline using only fully managed cloud services: Redpanda Cloud and Imply Polaris.

In this tutorial, you will learn how to create a connection to a dedicated Redpanda cluster and to populate a Polaris table with realtime data using the Polaris API, allowing for automated scripting and zero touch configuration.

As prerequisites, you will need:

a Redpanda cloud account (a 15 day free trial is available)
an Imply Polaris account (you can sign up for a 30 day trial).

Setting up Redpanda

Create a cluster

I am assuming that you have a Redpanda Cloud account created.

In Redpanda Cloud, a cluster exists in a namespace, and you can have multiple namespaces inside an organization. Create a namespace, and in that namespace create a cluster according to the documentation. The smallest cluster option is sufficient for this experiment.

I created a cluster in AWS, single region and single AZ. I am not using VPC peering - make sure your cluster is connected to the public Internet. While VPC peering between Redpanda Cloud and Imply Polaris is possible, that is a story for another blog.

Redpanda will suggest a unique (3-word) cluster name, but you can of course pick your own.

Create a topic

Creating a topic is straightforward in Redpanda:

Create topic

Create a topic named pizza, and leave all settings at the defaults.

Create a user and password

In Redpanda Cloud, go to the Security tab and create a new (data plane) user. You can enter your own password or use the autogenerated one. Make sure to copy the password because you will need it later. Set the security protocol to SCRAM-SHA-512.

Create user

Set up ACLs

You have to set up ACLs to grant access rights. Click the name of the user you just created, and in the popup click Allow all operations:

Set up ACLs

This gives my user full access rights. In a production setup, you would not do that: you would restrict that user’s access rights to a topic or group of topics.

Populate the Redpanda Topic

Once more, I am using Francesco’s pizza simulator. I have described that in a previous blog.

Find the bootstrap server from the Redpanda overview page:

Cluster settings

Check out the data generator repository, and using the bootstrap address, and the username and password you just configured, start the data generator like so:

python3 main.py \
  --security-protocol SASL_SSL \
  --sasl-mechanism SCRAM-SHA-512 \
  --username '<Redpanda username from above>' \
  --password '<Redpanda password>' \
  --host <Redpanda bootstrap server> \
  --port 9092 \
  --topic-name pizza \
  --nr-messages 0 \
  --max-waiting-time 0

After a moment, you can see the incoming messages in the Redpanda console too:

Messages in Redpanda console

Setting up Imply Polaris

Create an API key

Follow the documentation to create an API key. API authentication is HTTP Basic authentication, where the API key stands in place of the username and the password is empty. I am going to use curl in this tutorial to access the API, but of course you can also use a tool like Postman.

For convenience, I assume that the API key and URL are in environment variables:

BASEURL is the API base URL, which generally looks like this: https://ORGANIZATION_NAME.REGION.CLOUD_PROVIDER.api.imply.io/
APIKEY is the API key.

Hence the general scheme of an API call will be:

curl -X POST ${BASEURL}/... endpoint .../ -u ${APIKEY}: -H "Content-Type: application/json" -d'{ ... }'

with a JSON payload for POST and PUT requests.

If you forget to add the Content-Type header, you will generally get a 415 Unsupported media type error. Don’t forget to add the Content-Type.

Create a Kafka Pull connection

Create a connection with the new type kafka. In the first step, only supply the server detail.

curl -X POST ${BASEURL}/v2/connections/ -u ${APIKEY}: -H "Content-Type: application/json" -d'{             
  "type": "kafka",
  "name": "pizza_connection",
  "bootstrapServers": "<Redpanda bootstrap server>:9092",
  "topicName": "pizza"
}'

After that, add credentials to the connection:

curl -X PUT ${BASEURL}/v2/connections/pizza_connection/secrets/ -u ${APIKEY}: -H "Content-Type: application/json" -d'{
  "type": "sasl_scram",
  "mechanism": "SCRAM-SHA-512",
  "username": "<Redpanda username>",
  "password": "<Redpanda password>"
}'

You can test the connection

curl -X POST ${BASEURL}/v2/connections/pizza_connection/test/ -u ${APIKEY}: -H "Content-Type: application/json"

Note that this is a POST request without a payload. A 200 status with an empty response means everything is good.

Create a Polaris table

Create a detail table in Polaris. In the GUI, it is also possible to create a table first and add the schema later - here we specify it all in one single call.

curl -X POST ${BASEURL}/v2/tables/ -u ${APIKEY}: -H "Content-Type: application/json" -d'{
    "schema": [
        {
            "name": "__time",
            "dataType": "timestamp"
        },
        {
            "name": "id",
            "dataType": "long"
        },
        {
            "name": "name",
            "dataType": "string"
        },
        {
            "name": "pizzas",
            "dataType": "json"
        },
        {
            "name": "address",
            "dataType": "string"
        },
        {
            "name": "phoneNumber",
            "dataType": "string"
        },
        {
            "name": "shop",
            "dataType": "string"
        }
    ],
    "type": "detail",
    "name": "pizza",
    "version": 0,
    "partitioningGranularity": "day"
}'

The API should respond with a JSON object decribing the new table.

Fire up the ingestion job

To get the ingestion going, create an ingestion job that references the new connection. This one is a bit lengthy because it needs two sets of metadata:

The inputSchema object describes the fields found in the original data. This is required.
mappings describes how input fields are mapped to table columns, and can also include SQL transformations.

curl -X POST ${BASEURL}/v2/jobs/ -u ${APIKEY}: -H "Content-Type: application/json" -d'{

    "type": "streaming",
    "target": {
        "type": "table",
        "tableName": "pizza",
        "intervals": [ "1000/3000" ]
    },
    "desiredExecutionStatus": "running",
    "source": {
        "type": "connection",
        "connectionName": "pizza_connection",
        "formatSettings": {
            "format": "nd-json"
        },
        "inputSchema": [
            {
                "name": "timestamp",
                "dataType": "long"
            },
            {
                "name": "id",
                "dataType": "long"
            },
            {
                "name": "name",
                "dataType": "string"
            },
            {
                "name": "pizzas",
                "dataType": "json"
            },
            {
                "name": "address",
                "dataType": "string"
            },
            {
                "name": "phoneNumber",
                "dataType": "string"
            },
            {
                "name": "shop",
                "dataType": "string"
            }
        ]
    },
    "mappings": [
        {
            "columnName": "__time",
            "expression": "MILLIS_TO_TIMESTAMP(\"timestamp\")"
        },
        {
            "columnName": "id",
            "expression": "\"id\""
        },
        {
            "columnName": "name",
            "expression": "\"name\""
        },
        {
            "columnName": "pizzas",
            "expression": "\"pizzas\""
        },
        {
            "columnName": "address",
            "expression": "\"address\""
        },
        {
            "columnName": "phoneNumber",
            "expression": "\"phoneNumber\""
        },
        {
            "columnName": "shop",
            "expression": "\"shop\""
        }
    ],
    "readFromPoint": "earliest"
}'

This should respond with a summary of the job that has been created.

Go to the Imply Polaris GUI and verify that data is coming in:

Data in Imply

Conclusion

With Imply Polaris and Redpanda Cloud, you can set up an end-to-end realtime analytics pipeline using only fully managed services.
The connection setup can be scripted easily using the Polaris API.
A GUI integration is coming soon.