Data Cookbook Kitchen

Here’s another of my learnings from playing with ClickHouse. This one is about the QUALIFY clause. It is a filter that is evaluated after applying window functions, but before any GROUP BY clause.

I had been thinking of this clause mainly as syntactic sugar - you can write

SELECT number, COUNT() OVER (PARTITION BY number % 3) AS partition_count
FROM numbers(10)
QUALIFY partition_count = 4
ORDER BY number;

to filter by the result of a window function, rather than the more peripatetic nested query in standard SQL:

SELECT * FROM (
    SELECT number, COUNT() OVER (PARTITION BY number % 3) AS partition_count
    FROM numbers(10)
)
WHERE partition_count = 4
ORDER BY number;

And in this case you would not be able to apply WHERE partition_count directly, because the WHERE filter is evaluated before the window functions.

But there are queries where both WHERE and QUALIFY can be applied to the same column, with different results.

Querying with a session window

Assume we are tracking sessions in an online application. For each event inside the session, we would like to enrich it with the session history up to that point.

Here is a simplified test schema and data sample:

CREATE TABLE wintest (
    sid Int64,
    id Int64,
    val String
)
ENGINE = MergeTree
ORDER BY id;

INSERT INTO wintest
SELECT sid, id, val FROM format(CSVWithNames,
$$sid,id,val
1,1,eins
1,2,zwei
1,3,drei
1,4,vier
$$);

This minimal sample has only one session, but this is sufficient to get the picture.

We can use groupArray with a window function to add to every event the session history up to that point:

SELECT
    sid, id,
    groupArray((id, val)) OVER (PARTITION BY sid ORDER BY id ASC ROWS UNBOUNDED PRECEDING)
FROM wintest;
   ┌─sid─┬─id─┬─groupArray((id, val)) ⋯EDING AND CURRENT ROW)─┐
1. │   1 │  1 │ [(1,'eins')]                                  │
2. │   1 │  2 │ [(1,'eins'),(2,'zwei')]                       │
3. │   1 │  3 │ [(1,'eins'),(2,'zwei'),(3,'drei')]            │
4. │   1 │  4 │ [(1,'eins'),(2,'zwei'),(3,'drei'),(4,'vier')] │
   └─────┴────┴───────────────────────────────────────────────┘

Filtering the results

Now assume we want the full history only for selected events. You might be writing a query like this:

SELECT
    id,
    groupArray((id, val)) OVER (PARTITION BY sid ORDER BY id ASC ROWS UNBOUNDED PRECEDING)
FROM wintest
WHERE id % 2 = 0;

The result might be unexpected:

   ┌─id─┬─groupArray((⋯URRENT ROW)─┐
1. │  2 │ [(2,'zwei')]             │
2. │  4 │ [(2,'zwei'),(4,'vier')]  │
   └────┴──────────────────────────┘

We have filtered not only the result rows but also the events that go into the session history! If we want to still preserve the full session history for each result row, we can use QUALIFY instead:

SELECT
    id,
    groupArray((id, val)) OVER (PARTITION BY sid ORDER BY id ASC ROWS UNBOUNDED PRECEDING)
FROM wintest
QUALIFY id % 2 = 0;
   ┌─id─┬─groupArray((id, val)) ⋯EDING AND CURRENT ROW)─┐
1. │  2 │ [(1,'eins'),(2,'zwei')]                       │
2. │  4 │ [(1,'eins'),(2,'zwei'),(3,'drei'),(4,'vier')] │
   └────┴───────────────────────────────────────────────┘

This time we filter only the result rows, and not the rows that go into the window function.

Filtering only inside the window aggregation

What if we want to do it the other way round and filter only the session list, but return this filtered session list for every single event?

Here aggregator combinators come in handy: Attach an-If to the aggregation function, and it takes a filter condition as an extra parameter:

SELECT
    id,
    groupArrayIf((id, val), id % 2 = 0) OVER (PARTITION BY sid ORDER BY id ASC ROWS UNBOUNDED PRECEDING)
FROM wintest;
   ┌─id─┬─groupArrayIf⋯URRENT ROW)─┐
1. │  1 │ []                       │
2. │  2 │ [(2,'zwei')]             │
3. │  3 │ [(2,'zwei')]             │
4. │  4 │ [(2,'zwei'),(4,'vier')]  │
   └────┴──────────────────────────┘

Summary

There are three distinct filter clauses in ClickHouse SQL:

  • WHERE is applied before any grouping or window function.
  • HAVING applies only after GROUP BY.
  • QUALIFY filters before grouping but after applying window functions:
    • The most popular use case is to filter by the result of a window function, eliminating the need for a nested query.
    • However, you can also apply QUALIFY to a filter column. This creates distinct semantics: The expression inside the window aggregation is not filtered, but the result is!
  • If you want to filter only inside the window aggregation, use the -If combinator.

This goes again to show the power of ClickHouse SQL.


This image is taken from Page 377 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .