ClickHouse: Removing Diacritics from Strings

Data Cookbook Kitchen

The Problem

In text search, you often want to be agnostic of diacritics - those appendages to Latin script letters, like umlaut markers, accents, cedilles, and other modifiers. You want the search term ‘Jose’ to also match ‘José’, for instance. A customer, faced with this question in ClickHouse, came up with an expression like this:

replaceAll(replaceAll(replaceAll(replaceAll(replaceAll(replaceAll(replaceAll(lowerUTF8(%s), 'ş', 's'), 'ü', 'u'), 'ö', 'o'), 'İ', 'i'), 'ğ', 'g'), 'ı', 'i'), 'ç', 'c')

They were looking for a more performant and flexible approach.

Naïve Approach

A first attempt to simplify this code would use translateUTF8 instead. The code gets a lot simpler and more readable:

SELECT translateUTF8('Jäätelöö', 'äöü', 'aou');

but this is limited because you have to hardcode the translation lists into the function call.

How Other Databases Do It

I did a bit research and found these articles:

Oracle has no builtin function but a forum post describes how to use the decompose function that separates the base character and diacritic (modifier) into separate Unicode characters, along with a regular expression that filters out the modifiers. This does not catch all edge cases though: more about this later.

Postgres has a builtin module unaccent which follows a more flexible approach:

It uses a dictionary lookup; the dictionary is populated by an unaccent.rules file.
You can bring your own but it comes with a default file.

Let’s see where this gets us!

Better Approach (like Oracle)

The ClickHouse equivalent of Oracle’s decompose is called normalizeUTF8NFD (D for decompose). Rewriting the expression from the Oracle forum post in ClickHouse SQL

SELECT
    replaceRegexpAll(
        normalizeUTF8NFD('Jäätelöö Øre Võru Jérôme İstanbul ıçşğŁ'), '[\\x{0300}-\\x{036F}]', '');

results in:

Jaateloo Øre Voru Jerome Istanbul ıcsgŁ

Result:

it does remove umlaut markers, accents, cedilles and such
it does not remove the strikethrough that is used as a modifier in some languages
it does not handle well the case of Turkish ı which would normalize to i (so you would have to add, rather than remove, a dot)
it needs a regex match which may be expensive

Can we do better?

Even Better (like Postgres)

Since the Postgres repository is public, we can download the rules file from there. Clickhouse has dictionaries for lookup which can be populated directly from files, but I’d rather download the rules file only once. Let’s set up a table and populate it with the contents of the rules file.

Note that the rules file is generally tab separated, but sometimes the tab and second column are missing! Therefore we cannot use the TSV format right away to parse the data. Instead, we parse the file as one field per line and split afterwards. That way, where the second column is missing, it will just be filled with an empty string (I did not make it Nullable.)

CREATE OR REPLACE TABLE t_unaccent(
    `key` String,
    `value` String
)
ENGINE = MergeTree
ORDER BY `key`;

INSERT INTO t_unaccent
SELECT splitByChar('\t', line)[1] AS `key`, splitByChar('\t', line)[2] AS `value`
FROM url('https://raw.githubusercontent.com/postgres/postgres/5b148706c5c8ffffe5662fe569a0f0bcef2351d9/contrib/unaccent/unaccent.rules',
    'LineAsString',
    'line String');

(Adapt the url according to the latest version!)

Now build a dictionary on top of the table

CREATE DICTIONARY unaccent (
    `key` String,
    `value` String
)
PRIMARY KEY `key`
SOURCE(CLICKHOUSE(
    TABLE 't_unaccent' DB 'default' USER 'default' PASSWORD '<your password>'
))
LAYOUT(HASHED())
LIFETIME(3600);

The idea is

to take the string one character at a time;
for each character
- look up the replacement in the dictionary
- leave the original in place when there is no dictionary entry
and concatenate the result into a string again.

The docs suggest using the function splitByString with an empty string as first argument. But calling SELECT splitByString('', 'äöü') yields only nonprintable garbage: the function does not work with UTF-8 characters. We could roll our own:

CREATE OR REPLACE FUNCTION splitUTF8 AS (x) ->
    arrayMap((i) -> substringUTF8(x, i+1, 1), range(lengthUTF8(x)));

We are constructing a sequence of as many numbers as there are UTF-8 characters in the input string, then use these numbers to index into the string, picking one character at a time and sending it to an array. This works because both the length and substring functions have UTF-8 versions.

Thankfully Christoph pointed me to the ngrams function: ngrams(s, 1) yields a list of all 1-character substrings in s, and it respects UTF-8!

Now we can define the final function as a UDF:

CREATE OR REPLACE FUNCTION removeDiacritics AS (s) ->
    arrayStringConcat(arrayMap(x -> dictGetOrDefault('unaccent', 'value', x, x), ngrams(s, 1)));

Using dictGetOrDefault makes sure that the original character is left in place when there is no dictionary entry.

With this,

SELECT removeDiacritics('Jäätelöö Øre Võru Jérôme İstanbul ıçşğŁ');

yields

Jaateloo Ore Voru Jerome Istanbul icsgL

which is what we wanted!

Conclusion

Replacing composite characters with diacritics by their base glyphs is not quite trivial.
translateUTF8 is a good option in ClickHouse if the list of characters is limited and known in advance.
A better approach is using the Postgres unaccent.rules list and a ClickHouse dictionary.
To make the code neat, you can encapsulate the logic in an SQL UDF.

“This image is taken from Page 377 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .