ClickHouse: Removing Diacritics from Strings
The Problem
In text search, you often want to be agnostic of diacritics - those appendages to Latin script letters, like umlaut markers, accents, cedilles, and other modifiers. You want the search term ‘Jose’ to also match ‘José’, for instance. A customer, faced with this question in ClickHouse, came up with an expression like this:
replaceAll(replaceAll(replaceAll(replaceAll(replaceAll(replaceAll(replaceAll(lowerUTF8(%s), 'ş', 's'), 'ü', 'u'), 'ö', 'o'), 'İ', 'i'), 'ğ', 'g'), 'ı', 'i'), 'ç', 'c')
They were looking for a more performant and flexible approach.
Naïve Approach
A first attempt to simplify this code would use translateUTF8
instead. The code gets a lot simpler and more readable:
SELECT translateUTF8('Jäätelöö', 'äöü', 'aou');
but this is limited because you have to hardcode the translation lists into the function call.
How Other Databases Do It
I did a bit research and found these articles:
Oracle has no builtin function but a forum post describes how to use the decompose
function that separates the base character and diacritic (modifier) into separate Unicode characters, along with a regular expression that filters out the modifiers. This does not catch all edge cases though: more about this later.
Postgres has a builtin module unaccent which follows a more flexible approach:
- It uses a dictionary lookup; the dictionary is populated by an
unaccent.rules
file. - You can bring your own but it comes with a default file.
Let’s see where this gets us!
Better Approach (like Oracle)
The ClickHouse equivalent of Oracle’s decompose
is called normalizeUTF8NFD
(D for decompose). Rewriting the expression from the Oracle forum post in ClickHouse SQL
SELECT
replaceRegexpAll(
normalizeUTF8NFD('Jäätelöö Øre Võru Jérôme İstanbul ıçşğŁ'), '[\\x{0300}-\\x{036F}]', '');
results in:
Jaateloo Øre Voru Jerome Istanbul ıcsgŁ
Result:
- it does remove umlaut markers, accents, cedilles and such
- it does not remove the strikethrough that is used as a modifier in some languages
- it does not handle well the case of Turkish ı which would normalize to i (so you would have to add, rather than remove, a dot)
- it needs a regex match which may be expensive
Can we do better?
Even Better (like Postgres)
Since the Postgres repository is public, we can download the rules file from there. Clickhouse has dictionaries for lookup which can be populated directly from files, but I’d rather download the rules file only once. Let’s set up a table and populate it with the contents of the rules file.
Note that the rules file is generally tab separated, but sometimes the tab and second column are missing! Therefore we cannot use the TSV format right away to parse the data. Instead, we parse the file as one field per line and split afterwards. That way, where the second column is missing, it will just be filled with an empty string (I did not make it Nullable.)
CREATE OR REPLACE TABLE t_unaccent(
`key` String,
`value` String
)
ENGINE = MergeTree
ORDER BY `key`;
INSERT INTO t_unaccent
SELECT splitByChar('\t', line)[1] AS `key`, splitByChar('\t', line)[2] AS `value`
FROM url('https://raw.githubusercontent.com/postgres/postgres/5b148706c5c8ffffe5662fe569a0f0bcef2351d9/contrib/unaccent/unaccent.rules',
'LineAsString',
'line String');
(Adapt the url according to the latest version!)
Now build a dictionary on top of the table
CREATE DICTIONARY unaccent (
`key` String,
`value` String
)
PRIMARY KEY `key`
SOURCE(CLICKHOUSE(
TABLE 't_unaccent' DB 'default' USER 'default' PASSWORD '<your password>'
))
LAYOUT(HASHED())
LIFETIME(3600);
The idea is
- to take the string one character at a time;
- for each character
- look up the replacement in the dictionary
- leave the original in place when there is no dictionary entry
- and concatenate the result into a string again.
The docs suggest using the function splitByString
with an empty string as first argument. But calling SELECT splitByString('', 'äöü')
yields only nonprintable garbage: the function does not work with UTF-8 characters. We could roll our own:
CREATE OR REPLACE FUNCTION splitUTF8 AS (x) ->
arrayMap((i) -> substringUTF8(x, i+1, 1), range(lengthUTF8(x)));
We are constructing a sequence of as many numbers as there are UTF-8 characters in the input string, then use these numbers to index into the string, picking one character at a time and sending it to an array. This works because both the length and substring functions have UTF-8 versions.
Thankfully Christoph pointed me to the ngrams function: ngrams(s, 1)
yields a list of all 1-character substrings in s
, and it respects UTF-8!
Now we can define the final function as a UDF:
CREATE OR REPLACE FUNCTION removeDiacritics AS (s) ->
arrayStringConcat(arrayMap(x -> dictGetOrDefault('unaccent', 'value', x, x), ngrams(s, 1)));
Using dictGetOrDefault
makes sure that the original character is left in place when there is no dictionary entry.
With this,
SELECT removeDiacritics('Jäätelöö Øre Võru Jérôme İstanbul ıçşğŁ');
yields
Jaateloo Ore Voru Jerome Istanbul icsgL
which is what we wanted!
Conclusion
- Replacing composite characters with diacritics by their base glyphs is not quite trivial.
translateUTF8
is a good option in ClickHouse if the list of characters is limited and known in advance.- A better approach is using the Postgres
unaccent.rules
list and a ClickHouse dictionary. - To make the code neat, you can encapsulate the logic in an SQL UDF.
“This image is taken from Page 377 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .