Is there a multibyte-aware Postgresql Levenshtein?

Is there a multibyte-aware Postgresql Levenshtein? - postgresql

When I use the fuzzystrmatch levenshtein function with diacritic characters it returns a wrong / multibyte-ignorant result:
select levenshtein('ą', 'x');
levenshtein
-------------
2
(Note: the first character is an 'a' with a diacritic below, it is not rendered properly after I copied it here)
The fuzzystrmatch documentation (https://www.postgresql.org/docs/9.1/fuzzystrmatch.html) warns that:
At present, the soundex, metaphone, dmetaphone, and dmetaphone_alt functions do not work well with multibyte encodings (such as UTF-8).
But as it does not name the levenshtein function, I was wondering if there is a multibyte aware version of levenshtein.
I know that I could use unaccent function as a workaround but I need to keep the diacritics.

Note: This solution was suggested by #Nick Barnes in his answer to a related question.
The 'a' with a diacritic is a character sequence, i.e. a combination of a and a combining character, the diacritic ̨ : E'a\u0328'
There is an equivalent precomposed character ą: E'\u0105'
A solution would be to normalise the Unicode strings, i.e. to convert the combining character sequence into the precomposed character before comparing them.
Unfortunately, Postgres doesn't seem to have a built-in Unicode normalisation function, but you can easily access one via the PL/Perl or PL/Python language extensions.
For example:
create extension plpythonu;
create or replace function unicode_normalize(str text) returns text as $$
import unicodedata
return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ language plpythonu;
Now, as the character sequence E'a\u0328' is mapped onto the equivalent precomposed character E'\u0105' by using unicode_normalize, the levenshtein distance is correct:
select levenshtein(unicode_normalize(E'a\u0328'), 'x');
levenshtein
-------------
1

Related

Postgres unaccent function for character

I'm using unaccent in Postgres but it cannot convert special character like:
ù : ù
but it's okay for ù: ù
2 characters same meaning but different code, the first one is character u + ̀
How I can solve this problem ?
Thank you so much.

Your problem is unicode normalization, what PostgreSQL does not do, unfortunately. And it's not so simple to implement on your own.
But, because you only want to remove diacritical marks, you only need to actually remove code-points (before or after calling the unaccent() function) which are unicode combining characters:
select regexp_replace(
'ùù',
'[\u0300-\u036F\u1AB0-\u1AFF\u1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]',
'',
'g'
)
should do the trick.

How to identify if a value in column can be encoded to Latin in Postgres

My Postgres database encodes everything as UTF-8, but in a query when selecting a column I want to know if it can be encoded as Latin. I've no need to actually encode it as Latin but I need to know if can be encoded as Latin.
By Latin in mean what other people generally mean by Latin, i.e characters are recognisable to Western European speakers
i.e
SELECT val
FROM
TABLE1
WHERE IS_LATIN(Val);
Solution
I used the answer posted below, firstly I tried the python function but it failed because I dont have that language installed. Then I tried the pl/sql function and that failed because missing RETURN statement, but I fixed as follows and now works okay
CREATE OR REPLACE FUNCTION is_latin(input text)
RETURNS boolean
LANGUAGE plpgsql
IMMUTABLE
AS $$
BEGIN
PERFORM convert_to(input, 'iso-8859-15');
RETURN true;
EXCEPTION
WHEN untranslatable_character THEN
RETURN false;
END;
$$;

Well, you need to be more specific about "latin".
Assuming you mean ISO-8859-15, typical for Western europe
regress=> SELECT convert_to('a€bcáéíöâ', 'iso-8859-15');
convert_to
----------------------
\x61a46263e1e9edf6e2
(1 row)
Beware, people often use iso-8859-1, but it doesn't support €.
However, you'll run into issues with currency symbols and other things that might typically appear in modern western european text. For example, ₽ isn't part of ISO-8859-15. Nor is ฿, ₡, ₹, and other major currencies. (Oddly, ¥ is in ISO-8859-15).
If you want to test without an error you'll need to either use PL/Python or similar, or use PL/PgSQL and trap the exception.
CREATE OR REPLACE FUNCTION is_latin(input text)
RETURNS boolean
LANGUAGE plpgsql
IMMUTABLE
AS $$
BEGIN
PERFORM convert_to(input, 'iso-8859-15');
EXCEPTION
WHEN untranslatable_character THEN
RETURN false;
END;
$$;
regress=> SELECT is_latin('฿');
is_latin
----------
f
(1 row)
That creates a savepoint on every call, though, which can get expensive. So perhaps PL/Python is better. This one makes an assumption about the server_encoding (assuming it is utf-8) which isn't wise, so it should really check that properly. Anyway:
CREATE OR REPLACE FUNCTION is_latin(input text)
RETURNS boolean
LANGUAGE plpythonu
IMMUTABLE
AS $$
try:
input.decode("utf-8").encode("iso-8859-1")
return True
except UnicodeEncodeError:
return False
$$;
Another option is to create a regular expression with a charset that matches all the chars you want to permit, but I suspect that'll be slow and ugly. Incomplete example:
SELECT 'ab฿cdé' ~ '^[a-zA-Z0-9.áéíóúÁÉÍÓÚàè]*$'
... where you'd probably use the iso-8859-15 encoding table to produce the character list.

Unicode normalization in Postgres

I have a large number of Scottish and Welsh accented place names (combining grave, acute, circumflex and diareses) which I need to update to their unicode normalized form, eg, the shorter form 00E1 (\xe1) for á instead of 0061 + 0301 (\x61\x301)
I have found a solution from an old Postgres nabble mail list from 2009, using pl/python,
create or replace function unicode_normalize(str text) returns text as $$
import unicodedata
return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ LANGUAGE PLPYTHONU;
This works, as expected, but made me wonder if there was any way of doing it directly with built-in Postgres functions. I tried various conversions using convert_to, all in vain.
EDIT: As Craig has pointed out, and one of the things I tried:
SELECT convert_to(E'\u00E1', 'iso-8859-1');
returns \xe1, whereas
SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');
fails with the ERROR: character 0xcc81 of encoding "UTF8" has no equivalent in "LATIN1"

I think this is a Pg bug.
In my opinion, PostgreSQL should be normalizing utf-8 into pre-composed form before performing encoding conversions. The result of the conversions shown are wrong.
I'll raise it on pgsql-bugs ... done.
http://www.postgresql.org/message-id/53E179E1.3060404#2ndquadrant.com
You should be able to follow the thread there.
Edit: pgsql-hackers doesn't appear to agree, so this is unlikely to change in a hurry. I strongly advise you to normalise your UTF-8 at your application input boundaries.
BTW, this can be simplified down to:
regress=> SELECT 'á' = 'á';
?column?
----------
f
(1 row)
which is plain crazy-talk, but is permitted. The first is precomposed, the second is not. (To see this result you'll have to copy & paste, and it'll only work if your browser or terminal don't normalize utf-8).
If you're using Firefox you might not see the above correctly; Chrome renders it correctly. Here's what you should see if your browser handles decomposed Unicode correctly:

PostgreSQL 13 has introduced string function normalize ( text [, form ] ) → text, which is available when the server encoding is UTF8.
> select 'päivää' = 'päivää' as without, normalize('päivää') = normalize('päivää') as with_norm ;
without | with_norm
---------+-----------
f | t
(1 row)
Note that I am expecting this to miss any indices, and therefore using this blindly in a hot production query is prone to be a recipe for disaster.
Great news for us who have naively stored NFD filenames from Mac users in our databases.

How to enumerate all Unicode canonically equivalent sequences in Perl?

Does there exist a standard Perl module or function that, given a Unicode Combining Character Sequence (or, more generally, an arbitrary Unicode text string), will generate a list of all canonically equivalent strings?
For example, if given the character U+1EAD, I'd like to get back a list of all these canonically equivalent sequences:
0061 0302 0323
0061 0323 0302
00E2 0323
1EA1 0302
1EAD
(I don't particularly care whether the interface is in terms of arrays of USVs or utf strings.)

Is this an XY problem? If you want to compare/match 2 unicode strings and you're worried that different ways of encoding the accented characters would create false negatives, then the best way to do this would be to normalize the 2 strings using one of the normalization functions from Unicode::Normalize, before doing the comparison or match.
Otherwise it gets a little messy.
You could get the complete character name using charnames::viacode(0x1EAD); (for U+1EAD it would be LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW), and get the various composing characters by splitting the name on WITH|AND. Then you could generate all combinations (checking that they exist!) of the base character + modifiers and the other modifiers. At this point you will run into the problem of matching the combining characters names in the full name (eg CIRCUMFLEX) with the combining character real name (COMBINING CIRCUMFLEX ACCENT). There are probably rules for this, but I don't know them.
This would be my naive attempt, there may be better ways of doing this, but since so far no one has volunteered the information...

Converting accented characters in PostgreSQL?

Is there an existing function to replace accented characters with unadorned characters in PostgreSQL? Characters like å and ø should become a and o respectively.
The closest thing I could find is the translate function, given the example in the comments section found here.
Some commonly used accented characters
can be searched using the following
function:
translate(search_terms,
'\303\200\303\201\303\202\303\203\303\204\303\205\303\206\303\207\303\210\303\211\303\212\303\213\303\214\303\215\303\216\303\217\303\221\303\222\303\223\303\224\303\225\303\226\303\230\303\231\303\232\303\233\303\234\303\235\303\237\303\240\303\241\303\242\303\243\303\244\303\245\303\246\303\247\303\250\303\251\303\252\303\253\303\254\303\255\303\256\303\257\303\261\303\262\303\263\303\264\303\265\303\266\303\270\303\271\303\272\303\273\303\274\303\275\303\277','AAAAAAACEEEEIIIINOOOOOOUUUUYSaaaaaaaceeeeiiiinoooooouuuuyy')

Are you doing this just for indexing/sorting? If so, you could use this postgresql extension, which provides proper Unicode collation. The same group has a postgresql extension for doing normalization.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Is there a multibyte-aware Postgresql Levenshtein? - postgresql

Related

Postgres unaccent function for character

How to identify if a value in column can be encoded to Latin in Postgres

Unicode normalization in Postgres

How to enumerate all Unicode canonically equivalent sequences in Perl?

Converting accented characters in PostgreSQL?

Categories

Resources