Unicode normalization in Postgres

Unicode normalization in Postgres - postgresql

I have a large number of Scottish and Welsh accented place names (combining grave, acute, circumflex and diareses) which I need to update to their unicode normalized form, eg, the shorter form 00E1 (\xe1) for á instead of 0061 + 0301 (\x61\x301)
I have found a solution from an old Postgres nabble mail list from 2009, using pl/python,
create or replace function unicode_normalize(str text) returns text as $$
import unicodedata
return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ LANGUAGE PLPYTHONU;
This works, as expected, but made me wonder if there was any way of doing it directly with built-in Postgres functions. I tried various conversions using convert_to, all in vain.
EDIT: As Craig has pointed out, and one of the things I tried:
SELECT convert_to(E'\u00E1', 'iso-8859-1');
returns \xe1, whereas
SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');
fails with the ERROR: character 0xcc81 of encoding "UTF8" has no equivalent in "LATIN1"

I think this is a Pg bug.
In my opinion, PostgreSQL should be normalizing utf-8 into pre-composed form before performing encoding conversions. The result of the conversions shown are wrong.
I'll raise it on pgsql-bugs ... done.
http://www.postgresql.org/message-id/53E179E1.3060404#2ndquadrant.com
You should be able to follow the thread there.
Edit: pgsql-hackers doesn't appear to agree, so this is unlikely to change in a hurry. I strongly advise you to normalise your UTF-8 at your application input boundaries.
BTW, this can be simplified down to:
regress=> SELECT 'á' = 'á';
?column?
----------
f
(1 row)
which is plain crazy-talk, but is permitted. The first is precomposed, the second is not. (To see this result you'll have to copy & paste, and it'll only work if your browser or terminal don't normalize utf-8).
If you're using Firefox you might not see the above correctly; Chrome renders it correctly. Here's what you should see if your browser handles decomposed Unicode correctly:

PostgreSQL 13 has introduced string function normalize ( text [, form ] ) → text, which is available when the server encoding is UTF8.
> select 'päivää' = 'päivää' as without, normalize('päivää') = normalize('päivää') as with_norm ;
without | with_norm
---------+-----------
f | t
(1 row)
Note that I am expecting this to miss any indices, and therefore using this blindly in a hot production query is prone to be a recipe for disaster.
Great news for us who have naively stored NFD filenames from Mac users in our databases.

Related

Using unaccent with two different rules

German language uses diacritical characters ä, ö, ü. For international use, they are translated into ae, oe, ue (not a, o, u). This mean, that Müller is Mueller on his ID document. This is what we get, when we read the document with (for example) passport reader and this is what we save to the database table.
In next step we search for the records. We do it in two ways:
by entering search data with passport reader (no problem in here)
by entering search data manually
With manual insert there is little problem, because user may enter data international way: 'Mueller' or popular way 'Müller'.
This problem can be solved by using postgres extension Unaccent and modification of unaccent.rules file, so despite is user inserts 'Mueller' or 'Müller', we search in the database for Mueller.
So far so good...
BUT
in the same table we have also other origin's names - for example Turkish ones. Turks translates theirs umlauts (ä, ö, ü) directly into a, o, u, and this way they are saved on the documents, so Müller would by Muller on Turkish document. This causes a problem because (as described before) we search with German unnaccent.rules and we don't find people who we search for.
Long story, but finally question...
... does anybody have any idea how to handle it?
Is there any way, to have two unaccent.rules and use them with or???... for example
Select * from table
where last_name = unaccent('Müller' (use German rules))
or last_name = unaccent('Müller' (use Turkish rules))
(I know that what's above does not work, but maybe there is something similar we could use)
regards
M

The solution should be simple. Define your German unaccent dictionary (I'll call it entumlauten), then query like
SELECT ...,
last_name = unaccent('unaccent', 'Müller') AS might_be_turkish,
last_name = unaccent('entumlauten', 'Müller') AS might_be_german,
FROM tab
WHERE last_name IN (unaccent('unaccent', 'Müller'),
unaccent('entumlauten', 'Müller'))
IN (or (= ANY) will perform better than OR, because it can use an index scan. The additional columns in the SELECT list tell you which condition was matched.

Use soundex() function. This is suitable only for creating lists for human user to pick wanted name. You probably should clean all diacritics (use the Turkish way) before using this.
It also handles similar sounding letters, like C, S and Z or D and T. So Schmidt would match Smith or Jönssen matches Johnson.

Select strange characters on text, not working with LIKE operator

I try to use this solution and this (for str_eval()) but seems other encode or other UTF8's Normalization Form, perhaps combining diacritical marks...
select distinct logradouro, str_eval(logradouro)
from logradouro where logradouro like '%CECi%';
-- logradouro | str_eval
------------------------------+----------------------------
-- AV CECi\u008DLIA MEIRELLES | AV CECi\u008DLIA MEIRELLES
PROBLEM: how to select all rows of the table where the problem exists?That is, where \u occurs?
not works with like '%CECi\u%' neither like '%CECi\\u%'
works with like E'%CECi\u008D%' but is not generic
For Google, edited after solved question: this is a typical XY problem. In the original question (above) I used ~wrong hypothesis. All the solutions bellow are answers to the following (objective) question:
How to select only printable ASCII text?
"Printable ASCII" is a subset of UTF8, it is "all ASCII that is not a 'control character'".
The "non-printable" control characters are UNICODE hexadecimal 00 to 1F and 7F(HTML entity  to  +  or decimal 0 to 31 + 127).
PS1: the zero () is the "end of text" mark of PostgreSQL text datatype internal representation, so not need to be checked, but no problems to include it in the range.
PS2: about the secondary question "how to convert a word with encode bug to a valid word?", see an heuristic at my answer.

This condition will exclude any strings that do not entirely consist of printable ASCII characters:
logradouro ~ '[^\u0020-\u007E]'

Solving with workaround
select distinct logradouro, str_eval(logradouro)
from logradouro where not(logradouro ~ E'^[a-zA-Z0-9_,;\\- \\.\\(\\)\\/"\'\\*]+$');
There is a systematic bug on encode, no way to convert to correct UTF8... Even converting, the problem is that "CECi\u008DLIA" is not "CECíLIA".
The solution is to use a kind of "heuristic spell corrector" on
regexp_replace(logradouro, E'[^a-zA-Z0-9_,;\\- \\.\\(\\)\\/"\'\\*]+', '!')
Example: the i! of "Ceci!lia" is corrected to í.
NOTICE. Any heuristic solution (or neural network) trained with a specific dataset (specific systematic error source) is a black box solution, valid only for that type of systematic error. There is no generalization for this type of problem.

Replace characters with multi-character strings

I am trying to replace German and Dutch umlauts such as ä, ü, or ß. They should be written like ae instead of ä. So I can't simply translate one char with another.
Is there a more elegant way to do that? Actually it looks like that (not completed yet):
SELECT addr, REPLACE (REPLACE(addr, 'Ã¼','ue'),'ß','ss') FROM search;
On my way trying different commands I got another problem:
When I searched for Ü I got this:
ERROR: invalid byte sequence for encoding "UTF8": 0xdc27
Tried it with U&'\0220', it didn't replace anything. Only by using Ã¼ (for lowercase ü) it was replaced correctly. Has to do something with unicode, but how to solve this issue?
Kind regards from Germany. :)

Your server encoding seems to be UTF8.
I suspect your client_encoding does not match, which might give you a wrong impression of what you are dealing with. Check with:
SHOW client_encoding; -- in your actual session
And read this related answers:
Can not insert German characters in Postgres
Replace unicode characters in PostgreSQL
The rest of the tool chain has to be in sync, too. When using puTTY, for instance, one has to make sure, the terminal agrees with the rest: Change settings... Window -> Translation -> Remote character set = UTF-8.
As for your first question, you already have the best solution. A couple of umlauts are best replaced with a string of replace() statements.
As you seem to know already as well, single character replacements are more efficient with (a single) translate() statement.
Related:
Replace unicode characters in PostgreSQL
Regex remove all occurrences of multiple characters in a string

Beside other reasons I decided to write the replacement in python. Like Erwin wrote before, it seems there is no better solution as combining replace- commands.
In general pretty simple, even no encoding had to benn used. My "final" solution now looks like this:
ger_UE="Ü"
ger_AE="Ä"
ger_OE="Ö"
ger_SS="ß"
dk_AA="Å"
dk_OE="Ø"
dk_AE="Æ"
cur.execute("""Select addr, REPLACE (REPLACE (REPLACE( REPLACE (REPLACE (REPLACE (REPLACE(addr, '%s','UE'),'%s','OE'),'%s','AE'),'%s','SS'),'%s','AA'),'%s','OE'),'%s','AE')
from search WHERE x = '1';"""%(ger_UE,ger_OE,ger_AE,ger_SS,dk_AA,dk_OE,dk_AE))
I am now looking forward to the speed when it hits the large table. If anyone would like to make some annotations, they are very welcome.

Localized COLLATE on a SQLite string comparison

I want to compare two strings in a SQLite DB without caring for the accents and the case. I mean "Événement" should be equal to "evenèment".
On Debian Wheezy, the SQLite package doesn't provide ICU. So I compiled the official SQLite package (version 3.7.15.2 2013-01-09 11:53:05) with contains an ICU module. Now, I do have a better Unicode support (the originallower() applied only to ASCII chars, now it works on other letters). But I can't manage to apply a collation to a comparison.
SELECT icu_load_collation('fr_FR', 'FRENCH');
SELECT 'événement' COLLATE FRENCH = 'evenement';
-- 0 (should be 1)
SELECT 'Événement' COLLATE FRENCH = 'événement';
-- 0 (should be 1 if collation was case-insensitive)
SELECT lower('Événement') = 'événement';
-- 1 (at least lower() works as expected with Unicode strings)
The SQLite documentation confirms that this is the right way to apply a collation. I think the documentation of this ICU extension is a bit light (few examples, nothing on case sensitivity for collations).
I don't understand why the COLLATE operator has no effect in my example above. Please help.

I took me hours to understand the situation... The way the ICU collations are defined in SQLite has (almost) no incidence on comparisons. An exception being, according to the ICU, Hebrew texts with cantillation marks. This is the default behavior of the ICU library's collation. With SQLite, LIKE becomes case-insensitive when ICU is loaded, but normalization of the accentuated letters can't be attained this way.
I finally understood that what I needed was to set the
strength
of the collation to the
primary level
instead of the default tertiary level.
I found no way to set this through the locale
(e.g several variants of SELECT icu_load_collation('fr_FR,strength=0', 'french') were useless).
So the only solution was to patch the code of SQLite.
It was easy thanks to the ucol_setStrength() function
in the ICU API.
The minimal change is a one-line patch: add the line ucol_setStrength(pUCollator, 0); after pUCollator = ucol_open(zLocale, &status); in the function icuLoadCollation().
For a backwards-compatible change, I added an optional third parameter to icu_load_collation() that sets the strength:
0 for default, 1 for primary, etc. up to 4-quaternary.
See the diff.
At last I have what I wanted:
SELECT icu_load_collation('fr_FR', 'french_ci', 1); -- collation with strength=primary
SELECT 'Événement' COLLATE french_ci = 'evenèment';
-- 1

Postgresql Order By is very weird

I am not familiar with Postgresql. Trying to learn it because I am moving my Rails apps to Heroku.
Here's an example with the ordering problem.
# select name_kr from users order by name_kr;
name_kr
---------
곽철
김영
박영
안준
양민
이남
임유
정신
차욱
강동수
강상구
강신용
강용석
강지영
강지원
강호석
You may not understand Korean. But one weird thing is that it displays 2 syllable words first and 3 syllables - each corretly ordered in its group.
Here's the related info:
kwanak_development=# show lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)
kwanak_development=# show lc_ctype;
lc_ctype
-------------
en_US.UTF-8
(1 row)
What did I do wrong?
Thanks.
Sam
Additional Info:
I tried collation for order by and got an interesting result.
select name_kr from users order by name_kr collate "ko_KR"; => Same as above
select name_kr from users order by name_kr collate "C"; => Correct Result

PostgreSQL collation is mostly handled by PostgreSQL and should follow the same rules as the UNIX sort command. The first thing to do is to try using the sort command to determine if this is in fact the problem or if it is merely a symptom of something further down the stack.
If sort does not show this problem with the same locale settings, then please file a bug with the PostgreSQL team (this strikes me as very unlikely but it is possible). If it does show the problem, then you will need to take it up with the makers of the standard C libraries you are using.
As a final note for those of us unfamiliar with the ordering of Korean, you may want to try to describe the desired ordering rather than just the problem ordering.

Using GNU sort 5.93 on OS X, i get the same ordering in the default locale (which is probably one of en_GB.utf8 or en_US.utf8 - something which doesn't know Korean, anyway). However if i set LC_ALL to ko_KR.utf8, i get the three-character strings sorted first. The sets of two- and three- character strings keep the same order between themselves.
Note that all the three-character names begin with '강'. What this looks like is that '강' sorts after all the other initial characters in a naive locale, but sorts before it in Korean. If i insert a nonsense string made of one of the three-character strings with the initial character replaced with the initial character of one of the two-character strings (that is, "양호석"), then that sorts in with the two-character strings. This shows that the sort order is nothing to do with the length of the strings, and simply to do with the sorting of '강'.
I have absolutely no idea why '강' sorts after the other characters in my locale. '강' is at code point U+AC15. '곽' is at code point U+ACFD. '차' is at code point U+CC28. If the sort was on raw code point, '강' would sort before the other characters, as it does with the Korean sort.
If i sort these strings with Java, they come out with the '강' strings first, like the Korean sort. Java is pretty careful about unicode matters. The fact that it and the Korean sort agree leads me to think that that is the correct order.
If you encode the characters in UTF-8, then its first byte is 0xea, which again would sort before the other characters, which encode to bytes starting with values from 0xea to 0xec. This is presumably why collate "C" gives you the right result - that setting causes the strings to be sorted as strings of opaque bytes, not encoded characters.
I am completely baffled as to why collate "ko_KR" gives the wrong result.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse