Postgresql Order By is very weird - postgresql

I am not familiar with Postgresql. Trying to learn it because I am moving my Rails apps to Heroku.
Here's an example with the ordering problem.
# select name_kr from users order by name_kr;
name_kr
---------
곽철
김영
박영
안준
양민
이남
임유
정신
차욱
강동수
강상구
강신용
강용석
강지영
강지원
강호석
You may not understand Korean. But one weird thing is that it displays 2 syllable words first and 3 syllables - each corretly ordered in its group.
Here's the related info:
kwanak_development=# show lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)
kwanak_development=# show lc_ctype;
lc_ctype
-------------
en_US.UTF-8
(1 row)
What did I do wrong?
Thanks.
Sam
Additional Info:
I tried collation for order by and got an interesting result.
select name_kr from users order by name_kr collate "ko_KR"; => Same as above
select name_kr from users order by name_kr collate "C"; => Correct Result

PostgreSQL collation is mostly handled by PostgreSQL and should follow the same rules as the UNIX sort command. The first thing to do is to try using the sort command to determine if this is in fact the problem or if it is merely a symptom of something further down the stack.
If sort does not show this problem with the same locale settings, then please file a bug with the PostgreSQL team (this strikes me as very unlikely but it is possible). If it does show the problem, then you will need to take it up with the makers of the standard C libraries you are using.
As a final note for those of us unfamiliar with the ordering of Korean, you may want to try to describe the desired ordering rather than just the problem ordering.

Using GNU sort 5.93 on OS X, i get the same ordering in the default locale (which is probably one of en_GB.utf8 or en_US.utf8 - something which doesn't know Korean, anyway). However if i set LC_ALL to ko_KR.utf8, i get the three-character strings sorted first. The sets of two- and three- character strings keep the same order between themselves.
Note that all the three-character names begin with '강'. What this looks like is that '강' sorts after all the other initial characters in a naive locale, but sorts before it in Korean. If i insert a nonsense string made of one of the three-character strings with the initial character replaced with the initial character of one of the two-character strings (that is, "양호석"), then that sorts in with the two-character strings. This shows that the sort order is nothing to do with the length of the strings, and simply to do with the sorting of '강'.
I have absolutely no idea why '강' sorts after the other characters in my locale. '강' is at code point U+AC15. '곽' is at code point U+ACFD. '차' is at code point U+CC28. If the sort was on raw code point, '강' would sort before the other characters, as it does with the Korean sort.
If i sort these strings with Java, they come out with the '강' strings first, like the Korean sort. Java is pretty careful about unicode matters. The fact that it and the Korean sort agree leads me to think that that is the correct order.
If you encode the characters in UTF-8, then its first byte is 0xea, which again would sort before the other characters, which encode to bytes starting with values from 0xea to 0xec. This is presumably why collate "C" gives you the right result - that setting causes the strings to be sorted as strings of opaque bytes, not encoded characters.
I am completely baffled as to why collate "ko_KR" gives the wrong result.

Related

Why doesn't ICU4J match UTF-8 sort order?

I am having a hard time understanding unicode sorting order.
When I run Collator.getInstance(Locale.ENGLISH).compare("_", "#") under ICU4J 55.1 I get a return value of -1 indicating that _ comes before #.
However, looking at http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec I see that # (U+0023) comes before _ (U+005F). Why is ICU4J returning a value of -1?
First, UTF-8 is just an encoding. It specifies how to store the Unicode code points physically, but does not handle sorting, comparisons, etc.
Now, the page you linked to shows everything in numerical Code Point order. That is the order things would sort in if using a binary collation (in SQL Server, that would be collations with names ending in _BIN and _BIN2). But the non-binary ordering is far more complex. The rules are described here: Unicode Collation Algorithm (UCA).
The base rules are found here: http://www.unicode.org/repos/cldr/tags/release-28/common/uca/allkeys_CLDR.txt
It shows:
005F ; [*010A.0020.0002] # LOW LINE
...
0023 ; [*0290.0020.0002] # NUMBER SIGN
It is very important to keep in mind that any locale / culture can override these base rules. Hence, while the few lines noted above explain this specific circumstance, other circumstances would need to check http://www.unicode.org/repos/cldr/tags/release-28/common/collation/ to see if there are any locale-specific overrides.
Converting Mark Ransom's comments into an answer:
The ordering of individual characters is based on a collation table, which has little relationship to the codepoint numbers. See: http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table
If you follow the first link on that page, it leads to allkeys.txt which gives the default collation ordering.
In particular, _ is 005F ; [*020B.0020.0002] # LOW LINE while # is 0023 ; [*0391.0020.0002] # NUMBER SIGN. Note that the collation numbers for _ are lower than the numbers for #.

Unicode normalization in Postgres

I have a large number of Scottish and Welsh accented place names (combining grave, acute, circumflex and diareses) which I need to update to their unicode normalized form, eg, the shorter form 00E1 (\xe1) for á instead of 0061 + 0301 (\x61\x301)
I have found a solution from an old Postgres nabble mail list from 2009, using pl/python,
create or replace function unicode_normalize(str text) returns text as $$
import unicodedata
return unicodedata.normalize('NFC', str.decode('UTF-8'))
$$ LANGUAGE PLPYTHONU;
This works, as expected, but made me wonder if there was any way of doing it directly with built-in Postgres functions. I tried various conversions using convert_to, all in vain.
EDIT: As Craig has pointed out, and one of the things I tried:
SELECT convert_to(E'\u00E1', 'iso-8859-1');
returns \xe1, whereas
SELECT convert_to(E'\u0061\u0301', 'iso-8859-1');
fails with the ERROR: character 0xcc81 of encoding "UTF8" has no equivalent in "LATIN1"
I think this is a Pg bug.
In my opinion, PostgreSQL should be normalizing utf-8 into pre-composed form before performing encoding conversions. The result of the conversions shown are wrong.
I'll raise it on pgsql-bugs ... done.
http://www.postgresql.org/message-id/53E179E1.3060404#2ndquadrant.com
You should be able to follow the thread there.
Edit: pgsql-hackers doesn't appear to agree, so this is unlikely to change in a hurry. I strongly advise you to normalise your UTF-8 at your application input boundaries.
BTW, this can be simplified down to:
regress=> SELECT 'á' = 'á';
?column?
----------
f
(1 row)
which is plain crazy-talk, but is permitted. The first is precomposed, the second is not. (To see this result you'll have to copy & paste, and it'll only work if your browser or terminal don't normalize utf-8).
If you're using Firefox you might not see the above correctly; Chrome renders it correctly. Here's what you should see if your browser handles decomposed Unicode correctly:
PostgreSQL 13 has introduced string function normalize ( text [, form ] ) → text, which is available when the server encoding is UTF8.
> select 'päivää' = 'päivää' as without, normalize('päivää') = normalize('päivää') as with_norm ;
without | with_norm
---------+-----------
f | t
(1 row)
Note that I am expecting this to miss any indices, and therefore using this blindly in a hot production query is prone to be a recipe for disaster.
Great news for us who have naively stored NFD filenames from Mac users in our databases.

Unicode character default collation table

I don't know which site this question belongs exactly, so posting it here.
I use Postgresql 9.2 on RHEL 6.4 and observe the following:
select foo
from unnest('{а,ә,б,в,г,д,е,ж}'::text[]) as foo
order by foo collate "kk_KZ.utf8"
gives
а
ә
б
в
г
д
е
ж
BUT
select foo
from unnest('{а,ә,б,в,г,д,е,ж}'::text[]) as foo
order by foo collate "en_US.utf8"
gives
а
б
в
г
д
е
ә -- misplaced
ж
Further, I found that there is the Default Unicode Collation Element Table [1], which lists the character in question (04D9 ; [.199D.0020.0002.04D9] # CYRILLIC SMALL LETTER SCHWA) in proper order.
I understand that it is silly to expect the cyrillic characters be handled properly by "en_US.utf8" locale, but what is the correct behavior by Unicode or any other relevant standards in cases, where a character does not normally belong to language/locale used for collation?
[1] http://www.unicode.org/Public/UCA/latest/allkeys.txt
It's not misplaced. It might be to you, but it's not to me. :-) In all seriousness, there is no correct behavior by Unicode; there simply cannot be. A character set is a mapping; the collation is a locale-specific set of rules to sort the characters in that set -- and even within the same locale there can be multiple collations.
The ICU docs has colorful examples of how thorny this kind of stuff gets, in case you're curious. Quoting extensively:
http://userguide.icu-project.org/collation
[H]ere are some of the ways languages vary in ordering strings:
The letters A-Z can be sorted in a different order than in English. For example, in Lithuanian, "y" is sorted between "i" and "k".
Combinations of letters can be treated as if they were one letter. For example, in traditional Spanish "ch" is treated as a single letter, and sorted between "c" and "d".
Accented letters can be treated as minor variants of the unaccented letter. For example, "é" can be treated equivalent to "e".
Accented letters can be treated as distinct letters. For example, "Å" in Danish is treated as a separate letter that sorts just after "Z".
Unaccented letters that are considered distinct in one language can be indistinct in another. For example, the letters "v" and "w" are two different letters according to English. However, "v" and "w" are considered variant forms of the same letter in Swedish.
A letter can be treated as if it were two letters. For example, in traditional German "ä" is compared as if it were "ae".
Thai requires that the order of certain letters be reversed.
French requires that letters sorted with accents at the end of the string be sorted ahead of accents in the beginning of the string. For example, the word "côte" sorts before "coté" because the acute accent on the final "e" is more significant than the circumflex on the "o".
Sometimes lowercase letters sort before uppercase letters. The reverse is required in other situations. For example, lowercase letters are usually sorted before uppercase letters in English. Latvian letters are the exact opposite.
Even in the same language, different applications might require different sorting orders. For example, in German dictionaries, "öf" would come before "of". In phone books the situation is the exact opposite.
Sorting orders can change over time due to government regulations or new characters/scripts in Unicode.
Postgresql uses the locales provided by the operating system. In your setup, locales are provided by glibc. Glibc uses a heavily modified version of an "ancient" version of ISO 14651 (see glibc Bug 14095 - Review / update collation data from Unicode / ISO 14651 for information on current pains in trying to update glibc locale data).
As of glibc 2.28, to be released on 2018-08-01, glibc will use data from ISO 14651:2016 (which is synchronized to Unicode 9), and will give the order the OP expects for en_US.
ISO 14651 is Method for comparing character strings and description of the common template tailorable ordering and it is similar to the UCA, with some differences. The CTT (Common Template Table) is the ISO14651 equivalent of the DUCET, and they are aligned.
The first time CYRILLIC SMALL LETTER SCHWA appeared in a collation table in glibc was for the az_AZ locale (Azerbaijani), where it is ordered after CYRILLIC SMALL LETTER IE. This corresponds to:
commit fcababc4e18fee81940dab20f7c40b1e1fb67209
Author: Ulrich Drepper <drepper#redhat.com>
Date: Fri Aug 3 08:42:28 2001 +0000
Update.
2001-08-03 Ulrich Drepper <drepper#redhat.com>
* locale/iso-639.def: Add Tigrinya.
From there, that ordering was eventually moved to the file iso14651_t1 as per Bug 672 - Include iso14651_t1 in collation rules, which was an effort to simplify glibc locale data. This corresponds to:
commit 5d2489928c0040d2a71dd0e63c801f2cf98e7efc
Author: Ulrich Drepper <drepper#redhat.com>
Date: Sun Feb 18 04:34:28 2007 +0000
[BZ #672]
2005-01-16 Denis Barbier <barbier#linuxfr.org>
[BZ #672]
* locales/ca_ES: Replace current collation rules by including
iso14651_t1 and adding extra rules if needed. There should be
no noticeable changes in sorted text. only ligatures and
ignoreable characters have modified weights.
* locales/da_DK: Likewise.
* locales/en_CA: Likewise.
* locales/es_US: Likewise.
* locales/fi_FI: Likewise.
* locales/nb_NO: Likewise.
[BZ #672]
* locales/iso14651_t1: Simplified. Extended.
Most locales in glibc start from iso14651_t1, and tailor it, which is what you are seeing with en_US.
While glibc based its default ordering in Azerbaijani, the DUCET instead bases it on the ordering for Kazakh and Tatar, which is where the difference comes from.
The Unicode Collation Algorithm allows any tailorings to be made to the DUCET.
There isn't a "correct" behaviour. There are various behaviours one could expect, and the most appropriate depends on the context, the audience. Sometimes any behaviour could be correct, since there isn't really a reason to force any order of cyrillic betters in an American English collation.
The Common Locale Data Repository provides locale-specific tailorings to the DUCET. The CLDR uses LDML (Locale Data Markup Language) to specify the tailorings, and the syntax is given by the Unicode Technical Specification #35, part 5.
The latest version of the data provided by the CLDR for en_US has no tailorings: it uses a modified version of the DUCET (as stated in UTS#35 under "Root collation"). It lists the cyrillic schwa after the cyrillic A, i.e., the order you were expecting.
There is also data for an en_US_POSIX locale, and that one includes some modifications, but none changes anything that isn't in ASCII.
It appears the en_US locale installed in your system uses a tailoring that puts the schwa next to E probably because of their similar form. It could be argued that would cause fewer surprises to an American English audience than sorting the schwa after A: ask people what that is and see how many will just tell you it is an "upside-down E". It isn't right or wrong, but if you ask me, it seems more appropriate than the collation found in the CLDR.

Localized COLLATE on a SQLite string comparison

I want to compare two strings in a SQLite DB without caring for the accents and the case. I mean "Événement" should be equal to "evenèment".
On Debian Wheezy, the SQLite package doesn't provide ICU. So I compiled the official SQLite package (version 3.7.15.2 2013-01-09 11:53:05) with contains an ICU module. Now, I do have a better Unicode support (the originallower() applied only to ASCII chars, now it works on other letters). But I can't manage to apply a collation to a comparison.
SELECT icu_load_collation('fr_FR', 'FRENCH');
SELECT 'événement' COLLATE FRENCH = 'evenement';
-- 0 (should be 1)
SELECT 'Événement' COLLATE FRENCH = 'événement';
-- 0 (should be 1 if collation was case-insensitive)
SELECT lower('Événement') = 'événement';
-- 1 (at least lower() works as expected with Unicode strings)
The SQLite documentation confirms that this is the right way to apply a collation. I think the documentation of this ICU extension is a bit light (few examples, nothing on case sensitivity for collations).
I don't understand why the COLLATE operator has no effect in my example above. Please help.
I took me hours to understand the situation... The way the ICU collations are defined in SQLite has (almost) no incidence on comparisons. An exception being, according to the ICU, Hebrew texts with cantillation marks. This is the default behavior of the ICU library's collation. With SQLite, LIKE becomes case-insensitive when ICU is loaded, but normalization of the accentuated letters can't be attained this way.
I finally understood that what I needed was to set the
strength
of the collation to the
primary level
instead of the default tertiary level.
I found no way to set this through the locale
(e.g several variants of SELECT icu_load_collation('fr_FR,strength=0', 'french') were useless).
So the only solution was to patch the code of SQLite.
It was easy thanks to the ucol_setStrength() function
in the ICU API.
The minimal change is a one-line patch: add the line ucol_setStrength(pUCollator, 0); after pUCollator = ucol_open(zLocale, &status); in the function icuLoadCollation().
For a backwards-compatible change, I added an optional third parameter to icu_load_collation() that sets the strength:
0 for default, 1 for primary, etc. up to 4-quaternary.
See the diff.
At last I have what I wanted:
SELECT icu_load_collation('fr_FR', 'french_ci', 1); -- collation with strength=primary
SELECT 'Événement' COLLATE french_ci = 'evenèment';
-- 1

PostgreSQL ignores dashes when ordering

I have a PostgreSQL 8.4 database that is created with the da_DK.utf8 locale.
dbname=> show lc_collate;
lc_collate
------------
da_DK.utf8
(1 row)
When I select something from a table where I order on a character varying column I get a strange behaviour IMO. When ordering the result PostgreSQL ignores dashes that prefixes the value, e.g.:
select name from mytable order by name asc;
May return something like
name
----------------
Ad...
Ae...
Ag...
- Ak....
At....
The dash prefix seems to be ignored.
I can fix this issue by converting the column to latin1 when ordering:
select name from mytable order by convert_to(name, 'latin1') asc;
The I get the expected result as:
name
----------------
- Ak....
Ad...
Ae...
Ag...
At....
Why does the dash prefix get ignored by default? Can that behavior be changed?
This is because da_DK.utf8 locale defines it this way. Linux locale aware utilities, for example sort will also work like this.
Your convert_to(name, 'latin1') will break if it finds a character which is not in Latin 1 character set, for example €, so it isn't a good workaround.
You can use order by convert_to(name, 'SQL_ASCII'), which will ignore locale defined sort and simply use byte values.
Ugly hack edit:
order by
(
ascii(name) between ascii('a') and ascii('z')
or ascii(name) between ascii('A') and ascii('Z')
or ascii(name)>127
),
name;
This will sort first anything which starts with ASCII non-letter. This is very ugly, because sorting further in string would behave strange, but it can be good enough for you.
A workaround that will work in my specific case is to replace dashes with exclamation points. I happen to know that I will never get exclamation points and it will be sorted before any letters or digits.
select name from mytable order by translate(name, '-', '!') asc
It will certainly affect performance so I may look into creating a special column for sorting but I really don't like that either...
I don't know how seems ordering rules for Dutch, but for Polish special characters like space, dashes etc are not "counted" in sorting in most dictionaries. Some good sort routines do the same and ignores such special characters. Probably in Dutch there is similar rule, and this rule is implemented by Ubuntu locale aware sort function.