PostgreSQL ignores dashes when ordering - postgresql

I have a PostgreSQL 8.4 database that is created with the da_DK.utf8 locale.
dbname=> show lc_collate;
lc_collate
------------
da_DK.utf8
(1 row)
When I select something from a table where I order on a character varying column I get a strange behaviour IMO. When ordering the result PostgreSQL ignores dashes that prefixes the value, e.g.:
select name from mytable order by name asc;
May return something like
name
----------------
Ad...
Ae...
Ag...
- Ak....
At....
The dash prefix seems to be ignored.
I can fix this issue by converting the column to latin1 when ordering:
select name from mytable order by convert_to(name, 'latin1') asc;
The I get the expected result as:
name
----------------
- Ak....
Ad...
Ae...
Ag...
At....
Why does the dash prefix get ignored by default? Can that behavior be changed?

This is because da_DK.utf8 locale defines it this way. Linux locale aware utilities, for example sort will also work like this.
Your convert_to(name, 'latin1') will break if it finds a character which is not in Latin 1 character set, for example €, so it isn't a good workaround.
You can use order by convert_to(name, 'SQL_ASCII'), which will ignore locale defined sort and simply use byte values.
Ugly hack edit:
order by
(
ascii(name) between ascii('a') and ascii('z')
or ascii(name) between ascii('A') and ascii('Z')
or ascii(name)>127
),
name;
This will sort first anything which starts with ASCII non-letter. This is very ugly, because sorting further in string would behave strange, but it can be good enough for you.

A workaround that will work in my specific case is to replace dashes with exclamation points. I happen to know that I will never get exclamation points and it will be sorted before any letters or digits.
select name from mytable order by translate(name, '-', '!') asc
It will certainly affect performance so I may look into creating a special column for sorting but I really don't like that either...

I don't know how seems ordering rules for Dutch, but for Polish special characters like space, dashes etc are not "counted" in sorting in most dictionaries. Some good sort routines do the same and ignores such special characters. Probably in Dutch there is similar rule, and this rule is implemented by Ubuntu locale aware sort function.

Related

Enforce printable characters only in text fields

The text data type in PostgreSQL database (encoding utf-8) can contain any UTF-8 character. These include a number of control characters (https://en.wikipedia.org/wiki/Unicode_control_characters)
While I agree there are cases when the control characters are needed, there is little (to none) of use of these characters in normal attributes like persons name, document number etc. In fact allowing such characters to be stored in DB can lead to nasty problems as the characters are not visible and the value of the attribute is not what it seems to be to the end user.
As the problem seems to be very general, is there a way to prevent control chars in text fields? Maybe there is a special text type (like citext for case-incencitive text)? Or should this behaviour be realized as a domain? Are there any other options? All I could find people talk is finding these characters using regex.
I could not find any general recommendations to solving the problem so maybe I'm missing something obvious here.
The exact answer will depend on what you consider printable.
However, a domain is the way to go. If you want to go with what your database collation considers a printable character, use a domain like this:
CREATE DOMAIN printable_text AS text CHECK (VALUE !~ '[^[:print:]]');
SELECT 'a'::printable_text;
printable_text
════════════════
a
(1 row)
SELECT E'\u0007'::printable_text; -- bell character (ASCII 7)
ERROR: value for domain printable_text violates check constraint "printable_text_check"

PostgreSQL pattern matching with Unicode graphemes

Is there any way to pattern match with Unicode graphemes?
As a quick example, when I run this query:
CREATE TABLE test (
id SERIAL NOT NULL,
name VARCHAR NOT NULL,
PRIMARY KEY (id),
UNIQUE (name)
);
INSERT INTO test (name) VALUES ('👍🏻 One');
INSERT INTO test (name) VALUES ('👍 Two');
SELECT * FROM public.test WHERE test.name LIKE '👍%';
I get both rows returned, rather than just '👍 Two'. Postgres seems to be just comparing code points, but I want it to compare full graphemes, so it should only match '👍 Two', because 👍🏻 is a different grapheme.
Is this possible?
It's a very interesting question!
I am not quite sure if it is possible anyway:
The skinned emojis are, in fact, two joined characters (like ligatures). The first character is the yellow hand 👍 which is followed by an emoji skin modifier 🏻
This is how the light skinned hand is stored internally. So, for me, your result makes sense:
When you query any string, that begins with 👍, it will return:
👍 Two (trivial)
👍_🏻 One (ignore the underscore, I try to suppress the automated ligature with this)
So, you can see, the light skinned emoji internally also starts with 👍. That's why I believe, that your query doesn't work the way you like.
Workarounds/Solutions:
You can add a space to your query. This ensures, that there's no skin modifier after your 👍 character. Naturally, this only works in your case, where all data sets have a space after the hand:
SELECT * FROM test WHERE name LIKE '👍 %';
You can simply extend the WHERE clause like this:
SELECT * FROM test
WHERE name LIKE '👍%'
AND name NOT LIKE '👍🏻%'
AND name NOT LIKE '👍🏼%'
AND name NOT LIKE '👍🏽%'
AND name NOT LIKE '👍🏾%'
AND name NOT LIKE '👍🏿%'
You can use regular expression pattern matching to exclude the skins:
SELECT * FROM test
WHERE name ~ '^👍[^🏻🏼🏽🏾🏿]*$'
see demo:db<>fiddle (note that the fiddle seems not to provide automated ligatures, so both characters are separated displayed there)

How can I use tsvector on a string with numbers?

I would like to use a postgres tsquery on a column that has strings that all contain numbers, like this:
FRUIT-239476234
If I try to make a tsquery out of this:
select to_tsquery('FRUIT-239476234');
What I get is:
'fruit' & '-239476234'
I want to be able to search by just the numeric portion of this value like so:
239476234
It seems that it is unable to match this because it is interpreting my hyphen as a "negative sign" and doesn't think 239476234 matches -239476234. How can I tell postgres to treat all of my characters as text and not try to be smart about numbers and hyphens?
An answer from the future. Once version 13 of PostgreSQL is released, you will be able to do use the dict_int module to do this.
create extension dict_int ;
ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN = 100, ABSVAL=true);
ALTER TEXT SEARCH CONFIGURATION english ALTER MAPPING FOR int WITH intdict;
select to_tsquery('FRUIT-239476234');
to_tsquery
-----------------------
'fruit' & '239476234'
But you would probably be better off creating your own TEXT SEARCH DICTIONARY as well as copying the 'english' CONFIGURATION and modifying the copy, rather than modifying the default ones in place. Otherwise you have the risk that upgrading will silently lose your changes.
If you don't want to wait for v13, you could back-patch this change and compile into your own version of the extension for a prior server.
This is done by the text search parser, which is not configurable (short of writing your own parser in C, which is supported).
The simplest solution is to pre-process all search strings by replacing - with a space.

Localized COLLATE on a SQLite string comparison

I want to compare two strings in a SQLite DB without caring for the accents and the case. I mean "Événement" should be equal to "evenèment".
On Debian Wheezy, the SQLite package doesn't provide ICU. So I compiled the official SQLite package (version 3.7.15.2 2013-01-09 11:53:05) with contains an ICU module. Now, I do have a better Unicode support (the originallower() applied only to ASCII chars, now it works on other letters). But I can't manage to apply a collation to a comparison.
SELECT icu_load_collation('fr_FR', 'FRENCH');
SELECT 'événement' COLLATE FRENCH = 'evenement';
-- 0 (should be 1)
SELECT 'Événement' COLLATE FRENCH = 'événement';
-- 0 (should be 1 if collation was case-insensitive)
SELECT lower('Événement') = 'événement';
-- 1 (at least lower() works as expected with Unicode strings)
The SQLite documentation confirms that this is the right way to apply a collation. I think the documentation of this ICU extension is a bit light (few examples, nothing on case sensitivity for collations).
I don't understand why the COLLATE operator has no effect in my example above. Please help.
I took me hours to understand the situation... The way the ICU collations are defined in SQLite has (almost) no incidence on comparisons. An exception being, according to the ICU, Hebrew texts with cantillation marks. This is the default behavior of the ICU library's collation. With SQLite, LIKE becomes case-insensitive when ICU is loaded, but normalization of the accentuated letters can't be attained this way.
I finally understood that what I needed was to set the
strength
of the collation to the
primary level
instead of the default tertiary level.
I found no way to set this through the locale
(e.g several variants of SELECT icu_load_collation('fr_FR,strength=0', 'french') were useless).
So the only solution was to patch the code of SQLite.
It was easy thanks to the ucol_setStrength() function
in the ICU API.
The minimal change is a one-line patch: add the line ucol_setStrength(pUCollator, 0); after pUCollator = ucol_open(zLocale, &status); in the function icuLoadCollation().
For a backwards-compatible change, I added an optional third parameter to icu_load_collation() that sets the strength:
0 for default, 1 for primary, etc. up to 4-quaternary.
See the diff.
At last I have what I wanted:
SELECT icu_load_collation('fr_FR', 'french_ci', 1); -- collation with strength=primary
SELECT 'Événement' COLLATE french_ci = 'evenèment';
-- 1

Postgresql Order By is very weird

I am not familiar with Postgresql. Trying to learn it because I am moving my Rails apps to Heroku.
Here's an example with the ordering problem.
# select name_kr from users order by name_kr;
name_kr
---------
곽철
김영
박영
안준
양민
이남
임유
정신
차욱
강동수
강상구
강신용
강용석
강지영
강지원
강호석
You may not understand Korean. But one weird thing is that it displays 2 syllable words first and 3 syllables - each corretly ordered in its group.
Here's the related info:
kwanak_development=# show lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)
kwanak_development=# show lc_ctype;
lc_ctype
-------------
en_US.UTF-8
(1 row)
What did I do wrong?
Thanks.
Sam
Additional Info:
I tried collation for order by and got an interesting result.
select name_kr from users order by name_kr collate "ko_KR"; => Same as above
select name_kr from users order by name_kr collate "C"; => Correct Result
PostgreSQL collation is mostly handled by PostgreSQL and should follow the same rules as the UNIX sort command. The first thing to do is to try using the sort command to determine if this is in fact the problem or if it is merely a symptom of something further down the stack.
If sort does not show this problem with the same locale settings, then please file a bug with the PostgreSQL team (this strikes me as very unlikely but it is possible). If it does show the problem, then you will need to take it up with the makers of the standard C libraries you are using.
As a final note for those of us unfamiliar with the ordering of Korean, you may want to try to describe the desired ordering rather than just the problem ordering.
Using GNU sort 5.93 on OS X, i get the same ordering in the default locale (which is probably one of en_GB.utf8 or en_US.utf8 - something which doesn't know Korean, anyway). However if i set LC_ALL to ko_KR.utf8, i get the three-character strings sorted first. The sets of two- and three- character strings keep the same order between themselves.
Note that all the three-character names begin with '강'. What this looks like is that '강' sorts after all the other initial characters in a naive locale, but sorts before it in Korean. If i insert a nonsense string made of one of the three-character strings with the initial character replaced with the initial character of one of the two-character strings (that is, "양호석"), then that sorts in with the two-character strings. This shows that the sort order is nothing to do with the length of the strings, and simply to do with the sorting of '강'.
I have absolutely no idea why '강' sorts after the other characters in my locale. '강' is at code point U+AC15. '곽' is at code point U+ACFD. '차' is at code point U+CC28. If the sort was on raw code point, '강' would sort before the other characters, as it does with the Korean sort.
If i sort these strings with Java, they come out with the '강' strings first, like the Korean sort. Java is pretty careful about unicode matters. The fact that it and the Korean sort agree leads me to think that that is the correct order.
If you encode the characters in UTF-8, then its first byte is 0xea, which again would sort before the other characters, which encode to bytes starting with values from 0xea to 0xec. This is presumably why collate "C" gives you the right result - that setting causes the strings to be sorted as strings of opaque bytes, not encoded characters.
I am completely baffled as to why collate "ko_KR" gives the wrong result.