Set Order By to ignore punctuation on a per-column basis - postgresql

Is it possible to order the results of a PostgreSQL query by a title field that contains characters like [](),; etc but do so ignoring these punctuation characters and sorting only by the text characters?
I've read articles on changing the database collation or locale but have not found any clear instructions on how to do this on an existing database an on a per-column basis. Is this even possible?

"Normalize" for sorting
You could use regexp_replace() with the pattern '[^a-zA-Z]' in the ORDER BY clause but that only recognizes pure ASCII letters. Better use the class shorthand '\W' which recognizes additional non-ASCII letters in your locale like äüóèß etc.
Or you could improvise and "normalize all characters with diacritic elements to their base form with the help of the unaccent() function. Consider this little demo:
SELECT *
, regexp_replace(x, '[^a-zA-Z]', '', 'g')
, regexp_replace(x, '\W', '', 'g')
, regexp_replace(unaccent(x), '\W', '', 'g')
FROM (
SELECT 'XY ÖÜÄöüäĆČćč€ĞğīїıŁłŃńŇňŐőōŘřŠšŞşůŽžż‘´’„“”­–—[](),;.:̈� XY'::text AS x) t
->SQLfiddle for Postgres 9.2.
->SQLfiddle for Postgres 9.1.
Regular expression code has been updated in version 9.2. I am assuming this is the reason for the improved handling in 9.2 where all letter characters in the example are matched, while 9.1 only matches some.
unaccent() is provided by the additional module unaccent. Run:
CREATE EXTENSION unaccent;
once per database to use in (Postgres 9.1+, older versions use a different technique).
locales / collation
You must be aware that Postgres relies on the underlying operating system for locales (including collation). The sort order is governed by your chosen locale, or more specific LC_COLLATE. More in this related answer:
String sort order (LC_COLLATE and LC_CTYPE)
There are plans to incorporate collation support into Postgres directly, but that's not available at this time.
Many locales ignore the special characters you describe for sorting character data out of the box. If you have a locale installed in your system that provides the sort order you are looking for, you can use it ad-hoc in Postgres 9.1 or later:
SELECT foo FROM bar ORDER BY foo COLLATE "xy_XY"
To see which collations are installed and available in your current Postgres installation:
SELECT * FROM pg_collation;
Unfortunately it is not possible to define your own custom collation (yet) unless you hack the source code.
The collation rules are usually governed by the rules of a language as spoken in a country. The sort order telephone books would be in, if there were still telephone books ... Your operating system provides them.
For instance, in Debian Linux you can use:
locale -a
to display all generated locales. And:
dpkg-reconfigure locales
as root user (one way of several) to generate / install more.

If you want to have this ordering in one particular query you can
ORDER BY regexp_replace(title, '[^a-zA-Z]', '', 'g')
It will delete all non A-Z from sting and order by resulting field.

Related

Should I save ASCII-only varchar in UTF-8 or ASCII?

I have a varchar column that contains only ASCII symbols. I don't need to sort by this field, but I need to search it by full equality.
Default locale is en.UTF8. Will I gain anything if I create this column with collate "C"?
Yes, it makes a difference.
Even if you do not sort deliberately, there are various operations requiring sort steps internally (some aggregate functions, DISTINCT, nested loop joins etc.).
Also, any index on the field has to sort values internally - and observe collation rules unless COLLATE "C" applies (no collation rules).
For searches by full equality you'll want an index - which works either way (for equality), but it's faster overall without collation rules. Depending on the details of your use case, the effect may be negligible or substantial. The impact grows with the length of your strings. I ran a benchmark on a related case some time ago:
Slow query ordering by a column in a joined table
Also, there are more pattern matching options with locale "C". The alternative would be to create an index with the special varchar_pattern_ops operator class.
Related:
PostgreSQL LIKE query performance variations
Operator “~<~” uses varchar_pattern_ops index while normal ORDER BY clause doesn't?
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
Postgres 9.5 introduced performance improvements with a technique called "abbreviated keys", which ran into problems with some locales. So it was deactivated, except for the C locale. Quoting The release notes of Postgres 9.5.2:
Disable abbreviated keys for string sorting in non-C locales (Robert Haas)
PostgreSQL 9.5 introduced logic for speeding up comparisons of string
data types by using the standard C library function strxfrm() as a
substitute for strcoll(). It now emerges that most versions of glibc
(Linux's implementation of the C library) have buggy implementations
of strxfrm() that, in some locales, can produce string comparison
results that do not match strcoll(). Until this problem can be better
characterized, disable the optimization in all non-C locales. (C
locale is safe since it uses neither strcoll() nor strxfrm().)
Unfortunately, this problem affects not only sorting but also entry
ordering in B-tree indexes, which means that B-tree indexes on text,
varchar, or char columns may now be corrupt if they sort according to
an affected locale and were built or modified under PostgreSQL 9.5.0
or 9.5.1. Users should REINDEX indexes that might be affected.
It is not possible at this time to give an exhaustive list of
known-affected locales. C locale is known safe, and there is no
evidence of trouble in English-based locales such as en_US, but some
other popular locales such as de_DE are affected in most glibc
versions.
The problem also illustrates where collation rules come in, generally.

Comparing Text on PostgreSQL 8.4 and 9.1

I have two databases, one is running on postgresql 8.4 and the other on postgresql 9.1.
Both are on CentOS machines with the same locale (en_US).
Suppose i have a table with this data:
id | description
1 Morango
2 CAFÉ
3 pera
4 Uva
The odd thing is, when i run a query like this one:
SELECT * FROM products WHERE description ~* 'café'
On the 8.4 machine i get no results, but on the 9.1 machine i got the row (CAFÉ).
Apparently they differ on how to compare the upcase unicode character.
Could someone give me some insight about this problem?
Is it the different version o postgresql that can cause this problem?
Are there any additional configuration i could make to equalize the behavior from the two machines?
UPDATE: Both databases are UTF-8
Case-insensitive regex matching for non-US Unicode characters was basically not supported before 9.0.
See this snippet in the 9.0 release notes:
E.14.3.6. Functions
[...]
Support locale-specific regular expression processing with UTF-8
server encoding (Tom Lane)
Locale-specific regular expression functionality includes
case-insensitive matching and locale-specific character classes.
Previously, these features worked correctly for non-ASCII characters
only if the database used a single-byte server encoding (such as
LATIN1). They will still misbehave in multi-byte encodings other than
UTF-8.

Do text_pattern_ops comparators understand UTF-8?

According to the PostgreSQL 9.2 documentation, if I am using a locale other than the C locale (en_US.UTF-8 in my case), btree indexes on text columns for supporting queries like
SELECT * from my_table WHERE text_col LIKE 'abcd%'
need to be created using text_pattern_ops like so
CREATE INDEX my_idx ON my_table (text_col text_pattern_ops)
Now section 11.9 of the documentation states that this results in a "character by character" comparison. Are these (non-wide) C characters or does the comparison understand UTF-8?
Good question, I'm not totally sure but my tentative understanding is:
Here Postgresql means "real characters" (eventually multibyte), not bytes. The comparison "understands UTF-8" always, with or without this special index.
The point is that, for locales that have special (non C) collation rules, we normally want to follow those rules (and call the respective locale libraries) when doing comparisons ( <, >...) and sorting. But we don't want to use those collations for POSIX regular matching and LIKE patterns. Hence the existence of two different types of indexes for text.
The operators in the text_pattern_ops operator class actually do a memcmp() on the strings, so the documentation is perhaps slightly inaccurate talking about characters.
But this doesn't really affect the question whether they support UTF-8. The indexing of pattern matching operations in the described fashion does support UTF-8. The underlying operators don't have to worry about the encoding.

PostgreSQL: wrong sorting on Ukrainian text

I have table with all countries on three languages: English, Russian and Ukrainian. On first two languages sorting is OK. But on Ukrainian countries' names sorting is not OK.
On first two place it stands two letters 'є' (8th position in ABC) and 'і' (12th position in ABC) and all next letters are sorted fine.
How to prevent this behaviour? DB encoding is utf-8.
If you are on 9.1, you can add the collation to be used for sorting to your ORDER BY clause:
SELECT *
FROM your_table
ORDER BY your_column COLLATE "ua_UA"
The name of the collation depends on your operating system - not sure what the correct name for Ukraine would be. But I think you get the idea.
You might also want to read this blog entry:
http://www.depesz.com/index.php/2011/03/04/waiting-for-9-1-per-column-collation-support/
UTF-8 doesn't know anything about "language". For alphabetical sort to make any sense to Postgres you need to set a locale. Your question doesn't mention locale at all so I'm guessing you're just sorting using whatever your default locale is (probably English or Russian).
If you are already using locales then I suggest providing details of your client / server locale settings as there may be a mistake there.

String sort order (LC_COLLATE and LC_CTYPE)

Apparently PostgreSQL allows different locales for each database since version 8.4
So I went to the docs to read about locales (http://www.postgresql.org/docs/8.4/static/locale.html).
String sort order is of my particular interest (I want strings sorted like 'A a b c D d' and not 'A B C ... Z a b c').
Question 1: Do I only need to set LC_COLLATE (String sort order) when I create a database?
I also read about LC_CTYPE (Character classification (What is a letter? Its upper-case equivalent?))
Question 2: Can someone explain what this means?
The sort order you describe is the standard in most locales.
Just try for yourself:
SELECT regexp_split_to_table('D d a A c b', ' ') ORDER BY 1;
When you initialize your db cluster with initdb you can can pick a locale with --locale=some_locale. In my case it's --locale=de_AT.UTF-8. If you don't specify anything the locale is inherited from the environment - your current system locale will be used.
The template database of the cluster will be set to that locale. When you create a new database, it inherits the settings from the template. Normally you don't have to worry about anything, it all just works.
Read the chapter on CREATE DATABASE for more.
If you want to speed up text search with indexes, be sure to read about operator classes, as well.
All links to version 8.4, as you specifically asked for that.
In PostgreSQL 9.1 or later, there is collation support that allows more flexible use of collations:
The collation feature allows specifying the sort order and character
classification behavior of data per-column, or even per-operation.
This alleviates the restriction that the LC_COLLATE and LC_CTYPE
settings of a database cannot be changed after its creation.
Compared to other databases, PostgreSQL is a lot more stringent about case sensitivity. To avoid this when ordering you can use string functions to make it case sensitive:
SELECT * FROM users ORDER BY LOWER(last_name), LOWER(first_name);
If you have a lot of data it will be inefficient doing this across a whole table every time you want to display a list of records. An alternative is to use the citext module, which provides a type that is internally case insensitive when doing comparisons.
Bonus:
You might come into this issue when searching too, in this there is a case insensitive pattern matching operator:
SELECT * FROM users WHERE first_name ILIKE "%john%";
Answer for question 1 (One)
The LC_COLLATE and LC_CTYPE settings are determined when a database is created, and cannot be changed except by creating a new database.