PostgreSQL: wrong sorting on Ukrainian text - postgresql

I have table with all countries on three languages: English, Russian and Ukrainian. On first two languages sorting is OK. But on Ukrainian countries' names sorting is not OK.
On first two place it stands two letters 'є' (8th position in ABC) and 'і' (12th position in ABC) and all next letters are sorted fine.
How to prevent this behaviour? DB encoding is utf-8.

If you are on 9.1, you can add the collation to be used for sorting to your ORDER BY clause:
SELECT *
FROM your_table
ORDER BY your_column COLLATE "ua_UA"
The name of the collation depends on your operating system - not sure what the correct name for Ukraine would be. But I think you get the idea.
You might also want to read this blog entry:
http://www.depesz.com/index.php/2011/03/04/waiting-for-9-1-per-column-collation-support/

UTF-8 doesn't know anything about "language". For alphabetical sort to make any sense to Postgres you need to set a locale. Your question doesn't mention locale at all so I'm guessing you're just sorting using whatever your default locale is (probably English or Russian).
If you are already using locales then I suggest providing details of your client / server locale settings as there may be a mistake there.

Related

PostgreSQL SELECT can alter a table?

So I'm new to SQL like databases and the place that I work at migrated to PostgreSQL. One table drastically reduced its contents. The point is, I only used SELECT statements, and changed the name of the columns with AS. Is there a way I might have changed the table data?
When you migrate from a DBMS to another DBMS you must be sure that the objects created are strictly equivalent... The question seems to be trivial, but is'nt.
As a matter fact one important consideration for litterals (char/varchar...) is to verify the collation used formerly and the collation you have used to create the newly database in PostGreSQL.
Collation in an RDBMS is the way to adjust the behavior of character strings with regard to certain parameters such as the distinction, or not, of upper and lower case letters, the distinction, or not, of diacritical characters (accents, ligatures...), specific sorting to language, etc. And constitutes a superset of the character encoding.
Did you verify this point when using some WHERE clause to search some litterals ? If not, try to restricts litteral in applying the right collation (COLLATE operator) or use UPPER function to avoid the distinguish between upper and lower chars...

Set Order By to ignore punctuation on a per-column basis

Is it possible to order the results of a PostgreSQL query by a title field that contains characters like [](),; etc but do so ignoring these punctuation characters and sorting only by the text characters?
I've read articles on changing the database collation or locale but have not found any clear instructions on how to do this on an existing database an on a per-column basis. Is this even possible?
"Normalize" for sorting
You could use regexp_replace() with the pattern '[^a-zA-Z]' in the ORDER BY clause but that only recognizes pure ASCII letters. Better use the class shorthand '\W' which recognizes additional non-ASCII letters in your locale like äüóèß etc.
Or you could improvise and "normalize all characters with diacritic elements to their base form with the help of the unaccent() function. Consider this little demo:
SELECT *
, regexp_replace(x, '[^a-zA-Z]', '', 'g')
, regexp_replace(x, '\W', '', 'g')
, regexp_replace(unaccent(x), '\W', '', 'g')
FROM (
SELECT 'XY ÖÜÄöüäĆČćč€ĞğīїıŁłŃńŇňŐőōŘřŠšŞşůŽžż‘´’„“”­–—[](),;.:̈� XY'::text AS x) t
->SQLfiddle for Postgres 9.2.
->SQLfiddle for Postgres 9.1.
Regular expression code has been updated in version 9.2. I am assuming this is the reason for the improved handling in 9.2 where all letter characters in the example are matched, while 9.1 only matches some.
unaccent() is provided by the additional module unaccent. Run:
CREATE EXTENSION unaccent;
once per database to use in (Postgres 9.1+, older versions use a different technique).
locales / collation
You must be aware that Postgres relies on the underlying operating system for locales (including collation). The sort order is governed by your chosen locale, or more specific LC_COLLATE. More in this related answer:
String sort order (LC_COLLATE and LC_CTYPE)
There are plans to incorporate collation support into Postgres directly, but that's not available at this time.
Many locales ignore the special characters you describe for sorting character data out of the box. If you have a locale installed in your system that provides the sort order you are looking for, you can use it ad-hoc in Postgres 9.1 or later:
SELECT foo FROM bar ORDER BY foo COLLATE "xy_XY"
To see which collations are installed and available in your current Postgres installation:
SELECT * FROM pg_collation;
Unfortunately it is not possible to define your own custom collation (yet) unless you hack the source code.
The collation rules are usually governed by the rules of a language as spoken in a country. The sort order telephone books would be in, if there were still telephone books ... Your operating system provides them.
For instance, in Debian Linux you can use:
locale -a
to display all generated locales. And:
dpkg-reconfigure locales
as root user (one way of several) to generate / install more.
If you want to have this ordering in one particular query you can
ORDER BY regexp_replace(title, '[^a-zA-Z]', '', 'g')
It will delete all non A-Z from sting and order by resulting field.

Storing uni code characters in PostgreSQL 8.4 table

I want to store unicode characters in on of the column of PostgreSQL8.4 datat base table. I want to store non-English language data say want to store the Indic language texts. I have achieved the same in Oracle XE by converting the text into unicode and stored in the table using nvarchar2 column data type.
The same way I want to store unicode characters of Indic languages say (Tamil,Hindi) in one of the column of a table. How to I can achieve that,what data type should I use?
Please guide me, thanks in advance
Just make sure the database is initialized with encoding utf8. This applies to the whole database for 8.4, later versions are more sophisticated. You might want to check the locale settings too - see the manual for details, particularly around matching with LIKE and text pattern ops.

String sort order (LC_COLLATE and LC_CTYPE)

Apparently PostgreSQL allows different locales for each database since version 8.4
So I went to the docs to read about locales (http://www.postgresql.org/docs/8.4/static/locale.html).
String sort order is of my particular interest (I want strings sorted like 'A a b c D d' and not 'A B C ... Z a b c').
Question 1: Do I only need to set LC_COLLATE (String sort order) when I create a database?
I also read about LC_CTYPE (Character classification (What is a letter? Its upper-case equivalent?))
Question 2: Can someone explain what this means?
The sort order you describe is the standard in most locales.
Just try for yourself:
SELECT regexp_split_to_table('D d a A c b', ' ') ORDER BY 1;
When you initialize your db cluster with initdb you can can pick a locale with --locale=some_locale. In my case it's --locale=de_AT.UTF-8. If you don't specify anything the locale is inherited from the environment - your current system locale will be used.
The template database of the cluster will be set to that locale. When you create a new database, it inherits the settings from the template. Normally you don't have to worry about anything, it all just works.
Read the chapter on CREATE DATABASE for more.
If you want to speed up text search with indexes, be sure to read about operator classes, as well.
All links to version 8.4, as you specifically asked for that.
In PostgreSQL 9.1 or later, there is collation support that allows more flexible use of collations:
The collation feature allows specifying the sort order and character
classification behavior of data per-column, or even per-operation.
This alleviates the restriction that the LC_COLLATE and LC_CTYPE
settings of a database cannot be changed after its creation.
Compared to other databases, PostgreSQL is a lot more stringent about case sensitivity. To avoid this when ordering you can use string functions to make it case sensitive:
SELECT * FROM users ORDER BY LOWER(last_name), LOWER(first_name);
If you have a lot of data it will be inefficient doing this across a whole table every time you want to display a list of records. An alternative is to use the citext module, which provides a type that is internally case insensitive when doing comparisons.
Bonus:
You might come into this issue when searching too, in this there is a case insensitive pattern matching operator:
SELECT * FROM users WHERE first_name ILIKE "%john%";
Answer for question 1 (One)
The LC_COLLATE and LC_CTYPE settings are determined when a database is created, and cannot be changed except by creating a new database.

Anyone had success using a specific locale for a PostgreSQL database so that text comparison is case-insensitive? [duplicate]

This question already has answers here:
Change postgres to case insensitive
(2 answers)
Closed last year.
I'm developing an app in Rails on OS X using PostgreSQL 8.4. I need to setup the database for the app so that standard text queries are case-insensitive. For example:
SELECT * FROM documents WHERE title = 'incredible document'
should return the same result as:
SELECT * FROM documents WHERE title = 'Incredible Document'
Just to be clear, I don't want to use:
(1) LIKE in the where clause or any other type of special comparison operators
(2) citext for the column datatype or any other special column index
(3) any type of full-text software like Sphinx
What I do want is to set the database locale to support case-insensitive text comparison. I'm on Mac OS X (10.5 Leopard) and have already tried setting the Encoding to "LATIN1", with the Collation and Ctype both set to "en_US.ISO8859-1". No success so far.
Any help or suggestions are greatly appreciated.
Thanks!
Update
I have marked one of the answers given as the correct answer out of respect for the folks who responded. However, I've chosen to solve this issue differently than suggested. After further review of the application, there are only a few instances where I need case-insensitive comparison against a database field, so I'll be creating shadow database fields for the ones I need to compare case-insensitively. For example, name and name_lower. I believe I came across this solution on the web somewhere. Hopefully PostgreSQL will allow similar collation options to what SQL Server provides in the future (i.e. DOCI).
Special thanks to all who responded.
You will likely need to do something like use a column function to convert your text e.g. convert to uppercase - an example :
SELECT * FROM documents WHERE upper(title) = upper('incredible document')
Note that this may mess up performance that used index scanning, but if it becomes a problem you can define an index including column functions on target columns e.g.
CREATE INDEX I1 on documents (upper(title))
With all the limitations you have set, possibly the only way to make it work is to define your own = operator for text. It is very likely that it will create other problems, such as creating broken indexes. Other than that, your best bet seems to be to use the citext datatype; that would still let the ORM stuff you're using generate the SQL.
(I am not mentioning the possibility of creating your own locale definition because I haven't ever heard of anyone doing it.)
Your problem and your exclusives are like saying "I want to swim, but I don't want to have to move my arms.".
You will drown trying.
I don't think that is what local or encoding is used for. Encoding is more for picking a character set and not determining how to deal with characters. If there were a setting it would be in the config, but I haven't seen one.
If you do not want to use ilike for fear of not being able to port to another database then I would suggest you look into what ORM options might be available with ActiveRecord if you are using that.
here is something from one of the top postgres guys: http://archives.postgresql.org/pgsql-php/2003-05/msg00045.php
edit: fixed specific references to locale.
SELECT * FROM documents WHERE title ~* 'incredible document'