Problems with COLLATE in PostgreSQL 12 - postgresql

My problem:
I work in Windows 10 and my computer is set-up to Portuguese (pt_BR);
I'm building a database in PostgreSQL where I need certain columns to remain in Portuguese, but others to be in en_US - namely, those storing numbers and currency. I need $ instead of R$ and 1,000.00 instead of 1.000,00.
I tried to create columns this way using the COLLATE statement, as:
CREATE TABLE crm.TESTE (
prodserv_id varchar(30) NOT NULL,
prodserv_name varchar(140) NULL,
fk_prodservs_rep_acronym varchar(4) NULL,
prodserv_price numeric null collate "en_US",
CONSTRAINT pk_prodservs_prodserv_id PRIMARY KEY (prodserv_id)
);
But I get the error message:
SQL Error [42704]: ERROR: collation "en_US" for encoding "UTF8" does not exist
Database metadata shows Default Encoding: UTF8 and Collate Portuguese_Brazil.1252
It will be deployed at my ISP, which runs Linux.
Any suggestions would be greatly appreciated.
Thanks in advance.

A collation defines how strings are compared. It is not applicable to numerical data.
Moreover, PostgreSQL uses the operating system's collations, which causes problems when porting a database from Windows to other operating systems. The collation would be called English on Windows and en_US.utf8 on operating systems that use glibc.
To influence the formatting of numbers and currency symbols, set the lc_numeric and lc_monetary parameters appropriately (English on Windows, en_US elsewhere). Note that while lc_monetary affects the string representation of the money data type, these settings do not influence the string representation of numbers. You need to use to_char like this:
to_char(1000, '999G999G999D00 L');

Related

Upper() doesn't uppercase accented characters

Within our postgres 12 database, using united_states.utf8 collation, we have a dataset with Spanish data that includes accented characters. However when we upper() the values in the field, unaccented characters are correctly uppercased, but accented character are not.
upper('anuncio genérico para web del cliente') gives 'ANUNCIO GENéRICO PARA WEB DEL CLIENTE'
How can I correct this to get the expected result of 'ANUNCIO GENÉRICO PARA WEB DEL CLIENTE'?
I have tried forcing the string into c and posix collations, but these are ANSI only.
I have discovered the problem and a solution to it. My DB, as mentioned in the question, used 'English_United States.utf8' collation as default to match the encoding of UTF8. However on a windows environment, the 'English_United Kingdom.1252' collation works with UTF8 encoding, despite being specified for ANSI, and returns the uppercase characters as expected.
To resolve the issue I had to create the collation in the db using;
CREATE COLLATION "English_United Kingdom.1252" (LC_COLLATE='English_United Kingdom.1252', LC_CTYPE='English_United Kingdom.1252');
Which places it in the public schema. You can then manually correct the issue in queries by calling collate "English_United Kingdom.1252" against any strings with accents, or use
alter table [table_name] alter column [column_name] type text collate "English_United Kingdom.1252"
Against any columns that have accented characters to fix the collation permanently. Unfortunately there is no way to change the default collation for a DB once it is created without doing a full backup, drop, and restore.

Heroku Postgres ignores underscores when sorting

This is driving me bonkers. My Heroku Postgres (9.5.18) DB seems to be ignoring underscores when sorting results:
Query:
SELECT category FROM categories ORDER BY category ASC;
Results:
category
-------------------
z_commercial_overlay
z_district
zr_use_group
zr_uses_footnote
z_special_district
This is new to me. I've never noticed another system where underscores are not respected in sorting, and this is the first time I've noticed Postgres behaving like this.
On my local OSX box (Postgres 10.5) the results are sorted the 'normal' expected way:
category
-------------------
z_commercial_overlay
z_district
z_special_district
zr_use_group
zr_uses_footnote
UPDATE:
Based on the comments, I was able to get the correct sorting by using COLLATE "C"
SELECT category FROM categories ORDER BY category COLLATE "C" ASC;
But I don't understand why is this necessary. BOTH of the Postgres instances show the same default collation value, and all of the table columns were created the same way, with no alternate collation specified.
SHOW lc_collate;
lc_collate
-------------
en_US.UTF-8
SHOW lc_ctype;
lc_ctype
-------------
en_US.UTF-8
So why does the Heroku Postgres DB require the COLLATE declaration?
I've never encountered another system where underscores are not respected in sorting
Really? Never used one, or just never paid attention to one?
On Ubuntu 16.04 (and every other modern system I've paid attention to), the system sort tool behaves the same way as long as you are using en_US.
LC_ALL= LANG=en_US.UTF-8 sort
<produced the same order as the first one you show above)
On my local box (Postgres 10.5) the results are sorted the 'normal' expected way:
BOTH of the Postgres instances show the same collation value:
SHOW lc_collate;
lc_collate
en_US.UTF-8
That only shows the default collation for the database. The column could have been declared to use a different collation than the default:
create table categories(category text collate "C");
If your local database is supposed to be using en_US, and is not, then it is busted.

Invalid multibyte character for locale when attempting to create Trigrams index

I am using the musicbrainz database through PostgresSQL in Windows 10. I used these scripts in order to import it to my own computer (this might be irrelevant). The encoding of the data is UTF-8.
Today I tried to create some indexes using Trigrams in order to improve the speed of my string-based searches, and this error appeared when I was creating the first one:
CREATE INDEX trgm_idx_release_group_name ON release_group USING gin (name gin_trgm_ops);
ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the
database encoding.
SQL state: 22021
The LC_CTYPE is Spanish_Spain.1252, which, seems to be the issue as it is not UTF-8.
The solution then is to create my DB with a locale compatible with UTF-8 but I cannot find that locale, as the names are different from the Linux locales. I have tried creating the database from the pgAdmin but in locale list I only find C, POSIX and Spanish_Spain.1252. I tried creating it with createdb like this:
createdb -U palp example-db -l=Spanish_Spain
but got the same locale as before with the -l argument as a comment. (I tried the command with different values of the locale and different quotation marks but got the same result).
Am I doing this wrong? Is the locale name wrong and thus, that is the problem?
Thank you.

Install utf8 collation in PostgreSQL

Right now I can choose Encoding : UTF8 when creating a new DB in pgAdmin4 GUI.
But, there is no option to choose utf8_general_ci as collation or character type. When I do select * from pg_collation; I dont see any collation relevant to utf8_general_ci.
Coming from a mySQL background I am confused. Do I have to install utf8-like ( eg utf8_general_ci, utf8_unicode_ci) collation in my PostgreSQL 10 or windows10?
I just want to have the equivalent of mySQL collation utf8_general_ci to PostgreSQL.
Thank you
utf8 is an encoding (how to represent unicode characters as a series of bytes), not a collation (which character goes before which).
I think the Postgres 10 collation equivalent for utf8_general_ci (or more modern utf8_unicode_ci) is called und-x-icu - this is an undefined collation (not defined for any real world language) provided by an ICU library. This collation would sort quite reasonably characters from most languages.
ICU support is a new feature added in PostgreSQL 10, so this collation isn't available for older PostgreSQL versions or when it's disabled during compilation. Before that Postgres was using operating system provided collation support, which differs between operating systems.

How do I know if my PostgreSQL server is using the "C" locale?

I'm trying to optimize my PostgreSQL 8.3 DB tables to the best of my ability, and I'm unsure if I need to use varchar_pattern_ops for certain columns where I'm performing a LIKE against the first N characters of a string. According to this documentation, the use of xxx_pattern_ops is only necessary "...when the server does not use the standard 'C' locale".
Can someone explain what this means? How do I check what locale my database is using?
Currently some locale [docs] support can only be set at initdb time, but I think the one relevant to _pattern_ops can be modified via SET at runtime, LC_COLLATE. To see the set values you can use the SHOW command.
For example:
SHOW LC_COLLATE
_pattern_ops indexes are useful in columns that use pattern matching constructs, like LIKE or regexps. You still have to make a regular index (without _pattern_ops) to do equality search on an index. So you have to take all this into consideration to see if you need such indexes on your tables.
About what locale is, it's a set of rules about character ordering, formatting and similar things that vary from language/country to another language/country. For instance, the locale fr_CA (French in Canada) might have some different sorting rules (or way of displaying numbers and so on) than en_CA (English in Canada.). The standard "C" locale is the POSIX standards-compliant default locale. Only strict ASCII characters are valid, and the rules of ordering and formatting are mostly those of en_US (US English)
In computing, locale is a set of
parameters that defines the user's
language, country and any special
variant preferences that the user
wants to see in their user interface.
Usually a locale identifier consists
of at least a language identifier and
a region identifier.
psql -l
according to handbook
example output:
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-------------+--------+----------+-------------+-------------+-------------------
packrd | packrd | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
postgres | packrd | UTF8 | en_US.UTF-8 | en_US.UTF-8 |
template0 | packrd | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/packrd +
| | | | | packrd=CTc/packrd
template1 | packrd | UTF8 | en_US.UTF-8 | en_US.UTF-8 | =c/packrd +
| | | | | packrd=CTc/packrd
(5 rows)
OK, from my perusings, it appears that this initial setting
initdb --locale=xxx
--locale=locale
Specifies the locale to be used in this database. This is equivalent to specifying both --lc-collate and --lc-ctype.
basically specifies the "default" locale for all database that you create after that (i.e. it specifies the settings for template1, which is the default template). You can create new databases with a different locale like this:
Locale is different than encoding, you can manually specify it and/or encoding:
CREATE DATABASE korean WITH ENCODING 'EUC_KR' LC_COLLATE='ko_KR.euckr' LC_CTYPE='ko_KR.euckr' TEMPLATE=template0;
If you want to manually call it out.
Basically if you don't specify it, it uses the system default, which is almost never "C".
So if your show LC_COLLATE returns anything other than "C" or "POSIX" then you are not using the standard C locale and you will need to specify the xxx_pattern_ops for your indexes. Note also the caveat that if you want to use the <, <=, >, or >= operators you need to create a second index without the xxx_pattern_ops flag (unless you are using the standard C locale on your database, which is rare...). For just == and LIKE (etc.) then you don't need a second index. If you don't need LIKE then you don't need the index with xxx_pattern_ops, possibly, as well.
Even if your indexes are defined to collate with the "default" like
CREATE INDEX my_index_name
ON table_name
USING btree
(identifier COLLATE pg_catalog."default");
This is not enough, unless the default is the "C" (or POSIX, same thing) collation, it can't be used for patterns like LIKE 'ABC%'. You need something like this:
CREATE INDEX my_index_name
ON table_name
USING btree
(identifier COLLATE pg_catalog."default" varchar_pattern_ops);
If you've got the option...
You could recreate the database cluster with the C locale.
You need to pass the locale to initdb when initializing your Postgres instance.
You can do this regardless of what the server's default or user's locale is.
That's a server administration command though, not a database schema designers task. The cluster contains all the databases on the server, not just the one you're optimising.
It creates a brand new cluster, and does not migrate any of your existing databases or data. That'd be additional work.
Furthermore, if you're in a position where you can consider creating a new cluster as an option, you really should be considering using PostgreSQL 8.4 instead, which can have per-database locales, specified in the CREATE DATABASE statement.
There is also another way (assuming you want to check them, not modify them):
Check file /var/lib/postgres/data/postgresql.conf
Following lines should be found:
# These settings are initialized by initdb, but they can be changed.
lc_messages = 'en_US.UTF-8' # locale for system error message strings
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8' # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting