Invalid multibyte character for locale when attempting to create Trigrams index - postgresql

I am using the musicbrainz database through PostgresSQL in Windows 10. I used these scripts in order to import it to my own computer (this might be irrelevant). The encoding of the data is UTF-8.
Today I tried to create some indexes using Trigrams in order to improve the speed of my string-based searches, and this error appeared when I was creating the first one:
CREATE INDEX trgm_idx_release_group_name ON release_group USING gin (name gin_trgm_ops);
ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the
database encoding.
SQL state: 22021
The LC_CTYPE is Spanish_Spain.1252, which, seems to be the issue as it is not UTF-8.
The solution then is to create my DB with a locale compatible with UTF-8 but I cannot find that locale, as the names are different from the Linux locales. I have tried creating the database from the pgAdmin but in locale list I only find C, POSIX and Spanish_Spain.1252. I tried creating it with createdb like this:
createdb -U palp example-db -l=Spanish_Spain
but got the same locale as before with the -l argument as a comment. (I tried the command with different values of the locale and different quotation marks but got the same result).
Am I doing this wrong? Is the locale name wrong and thus, that is the problem?
Thank you.

Related

Upper() doesn't uppercase accented characters

Within our postgres 12 database, using united_states.utf8 collation, we have a dataset with Spanish data that includes accented characters. However when we upper() the values in the field, unaccented characters are correctly uppercased, but accented character are not.
upper('anuncio genérico para web del cliente') gives 'ANUNCIO GENéRICO PARA WEB DEL CLIENTE'
How can I correct this to get the expected result of 'ANUNCIO GENÉRICO PARA WEB DEL CLIENTE'?
I have tried forcing the string into c and posix collations, but these are ANSI only.
I have discovered the problem and a solution to it. My DB, as mentioned in the question, used 'English_United States.utf8' collation as default to match the encoding of UTF8. However on a windows environment, the 'English_United Kingdom.1252' collation works with UTF8 encoding, despite being specified for ANSI, and returns the uppercase characters as expected.
To resolve the issue I had to create the collation in the db using;
CREATE COLLATION "English_United Kingdom.1252" (LC_COLLATE='English_United Kingdom.1252', LC_CTYPE='English_United Kingdom.1252');
Which places it in the public schema. You can then manually correct the issue in queries by calling collate "English_United Kingdom.1252" against any strings with accents, or use
alter table [table_name] alter column [column_name] type text collate "English_United Kingdom.1252"
Against any columns that have accented characters to fix the collation permanently. Unfortunately there is no way to change the default collation for a DB once it is created without doing a full backup, drop, and restore.

Problems with COLLATE in PostgreSQL 12

My problem:
I work in Windows 10 and my computer is set-up to Portuguese (pt_BR);
I'm building a database in PostgreSQL where I need certain columns to remain in Portuguese, but others to be in en_US - namely, those storing numbers and currency. I need $ instead of R$ and 1,000.00 instead of 1.000,00.
I tried to create columns this way using the COLLATE statement, as:
CREATE TABLE crm.TESTE (
prodserv_id varchar(30) NOT NULL,
prodserv_name varchar(140) NULL,
fk_prodservs_rep_acronym varchar(4) NULL,
prodserv_price numeric null collate "en_US",
CONSTRAINT pk_prodservs_prodserv_id PRIMARY KEY (prodserv_id)
);
But I get the error message:
SQL Error [42704]: ERROR: collation "en_US" for encoding "UTF8" does not exist
Database metadata shows Default Encoding: UTF8 and Collate Portuguese_Brazil.1252
It will be deployed at my ISP, which runs Linux.
Any suggestions would be greatly appreciated.
Thanks in advance.
A collation defines how strings are compared. It is not applicable to numerical data.
Moreover, PostgreSQL uses the operating system's collations, which causes problems when porting a database from Windows to other operating systems. The collation would be called English on Windows and en_US.utf8 on operating systems that use glibc.
To influence the formatting of numbers and currency symbols, set the lc_numeric and lc_monetary parameters appropriately (English on Windows, en_US elsewhere). Note that while lc_monetary affects the string representation of the money data type, these settings do not influence the string representation of numbers. You need to use to_char like this:
to_char(1000, '999G999G999D00 L');

Heroku Postgres ignores underscores when sorting

This is driving me bonkers. My Heroku Postgres (9.5.18) DB seems to be ignoring underscores when sorting results:
Query:
SELECT category FROM categories ORDER BY category ASC;
Results:
category
-------------------
z_commercial_overlay
z_district
zr_use_group
zr_uses_footnote
z_special_district
This is new to me. I've never noticed another system where underscores are not respected in sorting, and this is the first time I've noticed Postgres behaving like this.
On my local OSX box (Postgres 10.5) the results are sorted the 'normal' expected way:
category
-------------------
z_commercial_overlay
z_district
z_special_district
zr_use_group
zr_uses_footnote
UPDATE:
Based on the comments, I was able to get the correct sorting by using COLLATE "C"
SELECT category FROM categories ORDER BY category COLLATE "C" ASC;
But I don't understand why is this necessary. BOTH of the Postgres instances show the same default collation value, and all of the table columns were created the same way, with no alternate collation specified.
SHOW lc_collate;
lc_collate
-------------
en_US.UTF-8
SHOW lc_ctype;
lc_ctype
-------------
en_US.UTF-8
So why does the Heroku Postgres DB require the COLLATE declaration?
I've never encountered another system where underscores are not respected in sorting
Really? Never used one, or just never paid attention to one?
On Ubuntu 16.04 (and every other modern system I've paid attention to), the system sort tool behaves the same way as long as you are using en_US.
LC_ALL= LANG=en_US.UTF-8 sort
<produced the same order as the first one you show above)
On my local box (Postgres 10.5) the results are sorted the 'normal' expected way:
BOTH of the Postgres instances show the same collation value:
SHOW lc_collate;
lc_collate
en_US.UTF-8
That only shows the default collation for the database. The column could have been declared to use a different collation than the default:
create table categories(category text collate "C");
If your local database is supposed to be using en_US, and is not, then it is busted.

Install utf8 collation in PostgreSQL

Right now I can choose Encoding : UTF8 when creating a new DB in pgAdmin4 GUI.
But, there is no option to choose utf8_general_ci as collation or character type. When I do select * from pg_collation; I dont see any collation relevant to utf8_general_ci.
Coming from a mySQL background I am confused. Do I have to install utf8-like ( eg utf8_general_ci, utf8_unicode_ci) collation in my PostgreSQL 10 or windows10?
I just want to have the equivalent of mySQL collation utf8_general_ci to PostgreSQL.
Thank you
utf8 is an encoding (how to represent unicode characters as a series of bytes), not a collation (which character goes before which).
I think the Postgres 10 collation equivalent for utf8_general_ci (or more modern utf8_unicode_ci) is called und-x-icu - this is an undefined collation (not defined for any real world language) provided by an ICU library. This collation would sort quite reasonably characters from most languages.
ICU support is a new feature added in PostgreSQL 10, so this collation isn't available for older PostgreSQL versions or when it's disabled during compilation. Before that Postgres was using operating system provided collation support, which differs between operating systems.

How to get and change encoding schema for a DB2 z/OS database using dynamic SQL statement

A DB2 for z/OS database has been setup for me. Now I want to know the encoding scheme of the database and change it to Unicode if the database is other type of encoding.
How can I do this? Can I do this using dynamic SQL statements in my Java application?
Thanks!
You need to specify that the encoding scheme is UNICODE when you are creating your table (and database and tablepsace) by using the CCSID UNICODE clause.
According to the documentation:
By default, the encoding scheme of a table is the same as the encoding scheme of its table space. Also by default, the encoding scheme of the table space is the same as the encoding scheme of its database. You can override the encoding scheme with the CCSID clause in the CREATE TABLESPACE or CREATE TABLE statement. However, all tables within a table space must have the same CCSID.
For more, see Creating a Unicode Table in the DB2 for z/os documentation.
You are able to create tables via Java/JDBC, but I doubt that you will be able to create databases and tablespaces that way. I wouldn't recommend it anyway, I would find your closest z/os DBA and get that person to help you.