Upper() doesn't uppercase accented characters - postgresql

Within our postgres 12 database, using united_states.utf8 collation, we have a dataset with Spanish data that includes accented characters. However when we upper() the values in the field, unaccented characters are correctly uppercased, but accented character are not.
upper('anuncio genérico para web del cliente') gives 'ANUNCIO GENéRICO PARA WEB DEL CLIENTE'
How can I correct this to get the expected result of 'ANUNCIO GENÉRICO PARA WEB DEL CLIENTE'?
I have tried forcing the string into c and posix collations, but these are ANSI only.

I have discovered the problem and a solution to it. My DB, as mentioned in the question, used 'English_United States.utf8' collation as default to match the encoding of UTF8. However on a windows environment, the 'English_United Kingdom.1252' collation works with UTF8 encoding, despite being specified for ANSI, and returns the uppercase characters as expected.
To resolve the issue I had to create the collation in the db using;
CREATE COLLATION "English_United Kingdom.1252" (LC_COLLATE='English_United Kingdom.1252', LC_CTYPE='English_United Kingdom.1252');
Which places it in the public schema. You can then manually correct the issue in queries by calling collate "English_United Kingdom.1252" against any strings with accents, or use
alter table [table_name] alter column [column_name] type text collate "English_United Kingdom.1252"
Against any columns that have accented characters to fix the collation permanently. Unfortunately there is no way to change the default collation for a DB once it is created without doing a full backup, drop, and restore.

Related

Problems with COLLATE in PostgreSQL 12

My problem:
I work in Windows 10 and my computer is set-up to Portuguese (pt_BR);
I'm building a database in PostgreSQL where I need certain columns to remain in Portuguese, but others to be in en_US - namely, those storing numbers and currency. I need $ instead of R$ and 1,000.00 instead of 1.000,00.
I tried to create columns this way using the COLLATE statement, as:
CREATE TABLE crm.TESTE (
prodserv_id varchar(30) NOT NULL,
prodserv_name varchar(140) NULL,
fk_prodservs_rep_acronym varchar(4) NULL,
prodserv_price numeric null collate "en_US",
CONSTRAINT pk_prodservs_prodserv_id PRIMARY KEY (prodserv_id)
);
But I get the error message:
SQL Error [42704]: ERROR: collation "en_US" for encoding "UTF8" does not exist
Database metadata shows Default Encoding: UTF8 and Collate Portuguese_Brazil.1252
It will be deployed at my ISP, which runs Linux.
Any suggestions would be greatly appreciated.
Thanks in advance.
A collation defines how strings are compared. It is not applicable to numerical data.
Moreover, PostgreSQL uses the operating system's collations, which causes problems when porting a database from Windows to other operating systems. The collation would be called English on Windows and en_US.utf8 on operating systems that use glibc.
To influence the formatting of numbers and currency symbols, set the lc_numeric and lc_monetary parameters appropriately (English on Windows, en_US elsewhere). Note that while lc_monetary affects the string representation of the money data type, these settings do not influence the string representation of numbers. You need to use to_char like this:
to_char(1000, '999G999G999D00 L');

Invalid multibyte character for locale when attempting to create Trigrams index

I am using the musicbrainz database through PostgresSQL in Windows 10. I used these scripts in order to import it to my own computer (this might be irrelevant). The encoding of the data is UTF-8.
Today I tried to create some indexes using Trigrams in order to improve the speed of my string-based searches, and this error appeared when I was creating the first one:
CREATE INDEX trgm_idx_release_group_name ON release_group USING gin (name gin_trgm_ops);
ERROR: invalid multibyte character for locale
HINT: The server's LC_CTYPE locale is probably incompatible with the
database encoding.
SQL state: 22021
The LC_CTYPE is Spanish_Spain.1252, which, seems to be the issue as it is not UTF-8.
The solution then is to create my DB with a locale compatible with UTF-8 but I cannot find that locale, as the names are different from the Linux locales. I have tried creating the database from the pgAdmin but in locale list I only find C, POSIX and Spanish_Spain.1252. I tried creating it with createdb like this:
createdb -U palp example-db -l=Spanish_Spain
but got the same locale as before with the -l argument as a comment. (I tried the command with different values of the locale and different quotation marks but got the same result).
Am I doing this wrong? Is the locale name wrong and thus, that is the problem?
Thank you.

Install utf8 collation in PostgreSQL

Right now I can choose Encoding : UTF8 when creating a new DB in pgAdmin4 GUI.
But, there is no option to choose utf8_general_ci as collation or character type. When I do select * from pg_collation; I dont see any collation relevant to utf8_general_ci.
Coming from a mySQL background I am confused. Do I have to install utf8-like ( eg utf8_general_ci, utf8_unicode_ci) collation in my PostgreSQL 10 or windows10?
I just want to have the equivalent of mySQL collation utf8_general_ci to PostgreSQL.
Thank you
utf8 is an encoding (how to represent unicode characters as a series of bytes), not a collation (which character goes before which).
I think the Postgres 10 collation equivalent for utf8_general_ci (or more modern utf8_unicode_ci) is called und-x-icu - this is an undefined collation (not defined for any real world language) provided by an ICU library. This collation would sort quite reasonably characters from most languages.
ICU support is a new feature added in PostgreSQL 10, so this collation isn't available for older PostgreSQL versions or when it's disabled during compilation. Before that Postgres was using operating system provided collation support, which differs between operating systems.

how can i change the "character type" of a database in postgresql?

I'm using postgreSQL 9.1
I've set the Collation and the Character Type of the database to Greek_Greece.1253 and I want to change it to utf8
To change the collation I should use this, right?
But how can I change the character type?
Thanks
EDIT
I ment to wright C instead of utf8. I would like to change the Collation and the Character Type to C
You cannot change default collation of an existing database. You need to CREATE DATABASE with the collation you need and then dump/restore your schema and data into it.
If you do not want to recreate the database - you can specify collation for every text collumn in your db.
Here is detailed postgres manual on collations: Collation Support.
First line of this manual page states:
LC_COLLATE and LC_CTYPE settings of a database cannot be changed after
its creation.
CREATE DATABASE, pg_dump, pg_restore

Force Postgres shift to uppercase rather than lowercase before sorting case insensitive?

I try to migrate to postgres from pervasive. In pervasive there was something like 'upper.alt' - alternative collation. I don't really know how it works, but I have to make my new postgres database to behave like pervasive with this collation.
I use Postgres 9.2.4 and utf-8 encoding and LC_COLLATE='Polish_Poland.1250' .
You can try and order with COLLATE "C". That would get what you want in your example. It has side effects though! Effectively everything is ordered according to the byte values of the encoded character.
WITH x(col) AS (
VALUES
('ABC_AAAAA')
,('ABC_BBBBB')
,('ABC_ZZZZZ')
,('ABCAAAAA')
,('ABCBBBBB')
,('ABCZZZZZ')
)
SELECT *
FROM x
ORDER BY col COLLATE "C"
This option to change the collation for individual expressions (as opposed to using a collation defined at creation time of the db) was introduced with Postgres 9.1.
More about collation in the manual here.