Query failed: collation "numerickn" for encoding "UTF8" does not exist - postgresql

I have a column (vendor_name) in Postgresql (AWS RDS) table which can contain alphanumeric values. I would like to do a natural sort on this column.
The sample data in the table is as follows
delta 20221120
delta 20220109
costco delivery 564
costco delivery 561
united 01672519702943
Uber
I have created a colllate in the db as below.
CREATE COLLATION IF NOT EXISTS numerickn (provider = icu, locale = 'en-u-kn-true')
If anyone sorts on the vendor name column in the UI grid, I am adding the following clause dynamically in my query.
ORDER BY "vendor" COLLATE "numerickn"
However, it gives the following error, though I see collation exists in DB.
Error: Query failed: collation "numerickn" for encoding "UTF8" does
not exist
I am not sure why it does not work if collate exists in the DB. In my vendor name, numeric can appear anywhere within the string, so there is no pattern.

I could not find why it was not working in the stage environment while in my local it was working.
In the end, I moved away from colation logic and implement the natural sort in a different way found in stack overflow only.
PostgreSQL ORDER BY issue - natural sort
I am using Nodejs in my api code. My solution goes as follows
qOrderBy = String.raw` ORDER BY ARRAY(
SELECT ROW(
CAST(COALESCE(NULLIF(match[1], ''), '9223372036854775807') AS BIGINT),
match[2]
)
FROM REGEXP_MATCHES(vendor, '(\d*)|(\D*)', 'g')
AS match ) ${sortOrder}`
}

Related

Postgresql ORDER BY not working as expected

Let's try this simple example to represent the problem I'm facing.
Assume this table:
CREATE TABLE testing1
(
id serial NOT NULL,
word text,
CONSTRAINT testing1_pkey PRIMARY KEY (id)
);
and that data:
insert into testing1 (word) values ('Heliod, God');
insert into testing1 (word) values ('Heliod''s Inter');
insert into testing1 (word) values ('Heliod''s Pilg');
insert into testing1 (word) values ('Heliod, Sun');
Then I want to run this query to get the results ordered by the word column:
SELECT
id, word
FROM testing1
WHERE UPPER(word::text) LIKE UPPER('heliod%')
ORDER BY word asc;
But look at the output, it's not ordered. I would expect the rows to be in that order, using their ids: 2, 3, 1, 4 (or, if I use the word's values: Heliod's Inter, Heliod's Pilg, Heliod, God, Heliod, Sun). This is what I get:
I thought that maybe something could confuse postgresql because of the WHERE criteria I used, but the below happens if I just order by on the rows:
Am I missing something here? I couldn't find anything in the docs about ordering values that contain quotes (I suspect that the quotes cause that behaviour because of their special meaning in postgresql, but I may be wrong).
I am using UTF-8 encoding for my database (not sure if it matters though) and this issue is happening on Postgresql version 12.7.
The output of
show lc_ctype;
is
"en_GB.UTF-8"
and the output of
show lc_collate;
is
"en_GB.UTF-8"
That is the correct way to order the rows in en_US.UTF-8. It does 'weird' (to someone used to ASCII) things with punctuation and whitespace, skipping on a first pass and considering it only for otherwise tied values.
If you don't want those rules, maybe use the C collation instead.
Indeed, I've tried #jjanes's suggestion to use the C collation and the output is the one I would expect:
SELECT
id, word
FROM testing1
ORDER BY word collate "C" ;
How weird, I have been using postgresql for some years now and I never noticed that behaviour.
Relevant section from the docs:
23.2.2.1. Standard Collations
On all platforms, the collations named default, C, and POSIX are available. > Additional collations may be available depending on operating system
support. The default collation selects the LC_COLLATE and LC_CTYPE values
specified at database creation time. The C and POSIX collations both
specify “traditional C” behavior, in which only the ASCII letters “A”
through “Z” are treated as letters, and sorting is done strictly by
character code byte values.

VARCHAR comparison on an indexed column

Postgres is behaving differently from the 'common sense' expected behavior:
Given a table 'my_table' and a VARCHAR(250) column named 'MyVarcharColumn' where an index IDX_MyvarcharColumn is created based on the 'MyVarcharColumn'.
Collation: Default
Postgres version: 11.12
LC_COLLATE: en_US.utf8
Enconding: UTF8
CTYPE: en_US.utf8
The problem is presented below:
Given a query (A)
SELECT * FROM my_table t
WHERE 'mystring' = t.MyVarcharColumn
When running the query above, no records are returned even though there is a value 'mystring' present in 'my_table'.
Workaround:
SELECT * FROM my_table t
WHERE 'mystring' = t.MyVarcharColumn collate "C"
By adding 'collate "C"' the query works fine, obviously no one wants to have to add the "collate" statement at the end of every query.
Second 'Workaround':
By recreating the databases indexes 'REINDEX database myDB' the query also starts to work as expected without the need of adding the statement 'collate'.
The question is: is there a way to avoid using the collate statement and/or the REINDEX to make this work without a workaround?
Re-creating the database with a different collation it also not an option at the moment.
Using lower(column_name) to compare isn't an option since it does not use indexes and it would make the query slow.

In Postgres, running ANALYZE changes behaviour of ILIKE clause with COLLATE

We are migrating an application from SQL Server to Postgres and attempting to emulate various aspects of the case insensitivity of SQL Server. We have created a non-deterministic collation to support case-insensitive matching of foreign keys and equality comparisons.
But we are seeing some weird behaviour when using ILIKE which we can't explain, and would appreciate some assistance.
To see the behaviour, run the following on a fresh database:
CREATE COLLATION IF NOT EXISTS public.ci (provider = icu, locale = 'und-u-ks-level2', deterministic = false);
DROP TABLE IF EXISTS sort_test;
CREATE TABLE sort_test (a text COLLATE public.ci);
INSERT INTO sort_test SELECT md5(n::text) FROM generate_series(1, 10000) n;
-- Removing the following line fixes the issue
ANALYZE sort_test;
-- This line throws "nondeterministic collations are not supported for ILIKE"
SELECT * FROM sort_test WHERE a ILIKE 'c4ca4238a0%' COLLATE "und-x-icu";
Why does running the ANALYZE statement break the ILIKE statement?
That behavior is a PostgreSQL bug.
The reason why it works without the ANALYZE is that the error is thrown when applying the operator to the “histogram bounds” in the statistics. Before ANALYZE there are no statistics, so no error is thrown.

PostgreSQL - Query on hstore - column does not exists

I wonder if someone could have an idea what is going wrong with this simple query on a hstore column in PostgreSQL 9.2
The queries are runned in pgAdmin
select attributeValue->"CODE_MUN" from shapefile_feature
returns: « attributevalue » column does not exists
when doing:
select * from shapefile_feature;
all the columns are returned including attributeValue, the hstore column
what is the problem?
PostgreSQL distinguish between "identifiers" and 'literal'. Identifiers are schema, table, column's, .. names, literals are others. A attribute in hstore are not SQL identifiers. So you have to pass their names as literals. Operator "->" is only shortcut for function "fetchval(hstore, text)" with possibility be indexed.
select attributeValue->'CODE_MUN' from shapefile_feature
is internally transformed to (don't do this transformation by self!)
select fetchval(attributeValue, 'CODE_MUN') from shapefile_feature
on buggy example in transformed form, you can better understand to error message:
select fetchval(attributeValue, "CODE_MUN") from shapefile_feature
PostgreSQL tries to find column "CODE_MUN" in shapefile_feature, bacause used double quotes means identifiers (in case sensitive notation).

Postgresql order by - danish characters is expanded

I'm trying to make a "order by" statement in a sql query work. But for some reason danish special characters is expanded in stead of their evaluating their value.
SELECT roadname FROM mytable ORDER BY roadname
The result:
Abildlunden
Æblerosestien
Agern Alle 1
The result in the middle should be the last.
The locale is set to danish, so it should know the value of the danish special characters.
What is the collation of your database? (You might also want to give the PostgreSQL version you are using) Use "\l" from psql to see.
Compare and contrast:
steve#steve#[local] =# select * from (values('Abildlunden'),('Æblerosestien'),('Agern Alle 1')) x(word)
order by word collate "en_GB";
word
---------------
Abildlunden
Æblerosestien
Agern Alle 1
(3 rows)
steve#steve#[local] =# select * from (values('Abildlunden'),('Æblerosestien'),('Agern Alle 1')) x(word)
order by word collate "da_DK";
word
---------------
Abildlunden
Agern Alle 1
Æblerosestien
(3 rows)
The database collation is set when you create the database cluster, from the locale you have set at the time. If you installed PostgreSQL through a package manager (e.g. apt-get) then it is likely taken from the system-default locale.
You can override the collation used in a particular column, or even in a particular expression (as done in the examples above). However if you're not specifying anything (likely) then the database default will be used (which itself is inherited from the template database when the database is created, and the template database collation is fixed when the cluster is created)
If you want to use da_DK as your default collation throughout, and it's not currently your database default, your simplest option might be to dump the database, then drop and re-create the cluster, specifying the collation to initdb (or pg_createcluster or whatever tool you use to create the server)
BTW the question isn't well-phrased. PostgreSQL is very much not ignoring the "special" characters, it is correctly expanding "Æ" into "AE"- which is a correct rule for English. Collating "Æ" at the end is actually more like the unlocalised behaviour.
Collation documentation: http://www.postgresql.org/docs/current/static/collation.html