In Postgres, running ANALYZE changes behaviour of ILIKE clause with COLLATE - postgresql

We are migrating an application from SQL Server to Postgres and attempting to emulate various aspects of the case insensitivity of SQL Server. We have created a non-deterministic collation to support case-insensitive matching of foreign keys and equality comparisons.
But we are seeing some weird behaviour when using ILIKE which we can't explain, and would appreciate some assistance.
To see the behaviour, run the following on a fresh database:
CREATE COLLATION IF NOT EXISTS public.ci (provider = icu, locale = 'und-u-ks-level2', deterministic = false);
DROP TABLE IF EXISTS sort_test;
CREATE TABLE sort_test (a text COLLATE public.ci);
INSERT INTO sort_test SELECT md5(n::text) FROM generate_series(1, 10000) n;
-- Removing the following line fixes the issue
ANALYZE sort_test;
-- This line throws "nondeterministic collations are not supported for ILIKE"
SELECT * FROM sort_test WHERE a ILIKE 'c4ca4238a0%' COLLATE "und-x-icu";
Why does running the ANALYZE statement break the ILIKE statement?

That behavior is a PostgreSQL bug.
The reason why it works without the ANALYZE is that the error is thrown when applying the operator to the “histogram bounds” in the statistics. Before ANALYZE there are no statistics, so no error is thrown.

Related

Query failed: collation "numerickn" for encoding "UTF8" does not exist

I have a column (vendor_name) in Postgresql (AWS RDS) table which can contain alphanumeric values. I would like to do a natural sort on this column.
The sample data in the table is as follows
delta 20221120
delta 20220109
costco delivery 564
costco delivery 561
united 01672519702943
Uber
I have created a colllate in the db as below.
CREATE COLLATION IF NOT EXISTS numerickn (provider = icu, locale = 'en-u-kn-true')
If anyone sorts on the vendor name column in the UI grid, I am adding the following clause dynamically in my query.
ORDER BY "vendor" COLLATE "numerickn"
However, it gives the following error, though I see collation exists in DB.
Error: Query failed: collation "numerickn" for encoding "UTF8" does
not exist
I am not sure why it does not work if collate exists in the DB. In my vendor name, numeric can appear anywhere within the string, so there is no pattern.
I could not find why it was not working in the stage environment while in my local it was working.
In the end, I moved away from colation logic and implement the natural sort in a different way found in stack overflow only.
PostgreSQL ORDER BY issue - natural sort
I am using Nodejs in my api code. My solution goes as follows
qOrderBy = String.raw` ORDER BY ARRAY(
SELECT ROW(
CAST(COALESCE(NULLIF(match[1], ''), '9223372036854775807') AS BIGINT),
match[2]
)
FROM REGEXP_MATCHES(vendor, '(\d*)|(\D*)', 'g')
AS match ) ${sortOrder}`
}

VARCHAR comparison on an indexed column

Postgres is behaving differently from the 'common sense' expected behavior:
Given a table 'my_table' and a VARCHAR(250) column named 'MyVarcharColumn' where an index IDX_MyvarcharColumn is created based on the 'MyVarcharColumn'.
Collation: Default
Postgres version: 11.12
LC_COLLATE: en_US.utf8
Enconding: UTF8
CTYPE: en_US.utf8
The problem is presented below:
Given a query (A)
SELECT * FROM my_table t
WHERE 'mystring' = t.MyVarcharColumn
When running the query above, no records are returned even though there is a value 'mystring' present in 'my_table'.
Workaround:
SELECT * FROM my_table t
WHERE 'mystring' = t.MyVarcharColumn collate "C"
By adding 'collate "C"' the query works fine, obviously no one wants to have to add the "collate" statement at the end of every query.
Second 'Workaround':
By recreating the databases indexes 'REINDEX database myDB' the query also starts to work as expected without the need of adding the statement 'collate'.
The question is: is there a way to avoid using the collate statement and/or the REINDEX to make this work without a workaround?
Re-creating the database with a different collation it also not an option at the moment.
Using lower(column_name) to compare isn't an option since it does not use indexes and it would make the query slow.

Postgres 12 case-insensitive compare

I'm attempting to move a SQL Server DB which is used by a C# application (+EF6) to Postgres 12 but I'm not having much luck with getting case-insensitive string comparisons working. The existing SQL Server db uses SQL_Latin1_General_CP1_CI_AS collation which means all WHERE clauses don't have to worry about case.
I understand that CIText was the way to do this previously, but is now superseded by non-deterministic collations.
I created such a collation;
CREATE COLLATION ci (provider = icu, locale = 'und-u-ks-level2', deterministic = false);
and when this is applied to the CREATE TABLE on a per-column basis it does work - case is ignored.
CREATE TABLE casetest (
id serial NOT NULL,
code varchar(10) null COLLATE "ci",
CONSTRAINT "PK_id" PRIMARY KEY ("id"));
But from what I have read it must be applied to every varchar column and can't be set globally across the whole db.
Is this correct?
I don't want to use .ToLower() everywhere due to clutter and that any index on the column is then not used.
I tried modifying the pre-existing 'default' collation in pg_collation to match the settings of 'ci' collation but it has no effect.
Thanks in advance.
PG
You got it right. From PostgreSQL v15 on, ICU collations can be used as database collations, but only deterministic ones (that don't compare different strings as equal). So your case-insensitive collation wouldn't work there either. Since you are using v12, you cannot use ICU collations as database default collation at all, but have to use them in column definitions.
This limitation is annoying and not in the nature of things. It will probably be lifted in some future version.
You can use a DO statement to change the collation of all string columns:
DO
$$DECLARE
v_table regclass;
v_column name;
v_type oid;
v_typmod integer;
BEGIN
FOR v_table, v_column, v_type, v_typmod IN
SELECT a.attrelid::regclass,
a.attname,
a.atttypid,
a.atttypmod
FROM pg_attribute AS a
JOIN pg_class AS c ON a.attrelid = c.oid
WHERE a.atttypid IN (25, 1042, 1043)
AND c.relnamespace::regnamespace::name
NOT IN ('pg_catalog', 'information_schema', 'pg_toast')
LOOP
EXECUTE
format('ALTER TABLE %s ALTER %I SET DATA TYPE %s COLLATE ci',
v_table,
v_column,
format_type(v_type, v_typmod)
);
END LOOP;
END;$$;

Postgres: Understanding Primary Key Sequences

My database has fallen out of sync, which lead me to this question:
How to reset postgres' primary key sequence when it falls out of sync? (Copied Below)
However, I don't quite understand some things here: What is 'your_table_id_seq'? I have no clue where to find this. Doing some digging, I found a table called pg_sequence
in pg_catalog. Is it related to this? I can't see any way to relate that data back to my table though.
-- Login to psql and run the following
-- What is the result?
SELECT MAX(id) FROM your_table;
-- Then run...
-- This should be higher than the last result.
SELECT nextval('your_table_id_seq');
-- If it's not higher... run this set the sequence last to your highest id.
-- (wise to run a quick pg_dump first...)
BEGIN;
-- protect against concurrent inserts while you update the counter
LOCK TABLE your_table IN EXCLUSIVE MODE;
-- Update the sequence
SELECT setval('your_table_id_seq', COALESCE((SELECT MAX(id)+1 FROM your_table), 1), false);
COMMIT;
The following query gives names of all sequences.
SELECT c.relname
FROM pg_class c
WHERE c.relkind = 'S';
Typically a sequence is named as ${table}_id_seq.
I found the answer in this question: List all sequences in a Postgres db 8.1 with SQL

Query works in postgresql, but not in hsqldb

column_name is of type int[]
SELECT unnest(column_name) FROM table_name
The above query works on postgresql but not on hsqldb, even with sql.syntax_pgs=true
Hsqldb versions tried : 2.2.9 and 2.3.0
The sql that works in hsqldb is
SELECT x FROM table_name, unnest(column_name) y(x)
x and y are NOT columns of this table.
HSQLDB tries to emulate PostgreSQL's syntax and features, but like most emulations it is imperfect.
IIRC, one of the things it has a hard time with is PostgreSQL's quirky use of set-returning functions in the SELECT clause.
Use of SRFs in the SELECT clause is a weird PostgreSQL extension that's deprecated in favour of SQL-standard LATERAL queries anyway. The alternate formulation you showed:
SELECT x FROM table_name, unnest(column_name) y(x);
is the correct and preferred form. So just use that.
In general, testing on one DB then deploying to another is a recipe for pain. I strongly suggest just setting up a local PostgreSQL instance for testing instead.