How to make my postgresql database use a case insensitive collation? - postgresql

In several SO posts OP asked for an efficient way to search text columns in a case insensitive manner.
As much as I could understand the most efficient way is to have a database with a case insensitive collation. In my case I am creating the database from scratch, so I have the perfect control on the DB collation. The only problem is that I have no idea how to define it and could not find any example of it.
Please, show me how to create a database with case insensitive collation.
I am using postgresql 9.2.4.
EDIT 1
The CITEXT extension is a good solution. However, it has some limitations, as explained in the documentation. I will certainly use it, if no better way exists.
I would like to emphasize, that I wish ALL the string operations to be case insensitive. Using CITEXT for every TEXT field is one way. However, using a case insensitive collation would be the best, if at all possible.
Now https://stackoverflow.com/users/562459/mike-sherrill-catcall says that PostgreSQL uses whatever collations the underlying system exposes. I do not mind making the OS expose a case insensitive collation. The only problem I have no idea how to do it.

A lot has changed since this question. Native support for case-insensitive collation has been added in PostgreSQL v12. This basically deprecates the citext extension, as mentioned in the other answers.
In PostgreSQL v12, one can do:
CREATE COLLATION case_insensitive (
provider = icu,
locale = 'und-u-ks-level2',
deterministic = false
);
CREATE TABLE names(
first_name text,
last_name text
);
insert into names values
('Anton','Egger'),
('Berta','egger'),
('Conrad','Egger');
select * from names
order by
last_name collate case_insensitive,
first_name collate case_insensitive;
See https://www.postgresql.org/docs/current/collation.html for more information.

There are no case insensitive collations, but there is the citext extension:
http://www.postgresql.org/docs/current/static/citext.html

For my purpose the ILIKE keyword did the job.
From the postgres docs:
The key word ILIKE can be used instead of LIKE to make the match
case-insensitive according to the active locale. This is not in the
SQL standard but is a PostgreSQL extension.

This is not changing collation, but maybe somebody help this type of query, where I was use function lower:
SELECT id, full_name, email FROM nurses WHERE(lower(full_name) LIKE '%bar%' OR lower(email) LIKE '%bar%')

I believe you need to specify your collation as a command line option to initdb when you create the database cluster. Something like
initdb --lc-collate=en_US.UTF-8
It also seems that using PostgreSQL 9.3 on Ubuntu and Mac OS X, initdb automatically creates the database cluster using a case-insensitive collation that is default in the current OS locale, in my case, en_US.UTF-8.
Could you be using an older version of PostgreSQL that does not default to the host locale? Or could it be that you are on an operating system that does not provide any case-insensitive collations for PostgreSQL to choose from?

Related

Create database defnition equivalent from mysql to postgresql

I need to migrate a mysql table to postgresql.
I need an accent and case insensitive database.
In mysql, my database has the next definition:
CREATE DATABASE gestan
DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
How do I create an equivalent definition to postgresql?
I have read some posts, but it seems outdated.
create database getan
encoding = 'UTF-8';
There is no direct equivalent for the case insensitive collation _ci in Postgres.
You can define a new ICU collation that uses case insensitive comparison, but it can't be used as a default collation for a database, only for the collation on column level.
Related questions:
Install utf8 collation in PostgreSQL
https://dba.stackexchange.com/questions/183355
https://dba.stackexchange.com/a/256979

PostgreSQL accent + case insensitive search

I'm looking for a way to support with good performances case insensitive + accent insensitive search. Till now we had no issue on this using MSSql server, on Oracle we had to use OracleText, and now we need it on PostgreSQL.
I've found this post about it, but we need to combine it with case insensitive. We also need to use indexes, otherwise performances could be impacted.
Any real experience about the best approach for large databases?
If you need to "combine with case insensitive", there are a number of options, depending on your exact requirements.
Maybe simplest, make the expression index case insensitive.
Building on the function f_unaccent() laid out in the referenced answer:
Does PostgreSQL support "accent insensitive" collations?
CREATE INDEX users_lower_unaccent_name_idx ON users(lower(f_unaccent(name)));
Then:
SELECT *
FROM users
WHERE lower(f_unaccent(name)) = lower(f_unaccent('João'));
Or you could build the lower() into the function f_unaccent(), to derive something like f_lower_unaccent().
Or (especially if you need to do fuzzy pattern matching anyways) you can use a trigram index provided by the additional module pg_trgm building on above function, which also supports ILIKE. Details:
LOWER LIKE vs iLIKE
I added a note to the referenced answer.
Or you could use the additional module citext (but I rather avoid it):
Deferrable, case-insensitive unique constraint
Full-Text-Search Dictionary that Unaccent case-insensitive
FTS is naturally case-insensitive by default,
Converting tokens into lexemes. A lexeme is a string, just like a token, but it has been normalized so that different forms of the same word are made alike. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as s or es in English).
And you can define your own dictionary using unaccent,
CREATE EXTENSION unaccent;
CREATE TEXT SEARCH CONFIGURATION mydict ( COPY = simple );
ALTER TEXT SEARCH CONFIGURATION mydict
ALTER MAPPING FOR hword, hword_part, word
WITH unaccent, simple;
Which you can then index with a functional index,
-- Just some sample data...
CREATE TABLE myTable ( myCol )
AS VALUES ('fóó bar baz'),('qux quz');
-- No index required, but feel free to create one
CREATE INDEX ON myTable
USING GIST (to_tsvector('mydict', myCol));
You can now query it very simply
SELECT *
FROM myTable
WHERE to_tsvector('mydict', myCol) ## 'foo & bar'
mycol
-------------
fóó bar baz
(1 row)
See also
Creating a case-insensitive and accent/diacritics insensitive search on a field

Is it safe to change Collation on Postgres (keeping encoding)?

I have a Postgres 9.3 database which, by mistake, has been set to:
but I need it to be:
Since the Encoding doesn't change, it is safe to dump the DB and restore it later (see here) to a database with the new Collation / Character type?
Perfectly safe -- the collation is just telling Postgres which set of rules to apply when sorting text.
You can even set it dynamically on a query basis in the order by clause, and should be able to alter it without needing to dump the database.

Force Postgres shift to uppercase rather than lowercase before sorting case insensitive?

I try to migrate to postgres from pervasive. In pervasive there was something like 'upper.alt' - alternative collation. I don't really know how it works, but I have to make my new postgres database to behave like pervasive with this collation.
I use Postgres 9.2.4 and utf-8 encoding and LC_COLLATE='Polish_Poland.1250' .
You can try and order with COLLATE "C". That would get what you want in your example. It has side effects though! Effectively everything is ordered according to the byte values of the encoded character.
WITH x(col) AS (
VALUES
('ABC_AAAAA')
,('ABC_BBBBB')
,('ABC_ZZZZZ')
,('ABCAAAAA')
,('ABCBBBBB')
,('ABCZZZZZ')
)
SELECT *
FROM x
ORDER BY col COLLATE "C"
This option to change the collation for individual expressions (as opposed to using a collation defined at creation time of the db) was introduced with Postgres 9.1.
More about collation in the manual here.

firebird set case insensitive collation

How can I set case insensitive collation for the whole database?
Do I have to recreate the tables and data?
Database is firebird 2.5
Quote from the release notes:
The character set and collation of existing columns are not affected by ALTER CHARACTER SET changes.
So yes, it seems that the best way would be to recreate the database with desired default character set and collation (and / or with explicit definitions in domains).