Postgres ordering of UTF-8 characters - postgresql

I'm building a small app that includes Esperanto words in my database, so I have words like ĉapelojn and brakhorloĝo, with "special" characters.
Using PostgreSQL 9.4.4 I have a words table with the following schema:
lingvoj_dev=# \d words
Table "public.words"
Column | Type | Modifiers
-------------+-----------------------------+----------------------------------------------------
id | integer | not null default nextval('words_id_seq'::regclass)
translated | character varying(255) |
meaning | character varying(255) |
times_seen | integer |
inserted_at | timestamp without time zone | not null
updated_at | timestamp without time zone | not null
Indexes:
"words_pkey" PRIMARY KEY, btree (id)
But the following query gives some strange output:
lingvoj_dev=# SELECT w."translated" FROM "words" AS w ORDER BY w."translated" desc limit 10;
translated
------------
ĉu
ŝi
ĝi
ĉevaloj
ĉapelojn
ĉapeloj
ĉambro
vostojn
volas
viro
(10 rows)
The ordering is inconsistent - I'd be okay with all of the words starting with special characters being at the end, but all of the words starting with ĉ should be grouped together and they're not! Why do ŝi and ĝi come in between ĉu and ĉevaloj?
The server encoding is UTF8, and the collation is en_AU.UTF-8.
edit: It looks like it's sorting all of the special characters as equivalent - it's ordering correctly based on the second character in each word. How do I make PostgreSQL see that ĉ, ŝ and ĝ are not equivalent?

I'd be okay with all of the words starting with special characters
being at the end...
Use collate "C":
SELECT w."translated"
FROM "words" AS w
ORDER BY w."translated" collate "C" desc limit 10;
See also Different behaviour in “order by” clause: Oracle vs. PostgreSQL
The query can be problematic when using ORM. The solution may be to recreate the database with the LC_COLLATE = C option, as suggested by the OP in the comment. There is one more option - change the collation for a single column:
ALTER TABLE "words" ALTER COLUMN "translated" TYPE text COLLATE "C";

Related

Is this INSERT statement containing SELECT subquery safe for multiple concurrent writes?

In Postgres, suppose I have the following table to be used like to a singly linked list, where each row has a reference to the previous row.
Table node
Column | Type | Collation | Nullable | Default
-------------+--------------------------+-----------+----------+----------------------
id | uuid | | not null | gen_random_uuid()
created_at | timestamp with time zone | | not null | now()
name | text | | not null |
prev_id | uuid | | |
I have the following INSERT statement, which includes A SELECT to look up the last row as data for the new row to be inserted.
INSERT INTO node(name, prev_id)
VALUES (
:name,
(
SELECT id
FROM node
ORDER BY created_at DESC
LIMIT 1
)
)
RETURNING id;
I understand storing prev_id may seem redundant in this example (ordering can be derived from created_at), but that is beside the point. My question: Is the above INSERT statement safe for multiple concurrent writes? Or, is it necessary to explicitly use LOCK in some way?
For clarity, by "safe", I mean is it possible that by the time the SELECT subquery executed and found the "last row", another concurrent query would have just finished an insert, so the "last row" found earlier is no longer the last row, so this insert would use the wrong "last row" value. The effect is multiple rows may share the same prev_id values, which is invalid for a linked list structure.

Can Postgres table partitioning by hash boost query performance?

I have an OLTP table on a Postgres 14.2 database that looks something like this:
Column | Type | Nullable |
----------------+-----------------------------+-----------
id | character varying(32) | not null |
user_id | character varying(255) | not null |
slug | character varying(255) | not null |
created_at | timestamp without time zone | not null |
updated_at | timestamp without time zone | not null |
Indexes:
"playground_pkey" PRIMARY KEY, btree (id)
"playground_user_id_idx" btree (user_id)
The database host has 8GB of RAM and 2 CPUs.
I have roughly 500M records in the table which adds up to about 80GB in size.
The table gets about 10K INSERT/h, 30K SELECT/h, and 5K DELETE/h.
The main query run against the table is:
SELECT * FROM playground WHERE user_id = '12345678' and slug = 'some-slug' limit 1;
Users have anywhere between 1 record to a few hundred records.
Thanks to the index on the user_id I generally get decent performance (double-digit milliseconds), but 5%-10% of the queries will take a few hundred milliseconds to maybe a second or two at worst.
My question is this: would partitioning the table by hash(user_id) help me boost lookup performance by taking advantage of partition pruning?
No, that wouldn't improve the speed of the query at all, since there is an index on that attribute. If anything, the increased planning time will slow down the query.
If you want to speed up that query as much as possible, create an index that supports both conditions:
CREATE INDEX ON playground (user_id, slug);
If slug is large, it may be preferable to index a hash:
CREATE INDEX ON playground (user_id, hashtext(slug));
and query like this:
SELECT *
FROM playground
WHERE user_id = '12345678'
AND slug = 'some-slug'
AND hashtext(slug) = hashtext('some-slug');
LIMIT 1;
Of course, partitioning could be a good idea for other reasons, for example to speed up autovacuum or CREATE INDEX.

Emojis as question marks in MariaDB primary key

With the following table, using mariadb-server-10.1 10.1.32+maria-1~trusty:
CREATE TABLE `tags` (
`tag_name` varchar(150) COLLATE utf8mb4_unicode_ci NOT NULL,
`thing_id` int(11) NOT NULL,
PRIMARY KEY (`thing_id`,`tag_name`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
Manifests as (in SQLAlchemy app):
IntegrityError: (IntegrityError)
(1062, "Duplicate entry '1532-?' for key 'PRIMARY'")
at the second emoji insertion attempt.
The first one seems to be ok in the db (believe me, it shows right in my console and my browser as "alien" emoji):
> select tag_name, HEX(tag_name) from tags;
+----------+----------------+
| tag_name | HEX(tag_name) |
+----------+----------------+
| GOODGUY | 474F4F44475559 |
| 👽 | F09F91BD |
+----------+----------------+
2 rows in set (0.00 sec)
I am aware of Emoji's in mysql turns to question marks , but my.cnf has:
default-character-set = utf8mb4
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
default-character-set = utf8mb4
and utf8mb4 is being used in the client (charset added to connection string as ?charset=utf8mb4). And I believe primary key is checked on the server side.
Is there anything else I am missing (can check) or is it some MariaDB bug or some additional configuration required?
not sure whether relevant or not, but the same problem when inserting via mysql shell. Also tried this to see what is going on (found this in mysql utf8mb4_unicode_ci cause unique key collision ):
> SELECT '😗'='😆' COLLATE utf8mb4_unicode_ci;
+------------------------------------+
| '?'='?' COLLATE utf8mb4_unicode_ci |
+------------------------------------+
| 1 |
+------------------------------------+
but not sure whether this relevant.
I do not understand: database sometimes shows it alright (via both clients - shell and SQLAlchemy app), but then fails to show in the header? From the evidence I got, I do not understand where that bad conversion happens. The data seems to be ok in the database (see hex above), but two emojis equivalent for the primary key?
One more to contrast:
> SELECT '😗'='😆' COLLATE utf8mb4_bin;
+-----------------------------+
| '?'='?' COLLATE utf8mb4_bin |
+-----------------------------+
| 0 |
+-----------------------------+
This kind of points finger primary key does not use binary? All emojis converted to something else before being used in the index? Quite weird.
mysql> SELECT '😗'='😆' COLLATE utf8mb4_unicode_520_ci;
+----------------------------------------+
| '?'='?' COLLATE utf8mb4_unicode_520_ci |
+----------------------------------------+
| 0 |
+----------------------------------------+
Two things to note:
The 520 collation treats them as different.
The question marks are a display problem.
To further isolate that...
mysql> SELECT HEX('😆');
+----------+
| HEX('?') |
+----------+
| F09F9886 |
+----------+
My point is that the question mark may be a display problem, and (as you point out with the hex) it is not an encoding problem:
(1062, "Duplicate entry '1532-?' for key 'PRIMARY'")
And, if you want to tag 1532 with both Emoji, that column needs to be utf8mb4_unicode_520_ci (or utf8mb4_bin).

Alphanumeric Sorting in PostgreSQL

I have this table with a character varying column in Postgres 9.6:
id | column
------------
1 |IR ABC-1
2 |IR ABC-2
3 |IR ABC-10
I see some solutions typecasting the column as bytea.
select * from table order by column::bytea.
But it always results to:
id | column
------------
1 |IR ABC-1
2 |IR ABC-10
3 |IR ABC-2
I don't know why '10' always comes before '2'. How do I sort this table, assuming the basis for ordering is the last whole number of the string, regardless of what the character before that number is.
When sorting character data types, collation rules apply - unless you work with locale "C" which sorts characters by there byte values. Applying collation rules may or may not be desirable. It makes sorting more expensive in any case. If you want to sort without collation rules, don't cast to bytea, use COLLATE "C" instead:
SELECT * FROM table ORDER BY column COLLATE "C";
However, this does not yet solve the problem with numbers in the string you mention. Split the string and sort the numeric part as number.
SELECT *
FROM table
ORDER BY split_part(column, '-', 2)::numeric;
Or, if all your numbers fit into bigint or even integer, use that instead (cheaper).
I ignored the leading part because you write:
... the basis for ordering is the last whole number of the string, regardless of what the character before that number is.
Related:
Alphanumeric sorting with PostgreSQL
Split comma separated column data into additional columns
What is the impact of LC_CTYPE on a PostgreSQL database?
Typically, it's best to save distinct parts of a string in separate columns as proper respective data types to avoid any such confusion.
And if the leading string is identical for all columns, consider just dropping the redundant noise. You can always use a VIEW to prepend a string for display, or do it on-the-fly, cheaply.
As in the comments split and cast the integer part
select *
from
table
cross join lateral
regexp_split_to_array(column, '-') r (a)
order by a[1], a[2]::integer

Alphanumeric case in-sensitive sorting in postgres

I am new to postrges and want to sort varchar type columns. want to explain the problem with with below example:
table name: testsorting
order name
1 b
2 B
3 a
4 a1
5 a11
6 a2
7 a20
8 A
9 a19
case sensitive sorting (which is default in postgres) gives:
select name from testsorting order by name;
A
B
a
a1
a11
a19
a2
a20
b
case in-sensitive sorting gives:
select name from testsorting order by UPPER(name);
A
a
a1
a11
a19
a2
a20
B
b
how can i make alphanumeric case in-sensitive sorting in postgres to get below order:
a
A
a1
a2
a11
a19
a20
b
B
I wont mind the order for capital or small letters, but the order should be "aAbB" or "AaBb" and should not be "ABab"
Please suggest if you have any solution to this in postgres.
My PostgreSQL sorts the way you want. The way PostgreSQL compares strings is determined by locale and collation. When you create database using createdb there is -l option to set locale. Also you can check how it is configured in your environment using psql -l:
[postgres#test]$ psql -l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
---------+----------+----------+------------+------------+-----------------------
mn_test | postgres | UTF8 | pl_PL.UTF8 | pl_PL.UTF8 |
As you see my database uses Polish collation.
If you created database using other collation then you can use other collation in query just like:
SELECT * FROM sort_test ORDER BY name COLLATE "C";
SELECT * FROM sort_test ORDER BY name COLLATE "default";
SELECT * FROM sort_test ORDER BY name COLLATE "pl_PL";
You can list available collations by:
SELECT * FROM pg_collation;
EDITED:
Oh, I missed that 'a11' must be before 'a2'.
I don't think standard collation can solve alphanumeric sorting. For such sorting you will have to split string into parts just like in Clodoaldo Neto response. Another option that is useful if you frequently have to order this way is to separate name field into two columns. You can create trigger on INSERT and UPDATE that split name into name_1 and name_2 and then:
SELECT name FROM sort_test ORDER BY name_1 COLLATE "en_EN", name_2;
(I changed collation from Polish into English, you should use your native collation to sort letters like aącć etc)
If the name is always in the 1 alpha followed by n numerics format then:
select name
from testsorting
order by
upper(left(name, 1)),
(substring(name from 2) || '0')::integer
PostgreSQL uses the C library locale facilities for sorting strings. C library is provided by the host operating system. On Mac OS X or a BSD-family operating system,the UTF-8 locale definitions are broken and hence the results are as per collation "C".
image attached for collation results with ubuntu 15.04 as host OS
Check FAQ's on postgres wiki for more details : https://wiki.postgresql.org/wiki/FAQ
As far as I'm concerned, I have used the PostgreSQL module citext and used the data type CITEXT instead of TEXT. It makes both sort and search on these columns case insensitive.
The module can be installed with the SQL command CREATE EXTENSION IF NOT EXISTS citext;
I agree with Clodoaldo Neto's answer, but also don't forget to add the index
CREATE INDEX testsorting_name on testsorting(upper(left(name,1)), substring(name from 2)::integer)
Answer strongly inspired from this one.
By using a function it will be easier to keep it clean if you need it over different queries.
CREATE OR REPLACE FUNCTION alphanum(str anyelement)
RETURNS anyelement AS $$
BEGIN
RETURN (SUBSTRING(str, '^[^0-9]*'),
COALESCE(SUBSTRING(str, '[0-9]+')::INT, -1) + 2000000);
END;
$$ LANGUAGE plpgsql IMMUTABLE;
Then you could use it this way:
SELECT name FROM testsorting ORDER BY alphanum(name);
Test:
WITH x(name) AS (VALUES ('b'), ('B'), ('a'), ('a1'),
('a11'), ('a2'), ('a20'), ('A'), ('a19'))
SELECT name, alphanum(name) FROM x ORDER BY alphanum(name);
name | alphanum
------+-------------
a | (a,1999999)
A | (A,1999999)
a1 | (a,2000001)
a2 | (a,2000002)
a11 | (a,2000011)
a19 | (a,2000019)
a20 | (a,2000020)
b | (b,1999999)
B | (B,1999999)