Emojis as question marks in MariaDB primary key - unicode

With the following table, using mariadb-server-10.1 10.1.32+maria-1~trusty:
CREATE TABLE `tags` (
`tag_name` varchar(150) COLLATE utf8mb4_unicode_ci NOT NULL,
`thing_id` int(11) NOT NULL,
PRIMARY KEY (`thing_id`,`tag_name`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
Manifests as (in SQLAlchemy app):
IntegrityError: (IntegrityError)
(1062, "Duplicate entry '1532-?' for key 'PRIMARY'")
at the second emoji insertion attempt.
The first one seems to be ok in the db (believe me, it shows right in my console and my browser as "alien" emoji):
> select tag_name, HEX(tag_name) from tags;
+----------+----------------+
| tag_name | HEX(tag_name) |
+----------+----------------+
| GOODGUY | 474F4F44475559 |
| 👽 | F09F91BD |
+----------+----------------+
2 rows in set (0.00 sec)
I am aware of Emoji's in mysql turns to question marks , but my.cnf has:
default-character-set = utf8mb4
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
default-character-set = utf8mb4
and utf8mb4 is being used in the client (charset added to connection string as ?charset=utf8mb4). And I believe primary key is checked on the server side.
Is there anything else I am missing (can check) or is it some MariaDB bug or some additional configuration required?
not sure whether relevant or not, but the same problem when inserting via mysql shell. Also tried this to see what is going on (found this in mysql utf8mb4_unicode_ci cause unique key collision ):
> SELECT '😗'='😆' COLLATE utf8mb4_unicode_ci;
+------------------------------------+
| '?'='?' COLLATE utf8mb4_unicode_ci |
+------------------------------------+
| 1 |
+------------------------------------+
but not sure whether this relevant.
I do not understand: database sometimes shows it alright (via both clients - shell and SQLAlchemy app), but then fails to show in the header? From the evidence I got, I do not understand where that bad conversion happens. The data seems to be ok in the database (see hex above), but two emojis equivalent for the primary key?
One more to contrast:
> SELECT '😗'='😆' COLLATE utf8mb4_bin;
+-----------------------------+
| '?'='?' COLLATE utf8mb4_bin |
+-----------------------------+
| 0 |
+-----------------------------+
This kind of points finger primary key does not use binary? All emojis converted to something else before being used in the index? Quite weird.

mysql> SELECT '😗'='😆' COLLATE utf8mb4_unicode_520_ci;
+----------------------------------------+
| '?'='?' COLLATE utf8mb4_unicode_520_ci |
+----------------------------------------+
| 0 |
+----------------------------------------+
Two things to note:
The 520 collation treats them as different.
The question marks are a display problem.
To further isolate that...
mysql> SELECT HEX('😆');
+----------+
| HEX('?') |
+----------+
| F09F9886 |
+----------+
My point is that the question mark may be a display problem, and (as you point out with the hex) it is not an encoding problem:
(1062, "Duplicate entry '1532-?' for key 'PRIMARY'")
And, if you want to tag 1532 with both Emoji, that column needs to be utf8mb4_unicode_520_ci (or utf8mb4_bin).

Related

Is this INSERT statement containing SELECT subquery safe for multiple concurrent writes?

In Postgres, suppose I have the following table to be used like to a singly linked list, where each row has a reference to the previous row.
Table node
Column | Type | Collation | Nullable | Default
-------------+--------------------------+-----------+----------+----------------------
id | uuid | | not null | gen_random_uuid()
created_at | timestamp with time zone | | not null | now()
name | text | | not null |
prev_id | uuid | | |
I have the following INSERT statement, which includes A SELECT to look up the last row as data for the new row to be inserted.
INSERT INTO node(name, prev_id)
VALUES (
:name,
(
SELECT id
FROM node
ORDER BY created_at DESC
LIMIT 1
)
)
RETURNING id;
I understand storing prev_id may seem redundant in this example (ordering can be derived from created_at), but that is beside the point. My question: Is the above INSERT statement safe for multiple concurrent writes? Or, is it necessary to explicitly use LOCK in some way?
For clarity, by "safe", I mean is it possible that by the time the SELECT subquery executed and found the "last row", another concurrent query would have just finished an insert, so the "last row" found earlier is no longer the last row, so this insert would use the wrong "last row" value. The effect is multiple rows may share the same prev_id values, which is invalid for a linked list structure.

Getting duplicate rows when querying Cloud SQL in AppMaker

I migrated from Drive tables to a 2nd gen MySQL Google Cloud SQL data model. I was able to insert 19 rows into the following Question table in AppMaker:
+-------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+-------+
| SurveyType | varchar(64) | NO | PRI | NULL | |
| QuestionNumber | int(11) | NO | PRI | NULL | |
| QuestionType | varchar(64) | NO | | NULL | |
| Question | varchar(512) | NO | | NULL | |
| SecondaryQuestion | varchar(512) | YES | | NULL | |
+-------------------+--------------+------+-----+---------+-------+
I queried the data from the command line and know it is good. However, when I query the data in AppMaker like this:
var newQuery = app.models.Question.newQuery();
newQuery.filters.SurveyType._equals = surveyType;
newQuery.sorting.QuestionNumber._ascending();
var allRecs = newQuery.run();
I get 19 rows with the same data (the first row) instead of the 19 different rows. Any idea what is wrong? Additionally (and possibly related) my list rows in AppMaker are not showing any data. I did notice that _key is not being set correctly in the records.
(Edit: I thought maybe having two columns as the primary key was the problem, but I tried having the PK be a single identity column, same result.)
Thanks for any tips or pointers.
You have two primary key fields in your table, which is problematic according to the App Maker Cloud SQL documentation: https://developers.google.com/appmaker/models/cloudsql
App Maker can only write to tables that have a single primary key
field—If you have an existing Google Cloud SQL table with zero or
multiple primary keys, you can still query it in App Maker, but you
can't write to it.
This may account for the inability of the view to be able to properly display each row and to properly set the _key.
I was able to get this to work by creating the table inside AppMaker rather than using a table created directly in the Cloud Shell. Not sure if existing tables are not supported or if there is a bug in AppMaker, but since it is working I am closing this.

Postgres ordering of UTF-8 characters

I'm building a small app that includes Esperanto words in my database, so I have words like ĉapelojn and brakhorloĝo, with "special" characters.
Using PostgreSQL 9.4.4 I have a words table with the following schema:
lingvoj_dev=# \d words
Table "public.words"
Column | Type | Modifiers
-------------+-----------------------------+----------------------------------------------------
id | integer | not null default nextval('words_id_seq'::regclass)
translated | character varying(255) |
meaning | character varying(255) |
times_seen | integer |
inserted_at | timestamp without time zone | not null
updated_at | timestamp without time zone | not null
Indexes:
"words_pkey" PRIMARY KEY, btree (id)
But the following query gives some strange output:
lingvoj_dev=# SELECT w."translated" FROM "words" AS w ORDER BY w."translated" desc limit 10;
translated
------------
ĉu
ŝi
ĝi
ĉevaloj
ĉapelojn
ĉapeloj
ĉambro
vostojn
volas
viro
(10 rows)
The ordering is inconsistent - I'd be okay with all of the words starting with special characters being at the end, but all of the words starting with ĉ should be grouped together and they're not! Why do ŝi and ĝi come in between ĉu and ĉevaloj?
The server encoding is UTF8, and the collation is en_AU.UTF-8.
edit: It looks like it's sorting all of the special characters as equivalent - it's ordering correctly based on the second character in each word. How do I make PostgreSQL see that ĉ, ŝ and ĝ are not equivalent?
I'd be okay with all of the words starting with special characters
being at the end...
Use collate "C":
SELECT w."translated"
FROM "words" AS w
ORDER BY w."translated" collate "C" desc limit 10;
See also Different behaviour in “order by” clause: Oracle vs. PostgreSQL
The query can be problematic when using ORM. The solution may be to recreate the database with the LC_COLLATE = C option, as suggested by the OP in the comment. There is one more option - change the collation for a single column:
ALTER TABLE "words" ALTER COLUMN "translated" TYPE text COLLATE "C";

Postgres returns records in wrong order

I have pretty big table messages. It contains about 100mil records.
When I run simple query:
select uid from messages order by uid asc limit 1000
The result is very strange. Records from the beginning are ok, but then they are not always ordered by column uid.
uid
----------
94621458
94637590
94653611
96545014
96553145
96581957
96590621
102907437
.....
446131576
459475933
424507749
424507166
459474125
431059132
440517049
446131301
475651666
413687676
.....
Here is analyze
Limit (cost=0.00..3740.51 rows=1000 width=4) (actual time=0.009..4.630 rows=1000 loops=1)
Output: uid
-> Index Scan using messages_pkey on public.messages (cost=0.00..376250123.91 rows=100587944 width=4) (actual time=0.009..4.150 rows=1000 loops=1)
Output: uid
Total runtime: 4.796 ms
PostgreSQL 9.1.12
The table is always under high load(inserts, updates, deletes) and almost constantly autovacuuming. May that cause the problem?
UPD. Added table definition. Sorry cannot add full table definition, but all impotant fields and their types are here.
# \d+ messages
Table "public.messages"
Column | Type | Modifiers | Storage | Description
--------------+-----------------------------+--------------------------------------------------------+----------+-------------
uid | integer | not null default nextval('messages_uid_seq'::regclass) | plain |
code | character(22) | not null | extended |
created | timestamp without time zone | not null | plain |
accountid | integer | not null | plain |
status | character varying(20) | not null | extended |
hash | character(3) | not null | extended |
xxxxxxxx | timestamp without time zone | not null | plain |
xxxxxxxx | integer | | plain |
xxxxxxxx | character varying(255) | | extended |
xxxxxxxx | text | not null | extended |
xxxxxxxx | character varying(250) | not null | extended |
xxxxxxxx | text | | extended |
xxxxxxxx | text | not null | extended |
Indexes:
"messages_pkey" PRIMARY KEY, btree (uid)
"messages_unique_code" UNIQUE CONSTRAINT, btree (code)
"messages_accountid_created_idx" btree (accountid, created)
"messages_accountid_hash_idx" btree (accountid, hash)
"messages_accountid_status_idx" btree (accountid, status)
Has OIDs: no
Here's a very general answer:
Try :
SET enable_indexscan TO off;
Rerun the query in the same session.
If the order of the results is different than with enable_indexscan to on, then the index is corrupted.
In this case, fix it with:
REINDEX INDEX index_name;
Save yourself the long wait trying to run the query without index. The problem is most probably due to a corrupted index. Repair it right away and see if that fixes the problem.
Since your table is always under high load, consider building a new index concurrently. Takes a bit longer, but does not block concurrent writes. Per documentation on REINDEX:
To build the index without interfering with production you should drop
the index and reissue the CREATE INDEX CONCURRENTLY command.
And under CREATE INDEX:
CONCURRENTLY
When this option is used, PostgreSQL will build the index without
taking any locks that prevent concurrent inserts, updates, or deletes
on the table; whereas a standard index build locks out writes (but not
reads) on the table until it's done. There are several caveats to be
aware of when using this option — see Building Indexes Concurrently.
So I suggest:
ALTER TABLE news DROP CONSTRAINT messages_pkey;
CREATE UNIQUE INDEX CONCURRENTLY messages_pkey ON news (uid);
ALTER TABLE news ADD PRIMARY KEY USING INDEX messages_pkey;
The last step is just a tiny update to the system catalogs.

How do I know if the statistics of a Postgres table are up to date?

In pgAdmin, whenever a table's statistics are out-of-date, it prompts:
Running VACUUM recommended
The estimated rowcount on the table schema.table deviates
significantly from the actual rowcount. You should run VACUUM ANALYZE
on this table.
I've tested it using pgAdmin 3 and Postgres 8.4.4, with autovacuum=off. The prompt shows up immediately whenever I click a table that has been changed.
Let's say I'm making a web-based system in Java, how do I detect if a table is out-of-date, so that I can show a prompt like the one in pgAdmin?
Because of the nature of my application, here are a few rules I have to follow:
I want to know if the statistics of a certain table in pg_stats and pg_statistic are up to date.
I can't set the autovacuum flag in postgresql.conf. (In other words, the autovacuum flag can be on or off. I have no control over it. I need to tell if the stats are up-to-date whether the autovacuum flag is on or off.)
I can't run vacuum/analyze every time to make it up-to-date.
When a user selects a table, I need to show the prompt that the table is outdated when there are any updates to this table (such as drop, insert, and update) that are not reflected in pg_stats and pg_statistic.
It seems that it's not feasible by analyzing timestamps in pg_catalog.pg_stat_all_tables. Of course, if a table hasn't been analyzed before, I can check if it has a timestamp in last_analyze to find out whether the table is up-to-date. Using this method, however, I can't detect if the table is up-to-date when there's already a timestamp. In other words, no matter how many rows I add to the table, its last_analyze timestamp in pg_stat_all_tables is always for the first analyze (assuming the autovacuum flag is off). Therefore, I can only show the "Running VACUUM recommended" prompt for the first time.
It's also not feasible by comparing the last_analyze timestamp to the current timestamp. There might not be any updates to the table for days. And there might be tons of updates in one hour.
Given this scenario, how can I always tell if the statistics of a table are up-to-date?
Check the system catalogs.
=> SELECT schemaname, relname, last_autoanalyze, last_analyze FROM pg_stat_all_tables WHERE relname = 'accounts';
schemaname | relname | last_autoanalyze | last_analyze
------------+----------+-------------------------------+--------------
public | accounts | 2022-11-22 07:49:16.215009+00 |
(1 row)
=>
https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-ALL-TABLES-VIEW
All kinds of useful information in there:
test=# \d pg_stat_all_tables View "pg_catalog.pg_stat_all_tables"
Column | Type | Modifiers
-------------------+--------------------------+-----------
relid | oid |
schemaname | name |
relname | name |
seq_scan | bigint |
seq_tup_read | bigint |
idx_scan | bigint |
idx_tup_fetch | bigint |
n_tup_ins | bigint |
n_tup_upd | bigint |
n_tup_del | bigint |
n_tup_hot_upd | bigint |
n_live_tup | bigint |
n_dead_tup | bigint |
last_vacuum | timestamp with time zone |
last_autovacuum | timestamp with time zone |
last_analyze | timestamp with time zone |
last_autoanalyze | timestamp with time zone |
vacuum_count | bigint |
autovacuum_count | bigint |
analyze_count | bigint |
autoanalyze_count | bigint |
You should not have to worry about vac'ing in your application. Instead, you should have the autovac process configured on your server (in postgresql.conf), and the server takes takes of VACCUM and ANALYZE processes based on its own internal statistics. You can configure how often it should run, and what the threshold variables are for it to process.