Optimizing PostgreSQL performance when using UUIDs as primary keys - postgresql

I understand that using UUIDs as primary keys can potentially have adverse performance implications when compared to sequential integer values.
I did some tests on my machine and observed that various operations (at considerable scale) were indeed quite a bit slower.
I had a table with sequential integer primary keys and inserted 20 million records - this completed in 1 minute and 55 seconds. I then dropped the table and created the same again, but this time with UUID primary keys. To insert 20 million records took 6 minutes and 44 seconds.
Currently, I am configuring the primary key column with a uuid data type and the default value is set to gen_random_uuid() - so the UUIDs are being generated at the database level, not the application level.
I was wondering if there were any suggestions to optimize the use of UUIDs as primary keys. For example, would it help if the PK was an integer, but another (indexed) field contained a UUID, specifically for public exposure?
I'm also open to other ideas for a non-sequential PK that may exist, while being more performant.
(I'm not working with data of this scale just yet; it's more of a theoretical question.)

UUIDs are slower than keys generated by a sequence. You'll just have to accept that, there is no way around it. For that reason you use UUIDs only if you have a compelling reason, like the keys are generated outside the database or need to be unique across several database.
There is some deeper discussion of this in my article.

Related

Partitioning instead of thousand tables with identical structure

I have legacy, but pretty big(~25Gb) database questionably designed. The "pattern" which is commonly used across whole database boils down to the following:
separate logical parts into different tables (journal_1, journal_2, journal_n)
all tables have a unique bigserial/autoincrement field (journal_id_seq_1, journal_id_seq_2, journal_id_seq_n)
all tables have one or several foreign keys to one or several reference tables(journal tables have 2 foreign keys, group of tables with another structure(log_1, log_2, log_n) referenced just to one)
I'm extremely curious (actually near to panic :) in what happens if there are about 50 thousand such tables(Now is "just" about 15k).
My idea is get everything together(tables with identical structure) in one huge table with a common name(let's say journal), add a column with a journal_id(extracted from a suffix journal_{1|2|3}) partitioned by this column and obviously create partition tables for each table with same naming convention. Moreover, bigserial fields need to be converted into regular bigints, but I still need keep sequences for each partition and manually call nextval on every insert. Primary key also need to be extended with journal_id field additionally to seq_id. Finally, I see a bonus in a sharding which can be applied to partitions when database becomes enormous.
Please, share you thoughts about this strategy and especially about foreign keys. For now we need at least max_locks_per_transaction limited to 512, otherwise pg_dump is failed with
ERROR: out of shared memory HINT: You might need to increase max_locks_per_transaction. pg_dump: error: query was: LOCK TABLE. Besides locking nightmare, as far as I known, Postgres has limits for relations per database(the total number is huge but not unlimited). Do I need to make foreign keys for each partition table or just a part(certain rows) of a partitionED(general) table will be locked on insert, delete or update because all partitions are just "storages" but not a real relational entites?
Thank you in advance.
15K tables == Gag!
Partitioning is not likely to be any better than multiple tables.
Neither provides any performance benefits except in rare cases.
Let's investigate the future need for sharding. That, alone, may justify the existence of journal_nnn. In this, some journals would be on one machine, some on another machine, etc. Are all journals in active use? Or are most of them "old" and not really used?
The PRIMARY KEY can be a composite of two (or more) columns.
AUTO_INCREMENT has some advantages over manually creating "serial numbers". (However, the Question does not have enough details for me to elaborate.)
FOREIGN KEYs are two things: an implied INDEX (good for performance) and a constraint (good for integrity). In a well-debugged app, the integrity checks are unnecessary overhead. They must be abandoned in partitioning and probably in sharding.
Why do you use partitioning for such a small database? Your average table is less then 2MB in size, that's really really small.
Get rid of the partitioning and your problems are gone.
Having 50000 tables starts to get painful, and it makes no sense with a small database like this. The same holds for partitioning – after all, partitions are tables with a side job.
I would define only one table per object type.
About the auto-generated primary key numbers: make a combined primary key that consists of the old primary key and the table number (journal_id). For new entries, use a sequence that is initialized higher than the existing maximum of all tables.

What is the impact of the not creating a primary key to a table in Redshift?

What is the impact of the not creating a primary key to a table in Redshift , especially on tables that record the web logs of 1tb size?
In Redshift, a primary key MAY be useful if you need it for queries, for example referring to the primary key when joining to another table.
However, there is no other benefit. If you don't need it for queries then you don't need it and it serves no other purpose.
The impact of removing it - it will reduce the amount of storage space consumed a little, however it wont speed up any queries that do not need it!
Removing it may affect tools such as DMS (if you use that)
Primary/Unique and Foreign Keys can impact the query plan and in some cases result in performance gains. The Keys are not enforced by Redshift but it makes the assumption that they are correct. If the keys are incorrect and the plan assumes uniqueness this could result in different results. Only use Key Constraints when your upstream systems will not cause key constraint violations.

Does UUID primary key worsen read queries if table contains 'hot' and 'cold' data in postgresql?

I'm architecting a table which will store events. Each event will be about 100-500 bytes and it's planned to be about 500 millions of events each year. The app lifetime should be 3+ years. Newest events are "hot", because during a month after event occurred it could be extensively fetched by different queries for processing, other events could be fetched as well, but very rarely, so they are "cold". First I decided to use UUID primary key for such table, but now I'm afraid that using UUID could ruin read performance for "hot" data because of how postgres stores pages on disk. Are my fears justified or not?
It does not matter what data type you choose for your primary key – it will just be a couple of bytes on disk.
What I'd look into is partitioning. If you normally access new entries, you could partition by date. But this will only help if you can add a clause like WHERE creationdate > '....' to the queries that access the entries, because then the search will be limited to those partitions that match the condition. Partitioning would also make it easy to remove old data.
Unfortunately partitioning is not built into PostgreSQL (yet) and still takes a lot of hand-rolling. Moreover there are certain things lacking, like global indexes. But if you use UUIDs as primary key (to complete the circle and come back to your question), you wouldn't have duplicate entries anyway.

postgresql immutable read workload tuning

I have a table where the non-primary key columns are deterministic given the primary key.
I think this could be pretty common, for example a table representing memoization/caching of an expensive function, or where the primary key is a hash of the other columns.
Further assume that the workload is mostly reads of 1-100 individual rows, and that writes can be batched or "async" based on what gives the best performance.
What are interesting tuning options on the table/database in this case?
This would be an ideal candidate for index-only-scans in versions 9.2 or up, by creating an index on all the primary key columns plus the frequently queried other columns. Aggressively vacuum the table (i.e. manually after every batch update) because the default autovacuum settings are not aggressive enough to get maximal benefit from IOS.

Future proof Primary Key design in postgresql

I've always used either auto_generated or Sequences in the past for my primary keys. With the current system I'm working on there is the possibility of having to eventually partition the data which has never been a requirement in the past. Knowing that I may need to partition the data in the future, is there any advantage of using UUIDs for PKs instead of the database's built-in sequences? If so, is there a design pattern that can safely generate relatively short keys (say 6 characters instead of the usual long one e6709870-5cbc-11df-a08a-0800200c9a66)? 36^6 keys per-table is more than sufficient for any table I could imagine.
I will be using the keys in URLs so conciseness is important.
There is no pattern to reduce a 128-Bit UUID to 6 chars, since information gets lost.
Almost all databases implement a surrogate key strategy called incremental keys.
Postgres and Informix have serials, MySql auto_increment, and Oracle offers sequence generators. In your case I think it would be safe to use integer IDs.
See this article: Choosing a Primary Key: Natural or Surrogate? for a discussion of availabe techniques
I'm not sure what type of partition are you planning (this?), but I don't see why to change the primary key design? Even if the old partitioned tables are "alive" (i.e., you might insert rows in any partitioned table), there is no problem in sharing the sequence among several tables.