I have a table where the non-primary key columns are deterministic given the primary key.
I think this could be pretty common, for example a table representing memoization/caching of an expensive function, or where the primary key is a hash of the other columns.
Further assume that the workload is mostly reads of 1-100 individual rows, and that writes can be batched or "async" based on what gives the best performance.
What are interesting tuning options on the table/database in this case?
This would be an ideal candidate for index-only-scans in versions 9.2 or up, by creating an index on all the primary key columns plus the frequently queried other columns. Aggressively vacuum the table (i.e. manually after every batch update) because the default autovacuum settings are not aggressive enough to get maximal benefit from IOS.
Related
I have legacy, but pretty big(~25Gb) database questionably designed. The "pattern" which is commonly used across whole database boils down to the following:
separate logical parts into different tables (journal_1, journal_2, journal_n)
all tables have a unique bigserial/autoincrement field (journal_id_seq_1, journal_id_seq_2, journal_id_seq_n)
all tables have one or several foreign keys to one or several reference tables(journal tables have 2 foreign keys, group of tables with another structure(log_1, log_2, log_n) referenced just to one)
I'm extremely curious (actually near to panic :) in what happens if there are about 50 thousand such tables(Now is "just" about 15k).
My idea is get everything together(tables with identical structure) in one huge table with a common name(let's say journal), add a column with a journal_id(extracted from a suffix journal_{1|2|3}) partitioned by this column and obviously create partition tables for each table with same naming convention. Moreover, bigserial fields need to be converted into regular bigints, but I still need keep sequences for each partition and manually call nextval on every insert. Primary key also need to be extended with journal_id field additionally to seq_id. Finally, I see a bonus in a sharding which can be applied to partitions when database becomes enormous.
Please, share you thoughts about this strategy and especially about foreign keys. For now we need at least max_locks_per_transaction limited to 512, otherwise pg_dump is failed with
ERROR: out of shared memory HINT: You might need to increase max_locks_per_transaction. pg_dump: error: query was: LOCK TABLE. Besides locking nightmare, as far as I known, Postgres has limits for relations per database(the total number is huge but not unlimited). Do I need to make foreign keys for each partition table or just a part(certain rows) of a partitionED(general) table will be locked on insert, delete or update because all partitions are just "storages" but not a real relational entites?
Thank you in advance.
15K tables == Gag!
Partitioning is not likely to be any better than multiple tables.
Neither provides any performance benefits except in rare cases.
Let's investigate the future need for sharding. That, alone, may justify the existence of journal_nnn. In this, some journals would be on one machine, some on another machine, etc. Are all journals in active use? Or are most of them "old" and not really used?
The PRIMARY KEY can be a composite of two (or more) columns.
AUTO_INCREMENT has some advantages over manually creating "serial numbers". (However, the Question does not have enough details for me to elaborate.)
FOREIGN KEYs are two things: an implied INDEX (good for performance) and a constraint (good for integrity). In a well-debugged app, the integrity checks are unnecessary overhead. They must be abandoned in partitioning and probably in sharding.
Why do you use partitioning for such a small database? Your average table is less then 2MB in size, that's really really small.
Get rid of the partitioning and your problems are gone.
Having 50000 tables starts to get painful, and it makes no sense with a small database like this. The same holds for partitioning – after all, partitions are tables with a side job.
I would define only one table per object type.
About the auto-generated primary key numbers: make a combined primary key that consists of the old primary key and the table number (journal_id). For new entries, use a sequence that is initialized higher than the existing maximum of all tables.
My team is looking at moving our non partitioned table with ~1TB of data over to a partitioned table.
We would be using range partitioning based on a timestamp column.
One thing I don't understand is whether we need to add an index on the timestamp column if it's being used as the partition key. If we make our partitions quite small (e.g. partition for every day), would this act in a similar way to an index?
We would only be doing queries on a maximum resolution of one day.
I am reluctant to add an index as we've tried this in the past and it never completed (probably because we didn't turn off writes. Not really an option to turn off writes for an extended period).
Your feeling is right: omitting the index on the partitioning column is one of the few places where partitioning actually makes queries faster.
You can then get away with a sequential scan of a single partition, and you don't have to maintain the index with every data modifying statement.
The other advantage is that partitioning makes mass deletion of data (along the partition boundaries) so much more efficient. And finally, autovacuum's job will become easier.
Two points about partitioning:
Upgrade to v12; there have been substantial performance improvements that concern partitioning.
Don't use too many partitions. With v12, you can probably go up to a few thousand, in earlier versions you will get performance problems earlier on.
What is the impact of the not creating a primary key to a table in Redshift , especially on tables that record the web logs of 1tb size?
In Redshift, a primary key MAY be useful if you need it for queries, for example referring to the primary key when joining to another table.
However, there is no other benefit. If you don't need it for queries then you don't need it and it serves no other purpose.
The impact of removing it - it will reduce the amount of storage space consumed a little, however it wont speed up any queries that do not need it!
Removing it may affect tools such as DMS (if you use that)
Primary/Unique and Foreign Keys can impact the query plan and in some cases result in performance gains. The Keys are not enforced by Redshift but it makes the assumption that they are correct. If the keys are incorrect and the plan assumes uniqueness this could result in different results. Only use Key Constraints when your upstream systems will not cause key constraint violations.
the official explain is:
The benefits will normally be worthwhile only when a table would otherwise be very large. The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server.
When a table is very large? How to judge a table is very large?
A rule of thumb is that the size of the table should exceed the physical memory of the database server? What does this sentence mean?
The typical use cases for table partitioning (not limited to Postgres) are:
Cleanup data
If you need to delete rows from large tables that can be identified by a single partition.
In that case drop partition would be a lot faster than using delete. A typical use case is a range-partitioned table on a timespan (week, month, year)
Improve queries
If all (or nearly all) queries you use, contain a condition on the partition key.
A typical use case would be partitioning an "orders" table on e.g. the country and all queries would involve a condition like where country_code = 'de' or something similar. Queries not including the partitioning key will be however be slower compared to a query on a non-partitioned table.
What is "large"? That depends very much on your hardware and system. But I would not consider a table with less 100 million rows "large". Indexing (including partial indexes) can get you a long way in Postgres.
Note that Postgres 10 partitioning is still severely limited compared to e.g. Oracle or SQL Server. One of the biggest limitations is the lack of support for foreign keys and global indexes (i.e. a primary key ensuring uniqueness across all partitions). So if you need that, partitioning is not for you.
I've always used either auto_generated or Sequences in the past for my primary keys. With the current system I'm working on there is the possibility of having to eventually partition the data which has never been a requirement in the past. Knowing that I may need to partition the data in the future, is there any advantage of using UUIDs for PKs instead of the database's built-in sequences? If so, is there a design pattern that can safely generate relatively short keys (say 6 characters instead of the usual long one e6709870-5cbc-11df-a08a-0800200c9a66)? 36^6 keys per-table is more than sufficient for any table I could imagine.
I will be using the keys in URLs so conciseness is important.
There is no pattern to reduce a 128-Bit UUID to 6 chars, since information gets lost.
Almost all databases implement a surrogate key strategy called incremental keys.
Postgres and Informix have serials, MySql auto_increment, and Oracle offers sequence generators. In your case I think it would be safe to use integer IDs.
See this article: Choosing a Primary Key: Natural or Surrogate? for a discussion of availabe techniques
I'm not sure what type of partition are you planning (this?), but I don't see why to change the primary key design? Even if the old partitioned tables are "alive" (i.e., you might insert rows in any partitioned table), there is no problem in sharing the sequence among several tables.