Postgres Partitioning Query Performance when Partitioned for Delete - postgresql

We are on Postgresql 12 and looking to partition a group of tables that are all related by Data Source Name. A source can have tens of millions of records and the whole dataset makes up about 900GB of space across the 2000 data sources. We don't have a good way to update these records so we are looking at a full dump and reload any time we need to update data for a source. This is why we are looking at using partitioning so we can load the new data into a new partition, detach (and later drop) the partition that currently houses the data, and then attach the new partition with the latest data. Queries will be performed via a single ID field. My concern is that since we are partitioning by source name and querying by an ID that isn't used in the partition definition that we won't be able to utilize any partition pruning and our queries will suffer for it.
How concerned should we be with query performance for this use case? There will be an index defined on the ID that is being queried, but based on the Postgres documentation it can add a lot of planning time and use a lot of memory to service queries that look at many partitions.

Performance will suffer, but it will depend on the number of partitions how much. The more partitions you have, the slower both planning and execution time will get, so keep the number low.
You can save on query planning time by defining a prepared statement and reusing it.

Related

Autovacuum and partitioned tables

Postgres doc tells that partitioned tables are not processed by autovacuum. But still I see that last_autovacuum column from pg_stat_user_tables is populated with recent timestamps for live partitions.
Does it mean that these timestamps are set by the background worker which only prevents transaction ID wraparound, without actually performing ANALYZE&VACUUM? Or whatever else could populate them?
Besides, taken that partitions are large and active enough, should I run the both ANALYZE and VACUUM manually on those partitions? If yes, does the order matter?
UPDATE
I'm trying to elaborate, thanks to the comments given.
Taking that vacuum should work the same way on partition as on the regular table, what could be a reason for much faster growth of the occupied disk space after partitioning? Before partitioning it was nearly a linear function of records count.
What is confusing as well, when looking for autovacuum processes running I see that those related to partitions are denoted with "to prevent wraparound", while others are not. Is it absolutely a coincidence or there is something to check?
Documentation describes partitioned table as rather a virtual entity, without its own storage. What is the point in denoting that it is not vacuumed?
The statement from the documentation is true, but misleading. Autovacuum does not process the partitioned table itself, but it processes the partitions, which are regular PostgreSQL tables. So dead tuples get removed, the visibility map gets updated, and so on. In short, there is nothing to worry about as far as vacuuming is concerned. Remember that the partitioned table itself does not hold any data!
What the documentation warns you about is ANALYZE. Autovacuum also launches automatic ANALYZE jobs to collect accurate table statistics. This will be work fine on the partitions, but there are no table statistics collected on the partitioned table itself, so you have to run ANALYZE manually on the partitioned table to get these data. In practice, I find that not to be a problem, since the optimizer generates plans for each individual partition anyway, and there it has accurate statistics.

Do cross-partition queries break infinite CosmosDB horizontal scalability?

As I understand, when you perform a query that doesn't filter by one primary key, you perform a cross-partition query. For this to be executed, the query is sent to all physical partitions of your CDB collection, executed in parallel in each of them, and then returned.
As you scale to tens of thousands of requests per second, that means that each of the tens of thousands of requests is executed on each physical partition.
Does this mean that eventually each partition will reach its limit of requests per second it can serve, and horizontal scaling will no longer give any benefit? Because for every new physical partition CDB adds, it will need to serve all requests coming in, so it's not adding new throughout capacity, only storage.
The downstream implication being that even if at a small scale you're ok with incurring the increased RU cost for cross-partition queries, to truly be able to scale indefinitely your data model should ensure queries hit only one partition (possibly by denormalizing it).
Yes, cross partition queries will not allow a database like Cosmos DB (or any horizontally scalable database) to scale.
Databases like Cosmos DB provide unlimited scale because it scales horizontally. The objective for your partition strategy should be to answer your high volume queries with one, or at a minimum, a bounded set of partitions. The effort around partition strategy is to chose a property that is nearly always passed in queries. Denormalization is generally more a function of modeling data around requests. It has less to do with partitioning directly.
If you would like to learn more about partitioning and modeling with Cosmos DB I highly recommend watching this video. It presents the topics very well, Data modeling & partitioning: What every relational database dev needs to know

Slow bulk read from Postgres Read replica while updating the rows we read

We have on RDS a main Postgres server and a read replica.
We constantly write and update new data for the last couple of days.
Reading from the read-replica works fine when looking at older data but when trying to read from the last couple of days, where we keep updating the data on the main server, is painfully slow.
Queries that take 2-3 minutes on old data can timeout after 20 minutes when querying data from the last day or two.
Looking at the monitors like CPU I don't see any extra load on the read replica.
Is there a solution for this?
You are accessing over 65 buffers for ever 1 visible row found in the index scan (and over 500 buffers for each row which is returned by the index scan, since 90% are filtered out by the mmsi criterion).
One issue is that your index is not as well selective as it could be. If you had the index on (day, mmsi) rather than just (day) it should be about 10 times faster.
But it also looks like you have a massive amount of bloat.
You are probably not vacuuming the table often enough. With your described UPDATE pattern, all the vacuum needs are accumulating in the newest data, but the activity counters are evaluated based on the full table size, so autovacuum is not done often enough to suit the needs of the new data. You could lower the scale factor for this table:
alter table simplified_blips set (autovacuum_vacuum_scale_factor = 0.01)
Or if you partition the data based on "day", then the partitions for newer days will naturally get vacuumed more often because the occurrence of updates will be judged against the size of each partition, it won't get diluted out by the size of all the older inactive partitions. Also, each vacuum run will take less work, as it won't have to scan all of the indexes of the entire table, just the indexes of the active partitions.
As suggested, the problem was bloat.
When you update a record in an ACID database the database creates a new version of the record with the new updated record.
After the update you end with a "dead record" (AKA dead tuple)
Once in a while the database will do autovacuum and clean the table from the dead tuples.
Usually the autovacuum should be fine but if your table is really large and updated often you should consider changing the autovacuum analysis and size to be more aggressive.

MVCC snapshot limit for concurrent queries

I am trying to learn PostgreSQL MVCC architecture. It says that MVCC creates a separate snapshot for each concurrent query. Isn't this approach memory inefficient?
For example if there are 1000 concurrent queries and table size is huge. This will create multiple instances of the table.
Is my understanding correct?
It says that MVCC creates a separate snapshot for each concurrent query. Isn't this approach memory inefficient?
You could argue it is memory inefficient. It usually isn't a big problem in practise.
For example if there are 1000 concurrent queries and table size is huge.
Why would you have/want 1000 concurrent queries? Do you have 1000 CPUs? If there is a risk that you will try to establish 1000 concurrent queries, then you should deploy some entry control mechanism (like a connection pooler) that prevents this from happening, with a fallback to max_connections.
This will create multiple instances of the table.
A snapshot is not a copy of the table. Is just a set of information that gets applied to the base table rows dynamically to decide which rows are visible in that snapshot. The size of a snapshot is proportional to number of concurrent transactions (one reason not have 1000 of them), not to the size of the table.

Spring batch partitioning master can read database and pass data to workers?

I am new to spring batch and trying to design a new application which has to read 20 million records from database and process it.
I don’t think we can do this with one single JOB and Step(in sequential with one thread).
I was thinking we can do this in Partitioning where step is divided into master and multiple workers (each worker is a thread which does its own process can run parallel).
We have to read a table(existing table) which has 20 million records and process them but in this table we do not have any auto generated sequence number and it have primary key like employer number with 10 digits.
I checked few sample codes for Partitioning where we can pass the range to each worker and worker process given range like worker1 from 1 to 100 and worker2 101 to 200…but in my case which is not going work because we don’t have sequence number to pass as range to each worker.
In Partitioning can master read the data from database (like 1000 records) and pass it to each worker in place for sending range ? .
Or for the above scenario do you suggest any other better approach.
In principle any query that returns result rows in a deterministic order is amenable to partitioning as in the examples you mentioned by means of OFFSET and LIMIT options. The ORDER BY may considerably increase the query execution time, although if you order by the table's primary key then this effect should be less noticeable as the table's index will already be ordered. So I would give this approach a try first, as it is the most elegant IMHO.
Note however that you might run into other problems processing a huge result set straight from a JdbcCursorItemReader, because some RDBMSs (like MySQL) won't be happy with the rate at which you'd be fetching rows interlocked with processing. So depending on the complexity of your processing I would recommend validating the design in that regard early on.
Unfortunately it is not possible to retrieve a partition's entire set of table rows and pass it as a parameter to the worker step as you suggested, because the parameter must not serialize to more than a kilobyte (or something in that order of magnitude).
An alternative would be to retrieve each partition's data and store it somewhere (in a map entry in memory if size allows, or in a file) and pass the reference to that resource in a parameter to the worker step which then reads and processes it.