Spring data jpa insertion on partition table taking more time comparing to non partition table

Spring data jpa insertion on partition table taking more time comparing to non partition table - spring-data-jpa

We need some help on the below use case.
We are inserting data in Yugabyte DB using Spring Data JPA (YSQL).
We have observed a significant time difference on the partitioned table during INSERT DML.
Application is taking approx. 6 to 10 times more time during data logging in the partitioned
table.
We did some exercise with the JDBC connection and this time no much time difference was observed.
i.e. Application is taking almost the same time for a partitioned and non-partitioned table.
Base infrastructure is the same both times.

Related

PostgreSQL ANALYZE statisticts & Replication

On my primary I ran a VACUUM then an ANALYZE on all databases, then when I check pg_stat_user_tables, the last_analyze column shows a current timestamp which is great.
When I check my replication instance, there are no values in the last_analyze column. I was assuming this timestamp would also eventually populate? Is this known behaviour?
The reason I ask is that after that VACUUM/ANALYZE on the primary, I'm running into some extremely slow queries on the replication instance. I ran an EXPLAIN plan prior to the VACUUM/ANALYZE on a query and it ran in 5 seconds... now it's taking 65 seconds. The EXPLAIN shows it's not using a lot of indexes that it should be.

PostgreSQL has two different stats systems. One records data about the distribution of values in the columns, this is transactional. It propagates to the replica via the WAL.
The other system records data about turn over on the tables and data on when the last vac/an was done. This system is used to determine when to schedule new vac/an (to prevent the first system from getting too out of date). This one is not transactional, and does not propagate to the replica.
So the replica has the latest column value distribution statistics (as soon as the WAL replays, anyway), but it doesn't know how recent they are.

PostgreSQL large number of partition tables problem using hash partitioning

I have a very large database with more than 1.5 billion records for device data and growing.
I manage this by having a separate table for each device, about 1000 devices (tables) with an index table for daily stats. Some devices produce much more data than others, so I have tables with more than 20 million rows and others with less than 1 million.
I use indexes, but queries and data processing gets very slow on large tables.
I just upgraded to PostgreSQL 13 from 9.6 and tried to do one single table with hash partition with a least 3600 tables to import all the tables into this one and speed up the process.
But as soon as I do this, I was able to insert some rows, but when I try to query or count rows I get out of shared memory, and max locks per transaction issues.
I tried to fine tune but didn’t succeed. I dropped the tables to 1000, but in certain operations I get the error once again, just for testing I dropped down to 100 and it works, but queries are slower with the same amount data in a stand alone table.
I tried range partition in each individual table for year period and improved but will be very messy to maintain thousands of tables with yearly ranges (note I am running in a server with 24 virtual processors and 32 GB RAM).
The question is: Is it possible to have a hash partition with more than 1000 tables? If so, what I am doing wrong?

Postgres Partitioning Query Performance when Partitioned for Delete

We are on Postgresql 12 and looking to partition a group of tables that are all related by Data Source Name. A source can have tens of millions of records and the whole dataset makes up about 900GB of space across the 2000 data sources. We don't have a good way to update these records so we are looking at a full dump and reload any time we need to update data for a source. This is why we are looking at using partitioning so we can load the new data into a new partition, detach (and later drop) the partition that currently houses the data, and then attach the new partition with the latest data. Queries will be performed via a single ID field. My concern is that since we are partitioning by source name and querying by an ID that isn't used in the partition definition that we won't be able to utilize any partition pruning and our queries will suffer for it.
How concerned should we be with query performance for this use case? There will be an index defined on the ID that is being queried, but based on the Postgres documentation it can add a lot of planning time and use a lot of memory to service queries that look at many partitions.

Performance will suffer, but it will depend on the number of partitions how much. The more partitions you have, the slower both planning and execution time will get, so keep the number low.
You can save on query planning time by defining a prepared statement and reusing it.

Slow bulk read from Postgres Read replica while updating the rows we read

We have on RDS a main Postgres server and a read replica.
We constantly write and update new data for the last couple of days.
Reading from the read-replica works fine when looking at older data but when trying to read from the last couple of days, where we keep updating the data on the main server, is painfully slow.
Queries that take 2-3 minutes on old data can timeout after 20 minutes when querying data from the last day or two.
Looking at the monitors like CPU I don't see any extra load on the read replica.
Is there a solution for this?

You are accessing over 65 buffers for ever 1 visible row found in the index scan (and over 500 buffers for each row which is returned by the index scan, since 90% are filtered out by the mmsi criterion).
One issue is that your index is not as well selective as it could be. If you had the index on (day, mmsi) rather than just (day) it should be about 10 times faster.
But it also looks like you have a massive amount of bloat.
You are probably not vacuuming the table often enough. With your described UPDATE pattern, all the vacuum needs are accumulating in the newest data, but the activity counters are evaluated based on the full table size, so autovacuum is not done often enough to suit the needs of the new data. You could lower the scale factor for this table:
alter table simplified_blips set (autovacuum_vacuum_scale_factor = 0.01)
Or if you partition the data based on "day", then the partitions for newer days will naturally get vacuumed more often because the occurrence of updates will be judged against the size of each partition, it won't get diluted out by the size of all the older inactive partitions. Also, each vacuum run will take less work, as it won't have to scan all of the indexes of the entire table, just the indexes of the active partitions.

As suggested, the problem was bloat.
When you update a record in an ACID database the database creates a new version of the record with the new updated record.
After the update you end with a "dead record" (AKA dead tuple)
Once in a while the database will do autovacuum and clean the table from the dead tuples.
Usually the autovacuum should be fine but if your table is really large and updated often you should consider changing the autovacuum analysis and size to be more aggressive.

Is there any alternative for column range partitioner in spring Bach remote partitioning?

Just take a normal case where I am taking data from DB2 , doing some business on data and writing it into mongoDB. This I am doing with spring batch column range partition(Remote partitioning) but the problem is in my DB2 table there is no sequential column , so each partition is having different data count. Because of this load is different for each slave. My requirement is to distribute load in slaves equally.

You'll need to write your own implementation of a Partitioner In a partitioned job, the Partitioner is responsible for knowing how to divide up the data into the partitions. Spring Batch really only provides one out of the box, theMultiResourcePartitioner`. The column range one found in the framework is actually just an sample. You can read more about this interface and it's role in the documentation here: https://docs.spring.io/spring-batch/trunk/apidocs/org/springframework/batch/core/partition/support/Partitioner.html and here: https://docs.spring.io/spring-batch/trunk/reference/html/scalability.html

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spring data jpa insertion on partition table taking more time comparing to non partition table - spring-data-jpa

Related

PostgreSQL ANALYZE statisticts & Replication

PostgreSQL large number of partition tables problem using hash partitioning

Postgres Partitioning Query Performance when Partitioned for Delete

Slow bulk read from Postgres Read replica while updating the rows we read

Is there any alternative for column range partitioner in spring Bach remote partitioning?

Categories

Resources