PostgreSQL ANALYZE statisticts & Replication - postgresql

On my primary I ran a VACUUM then an ANALYZE on all databases, then when I check pg_stat_user_tables, the last_analyze column shows a current timestamp which is great.
When I check my replication instance, there are no values in the last_analyze column. I was assuming this timestamp would also eventually populate? Is this known behaviour?
The reason I ask is that after that VACUUM/ANALYZE on the primary, I'm running into some extremely slow queries on the replication instance. I ran an EXPLAIN plan prior to the VACUUM/ANALYZE on a query and it ran in 5 seconds... now it's taking 65 seconds. The EXPLAIN shows it's not using a lot of indexes that it should be.

PostgreSQL has two different stats systems. One records data about the distribution of values in the columns, this is transactional. It propagates to the replica via the WAL.
The other system records data about turn over on the tables and data on when the last vac/an was done. This system is used to determine when to schedule new vac/an (to prevent the first system from getting too out of date). This one is not transactional, and does not propagate to the replica.
So the replica has the latest column value distribution statistics (as soon as the WAL replays, anyway), but it doesn't know how recent they are.

Related

Sudden Increase in row exclusive locks and connection exhaustion in PostgreSQL

I have a scenario that repeats itself every few hours. In every few hours, there is a sudden increase in row exclusive locks in PostgreSQL DB. In Meantime there seems that some queries are not responded in time and causes connection exhaustion to happen that PostgreSQL does not accept new clients anymore. After 2-3 minutes locks and connection numbers drops and the system comes back to normal state again.
I wonder if auto vacuum can be the root cause of this? I see analyze and vacuum (NOT FULL VACCUM) take about 20 seconds to complete on one of the tables. I have INSERT,SELECT,UPDATE and DELETE operations going on from my application and I don't have DDL commands (ALTER TABLE, DROP TABLE, CREATE INDEX, ...) going on. Can auto vacuum procedure conflict with queries from my application and cause them to wait until vacuum has completed? Or it's all the applications and my bad design fault? I should say one of my tables has a field of type jsonb that keeps relatively large data for each row (10 MB roughly).
I have attached an image from monitoring application that shows the sudden increase in row exclusive locks.
ROW EXCLUSIVE locks are perfectly harmless; they are taken on tables against which DML statements run. Your graph reveals nothing. You should set log_lock_waits = on and log_min_duration_statement to a reasonable value. Perhaps you can spot something in the logs. Also, watch out for long running transactions.

Autovacuum and partitioned tables

Postgres doc tells that partitioned tables are not processed by autovacuum. But still I see that last_autovacuum column from pg_stat_user_tables is populated with recent timestamps for live partitions.
Does it mean that these timestamps are set by the background worker which only prevents transaction ID wraparound, without actually performing ANALYZE&VACUUM? Or whatever else could populate them?
Besides, taken that partitions are large and active enough, should I run the both ANALYZE and VACUUM manually on those partitions? If yes, does the order matter?
UPDATE
I'm trying to elaborate, thanks to the comments given.
Taking that vacuum should work the same way on partition as on the regular table, what could be a reason for much faster growth of the occupied disk space after partitioning? Before partitioning it was nearly a linear function of records count.
What is confusing as well, when looking for autovacuum processes running I see that those related to partitions are denoted with "to prevent wraparound", while others are not. Is it absolutely a coincidence or there is something to check?
Documentation describes partitioned table as rather a virtual entity, without its own storage. What is the point in denoting that it is not vacuumed?
The statement from the documentation is true, but misleading. Autovacuum does not process the partitioned table itself, but it processes the partitions, which are regular PostgreSQL tables. So dead tuples get removed, the visibility map gets updated, and so on. In short, there is nothing to worry about as far as vacuuming is concerned. Remember that the partitioned table itself does not hold any data!
What the documentation warns you about is ANALYZE. Autovacuum also launches automatic ANALYZE jobs to collect accurate table statistics. This will be work fine on the partitions, but there are no table statistics collected on the partitioned table itself, so you have to run ANALYZE manually on the partitioned table to get these data. In practice, I find that not to be a problem, since the optimizer generates plans for each individual partition anyway, and there it has accurate statistics.

Slow bulk read from Postgres Read replica while updating the rows we read

We have on RDS a main Postgres server and a read replica.
We constantly write and update new data for the last couple of days.
Reading from the read-replica works fine when looking at older data but when trying to read from the last couple of days, where we keep updating the data on the main server, is painfully slow.
Queries that take 2-3 minutes on old data can timeout after 20 minutes when querying data from the last day or two.
Looking at the monitors like CPU I don't see any extra load on the read replica.
Is there a solution for this?
You are accessing over 65 buffers for ever 1 visible row found in the index scan (and over 500 buffers for each row which is returned by the index scan, since 90% are filtered out by the mmsi criterion).
One issue is that your index is not as well selective as it could be. If you had the index on (day, mmsi) rather than just (day) it should be about 10 times faster.
But it also looks like you have a massive amount of bloat.
You are probably not vacuuming the table often enough. With your described UPDATE pattern, all the vacuum needs are accumulating in the newest data, but the activity counters are evaluated based on the full table size, so autovacuum is not done often enough to suit the needs of the new data. You could lower the scale factor for this table:
alter table simplified_blips set (autovacuum_vacuum_scale_factor = 0.01)
Or if you partition the data based on "day", then the partitions for newer days will naturally get vacuumed more often because the occurrence of updates will be judged against the size of each partition, it won't get diluted out by the size of all the older inactive partitions. Also, each vacuum run will take less work, as it won't have to scan all of the indexes of the entire table, just the indexes of the active partitions.
As suggested, the problem was bloat.
When you update a record in an ACID database the database creates a new version of the record with the new updated record.
After the update you end with a "dead record" (AKA dead tuple)
Once in a while the database will do autovacuum and clean the table from the dead tuples.
Usually the autovacuum should be fine but if your table is really large and updated often you should consider changing the autovacuum analysis and size to be more aggressive.

Why PostgreSQL keeps WAL files on logical replication longer than necessary? How to limit them?

I am trying to run a PostgreSQL (10.6) logical replication between two servers on one table only. The table has Id(int2) as a primary key. This is intentional and the table acts as a rolling window for some IoT time series data. It is heavy on writing on the Master node. The whole table has roughly 10 minutes worth of sensor data. And that is the design we like to keep.
Logical replication between Master and Replica nodes works great until there is a network outage lasting more than 1 hour. In the meantime, PostgreSQL on Master node is collecting WAL files with step by step insert/update on the table. So effectively WAL files might contain even hours of data which we are not interested in and they take forever to replay, step by step, from Master -> Replica when the connection restores. It is basically replaying records which long time don't exist in the database table!
How can I set it up so that only relevant data got replayed? If that is hard to do is there a way to throw away WAL files older than, say 10 minutes, so that they simply won't be sent?
I have tried to play with postgresql.conf settings. I am not sure if there is a flag I can limit the WAL files storing in a case of replication slot disconnect.
This is how the table looks like:
CREATE TABLE iot_ts (id int2 not null, time timestamp(0) not null, value real, primary key(id));
I would like to have a logical replication of such table set up so that when a long internet outage occurs the restoration is fast and contains only the most recent data.

Informix to Postgres, continuous data replication algorithm

The master server is Informix, version varies from 9.40 to the latest, database is unlogged by design that can't be changed. Slave server is the latest PostgreSQL. Master and slave are separate machines, network latency is unpredictable. Master schema is statically defined, well known and does not change, so it's only the data that needs to be replicated. In the master, there are three types of tables:
Numeric data tables, usually one date column, one time column and 15-300 int columns keyed by 2-3 primary keys. The data is never changed, only added once in a set interval (15, 30, or 60 minutes) and deleted when the retention point is reached. Replication data set can be up to 80,000 rows but usually is in the range of hundreds. This data needs to be replicated one way, master to slave. There is about 30 tables of this type and they need to be replicated all at once and as fast as possible, typically in under one minute after new interval set has been committed to the master.
Mixed data tables, with date, time, int, and string types, 30-100 columns, again 2-3 primary keys. This data is also never changed, added continuously and is deleted when the retention point is reached. The data set is up to 100,000 rows per hour. One way replication is needed, master to slave. There are a few tables like that, less than 5 usually.
Mixed data tables, with int and string types, less than 10 columns, 2-3 primary keys. The data largely stays intact, with occasional additions, edits or deletions. The usual replication set size is unpredictable, but probably will be in low hundreds of rows. This data needs to be replicated both ways, as fast as possible. There are a few tables of this type, and they need to be synched independently.
I've been looking for an existing tool that could do what I need, but it looks like there is none that is open source. I'm probably going to write one for my needs, and I'm looking for advice from DB gurus on how to approach this task.
In my estimate, there's probably no single algorithm that would cover all the use cases so I may be in fact looking for two or three algorithms. Here's what I found so far:
Fire trigger on master changes, record row OIDs (does Informix have them?) to temp table, dump the changed rows to a file, transfer it and load up. Question: how to buffer the trigger? The master DB is unlogged (no transactions), so trigger will fire upon each INSERT. Additional strain on the master, not good.
Add a cron job on the slave that will pull latest date/time keys from the master, and if the data is newer, pull it. Problem: although the update interval is defined, in reality it's based on the data source clock (not master DB clock) which is guaranteed to vary from slave server clock. More of it, there can be several data sources, each with varying clocks, and the data needs to be replicated ASAP. The only way here that I see is to constantly poll the master from the slave, hoping that by the time the poll comes in, the data is all committed (no transactions, remember?). Kludgy, slow, not good.
Add Informix as foreign data wrapper in the Postgres and run queries directly instead of bothering with replication. Pros: simplicity. Cons: Informix connector seems to be in alpha stage, and the whole approach is an unknown factor at best.
I've been researching this topic for some time, and it seems that the core of the problem is the lack of transactions on the master side. If the master DB was logged, it would be much easier to replicate it, but without transactions the task suddenly becomes much more complicated. For one, how do I ensure that there are no dupes? Another one, how to avoid update loops in type 3 tables? Considering all that, how to make replication as fast-reacting as possible? I mean the delay between data update and sync start here, data transfer is another topic altogether.
Any input is appreciated.
If you can't change the master in any significant way you are going to have a heck of a time with any sort of replication. Your basic problem is that you have no real way to handle replicating changes in real time without tracking which changes have been replicated, and if you can't change the master, you can't add that. So the short answer is that replication is not a solution which can work for you. Given some of Informix's other features I would think twice about going about this as continuous replication.
This leads to other approaches. The big unknown factors are that networks may not be reliable enough to just link the databases. This could lead to transactions hanging while waiting for data off a high latency connection to all kinds of other problems. You might be able to get this to work with an odbc fdw and an informix provider or with DBI-Link and DBD::Informix, but this strikes me as a problem in your current environment. You could use these in a cron job to populate a second PostgreSQL server closer to your own location periodically, however and so I would not write the approach entirely off.
One way or another it seems to me you need to get a copy of the data to your PostgreSQL server. You may want to do an ETL job to import the data periodically. You may want to use a secondary postgresql server and FDW's or DBI-Link to pull in the data. But this is not likely to be real-time, it is not likely to be continuous.
The tl;dr is that your environment isn't really set up to do this. For my money I would recommend an ETL approach and accept that your slave will not be in sync with the master.