In PostgreSQL 9.2, is archiving required for streaming replication? - postgresql

Is it allowed and/or reasonable to configure a master PostgreSQL 9.2 server to NOT archive but to perform streaming replication. That is configured as shown below:
wal_level = hot_standby
archive_mode = off
Can the "slave" server (hot standby), be configured to archive WAL segments?
wal_level = hot_standby
hot_standby = on
archive_mode = on
This would allow the archiving network traffic on the master server to be cut in half (replication but not archiving). This seems reasonable and the documentation appears to support this configuration but I'd prefer a bit of reassurance that we have a good configuration.

From documentation (strong added by myself):
If you use streaming replication without file-based continuous archiving, you have to set wal_keep_segments in the master to a value high enough to ensure that old WAL segments are not recycled too early, while the standby might still need them to catch up. If the standby falls behind too much, it needs to be reinitialized from a new base backup. If you set up a WAL archive that's accessible from the standby, wal_keep_segments is not required as the standby can always use the archive to catch up.
So, from my understanding, when you have too much transactions running, the slave could have some hard time to stay in sync. Especially if the master removes the WAL files before the slave really get what was inside. Without archive_mode on the master the WAL files could be deleted without leaving any way to get them back.
If you keep the WAL archiving in place and add the streaming upon a working hot-standby-with-archives structure this cannot happen as the slave could always access archived WAL and will get back the unsynced transactions as soon as the lower activity on the stream allows it. Without access to the archive the risk is clearly to loose your slave integrity after some really heavy stuff.

I don't know if this is actual "official and certified", I also don't think it is recent, BUUT it comes form PostgreSQL Wiki.. (https://wiki.postgresql.org/wiki/Streaming_Replication)
Step 5, specifies interesting comments, which coincides with the answer of the post:
# To enable read-only queries on a standby server, wal_level must be set to
# "hot_standby". But you can choose "archive" if you never connect to the
# server in standby mode.
wal_level = hot_standby
# Set the maximum number of concurrent connections from the standby servers.
max_wal_senders = 5
# To prevent the primary server from removing the WAL segments required for
# the standby server before shipping them, set the minimum number of segments
# retained in the pg_xlog directory. At least wal_keep_segments should be
# larger than the number of segments generated between the beginning of
# online-backup and the startup of streaming replication. If you enable WAL
# archiving to an archive directory accessible from the standby, this may
# not be necessary.
wal_keep_segments = 32
# Enable WAL archiving on the primary to an archive directory accessible from
# the standby. If wal_keep_segments is a high enough number to retain the WAL
# segments required for the standby server, this is not necessary.
archive_mode = on
archive_command = 'cp %p /path_to/archive/%f'

Related

Postgresql doesn't reestablish delayed replication

I'm running master & replica on PG 13.3. I decided to use delayed replication (30 minutes configured in recovery_min_apply_delay parameter). On top of that, WAL archiving is configured and working well.
When load on master is very high for a long time, it happens that replication is falling behind until max_slot_wal_keep_size is exceeded (see my another, related question: Replication lag - exceeding max_slot_wal_keep_size, WAL segments not removed). Once it falls too far behind, the slot is "lost' and replica falls back to restoring WAL from the archive. So far so good. The problem is, it never tries replication again. Restarting slave does not help.
There are two ways how I managed to restore the replication:
Restarts & config edits
Remove the delay config from the replica
Restart postgres. Then it restores all the WAL from archive and once there's nothing left it will start replication again - but without any delay. Then I edit config again to introduce replication and it sometimes works, sometimes doesn't. I think it depends on the load.
Removing a WAL segment from archive
Look at currently restored WAL segments from the postgresql log and temporarily move the following one from the WAL archive. When PG tries to recovery it fails and falls back to replication
This doesn't seem like the right way to do it, does it?
Thanks,
-- Marcin
As far as I can see, this is a non-problem.
If you want replication delayed by 30 minutes, and you archive more than one 16MB WAL segment per half hour, there is no need to replicate. The information can just as well be read from the archive. If the latest entry in the latest archived WAL segment happens to be older than recovery_min_apply_delay, the standby will contact the primary and replicate.
If you insist on replication rather than archive recovery, remove restore_command and max_slot_wal_keep_size from the configuration. But I don't see the point.
If you are concerned about losing the active WAL segment in case of a catastrophe on the primary, use pg_receivewal rather than archive_command to populate the WAL archive.

Replication lag - exceeding max_slot_wal_keep_size, WAL segments not removed

Summary
We are using max_slot_wal_keep_size from Postgresql 13 to prevent master from being killed by a lagging replication. It seems, that in our case, WAL storage wasn't freed up after exceeding this parameter which resulted in a replication failure. WAL which, as I believe, should have been freed up did not seem to be needed by any other transaction at a time. I wonder how this should work and why WAL segments were not removed?
Please find the details below.
Configuration
master & one replica - streaming replication using a slot
~700GB available for pg_wal
max_slot_wal_keep_size = 600GB
min_wal_size = 20GB
max_wal_size = 40GB
default checkpoint_timeout = 5 minutes (no problem with checkpoints)
archiving is on and is catching up well
What happened
Under heavy load (large COPY/INSERT transactions, loading hundreds of GB of data), the replication started falling behind. Available space on pg_wal was being reduced in the same rate as safe_slot pg_replication_slot.safe_wal_size - as expected. At some point safe_wal_size went negative and streaming stopped working. It wasn't a problem, because replica started recovery from WAL archive. I expected that once the slot is lost, WALs will be removed up to max_wal_size. This did not happen though. It seems that Postgres tried to maintain something close to max_slot_wal_keep_size (600GB) available, in case replica starts catching up again. Over the time, there was no single transaction which would require this much WAL to be kept. archiving wasn't behind either.
Q1: Is it the case that PG will try to maintain max_slot_keep_size of WALs available?
Q2: If not, why PG did not remove excessive WAL when they were not needed neither by archiver, nor by any transactions running on the system?
Amount of free space on pg_wal was more or less 70GB for most of the time, however at some point, during heavy autovacuuming, it dipped to 0 :( This is when PG crashed and (auto-recovered soon after). After getting back up, there was 11GB left on pg_wal and no transaction running, no loading. This lasted for hours. During this time replica finally caught up from the archive and restored the replication with no delay. None of the WALs were removed. I manually run checkpoint but it did not clear any WALs. I finally restarted Postgresql and during the restarting pg_wal were finally cleared.
Q3: Again - why PG did not clear WAL? WALs, even more clearly, were not needed by any process.
Many thanks!
This was a PostgreSQL bug, and it's fixed. Thanks for reporting!
It should be available in 13.4 according to release notes (look for "Advance oldest required WAL segment")

Attach additional node to postgres primary server as warm standby

I have set up Postgres 11 streaming replication cluster. Standby is a "hot standby". Is it possible to attach the second standby as a warm standby?
I assume that you are talking about WAL file shipping when you are speaking of a “warm standby”.
Sure, there is nothing that keeps you from adding a second standby that ships WAL files rather than directly attaching to the primary, but I don't see the reason for that.
According to this decent documentation of Postgres 11 streaming replication architecture, you can set the sync_state of a 2nd slave instance to be potential. This means that if/when the 1st sync slave fails, the detected failure (through ACK communication) will result in the 2nd slave will move from potential to sync becoming the active replication server. --see Section 11.3 - Managing Multiple Stand-by Servers in that link for more details.

use of archive_command in PostgreSQL streaming replication

When using streaming replication can someone please explain the purpose of archive_command and restore_command in PostgreSQL?
As i studied in streaming replication secondary server read and apply the partially filled WAL files.suppose i have my wal segment location in pg_xlog and using archive_command i am copying this to my local archive directory say /arclogs.
So if secondary server is going to read the partially filled archive logs from pg_xlog over the network then what's the use of files kept in /arclogs.
and also the files will be sent to /arclogs only when they will be 16 mb?
I'm new to PostgreSQL & your help will be appericated.
The master will normally only retain a limited amount of WAL in pg_xlog, controlled by the master's wal_keep_segments setting. If the replica is too slow or disconnected for too long, the master will delete those transaction logs to ensure it can continue running without running out of disk space.
If that happens the replica has no way to catch up to the master, since it needs a continuous and gap-free stream of WAL.
So you can:
Enable WAL archiving (archive_command and archive_mode) as a fallback, so the replica can switch to replaying WAL from archives if the master deletes WAL it needs from its pg_xlog. The replica fetches the WAL with its restore_command. Importantly, the archived WAL does not need to be on the same machine as the master, and usually isn't.
or
Use a physical replication slot (primary_slot_name in recovery.conf) to connect the replica to the master. If a slot is used, the master knows what WAL the replica requires even when the replica is disconnected. So it won't remove WAL still needed by a replica from pg_xlog. But the downside is that pg_xlog can fill up if a replica is down for too long, causing the master to fail due to lack of disk space.
or
Do neither, and allow replicas to fail if they fall too far behind. Then re-create them from a new base backup if this happens.
The documentation really needs an overview piece to put all this together.
WAL archiving has an additional benefit: If you make a base backup of the server you can use it, plus WAL archives, to do a point-in-time restore of the master. This lets you recover data from things like accidental table drops. PgBarman is one of the tools that can help you with this.

Replication on Postgresql pauses when Querying and replication are happening simultaneously

Postgress follows MVCC rules. So any query that is run on a table doesn't conflict with the writes that happen on the table. The query returns the result based on the snapshot at the point of running the query.
Now i have a master and slave. The slave is used by analysts to run queries and to perform analysis. When the slave is replicating and when analyst are running their queries simultaneously, i can see the replication lag for a long time.If the queries are long running, the replication lags a long duration and if the number of writes on the master happens to be pretty high, then i end up losing the WAL files and replication can longer proceed. I just have to spin up another slave. Why does this happen ? How do i allow queries and replication to happen simultaneously on postures ? Is there any parameter setting that i can apply to make this happen ?
The replica can't apply more WAL from the master because the master might've overwritten data blocks still needed by queries running on the replica that're older than any still running on the master. The replica needs older row versions than the master. It's exactly because of MVCC that this pause is necessary.
You probably set a high max_standby_streaming_delay to avoid "canceling statement due to conflict with recovery" errors.
If you turn hot_standby_feedback on, the replica can instead tell the master to keep those rows. But the master can't clean up free space as efficiently then, and it might run out of space in pg_xlog if the standby gets way too far behind.
See PostgreSQL manual: Handling Query Conflicts.
As for the WAL retention part: enable WAL archiving and a restore_command for your standbys. You should really be using it anyway, for point-in-time recovery. PgBarman now makes this easy with the barman get-wal command. If you don't want WAL archiving you can instead set your replica servers up to use a replication slot to connect to the master, so the master knows to retain the WAL they need indefinitely. Of course, that can cause the master to run out of space in pg_xlog and stop running so you need to monitor more closely if you do that.