pg_wal getting full in Master DB - postgresql

I also have a similar setup with master-slave (primary-standby) streaming replication set up on 2 physical nodes.The replication is working correctly and walsender and walreceiver both work fine.But in my case, the pg_wal in master is getting full and wal files are not being cleaned up. My archive mode is disabled. Can anyone help?
Postgres Version 12. Running on RHEL-7.8
Here is my configuration in Master
Master Runtime Configuration
Here is my configuration in Streaming Replication Streaming Replication Runtime Configuration
Can someone help?

Very likely you have a stale replication slot. Look at
SELECT slot_name, active, restart_lsn
FROM pg_replication_slots;
If you find an inactive one with an old restart_lsn you have found the problem. Use the pg_drop_replication_slot function to get rid of it, then pg_wal will slowly shrink back to max_wal_size.
You should also check if you have high values in wal_keep_size (wal_keep_segments in older releases). That will also make PostgreSQL retain old WAL segments.

Related

Does pg_archivecleanup command effect on replication slot?

As per replication slot definition, it is a feature in PostgreSQL that ensure that the master server will retain the WAL logs that are needed by the replicas even when they are disconnected from server.
Is there any effect if I run pg_archivecleanup command every 15th day of month to free my storage. Does it has any effect on replication slot, since it is tracing WAL file which is required by standby server?
Because I run pg_archivecleanup removing WAL file from last checkpoint but I am not sure whether it is removing WAl file that is required for other replica.
If not removing then how it is actually tracing it?
I am looking for explanation from experts.
When you run pg_archivecleanup, PostgreSQL will delete all WAL segments that are older than the WAL segment you specify as argument. This will ignore replication slots, you you may end up removing WAL segments that may still be needed by standby servers to catch up (if they do that using restore_command).
Note that this normally not a problem, because pg_archivecleanup deletes WAL segments from the archive, while replication slots deal with WAL segments on the primary server (in the pg_wal directory), which are not affected by pg_archivecleanup. Now since the standby consumes WAL directly from the primary (as specified in primary_conninfo), it does not have to rely on the WAL archives.

WAL level not sufficient for making an online backup

I have tried to do database replication in linux 7.0 red hat using postgresql.
I refered this URL for Confuring http://blog.terminal.com/postgresql-replication-and-backup-methods/ I completed the the steps upto this step
Configuring the slave server
but the step
Initial replication
when I executed this command in Master
-bash-4.2$ psql -c "select pg_start_backup('initial_backup');"
I got Error Like this
ERROR: WAL level not sufficient for making an online backup
HINT: wal_level must be set to "archive" or "hot_standby" at server start.
So kindly let me know where we are wrong.
On your master PostgreSQL server's configuration file <PG_DATA>/postgresql.conf there is parameter called wal_level it should be set to either "archive" or "hot_standby"
Both said to postgres to keep WAL segmetnts for replication server. The difference between the two is rather simple:
"hot_standby" means that your replication server will allow connections for SELECT statements
"archive" means you don't want that possibility.
More to read in this tutorial on streaming replication
Also keep in mind that change of wal_level property needs server restart to take an effect.

Primary and standby server at different timelines in postgres

I am very new to postgres and being new I got stuck at a point and need some help, please pardon if you find it silly.
I am doing a pgpool HA and at postgres level i have streaming replication between 3 nodes of postgresql-9.5 - 1 master and 2 slaves
I was trying to configure auto failover but when i switched back to my original master, and restarted the postgres service, I am getting the following error:
slave 1-highest timeline 1 of the primary is behind recovery timeline 11
slave 2-highest timeline 1 of the primary is behind recovery timeline 10
slave 3-highest timeline 1 of the primary is behind recovery timeline 3
I tried deleting pg_xlog files in slaves and copying all the files from master pg_xlog into the slaves and then did a rsync.
i also did a pg_rewind but it says:
target server needs to use either data checksums or wal_log_hints = on
(I have wal_log_hints = on set in postgresql.conf already)
I've tried doing a pg_basebackup but since the data base server in slaves are still starting up its not able to connect to the server
Is there any way to bring the master and the slave at a same timeline?
In my case, it happened because ( experimentally ), I updated the standby database tables and again when I simulate the master-standby streaming replication I got the same errors.
So once again I cleaned the whole standby database directory and migrate the master database using cmd like
"pg_basebackup -P -R -X stream -c fast -h 10.10.40.105 -U postgres -D standby/"
I think something is wrong in your pgpool configuration. What tool you have been using for manement of replication and master-slave control? Is it post master or repmgr?
I was trying to configure pgpool with 3 data nodes using a tutorial from http://jensd.be/591/linux/setup-a-redundant-postgresql-database-with-repmgr-and-pgpool and have done it correctly.
Also you can lean auto failover here.
(These question is obviously duplicate of this one, so I'll repeat the answer also.)
I'm not sure what you exactly mean by "when i switched back to my original master", but it looks that you are doing the wrongest possible thing in PostgreSQL streaming replication - introducing the second master.
The most important thing you should know about PostgreSQL replication is that once the failover is performed, you cannot simply "switch back to original master" - there's now a new master in cluster, and existence of two masters will make damage.
After a slave is promoted to master, the only way for you to re-join the old master is to:
Destroy it (delete the data directory);
Join it as a slave.
If you want it to be master again you'll continue with the following:
Let it run for awhile as a slave so that it can sync the data;
Kill temporary master and failover to old master;
Rejoin temporary master again as a slave.
You cannot simply switch master servers! Master can be created ONLY by failover (promoting a slave)
You should also know that whenever you are performing failover (whenever the master is changed), all slaves (except for the one that is promoted) need to be reconfigured to target the new master.
I suggest you reading this tutorial - it'll help.

How do I fix a PostgreSQL 9.3 Slave that Cannot Keep Up with the Master?

We have a master-slave replication configuration as follows.
On the master:
postgresql.conf has replication configured as follows (commented line taken out for brevity):
max_wal_senders = 1
wal_keep_segments = 8
On the slave:
Same postgresql.conf as on the master. recovery.conf looks like this:
standby_mode = 'on'
primary_conninfo = 'host=master1 port=5432 user=replication password=replication'
trigger_file = '/tmp/postgresql.trigger.5432'
When this was initially setup, we performed some simple tests and confirmed the replication was working. However, when we did the initial data load, only some of the data made it to the slave.
Slave's log is now filled with messages that look like this:
< 2015-01-23 23:59:47.241 EST >LOG: started streaming WAL from primary at F/52000000 on timeline 1
< 2015-01-23 23:59:47.241 EST >FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000F00000052 has already been removed
< 2015-01-23 23:59:52.259 EST >LOG: started streaming WAL from primary at F/52000000 on timeline 1
< 2015-01-23 23:59:52.260 EST >FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000F00000052 has already been removed
< 2015-01-23 23:59:57.270 EST >LOG: started streaming WAL from primary at F/52000000 on timeline 1
< 2015-01-23 23:59:57.270 EST >FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000F00000052 has already been removed
After some analysis and help on the #postgresql IRC channel, I've come to the conclusion that the slave cannot keep up with the master. My proposed solution is as follows.
On the master:
Set max_wal_senders=5
Set wal_keep_segments=4000 . Yes I know it is very high, but I'd like to monitor the situation and see what happens. I have room on the master.
On the slave:
Save configuration files in the data directory (i.e. pg_hba.conf pg_ident.conf postgresql.conf recovery.conf)
Clear out the data directory (rm -rf /var/lib/pgsql/9.3/data/*) . This seems to be required by pg_basebackup.
Run the following command:
pg_basebackup -h master -D /var/lib/pgsql/9.3/data --username=replication --password
Am I missing anything ? Is there a better way to bring the slave up-to-date w/o having to reload all the data ?
Any help is greatly appreciated.
The two important options for dealing with the WAL for streaming replication:
wal_keep_segments should be set high enough to allow a slave to catch up after a reasonable lag (e.g. high update volume, slave being offline, etc...).
archive_mode enables WAL archiving which can be used to recover files older than wal_keep_segments provides. The slave servers simply need a method to retrieve the WAL segments. NFS is the simplest method, but anything from scp to http to tapes will work so long as it can be scripted.
# on master
archive_mode = on
archive_command = 'cp %p /path_to/archive/%f'
# on slave
restore_command = 'cp /path_to/archive/%f "%p"'
When the slave can't pull the WAL segment directly from the master, it will attempt to use the restore_command to load it. You can configure the slave to automatically remove segments using the archive_cleanup_commandsetting.
If the slave comes to a situation where the next WAL segment it needs is missing from both the master and the archive, there will be no way to consistently recover the database. The only reasonable option then is to scrub the server and start again from a fresh pg_basebackup.
You can configure replication slots for postgress to keep WAL segments for replica mentioned in such slot.
Read more at https://www.percona.com/blog/2018/11/30/postgresql-streaming-physical-replication-with-slots/
On master server run
SELECT pg_create_physical_replication_slot('standby_slot');
On slave server add next line to recovery.conf
primary_slot_name = 'standby_slot'
actually to recover, you don't have to drop the whole DB and start from scratch. since master has up-to-date binary, you can do following to recover the slave and bring them back to in-sync:
psql -c "select pg_start_backup('initial_backup');"
rsync -cva --inplace --exclude=*pg_xlog* <data_dir> slave_IP_address:<data_dir>
psql -c "select pg_stop_backup();"
Note:
1. slave has to be turned down by service stop
2. master will turn to read-only due to query pg_start_backup
3. master can continue serving read only queries
4. bring back slave at the end of the steps
I did this in prod, it works perfect for me.
slave and master are in sync and there is no data loss.
You will get that error if keep_wal_segments setting is too low.
When you set the value for keep_wal_segments consider that "How long is the pg_basebackup taking?"
Remember that segments are generated about every 5 minutes, so if the backup takes an hour, you need at least 12 segments saved. At 2 hours, you need 24, etc. I would set the value to about 12.2 segments/hour of backup.
As Ben Grimm suggested in the comments, this is a question of making sure to set segments to the maximum possible value to allow the slave to catch up.

PostgreSQL - using log shipping to incrementally update a remote read-only slave

My company's website uses a PostgreSQL database. In our data center we have a master DB and a few read-only slave DB's, and we use Londiste for continuous replication between them.
I would like to setup another read-only slave DB for reporting purposes, and I'd like this slave to be in a remote location (outside the data center). This slave doesn't need to be 100% up-to-date. If it's up to 24 hours old, that's fine. Also, I'd like to minimize the load I'm putting on the master DB. Since our master DB is busy during the day and idle at night, I figure a good idea (if possible) is to get the reporting slave caught up once each night.
I'm thinking about using log shipping for this, as described on
http://www.postgresql.org/docs/8.4/static/continuous-archiving.html
My plan is:
Setup WAL archiving on the master DB
Produce a full DB snapshot and copy it to the remote location
Restore the DB and get it caught up
Go into steady state where:
DAYTIME -- the DB falls behind but people can query it
NIGHT -- I copy over the day's worth of WAL files and get the DB caught up
Note: the key here is that I only need to copy a full DB snapshot one time. Thereafter I should only have to copy a day's worth of WAL files in order to get the remote slave caught up again.
Since I haven't done log-shipping before I'd like some feedback / advice.
Will this work? Does PostgreSQL support this kind of repeated recovery?
Do you have other suggestions for how to set up a remote semi-fresh read-only slave?
thanks!
--S
Your plan should work.
As Charles says, warm standby is another possible solution. It's supported since 8.2 and has relatively low performance impact on the primary server.
Warm Standby is documented in the Manual: PostgreSQL 8.4 Warm Standby
The short procedure for configuring a
standby server is as follows. For full
details of each step, refer to
previous sections as noted.
Set up primary and standby systems as near identically as possible,
including two identical copies of
PostgreSQL at the same release level.
Set up continuous archiving from the primary to a WAL archive located
in a directory on the standby server.
Ensure that archive_mode,
archive_command and archive_timeout
are set appropriately on the primary
(see Section 24.3.1).
Make a base backup of the primary server (see Section 24.3.2), and load
this data onto the standby.
Begin recovery on the standby server from the local WAL archive,
using a recovery.conf that specifies a
restore_command that waits as
described previously (see Section
24.3.3).
To achieve only nightly syncs, your archive_command should exit with a non-zero exit status during daytime.
Additional Informations:
Postgres Wiki about Warm Standby
Blog Post Warm Standby Setup
9.0's built-in WAL streaming replication is designed to accomplish something that should meet your goals -- a warm or hot standby that can accept read-only queries. Have you considered using it, or are you stuck on 8.4 for now?
(Also, the upcoming 9.1 release is expected to include an updated/rewritten version of pg_basebackup, a tool for creating the initial backup point for a fresh slave.)
Update: PostgreSQL 9.1 will include the ability to pause and resume streaming replication with a simple function call on the slave.