How is PostgreSQL hot standby WAL file restoring triggered? - postgresql

Primary server
# postgresql.conf
wal_level = hot_standby
archive_mode = on
archive_timeout = 10
archive_command = 'test ! -f /archive/%f && cp %p /archive/%f'
Standby server
hot_standby = on
I copied /archive/* in primary server to $PGDATA/pg_xlog in the standby, and nothing happen. When I restart the standby server, I got error messages from server log:
2016-11-21 17:56:09 CST [17762-3] LOG: invalid primary checkpoint record
2016-11-21 17:56:09 CST [17762-4] LOG: record with zero length at 0/6000100
2016-11-21 17:56:09 CST [17762-5] LOG: invalid secondary checkpoint record
2016-11-21 17:56:09 CST [17762-6] PANIC: could not locate a valid checkpoint record
2016-11-21 17:56:09 CST [17761-1] LOG: startup process (PID 17762) was terminated by signal 6: Aborted
2016-11-21 17:56:09 CST [17761-2] LOG: aborting startup due to startup process failure
Questions:
Is it enough to sync data to standby server by simply copying /archive/* in primary server to $PGDATA/pg_xlog in the standby?
How and when is the restoring of WAL files triggered in a hot standby server? Does the standby server periodically check its $PGDATA/pg_xlog directory for new WAL files? Or do I have to trigger it manually?
I am talking about hot standby, not streaming replication; so I assume I don't have to configure conninfo. Am I right?
After configuring hot_standby = on and restarting the server, I can still do an INSERT without error. How to configure to make it really read-only?

That looks a lot like you didn't initialize the standby database correctly.
The log file states that PostgreSQL won't even begin to replicate, because it cannot find a valid checkpoint to start with.
What does the backup_label file in your standby's data directory contain? If that file doesn't exist, that's probably the problem.
Did that standby suddenly stop working or has it never worked?
How exactly did you create the standby?

You must first create the standby from a low level base backup of the master. You cannot create a new instance and use pg_dump and pg_restore. I'm guessing that's what you tried to do.
The simplest way to do a suitable base backup is to use pg_basebackup. Other options are discussed in the manual, but really, just use:
pg_basebackup -X stream -D standby_datadir_location -h master_ip
or similar.
Only once you have a valid base backup may you start archive recovery or streaming replication. The simplest way is to enable streaming replication. Let pg_basebackup do that for you by passing the -R flag.
If you want archive recovery, you should add a restore_command to the standby's recovery.conf that copies the archives from the archive location to the standby.
It's all covered in the manual.

Related

PostgreSQL restoration throwing error : replication slot does not exist

Environment: Postgresql 13.x (dockerized)
I was trying to test the DR setup for PostgreSQL nodes.
pg_basebackup and wal_files archive was taken from the standby mode.
Done restoration on a new node by copying pg_basebackup and configured postgresql.conf to use restore_command pointing to walfiles archive.
#----------------------- RECOVERY CONFIGS -----------------------
restore_command = 'cp /db-restore/mydb/walfiles/%f "%p"'
recovery_target_timeline = 'latest'
recovery_target_action = promote
recovery seems to be fine. Some random select queries returning correct results.
But logfile is throwing below error frequently.
2022-04-19 10:19:53 UTC [291] rep_usr#[unknown] ERROR: replication slot "slot_name" does not exist
2022-04-19 10:19:58 UTC [296] rep_usr#[unknown] ERROR: replication slot "slot_name" does not exist
As I have taken backup from standby, is this restoration making new node as a standby and looking for the replication_slot it used in the previous generation?
How can I make new node as a Master (remove replication_slot info)
What are the proper steps to recover if the backup was taken from standby.
I have 1 master and 2 standby nodes. And planning to take a backup from a standby. So is there any specific changes required for archive_mode and archive_command when using this on a standby node? Current commands:
archive_mode = always
archive_level = logical
archive_command = 'test ! -f /db-archives/walfiles/%f && cp %p /db-archives/walfiles/%f'"
Could someone help with this? Any pointers?
I am sure, db-backup will have info about replication_slot and connection_info as the pg_basebackup itself is a clone of entire DB. To revert configs, I am manually removing postgresql.auto.conf in main directory which contains above parameters.
So how can I remove any other references of replication_slot if there are any in the DB backup?
These error messages don't seem to be thrown by recovery, but by some other tool that connects as database user rep_usr.
Create the replication slot if your application needs it!
I removed all configs and started with fresh.
removed main/postgresql.auto.conf which was present in the backup.
main/postgresql.auto.conf is present in standby nodes when we take pg_basebackup. contains the configs used for pg_basebackup in standby nodes. (slot_name, and connect_info).
As I was restoring backup from standby to a Master, I don't need that postgresql.auto.conf.

Postgres Streaming Replication Error: requested WAL segment has already been removed

I have setup streaming replication between a primary and secondary server. I have enabled archiving. In the Postgres log file I am seeing the below error.
< 2017-12-05 03:08:45.374 UTC > WARNING: archive_mode enabled, yet archive_command is not set
< 2017-12-05 03:08:46.668 UTC > ERROR: requested WAL segment 0000000100000000000000E3 has already been removed
< 2017-12-05 03:08:51.675 UTC > ERROR: requested WAL segment 0000000100000000000000E3 has already been removed
< 2017-12-05 03:08:56.682 UTC > ERROR: requested WAL segment 0000000100000000000000E3 has already been removed
Do we need to enable archive_mode = on for streaming replication? How can I avoid above error?
max_wal_senders = 3
wal_keep_segements = 32
https://www.postgresql.org/docs/current/static/warm-standby.html
If you use streaming replication without file-based continuous
archiving, the server might recycle old WAL segments before the
standby has received them. If this occurs, the standby will need to be
reinitialized from a new base backup. You can avoid this by setting
wal_keep_segments to a value large enough to ensure that WAL segments
are not recycled too early, or by configuring a replication slot for
the standby. If you set up a WAL archive that's accessible from the
standby, these solutions are not required, since the standby can
always use the archive to catch up provided it retains enough
segments.
emphasis mine.
so either increase wal_keep_segments to big enough (enough for your amount of block changes), or configure archive_command and set up some storage to keep removed wals from master to be available for slave. Or configuring a replication slot for the standby...
In my case I had to do reinit the replica in maintenance mode using below commands and it fixed the issue. This error was due to lag between leader and replica.
patronictl list
patronictl pause
patronictl reinit patroni
choose Replica pod
patronictl resume

Can't start postgresql replication

We have postgresql replication on different server. So today I was doing some optimization on replication cluster postgresql.conf
After doing replication, I restarted postgresql with this command:
pg_ctlcluster 9.2 main2 restart
But instead of restarting, it gave this error:
The PostgreSQL server failed to start. Please check the log output.
And checking the log, I see this:
2015-06-16 12:18:16 EEST [10655]: [2-1] LOG: received smart shutdown request
2015-06-16 12:18:16 EEST [10661]: [2-1] FATAL: terminating walreceiver process due to administrator command
2015-06-16 12:18:16 EEST [10658]: [1-1] LOG: shutting down
2015-06-16 12:18:16 EEST [10658]: [2-1] LOG: database system is shut down
Checking log now it shows the last restart log, but log does not update, when I try to start server. It says check the log, but there is no new information, even if I try to start server again.
P.S. Do I need to do anything on master?
Update
Changing postgresql.conf settings back, started replication. But from error it is hard to tell what was wrong.
here are settings I changed (after they changing, they were the same as on master. When I commented it, only then I could start replication):
shared_buffers = 1536MB
effective_cache_size = 3072MB
checkpoint_segments = 15
checkpoint_completion_target = 0.9
autovacuum = on
track_counts = on
work_mem = 25MB
So as I said, after commenting these, I could start it. But don't get it why it won't let start with these settings.
If I were you, and if upgrade is an option, the first thing I would do is to upgrade to PostgreSQL 9.4 (or newer). There's a good reason for do this when it comes to replication - a new feature called "replication slots" (see the announcement).
In short: replication slots are more robust and easier to implement than WAL archiving (you obviously use according to your logs).
In this post you'll find a comprehensive guide on implementing the feature.

How do I fix a PostgreSQL 9.3 Slave that Cannot Keep Up with the Master?

We have a master-slave replication configuration as follows.
On the master:
postgresql.conf has replication configured as follows (commented line taken out for brevity):
max_wal_senders = 1
wal_keep_segments = 8
On the slave:
Same postgresql.conf as on the master. recovery.conf looks like this:
standby_mode = 'on'
primary_conninfo = 'host=master1 port=5432 user=replication password=replication'
trigger_file = '/tmp/postgresql.trigger.5432'
When this was initially setup, we performed some simple tests and confirmed the replication was working. However, when we did the initial data load, only some of the data made it to the slave.
Slave's log is now filled with messages that look like this:
< 2015-01-23 23:59:47.241 EST >LOG: started streaming WAL from primary at F/52000000 on timeline 1
< 2015-01-23 23:59:47.241 EST >FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000F00000052 has already been removed
< 2015-01-23 23:59:52.259 EST >LOG: started streaming WAL from primary at F/52000000 on timeline 1
< 2015-01-23 23:59:52.260 EST >FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000F00000052 has already been removed
< 2015-01-23 23:59:57.270 EST >LOG: started streaming WAL from primary at F/52000000 on timeline 1
< 2015-01-23 23:59:57.270 EST >FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000F00000052 has already been removed
After some analysis and help on the #postgresql IRC channel, I've come to the conclusion that the slave cannot keep up with the master. My proposed solution is as follows.
On the master:
Set max_wal_senders=5
Set wal_keep_segments=4000 . Yes I know it is very high, but I'd like to monitor the situation and see what happens. I have room on the master.
On the slave:
Save configuration files in the data directory (i.e. pg_hba.conf pg_ident.conf postgresql.conf recovery.conf)
Clear out the data directory (rm -rf /var/lib/pgsql/9.3/data/*) . This seems to be required by pg_basebackup.
Run the following command:
pg_basebackup -h master -D /var/lib/pgsql/9.3/data --username=replication --password
Am I missing anything ? Is there a better way to bring the slave up-to-date w/o having to reload all the data ?
Any help is greatly appreciated.
The two important options for dealing with the WAL for streaming replication:
wal_keep_segments should be set high enough to allow a slave to catch up after a reasonable lag (e.g. high update volume, slave being offline, etc...).
archive_mode enables WAL archiving which can be used to recover files older than wal_keep_segments provides. The slave servers simply need a method to retrieve the WAL segments. NFS is the simplest method, but anything from scp to http to tapes will work so long as it can be scripted.
# on master
archive_mode = on
archive_command = 'cp %p /path_to/archive/%f'
# on slave
restore_command = 'cp /path_to/archive/%f "%p"'
When the slave can't pull the WAL segment directly from the master, it will attempt to use the restore_command to load it. You can configure the slave to automatically remove segments using the archive_cleanup_commandsetting.
If the slave comes to a situation where the next WAL segment it needs is missing from both the master and the archive, there will be no way to consistently recover the database. The only reasonable option then is to scrub the server and start again from a fresh pg_basebackup.
You can configure replication slots for postgress to keep WAL segments for replica mentioned in such slot.
Read more at https://www.percona.com/blog/2018/11/30/postgresql-streaming-physical-replication-with-slots/
On master server run
SELECT pg_create_physical_replication_slot('standby_slot');
On slave server add next line to recovery.conf
primary_slot_name = 'standby_slot'
actually to recover, you don't have to drop the whole DB and start from scratch. since master has up-to-date binary, you can do following to recover the slave and bring them back to in-sync:
psql -c "select pg_start_backup('initial_backup');"
rsync -cva --inplace --exclude=*pg_xlog* <data_dir> slave_IP_address:<data_dir>
psql -c "select pg_stop_backup();"
Note:
1. slave has to be turned down by service stop
2. master will turn to read-only due to query pg_start_backup
3. master can continue serving read only queries
4. bring back slave at the end of the steps
I did this in prod, it works perfect for me.
slave and master are in sync and there is no data loss.
You will get that error if keep_wal_segments setting is too low.
When you set the value for keep_wal_segments consider that "How long is the pg_basebackup taking?"
Remember that segments are generated about every 5 minutes, so if the backup takes an hour, you need at least 12 segments saved. At 2 hours, you need 24, etc. I would set the value to about 12.2 segments/hour of backup.
As Ben Grimm suggested in the comments, this is a question of making sure to set segments to the maximum possible value to allow the slave to catch up.

PostgreSQL hot standby won't start after deleting WAL files

I have a PostgreSQL 9.1 hot standby server. The WAL files ended up taking up the entire HD so I ended up deleting all the WAP files. How I want to bring the server back up so I ran:
/usr/local/pgsql/bin/pg_ctl start -D /usr/local/pgsql/data
The problem is the server never fully starts. I see this follow by non-stop missing WAL files log outputs:
server starting
sh-4.1$ LOG: database system was shut down in recovery at 2013-02-10 03:17:06 UTC
LOG: entering standby mode
cp: cannot stat `/usr/local/pgsql/wals/0000000100000035000000A4': No such file or directory
LOG: redo starts at 35/A4943928
LOG: consistent recovery state reached at 35/A4AE8EB8
LOG: database system is ready to accept read only connections
LOG: invalid record length at 35/A4AE8EB8
cp: cannot stat `/usr/local/pgsql/wals/0000000100000035000000A4': No such file or directory
LOG: streaming replication successfully connected to primary
FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 0000000100000035000000A4 has already been removed
How can I get the server back?
You need to recreate the hot standby server from the dump (or filesystem copy).
Details on how to do that you can find in this manual: High Availability, Load Balancing, and Replication
Please check whether required wal's are on the location from which you are restoring for initial restoration before connecting slave to primary host in streaming mode...Above problem can happen due to unavailable wal segment on the location from which you are restoring for initial recovery...If this case is fine then check your restore_command in recovery.conf.