Streaming replication is failing with "WAL segment has already been moved" - postgresql

I am trying to implement Master/Slave streaming replication on Postgres 11.5. I ran the following steps -
On Master
select pg_start_backup('replication-setup',true);
On Slave
Stopped the postgres 11 database and ran
rsync -aHAXxv --numeric-ids --progress -e "ssh -T -o Compression=no -x" --exclude pg_wal --exclude postgresql.pid --exclude pg_log MASTER:/var/lib/postgresql/11/main/* /var/lib/postgresql/11/main
On Master
select pg_stop_backup();
On Slave
rsync -aHAXxv --numeric-ids --progress -e "ssh -T -o Compression=no -x" MASTER:/var/lib/postgresql/11/main/pg_wal/* /var/lib/postgresql/11/main/pg_wal
I created the recovery.conf file on slave ~/11/main folder
standby_mode = 'on'
primary_conninfo = 'user=postgres host=MASTER port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'
primary_slot_name='my_repl_slot'
When I start Postgres on Slave, I get the error on both MASTER and SLAVE logs -
019-11-08 09:03:51.205 CST [27633] LOG: 00000: database system was interrupted; last known up at 2019-11-08 02:53:04 CST
2019-11-08 09:03:51.205 CST [27633] LOCATION: StartupXLOG, xlog.c:6388
2019-11-08 09:03:51.252 CST [27633] LOG: 00000: entering standby mode
2019-11-08 09:03:51.252 CST [27633] LOCATION: StartupXLOG, xlog.c:6443
2019-11-08 09:03:51.384 CST [27634] LOG: 00000: started streaming WAL from primary at 12DB/C000000 on timeline 1
2019-11-08 09:03:51.384 CST [27634] LOCATION: WalReceiverMain, walreceiver.c:383
2019-11-08 09:03:51.384 CST [27634] FATAL: XX000: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000012DB0000000C has already been removed
2019-11-08 09:03:51.384 CST [27634] LOCATION: libpqrcv_receive, libpqwalreceiver.c:772
2019-11-08 09:03:51.408 CST [27635] LOG: 00000: started streaming WAL from primary at 12DB/C000000 on timeline 1
2019-11-08 09:03:51.408 CST [27635] LOCATION: WalReceiverMain, walreceiver.c:383
The problem is the START WAL - 00000001000012DB0000000C is available right until I run the pg_stop_backup() and is getting archived and no longer available, once the pg_stop_backup() is executed. So this is not an issue of the WAL being archived out due to low WAL_KEEP_SEGMENTS.
postgres#SLAVE:~/11/main/pg_wal$ cat 00000001000012DB0000000C.00000718.backup
START WAL LOCATION: 12DB/C000718 (file 00000001000012DB0000000C)
STOP WAL LOCATION: 12DB/F4C30720 (file 00000001000012DB000000F4)
CHECKPOINT LOCATION: 12DB/C000750
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2019-11-07 15:47:26 CST
LABEL: replication-setup-mdurbha
START TIMELINE: 1
STOP TIME: 2019-11-08 08:48:35 CST
STOP TIMELINE: 1
My MASTER has archive_command set, and I have the missing WALs available. I copied them into a restore directory on the SLAVE and tried the recovery.conf below, but it still fails with the MASTER reporting the same WAL segment has already been moved error.
Any idea how I can address this issue? I have used rsync to setup replication without any issues in the past on Postgres 9.6, but have been experiencing this issue on Postgres 11.
standby_mode = 'on'
primary_conninfo = 'user=postgres host=MASTER port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'
restore_command='cp /var/lib/postgresql/restore/%f %p'

Put a restore_command into recovery.conf that can restore archived WAL files and you are fine.

Related

Couldn't restore postgres v11 from pg_basebackup

I'm making a backup from my postgresql database (v11) using pg_basebackup like this:
pg_basebackup -h localhost -p 5432 -U postgres -D /tmp/backup -Ft -z -Xs -P
The backup process is completed just fine without any errors.Then as a test I'd like to recover from the backup and for that I do the following steps:
Shutdown the postgres instance.
Delete everything from /var/lib/postgresql/data folder
rm -rf /var/lib/postgresl/data/*
Untar the base archive there:
tar xvf /tmp/backup/base.tar.gz -C /var/lib/postgresql/data
Untar the wal archives to a temporary directory
tar xvf /tmp/backup/pg_wal.tar.gz -C /tmp/archived_wals
Create a file recovery.conf in the /var/lib/postgresql/data folder:
touch /var/lib/postgresql/data/recovery.conf
chown postgres:postgres /var/lib/postgresql/data/recovery.conf
Specify the recovery command in the recovery.conf file:
restore_command = 'cp /path/archived_wals/%f "%p" '
Restart the postgres instance, and wait for the recovery to be finished.
However when postgres starts up I get the following error in the logs:
2020-02-04 11:34:52.599 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-02-04 11:34:52.599 UTC [1] LOG: listening on IPv6 address "::", port 5432
2020-02-04 11:34:52.613 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2020-02-04 11:34:52.709 UTC [18] LOG: database system was interrupted; last known up at 2020-02-04 08:39:54 UTC
cp: can't stat '/tmp/archived_wals/00000002.history': No such file or directory
2020-02-04 11:34:52.735 UTC [18] LOG: starting archive recovery
2020-02-04 11:34:52.752 UTC [18] LOG: invalid checkpoint record
2020-02-04 11:34:52.752 UTC [18] FATAL: could not locate required checkpoint record
2020-02-04 11:34:52.752 UTC [18] HINT: If you are not restoring from a backup, try removing the file "/var/lib/postgresql/data/backup_label".
2020-02-04 11:34:52.753 UTC [1] LOG: startup process (PID 18) exited with exit code 1
2020-02-04 11:34:52.753 UTC [1] LOG: aborting startup due to startup process failure
2020-02-04 11:34:52.769 UTC [1] LOG: database system is shut down
I guess I'm missing something, maybe I need to do additional configuration to postgres to be able to restore a pg_basebackup backup?
Try this:
/pgsql-11/bin/pg_basebackup -D /pgdata/pg11/data -X stream --waldir=/pglog/pg11/wal_log/ -P -c fast -h remoteserver\localserver -U username
Let me know if that works.
After all, it was my bad.
I enclosed the destination placeholder within % instead of double quotes in recovery.conf.
However the way I described for the backup and restore is valid and since I corrected the typo it works just fine for me, so I leave the question as it is for further reference.

PostgreSQL 9.4.1 Switchover & Switchback without recover_target_timeline=latest

I have tested different scenarios to do switchover and switchback in postgreSQL 9.4.1 Version.
Scenario 1:- PostgreSQL Switchover and Switchback in 9.4.1
Scenario 2:- Is it mandatory parameter recover_target_timeline='latest' in switchover and switchback in PostgreSQL 9.4.1?
Scenario 3:- On this page
To test scenario 3 I have followed below steps to perform.
1) Stop the application connected to primary server.
2) Confirm all application was stopped and all thread was disconnected from primary DB.
#192.x.x.129(Primary)
3) Clean shutdown primary using
pg_ctl -D$PGDATA stop --mf
#DR(192.x.x.128) side check sync status:
postgres=# select pg_last_xlog_receive_location(),pg_last_xlog_replay_location();
-[ RECORD 1 ]-----------------+-----------
pg_last_xlog_receive_location | 4/57000090
pg_last_xlog_replay_location | 4/57000090
4)Stop DR server.DR(192.x.x.128)
pg_ctl -D $PGDATA stop -mf
pg_log:
2019-12-02 13:16:09 IST LOG: received fast shutdown request
2019-12-02 13:16:09 IST LOG: aborting any active transactions
2019-12-02 13:16:09 IST LOG: shutting down
2019-12-02 13:16:09 IST LOG: database system is shut down
#192.x.x.128(DR)
5) Make following changes on DR server.
mv recovery.conf recovery.conf_bkp
6)make changes in 192.x.x.129(Primary):
[postgres#localhost data]$ cat recovery.conf
standby_mode = 'on'
primary_conninfo = 'user=replication password=postgres host=192.x.x.128 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'
restore_command = 'cp %p /home/postgres/restore/%f'
trigger_file='/tmp/promote'
7)Start DR as read write mode:
pg_ctl -D $DATA start
pg_log:
2019-12-02 13:20:21 IST LOG: database system was shut down in recovery at 2019-12-02 13:16:09 IST
2019-12-02 13:20:22 IST LOG: database system was not properly shut down; automatic recovery in progress
2019-12-02 13:20:22 IST LOG: consistent recovery state reached at 4/57000090
2019-12-02 13:20:22 IST LOG: invalid record length at 4/57000090
2019-12-02 13:20:22 IST LOG: redo is not required
2019-12-02 13:20:22 IST LOG: database system is ready to accept connections
2019-12-02 13:20:22 IST LOG: autovacuum launcher started
(END)
We can see in above log OLD primary is now DR of Primary(Which was OLD DR) and not showing any error because timeline id same on new primary which is already exit in new DR.
8)Start Primary as read only mode:-
pg_ctl -D$PGDATA start
logs:
2019-12-02 13:24:50 IST LOG: database system was shut down at 2019-12-02 11:14:50 IST
2019-12-02 13:24:51 IST LOG: entering standby mode
cp: cannot stat ‘pg_xlog/RECOVERYHISTORY’: No such file or directory
cp: cannot stat ‘pg_xlog/RECOVERYXLOG’: No such file or directory
2019-12-02 13:24:51 IST LOG: consistent recovery state reached at 4/57000090
2019-12-02 13:24:51 IST LOG: record with zero length at 4/57000090
2019-12-02 13:24:51 IST LOG: database system is ready to accept read only connections
2019-12-02 13:24:51 IST LOG: started streaming WAL from primary at 4/57000000 on timeline 9
2019-12-02 13:24:51 IST LOG: redo starts at 4/57000090
(END)
Question 1:- In This scenario i have perform only switch-over to show you. using this method we can do switch-over and switchback. but using below method Switch-over-switchback is work, then why PostgreSQL Community invented recovery_target_timeline=latest and apply patches see blog: https://www.enterprisedb.com/blog/switchover-switchback-in-postgresql-9-3 from PostgrSQL 9.3...to latest version.
Question 2:- What mean to say in above log cp: cannot stat ‘pg_xlog/RECOVERYHISTORY’: No such file or directory ?
Question 3:- I want to make sure from scenarios 1 and scenario 3 which method/Scenarios is correct way to do switchover and switchback? because scenario 2 is getting error because we must use recover_target_timeline=latest which all community experts know.
Answers:
If you shut down the standby cleanly, then remove recovery.conf and restart it, it will come up, but has to perform crash recovery (database system was not properly shut down).
The proper way to promote a standby to a primary is by using the trigger file or running pg_ctl promote (or, from v12 on, by running the SQL function pg_promote). Then you have no down time and don't need to perform crash recovery.
Promoting the standby will make it pick a new time line, so you need recovery_target_timeline = 'latest' if you want the new standby to follow that time line switch.
That is caused by your restore_command.
The method shown in 1. above is the correct one.

Postgresql 10 logical replication not working

I install postgresql 10 using scripts:
$ wget -q https://www.postgresql.org/media/keys/ACCC4CF8.asc -O - | sudo apt-key add -
$ sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ `lsb_release -cs`-pgdg main" >> /etc/apt/sources.list.d/pgdg.list'
$ sudo apt-get update
$ sudo apt-get install postgresql postgresql-contrib
On Master server : xx.xxx.xxx.xx-:
And after that in postgresql.conf:
set wal_level = logical
On Slave server -:
in postgresql.conf:
set wal_level = logical
And after all i use below query on master server:
create table t1 (id integer primary key, val text);
create user replicant with replication;
grant select on t1 to replicant;
insert into t1 (id, val) values (10, 'ten'), (20, 'twenty'), (30, 'thirty');
create publication pub1 for table t1;
And at Slave server:
create table t1 (id integer primary key, val text, val2 text);
create subscription sub1 connection 'dbname=dbsrc user=replicant' publication pub1;
But the problem which is i am facing is table are not syncing and as per logical replication when i insert new row on master server, slave server not getting that row.
I am new to postgresql please help me.
Thanks for your precious time.
Now here is my postgresql log for Master server:
2017-10-17 11:06:16.644 UTC [10713] replicant#postgres LOG: starting logical decoding for slot "sub_1"
2017-10-17 11:06:16.644 UTC [10713] replicant#postgres DETAIL: streaming transactions committing after 1/F45EB0C8, reading WAL from 1/F45EB0C8
2017-10-17 11:06:16.645 UTC [10713] replicant#postgres LOG: logical decoding found consistent point at 1/F45EB0C8
2017-10-17 11:06:16.645 UTC [10713] replicant#postgres DETAIL: There are no running transactions.
Here is my slave server postgresql log:
2017-10-17 19:14:45.622 CST [7820] WARNING: out of logical replication worker slots
2017-10-17 19:14:45.622 CST [7820] HINT: You might need to increase max_logical_replication_workers.
2017-10-17 19:14:45.670 CST [7821] WARNING: out of logical replication worker slots
2017-10-17 19:14:45.670 CST [7821] HINT: You might need to increase max_logical_replication_workers.
2017-10-17 19:14:45.680 CST [7822] WARNING: out of logical replication worker slots
2017-10-17 19:14:45.680 CST [7822] HINT: You might need to increase max_logical_replication_workers.
2017-10-17 19:14:50.865 CST [7820] WARNING: out of logical replication worker slots
2017-10-17 19:14:50.865 CST [7820] HINT: You might need to increase max_logical_replication_workers.
2017-10-17 19:14:50.917 CST [7821] WARNING: out of logical replication worker slots
2017-10-17 19:14:50.917 CST [7821] HINT: You might need to increase max_logical_replication_workers.
2017-10-17 19:14:50.928 CST [7822] WARNING: out of logical replication worker slots
2017-10-17 19:14:50.928 CST [7822] HINT: You might need to increase max_logical_replication_workers.
2017-10-17 19:14:55.871 CST [7820] WARNING: out of logical replication worker slots
2017-10-17 19:14:55.871 CST [7820] HINT: You might need to increase max_logical_replication_workers.
And after increasing the max_logical_replication_workers i am getting this:
2017-10-17 19:44:45.898 CST [7987] LOG: logical replication table synchronization worker for subscription "sub2", table "t1" has started
2017-10-17 19:44:45.982 CST [7988] LOG: logical replication table synchronization worker for subscription "myadav_test", table "test_replication" h$
2017-10-17 19:44:45.994 CST [7989] LOG: logical replication table synchronization worker for subscription "sub3", table "t1" has started
2017-10-17 19:44:48.621 CST [7987] ERROR: could not start initial contents copy for table "staging.t1": ERROR: permission denied for schema staging
2017-10-17 19:44:48.623 CST [7962] LOG: worker process: logical replication worker for subscription 20037 sync 20027 (PID 7987) exited with exit co$
2017-10-17 19:44:48.705 CST [7988] ERROR: could not start initial contents copy for table "staging.test_replication": ERROR: permission denied for$
2017-10-17 19:44:48.707 CST [7962] LOG: worker process: logical replication worker for subscription 20025 sync 20016 (PID 7988) exited with exit co$
2017-10-17 19:44:48.717 CST [7989] ERROR: duplicate key value violates unique constraint "t1_pkey"
2017-10-17 19:44:48.717 CST [7989] DETAIL: Key (id)=(10) already exists.
2017-10-17 19:44:48.717 CST [7989] CONTEXT: COPY t1, line 1
2017-10-17 19:44:48.718 CST [7962] LOG: worker process: logical replication worker for subscription 20038 sync 20027 (PID 7989) exited with exit co$
2017-10-17 19:44:51.629 CST [8008] LOG: logical replication table synchronization worker for subscription "sub2", table "t1" has started
2017-10-17 19:44:51.712 CST [8009] LOG: logical replication table synchronization worker for subscription "myadav_test", table "test_replication" h$
2017-10-17 19:44:51.722 CST [8010] LOG: logical replication table synchronization worker for subscription "sub3", table "t1" has started
Now i finally realize that logical replication is working for postgres database but not for my other database on same server. I am getting permission issue on schema that is is log.
The row changes are applied using the rights of the user who owns the subscription. By default that's the user who created the subscription.
So make sure the subscription is owned by a user with sufficient rights. Grant needed rights to tables, or if you can't be bothered, make the subscription owned by a superuser who has full rights to everything.
See:
CREATE SUBSCRIPTION
logical replication - security
logical replication

Postgres replication not starting due to wal error

I am using postgres version 9.3.2 on two servers one master, one primary.
I am setting up replication as follows:-
On master:-
sudo -u postgres psql -c "CREATE USER replicator REPLICATION LOGIN ENCRYPTED PASSWORD 'FOO’;"
Edit postgresql.conf
listen_address = '*'
wal_level = hot_standby
max_wal_senders = 32
checkpoint_segments = 8
wal_keep_segments = 100
Edit pg_hba.conf
hostssl replication replicator <SLAVE IP>/32 md5
On Slave:-
Edit postgresql.conf
wal_level = hot_standby
max_wal_senders = 3
checkpoint_segments = 8
wal_keep_segments = 8
hot_standby = on
Run
sudo service postgresql stop
sudo -u postgres rm -rf /var/lib/postgresql/9.3/main
sudo -u postgres pg_basebackup -h <MASTER IP> -D /var/lib/postgresql/9.3/main -U replicator -v -P
CREATE /var/lib/postgresql/9.3/main/recovery.conf
standby_mode = 'on'
primary_conninfo = 'host=<MASTER IP> port=5432 user=replicator password=FOO sslmode=require'
trigger_file = '/tmp/postgresql.trigger'
Run:-
sudo service postgresql restart
When I restart postgres on the slave I get this error message:-
LOG: database system was shut down at 2015-01-14 09:10:50 GMT 2015-01-14 09:11:01 GMT [16741-2]
LOG: entering standby mode 2015-01-14 09:11:01 GMT [16741-3] WARNING: WAL was generated with wal_level=minimal, data may be missing 2015-01-14 09:11:01 GMT [16741-4] HINT: This happens if you temporarily set wal_level=minimal without taking a new base backup. 2015-01-14 09:11:01 GMT [16741-5]
FATAL: hot standby is not possible because wal_level was not set to "hot_standby" on the master server 2015-01-14 09:11:01 GMT [16741-6] HINT: Either set wal_level to "hot_standby" on the master, or turn off hot_standby here. 2015-01-14 09:11:01 GMT [16740-1]
LOG: startup process (PID 16741) exited with exit code 1 2015-01-14 09:11:01 GMT [16740-2] LOG: aborting startup due to startup process failure ... failed!
Why is this happening? I have checked and rechecked that on the master wal_level is set to hot_standby. On the master running "show all" shows that this is the case? I am at a loss as to what I am doing wrong here.
You have to restart primary database again to let the current WAL file replayed on the standby, since this WAL file was generated when wal_level=archive.

Postresql 9.3 replication not starting after pg_basebackup completes

I am trying to create a hot_standby server, and I receive the following error after pg_basebackup completes. Notice I use a shell script, replicator.sh, to start the replication. Can anyone give me some insight?
My specs:
Debian Wheezy 7.6
Postgresql 9.3
Database size: ~115GB
Error:
postgres#database-master:/etc/postgresql/9.3/main$ sh replicator.sh
Stopping PostgreSQL
[ ok ] Stopping PostgreSQL 9.3 database server: main.
Cleaning up old cluster directory
Starting base backup as replicator
Password:
113720266/113720266 kB (100%), 1/1 tablespace
NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup
pg_basebackup: base backup completed
Starting Postgresql
[....] Starting PostgreSQL 9.3 database server: main[....] The PostgreSQL server failed to start.
Please check the log output: 2014-09-11 17:56:33 UTC LOG: database system was interrupted; last
known up at 2014-09-11 16:54:29 UTC 2014-09-11 17:56:33 UTC LOG: creating missing WAL directory
"pg_xlog/archive_status" 2014-09-11 17:56:33 UTC LOG: incomplete startup packet 2014-09-11 17:56:33
UTC LOG: invalid checkpoint record 2014-09-11 17:56:33 UTC FATAL: could not locate required
checkpoint record 2014-09-11 17:56:33 UTC HINT: If you are not restoring from a backup, try
removing the file "/var/lib/p[FAILesql/9.3/main/backup_label". 2014-09-11 17:56:33 UTC LOG: startup
process (PID 21972) exited with exit code 1 2014-09-11 17:56:33 UTC LOG: aborting startup due to
startup process failure ... failed! failed!
Contents of replicator.sh:
#!/bin/bash
echo Stopping PostgreSQL
/etc/init.d/postgresql stop
echo Cleaning up old cluster directory
rm -rf /var/lib/postgresql/9.3/main
echo Starting base backup as replicator
pg_basebackup -h 123.456.789.123 -D /var/lib/postgresql/9.3/main -U replicator -v -P
echo Writing recovery.conf file
sudo -u postgres bash -c "cat > /var/lib/postgresql/9.3/main/recovery.conf <<- _EOF1_
standby_mode = 'on'
primary_conninfo = 'host=123.456.789.123 port=5432 user=replicator password=XXXXX sslmode=require'
trigger_file = '/tmp/postgresql.trigger'
_EOF1_
"
echo Starting Postgresql
/etc/init.d/postgresql start
Thank you,
Jake
My best guess from the above is that the pg_basebackup failed and your shell script doesn't check for error return codes or use set -e to automatically abort after errors, so it just carried on regardless.
It's also possible that you don't have WAL archiving configured, or don't have a restore_command set in the replica. In that case, the transaction logs required to start the base backup will not be available and startup will fail.
I strongly recommend that you:
Use pg_basebackup -X stream so that the required transaction logs get copied along with the backup; and
Use set -e in your shell script, or test for errors with a suitable if ! pg_basebackup .... ; then block.