log shipping error postgres - postgresql

I was performing log shipping from postgres 9.0.4 (redhat ) to 9.0.6 (fedoara14)
but I received an error
HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
LOG: entering standby mode
LOG: restored log file "000000010000000200000065" from archive
LOG: record with zero length at 2/65000100
WARNING: WAL was generated with wal_level=minimal, data may be missing
HINT: This happens if you temporarily set wal_level=minimal without taking a new base backup.
FATAL: hot standby is not possible because wal_level was not set to "hot_standby" on the master server
HINT: Either set wal_level to "hot_standby" on the master, or turn off hot_standby here.
LOG: startup process (PID 9438) exited with exit code 1
LOG: aborting startup due to startup process failure
ls ../archive/
000000010000000200000051 000000010000000200000059 00000001000000020000005F.00000020.backup
000000010000000200000052 000000010000000200000059.00000020.backup 000000010000000200000060
000000010000000200000053 00000001000000020000005A 000000010000000200000061
000000010000000200000054 00000001000000020000005B 000000010000000200000061.00000020.backup
000000010000000200000055 00000001000000020000005B.00000020.backup 000000010000000200000062
000000010000000200000055.00000020.backup 00000001000000020000005C 000000010000000200000063
000000010000000200000056 00000001000000020000005D 000000010000000200000064
000000010000000200000057 00000001000000020000005E 000000010000000200000065
000000010000000200000058 00000001000000020000005F
ls pg_xlog
000000010000000200000061.00000020.backup 000000010000000200000067 00000001000000020000006A archive_status
000000010000000200000065 000000010000000200000068 00000001000000020000006B RECOVERYXLOG
000000010000000200000066 000000010000000200000069 00000001000000020000006C
cat recovery.conf
### RECOVERY
standby_mode = 'on'
restore_command = 'cp -i /var/lib/pgsql/9.0/archive/%f %p'
when I remove the recovery.conf file from the data/ directory
and turned off 'hot_standby' in postgresql.conf file then I can start the postgres and can select the data
I want the secondary postgres should be start in a hot_standby mode
can any one tell me how to get rid of this issue !!!

Please, check postgresql.conf on your master database. According to your log:
WARNING: WAL was generated with wal_level=minimal, data may be missing
HINT: This happens if you temporarily set wal_level=minimal without taking a new base backup.
FATAL: hot standby is not possible because wal_level was not set to "hot_standby" on the master server
HINT: Either set wal_level to "hot_standby" on the master, or turn off hot_standby here.
The message is pretty informative. You should either use wal_level = hot_standby on the master database (consider running a full backup after turning this on), or use hot_standby = off on the standby side (this change requires no extra manipulations).
In fact, in order to maintain standby you need either archive or hot_standby level of WAL, per documentation.
If you have activated your standby by removing recovery.conf and starting the cluster, then you should re-create standby from the latest full backup.

Related

wal-e backup-push not terminating (waiting for required WAL segments to be archived)

Trying to setup wal-e for postgres.
Following various tutorials and I'm at a point to finally do a first backup of a clean 9.6 postgres install.
Followed some tutorials and finally read to do an initial wal-e backup-push, as follows:
sudo -u postgres -i
envdir /etc/wal-e.d/env wal-e backup-push /var/lib/postgresql/9.6/main
I'd expect the command to terminate rather quickly, since it's an empty database.
However it seems to wait indefinitely. Showing waiting for required WAL segments to be archived
postgres#postgres:~$ envdir /etc/wal-e.d/env wal-e backup-push /var/lib/postgresql/9.6/main
wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "backup-push".
STRUCTURED: time=2017-05-26T11:45:52.138889-00 pid=10426
wal_e.operator.backup INFO MSG: start upload postgres version metadata
DETAIL: Uploading to s3://xxxxxx/basebackups_005/base_000000010000000000000060_00000040/extended_version.txt.
STRUCTURED: time=2017-05-26T11:45:52.719220-00 pid=10426
wal_e.operator.backup INFO MSG: postgres version metadata upload complete
STRUCTURED: time=2017-05-26T11:45:52.929696-00 pid=10426
wal_e.worker.upload INFO MSG: beginning volume compression
DETAIL: Building volume 0.
STRUCTURED: time=2017-05-26T11:45:53.075771-00 pid=10426
wal_e.worker.upload INFO MSG: begin uploading a base backup volume
DETAIL: Uploading to "s3://xxxxxx/basebackups_005/base_000000010000000000000060_00000040/tar_partitions/part_00000000.tar.lzo".
STRUCTURED: time=2017-05-26T11:45:53.752390-00 pid=10426
wal_e.worker.upload INFO MSG: finish uploading a base backup volume
DETAIL: Uploading to "s3://xxxxxx/basebackups_005/base_000000010000000000000060_00000040/tar_partitions/part_00000000.tar.lzo" complete at 9106.47KiB/s.
STRUCTURED: time=2017-05-26T11:45:54.327037-00 pid=10426
NOTICE: pg_stop_backup cleanup done, waiting for required WAL segments to be archived
WARNING: pg_stop_backup still waiting for all required WAL segments to be archived (60 seconds elapsed)
HINT: Check that your archive_command is executing properly. pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.
WARNING: pg_stop_backup still waiting for all required WAL segments to be archived (120 seconds elapsed)
HINT: Check that your archive_command is executing properly. pg_stop_backup can be canceled safely, but the database backup will not be usable without all the WAL segments.
To be honest I'm a bit out of my depth here. Afaik, the above should do an initial backup, but not wait for WAL-files which are obviously continuously updated. (I've got archive_mode=on in my postgres config and set archive_command = 'envdir /etc/wal-e.d/env wal-e wal-push %p' which should do the incremental pushes. Again afaik.).
How can I get this initial backup command to finish?

WAL archive: FAILED (please make sure WAL shipping is setup)

I am trying to configure Barman to backup. When I do a barman check replica I keep getting:
Server replica:
WAL archive: FAILED (please make sure WAL shipping is setup)
PostgreSQL: OK
superuser: OK
wal_level: OK
directories: OK
retention policy settings: OK
backup maximum age: FAILED (interval provided: 1 day, latest backup age: No available backups)
compression settings: OK
failed backups: OK (there are 0 failed backups)
minimum redundancy requirements: FAILED (have 0 backups, expected at least 2)
ssh: OK (PostgreSQL server)
not in recovery: FAILED (cannot perform exclusive backup on a standby)
archive_mode: OK
archive_command: OK
continuous archiving: OK
archiver errors: OK
I am using Postgresql 9.6 and barman 2.1; I am not sure as to what the issue is could someone help?
Here is my Barman server configuration:
description = "Database backup"
conninfo = host=<db-ip> user=postgres dbname=db
backup_method = rsync
ssh_command = ssh postgres#<db-ip>
archiver = on
barman check tries to confirm that archiving is set up correctly by asserting that there's actually something in the archive. However, WAL segments are generally only archived once they're filled up, and if your server is idle, this is never going to happen.
To work around this, Barman provides a command to force a segment switch, wait for the completed WAL to show up, and then archive it immediately:
barman switch-xlog --force --archive replica
in brief
Barman's incoming_wals_directory and Postgresql.conf's archive_command not matched as described in details here
details
Another cause is that the not matched between
Barman's incoming_wals_directory
Postgresql.conf's archive_command
Bash util to check
barman#backup $ barman show-server pg | grep incoming_wals_directory
# output1
# > incoming_wals_directory: /var/lib/barman/pg/incoming
postgres#pg $ cat /etc/postgresql/10/main/postgresql.conf | grep archive_command
# output2
# > archive_command = 'rsync -a %p barman#staging:/var/lib/barman/pg/incoming/%f'
We must have same path in :output1 and :output2
Make them matched if they don't and don't forget to restart postgres afterward.

Postgres replication not starting due to wal error

I am using postgres version 9.3.2 on two servers one master, one primary.
I am setting up replication as follows:-
On master:-
sudo -u postgres psql -c "CREATE USER replicator REPLICATION LOGIN ENCRYPTED PASSWORD 'FOO’;"
Edit postgresql.conf
listen_address = '*'
wal_level = hot_standby
max_wal_senders = 32
checkpoint_segments = 8
wal_keep_segments = 100
Edit pg_hba.conf
hostssl replication replicator <SLAVE IP>/32 md5
On Slave:-
Edit postgresql.conf
wal_level = hot_standby
max_wal_senders = 3
checkpoint_segments = 8
wal_keep_segments = 8
hot_standby = on
Run
sudo service postgresql stop
sudo -u postgres rm -rf /var/lib/postgresql/9.3/main
sudo -u postgres pg_basebackup -h <MASTER IP> -D /var/lib/postgresql/9.3/main -U replicator -v -P
CREATE /var/lib/postgresql/9.3/main/recovery.conf
standby_mode = 'on'
primary_conninfo = 'host=<MASTER IP> port=5432 user=replicator password=FOO sslmode=require'
trigger_file = '/tmp/postgresql.trigger'
Run:-
sudo service postgresql restart
When I restart postgres on the slave I get this error message:-
LOG: database system was shut down at 2015-01-14 09:10:50 GMT 2015-01-14 09:11:01 GMT [16741-2]
LOG: entering standby mode 2015-01-14 09:11:01 GMT [16741-3] WARNING: WAL was generated with wal_level=minimal, data may be missing 2015-01-14 09:11:01 GMT [16741-4] HINT: This happens if you temporarily set wal_level=minimal without taking a new base backup. 2015-01-14 09:11:01 GMT [16741-5]
FATAL: hot standby is not possible because wal_level was not set to "hot_standby" on the master server 2015-01-14 09:11:01 GMT [16741-6] HINT: Either set wal_level to "hot_standby" on the master, or turn off hot_standby here. 2015-01-14 09:11:01 GMT [16740-1]
LOG: startup process (PID 16741) exited with exit code 1 2015-01-14 09:11:01 GMT [16740-2] LOG: aborting startup due to startup process failure ... failed!
Why is this happening? I have checked and rechecked that on the master wal_level is set to hot_standby. On the master running "show all" shows that this is the case? I am at a loss as to what I am doing wrong here.
You have to restart primary database again to let the current WAL file replayed on the standby, since this WAL file was generated when wal_level=archive.

Postresql 9.3 replication not starting after pg_basebackup completes

I am trying to create a hot_standby server, and I receive the following error after pg_basebackup completes. Notice I use a shell script, replicator.sh, to start the replication. Can anyone give me some insight?
My specs:
Debian Wheezy 7.6
Postgresql 9.3
Database size: ~115GB
Error:
postgres#database-master:/etc/postgresql/9.3/main$ sh replicator.sh
Stopping PostgreSQL
[ ok ] Stopping PostgreSQL 9.3 database server: main.
Cleaning up old cluster directory
Starting base backup as replicator
Password:
113720266/113720266 kB (100%), 1/1 tablespace
NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup
pg_basebackup: base backup completed
Starting Postgresql
[....] Starting PostgreSQL 9.3 database server: main[....] The PostgreSQL server failed to start.
Please check the log output: 2014-09-11 17:56:33 UTC LOG: database system was interrupted; last
known up at 2014-09-11 16:54:29 UTC 2014-09-11 17:56:33 UTC LOG: creating missing WAL directory
"pg_xlog/archive_status" 2014-09-11 17:56:33 UTC LOG: incomplete startup packet 2014-09-11 17:56:33
UTC LOG: invalid checkpoint record 2014-09-11 17:56:33 UTC FATAL: could not locate required
checkpoint record 2014-09-11 17:56:33 UTC HINT: If you are not restoring from a backup, try
removing the file "/var/lib/p[FAILesql/9.3/main/backup_label". 2014-09-11 17:56:33 UTC LOG: startup
process (PID 21972) exited with exit code 1 2014-09-11 17:56:33 UTC LOG: aborting startup due to
startup process failure ... failed! failed!
Contents of replicator.sh:
#!/bin/bash
echo Stopping PostgreSQL
/etc/init.d/postgresql stop
echo Cleaning up old cluster directory
rm -rf /var/lib/postgresql/9.3/main
echo Starting base backup as replicator
pg_basebackup -h 123.456.789.123 -D /var/lib/postgresql/9.3/main -U replicator -v -P
echo Writing recovery.conf file
sudo -u postgres bash -c "cat > /var/lib/postgresql/9.3/main/recovery.conf <<- _EOF1_
standby_mode = 'on'
primary_conninfo = 'host=123.456.789.123 port=5432 user=replicator password=XXXXX sslmode=require'
trigger_file = '/tmp/postgresql.trigger'
_EOF1_
"
echo Starting Postgresql
/etc/init.d/postgresql start
Thank you,
Jake
My best guess from the above is that the pg_basebackup failed and your shell script doesn't check for error return codes or use set -e to automatically abort after errors, so it just carried on regardless.
It's also possible that you don't have WAL archiving configured, or don't have a restore_command set in the replica. In that case, the transaction logs required to start the base backup will not be available and startup will fail.
I strongly recommend that you:
Use pg_basebackup -X stream so that the required transaction logs get copied along with the backup; and
Use set -e in your shell script, or test for errors with a suitable if ! pg_basebackup .... ; then block.

PostgreSQL 9.1 streaming replication restore_command: special meaning of exit code 255?

I have a PostgreSQL 9.1.3 streaming replication setup on Ubuntu 10.04.2 LTS (primary and standby). Replication is initialized with a streamed base backup (pg_basebackup). The restore_command script tries to fetch the required WAL archives from a remote archive location with rsync.
Everything works like described in the documentation when the restore_command script fails with an exit code <> 255:
At startup, the standby begins by restoring all WAL available in the archive location, calling restore_command. Once it reaches the end of WAL available there and restore_command fails, it tries to restore any WAL available in the pg_xlog directory. If that fails, and streaming replication has been configured, the standby tries to connect to the primary server and start streaming WAL from the last valid record found in archive or pg_xlog. If that fails or streaming replication is not configured, or if the connection is later disconnected, the standby goes back to step 1 and tries to restore the file from the archive again. This loop of retries from the archive, pg_xlog, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file.
But when the restore_command script fails with exit code 255 (because the exit code from a failed rsync call is returned by the script) the server process dies with the following error:
2012-05-09 23:21:30 CEST - # LOG: database system was interrupted; last known up at 2012-05-09 23:21:25 CEST
2012-05-09 23:21:30 CEST - # LOG: entering standby mode
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(601) [Receiver=3.0.7]
2012-05-09 23:21:30 CEST - # FATAL: could not restore file "00000001000000000000003D" from archive: return code 65280
2012-05-09 23:21:30 CEST - # LOG: startup process (PID 8184) exited with exit code 1
2012-05-09 23:21:30 CEST - # LOG: aborting startup due to startup process failure
So my question is now: Is this a bug or is there a special meaning of exit code 255 which is missing in the otherwise excellent documentation or am I missing something else here?
On the primary server, you have WAL files sitting in the pg_xlog/ directory. While WAL files are there, PostgreSQL is able to deliver them to the standby should they be requested.
Typically, you also have local archived WAL location, when files are moved there by PostgreSQL, they no longer can be delivered to the standby on-line and standby is expecting them to come from the archived WAL location via restore_command.
If you have different locations for archived WALs setup on primary and on standby servers, then there's no way for a while to reach standby and you have a gap.
In your case this might mean, that:
00000001000000000000003D had been archived by the primary PostgreSQL;
standby's restore_command doesn't see it from the configured source location.
You might consider manually copying missing WAL files from primary to the standby using scp or rsync. It is also might be necessary to review your WAL locations and make sure both servers look in the same direction.
EDIT:
grep-ing for restore_command in sources, only access/transam/xlog.c references it. In function RestoreArchivedFile almost at the end (round line 3115 for 9.1.3 sources), there's a check whether restore_command had exited normally or had it received a signal.
In first case, message is classified as DEBUG2. In case restore_command received a signal other then SIGTERM (and wasn't able to handle it properly I guess), a FATAL error will be reported. This is true for all codes greater then 125.
I will not be able to tell you why though.
I recommend asking on the hackers list.
This looks like an rsync problem I encountered temporarily using NFS (with rpcbind/rstatd on port 837):
$ rsync -avz /var/backup/* backup#storage:/data/backups
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
This fixed it for me:
service rpcbind stop
I had the same issue creating a hot standby (postgres 9.5). Streaming was working (I seeded the standby via pg_basebackup using the same credentials as would later be used in the standby's recovery.conf).
After taking the basebackup, I setup the following recovery.conf:
standby_mode = 'on'
primary_conninfo = 'host=ip.of.master port=5432 user=pgstandby password=password'
recovery_target_timeline = 'latest'
restore_command = 'sftp -q user#ip.of.wal.archive.host:data/master_wal_archive/%f "%p"'
trigger_file = '/srv/pgsql/9.5/data/trigger'
Starting the server would yield:
2016-03-08 12:34:58.981 UTC (/)LOG: database system was interrupted; last known up at 2016-03-08 12:26:10 UTC
Couldn't read packet: Connection reset by peer
2016-03-08 12:34:59.525 UTC (/)FATAL: could not restore file "00000002.history" from archive: child process exited with exit code 255
2016-03-08 12:34:59.526 UTC (/)LOG: startup process (PID 26636) exited with exit code 1
2016-03-08 12:34:59.526 UTC (/)LOG: aborting startup due to startup process failure
If I removed the restore_command line from recovey.conf, the standby started up fine and began streaming WALs from the master.
I eventually traced the problem down to not having added the standby postgres user's public key to the authorized_hosts file of the WAL archive host. I'd also forgotten to add the WAL archive host's server fingerprint to the known_hosts file of the standby postgres user.
These two mistakes were (I assume) causing the sftp restore_command to exit with code 255. As tscho says, the Postgres docs suggest that if the restore_command exits with ANY non-zero value, Postgres will simply move on to trying to stream from the master rather than refusing to start. In reality this doesn't seem to be the case if the exit code is higher than a certain number (maybe 125, as vyegorov's source code grepping suggests?).
Once I fixed the two SSH issues, the standby started fine with the restore_command present in recovery.conf.
Here is the comment describing why this behavior for high exit status from the command process was chosen, and the current code to implement it.
/*
* Remember, we rollforward UNTIL the restore fails so failure here is
* just part of the process... that makes it difficult to determine
* whether the restore failed because there isn't an archive to restore,
* or because the administrator has specified the restore program
* incorrectly. We have to assume the former.
*
* However, if the failure was due to any sort of signal, it's best to
* punt and abort recovery. (If we "return false" here, upper levels will
* assume that recovery is complete and start up the database!) It's
* essential to abort on child SIGINT and SIGQUIT, because per spec
* system() ignores SIGINT and SIGQUIT while waiting; if we see one of
* those it's a good bet we should have gotten it too.
*
* On SIGTERM, assume we have received a fast shutdown request, and exit
* cleanly. It's pure chance whether we receive the SIGTERM first, or the
* child process. If we receive it first, the signal handler will call
* proc_exit, otherwise we do it here. If we or the child process received
* SIGTERM for any other reason than a fast shutdown request, postmaster
* will perform an immediate shutdown when it sees us exiting
* unexpectedly.
*
* Per the Single Unix Spec, shells report exit status > 128 when a called
* command died on a signal. Also, 126 and 127 are used to report
* problems such as an unfindable command; treat those as fatal errors
* too.
*/
if (WIFSIGNALED(rc) && WTERMSIG(rc) == SIGTERM)
proc_exit(1);
signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;
ereport(signaled ? FATAL : DEBUG2,
(errmsg("could not restore file \"%s\" from archive: %s",
xlogfname, wait_result_to_str(rc))));