Unable to do backup using barman due to systemid error - postgresql

I am trying to backup using barman command: barman backup pg but it shown error like
ERROR: Impossible to start the backup. Check the log for more details, or run 'barman check pg'
Later I checked using barman command: barman check pg I found another error
systemid coherence: FAILED . Next I check systemid of postgres at barman, I found systemdid is different.
What need to do in this case?
I removed identity.json file form barman. Though somehow it solved my issue. But I am not sure whether it is right way or not, to solve this issue?
What is the actual use of identity.json? i am looking for expert opinion.
Server pg:
PostgreSQL: OK
superuser or standard user with backup privileges: OK
PostgreSQL streaming: OK
wal_level: OK
replication slot: OK
directories: OK
retention policy settings: OK
backup maximum age: OK (interval provided: 1 day, latest backup age: 2 hours, 57 minutes, 55 seconds)
backup minimum size: OK (876.1 MiB)
wal maximum age: OK (no last_wal_maximum_age provided)
wal size: OK (31.5 KiB)
compression settings: OK
failed backups: OK (there are 0 failed backups)
minimum redundancy requirements: OK (have 3 backups, expected at least 1)
ssh: OK (PostgreSQL server)
systemid coherence: FAILED (the system Id of the connected PostgreSQL server changed, stored in "/var/lib/barman/pg/identity.json")
pg_receivexlog: OK
pg_receivexlog compatible: OK
receive-wal running: OK
archive_mode: OK
archive_command: OK
continuous archiving: OK
archiver errors: FAILED (duplicates: 50)

Related

Recover Postgresql pgBarman

I've setup a postgresql DB and I want to backup it.
I've 1 server with my main DB et 1 with Barman.
All the setup is working, I can backup my DB with barman.
I just don't understand how I can recover my DB on a exact time point between the backups that I do everyday.
barman#ubuntu:~$ barman check main-db-server
WARNING: No backup strategy set for server 'main-db-server' (using default 'exclusive_backup').
WARNING: The default backup strategy will change to 'concurrent_backup' in the future. Explicitly set 'backup_options' to silence this warning.
Server main-db-server:
PostgreSQL: OK
is_superuser: OK
wal_level: OK
directories: OK
retention policy settings: OK
backup maximum age: OK (interval provided: 1 day, latest backup age: 9 minutes, 59 seconds)
compression settings: OK
failed backups: OK (there are 0 failed backups)
minimum redundancy requirements: OK (have 6 backups, expected at least 0)
ssh: OK (PostgreSQL server)
not in recovery: OK
systemid coherence: OK (no system Id available)
archive_mode: OK
archive_command: OK
continuous archiving: OK
archiver errors: OK
And when I backup my DB
barman#ubuntu:~$ barman backup main-db-server
WARNING: No backup strategy set for server 'main-db-server' (using default 'exclusive_backup').
WARNING: The default backup strategy will change to 'concurrent_backup' in the future. Explicitly set 'backup_options' to silence this warning.
Starting backup using rsync-exclusive method for server main-db-server in /var/lib/barman/main-db-server/base/20210427T150505
Backup start at LSN: 0/1C000028 (00000005000000000000001C, 00000028)
Starting backup copy via rsync/SSH for 20210427T150505
Copy done (time: 2 seconds)
Asking PostgreSQL server to finalize the backup.
Backup size: 74.0 MiB. Actual size on disk: 34.9 KiB (-99.95% deduplication ratio).
Backup end at LSN: 0/1C0000C0 (00000005000000000000001C, 000000C0)
Backup completed (start time: 2021-04-27 15:05:05.289717, elapsed time: 11 seconds)
Processing xlog segments from file archival for main-db-server
00000005000000000000001B
00000005000000000000001C
00000005000000000000001C.00000028.backup
I don't know how to restore my DB on a time between 2 backups :/
Thanks

Upgrading postgresql from version 11 to 12 results in "Your installation references loadable libraries that are missing"

I ran
brew postgresql-upgrade-database
and after a long series of updates the final results are:
==> Migrating and upgrading data...
Performing Consistency Checks
-----------------------------
Checking cluster versions ok
Checking database user is the install user ok
Checking database connection settings ok
Checking for prepared transactions ok
Checking for reg* data types in user tables ok
Checking for contrib/isn with bigint-passing mismatch ok
Checking for tables WITH OIDS ok
Checking for invalid "sql_identifier" user columns ok
Creating dump of global objects ok
Creating dump of database schemas
ok
Checking for presence of required libraries fatal
Your installation references loadable libraries that are missing from the
new installation. You can add these libraries to the new installation,
or remove the functions using them from the old installation. A list of
problem libraries is in the file:
loadable_libraries.txt
Failure, exiting
Error: Upgrading postgresql data from 11 to 12 failed!
==> Removing empty postgresql initdb database...
==> Moving postgresql data back from /usr/local/var/postgres.old to /usr/local/var/postgres...
==> Successfully started `postgresql` (label: homebrew.mxcl.postgresql)
Error: Failure while executing; `/usr/local/opt/postgresql/bin/pg_upgrade -r -b /usr/local/Cellar/postgresql#11/11.9/bin -B /usr/local/opt/postgresql/bin -d /usr/local/var/postgres.old -D /usr/local/var/postgres -j 16` exited with 1.
The only notes about this I could find were about pg developers mulling whether to print out the offending databases involved: it was unclear what the fix would be. Any hints ?
Update There is no loadable_libraries.txt in the directory that this was run.
I am creating an answer based on the comment from #LaurenzAlbe. The loadable_libraries.txt was not placed in the current directory from which the command
was run . Instead let's go find where it went:
$find / -name loadable_libraries.txt
/usr/local/var/log/loadable_libraries.txt
So what's in that file?
06:22:24/cidervuong2 $cat /usr/local/var/log/loadable_libraries.txt
could not load library "$libdir/pg_background": ERROR: could not access file "$libdir/pg_background": No such file or directory
Database: bluej
I looked into that error:
ERROR: could not access file "$libdir/pg_background"
and there is no useful information. I went ahead and nuked postgresql again and this time it installed cleanly.
brew uninstall --ignore-dependencies postgres
brew install postgres
brew services start postgresql # actually already started
and pg came up cleanly this time
502 84118 1 0 6:33AM ?? 0:00.03 /usr/local/opt/postgresql/bin/postgres -D /usr/local/var/postgres
502 84120 84118 0 6:33AM ?? 0:00.00 postgres: checkpointer
502 84121 84118 0 6:33AM ?? 0:00.03 postgres: background writer
502 84122 84118 0 6:33AM ?? 0:00.01 postgres: walwriter
502 84123 84118 0 6:33AM ?? 0:00.01 postgres: autovacuum launcher
502 84124 84118 0 6:33AM ?? 0:00.01 postgres: stats collector
502 84125 84118 0 6:33AM ?? 0:00.00 postgres: logical replication launcher

WAL archive: FAILED (please make sure WAL shipping is setup)

I am trying to configure Barman to backup. When I do a barman check replica I keep getting:
Server replica:
WAL archive: FAILED (please make sure WAL shipping is setup)
PostgreSQL: OK
superuser: OK
wal_level: OK
directories: OK
retention policy settings: OK
backup maximum age: FAILED (interval provided: 1 day, latest backup age: No available backups)
compression settings: OK
failed backups: OK (there are 0 failed backups)
minimum redundancy requirements: FAILED (have 0 backups, expected at least 2)
ssh: OK (PostgreSQL server)
not in recovery: FAILED (cannot perform exclusive backup on a standby)
archive_mode: OK
archive_command: OK
continuous archiving: OK
archiver errors: OK
I am using Postgresql 9.6 and barman 2.1; I am not sure as to what the issue is could someone help?
Here is my Barman server configuration:
description = "Database backup"
conninfo = host=<db-ip> user=postgres dbname=db
backup_method = rsync
ssh_command = ssh postgres#<db-ip>
archiver = on
barman check tries to confirm that archiving is set up correctly by asserting that there's actually something in the archive. However, WAL segments are generally only archived once they're filled up, and if your server is idle, this is never going to happen.
To work around this, Barman provides a command to force a segment switch, wait for the completed WAL to show up, and then archive it immediately:
barman switch-xlog --force --archive replica
in brief
Barman's incoming_wals_directory and Postgresql.conf's archive_command not matched as described in details here
details
Another cause is that the not matched between
Barman's incoming_wals_directory
Postgresql.conf's archive_command
Bash util to check
barman#backup $ barman show-server pg | grep incoming_wals_directory
# output1
# > incoming_wals_directory: /var/lib/barman/pg/incoming
postgres#pg $ cat /etc/postgresql/10/main/postgresql.conf | grep archive_command
# output2
# > archive_command = 'rsync -a %p barman#staging:/var/lib/barman/pg/incoming/%f'
We must have same path in :output1 and :output2
Make them matched if they don't and don't forget to restart postgres afterward.

Fatal error starting postgres

I'm unfamiliar with how to use postgres and need some help. I'm currently running OSX Yosemite.
When I start postgres I get this:
pg_ctl: could not start server
Examine the log output.
There was an error executing [start] on postgres. Check /Users/work/git/proj/var/log/postgres.log for details.
createuser: could not connect to database postgres: FATAL: could not open relation mapping file "global/pg_filenode.map": No such file or directory
The log is below.
When I try to stop postgres I get this:
Postgres not running
And when I run ps -ef |grep postgres I get this:
20010 13398 1 0 Jul07 ? 00:00:00 /usr/pgsql-9.3/bin/postgres -h -k /Users/work/git/proj/var/pg
20010 13399 13398 0 Jul07 ? 00:00:09 postgres: logger process
20010 13401 13398 0 Jul07 ? 00:00:10 postgres: checkpointer process
20010 13402 13398 0 Jul07 ? 00:00:00 postgres: writer process
20010 13403 13398 0 Jul07 ? 00:00:00 postgres: wal writer process
20010 13404 13398 0 Jul07 ? 00:00:36 postgres: autovacuum launcher process
20010 13405 13398 0 Jul07 ? 00:00:02 postgres: stats collector process
20010 18112 17723 0 10:22 pts/0 00:00:00 grep postgres
What does this all mean and how could I possibly fix this?
log text
Postgres data dir doesn't exist. Creating
The files belonging to this database system will be owned by user "rose.smith".
This user must also own the server process.
The database cluster will be initialized with locale "C".
The default database encoding has accordingly been set to "SQL_ASCII".
The default text search configuration will be set to "english".
Data page checksums are disabled.
creating directory /Users/work/git/proj/postgres ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
creating configuration files ... ok
creating template1 database in /Users/work/git/proj/postgres/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
syncing data to disk ... ok
Success. You can now start the database server using:
/usr/pgsql-9.3/bin/postgres -D /Users/work/git/proj/postgres
or
/usr/pgsql-9.3/bin/pg_ctl -D /Users/work/git/proj/postgres -l logfile start
waiting for server to start....< 2015-06-04 17:24:57.966 GMT >LOG: redirecting log output to logging collector process
< 2015-06-04 17:24:57.966 GMT >HINT: Future log output will appear in directory "pg_log".
done
server started
waiting for server to shut down.... done
server stopped
waiting for server to start....< 2015-06-04 18:10:18.044 GMT >LOG: redirecting log output to logging collector process
< 2015-06-04 18:10:18.044 GMT >HINT: Future log output will appear in directory "pg_log".
done
server started
"/Users/work/git/proj/var/log/postgres.log" 413L, 20935C
after running /usr/pgsql-9.3/bin/postgres -D /Users/work/git/proj/postgres
< 2015-07-08 14:40:36.331 GMT >FATAL: lock file "postmaster.pid" already exists
< 2015-07-08 14:40:36.331 GMT >HINT: Is another postmaster (PID 18145) running in data directory "/Users/work/git/proj/postgres"?
I can't speak to why this worked after trying these commands just a few minutes ago, but it is now working. Good luck to anyone else with the same problem.
stop postgres
killall postgres
remove postgres database with rm -rf postgres
start postgres
This website was helpful. I think my problem may have been the same as his.
I had deleted ~/Library/Containers/com.heroku.postgres or ~/Application Support/Postgres/ while the Postgres.app was still running. The old version was still running since I deleted the pid file, and it didn't know how to shut it down.
Source: https://github.com/PostgresApp/PostgresApp/issues/96
I faced same issue. I solved the problem with the following commands.
If you install postgresql using HomeBrew...
rm /usr/local/var/postgres/postmaster.pid
pg_ctl -D /usr/local/var/postgres -l /usr/local/var/postgres/server.log start
Hope this helps you!

PostgreSQL 9.1 streaming replication restore_command: special meaning of exit code 255?

I have a PostgreSQL 9.1.3 streaming replication setup on Ubuntu 10.04.2 LTS (primary and standby). Replication is initialized with a streamed base backup (pg_basebackup). The restore_command script tries to fetch the required WAL archives from a remote archive location with rsync.
Everything works like described in the documentation when the restore_command script fails with an exit code <> 255:
At startup, the standby begins by restoring all WAL available in the archive location, calling restore_command. Once it reaches the end of WAL available there and restore_command fails, it tries to restore any WAL available in the pg_xlog directory. If that fails, and streaming replication has been configured, the standby tries to connect to the primary server and start streaming WAL from the last valid record found in archive or pg_xlog. If that fails or streaming replication is not configured, or if the connection is later disconnected, the standby goes back to step 1 and tries to restore the file from the archive again. This loop of retries from the archive, pg_xlog, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file.
But when the restore_command script fails with exit code 255 (because the exit code from a failed rsync call is returned by the script) the server process dies with the following error:
2012-05-09 23:21:30 CEST - # LOG: database system was interrupted; last known up at 2012-05-09 23:21:25 CEST
2012-05-09 23:21:30 CEST - # LOG: entering standby mode
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(601) [Receiver=3.0.7]
2012-05-09 23:21:30 CEST - # FATAL: could not restore file "00000001000000000000003D" from archive: return code 65280
2012-05-09 23:21:30 CEST - # LOG: startup process (PID 8184) exited with exit code 1
2012-05-09 23:21:30 CEST - # LOG: aborting startup due to startup process failure
So my question is now: Is this a bug or is there a special meaning of exit code 255 which is missing in the otherwise excellent documentation or am I missing something else here?
On the primary server, you have WAL files sitting in the pg_xlog/ directory. While WAL files are there, PostgreSQL is able to deliver them to the standby should they be requested.
Typically, you also have local archived WAL location, when files are moved there by PostgreSQL, they no longer can be delivered to the standby on-line and standby is expecting them to come from the archived WAL location via restore_command.
If you have different locations for archived WALs setup on primary and on standby servers, then there's no way for a while to reach standby and you have a gap.
In your case this might mean, that:
00000001000000000000003D had been archived by the primary PostgreSQL;
standby's restore_command doesn't see it from the configured source location.
You might consider manually copying missing WAL files from primary to the standby using scp or rsync. It is also might be necessary to review your WAL locations and make sure both servers look in the same direction.
EDIT:
grep-ing for restore_command in sources, only access/transam/xlog.c references it. In function RestoreArchivedFile almost at the end (round line 3115 for 9.1.3 sources), there's a check whether restore_command had exited normally or had it received a signal.
In first case, message is classified as DEBUG2. In case restore_command received a signal other then SIGTERM (and wasn't able to handle it properly I guess), a FATAL error will be reported. This is true for all codes greater then 125.
I will not be able to tell you why though.
I recommend asking on the hackers list.
This looks like an rsync problem I encountered temporarily using NFS (with rpcbind/rstatd on port 837):
$ rsync -avz /var/backup/* backup#storage:/data/backups
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
This fixed it for me:
service rpcbind stop
I had the same issue creating a hot standby (postgres 9.5). Streaming was working (I seeded the standby via pg_basebackup using the same credentials as would later be used in the standby's recovery.conf).
After taking the basebackup, I setup the following recovery.conf:
standby_mode = 'on'
primary_conninfo = 'host=ip.of.master port=5432 user=pgstandby password=password'
recovery_target_timeline = 'latest'
restore_command = 'sftp -q user#ip.of.wal.archive.host:data/master_wal_archive/%f "%p"'
trigger_file = '/srv/pgsql/9.5/data/trigger'
Starting the server would yield:
2016-03-08 12:34:58.981 UTC (/)LOG: database system was interrupted; last known up at 2016-03-08 12:26:10 UTC
Couldn't read packet: Connection reset by peer
2016-03-08 12:34:59.525 UTC (/)FATAL: could not restore file "00000002.history" from archive: child process exited with exit code 255
2016-03-08 12:34:59.526 UTC (/)LOG: startup process (PID 26636) exited with exit code 1
2016-03-08 12:34:59.526 UTC (/)LOG: aborting startup due to startup process failure
If I removed the restore_command line from recovey.conf, the standby started up fine and began streaming WALs from the master.
I eventually traced the problem down to not having added the standby postgres user's public key to the authorized_hosts file of the WAL archive host. I'd also forgotten to add the WAL archive host's server fingerprint to the known_hosts file of the standby postgres user.
These two mistakes were (I assume) causing the sftp restore_command to exit with code 255. As tscho says, the Postgres docs suggest that if the restore_command exits with ANY non-zero value, Postgres will simply move on to trying to stream from the master rather than refusing to start. In reality this doesn't seem to be the case if the exit code is higher than a certain number (maybe 125, as vyegorov's source code grepping suggests?).
Once I fixed the two SSH issues, the standby started fine with the restore_command present in recovery.conf.
Here is the comment describing why this behavior for high exit status from the command process was chosen, and the current code to implement it.
/*
* Remember, we rollforward UNTIL the restore fails so failure here is
* just part of the process... that makes it difficult to determine
* whether the restore failed because there isn't an archive to restore,
* or because the administrator has specified the restore program
* incorrectly. We have to assume the former.
*
* However, if the failure was due to any sort of signal, it's best to
* punt and abort recovery. (If we "return false" here, upper levels will
* assume that recovery is complete and start up the database!) It's
* essential to abort on child SIGINT and SIGQUIT, because per spec
* system() ignores SIGINT and SIGQUIT while waiting; if we see one of
* those it's a good bet we should have gotten it too.
*
* On SIGTERM, assume we have received a fast shutdown request, and exit
* cleanly. It's pure chance whether we receive the SIGTERM first, or the
* child process. If we receive it first, the signal handler will call
* proc_exit, otherwise we do it here. If we or the child process received
* SIGTERM for any other reason than a fast shutdown request, postmaster
* will perform an immediate shutdown when it sees us exiting
* unexpectedly.
*
* Per the Single Unix Spec, shells report exit status > 128 when a called
* command died on a signal. Also, 126 and 127 are used to report
* problems such as an unfindable command; treat those as fatal errors
* too.
*/
if (WIFSIGNALED(rc) && WTERMSIG(rc) == SIGTERM)
proc_exit(1);
signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;
ereport(signaled ? FATAL : DEBUG2,
(errmsg("could not restore file \"%s\" from archive: %s",
xlogfname, wait_result_to_str(rc))));