PostgreSQL hot standby won't start after deleting WAL files - postgresql

I have a PostgreSQL 9.1 hot standby server. The WAL files ended up taking up the entire HD so I ended up deleting all the WAP files. How I want to bring the server back up so I ran:
/usr/local/pgsql/bin/pg_ctl start -D /usr/local/pgsql/data
The problem is the server never fully starts. I see this follow by non-stop missing WAL files log outputs:
server starting
sh-4.1$ LOG: database system was shut down in recovery at 2013-02-10 03:17:06 UTC
LOG: entering standby mode
cp: cannot stat `/usr/local/pgsql/wals/0000000100000035000000A4': No such file or directory
LOG: redo starts at 35/A4943928
LOG: consistent recovery state reached at 35/A4AE8EB8
LOG: database system is ready to accept read only connections
LOG: invalid record length at 35/A4AE8EB8
cp: cannot stat `/usr/local/pgsql/wals/0000000100000035000000A4': No such file or directory
LOG: streaming replication successfully connected to primary
FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 0000000100000035000000A4 has already been removed
How can I get the server back?

You need to recreate the hot standby server from the dump (or filesystem copy).
Details on how to do that you can find in this manual: High Availability, Load Balancing, and Replication

Please check whether required wal's are on the location from which you are restoring for initial restoration before connecting slave to primary host in streaming mode...Above problem can happen due to unavailable wal segment on the location from which you are restoring for initial recovery...If this case is fine then check your restore_command in recovery.conf.

Related

Postgres streaming replication - servers keep shutting down

I am new to PostgreSQL and I am trying to set up a streaming replication from our server to a test DB on my laptop. I have been following this tutorial https://www.percona.com/blog/2018/09/07/setting-up-streaming-replication-postgresql/ along with the Postgres documentation here https://www.postgresql.org/docs/11/runtime-config-replication.html.
I'm running Windows 10, PostgreSQL 11, PostGIS 2.5 extension.
The server and my local machine both keep shutting down and the logs are filled with postmaster.pid errors such as:
LOG: performing immediate shutdown because data directory lock file is invalid
LOG: received immediate shutdown request
LOG: could not open file "postmaster.pid": No such file or directory
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Could anyone point me towards the issue here? I know my server's aren't configured properly but I just don't know what configurations need to be changed.
Here is an image of my standby server's most recent log.
standby log
Here is an image of my master server's most recent log.
master log
You must have messed up in many ways.
You removed or overwrote postmaster.pid on the master server.
That is very dangerous and causes the server to die with the error message you quote.
You didn't create recovery.conf before starting the standby server, or you removed backup_label. From the error messages I'd suspect the second, with ensuing data corruption.

Postgresql fatal the database system is starting up - windows 10

I have installed postgresql on windows 10 on usb disk.
Every day when i start my pc in work from sleep and plug in the disk again then trying to start postgresql i get this error:
FATAL: the database system is starting up
The service starts with following command:
E:\PostgresSql\pg96\pgservice.exe "//RS//PostgreSQL 9.6 Server"
It is the default one.
logs from E:\PostgresSql\data\logs\pg96
2019-02-28 10:30:36 CET [21788]: [1-1] user=postgres,db=postgres,app=[unknown],client=::1 FATAL: the database system is starting up
2019-02-28 10:31:08 CET [9796]: [1-1] user=postgres,db=postgres,app=[unknown],client=::1 FATAL: the database system is starting up
I want this start up to happen faster.
When you commit data to a Postgres database, the only thing which is immediately saved to disk is the write-ahead log. The actual table changes are only applied to the in-memory buffers, and won't be permanently saved to disk until the next checkpoint.
If the server is stopped abruptly, or if it suddenly loses access to the file system, then everything in memory is lost, and the next time you start it up, it needs to resort to replaying the log in order to get the tables back to the correct state (which can take quite a while, depending on how much has happened since the last checkpoint). And until it's finished, any attempt to use the server will result in FATAL: the database system is starting up.
If you make sure you shut the server down cleanly before unplugging the disk - giving it a chance to set a checkpoint and flush all of its buffers - then it should be able to start up again more or less immediately.

Recover Postgres Streaming Replication Slave from Archived Wal Logs

I have set up a Postgres Hot Standby server by Streaming Replication. But My Standby server is asking for an old wal archive log which is currently not in Master's pg_xlog directory. But the file exists in the wal archive backup directory.
How can I configure Standby to read this file from backup directory? Or any way to manually copy this file to Standby Server ?
Any help will be appreciated.
You would have to add a restore_command to recovery.conf that can restore files from the WAL archive.
Then restart the standby, and it should be able to recover.
When the standby cannot get the required WAL via streaming replication, it tries restore_command. When that fails, it tries streaming replication again, and so on in an endless loop.

How is PostgreSQL hot standby WAL file restoring triggered?

Primary server
# postgresql.conf
wal_level = hot_standby
archive_mode = on
archive_timeout = 10
archive_command = 'test ! -f /archive/%f && cp %p /archive/%f'
Standby server
hot_standby = on
I copied /archive/* in primary server to $PGDATA/pg_xlog in the standby, and nothing happen. When I restart the standby server, I got error messages from server log:
2016-11-21 17:56:09 CST [17762-3] LOG: invalid primary checkpoint record
2016-11-21 17:56:09 CST [17762-4] LOG: record with zero length at 0/6000100
2016-11-21 17:56:09 CST [17762-5] LOG: invalid secondary checkpoint record
2016-11-21 17:56:09 CST [17762-6] PANIC: could not locate a valid checkpoint record
2016-11-21 17:56:09 CST [17761-1] LOG: startup process (PID 17762) was terminated by signal 6: Aborted
2016-11-21 17:56:09 CST [17761-2] LOG: aborting startup due to startup process failure
Questions:
Is it enough to sync data to standby server by simply copying /archive/* in primary server to $PGDATA/pg_xlog in the standby?
How and when is the restoring of WAL files triggered in a hot standby server? Does the standby server periodically check its $PGDATA/pg_xlog directory for new WAL files? Or do I have to trigger it manually?
I am talking about hot standby, not streaming replication; so I assume I don't have to configure conninfo. Am I right?
After configuring hot_standby = on and restarting the server, I can still do an INSERT without error. How to configure to make it really read-only?
That looks a lot like you didn't initialize the standby database correctly.
The log file states that PostgreSQL won't even begin to replicate, because it cannot find a valid checkpoint to start with.
What does the backup_label file in your standby's data directory contain? If that file doesn't exist, that's probably the problem.
Did that standby suddenly stop working or has it never worked?
How exactly did you create the standby?
You must first create the standby from a low level base backup of the master. You cannot create a new instance and use pg_dump and pg_restore. I'm guessing that's what you tried to do.
The simplest way to do a suitable base backup is to use pg_basebackup. Other options are discussed in the manual, but really, just use:
pg_basebackup -X stream -D standby_datadir_location -h master_ip
or similar.
Only once you have a valid base backup may you start archive recovery or streaming replication. The simplest way is to enable streaming replication. Let pg_basebackup do that for you by passing the -R flag.
If you want archive recovery, you should add a restore_command to the standby's recovery.conf that copies the archives from the archive location to the standby.
It's all covered in the manual.

incorrect resource manager data checksum in record at 2/XYZ + terminating walreceiver process due to administrator command

I am running a streaming replication environment with PostgreSQL 9.1 (1 master, 3 slaves). Everything worked fine for aprox. 2 months. Yesterday, the replication to one of the slaves failed with the log on the slave having:
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
FATAL: terminating walreceiver process due to administrator command
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
The slave was no longer in sync with the master. Two hours later, in which the log gets a new line like above every 5 seconds, I restarted the slave database server:
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: received fast shutdown request
LOG: aborting any active transactions
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
LOG: shutting down
LOG: database system is shut down
The new log file on the slave contains:
LOG: database system was shut down in recovery at 2016-02-29 05:12:11 CET
LOG: entering standby mode
LOG: redo starts at 61/D92C10C9
LOG: consistent recovery state reached at 61/DA2710A7
LOG: database system is ready to accept read only connections
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: streaming replication successfully connected to primary
Now the slave is in sync with the master but the checksum entry is still there. One more thing I checked were the network logs -> the network was available.
My questions are:
Does anyone know why the walreceiver was terminated?
Why didn't PostgreSQL retry the replication?
What can I do to prevent this in the future?
Thank you.
EDIT:
The database servers are running on SLES 11 with ext3. I found an article about low performance of SLES 11 with large RAM but I am not sure if it applies since my machine has only 8 GB RAM (https://www.novell.com/support/kb/doc.php?id=7010287)
Any help would be appreciated.
EDIT (2):
PostgreSQL version is 9.1.5. Seem that PostgreSQL version 9.1.6 provides a fix for similar issue?
Fix persistence marking of shared buffers during WAL replay (Jeff Davis)
This mistake can result in buffers not being written out during checkpoints, resulting in data corruption if the server later crashes without ever having written those buffers. Corruption can occur on any server following crash recovery, but it is significantly more likely to occur on standby slave servers since those perform much more WAL replay.
Source: http://www.postgresql.org/docs/9.1/static/release-9-1-6.html
Might this be the fix? Should I upgrade to PostgreSQL 9.1.6 and everything would run smooth?
In case someone stumbles across this question, I ended up reinstalling the databases from backed-up data and set up replication again. Never really figured out what went wrong.
Never really figured out what went wrong.
I'm experiencing the same error - just that it never syncs in full from the very beginning.
Then, the primary server had got some kernel errors (heat problem in server case?). The server was required to be switched off because of incomplete shutdown. Already while shutting down, the slave showed up with
LOG: incorrect resource manager data checksum in record at 1/63663CB0
After restart of the primary server and restart of the slave server, the situation doesn't change: same log entries every 5 seconds.