pg_xlog files not recycling on slave - postgresql

I've setup streaming replication with postgres 9.3
My problem is that on the Slave server the pg_xlog folder just gets fuller and fuller and WAL files are not getting recycled.
The slave server has the following (relevant) values in postgresql.conf on slave server:
wal_keep_segments = 150
hot_standby = on
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = off
#archive_command = ''
My initial replication command was:
pg_basebackup --xlog-method=stream -h <master-ip> -D . --username=replication --password
So I guess my WAL files are OK.
Here is my slave server startup log:
2017-05-08 09:55:31 IDT LOG: database system was shut down in recovery at 2017-05-08 09:55:19 IDT
2017-05-08 09:55:31 IDT LOG: entering standby mode
2017-05-08 09:55:31 IDT LOG: redo starts at 361/C76DD3E8
2017-05-08 09:55:31 IDT LOG: consistent recovery state reached at 361/C89A8278
2017-05-08 09:55:31 IDT LOG: database system is ready to accept read only connections
2017-05-08 09:55:31 IDT LOG: record with zero length at 361/C89A8278
2017-05-08 09:55:31 IDT LOG: started streaming WAL from primary at 361/C8000000 on timeline 1
2017-05-08 09:55:32 IDT LOG: incomplete startup packet
2017-05-08 09:58:34 IDT LOG: received SIGHUP, reloading configuration files
2017-05-08 09:58:34 IDT LOG: parameter "checkpoint_completion_target" changed to "0.9"
I even tried to copy older WAL files from master server manually to slave but that also didn't help.
What am I doing wrong? How can I stop the pg_xlog folder from growing indefinitely?
Is it related to the "incomplete startup packet" log message?
one last thing: under the pg_xlog\archive_status folder all of the WAL files are with .done suffix.
Appreciate any help I can get on this.
Edit:
I enabled log_checkpoints in postgresql.conf.
Here are the relevant log entries since I enabled it:
2017-05-12 08:43:11 IDT LOG: parameter "log_checkpoints" changed to "on"
2017-05-12 08:43:24 IDT LOG: checkpoint complete: wrote 2128 buffers (0.9%); 0 transaction log file(s) added, 0 removed, 9 recycled; write=189.240 s, sync=0.167 s, total=189.549 s; sync files=745, longest=0.010 s, average=0.000 s
2017-05-12 08:45:15 IDT LOG: checkpoint starting: time
2017-05-12 08:48:46 IDT LOG: checkpoint complete: wrote 15175 buffers (6.6%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=209.078 s, sync=1.454 s, total=210.617 s; sync files=769, longest=0.032 s, average=0.001 s
2017-05-12 08:50:15 IDT LOG: checkpoint starting: time
2017-05-12 08:53:45 IDT LOG: checkpoint complete: wrote 2480 buffers (1.1%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=209.162 s, sync=0.991 s, total=210.253 s; sync files=663, longest=0.076 s, average=0.001 s
Edit2:
Following the fact that my slave server has no restart points in the log, here is the relevant log for starting and recovering WALS in slave server before achieving consistent recovery state:
2017-05-12 09:35:42 IDT LOG: database system was shut down in recovery at 2017-05-12 09:35:41 IDT
2017-05-12 09:35:42 IDT LOG: entering standby mode
2017-05-12 09:35:42 IDT LOG: incomplete startup packet
2017-05-12 09:35:43 IDT FATAL: the database system is starting up
2017-05-12 09:35:43 IDT LOG: restored log file "0000000100000369000000B1" from archive
2017-05-12 09:35:43 IDT FATAL: the database system is starting up
2017-05-12 09:35:44 IDT FATAL: the database system is starting up
2017-05-12 09:35:44 IDT LOG: restored log file "0000000100000369000000AF" from archive
2017-05-12 09:35:44 IDT LOG: redo starts at 369/AFD28900
2017-05-12 09:35:44 IDT FATAL: the database system is starting up
2017-05-12 09:35:45 IDT FATAL: the database system is starting up
2017-05-12 09:35:45 IDT FATAL: the database system is starting up
2017-05-12 09:35:46 IDT LOG: restored log file "0000000100000369000000B0" from archive
2017-05-12 09:35:46 IDT FATAL: the database system is starting up
2017-05-12 09:35:46 IDT FATAL: the database system is starting up
2017-05-12 09:35:47 IDT FATAL: the database system is starting up
2017-05-12 09:35:47 IDT LOG: restored log file "0000000100000369000000B1" from archive
2017-05-12 09:35:47 IDT FATAL: the database system is starting up
2017-05-12 09:35:48 IDT FATAL: the database system is starting up
2017-05-12 09:35:48 IDT LOG: incomplete startup packet
2017-05-12 09:35:49 IDT LOG: restored log file "0000000100000369000000B2" from archive
2017-05-12 09:35:50 IDT LOG: restored log file "0000000100000369000000B3" from archive
2017-05-12 09:35:52 IDT LOG: restored log file "0000000100000369000000B4" from archive
.
.
.
2017-05-12 09:42:33 IDT LOG: restored log file "000000010000036A000000C0" from archive
2017-05-12 09:42:35 IDT LOG: restored log file "000000010000036A000000C1" from archive
2017-05-12 09:42:36 IDT LOG: restored log file "000000010000036A000000C2" from archive
2017-05-12 09:42:37 IDT LOG: restored log file "000000010000036A000000C3" from archive
2017-05-12 09:42:37 IDT LOG: consistent recovery state reached at 36A/C3ACEB28
2017-05-12 09:42:37 IDT LOG: database system is ready to accept read only connections
2017-05-12 09:42:39 IDT LOG: restored log file "000000010000036A000000C4" from archive
2017-05-12 09:42:40 IDT LOG: restored log file "000000010000036A000000C5" from archive
2017-05-12 09:42:42 IDT LOG: restored log file "000000010000036A000000C6" from archive
ERROR: WAL file '000000010000036A000000C7' not found in server 'main-db-server'
2017-05-12 09:42:42 IDT LOG: started streaming WAL from primary at 36A/C6000000 on timeline 1
Thanks!

The problem seems to have been resolved.
Apparently I had hardware issues on the master server.
I was able to perform full pg_dump and re-index my DB so I was pretty sure I did not have any data integrity issues.
But when looking at the master server logs after I've enabled log_checkpoints in the config - a few minutes before the slave server stopped performing checkpoints I saw the following message:
IDT ERROR: failed to re-find parent key in index "<table_name>_id_udx" for split pages 17/18
After seeing that - I decided to switch hosting provider and moved my DB to a new server.
Since then (almost a week now) - everything has been running smoothly replication and checkpoints are running as expected.
I really hope this will help other people - but when something like this is happening - always be advised that this issue might be caused by data integrity/hardware issues.

Related

Postgresql error : PANIC: could not locate a valid checkpoint record

In the Datastore logs, I encountered the following error, Not sure what has gone wrong.
[7804] LOG: starting PostgreSQL 13.1, compiled by Visual C++ build 1914, 64-bit
2021-08-23 22:56:15.980 CEST [7804] LOG: listening on IPv4 address "127.0.0.1", port 9003
2021-08-23 22:56:15.983 CEST [7804] LOG: listening on IPv4 address "10.91.198.36", port 9003
2021-08-23 22:56:16.041 CEST [8812] LOG: database system was shut down at 2021-08-23 22:54:51 CEST
2021-08-23 22:56:16.044 CEST [8812] LOG: invalid primary checkpoint record
2021-08-23 22:56:16.045 CEST [8812] PANIC: could not locate a valid checkpoint record
2021-08-23 22:56:16.076 CEST [7804] LOG: startup process (PID 8812) was terminated by exception 0xC0000409
2021-08-23 22:56:16.076 CEST [7804] HINT: See C include file "ntstatus.h" for a description of the hexadecimal value.
2021-08-23 22:56:16.078 CEST [7804] LOG: aborting startup due to startup process failure
2021-08-23 22:56:16.094 CEST [7804] LOG: database system is shut down
Somebody deleted crucial WAL files (to free space?), and now your cluster is corrupted
Restore from backup. If you have no backup, running pg_resetwal is an option, since it seems there was a clean shutdown.

PostgreSQL 9.4.1 Switchover & Switchback without recover_target_timeline=latest

I have tested different scenarios to do switchover and switchback in postgreSQL 9.4.1 Version.
Scenario 1:- PostgreSQL Switchover and Switchback in 9.4.1
Scenario 2:- Is it mandatory parameter recover_target_timeline='latest' in switchover and switchback in PostgreSQL 9.4.1?
Scenario 3:- On this page
To test scenario 3 I have followed below steps to perform.
1) Stop the application connected to primary server.
2) Confirm all application was stopped and all thread was disconnected from primary DB.
#192.x.x.129(Primary)
3) Clean shutdown primary using
pg_ctl -D$PGDATA stop --mf
#DR(192.x.x.128) side check sync status:
postgres=# select pg_last_xlog_receive_location(),pg_last_xlog_replay_location();
-[ RECORD 1 ]-----------------+-----------
pg_last_xlog_receive_location | 4/57000090
pg_last_xlog_replay_location | 4/57000090
4)Stop DR server.DR(192.x.x.128)
pg_ctl -D $PGDATA stop -mf
pg_log:
2019-12-02 13:16:09 IST LOG: received fast shutdown request
2019-12-02 13:16:09 IST LOG: aborting any active transactions
2019-12-02 13:16:09 IST LOG: shutting down
2019-12-02 13:16:09 IST LOG: database system is shut down
#192.x.x.128(DR)
5) Make following changes on DR server.
mv recovery.conf recovery.conf_bkp
6)make changes in 192.x.x.129(Primary):
[postgres#localhost data]$ cat recovery.conf
standby_mode = 'on'
primary_conninfo = 'user=replication password=postgres host=192.x.x.128 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'
restore_command = 'cp %p /home/postgres/restore/%f'
trigger_file='/tmp/promote'
7)Start DR as read write mode:
pg_ctl -D $DATA start
pg_log:
2019-12-02 13:20:21 IST LOG: database system was shut down in recovery at 2019-12-02 13:16:09 IST
2019-12-02 13:20:22 IST LOG: database system was not properly shut down; automatic recovery in progress
2019-12-02 13:20:22 IST LOG: consistent recovery state reached at 4/57000090
2019-12-02 13:20:22 IST LOG: invalid record length at 4/57000090
2019-12-02 13:20:22 IST LOG: redo is not required
2019-12-02 13:20:22 IST LOG: database system is ready to accept connections
2019-12-02 13:20:22 IST LOG: autovacuum launcher started
(END)
We can see in above log OLD primary is now DR of Primary(Which was OLD DR) and not showing any error because timeline id same on new primary which is already exit in new DR.
8)Start Primary as read only mode:-
pg_ctl -D$PGDATA start
logs:
2019-12-02 13:24:50 IST LOG: database system was shut down at 2019-12-02 11:14:50 IST
2019-12-02 13:24:51 IST LOG: entering standby mode
cp: cannot stat ‘pg_xlog/RECOVERYHISTORY’: No such file or directory
cp: cannot stat ‘pg_xlog/RECOVERYXLOG’: No such file or directory
2019-12-02 13:24:51 IST LOG: consistent recovery state reached at 4/57000090
2019-12-02 13:24:51 IST LOG: record with zero length at 4/57000090
2019-12-02 13:24:51 IST LOG: database system is ready to accept read only connections
2019-12-02 13:24:51 IST LOG: started streaming WAL from primary at 4/57000000 on timeline 9
2019-12-02 13:24:51 IST LOG: redo starts at 4/57000090
(END)
Question 1:- In This scenario i have perform only switch-over to show you. using this method we can do switch-over and switchback. but using below method Switch-over-switchback is work, then why PostgreSQL Community invented recovery_target_timeline=latest and apply patches see blog: https://www.enterprisedb.com/blog/switchover-switchback-in-postgresql-9-3 from PostgrSQL 9.3...to latest version.
Question 2:- What mean to say in above log cp: cannot stat ‘pg_xlog/RECOVERYHISTORY’: No such file or directory ?
Question 3:- I want to make sure from scenarios 1 and scenario 3 which method/Scenarios is correct way to do switchover and switchback? because scenario 2 is getting error because we must use recover_target_timeline=latest which all community experts know.
Answers:
If you shut down the standby cleanly, then remove recovery.conf and restart it, it will come up, but has to perform crash recovery (database system was not properly shut down).
The proper way to promote a standby to a primary is by using the trigger file or running pg_ctl promote (or, from v12 on, by running the SQL function pg_promote). Then you have no down time and don't need to perform crash recovery.
Promoting the standby will make it pick a new time line, so you need recovery_target_timeline = 'latest' if you want the new standby to follow that time line switch.
That is caused by your restore_command.
The method shown in 1. above is the correct one.

PostgreSQL connection issue after service restart

I have edited my pg_hba file and copied it to server and restarted the services by "sudo service postgresql restart" but after that the server is not connecting.
Showing the below error, Your database returned: "Connection to 138.2xx.1xx.xx:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections."
The Jenkins job and data visualization tools are failing which was working fine previously. What could be the reason.
Getting this in PostgreSQL Log
2019-10-23 07:21:25.829 CEST [11761] LOG: received fast shutdown request
2019-10-23 07:21:25.829 CEST [11761] LOG: aborting any active transactions
2019-10-23 07:21:25.829 CEST [11766] LOG: autovacuum launcher shutting down
2019-10-23 07:21:25.832 CEST [11763] LOG: shutting down
2019-10-23 07:21:25.919 CEST [11761] LOG: database system is shut down
2019-10-23 07:21:27.068 CEST [22633] LOG: database system was shut down at 2019-10-23 07:21:25 CEST
2019-10-23 07:21:27.073 CEST [22633] LOG: MultiXact member wraparound protections are now enabled
2019-10-23 07:21:27.075 CEST [22631] LOG: database system is ready to accept connections
2019-10-23 07:21:27.075 CEST [22637] LOG: autovacuum launcher started
2019-10-23 07:21:27.390 CEST [22639] [unknown]#[unknown] LOG: incomplete startup packet
Below shows no response.
root#Ubuntu-1604-xenial-64-minimal ~ # pg_isready -h localhost -p 5432
localhost:5432 - no response
Below was already added to the postgresql.config file.
listen_addresses = '*'
Do i need to restart the entire server?
Can anyone please help me to resolve this.

PostgreSQL PITR not working properly

I am trying to restore a PostgreSQL database to a point in time.
When I am using only restore_command in recovery.conf then its working fine.
restore_command = 'cp /var/lib/pgsql/pg_log_archive/%f %p'
When I am using the recovery_target_time parameter, it is not restoring to the target time.
restore_command = 'cp /var/lib/pgsql/pg_log_archive/%f %p'
recovery_target_time='2018-06-05 06:43:00.0'
Below is the log file content:
2018-06-05 07:31:39.166 UTC [22512] LOG: database system was interrupted; last known up at 2018-06-05 06:35:52 UTC
2018-06-05 07:31:39.664 UTC [22512] LOG: starting point-in-time recovery to 2018-06-05 06:43:00+00
2018-06-05 07:31:39.671 UTC [22512] LOG: restored log file "00000005.history" from archive
2018-06-05 07:31:39.769 UTC [22512] LOG: restored log file "00000005000000020000008F" from archive
2018-06-05 07:31:39.816 UTC [22512] LOG: redo starts at 2/8F000028
2018-06-05 07:31:39.817 UTC [22512] LOG: consistent recovery state reached at 2/8F000130
2018-06-05 07:31:39.818 UTC [22510] LOG: database system is ready to accept read only connections
2018-06-05 07:31:39.912 UTC [22512] LOG: restored log file "000000050000000200000090" from archive
2018-06-05 07:31:39.996 UTC [22512] LOG: recovery stopping before abort of transaction 9525, time 2018-06-05 06:45:02.088502+00
2018-06-05 07:31:39.996 UTC [22512] LOG: recovery has paused
I am trying to restore the database instance to 06:43:00. Why is it recovering up to 06:45:02?
EDIT
In first scenario recovery.conf converted into recovery.done but this didn't happen in second scenario
What could be the reason of this?
You forgot to set
recovery_target_action = 'promote'
After point-in-time-recovery, recovery_target_action determines how PostgreSQL will proceed.
The default value is pause which means that PostgreSQL will do nothing and wait for you to tell it how to proceed.
To complete recovery, connect to the database and run
SELECT pg_wal_replay_resume();
It seems that there has been no database activity logged between 06:43:00 and 06:45:02. Observe that the log says recovery stopping before abort of transaction 9525.

What is the purpose of `pg_logical` directory inside PostgreSQL data?

I've just stumbled upon this error while testing failover of a PostgreSQL 9.4 cluster I've set up. Here I'm trying to promote a slave to be the new master:
$ repmgr -f /etc/repmgr/repmgr.conf --verbose standby promote
2014-09-22 10:46:37 UTC LOG: database system shutdown was interrupted; last known up at 2014-09-22 10:44:02 UTC
2014-09-22 10:46:37 UTC LOG: database system was not properly shut down; automatic recovery in progress
2014-09-22 10:46:37 UTC LOG: redo starts at 0/18000028
2014-09-22 10:46:37 UTC LOG: consistent recovery state reached at 0/19000600
2014-09-22 10:46:37 UTC LOG: record with zero length at 0/1A000090
2014-09-22 10:46:37 UTC LOG: redo done at 0/1A000028
2014-09-22 10:46:37 UTC LOG: last completed transaction was at log time 2014-09-22 10:36:22.679806+00
2014-09-22 10:46:37 UTC FATAL: could not open directory "pg_logical/snapshots": No such file or directory
2014-09-22 10:46:37 UTC LOG: startup process (PID 2595) exited with exit code 1
2014-09-22 10:46:37 UTC LOG: aborting startup due to startup process failure
pg_logical/snapshots dir in fact exists on master node and it is empty.
UPD: I've just manually created empty directories pg_logical/snapshots and pg_logical/mappings and server has started without complaining. repmgr standby clone seems to omit this dirs while syncing. But the question still remains because I'm just curious what this directory is for, maybe I'm missing something in my setup. Simply Googling it did not yield any meaningful results.
It's for the new logical changeset extraction / logical replication feature in 9.4.
This shouldn't happen, though... it suggests a significant bug somewhere, probably repmgr. I'll wait for details (repmgr version, etc).
Update: Confirmed, it's a repmgr bug. It's fixed in git master already (and was before this report) and will be in the next release. Which had better be soon, given the significance of this issue.