PostgreSQL PITR not working properly - postgresql

I am trying to restore a PostgreSQL database to a point in time.
When I am using only restore_command in recovery.conf then its working fine.
restore_command = 'cp /var/lib/pgsql/pg_log_archive/%f %p'
When I am using the recovery_target_time parameter, it is not restoring to the target time.
restore_command = 'cp /var/lib/pgsql/pg_log_archive/%f %p'
recovery_target_time='2018-06-05 06:43:00.0'
Below is the log file content:
2018-06-05 07:31:39.166 UTC [22512] LOG: database system was interrupted; last known up at 2018-06-05 06:35:52 UTC
2018-06-05 07:31:39.664 UTC [22512] LOG: starting point-in-time recovery to 2018-06-05 06:43:00+00
2018-06-05 07:31:39.671 UTC [22512] LOG: restored log file "00000005.history" from archive
2018-06-05 07:31:39.769 UTC [22512] LOG: restored log file "00000005000000020000008F" from archive
2018-06-05 07:31:39.816 UTC [22512] LOG: redo starts at 2/8F000028
2018-06-05 07:31:39.817 UTC [22512] LOG: consistent recovery state reached at 2/8F000130
2018-06-05 07:31:39.818 UTC [22510] LOG: database system is ready to accept read only connections
2018-06-05 07:31:39.912 UTC [22512] LOG: restored log file "000000050000000200000090" from archive
2018-06-05 07:31:39.996 UTC [22512] LOG: recovery stopping before abort of transaction 9525, time 2018-06-05 06:45:02.088502+00
2018-06-05 07:31:39.996 UTC [22512] LOG: recovery has paused
I am trying to restore the database instance to 06:43:00. Why is it recovering up to 06:45:02?
EDIT
In first scenario recovery.conf converted into recovery.done but this didn't happen in second scenario
What could be the reason of this?

You forgot to set
recovery_target_action = 'promote'
After point-in-time-recovery, recovery_target_action determines how PostgreSQL will proceed.
The default value is pause which means that PostgreSQL will do nothing and wait for you to tell it how to proceed.
To complete recovery, connect to the database and run
SELECT pg_wal_replay_resume();
It seems that there has been no database activity logged between 06:43:00 and 06:45:02. Observe that the log says recovery stopping before abort of transaction 9525.

Related

PostgreSQL - Recovery process triggering after startup

I m trying to run PostgreSQL 9.5 on Ubuntu 16.04. Our PostreSQL setup looks as following:
PostgreSQL Primary (9.5 on Ubuntu 16.04, Docker container) --rsync--> Walstore (backup server, Docker Container) --rsync--> PostgreSQL Standby (v9.5 on Ubuntu 16.04, Docker container).
In case of Primary crashes we are going to copy the Standby PostgreSQL data (which is in sync with Primary PostgreSQL data, excluding only recovery.conf file) to Primary PostgreSQL data directory and startup Primary PostgreSQL server again.
After the Primary PostgreSQL server is up and running again, a recovery process is going to be triggered as follow (logs):
postgresql-docker-primary-1 | * Starting PostgreSQL 9.5 database server
postgresql-docker-primary-1 | ...done.
postgresql-docker-primary-1 | 2022-11-28 10:14:59 UTC [27-1] LOG: database system was shut down in recovery at 2022-11-28 10:11:14 UTC
postgresql-docker-primary-1 | 2022-11-28 10:14:59 UTC [27-2] LOG: database system was not properly shut down; automatic recovery in progress
postgresql-docker-primary-1 | 2022-11-28 10:14:59 UTC [27-3] LOG: redo starts at 0/1700DB8
postgresql-docker-primary-1 | 2022-11-28 10:14:59 UTC [27-4] LOG: invalid record length at 0/3000060
postgresql-docker-primary-1 | 2022-11-28 10:14:59 UTC [27-5] LOG: redo done at 0/3000028
postgresql-docker-primary-1 | 2022-11-28 10:14:59 UTC [27-6] LOG: last completed transaction was at log time 2022-11-28 10:04:37.388433+00
postgresql-docker-primary-1 | 2022-11-28 10:14:59 UTC [27-7] LOG: MultiXact member wraparound protections are now enabled
postgresql-docker-primary-1 | 2022-11-28 10:14:59 UTC [26-1] LOG: database system is ready to accept connections
postgresql-docker-primary-1 | 2022-11-28 10:14:59 UTC [31-1] LOG: autovacuum launcher started
postgresql-docker-primary-1 | 2022-11-28 10:14:59 UTC [34-1] [unknown]#[unknown] LOG: incomplete startup packet
Since the Standby PostgreSQL data is in sync with PostgreSQL Primary data, why the Primary startup triggers the automatic recovery? My expectation is that the Primary PostgreSQL database server is not going to perform an automatic recovery. How does PostgreSQL know that a recovery process has to be triggered or not?
In our production case the automatic recovery would take around 3 hours in our case, so that the PostgreSQL server would be unavailable for around 3 hours and we would like to avoid this behaviour.
Is there any documentation where I can find the recovery process triggering of PostgreSQL database server?
Thank you very much for your feedback.
ctn

PostgreSQL 9.4.1 Switchover & Switchback without recover_target_timeline=latest

I have tested different scenarios to do switchover and switchback in postgreSQL 9.4.1 Version.
Scenario 1:- PostgreSQL Switchover and Switchback in 9.4.1
Scenario 2:- Is it mandatory parameter recover_target_timeline='latest' in switchover and switchback in PostgreSQL 9.4.1?
Scenario 3:- On this page
To test scenario 3 I have followed below steps to perform.
1) Stop the application connected to primary server.
2) Confirm all application was stopped and all thread was disconnected from primary DB.
#192.x.x.129(Primary)
3) Clean shutdown primary using
pg_ctl -D$PGDATA stop --mf
#DR(192.x.x.128) side check sync status:
postgres=# select pg_last_xlog_receive_location(),pg_last_xlog_replay_location();
-[ RECORD 1 ]-----------------+-----------
pg_last_xlog_receive_location | 4/57000090
pg_last_xlog_replay_location | 4/57000090
4)Stop DR server.DR(192.x.x.128)
pg_ctl -D $PGDATA stop -mf
pg_log:
2019-12-02 13:16:09 IST LOG: received fast shutdown request
2019-12-02 13:16:09 IST LOG: aborting any active transactions
2019-12-02 13:16:09 IST LOG: shutting down
2019-12-02 13:16:09 IST LOG: database system is shut down
#192.x.x.128(DR)
5) Make following changes on DR server.
mv recovery.conf recovery.conf_bkp
6)make changes in 192.x.x.129(Primary):
[postgres#localhost data]$ cat recovery.conf
standby_mode = 'on'
primary_conninfo = 'user=replication password=postgres host=192.x.x.128 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'
restore_command = 'cp %p /home/postgres/restore/%f'
trigger_file='/tmp/promote'
7)Start DR as read write mode:
pg_ctl -D $DATA start
pg_log:
2019-12-02 13:20:21 IST LOG: database system was shut down in recovery at 2019-12-02 13:16:09 IST
2019-12-02 13:20:22 IST LOG: database system was not properly shut down; automatic recovery in progress
2019-12-02 13:20:22 IST LOG: consistent recovery state reached at 4/57000090
2019-12-02 13:20:22 IST LOG: invalid record length at 4/57000090
2019-12-02 13:20:22 IST LOG: redo is not required
2019-12-02 13:20:22 IST LOG: database system is ready to accept connections
2019-12-02 13:20:22 IST LOG: autovacuum launcher started
(END)
We can see in above log OLD primary is now DR of Primary(Which was OLD DR) and not showing any error because timeline id same on new primary which is already exit in new DR.
8)Start Primary as read only mode:-
pg_ctl -D$PGDATA start
logs:
2019-12-02 13:24:50 IST LOG: database system was shut down at 2019-12-02 11:14:50 IST
2019-12-02 13:24:51 IST LOG: entering standby mode
cp: cannot stat ‘pg_xlog/RECOVERYHISTORY’: No such file or directory
cp: cannot stat ‘pg_xlog/RECOVERYXLOG’: No such file or directory
2019-12-02 13:24:51 IST LOG: consistent recovery state reached at 4/57000090
2019-12-02 13:24:51 IST LOG: record with zero length at 4/57000090
2019-12-02 13:24:51 IST LOG: database system is ready to accept read only connections
2019-12-02 13:24:51 IST LOG: started streaming WAL from primary at 4/57000000 on timeline 9
2019-12-02 13:24:51 IST LOG: redo starts at 4/57000090
(END)
Question 1:- In This scenario i have perform only switch-over to show you. using this method we can do switch-over and switchback. but using below method Switch-over-switchback is work, then why PostgreSQL Community invented recovery_target_timeline=latest and apply patches see blog: https://www.enterprisedb.com/blog/switchover-switchback-in-postgresql-9-3 from PostgrSQL 9.3...to latest version.
Question 2:- What mean to say in above log cp: cannot stat ‘pg_xlog/RECOVERYHISTORY’: No such file or directory ?
Question 3:- I want to make sure from scenarios 1 and scenario 3 which method/Scenarios is correct way to do switchover and switchback? because scenario 2 is getting error because we must use recover_target_timeline=latest which all community experts know.
Answers:
If you shut down the standby cleanly, then remove recovery.conf and restart it, it will come up, but has to perform crash recovery (database system was not properly shut down).
The proper way to promote a standby to a primary is by using the trigger file or running pg_ctl promote (or, from v12 on, by running the SQL function pg_promote). Then you have no down time and don't need to perform crash recovery.
Promoting the standby will make it pick a new time line, so you need recovery_target_timeline = 'latest' if you want the new standby to follow that time line switch.
That is caused by your restore_command.
The method shown in 1. above is the correct one.

Slow postgresql startup in docker container

We built a debian docker image with postgresql to run one of our service. The database is for internal container use and does not need port mapping. I believe it is installed via apt-get in the Dockerbuild file.
We stop and start this service often, and it is a performance issue that the database is slow to startup. Although empty, takes sightly over 20s to accept connection on the first time we start the docker image. The log is as follow :
2019-04-05 13:05:30.924 UTC [19] LOG: could not bind IPv6 socket: Cannot assign requested address
2019-04-05 13:05:30.924 UTC [19] HINT: Is another postmaster already running on port 5432? If not, wait a few seconds and retry.
2019-04-05 13:05:30.982 UTC [20] LOG: database system was shut down at 2019-04-05 12:57:16 UTC
2019-04-05 13:05:30.992 UTC [20] LOG: MultiXact member wraparound protections are now enabled
2019-04-05 13:05:30.998 UTC [19] LOG: database system is ready to accept connections
2019-04-05 13:05:30.998 UTC [24] LOG: autovacuum launcher started
2019-04-05 13:05:31.394 UTC [26] [unknown]#[unknown] LOG: incomplete startup packet
2019-04-19 13:21:58.974 UTC [37] LOG: could not bind IPv6 socket: Cannot assign requested address
2019-04-19 13:21:58.974 UTC [37] HINT: Is another postmaster already running on port 5432? If not, wait a few seconds and retry.
2019-04-19 13:21:59.025 UTC [38] LOG: database system was interrupted; last known up at 2019-04-05 13:05:34 UTC
2019-04-19 13:21:59.455 UTC [39] [unknown]#[unknown] LOG: incomplete startup packet
2019-04-19 13:21:59.971 UTC [42] postgres#postgres FATAL: the database system is starting up
[...]
2019-04-19 13:22:15.221 UTC [85] root#postgres FATAL: the database system is starting up
2019-04-19 13:22:15.629 UTC [38] LOG: database system was not properly shut down; automatic recovery in progress
2019-04-19 13:22:15.642 UTC [38] LOG: redo starts at 0/14EEBA8
2019-04-19 13:22:15.822 UTC [38] LOG: invalid record length at 0/24462D0: wanted 24, got 0
2019-04-19 13:22:15.822 UTC [38] LOG: redo done at 0/24462A8
2019-04-19 13:22:15.822 UTC [38] LOG: last completed transaction was at log time 2019-04-05 13:05:36.602318+00
2019-04-19 13:22:16.084 UTC [38] LOG: MultiXact member wraparound protections are now enabled
2019-04-19 13:22:16.094 UTC [37] LOG: database system is ready to accept connections
2019-04-19 13:22:16.094 UTC [89] LOG: autovacuum launcher started
2019-04-19 13:22:21.528 UTC [92] root#test LOG: could not receive data from client: Connection reset by peer
2019-04-19 13:22:21.528 UTC [92] root#test LOG: unexpected EOF on client connection with an open transaction
Any suggetion in fixing this startup issue ?
EDIT : Some requested the dockerfile, here is relevant lines
RUN apt-get update \
&& apt-get install -y --force-yes \
postgresql-9.6-pgrouting \
postgresql-9.6-postgis-2.3 \
postgresql-9.6-postgis-2.3-scripts \
[...]
# Download, compile and install GRASS 7.2
[...]
USER postgres
# Create a database 'grass_backend' owned by the "root" role.
RUN /etc/init.d/postgresql start \
&& psql --command "CREATE USER root WITH SUPERUSER [...];" \
&& psql --command "CREATE EXTENSION postgis; CREATE EXTENSION plpython3u;" --dbname [dbname] \
&& psql --command "CREATE EXTENSION postgis_sfcgal;" --dbname [dbname] \
&& psql --command "CREATE EXTENSION postgis; CREATE EXTENSION plpython3u;" --dbname grass_backend
WORKDIR [...]
End of file after workdir, meaning I guess the database isn't properly shut down
Answer I stopped properly postgresql inside the docker install. It now starts 15s faster. Thanks for replying
Considering the line database system was not properly shut down; automatic recovery in progress that would definitely explain slow startup, please don't kill the service, send the stop command and wait for it to close properly.
Please note that the system might kill the process if it takes to long to stop, this will happen in the case of postgresql if there are connections still held to it (probably from your application). If you disconnect all the connections and than stop, postgresql should be able to stop relatively quickly.
Also make sure you stop the postgresql service inside the container before turning it off.
TCP will linger connections for a while, if you are starting and stopping in quick succession without properly stopping the service inside that would explain your error of why the port is unavailable, normally the service can start/stop in very quick succession on my machine if nothing is connected to it.
3 start-stop cycles of postgresql on my machine (I have 2 decently sized databases)
$ time bash -c 'for i in 1 2 3; do /etc/init.d/postgresql-11 restart; done'
* Stopping PostgreSQL 11 (this can take up to 92 seconds) ... [ ok ]
* /run/postgresql: correcting mode
* Starting PostgreSQL 11 ... [ ok ]
* Stopping PostgreSQL 11 (this can take up to 92 seconds) ... [ ok ]
* /run/postgresql: correcting mode
* Starting PostgreSQL 11 ... [ ok ]
* Stopping PostgreSQL 11 (this can take up to 92 seconds) ... [ ok ]
* /run/postgresql: correcting mode
* Starting PostgreSQL 11 ... [ ok ]
real 0m1.188s
user 0m0.260s
sys 0m0.080s

pg_xlog files not recycling on slave

I've setup streaming replication with postgres 9.3
My problem is that on the Slave server the pg_xlog folder just gets fuller and fuller and WAL files are not getting recycled.
The slave server has the following (relevant) values in postgresql.conf on slave server:
wal_keep_segments = 150
hot_standby = on
checkpoint_segments = 32
checkpoint_completion_target = 0.9
archive_mode = off
#archive_command = ''
My initial replication command was:
pg_basebackup --xlog-method=stream -h <master-ip> -D . --username=replication --password
So I guess my WAL files are OK.
Here is my slave server startup log:
2017-05-08 09:55:31 IDT LOG: database system was shut down in recovery at 2017-05-08 09:55:19 IDT
2017-05-08 09:55:31 IDT LOG: entering standby mode
2017-05-08 09:55:31 IDT LOG: redo starts at 361/C76DD3E8
2017-05-08 09:55:31 IDT LOG: consistent recovery state reached at 361/C89A8278
2017-05-08 09:55:31 IDT LOG: database system is ready to accept read only connections
2017-05-08 09:55:31 IDT LOG: record with zero length at 361/C89A8278
2017-05-08 09:55:31 IDT LOG: started streaming WAL from primary at 361/C8000000 on timeline 1
2017-05-08 09:55:32 IDT LOG: incomplete startup packet
2017-05-08 09:58:34 IDT LOG: received SIGHUP, reloading configuration files
2017-05-08 09:58:34 IDT LOG: parameter "checkpoint_completion_target" changed to "0.9"
I even tried to copy older WAL files from master server manually to slave but that also didn't help.
What am I doing wrong? How can I stop the pg_xlog folder from growing indefinitely?
Is it related to the "incomplete startup packet" log message?
one last thing: under the pg_xlog\archive_status folder all of the WAL files are with .done suffix.
Appreciate any help I can get on this.
Edit:
I enabled log_checkpoints in postgresql.conf.
Here are the relevant log entries since I enabled it:
2017-05-12 08:43:11 IDT LOG: parameter "log_checkpoints" changed to "on"
2017-05-12 08:43:24 IDT LOG: checkpoint complete: wrote 2128 buffers (0.9%); 0 transaction log file(s) added, 0 removed, 9 recycled; write=189.240 s, sync=0.167 s, total=189.549 s; sync files=745, longest=0.010 s, average=0.000 s
2017-05-12 08:45:15 IDT LOG: checkpoint starting: time
2017-05-12 08:48:46 IDT LOG: checkpoint complete: wrote 15175 buffers (6.6%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=209.078 s, sync=1.454 s, total=210.617 s; sync files=769, longest=0.032 s, average=0.001 s
2017-05-12 08:50:15 IDT LOG: checkpoint starting: time
2017-05-12 08:53:45 IDT LOG: checkpoint complete: wrote 2480 buffers (1.1%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=209.162 s, sync=0.991 s, total=210.253 s; sync files=663, longest=0.076 s, average=0.001 s
Edit2:
Following the fact that my slave server has no restart points in the log, here is the relevant log for starting and recovering WALS in slave server before achieving consistent recovery state:
2017-05-12 09:35:42 IDT LOG: database system was shut down in recovery at 2017-05-12 09:35:41 IDT
2017-05-12 09:35:42 IDT LOG: entering standby mode
2017-05-12 09:35:42 IDT LOG: incomplete startup packet
2017-05-12 09:35:43 IDT FATAL: the database system is starting up
2017-05-12 09:35:43 IDT LOG: restored log file "0000000100000369000000B1" from archive
2017-05-12 09:35:43 IDT FATAL: the database system is starting up
2017-05-12 09:35:44 IDT FATAL: the database system is starting up
2017-05-12 09:35:44 IDT LOG: restored log file "0000000100000369000000AF" from archive
2017-05-12 09:35:44 IDT LOG: redo starts at 369/AFD28900
2017-05-12 09:35:44 IDT FATAL: the database system is starting up
2017-05-12 09:35:45 IDT FATAL: the database system is starting up
2017-05-12 09:35:45 IDT FATAL: the database system is starting up
2017-05-12 09:35:46 IDT LOG: restored log file "0000000100000369000000B0" from archive
2017-05-12 09:35:46 IDT FATAL: the database system is starting up
2017-05-12 09:35:46 IDT FATAL: the database system is starting up
2017-05-12 09:35:47 IDT FATAL: the database system is starting up
2017-05-12 09:35:47 IDT LOG: restored log file "0000000100000369000000B1" from archive
2017-05-12 09:35:47 IDT FATAL: the database system is starting up
2017-05-12 09:35:48 IDT FATAL: the database system is starting up
2017-05-12 09:35:48 IDT LOG: incomplete startup packet
2017-05-12 09:35:49 IDT LOG: restored log file "0000000100000369000000B2" from archive
2017-05-12 09:35:50 IDT LOG: restored log file "0000000100000369000000B3" from archive
2017-05-12 09:35:52 IDT LOG: restored log file "0000000100000369000000B4" from archive
.
.
.
2017-05-12 09:42:33 IDT LOG: restored log file "000000010000036A000000C0" from archive
2017-05-12 09:42:35 IDT LOG: restored log file "000000010000036A000000C1" from archive
2017-05-12 09:42:36 IDT LOG: restored log file "000000010000036A000000C2" from archive
2017-05-12 09:42:37 IDT LOG: restored log file "000000010000036A000000C3" from archive
2017-05-12 09:42:37 IDT LOG: consistent recovery state reached at 36A/C3ACEB28
2017-05-12 09:42:37 IDT LOG: database system is ready to accept read only connections
2017-05-12 09:42:39 IDT LOG: restored log file "000000010000036A000000C4" from archive
2017-05-12 09:42:40 IDT LOG: restored log file "000000010000036A000000C5" from archive
2017-05-12 09:42:42 IDT LOG: restored log file "000000010000036A000000C6" from archive
ERROR: WAL file '000000010000036A000000C7' not found in server 'main-db-server'
2017-05-12 09:42:42 IDT LOG: started streaming WAL from primary at 36A/C6000000 on timeline 1
Thanks!
The problem seems to have been resolved.
Apparently I had hardware issues on the master server.
I was able to perform full pg_dump and re-index my DB so I was pretty sure I did not have any data integrity issues.
But when looking at the master server logs after I've enabled log_checkpoints in the config - a few minutes before the slave server stopped performing checkpoints I saw the following message:
IDT ERROR: failed to re-find parent key in index "<table_name>_id_udx" for split pages 17/18
After seeing that - I decided to switch hosting provider and moved my DB to a new server.
Since then (almost a week now) - everything has been running smoothly replication and checkpoints are running as expected.
I really hope this will help other people - but when something like this is happening - always be advised that this issue might be caused by data integrity/hardware issues.

What is the purpose of `pg_logical` directory inside PostgreSQL data?

I've just stumbled upon this error while testing failover of a PostgreSQL 9.4 cluster I've set up. Here I'm trying to promote a slave to be the new master:
$ repmgr -f /etc/repmgr/repmgr.conf --verbose standby promote
2014-09-22 10:46:37 UTC LOG: database system shutdown was interrupted; last known up at 2014-09-22 10:44:02 UTC
2014-09-22 10:46:37 UTC LOG: database system was not properly shut down; automatic recovery in progress
2014-09-22 10:46:37 UTC LOG: redo starts at 0/18000028
2014-09-22 10:46:37 UTC LOG: consistent recovery state reached at 0/19000600
2014-09-22 10:46:37 UTC LOG: record with zero length at 0/1A000090
2014-09-22 10:46:37 UTC LOG: redo done at 0/1A000028
2014-09-22 10:46:37 UTC LOG: last completed transaction was at log time 2014-09-22 10:36:22.679806+00
2014-09-22 10:46:37 UTC FATAL: could not open directory "pg_logical/snapshots": No such file or directory
2014-09-22 10:46:37 UTC LOG: startup process (PID 2595) exited with exit code 1
2014-09-22 10:46:37 UTC LOG: aborting startup due to startup process failure
pg_logical/snapshots dir in fact exists on master node and it is empty.
UPD: I've just manually created empty directories pg_logical/snapshots and pg_logical/mappings and server has started without complaining. repmgr standby clone seems to omit this dirs while syncing. But the question still remains because I'm just curious what this directory is for, maybe I'm missing something in my setup. Simply Googling it did not yield any meaningful results.
It's for the new logical changeset extraction / logical replication feature in 9.4.
This shouldn't happen, though... it suggests a significant bug somewhere, probably repmgr. I'll wait for details (repmgr version, etc).
Update: Confirmed, it's a repmgr bug. It's fixed in git master already (and was before this report) and will be in the next release. Which had better be soon, given the significance of this issue.