PSQL timeline conflict prevent start of master - postgresql

We had an outage on one of our PSQL 14 (managed by Zalando) due to k8s control plane being unreachable for 30min.
Control plane is now ok but master PSQL does not want to start:
LOG,00000,"listening on IPv4 address ""0.0.0.0"", port 5432"
LOG,00000,"listening on IPv6 address ""::"", port 5432"
LOG,00000,"listening on Unix socket ""/var/run/postgresql/.s.PGSQL.5432"""
LOG,00000,"database system was shut down at 2023-01-30 02:51:10 UTC"
WARNING,01000,"specified neither primary_conninfo nor restore_command",,"The database server will regularly poll the pg_wal subdirectory to check for files placed there."
LOG,00000,"entering standby mode"
FATAL,XX000,"requested timeline 5 is not a child of this server's history","Latest checkpoint is at 2/82000028 on timeline 4, but in the history of the requested timeline, the server forked off from that timeline at 0/530000A0."
LOG,00000,"startup process (PID 23007) exited with exit code 1"
LOG,00000,"aborting startup due to startup process failure"
LOG,00000,"database system is shut down"
We can see in archive_status folder:
-rw-------. 1 postgres postgres 0 Jan 30 02:51 000000040000000200000081.ready
-rw-------. 1 postgres postgres 0 Jan 30 02:51 00000005.history.done
Would you know how we can recover safely from this?
I guess switching back to timeline 4 would be enough as timeline 5 was made after start of outage.

The server is started in standby mode. Remove standby.signal if you want to start the server as primary server.

Related

Armitage 'Connection refused' error in new install of Kali Linux after full upgrade

I installed Kali Linux via VMware and did a full system upgrade:
apt-get update
apt-get upgrade
apt-get full-upgrade
As part of the upgrade postgresql upgraded from v11 to v12. I followed the instructions to finish this part of the upgrade:
pg_dropcluster 12 main --stop
pg_upgradecluster 11 main
pg_dropcluster 11 main
I start postgresql, initialize metasploit, and start Armitage:
/etc/init.d/postgresql start
msfdb init
armitage
The only console output appears unrelated:
Picked up _JAVA_OPTIONS: -Dawt.useSystemAAFontSettings=on
-Dswing.aatext=true
I do get the popup box with the connection information. I found that I get the "Unexpected end of file from server" if I use 'localhost' as the host, so - per their instructions - I change it to the external IP (in this case 192.168.9.134). I checked metasploit-framework/config/database.yml for
the port and login credentials.
After clicking 'Connect' with this information I get a connection window stating:
Connecting to 192.168.9.134:5432 Connection refused (Connection
refused)
There's also the progress bar that over time will completely fill up (unless I click 'Cancel'). After which nothing happens. As I run the command from the terminal I can see that the process is still running (I don't get my prompt back) but the window disappears and Armitage doesn't actually start. The log file, as verified by pg_lsclusters (/var/log/postgresql/postgresql-12-main.log) doesn't is actually empty.
The link I mentioned before suggests that the problem could either be not enough RAM (I set the VM to have 4gb and free -m shows):
total used free shared buff/cache available
Mem: 3964 803 2677 29 483 2787
Swap: 4093 0 4093
Or that the Metasploit RPC daemon never started (that window does come up the first time, but not subsequent times). I verified that it's running via msfdb status:
● postgresql.service - PostgreSQL RDBMS
Loaded: loaded (/lib/systemd/system/postgresql.service; disabled; vendor preset: disabled)
Active: active (exited) since Fri 2020-02-07 16:06:52 EST; 19min ago
Process: 1753 ExecStart=/bin/true (code=exited, status=0/SUCCESS) Main PID: 1753 (code=exited, status=0/SUCCESS)
Feb 07 16:06:52 kali systemd1: Starting PostgreSQL RDBMS... Feb 07
16:06:52 kali systemd1: Started PostgreSQL RDBMS.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME postgres
1735 postgres 3u IPv6 32516 0t0 TCP localhost:5432 (LISTEN)
postgres 1735 postgres 4u IPv4 32517 0t0 TCP localhost:5432
(LISTEN)
UID PID PPID C STIME TTY STAT TIME CMD postgres 1735
1 0 16:06 ? Ss 0:00 /usr/lib/postgresql/12/bin/postgres -D
/var/lib/postgresql/12/main -c
config_file=/etc/postgresql/12/main/postgresql.conf
[+] Detected configuration file
(/usr/share/metasploit-framework/config/database.yml)
Also, running regular Metasploit appears to work fine (msfconsole) and loads without error (not sure if there's any output that would be helpful here). I don't use postgresql directly, so I haven't messed with any configuration nor do I have any other applications (that I'm aware of) that use it, so it should be a pretty clean setup (not to mention this is a fresh install of Kali Linux). I'm out of ideas for what to check next. An online search didn't seem to match this problem well. Any thoughts?
Armitage has been deprecated for some time now, as it has not been updated since 2015, and is (to some extent) incompatible with current versions of metasploit.
Although this may not fix your problem, I suggest not using software this much out of date.

Postgresql WalReceiver process waits on connecting master regardless of "connect_timeout"

I am trying to deploy an automated high-available PostgreSQL cluster on kubernetes. In cases of master failover or temporary failures in master, standby loses streaming replication connection and when retrying, it takes a long time until it gets failed and retries.
I use PostgreSQL 10 and streaming replication (cluster-main-cluster-master-service is a service that always routes to master and all the replicas connect to this service for replication). I've tried setting configs like connect_timeout and keepalive in primary_conninfo of recovery.conf and wal_receiver_timeout in postgresql.conf of standby but I could not make any progress with them.
In the first place when master goes down, replication stops with the following error (state 1):
2019-10-06 14:14:54.042 +0330 [3039] LOG: replication terminated by primary server
2019-10-06 14:14:54.042 +0330 [3039] DETAIL: End of WAL reached on timeline 17 at 0/33000098.
2019-10-06 14:14:54.042 +0330 [3039] FATAL: could not send end-of-streaming message to primary: no COPY in progress
2019-10-06 14:14:55.534 +0330 [12] LOG: record with incorrect prev-link 0/2D000028 at 0/33000098
After investigating Postgres activities I found out that WalReceiver proccess stucks in LibPQWalReceiverConnect wait_event (state 2) but timeout is way longer than what I configured (although I set connect_timeout to 10 seconds, it takes about 2 minutes). Then, It fails with the following error (state 3):
2019-10-06 14:17:06.035 +0330 [3264] FATAL: could not connect to the primary server: could not connect to server: Connection timed out
Is the server running on host "cluster-main-cluster-master-service" (192.168.0.166) and accepting
TCP/IP connections on port 5432?
In the next try, It successfully connects the primary (state 4):
2019-10-06 14:17:07.892 +0330 [5786] LOG: started streaming WAL from primary at 0/33000000 on timeline 17
I also tried killing the process when stuck event occurs (state 2), and when I do, It starts the process again and connects and then streams normally (jumps to state 4).
After checking netstat, I also found that there is a connection with SYN_SENT state to the old master in the walreceiver process (in failover case).
connect_timeout governs how long PostgreSQL will wait for the replication connection to succeed, but that does not include establishing the TCP connection.
To reduce the time that the kernel waits for a successful answer to a TCP SYN request, reduce the number of retries. In /etc/sysctl.conf, set:
net.ipv4.tcp_syn_retries = 3
and run sysctl -p.
That should reduce the time significantly.
Reducing the value too much might make your system less stable.

PostgreSQL 10 not able to start on Mac

I'm running PostgreSQL 10 on a mac running macOS Mojave 10.14.1, i'm getting a could not start PostgreSQL server, pg_ctl could not start server.1
and the log shows: 2
-LOG: listening on IPv6 address "::1", port 5432
-LOG: listening on IPv4 address "127.0.0.1", port 5432
-LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
-LOG: database system was interrupted; last known up at 2018-05-28 21:53:17 EDT
-LOG: invalid record length at 0/18B6A00: wanted 24, got 0
-LOG: invalid primary checkpoint record
-LOG: invalid record length at 0/18B6920: wanted 24, got 0
-LOG: invalid secondary checkpoint record
-PANIC: could not locate a valid checkpoint record
-LOG: startup process (PID 3462) was terminated by signal 6: Abort trap
-LOG: aborting startup due to startup process failure
-LOG: database system is shut down
Just wondering if anyone has seen something similar to this.
Thanks
Your database or wal are corrupted. Normally it could fix itself unless it is very bad. Is it a new database? If so, delete it and recreate.

Restart PostgreSQL without postgresql-server

I'm on CentOS 7 and I'm trying to get through the 'PG::ConnectionBad: FATAL: Peer authentication failed for user' error.
So I've already figured out that I should change pg_hba.conf (peer to md5) and I've done it. It seems that I have to restart postgres but it is not so easy as I thought.
I tried 'service postgresql restart' which resulted in 'Failed to restart postgresql.service: Unit not found.'
Then tried to install posgresql-server. Got:
oct 23 01:16:15 serverct1 pg_ctl[3280]: HINT: Is another postmaster already running on port 5432? If ...try.
oct 23 01:16:15 serverct1 pg_ctl[3280]: WARNING: could not create listen socket for "localhost"
oct 23 01:16:15 serverct1 pg_ctl[3280]: FATAL: could not create any TCP/IP sockets
oct 23 01:16:16 serverct1 pg_ctl[3280]: pg_ctl: could not start server
oct 23 01:16:16 serverct1 systemd[1]: postgresql.service: control process exited, code=exited status=1
oct 23 01:16:16 serverct1 systemd[1]: Failed to start PostgreSQL database server.
About 5432 port usage:
postgres 5432/tcp postgresql # POSTGRES
postgres 5432/udp postgresql # POSTGRES
So I'm curious:
1) Do postgresql and postgresql-server work separately?
2) Is it possible to restart posgresql without postgresql-server?
3) If not - how to get the port 5432 free in order to run postgresql-server?
You can avoid troubles with serverct1 if you use standard postgres pg_ctl, eg:
pg_ctl reload
Or if needed pg_ctl reload -D $PGDATA
You dont need to restart the postgres for pg_hba changes to apply: https://www.postgresql.org/docs/current/static/auth-pg-hba-conf.html
The pg_hba.conf file is read on start-up and when the main server
process receives a SIGHUP signal. If you edit the file on an active
system, you will need to signal the postmaster (using pg_ctl reload or
kill -HUP) to make it re-read the file.

WAL contains references to invalid pages

centos 6.7
postgresql 9.5.3
I've DB servers that are on master-standby replication.
Suddenly, standby server's postgresql process was stopped with this logs.
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]WARNING: page 1671400 of relation base/16400/559613 is uninitialized
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]CONTEXT: xlog redo Heap2/VISIBLE: cutoff xid 1902107520
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]PANIC: WAL contains references to invalid pages
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]CONTEXT: xlog redo Heap2/VISIBLE: cutoff xid 1902107520
2016-07-14 18:14:21.026 JST [][5783e038.3cd9][0][15577]LOG: startup process (PID 15579) was terminated by signal 6: Aborted
2016-07-14 18:14:21.026 JST [][5783e038.3cd9][0][15577]LOG: terminating any other active server processes
And, master server's postgresql logs were nothing special.
But, master server's /var/log/messages was listed as below.
Jul 14 05:38:44 host kernel: sbridge: HANDLING MCE MEMORY ERROR
Jul 14 05:38:44 host kernel: CPU 8: Machine Check Exception: 0 Bank 9: 8c000040000800c0
Jul 14 05:38:44 host kernel: TSC 0 ADDR 1f7dad7000 MISC 90004000400008c PROCESSOR 0:306e4 TIME 1468442324 SOCKET 1 APIC 20
Jul 14 05:38:44 host kernel: EDAC MC1: CE row 1, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#1": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=8 Err=0008:00c0 (ch=0), addr = 0x1f7dad7000 => socket=1, Channel=0(mask=1), rank=4
Jul 14 05:38:44 host kernel:
Jul 14 18:30:40 host kernel: sbridge: HANDLING MCE MEMORY ERROR
Jul 14 18:30:40 host kernel: CPU 8: Machine Check Exception: 0 Bank 9: 8c000040000800c0
Jul 14 18:30:40 host kernel: TSC 0 ADDR 1f7dad7000 MISC 90004000400008c PROCESSOR 0:306e4 TIME 1468488640 SOCKET 1 APIC 20
Jul 14 18:30:41 host kernel: EDAC MC1: CE row 1, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#1": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=8 Err=0008:00c0 (ch=0), addr = 0x1f7dad7000 => socket=1, Channel=0(mask=1), rank=4
Jul 14 18:30:41 host kernel:
The memory error's started at 1 week ago. So, I doubt the memory error causes postgresql's error.
My question is here.
1) Can memory error of kernel cause postgresql's "WAL contains references to invalid pages" error?
2) Why there is not any logs at master server's postgresql?
thx.
Faulty memory can cause all kinds of data corruption, so that seems like a good enough explanation to me.
Perhaps there are no log entries at the master PostgreSQL server because all that was corrupted was the WAL stream.
You can run
oid2name
to find out which database has OID 16400 and then
oid2name -d <database with OID 16400> -f 559613
to find out which table belongs to file 559613.
Is that table larger than 12 GB? If not, that would mean that page 1671400 is indeed an invalid value.
You didn't tell which PostgreSQL version you are using, but maybe there are replication bugs fixed in later versions that could cause replication problems even without a hardware bug present; read the release notes.
I would perform a new pg_basebackup and reinitialize the slave system.
But what I'd really be worried about is possible data corruption on the master server. Block checksums are cool (turned on if pg_controldata <data directory> | grep checksum gives you 1), but possibly won't detect the effects of memory corruption.
Try something like
pg_dumpall -f /dev/null
on the master and see if there are errors.
Keep your old backups in case you need to repair something!