Greenplum DB stuck in recovery mode - recovery

I have a Greenplum deployment with several segments.
version: postgres (Greenplum Database) 8.2.15
A few days ago after an error as seen from pg_log (below)
FATAL 54000 out of on_shmem_exit slots
WARNING 1000 StartTransaction while in START state
PANIC XX000 Waiting on lock already held! (lwlock.c:557)
LOG 0 server process (PID 224596) was terminated by signal 6: Aborted
LOG 0 terminating any other active server processes
FATAL 57P01 terminating connection due to administrator command
LOG 0 sweeper process (PID 102635) exited with exit code 2
LOG 0 seqserver process (PID 102632) exited with exit code 2
FATAL 57P03 the database system is in recovery mode
LOG 0 ftsprobe process (PID 102633) exited with exit code 2
FATAL 57P03 the database system is in recovery mode
FATAL 57P03 the database system is in recovery mode
FATAL 57P03 the database system is in recovery mode
FATAL 57P03 the database system is in recovery mode
FATAL 57P03 the database system is in recovery mode
For 4 days now database remains in recovery mode, gpstart, gpstop all return with errors "the database system is in recovery mode" afterwards fails.
See ps response below:
[gpadmin#mdw1 ~]$ ps -ef | grep post
gpadmin 2979 189094 0 12:25 pts/0 00:00:00 grep post
postfix 3264 3251 0 2015 ? 00:34:40 qmgr -l -t fifo -u
gpadmin 102637 230099 0 May18 ? 00:01:47 postgres: port 5432, stats sender process
gpadmin 230099 1 0 Apr24 ? 02:47:53 /usr/local/greenplum-db-4.3.10.0/bin/postgres -D /data/master/gpseg-1 -p 5432 -b 1 -z 96 --silent-mode=true -i -M master -C -1 -x 194 -E
gpadmin 230100 230099 0 Apr24 ? 00:49:45 postgres: port 5432, master logger process
[gpadmin#mdw1 ~]$
I have searched a lot but am not able to find as a solution kindly assist with pointers on how to bring the database up.

Related

Postgres Replication with pglogical: ERROR: connection to other side has died

Got this error (on replica) while replicating between 2 Postgres instances:
ERROR: connection to other side has died
Here is the logs on the replica/subscriber:
2017-09-15 20:03:55 UTC [14335-3] LOG: apply worker [14335] at slot 7 generation 109 crashed
2017-09-15 20:03:55 UTC [2961-1732] LOG: worker process: pglogical apply 16384:3661733826 (PID 14335) exited with exit code 1
2017-09-15 20:03:59 UTC [14331-2] ERROR: connection to other side has died
2017-09-15 20:03:59 UTC [14331-3] LOG: apply worker [14331] at slot 2 generation 132 crashed
2017-09-15 20:03:59 UTC [2961-1733] LOG: worker process: pglogical apply 16384:3423246629 (PID 14331) exited with exit code 1
2017-09-15 20:04:02 UTC [14332-2] ERROR: connection to other side has died
2017-09-15 20:04:02 UTC [14332-3] LOG: apply worker [14332] at slot 4 generation 125 crashed
2017-09-15 20:04:02 UTC [2961-1734] LOG: worker process: pglogical apply 16384:2660030132 (PID 14332) exited with exit code 1
2017-09-15 20:04:02 UTC [14350-1] LOG: starting apply for subscription parking_sub
2017-09-15 20:04:05 UTC [14334-2] ERROR: connection to other side has died
2017-09-15 20:04:05 UTC [14334-3] LOG: apply worker [14334] at slot 6 generation 119 crashed
2017-09-15 20:04:05 UTC [2961-1735] LOG: worker process: pglogical apply 16384:394989729 (PID 14334) exited with exit code 1
2017-09-15 20:04:06 UTC [14333-2] ERROR: connection to other side has died
Logs on master/provider:
2017-09-15 23:22:43 UTC [22068-5] repuser#ga-master ERROR: got sequence entry 1 for toast chunk 1703536315 instead of seq 0
2017-09-15 23:22:43 UTC [22068-6] repuser#ga-master LOG: could not receive data from client: Connection reset by peer
2017-09-15 23:22:44 UTC [22067-5] repuser#ga-master ERROR: got sequence entry 1 for toast chunk 1703536315 instead of seq 0
2017-09-15 23:22:44 UTC [22067-6] repuser#ga-master LOG: could not receive data from client: Connection reset by peer
2017-09-15 23:22:48 UTC [22070-5] repuser#ga-master ERROR: got sequence entry 1 for toast chunk 1703536315 instead of seq 0
2017-09-15 23:22:48 UTC [22070-6] repuser#ga-master LOG: could not receive data from client: Connection reset by peer
2017-09-15 23:22:49 UTC [22069-5] repuser#ga-master ERROR: got sequence entry 1 for toast chunk 1703536315 instead of seq 0
2017-09-15 23:22:49 UTC [22069-6] repuser#ga-master LOG: could not receive data from client: Connection reset by peer
Config on master/provider:
archive_mode = on
archive_command = 'cp %p /data/pgdata/wal_archives/%f'
max_wal_senders = 20
wal_level = logical
max_worker_processes = 100
max_replication_slots = 100
shared_preload_libraries = pglogical
max_wal_size = 20GB
Config on the replica/subscriber:
max_replication_slots = 100
shared_preload_libraries = pglogical
max_worker_processes = 100
max_wal_size = 20GB
I'm having a total of 18 subscriptions for 18 schemas. It seemed to work fine in the beginning, but it quickly deteriorated and some subscriptions started to bounce between down and replicating statuses, with the error posted above.
Question
What could be the possible causes? Do I need to change my Pg configurations?
Also, I noticed that when replication is going on, the CPU usage on the master/provider is pretty high.
/# ps aux | sort -nrk 3,3 | head -n 5
postgres 18180 86.4 1.0 415168 162460 ? Rs 22:32 19:03 postgres: getaround getaround 10.240.0.7(64106) CREATE INDEX
postgres 20349 37.0 0.2 339428 38452 ? Rs 22:53 0:07 postgres: wal sender process repuser 10.240.0.7(49742) idle
postgres 20351 33.8 0.2 339296 36628 ? Rs 22:53 0:06 postgres: wal sender process repuser 10.240.0.7(49746) idle
postgres 20350 28.8 0.2 339016 44024 ? Rs 22:53 0:05 postgres: wal sender process repuser 10.240.0.7(49744) idle
postgres 20352 27.6 0.2 339420 36632 ? Rs 22:53 0:04 postgres: wal sender process repuser 10.240.0.7(49750) idle
Thanks in advance!
I had a similar problem which was fixed by setting the: wal_sender_timeout config on the master/provider to 5 minutes (default is 1 minute). It will drop the connection if it times out - this seems to have fixed the problem for me.

How to start postgreSQL9.4 server installed by homebrew

I install postgresql9.4.5 by homebrew.
Before 9.4.5, I used 9.3.9 by macport, but I want to use only homebrew.
So I uninstall 9.3.9 and macport and install 9.4.5 by homebrew.
I could success "initdb /usr/local/var/postgres".
But, when I enter "postgres -D /usr/local/var/postgres", show error.
The error
LOG: could not bind IPv6 socket: Address already in use
HINT: Is another postmaster already running on port 5432? If not, wait a few seconds and retry.
LOG: could not bind IPv4 socket: Address already in use
HINT: Is another postmaster already running on port 5432? If not, wait a few seconds and retry.
WARNING: could not create listen socket for "localhost"
FATAL: could not create any TCP/IP sockets
I tyied to start server by manual
pg_ctl -D /usr/local/var/postgres -l /usr/local/var/postgres/server.log start
But, pg_ctl -D /usr/local/var/postgres status show pg_ctl: no server running.
I confirmed pg_hba.conf and it shows
# IPv4 local connections:
host all all 127.0.0.1/32 trust
Also, when I enter psql, shows the error psql: FATAL: could not open relation mapping file "global/pg_filenode.map": No such file or directory
What do I need to do? Please tell me.
Add a postscript
I also tried ps aux | grep postgres
postgres 276 0.0 0.0 2470236 452 ?? Ss 火12AM 0:03.18 postgres: stats collector process
postgres 275 0.0 0.0 2614684 1160 ?? Ss 火12AM 0:11.63 postgres: autovacuum launcher process
postgres 274 0.0 0.0 2614552 416 ?? Ss 火12AM 0:00.79 postgres: wal writer process
postgres 273 0.0 0.0 2614552 440 ?? Ss 火12AM 0:00.81 postgres: writer process
postgres 272 0.0 0.0 2614552 596 ?? Ss 火12AM 0:00.07 postgres: checkpointer process
postgres 242 0.0 0.0 2614552 1184 ?? S 火12AM 0:01.07 /opt/local/lib/postgresql93/bin/postgres -D /opt/local/var/db/postgresql93/defaultdb
root 75 0.0 0.0 2469228 884 ?? Ss 火12AM 0:00.03 /opt/local/bin/daemondo --label=postgresql93-server --start-cmd /opt/local/etc/LaunchDaemons/org.macports.postgresql93-server/postgresql93-server.wrapper start ; --stop-cmd /opt/local/etc/LaunchDaemons/org.macports.postgresql93-server/postgresql93-server.wrapper stop ; --restart-cmd /opt/local/etc/LaunchDaemons/org.macports.postgresql93-server/postgresql93-server.wrapper restart ; --pid=none
This shows some process is active. And, 2 lines show macport pass. But, I had deleted /opt/local folder.
I killed these process and ps aux | grep postgres did not show anything.
So, I tried postgres -D /usr/local/var/postgres. Next time, this shows
LOG: database system was shut down at 2015-11-19 18:45:38 JST
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
And, the terminal stopped and did not show prompt. So I had to enter 'control + c'
But, I could start postgres manually. I did this command pg_ctl -D /usr/local/var/postgres -l /usr/local/var/postgres/server.log start
And I could do psql commands.

Fatal error starting postgres

I'm unfamiliar with how to use postgres and need some help. I'm currently running OSX Yosemite.
When I start postgres I get this:
pg_ctl: could not start server
Examine the log output.
There was an error executing [start] on postgres. Check /Users/work/git/proj/var/log/postgres.log for details.
createuser: could not connect to database postgres: FATAL: could not open relation mapping file "global/pg_filenode.map": No such file or directory
The log is below.
When I try to stop postgres I get this:
Postgres not running
And when I run ps -ef |grep postgres I get this:
20010 13398 1 0 Jul07 ? 00:00:00 /usr/pgsql-9.3/bin/postgres -h -k /Users/work/git/proj/var/pg
20010 13399 13398 0 Jul07 ? 00:00:09 postgres: logger process
20010 13401 13398 0 Jul07 ? 00:00:10 postgres: checkpointer process
20010 13402 13398 0 Jul07 ? 00:00:00 postgres: writer process
20010 13403 13398 0 Jul07 ? 00:00:00 postgres: wal writer process
20010 13404 13398 0 Jul07 ? 00:00:36 postgres: autovacuum launcher process
20010 13405 13398 0 Jul07 ? 00:00:02 postgres: stats collector process
20010 18112 17723 0 10:22 pts/0 00:00:00 grep postgres
What does this all mean and how could I possibly fix this?
log text
Postgres data dir doesn't exist. Creating
The files belonging to this database system will be owned by user "rose.smith".
This user must also own the server process.
The database cluster will be initialized with locale "C".
The default database encoding has accordingly been set to "SQL_ASCII".
The default text search configuration will be set to "english".
Data page checksums are disabled.
creating directory /Users/work/git/proj/postgres ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
creating configuration files ... ok
creating template1 database in /Users/work/git/proj/postgres/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
syncing data to disk ... ok
Success. You can now start the database server using:
/usr/pgsql-9.3/bin/postgres -D /Users/work/git/proj/postgres
or
/usr/pgsql-9.3/bin/pg_ctl -D /Users/work/git/proj/postgres -l logfile start
waiting for server to start....< 2015-06-04 17:24:57.966 GMT >LOG: redirecting log output to logging collector process
< 2015-06-04 17:24:57.966 GMT >HINT: Future log output will appear in directory "pg_log".
done
server started
waiting for server to shut down.... done
server stopped
waiting for server to start....< 2015-06-04 18:10:18.044 GMT >LOG: redirecting log output to logging collector process
< 2015-06-04 18:10:18.044 GMT >HINT: Future log output will appear in directory "pg_log".
done
server started
"/Users/work/git/proj/var/log/postgres.log" 413L, 20935C
after running /usr/pgsql-9.3/bin/postgres -D /Users/work/git/proj/postgres
< 2015-07-08 14:40:36.331 GMT >FATAL: lock file "postmaster.pid" already exists
< 2015-07-08 14:40:36.331 GMT >HINT: Is another postmaster (PID 18145) running in data directory "/Users/work/git/proj/postgres"?
I can't speak to why this worked after trying these commands just a few minutes ago, but it is now working. Good luck to anyone else with the same problem.
stop postgres
killall postgres
remove postgres database with rm -rf postgres
start postgres
This website was helpful. I think my problem may have been the same as his.
I had deleted ~/Library/Containers/com.heroku.postgres or ~/Application Support/Postgres/ while the Postgres.app was still running. The old version was still running since I deleted the pid file, and it didn't know how to shut it down.
Source: https://github.com/PostgresApp/PostgresApp/issues/96
I faced same issue. I solved the problem with the following commands.
If you install postgresql using HomeBrew...
rm /usr/local/var/postgres/postmaster.pid
pg_ctl -D /usr/local/var/postgres -l /usr/local/var/postgres/server.log start
Hope this helps you!

Postresql 9.3 replication not starting after pg_basebackup completes

I am trying to create a hot_standby server, and I receive the following error after pg_basebackup completes. Notice I use a shell script, replicator.sh, to start the replication. Can anyone give me some insight?
My specs:
Debian Wheezy 7.6
Postgresql 9.3
Database size: ~115GB
Error:
postgres#database-master:/etc/postgresql/9.3/main$ sh replicator.sh
Stopping PostgreSQL
[ ok ] Stopping PostgreSQL 9.3 database server: main.
Cleaning up old cluster directory
Starting base backup as replicator
Password:
113720266/113720266 kB (100%), 1/1 tablespace
NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup
pg_basebackup: base backup completed
Starting Postgresql
[....] Starting PostgreSQL 9.3 database server: main[....] The PostgreSQL server failed to start.
Please check the log output: 2014-09-11 17:56:33 UTC LOG: database system was interrupted; last
known up at 2014-09-11 16:54:29 UTC 2014-09-11 17:56:33 UTC LOG: creating missing WAL directory
"pg_xlog/archive_status" 2014-09-11 17:56:33 UTC LOG: incomplete startup packet 2014-09-11 17:56:33
UTC LOG: invalid checkpoint record 2014-09-11 17:56:33 UTC FATAL: could not locate required
checkpoint record 2014-09-11 17:56:33 UTC HINT: If you are not restoring from a backup, try
removing the file "/var/lib/p[FAILesql/9.3/main/backup_label". 2014-09-11 17:56:33 UTC LOG: startup
process (PID 21972) exited with exit code 1 2014-09-11 17:56:33 UTC LOG: aborting startup due to
startup process failure ... failed! failed!
Contents of replicator.sh:
#!/bin/bash
echo Stopping PostgreSQL
/etc/init.d/postgresql stop
echo Cleaning up old cluster directory
rm -rf /var/lib/postgresql/9.3/main
echo Starting base backup as replicator
pg_basebackup -h 123.456.789.123 -D /var/lib/postgresql/9.3/main -U replicator -v -P
echo Writing recovery.conf file
sudo -u postgres bash -c "cat > /var/lib/postgresql/9.3/main/recovery.conf <<- _EOF1_
standby_mode = 'on'
primary_conninfo = 'host=123.456.789.123 port=5432 user=replicator password=XXXXX sslmode=require'
trigger_file = '/tmp/postgresql.trigger'
_EOF1_
"
echo Starting Postgresql
/etc/init.d/postgresql start
Thank you,
Jake
My best guess from the above is that the pg_basebackup failed and your shell script doesn't check for error return codes or use set -e to automatically abort after errors, so it just carried on regardless.
It's also possible that you don't have WAL archiving configured, or don't have a restore_command set in the replica. In that case, the transaction logs required to start the base backup will not be available and startup will fail.
I strongly recommend that you:
Use pg_basebackup -X stream so that the required transaction logs get copied along with the backup; and
Use set -e in your shell script, or test for errors with a suitable if ! pg_basebackup .... ; then block.

chef-server-ctl reconfigure fails after customizing the PostgreSQL port

I'm using Open Source Chef 11.0.10 on Ubuntu 12.04. This is a shared server where PostgreSQL and Apache are already running, so I'm trying to customize the Chef port numbers.
I've created the file /etc/chef-server/chef-server.rb, which contains the lines:
nginx['ssl_port'] = 8443
postgresql['port'] = 5433
When I execute the command:
sudo chef-server-ctl reconfigure
it fails on the line:
execute[/opt/chef-server/embedded/bin/createdb -T template0 -E UTF-8 opscode_chef] action run
and the error message says:
---- Begin output of /opt/chef-server/embedded/bin/createdb -T template0 -E UTF-8 opscode_chef ----
STDOUT:
STDERR: createdb: could not connect to database template1: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
---- End output of /opt/chef-server/embedded/bin/createdb -T template0 -E UTF-8 opscode_chef ----
Now, the Chef instance of PostgreSQL does appear to be running, in addition to the original instance:
$ ps -ef | grep postgresql | grep -v grep
postgres 1000 1 0 09:14 ? 00:00:00 /usr/lib/postgresql/9.1/bin/postgres -D /var/lib/postgresql/9.1/main -c config_file=/etc/postgresql/9.1/main/postgresql.conf
root 4830 4421 0 09:46 ? 00:00:00 runsv postgresql
root 4831 4830 0 09:46 ? 00:00:00 svlogd -tt /var/log/chef-server/postgresql
998 5579 4830 0 09:49 ? 00:00:00 /opt/chef-server/embedded/bin/postgres -D /var/opt/chef-server/postgresql/data
What did I miss?
More details:
I had used the omnibus package to do the initial chef-server install:
https://opscode-omnibus-packages.s3.amazonaws.com/ubuntu/12.04/x86_64/chef-server_11.0.10-1.ubuntu.12.04_amd64.deb
It failed before completion at the same step with the same error because it's trying to use the default PostgreSQL port, which is already in use.
And the Chef PostgreSQL instance is running:
$ sudo /opt/chef-server/embedded/bin/sv status postgresql
run: postgresql: (pid 5579) 86034s; run: log: (pid 4831) 86158s
I gave up trying to get Chef's PostgreSQL instance configured to use a different port.
Instead, I modified our existing PostgreSQL installation's port number to be 5433 and let Chef's instance use 5432. Now the "chef-server-ctl reconfigure" command completes successfully.
check postgresql log file
tail -f /var/log/chef-server/postgresql/current
2015-02-28_13:29:01.48646 FATAL: could not create lock file "/tmp/.s.PGSQL.5432.lock": Permission denied
2015-02-28_13:29:02.57961 FATAL: could not create lock file "/tmp/.s.PGSQL.5432.lock": Permission denied
2015-02-28_13:29:02.57961 FATAL: could not create lock file "/tmp/.s.PGSQL.5432.lock": Permission denied
My problem is solve running following command
chmod 777 /tmp