Can't start postgresql replication - postgresql

We have postgresql replication on different server. So today I was doing some optimization on replication cluster postgresql.conf
After doing replication, I restarted postgresql with this command:
pg_ctlcluster 9.2 main2 restart
But instead of restarting, it gave this error:
The PostgreSQL server failed to start. Please check the log output.
And checking the log, I see this:
2015-06-16 12:18:16 EEST [10655]: [2-1] LOG: received smart shutdown request
2015-06-16 12:18:16 EEST [10661]: [2-1] FATAL: terminating walreceiver process due to administrator command
2015-06-16 12:18:16 EEST [10658]: [1-1] LOG: shutting down
2015-06-16 12:18:16 EEST [10658]: [2-1] LOG: database system is shut down
Checking log now it shows the last restart log, but log does not update, when I try to start server. It says check the log, but there is no new information, even if I try to start server again.
P.S. Do I need to do anything on master?
Update
Changing postgresql.conf settings back, started replication. But from error it is hard to tell what was wrong.
here are settings I changed (after they changing, they were the same as on master. When I commented it, only then I could start replication):
shared_buffers = 1536MB
effective_cache_size = 3072MB
checkpoint_segments = 15
checkpoint_completion_target = 0.9
autovacuum = on
track_counts = on
work_mem = 25MB
So as I said, after commenting these, I could start it. But don't get it why it won't let start with these settings.

If I were you, and if upgrade is an option, the first thing I would do is to upgrade to PostgreSQL 9.4 (or newer). There's a good reason for do this when it comes to replication - a new feature called "replication slots" (see the announcement).
In short: replication slots are more robust and easier to implement than WAL archiving (you obviously use according to your logs).
In this post you'll find a comprehensive guide on implementing the feature.

Related

Postgresql fatal the database system is starting up - windows 10

I have installed postgresql on windows 10 on usb disk.
Every day when i start my pc in work from sleep and plug in the disk again then trying to start postgresql i get this error:
FATAL: the database system is starting up
The service starts with following command:
E:\PostgresSql\pg96\pgservice.exe "//RS//PostgreSQL 9.6 Server"
It is the default one.
logs from E:\PostgresSql\data\logs\pg96
2019-02-28 10:30:36 CET [21788]: [1-1] user=postgres,db=postgres,app=[unknown],client=::1 FATAL: the database system is starting up
2019-02-28 10:31:08 CET [9796]: [1-1] user=postgres,db=postgres,app=[unknown],client=::1 FATAL: the database system is starting up
I want this start up to happen faster.
When you commit data to a Postgres database, the only thing which is immediately saved to disk is the write-ahead log. The actual table changes are only applied to the in-memory buffers, and won't be permanently saved to disk until the next checkpoint.
If the server is stopped abruptly, or if it suddenly loses access to the file system, then everything in memory is lost, and the next time you start it up, it needs to resort to replaying the log in order to get the tables back to the correct state (which can take quite a while, depending on how much has happened since the last checkpoint). And until it's finished, any attempt to use the server will result in FATAL: the database system is starting up.
If you make sure you shut the server down cleanly before unplugging the disk - giving it a chance to set a checkpoint and flush all of its buffers - then it should be able to start up again more or less immediately.

How is PostgreSQL hot standby WAL file restoring triggered?

Primary server
# postgresql.conf
wal_level = hot_standby
archive_mode = on
archive_timeout = 10
archive_command = 'test ! -f /archive/%f && cp %p /archive/%f'
Standby server
hot_standby = on
I copied /archive/* in primary server to $PGDATA/pg_xlog in the standby, and nothing happen. When I restart the standby server, I got error messages from server log:
2016-11-21 17:56:09 CST [17762-3] LOG: invalid primary checkpoint record
2016-11-21 17:56:09 CST [17762-4] LOG: record with zero length at 0/6000100
2016-11-21 17:56:09 CST [17762-5] LOG: invalid secondary checkpoint record
2016-11-21 17:56:09 CST [17762-6] PANIC: could not locate a valid checkpoint record
2016-11-21 17:56:09 CST [17761-1] LOG: startup process (PID 17762) was terminated by signal 6: Aborted
2016-11-21 17:56:09 CST [17761-2] LOG: aborting startup due to startup process failure
Questions:
Is it enough to sync data to standby server by simply copying /archive/* in primary server to $PGDATA/pg_xlog in the standby?
How and when is the restoring of WAL files triggered in a hot standby server? Does the standby server periodically check its $PGDATA/pg_xlog directory for new WAL files? Or do I have to trigger it manually?
I am talking about hot standby, not streaming replication; so I assume I don't have to configure conninfo. Am I right?
After configuring hot_standby = on and restarting the server, I can still do an INSERT without error. How to configure to make it really read-only?
That looks a lot like you didn't initialize the standby database correctly.
The log file states that PostgreSQL won't even begin to replicate, because it cannot find a valid checkpoint to start with.
What does the backup_label file in your standby's data directory contain? If that file doesn't exist, that's probably the problem.
Did that standby suddenly stop working or has it never worked?
How exactly did you create the standby?
You must first create the standby from a low level base backup of the master. You cannot create a new instance and use pg_dump and pg_restore. I'm guessing that's what you tried to do.
The simplest way to do a suitable base backup is to use pg_basebackup. Other options are discussed in the manual, but really, just use:
pg_basebackup -X stream -D standby_datadir_location -h master_ip
or similar.
Only once you have a valid base backup may you start archive recovery or streaming replication. The simplest way is to enable streaming replication. Let pg_basebackup do that for you by passing the -R flag.
If you want archive recovery, you should add a restore_command to the standby's recovery.conf that copies the archives from the archive location to the standby.
It's all covered in the manual.

incorrect resource manager data checksum in record at 2/XYZ + terminating walreceiver process due to administrator command

I am running a streaming replication environment with PostgreSQL 9.1 (1 master, 3 slaves). Everything worked fine for aprox. 2 months. Yesterday, the replication to one of the slaves failed with the log on the slave having:
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
FATAL: terminating walreceiver process due to administrator command
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
The slave was no longer in sync with the master. Two hours later, in which the log gets a new line like above every 5 seconds, I restarted the slave database server:
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: received fast shutdown request
LOG: aborting any active transactions
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
LOG: shutting down
LOG: database system is shut down
The new log file on the slave contains:
LOG: database system was shut down in recovery at 2016-02-29 05:12:11 CET
LOG: entering standby mode
LOG: redo starts at 61/D92C10C9
LOG: consistent recovery state reached at 61/DA2710A7
LOG: database system is ready to accept read only connections
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: streaming replication successfully connected to primary
Now the slave is in sync with the master but the checksum entry is still there. One more thing I checked were the network logs -> the network was available.
My questions are:
Does anyone know why the walreceiver was terminated?
Why didn't PostgreSQL retry the replication?
What can I do to prevent this in the future?
Thank you.
EDIT:
The database servers are running on SLES 11 with ext3. I found an article about low performance of SLES 11 with large RAM but I am not sure if it applies since my machine has only 8 GB RAM (https://www.novell.com/support/kb/doc.php?id=7010287)
Any help would be appreciated.
EDIT (2):
PostgreSQL version is 9.1.5. Seem that PostgreSQL version 9.1.6 provides a fix for similar issue?
Fix persistence marking of shared buffers during WAL replay (Jeff Davis)
This mistake can result in buffers not being written out during checkpoints, resulting in data corruption if the server later crashes without ever having written those buffers. Corruption can occur on any server following crash recovery, but it is significantly more likely to occur on standby slave servers since those perform much more WAL replay.
Source: http://www.postgresql.org/docs/9.1/static/release-9-1-6.html
Might this be the fix? Should I upgrade to PostgreSQL 9.1.6 and everything would run smooth?
In case someone stumbles across this question, I ended up reinstalling the databases from backed-up data and set up replication again. Never really figured out what went wrong.
Never really figured out what went wrong.
I'm experiencing the same error - just that it never syncs in full from the very beginning.
Then, the primary server had got some kernel errors (heat problem in server case?). The server was required to be switched off because of incomplete shutdown. Already while shutting down, the slave showed up with
LOG: incorrect resource manager data checksum in record at 1/63663CB0
After restart of the primary server and restart of the slave server, the situation doesn't change: same log entries every 5 seconds.

Autovacuum is not running on Openshift Online Postgres cartridge

I have Postgres 9.2 on my Openshift Online cartridge. Using Pgadmin3, I have enabled (by ticking the box) the autovuum setting for postgresql.conf. However, the autovacuum does not seem to be running.
Here is what I have:
ps -ef | grep -i vacuum
No autovacuum process is shown.
Using psql console, show autovacuum, says that its value is ON
Using psql console, SELECT schemaname, relname, last_vacuum, last_autovacuum from FROM pg_stat_user_tables; gives no value in last_vacuum and last_autovacuum column even though I did a manual Vacuum via Maintenance function using pgadmin3.
The properties tab on the db in pgAdminIII says AUTOVACUUM value of 'not running'
What do I miss?
EDIT
I also cannot access the postgresql.conf on Openshift Online when trying to find the file on the server - hoping to manually edit the file instead of using pgAdminIII.
-- Found this https://www.openshift.com/forums/openshift/how-do-i-set-maxpreparedtransactions-on-my-postgresql-cartridge I am now able to view/edit my postgresql.conf. Apparently the autovacuum is on already so the conf has the right setting.
When issue pg_ctl restart -m fast I got
LOG: could not bind socket for statistics collector: Permission denied
LOG: trying another address for the statistics collector
LOG: could not bind socket for statistics collector: Permission denied
LOG: trying another address for the statistics collector
LOG: could not bind socket for statistics collector: Cannot assign requested address LOG: trying another address for the statistics collector
LOG: could not bind socket for statistics collector: Cannot assign requested address LOG: disabling statistics collector for lack of working socket
WARNING: autovacuum not started because of misconfiguration
HINT: Enable the "track_counts" option.
LOG: database system was shut down at 2014-04-22 09:58:19 GMT
LOG: database system is ready to accept connections
Though track_counts is already set to on in postgresql.conf
Sorry for being so stupid but any help or pointers are much appreciated.
Thank you in advance.
i ran into a similar issue and found a helpful hint in this discussion:
... for some insane reason, openshit disabled localhost, and autovacuum only connects to localhost, I suppose it makes sense that they wouldn't want to be trying to vacuum a remote db... but openshit breaks autovacuum.
one solution i've found (and that i'll probably use) is to manually add a cronjob that does a forced vacuum. here is a batch-script that looks promising but be careful with the side-effects that a forced vacuum might involve (depending on you app of course).
Patching postgres to use the OPENSHIFT_PG_HOST environment variable instead of localhost seems to solve the problem: pgstat.patch.

LOG: server process (PID 11748) was terminated by signal 11: Segmentation fault

I am using Postgres-8.3.7 on fedora core 2 linux box. And Postgres service is crashing.
When I restart the system, it is working fine for some time. At some random time it is crashing again.
What could be the possible reasons for this segfaults which are random?
FATAL: the database system is in recovery mode
LOG: autovacuum launcher started
LOG: database system is ready to accept connections
LOG: server process (PID 11748) was terminated by signal 11: Segmentation fault
LOG: terminating any other active server processes
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted; last known up at 2010-05-24 13:28:06 PDT
LOG: database system was not properly shut down; automatic recovery in progress
A little too specific, few details - and perhaps more appropiate to serverfault.com , or the postgresql mailing lists.
Some random suggestions:
VACUUM ANALYZE VERBOSE ?
Can't you upgrade to the last version ?
Some special circumnstances when this happens ? Disk nearly full ? High load ? Nothing suspicious in the OS logs ( /var/log/message ) ?
Can't you raise the log level of postgresql to log the queries and see if this is related to some particular query (e.g. function)?
Postgresql has a very responsive developers community.