Postgresql fatal the database system is starting up - windows 10 - postgresql

I have installed postgresql on windows 10 on usb disk.
Every day when i start my pc in work from sleep and plug in the disk again then trying to start postgresql i get this error:
FATAL: the database system is starting up
The service starts with following command:
E:\PostgresSql\pg96\pgservice.exe "//RS//PostgreSQL 9.6 Server"
It is the default one.
logs from E:\PostgresSql\data\logs\pg96
2019-02-28 10:30:36 CET [21788]: [1-1] user=postgres,db=postgres,app=[unknown],client=::1 FATAL: the database system is starting up
2019-02-28 10:31:08 CET [9796]: [1-1] user=postgres,db=postgres,app=[unknown],client=::1 FATAL: the database system is starting up
I want this start up to happen faster.

When you commit data to a Postgres database, the only thing which is immediately saved to disk is the write-ahead log. The actual table changes are only applied to the in-memory buffers, and won't be permanently saved to disk until the next checkpoint.
If the server is stopped abruptly, or if it suddenly loses access to the file system, then everything in memory is lost, and the next time you start it up, it needs to resort to replaying the log in order to get the tables back to the correct state (which can take quite a while, depending on how much has happened since the last checkpoint). And until it's finished, any attempt to use the server will result in FATAL: the database system is starting up.
If you make sure you shut the server down cleanly before unplugging the disk - giving it a chance to set a checkpoint and flush all of its buffers - then it should be able to start up again more or less immediately.

Related

Postgres streaming replication - servers keep shutting down

I am new to PostgreSQL and I am trying to set up a streaming replication from our server to a test DB on my laptop. I have been following this tutorial https://www.percona.com/blog/2018/09/07/setting-up-streaming-replication-postgresql/ along with the Postgres documentation here https://www.postgresql.org/docs/11/runtime-config-replication.html.
I'm running Windows 10, PostgreSQL 11, PostGIS 2.5 extension.
The server and my local machine both keep shutting down and the logs are filled with postmaster.pid errors such as:
LOG: performing immediate shutdown because data directory lock file is invalid
LOG: received immediate shutdown request
LOG: could not open file "postmaster.pid": No such file or directory
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Could anyone point me towards the issue here? I know my server's aren't configured properly but I just don't know what configurations need to be changed.
Here is an image of my standby server's most recent log.
standby log
Here is an image of my master server's most recent log.
master log
You must have messed up in many ways.
You removed or overwrote postmaster.pid on the master server.
That is very dangerous and causes the server to die with the error message you quote.
You didn't create recovery.conf before starting the standby server, or you removed backup_label. From the error messages I'd suspect the second, with ensuing data corruption.

Lost all databases in postgresql because of automatic recovery

What actually Happend:
Running windows 7 on my system and system was shutdown due to power failure. when system restarted, I opened PostgreSQL and all my databases were deleted.
I checked logs and there were some entries
Database system was intrupted ; last known up at (dateTime)
Database system was not properly shutdown; automatic recovery in progress
Record with zero length at 0/C9B92168
Redo is not required
Can anybody tell me how can i recover my databases, as I don't have restore point.

incorrect resource manager data checksum in record at 2/XYZ + terminating walreceiver process due to administrator command

I am running a streaming replication environment with PostgreSQL 9.1 (1 master, 3 slaves). Everything worked fine for aprox. 2 months. Yesterday, the replication to one of the slaves failed with the log on the slave having:
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
FATAL: terminating walreceiver process due to administrator command
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
The slave was no longer in sync with the master. Two hours later, in which the log gets a new line like above every 5 seconds, I restarted the slave database server:
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: received fast shutdown request
LOG: aborting any active transactions
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
LOG: shutting down
LOG: database system is shut down
The new log file on the slave contains:
LOG: database system was shut down in recovery at 2016-02-29 05:12:11 CET
LOG: entering standby mode
LOG: redo starts at 61/D92C10C9
LOG: consistent recovery state reached at 61/DA2710A7
LOG: database system is ready to accept read only connections
LOG: incorrect resource manager data checksum in record at 61/DA2710A7
LOG: streaming replication successfully connected to primary
Now the slave is in sync with the master but the checksum entry is still there. One more thing I checked were the network logs -> the network was available.
My questions are:
Does anyone know why the walreceiver was terminated?
Why didn't PostgreSQL retry the replication?
What can I do to prevent this in the future?
Thank you.
EDIT:
The database servers are running on SLES 11 with ext3. I found an article about low performance of SLES 11 with large RAM but I am not sure if it applies since my machine has only 8 GB RAM (https://www.novell.com/support/kb/doc.php?id=7010287)
Any help would be appreciated.
EDIT (2):
PostgreSQL version is 9.1.5. Seem that PostgreSQL version 9.1.6 provides a fix for similar issue?
Fix persistence marking of shared buffers during WAL replay (Jeff Davis)
This mistake can result in buffers not being written out during checkpoints, resulting in data corruption if the server later crashes without ever having written those buffers. Corruption can occur on any server following crash recovery, but it is significantly more likely to occur on standby slave servers since those perform much more WAL replay.
Source: http://www.postgresql.org/docs/9.1/static/release-9-1-6.html
Might this be the fix? Should I upgrade to PostgreSQL 9.1.6 and everything would run smooth?
In case someone stumbles across this question, I ended up reinstalling the databases from backed-up data and set up replication again. Never really figured out what went wrong.
Never really figured out what went wrong.
I'm experiencing the same error - just that it never syncs in full from the very beginning.
Then, the primary server had got some kernel errors (heat problem in server case?). The server was required to be switched off because of incomplete shutdown. Already while shutting down, the slave showed up with
LOG: incorrect resource manager data checksum in record at 1/63663CB0
After restart of the primary server and restart of the slave server, the situation doesn't change: same log entries every 5 seconds.

Postgresql large table update slows down

I run an update on a large table (e.g. 8 GB). It is a simple update of 3 fields in the table. I had no problems running it under postgresql 9.1, it would take 40-60 minutes but it worked. I run the same query in 9.4 database (freshly created, not upgraded) and it starts the update fine but then slows down. It uses only ~2% CPU, the level if IO is 4-5MB/s and it is sitting there. No locks, no other queries or connections, just this single update SQL on the server.
The SQL is below. "lookup" table has 12 records. The lookup can return only one row, it breaks a discrete scale (SMALLINT, -32768 .. +32767) into non-overlapping regions. "src" and "dest" tables are ~60 million records.
UPDATE dest SET
field1 = src.field1,
field2 = src.field2,
field3_id = (SELECT lookup.id FROM lookup WHERE src.value BETWEEN lookup.min AND lookup.max)
FROM src
WHERE dest.id = src.id;
I thought my disk slowed down but I can copy 1 GB files in parallel to query execution and it runs fast at >40MB/s and I have only one disk (it is a VM with ISCSI media). All other disk operations are not impacted, there is plenty of IO bandwidth. At the same time PostgreSQL is just sitting there doing very little, running very slowly.
I have 2 virtualized linux servers, one runs postgresql 9.1 and another runs 9.4. Both servers have close to identical postgresql configuration.
Has anyone else had similar experience? I am running out of ideas. Help.
Edit
The query "ran" for 20 hours I had to kill the connections and restart the server. Surprisingly it didn't kill the connection via query:
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid <> pg_backend_pid() AND datname = current_database();
and sever produced the following log:
2015-05-21 12:41:53.412 EDT FATAL: terminating connection due to administrator command
2015-05-21 12:41:53.438 EDT FATAL: terminating connection due to administrator command
2015-05-21 12:41:53.438 EDT STATEMENT: UPDATE <... this is 60,000,000 record table update statement>
Also server restart took long time, producing the following log:
2015-05-21 12:43:36.730 EDT LOG: received fast shutdown request
2015-05-21 12:43:36.730 EDT LOG: aborting any active transactions
2015-05-21 12:43:36.730 EDT FATAL: terminating connection due to administrator command
2015-05-21 12:43:36.734 EDT FATAL: terminating connection due to administrator command
2015-05-21 12:43:36.747 EDT LOG: autovacuum launcher shutting down
2015-05-21 12:44:36.801 EDT LOG: received immediate shutdown request
2015-05-21 12:44:36.815 EDT WARNING: terminating connection because of crash of another server process
2015-05-21 12:44:36.815 EDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
"The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory" - is this an indication of a bug in PostgreSQL?
Edit
I tested 9.1, 9.3 and 9.4. Both 9.1 and 9.3 don't experience the slow down. 9.4 consistently slows down on large transactions. I noticed that when a transaction starts htop monitor indicates high CPU and the process status is "R" (running). Then it gradually changes to low CPU usage and status "D" - disk (see screenshot ). My biggest question is why 9.4 is different from 9.1 and 9.3? I have a dozen of servers and this effect is observed across the board.
Thanks everyone for the help. No matter how much I tried to emphasize on the difference of performance between identical configuration of 9.4 and previous versions no one seemed to pay attention to that.
The problem was solved by disabling transparent huge pages:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
Here are some resources I found helpful in reserching the issue:
* https://dba.stackexchange.com/questions/32890/postgresql-pg-stat-activity-shows-commit/34169#34169
* https://lwn.net/Articles/591723/
* https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge
I'd suspect a lot of disk seeking - 5MB/s is just about right for a very random IO on ordinary (spinning) hard drive.
As you constantly replace basically all your rows I'd try to set dest table fillfactor to about 45% (alter table dest set (fillfactor=45);) and then cluster test using test_pkey;. This would allow updated row versions to be placed in the same disk sector.
Additionally using cluster src using src_pkey; so both tables would have data in the same physical order on disk also can help.
Also remember to vacuum table dest; after every update that large, so old row versions could be used again in subsequent updates.
Your old server probably evolved it's fillfactor naturally during multiple updates. On new server it is packed 100%, so updated rows have to be placed at the end.
If only few of the target rows are actually updated, you can avoid new row versions to be generated by using DISTICNT FROM. This can prevent a lot of useless disk traffic.
UPDATE dest SET
field1 = src.field1,
field2 = src.field2,
field3_id = lu.id
FROM src
JOIN lookup lu ON src.value BETWEEN lu.min AND lu.max
WHERE dest.id = src.id
-- avoid unnecessary row versions to be generated
AND (dest.field1 IS DISTINCT FROM src.field1
OR dest.field1 IS DISTINCT FROM src.field1
OR dest.field3_id IS DISTINCT FROM lu.id
)
;

PostgreSQL hot standby won't start after deleting WAL files

I have a PostgreSQL 9.1 hot standby server. The WAL files ended up taking up the entire HD so I ended up deleting all the WAP files. How I want to bring the server back up so I ran:
/usr/local/pgsql/bin/pg_ctl start -D /usr/local/pgsql/data
The problem is the server never fully starts. I see this follow by non-stop missing WAL files log outputs:
server starting
sh-4.1$ LOG: database system was shut down in recovery at 2013-02-10 03:17:06 UTC
LOG: entering standby mode
cp: cannot stat `/usr/local/pgsql/wals/0000000100000035000000A4': No such file or directory
LOG: redo starts at 35/A4943928
LOG: consistent recovery state reached at 35/A4AE8EB8
LOG: database system is ready to accept read only connections
LOG: invalid record length at 35/A4AE8EB8
cp: cannot stat `/usr/local/pgsql/wals/0000000100000035000000A4': No such file or directory
LOG: streaming replication successfully connected to primary
FATAL: could not receive data from WAL stream: FATAL: requested WAL segment 0000000100000035000000A4 has already been removed
How can I get the server back?
You need to recreate the hot standby server from the dump (or filesystem copy).
Details on how to do that you can find in this manual: High Availability, Load Balancing, and Replication
Please check whether required wal's are on the location from which you are restoring for initial restoration before connecting slave to primary host in streaming mode...Above problem can happen due to unavailable wal segment on the location from which you are restoring for initial recovery...If this case is fine then check your restore_command in recovery.conf.