Postgresql large table update slows down - postgresql

I run an update on a large table (e.g. 8 GB). It is a simple update of 3 fields in the table. I had no problems running it under postgresql 9.1, it would take 40-60 minutes but it worked. I run the same query in 9.4 database (freshly created, not upgraded) and it starts the update fine but then slows down. It uses only ~2% CPU, the level if IO is 4-5MB/s and it is sitting there. No locks, no other queries or connections, just this single update SQL on the server.
The SQL is below. "lookup" table has 12 records. The lookup can return only one row, it breaks a discrete scale (SMALLINT, -32768 .. +32767) into non-overlapping regions. "src" and "dest" tables are ~60 million records.
UPDATE dest SET
field1 = src.field1,
field2 = src.field2,
field3_id = (SELECT lookup.id FROM lookup WHERE src.value BETWEEN lookup.min AND lookup.max)
FROM src
WHERE dest.id = src.id;
I thought my disk slowed down but I can copy 1 GB files in parallel to query execution and it runs fast at >40MB/s and I have only one disk (it is a VM with ISCSI media). All other disk operations are not impacted, there is plenty of IO bandwidth. At the same time PostgreSQL is just sitting there doing very little, running very slowly.
I have 2 virtualized linux servers, one runs postgresql 9.1 and another runs 9.4. Both servers have close to identical postgresql configuration.
Has anyone else had similar experience? I am running out of ideas. Help.
Edit
The query "ran" for 20 hours I had to kill the connections and restart the server. Surprisingly it didn't kill the connection via query:
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid <> pg_backend_pid() AND datname = current_database();
and sever produced the following log:
2015-05-21 12:41:53.412 EDT FATAL: terminating connection due to administrator command
2015-05-21 12:41:53.438 EDT FATAL: terminating connection due to administrator command
2015-05-21 12:41:53.438 EDT STATEMENT: UPDATE <... this is 60,000,000 record table update statement>
Also server restart took long time, producing the following log:
2015-05-21 12:43:36.730 EDT LOG: received fast shutdown request
2015-05-21 12:43:36.730 EDT LOG: aborting any active transactions
2015-05-21 12:43:36.730 EDT FATAL: terminating connection due to administrator command
2015-05-21 12:43:36.734 EDT FATAL: terminating connection due to administrator command
2015-05-21 12:43:36.747 EDT LOG: autovacuum launcher shutting down
2015-05-21 12:44:36.801 EDT LOG: received immediate shutdown request
2015-05-21 12:44:36.815 EDT WARNING: terminating connection because of crash of another server process
2015-05-21 12:44:36.815 EDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
"The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory" - is this an indication of a bug in PostgreSQL?
Edit
I tested 9.1, 9.3 and 9.4. Both 9.1 and 9.3 don't experience the slow down. 9.4 consistently slows down on large transactions. I noticed that when a transaction starts htop monitor indicates high CPU and the process status is "R" (running). Then it gradually changes to low CPU usage and status "D" - disk (see screenshot ). My biggest question is why 9.4 is different from 9.1 and 9.3? I have a dozen of servers and this effect is observed across the board.

Thanks everyone for the help. No matter how much I tried to emphasize on the difference of performance between identical configuration of 9.4 and previous versions no one seemed to pay attention to that.
The problem was solved by disabling transparent huge pages:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
Here are some resources I found helpful in reserching the issue:
* https://dba.stackexchange.com/questions/32890/postgresql-pg-stat-activity-shows-commit/34169#34169
* https://lwn.net/Articles/591723/
* https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge

I'd suspect a lot of disk seeking - 5MB/s is just about right for a very random IO on ordinary (spinning) hard drive.
As you constantly replace basically all your rows I'd try to set dest table fillfactor to about 45% (alter table dest set (fillfactor=45);) and then cluster test using test_pkey;. This would allow updated row versions to be placed in the same disk sector.
Additionally using cluster src using src_pkey; so both tables would have data in the same physical order on disk also can help.
Also remember to vacuum table dest; after every update that large, so old row versions could be used again in subsequent updates.
Your old server probably evolved it's fillfactor naturally during multiple updates. On new server it is packed 100%, so updated rows have to be placed at the end.

If only few of the target rows are actually updated, you can avoid new row versions to be generated by using DISTICNT FROM. This can prevent a lot of useless disk traffic.
UPDATE dest SET
field1 = src.field1,
field2 = src.field2,
field3_id = lu.id
FROM src
JOIN lookup lu ON src.value BETWEEN lu.min AND lu.max
WHERE dest.id = src.id
-- avoid unnecessary row versions to be generated
AND (dest.field1 IS DISTINCT FROM src.field1
OR dest.field1 IS DISTINCT FROM src.field1
OR dest.field3_id IS DISTINCT FROM lu.id
)
;

Related

postgresql / postgis : low disk space

Im trying to run a query on two tables on my postgresql database.
The query is like this :
psql -d gis -c "create table hgb as (select osm.*, h.geometry from osm_polygon osm join hautegaronne h on ST_contains(h.geometry,osm.way));"
The osm table is 1.3G rows and the h table is 1 row.
I get the error:
"CAUTION: Connection terminated due to crash of another server process
DETAIL: The postmaster instructed this server process to roll back the transaction
running and exiting because another server process exited abnormally
and that there is probably corrupted shared memory.
TIP: In a moment, you should be able to reconnect to the database.
data and relaunch your order.
the connection to the server was cut unexpectedly
The server may have terminated abnormally before or during the
request processing.
the connection to the server has been lost"
And my linux is showing me a notification "the / volume has only 438,2Mo of free disk space".
The postgresql log file tells :
ERROR: Could not write block 16700650 to file 'base/16384/486730.127': No space available on device
I have tried to set temp_file_limit 10000000 in postgresql.conf but it didnt change anything.
Could I set an external hard drive as a temp folder for postgresql to process my query?
Can I set a cap limit for the temp file size?
Thank you

Postgresql fatal the database system is starting up - windows 10

I have installed postgresql on windows 10 on usb disk.
Every day when i start my pc in work from sleep and plug in the disk again then trying to start postgresql i get this error:
FATAL: the database system is starting up
The service starts with following command:
E:\PostgresSql\pg96\pgservice.exe "//RS//PostgreSQL 9.6 Server"
It is the default one.
logs from E:\PostgresSql\data\logs\pg96
2019-02-28 10:30:36 CET [21788]: [1-1] user=postgres,db=postgres,app=[unknown],client=::1 FATAL: the database system is starting up
2019-02-28 10:31:08 CET [9796]: [1-1] user=postgres,db=postgres,app=[unknown],client=::1 FATAL: the database system is starting up
I want this start up to happen faster.
When you commit data to a Postgres database, the only thing which is immediately saved to disk is the write-ahead log. The actual table changes are only applied to the in-memory buffers, and won't be permanently saved to disk until the next checkpoint.
If the server is stopped abruptly, or if it suddenly loses access to the file system, then everything in memory is lost, and the next time you start it up, it needs to resort to replaying the log in order to get the tables back to the correct state (which can take quite a while, depending on how much has happened since the last checkpoint). And until it's finished, any attempt to use the server will result in FATAL: the database system is starting up.
If you make sure you shut the server down cleanly before unplugging the disk - giving it a chance to set a checkpoint and flush all of its buffers - then it should be able to start up again more or less immediately.

Postgres 11 Standby never catches up

Since upgrading to Postgres 11 I cannot get my production standby server to catch up. In the logs things look fine eventually:
2019-02-06 19:23:53.659 UTC [14021] LOG: consistent recovery state reached at 3C772/8912C508
2019-02-06 19:23:53.660 UTC [13820] LOG: database system is ready to accept read only connections
2019-02-06 19:23:53.680 UTC [24261] LOG: started streaming WAL from primary at 3C772/8A000000 on timeline 1
But the following queries show everything is not fine:
warehouse=# SELECT coalesce(abs(pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn())), -1) / 1024 / 1024 / 1024 AS replication_delay_gbytes;
replication_delay_gbytes
-------------------------
208.2317776754498486
(1 row)
warehouse=# select now() - pg_last_xact_replay_timestamp() AS replication_delay;
replication_delay
-------------------
01:54:19.150381
(1 row)
After a while (a couple hours) replication_delay stays about the same but replication_delay_gbytes grows, although note replication_delay is behind from the beginning and replication_delay_gbytes starts near 0. During startup there were a number of these messages:
2019-02-06 18:24:36.867 UTC [14036] WARNING: xlog min recovery request 3C734/FA802AA8 is past current point 3C700/371ED080
2019-02-06 18:24:36.867 UTC [14036] CONTEXT: writing block 0 of relation base/16436/2106308310_vm
but Googling suggests these are fine.
Replica was created using repmgr by running pg_basebackup to perform the clone and then starting up the replica and seeing it catch up. This previously was working with Postgres 10.
Any thoughts on why this replica comes up but is perpetually lagging?
I'm still not sure what the issue is/was, but I was able to get the standby caught up with these two changes:
set use_replication_slots=true in the repmgr config
set wal_compression=on in the postgres config
Use replication slots didn't seem to change anything other than to cause replication_delay_gbytes to stay roughly flat. Turing on WAL compression did help, somehow, although I'm not entirely sure how. Yes, in theory it made it possible to ship WAL files to the standby faster, but reviewing network logs I see a drop in sent/received bytes that matches the effects of compression, so it seems to be shipping WAL files at the same speed just using less network.
It still seems like there is some underlying issue at play here, though, because for example when I do pg_basebackup to create the standby it generates roughly 500 MB/s of network traffic, but then when it is streaming WALs after the standby finishes recovery it drops to ~250 MB/s without WAL compression and ~100 MB/s with WAL compression, but there is no decrease in network traffic after it caught up with WAL compression, so I'm not sure what's going on there that allowed it to catch up.

bdr_init_copy hangs indefinitely

Fairly new to Postgresql, but have to get replication set up. I settled on BDR, and it works fine in the local demo, but on distributed machines it starts to get problematic, mostly because I have no real clue what the hell I am doing, and I cry myself to sleep pining for MySQL. I've gotten BDR working accross multiple servers, almost. When I run:
SELECT bdr.bdr_node_join_wait_for_ready();
on the joining nodes it hangs. This happens on both DB2 and DB3. DB1 returns a valid response. Researching this I came across the bdr_init_copy command, which apparently does everything I have been doing by hand, and then some. so tried that out. Now, when I run:
/usr/lib/postgresql/9.4/bin/bdr_init_copy -d "host=192.168.1.10 dbname=demo3" --local-dbname="host=192.168.1.23 dbname=demo3" -n db2 -D bdr_data
I get
bdr_init_copy: starting ...
Getting remote server identification ...
Detected 2 BDR database(s) on remote server
Updating BDR configuration on the remote node:
demo2: creating replication slot ...
demo2: creating node entry for local node ...
demo3: creating replication slot ...
demo3: creating node entry for local node ...
Creating base backup of the remote node...
63655/63655 kB (100%), 1/1 tablespace
Creating restore point on remote node ...
Bringing local node to the restore point ...
And it sits there. I am assuming that it is the same cause for both issues. as far as I can tell there are no log entries created on the local node (db2) but the following is present on the remote(db1)
2016-10-12 22:38:43 UTC [20808-1] postgres#demo2 LOG: logical decoding found consistent point at 0/5001F00
2016-10-12 22:38:43 UTC [20808-2] postgres#demo2 DETAIL: There are no running transactions.
2016-10-12 22:38:43 UTC [20808-3] postgres#demo2 STATEMENT: SELECT pg_create_logical_replication_slot('bdr_17163_6340711416785871202_2_17163__', 'bdr');
2016-10-12 22:38:43 UTC [20811-1] postgres#demo3 LOG: logical decoding found consistent point at 0/5002090
2016-10-12 22:38:43 UTC [20811-2] postgres#demo3 DETAIL: There are no running transactions.
2016-10-12 22:38:43 UTC [20811-3] postgres#demo3 STATEMENT: SELECT pg_create_logical_replication_slot('bdr_17939_6340711416785871202_2_17939__', 'bdr');
2016-10-12 22:38:44 UTC [20812-1] postgres#demo3 LOG: restore point "bdr_6340711416785871202" created at 0/50022A8
2016-10-12 22:38:44 UTC [20812-2] postgres#demo3 STATEMENT: SELECT pg_create_restore_point('bdr_6340711416785871202')
Any help out there?
Right, just had this issue and none of the other forums were any help. Some of them even say things like it is okay for the new node to report its status as "o" and the other nodes report the new server status as "i" because "this is just a bug and it's fine". It's NOT OKAY. The new server could receive replication updates, but no primary updates were possible on the new server. The key to solving this problem is to crank up the logging on the server you are joining to (not the new one). On the new server logs, you might see things like: 08006: could not receive data from client: Connection reset by peer, which is not very helpful, and will have you checking firewalls, etc. The real money shot will come from the source server logs when they have logs saying something like: no free replication state could be found for 11, increase max_replication_slots What's probably happened is you either have too many servers in your cluster for the default settings or, more likely, there is some junk left over from old hosts.
You need to clean things up ... ON EVERY SERVER IN THE EXISTING CLUSTER (NB!). Start by getting a list of things on the existing cluster:
select * from bdr.bdr_nodes order by node_sysid;
THEN, check the following:
select conn_sysid,conn_dboid from bdr.bdr_connections order by conn_sysid;
.. if you see old entries (that don't contain node_sysid from the first query) then delete
eg. delete from bdr.bdr_connections where conn_sysid='<from-first-query>';
select * from pg_replication_slots order by slot_name;
.. if you see old entries that don't contain an active sysid then delete
.. NB, use the function, DO NOT do a "delete from"
eg. select pg_drop_replication_slot('bdr_17213_6574566740899221664_1_17213__');
select * from pg_replication_identifier order by riname;
.. if you see old entries that don't contain an active sysid then delete
.. NB, use the function, DO NOT do a "delete from"
select pg_replication_identifier_drop('bdr_6443767151306784833_1_17210_17213_');
With any luck, after you've done this on every node, you will see your new server's BDR status go to 'r'. As you clean up each host, you should notice that the logs "08006: could not receive data from client: Connection reset by peer", matching the conn-sysid of the server you've just cleaned up, stop happening. Good luck

Connection lost after query runs for few minutes in PostgreSQL

I am using PostgreSQL 8.4 and PostGIS 1.5. What I'm trying to do is INSERT data from one table to another (but not strictly the same data). For each column, a few queries are run and there are a total of 50143 rows stored in the table. But the query is quite resource-heavy: after the query has run for a few minutes, the connection is lost. Its happening about 21-22k MS into the execution of the query, after which I have to start the DBMS manually again. How should I go about solving this issue?
The error message is as follows:
[Err] server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
Additionally, here is the psql error log:
2013-07-03 05:33:06 AZOST HINT: In a moment you should be able to reconnect to the database and repeat your command.
2013-07-03 05:33:06 AZOST WARNING: terminating connection because of crash of another server process
2013-07-03 05:33:06 AZOST DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
My guess, reading your problem, is that you are hitting out of memory issues. Craig's suggestion to turn off overcommit is a good one. You may also need to reduce work_mem if this is a big query. This may slow down your query but it will free up memory. work_mem is per operation so a query can use many times that setting.
Another possibility is you are hitting some sort of bug in a C-language module in PostgreSQL. If this is the case, try updating to the latest version of PostGIS etc.