In my Aurora Postgres server I continuously see a vacuum timeout every minute for the following:
autovacuum: VACUUM pg_catalog.pg_statistic
I tried doing it manually and got following output:
INFO: vacuuming "pg_catalog.pg_statistic"
INFO: "pg_statistic": found 0 removable, 409109 nonremovable row versions in 19981 out of 19990 pages
DETAIL: 408390 dead row versions cannot be removed yet, oldest xmin: 4230263474
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 9 frozen pages.
0 pages are entirely empty.
CPU: user: 0.06 s, system: 0.00 s, elapsed: 0.07 s.
INFO: vacuuming "pg_toast.pg_toast_2619"
INFO: "pg_toast_2619": found 0 removable, 272673 nonremovable row versions in 61558 out of 61558 pages
DETAIL: 219 dead row versions cannot be removed yet, oldest xmin: 4230263474
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.14 s, system: 0.00 s, elapsed: 0.14 s.
VACUUM
Query returned successfully in 5 secs 55 msec.
Anyone can point to the reason why this is happening?
I think your autovacuum might work as well, Timeout:VacuumDelay might only be a metric From AWS
A process is waiting in a cost-based vacuum delay point.
The most similar setting is autovacuum_vacuum_cost_delay which exists in PostgreSQL.
That want to slow down autovacuum. autovacuum will sleep for these many milliseconds when a cleanup reaching autovacuum_vacuum_cost_limit cost is done.
We can try to use pg_stat_user_tables to verify the latest time of autovacuum.
SELECT
schemaname, relname,
last_autovacuum,
last_autoanalyze
FROM pg_stat_user_tables;
Related
I just did a large import of renderd tiles for an osm server. I want to start my next process (running the import of nominatim) but it takes a lot of memory. The problem I have is that walwriter, background writer, checkpointer are consuming 131GB of memory. I checked pg_top and the processes are sleeping. Is there any way to clear these processes safely or just force postgres to complete the walwriter and background writer?
I am using Postgres v12, and shared_buffers is set to 128GB.
HTOP:
pg_top:
last pid: 628600; load avg 0.08, 0.03, 0.04; up 1+00:31:38 02:22:22
5 5 sleeping
CPU states: 0.0% user, 0.0% nice, 0.0% system, 100% idle, 0.0% iowait
Memory: 487G used, 16G free, 546M buffers, 253G cached
DB activity: 0 tps, 0 rollbs/s, 0 buffer r/s, 100 hit%, 43 row r/s, 0 row w/s -
DB I/O: 0 reads/s, 0 KB/s, 0 writes/s, 0 KB/s
DB disk: 3088.7 GB total, 2538.8 GB free (17% used)
Swap: 45M used, 8147M free, 588K cached
627692 postgres 20 0 131G 4368K sleep 0:00 0.00% 0.00% postgres: 12/main: background writer
627691 postgres 20 0 131G 6056K sleep 0:00 0.00% 0.00% postgres: 12/main: checkpointer
627693 postgres 20 0 131G 4368K sleep 0:00 0.00% 0.00% postgres: 12/main: walwriter
628601 postgres 20 0 131G 11M sleep 0:00 0.00% 0.00% postgres: 12/main: postgres postgres [local] idle
627695 postgres 20 0 131G 6924K sleep 0:00 0.00% 0.00% postgres: 12/main: logical replication launcher
pg_wal directory:
Everything is just fine, and htop is lying to you.
Of course the background processes that access shared buffers will use that memory, and since it is shared memory, it is reported for each of these processes. In reality, it is allocated only once.
The shared memory allocated by PostgreSQL is slightly bigger than shared_buffers, so if that parameter is set to 128GB, you reported data make sense and are perfectly normal.
If you set max_wal_size = 32GB, it is normal to have a lot of WAL segments.
I'm testing checkpoint_completion_target in RDS PostgreSQL and see that checkpoint is taking total time of 28.5 seconds. However, I configured the
checkpoint_completion_target = 0.9
checkpoint_timeout = 300
According to this, should the checkpoint spread for 300*0.9 which is 270 seconds?
PostgreSQL version 11.10
Log:
2021-03-19 16:06:47 UTC::#:[25023]:LOG: checkpoint starting: time
2021-03-19 16:07:16 UTC::#:[25023]:LOG: checkpoint complete: wrote 283 buffers (0.2%); 0 WAL file(s) added, 0 removed, 1 recycled; write=28.500 s, sync=0.006 s, total=28.533 s; sync files=56, longest=0.006 s, average=0.000 s; distance=64990 kB, estimate=68721 kB
https://www.postgresql.org/docs/10/runtime-config-wal.html
https://www.postgresql.org/docs/11/wal-configuration.html
The checkpointer implements its throttling by napping in 0.1 second chunks. And there is no provision for taking more than one nap per buffer needing to be written. So if there is very little work to be done, it will finish early despite the setting of checkpoint_completion_target.
I have something like this
outputs = Parallel(n_jobs=12, verbose=10)(delayed(_process_article)(article, config) for article in data)
Case 1: Run on ubuntu with 80 cores:
CPU(s): 80
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
There are a total of 90,000 tasks. At around 67k it fails and is terminated.
joblib.externals.loky.process_executor.BrokenProcessPool: A process in the executor was terminated abruptly, the pool is not usable anymore.
When I monitor the top at 67k I see a sharp fall in the memory
top - 11:40:25 up 2 days, 18:35, 4 users, load average: 7.09, 7.56, 7.13
Tasks: 32 total, 3 running, 29 sleeping, 0 stopped, 0 zombie
%Cpu(s): 7.6 us, 2.6 sy, 0.0 ni, 89.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 33554432 total, 40 free, 33520996 used, 33396 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 40 avail Mem
Case 2: Mac with 8 cores
hw.physicalcpu: 4
hw.logicalcpu: 8
But on the mac it is much much slower .. And surprisingly it does not get killed at 67k..
Additionally, I reduced the parallelism (in case 1) to 2,4 and it still fails :(
Why is this happening? Has anyone faced this issue before and has a fix?
Note: when I run for 50,000 tasks it runs well and does not give any problems.
Thank you!
Got a machine with an increased memory of 128GB and that solved the problem!
I recently started using a tool called pg_top that shows statistics for Postgres, however since I am not verify versed with the internals of Postgres I needed a bit of clarification on the output.
last pid: 6152; load avg: 19.1, 18.6, 20.4; up 119+20:31:38 13:09:41
41 processes: 5 running, 36 sleeping
CPU states: 52.1% user, 0.0% nice, 0.8% system, 47.1% idle, 0.0% iowait
Memory: 47G used, 16G free, 2524M buffers, 20G cached
DB activity: 151 tps, 0 rollbs/s, 253403 buffer r/s, 86 hit%, 1550639 row r/s,
21 row w/s
DB I/O: 0 reads/s, 0 KB/s, 35 writes/s, 2538 KB/s
DB disk: 233.6 GB total, 195.1 GB free (16% used)
Swap:
My question is under the DB Activity, the 1.5 million row r/s, is that a lot? If so what can be done to improve it? I am running puppetdb 2.3.8, with 6.8 million resources, 2500 nodes, and Postgres 9.1. All of this runs on a single 24 core box with 64GB of memory.
I have just added a new index in sphinx server where mysql is running on local server, sphinx quires to local mysql, it searches data but index is zero sized,
sing config file '/usr/local/sphinx/etc/sphinx.conf'...
indexing index 'test'...
collected 861000 docs, 0.0 MB
collected 932575 docs, 0.0 MB
total 932575 docs, 0 bytes
total 479.180 sec, 0 bytes/sec, 1946.18 docs/sec
total 0 reads, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg
total 4 writes, 0.000 sec, 0.1 kb/call avg, 0.0 msec/call avg
rotating indices: succesfully sent SIGHUP to searchd (pid=5545).
please help
This issue was resolved after restarting sphinx service and .new.sp* files which sphinx created also gone after restarting sphinx.