We have a dedicated database server that runs PostgreSQL 8.3 on linux debian. The database is being regularly queried for a lot of data while updates/inserts happen frequently also. Periodically the database does not respond for a small duration ( like 10 seconds ) and then it goes into normal execution flow again.
What I noticed through top is that there's an iowait spike during that time that lasts for as long as the database does not respond. At the same time pdflush gets activated. So my idea is that pdflush has to write data from the cached memory space back to the disk based on dirty page and background ratio. The rest of the time , when postgresql works normally there's no iowait happening since pdflush is not active. The values for my vm are the following:
dirty_background_ratio = 5
dirty_ratio = 10
dirty_expire_centisecs = 3000
My meminfo :
MemTotal: 12403212 kB
MemFree: 1779684 kB
Buffers: 253284 kB
Cached: 9076132 kB
SwapCached: 0 kB
Active: 7298316 kB
Inactive: 2555240 kB
SwapTotal: 7815544 kB
SwapFree: 7814884 kB
Dirty: 1804 kB
Writeback: 0 kB
AnonPages: 495028 kB
Mapped: 3142164 kB
Slab: 280588 kB
SReclaimable: 265284 kB
SUnreclaim: 15304 kB
PageTables: 422980 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 14017148 kB
Committed_AS: 3890832 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 304188 kB
VmallocChunk: 34359433983 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
I am thinking to tweak the duration at which a dirty page stays in the memory ( dirty_expire_centisecs) so as to divide the iowait spikes equally in time ( call pdflush more regularly so as to write smaller chunks of data to the disk ). Any other proposed solution ?
IO spikes are likely to happen when postgresql is checkpointing.
You can verify that by logging checkpoints and see if they coincide with the lack of response of the server.
If that's the case, tuning checkpoints_segments and checkpoint_completion_target is likely to help.
See the wiki's advice about that and the doc about the WAL configuration.
Related
I just did a large import of renderd tiles for an osm server. I want to start my next process (running the import of nominatim) but it takes a lot of memory. The problem I have is that walwriter, background writer, checkpointer are consuming 131GB of memory. I checked pg_top and the processes are sleeping. Is there any way to clear these processes safely or just force postgres to complete the walwriter and background writer?
I am using Postgres v12, and shared_buffers is set to 128GB.
HTOP:
pg_top:
last pid: 628600; load avg 0.08, 0.03, 0.04; up 1+00:31:38 02:22:22
5 5 sleeping
CPU states: 0.0% user, 0.0% nice, 0.0% system, 100% idle, 0.0% iowait
Memory: 487G used, 16G free, 546M buffers, 253G cached
DB activity: 0 tps, 0 rollbs/s, 0 buffer r/s, 100 hit%, 43 row r/s, 0 row w/s -
DB I/O: 0 reads/s, 0 KB/s, 0 writes/s, 0 KB/s
DB disk: 3088.7 GB total, 2538.8 GB free (17% used)
Swap: 45M used, 8147M free, 588K cached
627692 postgres 20 0 131G 4368K sleep 0:00 0.00% 0.00% postgres: 12/main: background writer
627691 postgres 20 0 131G 6056K sleep 0:00 0.00% 0.00% postgres: 12/main: checkpointer
627693 postgres 20 0 131G 4368K sleep 0:00 0.00% 0.00% postgres: 12/main: walwriter
628601 postgres 20 0 131G 11M sleep 0:00 0.00% 0.00% postgres: 12/main: postgres postgres [local] idle
627695 postgres 20 0 131G 6924K sleep 0:00 0.00% 0.00% postgres: 12/main: logical replication launcher
pg_wal directory:
Everything is just fine, and htop is lying to you.
Of course the background processes that access shared buffers will use that memory, and since it is shared memory, it is reported for each of these processes. In reality, it is allocated only once.
The shared memory allocated by PostgreSQL is slightly bigger than shared_buffers, so if that parameter is set to 128GB, you reported data make sense and are perfectly normal.
If you set max_wal_size = 32GB, it is normal to have a lot of WAL segments.
The pool default.rgw.buckets.data has 501 GiB stored, but USED shows 3.5 TiB.
root#ceph-01:~# ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85
TOTAL 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 1 19 KiB 12 56 KiB 0 61 TiB
.rgw.root 2 32 2.6 KiB 6 1.1 MiB 0 61 TiB
default.rgw.log 3 32 168 KiB 210 13 MiB 0 61 TiB
default.rgw.control 4 32 0 B 8 0 B 0 61 TiB
default.rgw.meta 5 8 4.8 KiB 11 1.9 MiB 0 61 TiB
default.rgw.buckets.index 6 8 1.6 GiB 211 4.7 GiB 0 61 TiB
default.rgw.buckets.data 10 128 501 GiB 5.36M 3.5 TiB 1.90 110 TiB
The default.rgw.buckets.data pool is using erasure coding:
root#ceph-01:~# ceph osd erasure-code-profile get EC_RGW_HOST
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=6
m=4
plugin=jerasure
technique=reed_sol_van
w=8
If anyone could help explain why it's using up 7 times more space, it would help a lot.
Versioning is disabled. ceph version 15.2.13 (octopus stable).
This is related to bluestore_min_alloc_size_hdd=64K (default on Octopus).
If using Erasure Coding, data is broken up into smaller chunks, which each take 64K on disk.
One option would be to lower bluestore_min_alloc_size_hdd to 4K, which is good if your solution requires storing millions of tiny (16K) objects. In my case, I'm storing hundreds of millions of 3-4M photos, so I decide to skip Erasure Coding, stay on bluestore_min_alloc_size_hdd=64K and switch to replicated 3 (min 2). Which is much safer and faster in the long run.
Here is the reply from Josh Baergen on the mailing list:
Hey Arkadiy,
If the OSDs are on HDDs and were created with the default
bluestore_min_alloc_size_hdd, which is still 64KiB in Octopus, then in
effect data will be allocated from the pool in 640KiB chunks (64KiB *
(k+m)). 5.36M objects taking up 501GiB is an average object size of 98KiB
which results in a ratio of 6.53:1 allocated:stored, which is pretty close
to the 7:1 observed.
If my assumption about your configuration is correct, then the only way to
fix this is to adjust bluestore_min_alloc_size_hdd and recreate all your
OSDs, which will take a while...
Josh
I'm testing checkpoint_completion_target in RDS PostgreSQL and see that checkpoint is taking total time of 28.5 seconds. However, I configured the
checkpoint_completion_target = 0.9
checkpoint_timeout = 300
According to this, should the checkpoint spread for 300*0.9 which is 270 seconds?
PostgreSQL version 11.10
Log:
2021-03-19 16:06:47 UTC::#:[25023]:LOG: checkpoint starting: time
2021-03-19 16:07:16 UTC::#:[25023]:LOG: checkpoint complete: wrote 283 buffers (0.2%); 0 WAL file(s) added, 0 removed, 1 recycled; write=28.500 s, sync=0.006 s, total=28.533 s; sync files=56, longest=0.006 s, average=0.000 s; distance=64990 kB, estimate=68721 kB
https://www.postgresql.org/docs/10/runtime-config-wal.html
https://www.postgresql.org/docs/11/wal-configuration.html
The checkpointer implements its throttling by napping in 0.1 second chunks. And there is no provision for taking more than one nap per buffer needing to be written. So if there is very little work to be done, it will finish early despite the setting of checkpoint_completion_target.
I recently started using a tool called pg_top that shows statistics for Postgres, however since I am not verify versed with the internals of Postgres I needed a bit of clarification on the output.
last pid: 6152; load avg: 19.1, 18.6, 20.4; up 119+20:31:38 13:09:41
41 processes: 5 running, 36 sleeping
CPU states: 52.1% user, 0.0% nice, 0.8% system, 47.1% idle, 0.0% iowait
Memory: 47G used, 16G free, 2524M buffers, 20G cached
DB activity: 151 tps, 0 rollbs/s, 253403 buffer r/s, 86 hit%, 1550639 row r/s,
21 row w/s
DB I/O: 0 reads/s, 0 KB/s, 35 writes/s, 2538 KB/s
DB disk: 233.6 GB total, 195.1 GB free (16% used)
Swap:
My question is under the DB Activity, the 1.5 million row r/s, is that a lot? If so what can be done to improve it? I am running puppetdb 2.3.8, with 6.8 million resources, 2500 nodes, and Postgres 9.1. All of this runs on a single 24 core box with 64GB of memory.
One of our website is using like 2gb memories, and we are trying to understand why it is using so much (as we are trying to push this site to azure, as big memory usage means higher bill from azure).
I took an IIS dump and from task manager, I can see it was using like 2.2GB momory.
Then I run !address -summaryand this is what I got:
--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
Free 913 7fb`2f5ce000 ( 7.981 Tb) 99.76%
<unknown> 4055 4`a49c9000 ( 18.572 Gb) 96.43% 0.23%
Heap 338 0`1dbd1000 ( 475.816 Mb) 2.41% 0.01%
Image 3147 0`0c510000 ( 197.063 Mb) 1.00% 0.00%
Stack 184 0`01d40000 ( 29.250 Mb) 0.15% 0.00%
Other 14 0`001bf000 ( 1.746 Mb) 0.01% 0.00%
TEB 60 0`00078000 ( 480.000 kb) 0.00% 0.00%
PEB 1 0`00001000 ( 4.000 kb) 0.00% 0.00%
--- Type Summary (for busy) ------ RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_PRIVATE 2206 4`ba7d2000 ( 18.914 Gb) 98.20% 0.23%
MEM_IMAGE 5522 0`148b0000 ( 328.688 Mb) 1.67% 0.00%
MEM_MAPPED 71 0`019a0000 ( 25.625 Mb) 0.13% 0.00%
--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_FREE 913 7fb`2f5ce000 ( 7.981 Tb) 99.76%
MEM_RESERVE 2711 4`378f4000 ( 16.868 Gb) 87.58% 0.21%
MEM_COMMIT 5088 0`9912e000 ( 2.392 Gb) 12.42% 0.03%
--- Protect Summary (for commit) - RgnCount ----------- Total Size -------- %ofBusy %ofTotal
PAGE_READWRITE 1544 0`81afb000 ( 2.026 Gb) 10.52% 0.02%
PAGE_EXECUTE_READ 794 0`0f35d000 ( 243.363 Mb) 1.23% 0.00%
PAGE_READONLY 2316 0`05ea8000 ( 94.656 Mb) 0.48% 0.00%
PAGE_EXECUTE_READWRITE 279 0`020f4000 ( 32.953 Mb) 0.17% 0.00%
PAGE_WRITECOPY 92 0`0024f000 ( 2.309 Mb) 0.01% 0.00%
PAGE_READWRITE|PAGE_GUARD 61 0`000e6000 ( 920.000 kb) 0.00% 0.00%
PAGE_EXECUTE 2 0`00005000 ( 20.000 kb) 0.00% 0.00%
--- Largest Region by Usage ----------- Base Address -------- Region Size ----------
Free 5`3fac0000 7f9`59610000 ( 7.974 Tb)
<unknown> 3`06a59000 0`f9067000 ( 3.891 Gb)
Heap 0`0f1c0000 0`00fd0000 ( 15.813 Mb)
Image 7fe`fe767000 0`007ad000 ( 7.676 Mb)
Stack 0`01080000 0`0007b000 ( 492.000 kb)
Other 0`00880000 0`00183000 ( 1.512 Mb)
TEB 7ff`ffe44000 0`00002000 ( 8.000 kb)
PEB 7ff`fffdd000 0`00001000 ( 4.000 kb)
There are lots of things I don't really get:
The webserver has 8GB memory in total, but the Free section in the Usage summary is showing 7.9Tb? Why?
Unknown was showing 19.572GB, but the webserver has 8GB memory in total. Why?
The task manager shows private memory workset was like 2.2GB, but if I add Heap, Image and Stack together it is only around 700MB, so where are the rest 1.5 GB memory or I totally read the output wrong?
Many Thanks
The webserver has 8GB memory in total, but the Free section in the Usage summary is showing 7.9Tb? Why?
The 8 GB RAM is physical memory, i.e. the one that's filled in the DDR slots of your PC. The 8 TB is virtual memory, which could also be stored in the page file.
The virtual memory can be 4 GB for a 32 bit process and depends on the exact limitations of the OS for 64 bit processes.
Unknown was showing 19.572GB, but the webserver has 8GB memory in total. Why?
The 19 GB is amount the virtual memory used by an <unknown> memory manager, e.g. .NET or direct calls to VirtualAlloc().
Even if 19 GB is more than 8 GB, this does not necessarily mean that memory was swapped to disk. It depends on the state of the memory. Looking at MEM_RESERVE, we see that most of it is not in use yet. Therefore, your application may still have good performance.
The task manager shows private memory workset was like 2.2GB, but if I add Heap, Image and Stack together it is only around 700MB, so where are the rest 1.5 GB memory or I totally read the output wrong?
The rest is in <unknown>, so the sum is even more than 2.2 GB shown by task manager. The working set size indicates how much physical RAM is used by your process. Ideally, everything would be in RAM, since RAM is the fastest. But RAM is limited and not all applications will fit into RAM. Therefore, memory that is not used very often is swapped to disk, which decreases the use of physical RAM and thus decreases the working set size.