We have an asp.net web service application that is having OutOfMemory issue with IIS w3wp.exe. We got mini memory dump of w3wp when OOM happened, and we see it has large mem_reserve but small managed heap size. Question is how can we find out what is using mem_reserve (virtual address space). I expect to see some mem_reserve but not 2.9G. Since mem_reserve doesn't add up to eeheap, would it be from unmanaged heap? then anyway to confirm that?
0:000> !address -summary
Mapping file section regions...
Mapping module regions...
Mapping PEB regions...
Mapping TEB and stack regions...
Mapping heap regions...
Mapping page heap regions...
Mapping other regions...
Mapping stack trace database regions...
Mapping activation context regions...
--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
<unknown> 1664 d4c1d000 ( 3.324 GB) 92.34% 83.11%
Free 348 19951000 ( 409.316 MB) 9.99%
Image 1184 ad58000 ( 173.344 MB) 4.70% 4.23%
Heap 75 5ab4000 ( 90.703 MB) 2.46% 2.21%
Stack 193 1200000 ( 18.000 MB) 0.49% 0.44%
TEB 64 40000 ( 256.000 kB) 0.01% 0.01%
Other 10 35000 ( 212.000 kB) 0.01% 0.01%
PEB 1 1000 ( 4.000 kB) 0.00% 0.00%
--- Type Summary (for busy) ------ RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_PRIVATE 1323 d6f04000 ( 3.358 GB) 93.28% 83.96%
MEM_IMAGE 1828 dd75000 ( 221.457 MB) 6.01% 5.41%
MEM_MAPPED 40 1a26000 ( 26.148 MB) 0.71% 0.64%
--- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
MEM_RESERVE 746 b8b4e000 ( 2.886 GB) 80.16% 72.15%
MEM_COMMIT 2445 2db51000 ( 731.316 MB) 19.84% 17.85%
MEM_FREE 348 19951000 ( 409.316 MB) 9.99%
--- Protect Summary (for commit) - RgnCount ----------- Total Size -------- %ofBusy %ofTotal
PAGE_READWRITE 1024 1f311000 ( 499.066 MB) 13.54% 12.18%
PAGE_EXECUTE_READ 248 b2e6000 ( 178.898 MB) 4.85% 4.37%
PAGE_READONLY 613 1c1d000 ( 28.113 MB) 0.76% 0.69%
PAGE_WRITECOPY 219 103e000 ( 16.242 MB) 0.44% 0.40%
PAGE_EXECUTE_READWRITE 144 59c000 ( 5.609 MB) 0.15% 0.14%
PAGE_EXECUTE_WRITECOPY 69 21f000 ( 2.121 MB) 0.06% 0.05%
PAGE_READWRITE|PAGE_GUARD 128 144000 ( 1.266 MB) 0.03% 0.03%
--- Largest Region by Usage ----------- Base Address -------- Region Size ----------
<unknown> c2010000 4032000 ( 64.195 MB)
Free 86010000 4000000 ( 64.000 MB)
Image 72f48000 f5d000 ( 15.363 MB)
Heap 64801000 fcf000 ( 15.809 MB)
Stack 1ed0000 7a000 ( 488.000 kB)
TEB ffe8a000 1000 ( 4.000 kB)
Other fffb0000 23000 ( 140.000 kB)
PEB fffde000 1000 ( 4.000 kB)
0:000> !ao
---------Heap 1 ---------
Managed OOM occured after GC #2924 (Requested to allocate 0 bytes)
Reason: Didn't have enough memory to allocate an LOH segment
Detail: LOH: Failed to reserve memory (117440512 bytes)
---------Heap 7 ---------
Managed OOM occured after GC #2308 (Requested to allocate 0 bytes)
Reason: Could not do a full GC
0:000> !vmstat
TYPE MINIMUM MAXIMUM AVERAGE BLK COUNT TOTAL
~~~~ ~~~~~~~ ~~~~~~~ ~~~~~~~ ~~~~~~~~~ ~~~~~
Free:
Small 4K 64K 34K 219 7,475K
Medium 72K 832K 273K 80 21,903K
Large 1,080K 65,536K 7,954K 49 389,759K
Summary 4K 65,536K 1,204K 348 419,139K
Reserve:
Small 4K 64K 14K 436 6,519K
Medium 88K 1,020K 303K 169 51,303K
Large 1,088K 65,528K 21,052K 141 2,968,407K
Summary 4K 65,528K 4,056K 746 3,026,231K
Commit:
Small 4K 64K 16K 2,019 33,386K
Medium 68K 1,024K 287K 275 79,043K
Large 1,028K 65,736K 7,315K 87 636,435K
Summary 4K 65,736K 314K 2,381 748,866K
Private:
Small 4K 64K 28K 851 23,911K
Medium 88K 1,024K 313K 222 69,687K
Large 1,028K 65,736K 18,429K 186 3,427,951K
Summary 4K 65,736K 2,797K 1,259 3,521,550K
Mapped:
Small 4K 64K 18K 23 431K
Medium 68K 1,004K 385K 11 4,239K
Large 1,520K 6,640K 3,684K 6 22,104K
Summary 4K 6,640K 669K 40 26,775K
Image:
Small 4K 64K 9K 1,581 15,562K
Medium 68K 1,000K 267K 211 56,419K
Large 1,028K 15,732K 4,299K 36 154,787K
Summary 4K 15,732K 124K 1,828 226,771K
0:000> !eeheap -gc
Number of GC Heaps: 8
------------------------------
Heap 0 (01cd1720)
generation 0 starts at 0x44470de4
generation 1 starts at 0x43f51000
generation 2 starts at 0x02161000
ephemeral segment allocation context: none
segment begin allocated size
02160000 02161000 02dee4a8 0xc8d4a8(13161640)
43f50000 43f51000 444b9364 0x568364(5669732)
Large object heap starts at 0x12161000
segment begin allocated size
12160000 12161000 121bd590 0x5c590(378256)
c2010000 c2011000 c6041020 0x4030020(67305504)
Heap Size: Size: 0x5281dbc (86515132) bytes.
------------------------------
Heap 1 (01cd65a8)
generation 0 starts at 0x8c4b38e0
generation 1 starts at 0x8c011000
generation 2 starts at 0x04161000
ephemeral segment allocation context: none
segment begin allocated size
04160000 04161000 054114f0 0x12b04f0(19596528)
25c40000 25c41000 25fc7328 0x386328(3695400)
8c010000 8c011000 8c4c4778 0x4b3778(4929400)
Large object heap starts at 0x13161000
segment begin allocated size
13160000 13161000 13161010 0x10(16)
Heap Size: Size: 0x1ae9fa0 (28221344) bytes.
------------------------------
Heap 2 (01cdb5c0)
generation 0 starts at 0x4a71a420
generation 1 starts at 0x49f51000
generation 2 starts at 0x06161000
ephemeral segment allocation context: none
segment begin allocated size
06160000 06161000 06e89b18 0xd28b18(13798168)
27c40000 27c41000 27f8f6dc 0x34e6dc(3466972)
9a010000 9a011000 9a5900b8 0x57f0b8(5763256)
49f50000 49f51000 4a72cc2c 0x7dbc2c(8240172)
Large object heap starts at 0x14161000
segment begin allocated size
14160000 14161000 14161010 0x10(16)
Heap Size: Size: 0x1dd1ee8 (31268584) bytes.
------------------------------
Heap 3 (01ce05d8)
generation 0 starts at 0x34fbe540
generation 1 starts at 0x34d11000
generation 2 starts at 0x08161000
ephemeral segment allocation context: none
segment begin allocated size
08160000 08161000 08c26248 0xac5248(11293256)
2bc40000 2bc41000 2bfc9294 0x388294(3703444)
34d10000 34d11000 34fcdbb0 0x2bcbb0(2870192)
Large object heap starts at 0x15161000
segment begin allocated size
15160000 15161000 151c3b10 0x62b10(404240)
Heap Size: Size: 0x116cb9c (18271132) bytes.
------------------------------
Heap 4 (01ce55f0)
generation 0 starts at 0x934a7cec
generation 1 starts at 0x93011000
generation 2 starts at 0x0a161000
ephemeral segment allocation context: none
segment begin allocated size
0a160000 0a161000 0b4bf898 0x135e898(20310168)
45f50000 45f51000 45f70df4 0x1fdf4(130548)
93010000 93011000 934b83ec 0x4a73ec(4879340)
Large object heap starts at 0x16161000
segment begin allocated size
16160000 16161000 161ab050 0x4a050(303184)
Heap Size: Size: 0x186fac8 (25623240) bytes.
------------------------------
Heap 5 (01cec608)
generation 0 starts at 0x31f1d4d0
generation 1 starts at 0x31c41000
generation 2 starts at 0x0c161000
ephemeral segment allocation context: none
segment begin allocated size
0c160000 0c161000 0d2120b0 0x10b10b0(17502384)
7aea0000 7aea1000 7af5c074 0xbb074(766068)
31c40000 31c41000 31f2caac 0x2ebaac(3062444)
Large object heap starts at 0x17161000
segment begin allocated size
17160000 17161000 17179ad0 0x18ad0(101072)
Heap Size: Size: 0x14706a0 (21431968) bytes.
------------------------------
Heap 6 (01cf5a20)
generation 0 starts at 0x5fbb5598
generation 1 starts at 0x5f821000
generation 2 starts at 0x0e161000
ephemeral segment allocation context: none
segment begin allocated size
0e160000 0e161000 0eb22284 0x9c1284(10228356)
5f820000 5f821000 5fbc9838 0x3a8838(3835960)
Large object heap starts at 0x18161000
segment begin allocated size
18160000 18161000 18161010 0x10(16)
Heap Size: Size: 0xd69acc (14064332) bytes.
------------------------------
Heap 7 (01cfaa38)
generation 0 starts at 0xad530e00
generation 1 starts at 0xad011000
generation 2 starts at 0x10161000
ephemeral segment allocation context: none
segment begin allocated size
10160000 10161000 109bd218 0x85c218(8765976)
a4010000 a4011000 a42f418c 0x2e318c(3027340)
41f50000 41f51000 42a99090 0xb48090(11829392)
ad010000 ad011000 ad5399a0 0x5289a0(5409184)
Large object heap starts at 0x19161000
segment begin allocated size
19160000 19161000 1980ddb0 0x6acdb0(6999472)
Heap Size: Size: 0x225cb84 (36031364) bytes.
------------------------------
GC Heap Size: Size: 0xf950f98 (261427096) bytes.
0:000> !heap -s
SEGMENT HEAP ERROR: failed to initialize the extention
LFH Key : 0x3a251113
Termination on corruption : ENABLED
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-----------------------------------------------------------------------------
Virtual block: 1a830000 - 1a830000 (size 00000000)
Virtual block: 1a9d0000 - 1a9d0000 (size 00000000)
007b0000 00000002 16384 15520 16384 198 418 5 2 3 LFH
00280000 00001002 3136 1608 3136 7 23 3 0 0 LFH
00a40000 00001002 1088 828 1088 79 25 2 0 0 LFH
00c30000 00001002 1088 308 1088 13 5 2 0 0 LFH
00100000 00041002 256 4 256 2 1 1 0 0
01560000 00001002 256 24 256 3 4 1 0 0
01510000 00001002 64 4 64 2 1 1 0 0
01670000 00001002 64 4 64 2 1 1 0 0
00290000 00011002 256 4 256 1 2 1 0 0
01870000 00001002 256 8 256 3 3 1 0 0
004e0000 00001002 256 4 256 1 2 1 0 0
019e0000 00001002 256 4 256 2 1 1 0 0
01af0000 00001002 256 4 256 2 1 1 0 0
02080000 00001002 1280 396 1280 1 34 2 0 0 LFH
01fc0000 00041002 256 4 256 2 1 1 0 0
1b700000 00041002 256 256 256 6 4 1 0 0 LFH
21840000 00001002 64192 1880 64192 1639 13 12 0 0 LFH
External fragmentation 87 % (13 free blocks)
-----------------------------------------------------------------------------
If we recycle w3wp, then all memory will go back to system. So there seems no memory leak. I understand that mem_reserve is mapped to virtual address, and are allocated space but not committed. Question is: is there a way to figure out what is using up all these virtual address.
I have memory dump, but I don't have access the the server to get performance counter or something.
Thanks folks!!
Related
I have a 5 note replicaset mongoDB - 1 primary, 3 secondaries and 1 arbiter.
I am using mong version 4.2.3
Sizes:
“dataSize” : 688.4161271536723,
“indexes” : 177,
“indexSize” : 108.41889953613281
My Primary is very slow - each command from the shell takes a long time to return.
Memory usage seems very high, and looks like mongodb is consuming more than 50% of the RAM:
# free -lh
total used free shared buff/cache available
Mem: 188Gi 187Gi 473Mi 56Mi 740Mi 868Mi
Low: 188Gi 188Gi 473Mi
High: 0B 0B 0B
Swap: 191Gi 117Gi 74Gi
------------------------------------------------------------------
Top Memory Consuming Process Using ps command
------------------------------------------------------------------
PID PPID %MEM %CPU CMD
311 49145 97.8 498 mongod --config /etc/mongod.conf
23818 23801 0.0 3.8 /bin/prometheus --config.file=/etc/prometheus/prometheus.yml
23162 23145 0.0 8.4 /usr/bin/cadvisor -logtostderr
25796 25793 0.0 0.4 postgres: checkpointer
23501 23484 0.0 1.0 /postgres_exporter
24490 24473 0.0 0.1 grafana-server --homepath=/usr/share/grafana --config=/etc/grafana/grafana.ini --packaging=docker cfg:default.log.mode=console
top:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
311 systemd+ 20 0 313.9g 184.6g 2432 S 151.7 97.9 26229:09 mongod
23818 nobody 20 0 11.3g 150084 17988 S 20.7 0.1 8523:47 prometheus
23162 root 20 0 12.7g 93948 5964 S 65.5 0.0 18702:22 cadvisor
serverStatus memeory shows this:
octopusrs0:PRIMARY> db.serverStatus().mem
{
"bits" : 64,
"resident" : 189097,
"virtual" : 321404,
"supported" : true
}
octopusrs0:PRIMARY> db.serverStatus().tcmalloc.tcmalloc.formattedString
------------------------------------------------
MALLOC: 218206510816 (208097.9 MiB) Bytes in use by application
MALLOC: + 96926863360 (92436.7 MiB) Bytes in page heap freelist
MALLOC: + 3944588576 ( 3761.9 MiB) Bytes in central cache freelist
MALLOC: + 134144 ( 0.1 MiB) Bytes in transfer cache freelist
MALLOC: + 713330688 ( 680.3 MiB) Bytes in thread cache freelists
MALLOC: + 1200750592 ( 1145.1 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 320992178176 (306122.0 MiB) Actual memory used (physical + swap)
MALLOC: + 13979086848 (13331.5 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 334971265024 (319453.5 MiB) Virtual address space used
MALLOC:
MALLOC: 9420092 Spans in use
MALLOC: 234 Thread heaps in use
MALLOC: 4096 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
How can I detrmine what is causing this high memory consumption, and what can I do to return to nornal memory consumption?
Thanks,
Tamar
I have a Ceph cluster running with 18 X 600GB OSDs. There are three pools (size:3, pg_num:64) with an image size of 200GB on each, and there are 6 servers connected to these images via iSCSI and storing about 20 VMs on them. Here is the output of "ceph df":
POOLS:
POOL ID STORED OBJECTS USED %USED MAX AVAIL
cephfs_data 1 0 B 0 0 B 0 0 B
cephfs_metadata 2 17 KiB 22 1.5 MiB 100.00 0 B
defaults.rgw.buckets.data 3 0 B 0 0 B 0 0 B
defaults.rgw.buckets.index 4 0 B 0 0 B 0 0 B
.rgw.root 5 2.0 KiB 5 960 KiB 100.00 0 B
default.rgw.control 6 0 B 8 0 B 0 0 B
default.rgw.meta 7 393 B 2 384 KiB 100.00 0 B
default.rgw.log 8 0 B 207 0 B 0 0 B
rbd 9 150 GiB 38.46k 450 GiB 100.00 0 B
rbd3 13 270 GiB 69.24k 811 GiB 100.00 0 B
rbd2 14 150 GiB 38.52k 451 GiB 100.00 0 B
Based on this, I expect about 1.7 TB RAW capacity usage, BUT it is currently about 9TBs!
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 9.8 TiB 870 GiB 9.0 TiB 9.0 TiB 91.35
TOTAL 9.8 TiB 870 GiB 9.0 TiB 9.0 TiB 91.35
And the cluster is down because there is very few capacity remained. I wonder what makes this and how can I get it fixed.
Your help is much appreciated
The problem was mounting the iSCSI target without discard option.
Since I am using RedHat Virtualization, I just modified all storage domains created on top of Ceph, and enabled "discard" on them1. Just after a few hours, about 1 TB of storage released. Now it is about 12 hours passed and 5 TB of storage is released.
Thanks
I am trying to get a feel for what I should expect in terms of performance from cloud storage.
I just ran the gsutil perfdiag from a compute engine instance in the same location (US) and the same project as my cloud storage bucket.
For nearline storage, I get a 25 Mibit/s read and 353 Mibit/s write, is that low / high / average, why such discrepancy between read and write ?
==============================================================================
DIAGNOSTIC RESULTS
==============================================================================
------------------------------------------------------------------------------
Latency
------------------------------------------------------------------------------
Operation Size Trials Mean (ms) Std Dev (ms) Median (ms) 90th % (ms)
========= ========= ====== ========= ============ =========== ===========
Delete 0 B 5 112.0 52.9 78.2 173.6
Delete 1 KiB 5 94.1 17.5 90.8 115.0
Delete 100 KiB 5 80.4 2.5 79.9 83.4
Delete 1 MiB 5 86.7 3.7 88.2 90.4
Download 0 B 5 58.1 3.8 57.8 62.2
Download 1 KiB 5 2892.4 1071.5 2589.1 4111.9
Download 100 KiB 5 1955.0 711.3 1764.9 2814.3
Download 1 MiB 5 2679.4 976.2 2216.2 3869.9
Metadata 0 B 5 69.1 57.0 42.8 129.3
Metadata 1 KiB 5 37.4 1.5 37.1 39.0
Metadata 100 KiB 5 64.2 47.7 40.9 113.0
Metadata 1 MiB 5 45.7 9.1 49.4 55.1
Upload 0 B 5 138.3 21.0 122.5 164.8
Upload 1 KiB 5 170.6 61.5 139.4 242.0
Upload 100 KiB 5 387.2 294.5 245.8 706.1
Upload 1 MiB 5 257.4 51.3 228.4 319.7
------------------------------------------------------------------------------
Write Throughput
------------------------------------------------------------------------------
Copied a 1 GiB file 5 times for a total transfer size of 5 GiB.
Write throughput: 353.13 Mibit/s.
------------------------------------------------------------------------------
Read Throughput
------------------------------------------------------------------------------
Copied a 1 GiB file 5 times for a total transfer size of 5 GiB.
Read throughput: 25.16 Mibit/s.
------------------------------------------------------------------------------
System Information
------------------------------------------------------------------------------
IP Address:
##.###.###.##
Temporary Directory:
/tmp
Bucket URI:
gs://pl_twitter/
gsutil Version:
4.12
boto Version:
2.30.0
Measurement time:
2015-05-11 07:03:26 PM
Google Server:
Google Server IP Addresses:
##.###.###.###
Google Server Hostnames:
Google DNS thinks your IP is:
CPU Count:
4
CPU Load Average:
[0.16, 0.05, 0.06]
Total Memory:
14.38 GiB
Free Memory:
11.34 GiB
TCP segments sent during test:
5592296
TCP segments received during test:
2417850
TCP segments retransmit during test:
3794
Disk Counter Deltas:
disk reads writes rbytes wbytes rtime wtime
sda1 31 5775 126976 1091674112 856 1603544
TCP /proc values:
wmem_default = 212992
wmem_max = 212992
rmem_default = 212992
tcp_timestamps = 1
tcp_window_scaling = 1
tcp_sack = 1
rmem_max = 212992
Boto HTTPS Enabled:
True
Requests routed through proxy:
False
Latency of the DNS lookup for Google Storage server (ms):
2.5
Latencies connecting to Google Storage server IPs (ms):
##.###.###.### = 1.1
------------------------------------------------------------------------------
In-Process HTTP Statistics
------------------------------------------------------------------------------
Total HTTP requests made: 94
HTTP 5xx errors: 0
HTTP connections broken: 0
Availability: 100%
For standard storage I get:
==============================================================================
DIAGNOSTIC RESULTS
==============================================================================
------------------------------------------------------------------------------
Latency
------------------------------------------------------------------------------
Operation Size Trials Mean (ms) Std Dev (ms) Median (ms) 90th % (ms)
========= ========= ====== ========= ============ =========== ===========
Delete 0 B 5 121.9 34.8 105.1 158.9
Delete 1 KiB 5 159.3 58.2 126.0 232.3
Delete 100 KiB 5 106.8 17.0 103.3 125.7
Delete 1 MiB 5 167.0 77.3 145.1 251.0
Download 0 B 5 87.2 10.3 81.1 100.0
Download 1 KiB 5 95.5 18.0 92.4 115.6
Download 100 KiB 5 156.7 20.5 155.8 179.6
Download 1 MiB 5 219.6 11.7 213.4 232.6
Metadata 0 B 5 59.7 4.5 57.8 64.4
Metadata 1 KiB 5 61.0 21.8 49.6 85.4
Metadata 100 KiB 5 55.3 10.4 50.7 67.7
Metadata 1 MiB 5 75.6 27.8 67.4 109.0
Upload 0 B 5 162.7 37.0 139.0 207.7
Upload 1 KiB 5 165.2 23.6 152.3 194.1
Upload 100 KiB 5 392.1 235.0 268.7 643.0
Upload 1 MiB 5 387.0 79.5 340.9 486.1
------------------------------------------------------------------------------
Write Throughput
------------------------------------------------------------------------------
Copied a 1 GiB file 5 times for a total transfer size of 5 GiB.
Write throughput: 515.63 Mibit/s.
------------------------------------------------------------------------------
Read Throughput
------------------------------------------------------------------------------
Copied a 1 GiB file 5 times for a total transfer size of 5 GiB.
Read throughput: 123.14 Mibit/s.
------------------------------------------------------------------------------
System Information
------------------------------------------------------------------------------
IP Address:
10.240.133.190
Temporary Directory:
/tmp
Bucket URI:
gs://test_throughput_standard/
gsutil Version:
4.12
boto Version:
2.30.0
Measurement time:
2015-05-21 11:08:50 AM
Google Server:
Google Server IP Addresses:
##.###.##.###
Google Server Hostnames:
Google DNS thinks your IP is:
CPU Count:
8
CPU Load Average:
[0.28, 0.18, 0.08]
Total Memory:
Upload 1 MiB 5 387.0 79.5 340.9 486.1
49.91 GiB
Free Memory:
47.9 GiB
TCP segments sent during test:
5165461
TCP segments received during test:
1881727
TCP segments retransmit during test:
3423
Disk Counter Deltas:
disk reads writes rbytes wbytes rtime wtime
dm-0 0 0 0 0 0 0
loop0 0 0 0 0 0 0
loop1 0 0 0 0 0 0
sda1 0 4229 0 1080618496 0 1605286
TCP /proc values:
wmem_default = 212992
wmem_max = 212992
rmem_default = 212992
tcp_timestamps = 1
tcp_window_scaling = 1
tcp_sack = 1
rmem_max = 212992
Boto HTTPS Enabled:
True
Requests routed through proxy:
False
Latency of the DNS lookup for Google Storage server (ms):
1.2
Latencies connecting to Google Storage server IPs (ms):
##.###.##.### = 1.3
------------------------------------------------------------------------------
In-Process HTTP Statistics
------------------------------------------------------------------------------
Total HTTP requests made: 94
HTTP 5xx errors: 0
HTTP connections broken: 0
Availability: 100%
==============================================================================
DIAGNOSTIC RESULTS
==============================================================================
------------------------------------------------------------------------------
Latency
------------------------------------------------------------------------------
Operation Size Trials Mean (ms) Std Dev (ms) Median (ms) 90th % (ms)
========= ========= ====== ========= ============ =========== ===========
Delete 0 B 5 145.1 59.4 117.8 215.2
Delete 1 KiB 5 178.0 51.4 190.6 224.3
Delete 100 KiB 5 98.3 5.0 96.6 104.3
Delete 1 MiB 5 117.7 19.2 112.0 140.2
Download 0 B 5 109.4 38.9 91.9 156.5
Download 1 KiB 5 149.5 41.0 141.9 192.5
Download 100 KiB 5 106.9 20.3 108.6 127.8
Download 1 MiB 5 121.1 16.0 112.2 140.9
Metadata 0 B 5 70.0 10.8 76.8 79.9
Metadata 1 KiB 5 113.8 36.6 124.0 148.7
Metadata 100 KiB 5 63.1 20.2 55.7 86.5
Metadata 1 MiB 5 59.2 4.9 61.3 62.9
Upload 0 B 5 127.5 22.6 117.4 153.6
Upload 1 KiB 5 215.2 54.8 221.4 270.4
Upload 100 KiB 5 229.8 79.2 171.6 329.8
Upload 1 MiB 5 489.8 412.3 295.3 915.4
------------------------------------------------------------------------------
Write Throughput
------------------------------------------------------------------------------
Copied a 1 GiB file 5 times for a total transfer size of 5 GiB.
Write throughput: 503 Mibit/s.
------------------------------------------------------------------------------
Read Throughput
------------------------------------------------------------------------------
Copied a 1 GiB file 5 times for a total transfer size of 5 GiB.
Read throughput: 1.05 Gibit/s.
------------------------------------------------------------------------------
System Information
------------------------------------------------------------------------------
IP Address:
################
Temporary Directory:
/tmp
Bucket URI:
gs://test_throughput_standard/
gsutil Version:
4.12
boto Version:
2.30.0
Measurement time:
2015-05-21 06:20:49 PM
Google Server:
Google Server IP Addresses:
#############
Google Server Hostnames:
Google DNS thinks your IP is:
CPU Count:
8
CPU Load Average:
[0.08, 0.03, 0.05]
Total Memory:
49.91 GiB
Free Memory:
47.95 GiB
TCP segments sent during test:
4958020
TCP segments received during test:
2326124
TCP segments retransmit during test:
2163
Disk Counter Deltas:
disk reads writes rbytes wbytes rtime wtime
dm-0 0 0 0 0 0 0
loop0 0 0 0 0 0 0
loop1 0 0 0 0 0 0
sda1 0 4202 0 1080475136 0 1610000
TCP /proc values:
wmem_default = 212992
wmem_max = 212992
rmem_default = 212992
tcp_timestamps = 1
tcp_window_scaling = 1
tcp_sack = 1
rmem_max = 212992
Boto HTTPS Enabled:
True
Requests routed through proxy:
False
Latency of the DNS lookup for Google Storage server (ms):
1.6
Latencies connecting to Google Storage server IPs (ms):
############ = 1.3
2nd Run:
==============================================================================
DIAGNOSTIC RESULTS
==============================================================================
------------------------------------------------------------------------------
Latency
------------------------------------------------------------------------------
Operation Size Trials Mean (ms) Std Dev (ms) Median (ms) 90th % (ms)
========= ========= ====== ========= ============ =========== ===========
Delete 0 B 5 91.5 14.0 85.1 106.0
Delete 1 KiB 5 125.4 76.2 91.7 203.3
Delete 100 KiB 5 104.4 15.9 99.0 123.2
Delete 1 MiB 5 128.2 36.0 116.4 170.7
Download 0 B 5 60.2 8.3 63.0 68.7
Download 1 KiB 5 62.6 11.3 61.6 74.8
Download 100 KiB 5 103.2 21.3 110.7 123.8
Download 1 MiB 5 137.1 18.5 130.3 159.8
Metadata 0 B 5 73.4 35.9 62.3 114.2
Metadata 1 KiB 5 55.9 18.1 55.3 75.6
Metadata 100 KiB 5 45.7 11.0 42.5 59.1
Metadata 1 MiB 5 49.9 7.9 49.2 58.8
Upload 0 B 5 128.2 24.6 115.5 158.8
Upload 1 KiB 5 153.5 44.1 132.4 206.4
Upload 100 KiB 5 176.8 26.8 165.1 209.7
Upload 1 MiB 5 277.9 80.2 214.7 378.5
------------------------------------------------------------------------------
Write Throughput
------------------------------------------------------------------------------
Copied a 1 GiB file 5 times for a total transfer size of 5 GiB.
Write throughput: 463.76 Mibit/s.
------------------------------------------------------------------------------
Read Throughput
------------------------------------------------------------------------------
Copied a 1 GiB file 5 times for a total transfer size of 5 GiB.
Read throughput: 184.96 Mibit/s.
------------------------------------------------------------------------------
System Information
------------------------------------------------------------------------------
IP Address:
#################
Temporary Directory:
/tmp
Bucket URI:
gs://test_throughput_standard/
gsutil Version:
4.12
boto Version:
2.30.0
Measurement time:
2015-05-21 06:24:31 PM
Google Server:
Google Server IP Addresses:
####################
Google Server Hostnames:
Google DNS thinks your IP is:
CPU Count:
8
CPU Load Average:
[0.19, 0.17, 0.11]
Total Memory:
49.91 GiB
Free Memory:
47.9 GiB
TCP segments sent during test:
5180256
TCP segments received during test:
2034323
TCP segments retransmit during test:
2883
Disk Counter Deltas:
disk reads writes rbytes wbytes rtime wtime
dm-0 0 0 0 0 0 0
loop0 0 0 0 0 0 0
loop1 0 0 0 0 0 0
sda1 0 4209 0 1080480768 0 1604066
TCP /proc values:
wmem_default = 212992
wmem_max = 212992
rmem_default = 212992
tcp_timestamps = 1
tcp_window_scaling = 1
tcp_sack = 1
rmem_max = 212992
Boto HTTPS Enabled:
True
Requests routed through proxy:
False
Latency of the DNS lookup for Google Storage server (ms):
3.5
Latencies connecting to Google Storage server IPs (ms):
################ = 1.1
------------------------------------------------------------------------------
In-Process HTTP Statistics
------------------------------------------------------------------------------
Total HTTP requests made: 94
HTTP 5xx errors: 0
HTTP connections broken: 0
Availability: 100%
3rd run
==============================================================================
DIAGNOSTIC RESULTS
==============================================================================
------------------------------------------------------------------------------
Latency
------------------------------------------------------------------------------
Operation Size Trials Mean (ms) Std Dev (ms) Median (ms) 90th % (ms)
========= ========= ====== ========= ============ =========== ===========
Delete 0 B 5 157.0 78.3 101.5 254.9
Delete 1 KiB 5 153.5 49.1 178.3 202.5
Delete 100 KiB 5 152.9 47.5 168.0 202.6
Delete 1 MiB 5 110.6 20.4 105.7 134.5
Download 0 B 5 104.4 50.5 66.8 167.6
Download 1 KiB 5 68.1 11.1 68.7 79.2
Download 100 KiB 5 85.5 5.8 86.0 90.8
Download 1 MiB 5 126.6 40.1 100.5 175.0
Metadata 0 B 5 67.9 16.2 61.0 86.6
Metadata 1 KiB 5 49.3 8.6 44.9 59.5
Metadata 100 KiB 5 66.6 35.4 44.2 107.8
Metadata 1 MiB 5 53.9 13.2 52.1 69.4
Upload 0 B 5 136.7 37.1 114.4 183.5
Upload 1 KiB 5 145.5 58.3 116.8 208.2
Upload 100 KiB 5 227.3 37.6 233.3 259.3
Upload 1 MiB 5 274.8 45.2 261.8 328.5
------------------------------------------------------------------------------
Write Throughput
------------------------------------------------------------------------------
Copied a 1 GiB file 5 times for a total transfer size of 5 GiB.
Write throughput: 407.03 Mibit/s.
------------------------------------------------------------------------------
Read Throughput
------------------------------------------------------------------------------
Copied a 1 GiB file 5 times for a total transfer size of 5 GiB.
Read throughput: 629.07 Mibit/s.
------------------------------------------------------------------------------
System Information
------------------------------------------------------------------------------
IP Address:
###############
Temporary Directory:
/tmp
Bucket URI:
gs://test_throughput_standard/
gsutil Version:
4.12
boto Version:
2.30.0
Measurement time:
2015-05-21 06:32:48 PM
Google Server:
Google Server IP Addresses:
################
Google Server Hostnames:
Google DNS thinks your IP is:
CPU Count:
8
CPU Load Average:
[0.11, 0.13, 0.13]
Total Memory:
49.91 GiB
Free Memory:
47.94 GiB
TCP segments sent during test:
5603925
TCP segments received during test:
2438425
TCP segments retransmit during test:
4586
Disk Counter Deltas:
disk reads writes rbytes wbytes rtime wtime
dm-0 0 0 0 0 0 0
loop0 0 0 0 0 0 0
loop1 0 0 0 0 0 0
sda1 0 4185 0 1080353792 0 1603851
TCP /proc values:
wmem_default = 212992
wmem_max = 212992
rmem_default = 212992
tcp_timestamps = 1
tcp_window_scaling = 1
tcp_sack = 1
rmem_max = 212992
Boto HTTPS Enabled:
True
Requests routed through proxy:
False
Latency of the DNS lookup for Google Storage server (ms):
2.2
Latencies connecting to Google Storage server IPs (ms):
############## = 1.6
All things being equal, write performance is generally higher for modern storage systems because of presence of a caching layer between the application disks, that said, what you are seeing is within the expected range for "nearline" storage.
I have observed far superior throughput results when using "standard" storage buckets. Though latency did not improve much. Consider using a "Standard" bucket if your application requires high throughput. If your application is sensitive to latency, then using local storage as a cache (or scratch space) may be the only option.
Here is a snippet from one my experiments on "Standard" buckets:
------------------------------------------------------------------------------
Latency
------------------------------------------------------------------------------
Operation Size Trials Mean (ms) Std Dev (ms) Median (ms) 90th % (ms)
========= ========= ====== ========= ============ =========== ===========
Delete 0 B 10 91.5 12.4 89.0 98.5
Delete 1 KiB 10 96.4 9.1 95.6 105.6
Delete 100 KiB 10 92.9 22.8 85.3 102.4
Delete 1 MiB 10 86.4 9.1 84.1 93.2
Download 0 B 10 54.2 5.1 55.4 58.8
Download 1 KiB 10 83.3 18.7 78.4 94.9
Download 100 KiB 10 75.2 14.5 68.6 92.6
Download 1 MiB 10 95.0 19.7 86.3 126.7
Metadata 0 B 10 33.5 7.9 31.1 44.8
Metadata 1 KiB 10 36.3 7.2 35.8 46.8
Metadata 100 KiB 10 37.7 9.2 36.6 44.1
Metadata 1 MiB 10 116.1 231.3 36.6 136.1
Upload 0 B 10 151.4 67.5 122.9 195.9
Upload 1 KiB 10 134.2 22.4 127.9 149.3
Upload 100 KiB 10 168.8 20.5 168.6 188.6
Upload 1 MiB 10 213.3 37.6 200.2 262.5
------------------------------------------------------------------------------
Write Throughput
------------------------------------------------------------------------------
Copied 5 1 GiB file(s) for a total transfer size of 10 GiB.
Write throughput: 3.46 Gibit/s.
Parallelism strategy: both
------------------------------------------------------------------------------
Write Throughput With File I/O
------------------------------------------------------------------------------
Copied 5 1 GiB file(s) for a total transfer size of 10 GiB.
Write throughput: 3.9 Gibit/s.
Parallelism strategy: both
------------------------------------------------------------------------------
Read Throughput
------------------------------------------------------------------------------
Copied 5 1 GiB file(s) for a total transfer size of 10 GiB.
Read throughput: 7.04 Gibit/s.
Parallelism strategy: both
------------------------------------------------------------------------------
Read Throughput With File I/O
------------------------------------------------------------------------------
Copied 5 1 GiB file(s) for a total transfer size of 10 GiB.
Read throughput: 1.64 Gibit/s.
Parallelism strategy: both
Hope that is helpful.
I can't understand, where my ceph raw space is gone.
cluster 90dc9682-8f2c-4c8e-a589-13898965b974
health HEALTH_WARN 72 pgs backfill; 26 pgs backfill_toofull; 51 pgs backfilling; 141 pgs stuck unclean; 5 requests are blocked > 32 sec; recovery 450170/8427917 objects degraded (5.341%); 5 near full osd(s)
monmap e17: 3 mons at {enc18=192.168.100.40:6789/0,enc24=192.168.100.43:6789/0,enc26=192.168.100.44:6789/0}, election epoch 734, quorum 0,1,2 enc18,enc24,enc26
osdmap e3326: 14 osds: 14 up, 14 in
pgmap v5461448: 1152 pgs, 3 pools, 15252 GB data, 3831 kobjects
31109 GB used, 7974 GB / 39084 GB avail
450170/8427917 objects degraded (5.341%)
18 active+remapped+backfill_toofull
1011 active+clean
64 active+remapped+wait_backfill
8 active+remapped+wait_backfill+backfill_toofull
51 active+remapped+backfilling
recovery io 58806 kB/s, 14 objects/s
OSD tree (each host has 2 OSD):
# id weight type name up/down reweight
-1 36.45 root default
-2 5.44 host enc26
0 2.72 osd.0 up 1
1 2.72 osd.1 up 0.8227
-3 3.71 host enc24
2 0.99 osd.2 up 1
3 2.72 osd.3 up 1
-4 5.46 host enc22
4 2.73 osd.4 up 0.8
5 2.73 osd.5 up 1
-5 5.46 host enc18
6 2.73 osd.6 up 1
7 2.73 osd.7 up 1
-6 5.46 host enc20
9 2.73 osd.9 up 0.8
8 2.73 osd.8 up 1
-7 0 host enc28
-8 5.46 host archives
12 2.73 osd.12 up 1
13 2.73 osd.13 up 1
-9 5.46 host enc27
10 2.73 osd.10 up 1
11 2.73 osd.11 up 1
Real usage:
/dev/rbd0 14T 7.9T 5.5T 59% /mnt/ceph
Pool size:
osd pool default size = 2
Pools:
ceph osd lspools
0 data,1 metadata,2 rbd,
rados df
pool name category KB objects clones degraded unfound rd rd KB wr wr KB
data - 0 0 0 0 0 0 0 0 0
metadata - 0 0 0 0 0 0 0 0 0
rbd - 15993591918 3923880 0 444545 0 82936 1373339 2711424 849398218
total used 32631712348 3923880
total avail 8351008324
total space 40982720672
Raw usage is 4x real usage. As I understand, it must be 2x ?
Yes, it must be 2x. I don't really shure, that the real raw usage is 7.9T. Why do you check this value on mapped disk?
This are my pools:
pool name KB objects clones degraded unfound rd rd KB wr wr KB
admin-pack 7689982 1955 0 0 0 693841 3231750 40068930 353462603
public-cloud 105432663 26561 0 0 0 13001298 638035025 222540884 3740413431
rbdkvm_sata 32624026697 7968550 31783 0 0 4950258575 232374308589 12772302818 278106113879
total used 98289353680 7997066
total avail 34474223648
total space 132763577328
You can see, that the total amount of used space is 3 times more than the used space in the pool rbdkvm_sata (+-).
ceph -s shows the same result too:
pgmap v11303091: 5376 pgs, 3 pools, 31220 GB data, 7809 kobjects
93736 GB used, 32876 GB / 123 TB avail
I don't think you have just one rbd image. The result of "ceph osd lspools" indicated that you had 3 pools and one of pools had name "metadata".(Maybe you were using cephfs). /dev/rbd0 was appeared because you mapped the image but you could have other images also. To list the images you can use "rbd list -p ". You can see the image info with "rbd info -p "
I'm running postgresql 9.3 on a machine with 32GB ram, with 0 swap. There are up to 200 clients connected. There's 1 other 4GB process running on the box. How do I interpret this error log message? How can I prevent the out of memory error? Allow swapping? Add more memory to the machine? Allow fewer client connections? Adjust a setting?
example pg_top:
last pid: 6607; load avg: 3.59, 2.32, 2.61; up 16+09:17:29 20:49:51
113 processes: 1 running, 111 sleeping, 1 uninterruptable
CPU states: 22.5% user, 0.0% nice, 4.9% system, 63.2% idle, 9.4% iowait
Memory: 29G used, 186M free, 7648K buffers, 23G cached
DB activity: 2479 tps, 1 rollbs/s, 217 buffer r/s, 99 hit%, 11994 row r/s, 3820 row w/s
DB I/O: 0 reads/s, 0 KB/s, 0 writes/s, 0 KB/s
DB disk: 149.8 GB total, 46.7 GB free (68% used)
Swap:
example top showing the only other significant 4GB process on the box:
top - 21:05:09 up 16 days, 9:32, 2 users, load average: 2.73, 2.91, 2.88
Tasks: 247 total, 3 running, 244 sleeping, 0 stopped, 0 zombie
%Cpu(s): 22.1 us, 4.1 sy, 0.0 ni, 62.9 id, 9.8 wa, 0.0 hi, 0.7 si, 0.3 st
KiB Mem: 30827220 total, 30642584 used, 184636 free, 7292 buffers
KiB Swap: 0 total, 0 used, 0 free. 23449636 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7407 postgres 20 0 7604928 10172 7932 S 29.6 0.0 2:51.27 postgres
10469 postgres 20 0 7617716 176032 160328 R 11.6 0.6 0:01.48 postgres
10211 postgres 20 0 7630352 237736 208704 S 10.6 0.8 0:03.64 postgres
18202 elastic+ 20 0 8726984 4.223g 4248 S 9.6 14.4 883:06.79 java
9711 postgres 20 0 7619500 354188 335856 S 7.0 1.1 0:08.03 postgres
3638 postgres 20 0 7634552 1.162g 1.127g S 6.6 4.0 0:50.42 postgres
postgresql.conf:
max_connections = 1000 # (change requires restart)
shared_buffers = 7GB # min 128kB
work_mem = 40MB # min 64kB
maintenance_work_mem = 1GB # min 1MB
effective_cache_size = 20GB
....
log:
ERROR: out of memory
DETAIL: Failed on request of size 67108864.
STATEMENT: SELECT "package_texts".* FROM "package_texts" WHERE "package_texts"."id" = $1 LIMIT 1
TopMemoryContext: 798624 total in 83 blocks; 11944 free (21 chunks); 786680 used
TopTransactionContext: 8192 total in 1 blocks; 7328 free (0 chunks); 864 used
Prepared Queries: 253952 total in 5 blocks; 136272 free (18 chunks); 117680 used
Type information cache: 24240 total in 2 blocks; 3744 free (0 chunks); 20496 used
Operator lookup cache: 24576 total in 2 blocks; 11888 free (5 chunks); 12688 used
TableSpace cache: 8192 total in 1 blocks; 3216 free (0 chunks); 4976 used
MessageContext: 8192 total in 1 blocks; 6976 free (0 chunks); 1216 used
Operator class cache: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
smgr relation table: 24576 total in 2 blocks; 5696 free (4 chunks); 18880 used
TransactionAbortContext: 32768 total in 1 blocks; 32736 free (0 chunks); 32 used
Portal hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
PortalMemory: 8192 total in 1 blocks; 7888 free (0 chunks); 304 used
PortalHeapMemory: 1024 total in 1 blocks; 568 free (0 chunks); 456 used
ExecutorState: 32928 total in 3 blocks; 15616 free (5 chunks); 17312 used
printtup: 34002024 total in 2 blocks; 7056 free (7 chunks); 33994968 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
Relcache by OID: 24576 total in 2 blocks; 12832 free (3 chunks); 11744 used
CacheMemoryContext: 1372624 total in 24 blocks; 38832 free (0 chunks); 1333792 used
CachedPlanSource: 7168 total in 3 blocks; 3080 free (1 chunks); 4088 used
CachedPlanQuery: 7168 total in 3 blocks; 2992 free (1 chunks); 4176 used
CachedPlanSource: 15360 total in 4 blocks; 7128 free (5 chunks); 8232 used
CachedPlanQuery: 15360 total in 4 blocks; 3320 free (1 chunks); 12040 used
CachedPlanSource: 3072 total in 2 blocks; 552 free (0 chunks); 2520 used
CachedPlanQuery: 7168 total in 3 blocks; 1592 free (1 chunks); 5576 used
CachedPlanSource: 3072 total in 2 blocks; 536 free (0 chunks); 2536 used
... 2 Thousand snipped lines of CachedPlans ...
CachedPlanSource: 15360 total in 4 blocks; 7128 free (5 chunks); 8232 used
CachedPlanQuery: 15360 total in 4 blocks; 3320 free (1 chunks); 12040 used
CachedPlanSource: 7168 total in 3 blocks; 3880 free (3 chunks); 3288 used
CachedPlanQuery: 7168 total in 3 blocks; 4032 free (1 chunks); 3136 used
CachedPlanSource: 7168 total in 3 blocks; 3936 free (3 chunks); 3232 used
CachedPlanQuery: 7168 total in 3 blocks; 4032 free (1 chunks); 3136 used
CachedPlanSource: 7168 total in 3 blocks; 3080 free (1 chunks); 4088 used
CachedPlanQuery: 7168 total in 3 blocks; 2992 free (1 chunks); 4176 used
CachedPlanSource: 7168 total in 3 blocks; 3872 free (2 chunks); 3296 used
CachedPlanQuery: 7168 total in 3 blocks; 4032 free (1 chunks); 3136 used
pg_toast_17305_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
index_package_raises_on_natural_key: 3072 total in 2 blocks; 1648 free (1 chunks); 1424 used
index_package_extensions_on_natural_key: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
index_package_mixins_on_natural_key: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
index_package_mixins_on_includes_id: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
package_texts_pkey: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
index_package_file_objects_on_natural_key: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
index_package_symbols_on_natural_key: 3072 total in 2 blocks; 1136 free (1 chunks); 1936 used
index_package_symbols_on_full_name: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
index_package_symbols_on_alias_for_id: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
package_symbols_pkey: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_toast_17313_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
index_packages_on_natural_key: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
packages_pkey: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
index_package_files_on_natural_key: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
package_files_pkey: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_toast_2619_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
index_projects_on_user_id: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
index_projects_on_type: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
index_projects_on_name_and_type: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
index_projects_on_claim_ticket: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
ruby_gem_metadata_pkey: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_constraint_contypid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_constraint_conrelid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_constraint_conname_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
pg_attrdef_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_attrdef_adrelid_adnum_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
pg_index_indrelid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_db_role_setting_databaseid_rol_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_opclass_am_name_nsp_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
pg_foreign_data_wrapper_name_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_enum_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_class_relname_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
pg_foreign_server_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_statistic_relid_att_inh_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
pg_cast_source_target_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
pg_language_name_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_collation_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_amop_fam_strat_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
pg_index_indexrelid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_ts_template_tmplname_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_ts_config_map_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 used
pg_opclass_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_foreign_data_wrapper_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_event_trigger_evtname_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_ts_dict_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_event_trigger_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_conversion_default_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 used
pg_operator_oprname_l_r_n_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_enum_typid_label_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_ts_config_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_user_mapping_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_opfamily_am_name_nsp_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 used
pg_foreign_table_relid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_type_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_aggregate_fnoid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_constraint_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_ts_parser_prsname_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_ts_config_cfgname_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_ts_parser_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_operator_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_namespace_nspname_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_ts_template_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_amop_opr_fam_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
pg_default_acl_role_nsp_obj_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 used
pg_collation_name_enc_nsp_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 used
pg_range_rngtypid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_ts_dict_dictname_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_type_typname_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
pg_opfamily_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_class_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_proc_proname_args_nsp_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
pg_attribute_relid_attnum_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
pg_proc_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_language_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_namespace_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_amproc_fam_proc_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
pg_foreign_server_name_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_attribute_relid_attnam_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
pg_conversion_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_user_mapping_user_server_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_conversion_name_nsp_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_authid_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_auth_members_member_role_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_tablespace_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_database_datname_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_auth_members_role_member_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_database_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_authid_rolname_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
MdSmgr: 24576 total in 2 blocks; 13984 free (0 chunks); 10592 used
ident parser context: 0 total in 0 blocks; 0 free (0 chunks); 0 used
hba parser context: 7168 total in 3 blocks; 304 free (1 chunks); 6864 used
LOCALLOCK hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
Timezones: 83472 total in 2 blocks; 3744 free (0 chunks); 79728 used
ErrorContext: 8192 total in 1 blocks; 8160 free (6 chunks); 32 used
If I'm reading the output of your top correctly, it's not taken at a point when you're out of memory.
The actual error seems fine - it's not requesting a huge amount of memory so presumably the machine was out of memory at that point.
Let's take a quick look at your settings:
max_connections = 1000 # (change requires restart)
work_mem = 40MB # min 64kB
So - you are of the opinion that you can support 1000 concurrent queries each using say 10 + 40MB (some might use multiples of 40MB but let's be reasonable). So - this is suggesting to me that your machine has > 500 cores and say 100GB of RAM. That's not the case.
So - take your number of cores and double it - that's a reasonable value for the max number of connections. That will allow you one query on each core while another is waiting for I/O. Then, place a connection pooler in front of the DB if you need to (pgbouncer / Java's connection pooling).
Then, you might even consider increasing work_mem if you need to.
Oh - perfectly reasonable to run without swap enabled. Once you start swapping you are in a world of pain anyway as regards database usage.
Edit: expand on work_mem vs shared
If in doubt, always refer to the documentation.
The shared_buffers value is, as the name suggests shared between backends. The work_mem is not only per backend, it's actually per sort. So - one query might use three or four times that amount if it is doing sorts on three subqueries.
Restarting server helped in my case