Ceph PGs not deep scrubbed in time keep increasing - ceph
I've noticed this about 4 days ago and dont know what to do right now. The problem is as follows:
I have a 6 node 3 monitor ceph cluster with 84 osds, 72x7200rpm spin disks and 12xnvme ssds for journaling. Every value for scrub configurations are the default values. Every pg in the cluster is active+clean, every cluster stat is green. Yet PGs not deep scrubbed in time keeps increasing and it is at 96 right now. Output from ceph -s:
cluster:
id: xxxxxxxxxxxxxxxxx
health: HEALTH_WARN
1 large omap objects
96 pgs not deep-scrubbed in time
services:
mon: 3 daemons, quorum mon1,mon2,mon3 (age 6h)
mgr: mon2(active, since 2w), standbys: mon1
mds: cephfs:1 {0=mon2=up:active} 2 up:standby
osd: 84 osds: 84 up (since 4d), 84 in (since 3M)
rgw: 3 daemons active (mon1, mon2, mon3)
data:
pools: 12 pools, 2006 pgs
objects: 151.89M objects, 218 TiB
usage: 479 TiB used, 340 TiB / 818 TiB avail
pgs: 2006 active+clean
io:
client: 1.3 MiB/s rd, 14 MiB/s wr, 93 op/s rd, 259 op/s wr
How do i solve this problem? Also ceph health detail output shows that this non deep-scrubbed pg alerts started in january 25th but i didn't notice this before. The time I noticed this was when an osd went down for 30 seconds and got up. Might it be related to this issue? will it just resolve itself? should i tamper with the scrub configurations? For example how much performance loss i might face on client side if i increase osd_max_scrubs to 2 from 1?
Usually the cluster deep-scrubs itself during low I/O intervals on the cluster. The default is every PG has to be deep-scrubbed once a week. If OSDs go down they can't be deep-scrubbed, of course, this could cause some delay.
You could run something like this to see which PGs are behind and if they're all on the same OSD(s):
ceph pg dump pgs | awk '{print $1" "$23}' | column -t
Sort the output if necessary, and you can issue a manual deep-scrub on one of the affected PGs to see if the number decreases and if the deep-scrub itself works.
ceph pg deep-scrub <PG_ID>
Also please add ceph osd pool ls detail to see if any flags are set.
You can set the deep scrub period to 2 week, to stretch the deep scrub window.
Insted of
osd_deep_scrub_interval = 604800
use:
osd_deep_scrub_interval = 1209600
Mr. Eblock has a good idea to force manually some of the pgs for deep scrub , to spread the actions evently within 2 week.
You have 2 options:
Increase the interval between deep scrubs.
Control deep scrubbing manually with a standalone script.
I've written a simple PHP script which takes care of deep scrubbing for me: https://gist.github.com/ethaniel/5db696d9c78516308b235b0cb904e4ad
It lists all the PGs, picks 1 PG which have a last deep scrub done more than 2 weeks ago (the script takes the oldest one), checks if the OSDs that the PG sits on are not being used for another scrub (are in active+clean state), and only then starts a deep scrub on that PG. Otherwise it goes looking for another PG.
I have osd_max_scrubs set to 1 (otherwise OSD daemons start crashing due to a bug in Ceph), so this script works nicely with the regular scheduler - whichever starts the scrubbing on a PG-OSD first, wins.
Related
ceph active+undersized warning
Setup: 6 node cluster with 3 hosts with 12 hdd osd(s) each (36 total) and other 3 hosts with 24 ssd osd(s) each (72 total). 2 erasure code pool that takes 100% of data one for ssd class and the other for hdd class. # hdd k=22 m=14 64% overhead. Withstands 14 hdd osd failures. This includes # tolerating one host failure and additional 2 osd failures on top. ceph osd erasure-code-profile set hdd_k22_m14_osd \ k=22 \ m=14 \ crush-device-class=hdd \ crush-failure-domain=osd # ssd k=44 m=28 64% overhead. Withstands 28 ssd osd failures. This includes # tolerating one host failure and additional 4 osd failures on top. ceph osd erasure-code-profile set ssd_k44_m28_osd \ k=44 \ m=28 \ crush-device-class=ssd \ crush-failure-domain=osd # creating erasure code pool min_size=k+2 ceph osd pool create cephfs.vol1.test.hdd.ec erasure hdd_k22_m14_osd ceph osd pool set cephfs.vol1.test.hdd.ec allow_ec_overwrites true ceph osd pool set cephfs.vol1.test.hhd.ec pg_num 128 ceph osd pool set cephfs.vol1.test.hhd.ec pgp_num 128 ceph osd pool set cephfs.vol1.test.hdd.ec min_size 24 # creating erasure code pool ceph osd pool create cephfs.vol1.test.ssd.ec erasure ssd_k44_m28_osd ceph osd pool set cephfs.vol1.test.ssd.ec allow_ec_overwrites true ceph osd pool set cephfs.vol1.test.ssd.ec pg_num 128 ceph osd pool set cephfs.vol1.test.ssd.ec pgp_num 128 ceph osd pool set cephfs.vol1.test.ssd.ec min_size 46 k=22 m=14; 128 pgs; failure domain = osd. The reason for this is for ceph cluster to account for a full host failure (12osds). All osds have the same storage space and same storage class (hdd). # ceph osd erasure-code-profile get hdd_k22_m14_osd crush-device-class=hdd crush-failure-domain=osd crush-root=default jerasure-per-chunk-alignment=false k=22 m=14 plugin=jerasure technique=reed_sol_van w=8 # ceph osd pool ls detail | grep hdd pool 16 'cephfs.vol1.test.hdd.ec' erasure profile hdd_k22_m14_osd size 36 min_size 24 crush_rule 7 object_hash rjenkins pg_num 253 pgp_num 241 pg_num_target 128 pgp_num_target 128 autoscale_mode on last_change 17748 lfor 0/7144/7142 flags hashpspool,ec_overwrites stripe_width 90112 target_size_bytes 344147139493888 application cephfs # ceph osd pool ls detail | grep ssd pool 17 'cephfs.vol1.test.ssd.ec' erasure profile ssd_k44_m28_osd size 72 min_size 46 crush_rule 8 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 13591 lfor 0/0/7109 flags hashpspool,ec_overwrites stripe_width 180224 target_size_bytes 113249697660928 application cephfs { "rule_id": 7, "rule_name": "cephfs.vol1.test.hdd.ec", "ruleset": 7, "type": 3, "min_size": 3, "max_size": 36, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -2, "item_name": "default~hdd" }, { "op": "choose_indep", "num": 0, "type": "osd" }, { "op": "emit" } ] } { "rule_id": 8, "rule_name": "cephfs.vol1.test.ssd.ec", "ruleset": 8, "type": 3, "min_size": 3, "max_size": 72, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -12, "item_name": "default~ssd" }, { "op": "choose_indep", "num": 0, "type": "osd" }, { "op": "emit" } ] } Issues: However, this setup does not seem to work and gives: # ceph -s cluster: id: <id> health: HEALTH_WARN Degraded data redundancy: 19 pgs undersized 20 pgs not deep-scrubbed in time # ceph health detail pg 17.0 is stuck undersized for 7h, current state active+undersized, last acting [92,76,44,84,46,72,102,104,59,62,60,89,40,47,65,38,95,79,43,67,91,69,80,83,94,48,42,90,88,37,49,75,53,58,93,45,96,61,106,64,52,70,77,99,107,63,97,100,56,98,87,105,36,68,103,55,85,2147483647,82,66,51,101,81,54,78,74,39,50,73,71,57,41] pg 17.1 is stuck undersized for 7h, current state active+undersized, last acting [69,59,75,104,79,83,89,51,76,102,37,54,95,60,105,87,43,91,70,101,45,94,68,57,72,107,53,49,40,50,65,61,88,84,73,58,47,96,48,100,103,42,52,71,63,86,39,64,97,41,46,81,67,36,93,82,62,38,98,90,85,2147483647,44,99,55,80,78,56,92,66,106,77] pg 17.4 is stuck undersized for 7h, current state active+undersized, last acting [46,84,96,39,38,94,82,67,103,63,50,52,106,42,61,64,45,62,74,79,101,48,2147483647,85,105,59,72,81,91,60,56,71,102,77,70,57,54,100,49,75,36,53,92,98,58,83,51,69,44,89,65,47,43,41,99,107,90,76,37,68,80,40,55,93,104,66,95,78,86,97,73,88] pg 17.5 is stuck undersized for 7h, current state active+undersized, last acting [63,64,93,82,69,90,60,102,89,104,50,103,55,52,66,98,99,65,100,48,53,76,68,62,84,87,57,42,75,46,83,71,43,92,51,44,80,56,61,88,77,37,38,39,81,74,105,49,85,41,91,36,79,54,45,94,67,101,72,96,47,73,86,2147483647,106,97,70,107,59,78,40,95] pg 17.6 is stuck undersized for 7h, current state active+undersized, last acting [48,67,88,105,97,78,92,79,58,59,46,98,91,45,96,52,38,57,41,81,73,49,89,55,86,68,37,39,77,47,83,76,54,94,44,70,43,62,42,60,104,64,84,85,63,102,87,90,71,80,103,100,101,40,50,72,75,95,51,82,53,36,65,61,106,93,2147483647,99,56,74,107,66] pg 17.7 is stuck undersized for 7h, current state active+undersized, last acting [69,79,84,103,37,60,75,42,67,40,65,90,99,85,63,91,83,58,104,56,43,62,55,86,82,72,73,106,87,68,57,50,64,96,41,39,61,71,93,97,59,92,102,81,38,98,48,51,95,101,52,74,77,53,44,49,45,107,78,88,70,105,46,54,80,36,47,89,76,66,100,2147483647] pg 17.8 is stuck undersized for 7h, current state active+undersized, last acting [71,78,99,81,43,58,54,86,95,82,52,46,73,69,97,39,93,88,59,105,103,91,50,101,102,49,51,64,98,90,84,75,42,107,56,83,60,67,70,55,104,61,66,79,96,74,63,72,92,53,2147483647,100,62,77,45,87,85,89,76,80,37,44,68,57,41,94,40,48,38,47,65,36] pg 17.a is stuck undersized for 7h, current state active+undersized, last acting [65,42,58,61,52,57,60,85,100,75,98,40,74,79,38,72,91,48,93,80,54,41,83,95,76,49,46,71,55,88,63,94,73,44,45,102,89,107,92,86,53,103,47,43,56,82,104,106,51,37,36,39,99,97,59,81,64,66,84,96,90,77,87,78,50,105,62,67,69,70,101,2147483647] pg 17.b is stuck undersized for 7h, current state active+undersized, last acting [47,54,59,93,91,36,58,98,39,60,46,49,78,64,88,100,66,107,92,83,99,56,63,87,41,96,89,45,51,76,69,71,103,94,90,50,85,68,81,73,75,105,40,79,84,44,80,37,42,52,95,70,62,55,82,53,38,72,65,2147483647,48,106,43,101,104,86,61,57,102,77,74,67] pg 17.d is stuck undersized for 7h, current state active+undersized, last acting [92,83,39,44,75,98,96,61,41,64,38,97,63,37,70,68,87,90,36,77,73,60,69,55,93,47,2147483647,56,102,50,54,91,82,58,43,67,53,86,81,95,105,52,85,51,79,46,62,49,80,40,57,42,104,107,78,84,94,103,48,72,88,74,71,45,101,99,65,59,106,66,100,76] pg 17.10 is stuck undersized for 7h, current state active+undersized, last acting [96,94,52,46,43,50,82,97,75,84,53,106,91,78,64,65,42,95,98,87,69,99,77,59,76,2147483647,49,70,79,90,105,81,107,86,45,39,55,93,92,56,72,37,101,36,85,100,67,47,104,74,63,38,48,68,44,60,57,61,40,88,51,62,71,83,58,89,103,80,102,41,54,73] pg 17.13 is stuck undersized for 7h, current state active+undersized, last acting [46,55,50,77,73,97,45,57,67,95,103,38,90,106,66,87,36,44,82,49,100,107,84,88,102,40,65,60,43,70,42,86,48,39,71,74,99,56,59,96,72,92,101,62,93,51,47,52,85,53,104,76,37,79,58,94,81,64,83,68,69,63,54,80,98,61,78,105,2147483647,91,75,41] pg 17.14 is stuck undersized for 7h, current state active+undersized, last acting [105,62,66,55,53,51,97,50,65,90,104,56,74,52,70,100,42,107,101,40,58,63,44,49,59,69,38,80,73,102,36,76,106,75,39,99,92,60,94,91,89,41,46,72,88,2147483647,87,98,71,78,54,68,84,95,57,103,81,82,96,61,67,79,37,83,86,47,93,77,64,48,85,45] pg 17.19 is stuck undersized for 7h, current state active+undersized, last acting [50,90,73,99,45,101,72,93,85,47,59,78,95,83,96,58,76,39,43,49,44,92,91,102,81,74,62,86,54,56,103,87,70,105,75,48,88,97,67,38,57,46,36,84,107,66,65,69,106,41,80,42,52,63,64,61,98,100,79,60,51,94,53,89,37,68,40,55,77,71,2147483647,104] pg 17.1a is stuck undersized for 7h, current state active+undersized, last acting [70,95,59,78,87,85,66,68,40,63,90,73,89,101,86,80,82,50,107,74,55,49,72,48,43,104,62,97,81,94,103,58,77,52,2147483647,102,53,75,106,91,88,57,42,61,99,79,39,54,38,96,37,45,76,105,51,84,60,47,93,98,83,100,64,65,44,36,56,71,67,46,41,69] pg 17.1b is stuck undersized for 7h, current state active+undersized, last acting [84,37,62,58,87,36,94,77,53,55,45,93,43,82,75,78,101,104,95,106,98,107,61,99,38,46,52,76,56,51,66,83,42,80,63,81,79,86,100,90,88,65,47,60,44,103,2147483647,73,59,69,102,67,57,70,72,41,105,54,64,91,97,48,74,89,92,96,40,71,50,39,49,68] pg 17.1e is stuck undersized for 7h, current state active+undersized, last acting [103,48,71,70,104,47,77,56,55,89,68,97,72,82,36,69,40,83,107,38,80,76,39,100,92,79,57,37,42,66,98,53,62,43,84,95,75,105,59,94,106,45,88,54,96,67,91,46,44,58,2147483647,93,73,64,85,78,101,65,50,99,74,102,49,51,41,61,87,90,52,63,60,81] And the external cluster rook pvc mounts cannot write to it. What was done wrong here? Why are the pg(s) undersized?
This is a really bad design, you should start from scratch. First, the number of chunks you're creating is way too high and there's no need for that. It's also a bad choice to involve all hosts because in case of a host or even OSD failure there's no room for recovery, so your cluster will be in degraded state until the failed host or OSD is back online. Second, OSD as failure domain is not a good choice either, usually you'd go with host as failure domain. For your relatively small setup I would rather choose a replicated pool with size 6 (2 replicas per node, you can lose 2 hosts without data loss). If you really need to go with EC be aware that you won't be able to sustain the loss of a host since there's not enough space to recover. You could choose a profile like k2 m4 - or if you want to have more chunks make it k3 m6 - and keep OSD as failure domain, but as I said, it's not very resilient. You'd be better off with replicated pools. Why your PGs are degraded is depending on a couple of things. If you want to keep your current setup (which I don't recommend) you could post your ceph osd tree and ceph osd df to begin with.
Ceph luminous rbd map hangs forever
Running a 1 node ceph cluster, and using the ceph-client from another node. Qemu is working fine with the RBD mounting. When I try to mount a RBD block device on the ceph-client I get an indefinite hang with no output. How to diagnose whats wrong? System is ubuntu 16.04 server, and Ceph Luminous. sudo ceph tell osd.* version { "version": "ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)" } ceph -s cluster: id: 4bfcc109-e432-4ac0-ba9d-bf81243aea health: HEALTH_OK services: mon: 1 daemons, quorum gcmaster mgr: gcmaster(active) osd: 1 osds: 1 up, 1 in data: pools: 1 pools, 128 pgs objects: 1512 objects, 5879 MB usage: 7356 MB used, 216 GB / 223 GB avail pgs: 128 active+clean rbd info gcbase rbd image 'gcbase': size 512 MB in 128 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.376974b0dc51 format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten flags: create_timestamp: Fri Dec 29 17:58:02 2017 This hangs forever rbd map gcbase --pool rbd As does this rbd map typo_gcbase --pool rbd dmesg shows Dec 29 13:27:32 cephclient1 kernel: [85798.195468] libceph: mon0 192.168.1.55:6789 feature set mismatch, my 106b84a842a42 < server's 40106b84a842a42, missing 400000000000000 Dec 29 13:27:32 cephclient1 kernel: [85798.222070] libceph: mon0 192.168.1.55:6789 missing required protocol features
The dmesg output tells what's going on: The cluster requires a feature bit that is not supported by the libceph kernel module. The feature bit in question is either CEPH_FEATURE_CRUSH_TUNABLES5, CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING or CEPH_FEATURE_FS_FILE_LAYOUT_V2 (they are overlapping because they were introduced at the same time) which only became available on kernel 4.5, whereas Ubuntu 16.04 uses a 4.4 kernel. A similar question (although related to CephFS) came up on the mailing list with a possible solution: Yes, you should be able to set your CRUSH tunables profile to hammer with "ceph osd crush tunables hammer". This will disable some features, but should make the older kernel compatible with the cluster. Alternatively you could upgrade to a mainline kernel or to a newer OS release.
pg_top output analysis of puppetdb with postgres
I recently started using a tool called pg_top that shows statistics for Postgres, however since I am not verify versed with the internals of Postgres I needed a bit of clarification on the output. last pid: 6152; load avg: 19.1, 18.6, 20.4; up 119+20:31:38 13:09:41 41 processes: 5 running, 36 sleeping CPU states: 52.1% user, 0.0% nice, 0.8% system, 47.1% idle, 0.0% iowait Memory: 47G used, 16G free, 2524M buffers, 20G cached DB activity: 151 tps, 0 rollbs/s, 253403 buffer r/s, 86 hit%, 1550639 row r/s, 21 row w/s DB I/O: 0 reads/s, 0 KB/s, 35 writes/s, 2538 KB/s DB disk: 233.6 GB total, 195.1 GB free (16% used) Swap: My question is under the DB Activity, the 1.5 million row r/s, is that a lot? If so what can be done to improve it? I am running puppetdb 2.3.8, with 6.8 million resources, 2500 nodes, and Postgres 9.1. All of this runs on a single 24 core box with 64GB of memory.
Ceph enters degraded state after Deis installation
I have successfully upgraded Deis to v1.0.1 with 3 nodes cluster, with each node having 2GB ram, hosted by Digital Ocean. I then nse'ed into a deis-store-monitor service, ran ceph -s, and realized it has entered active+undersized+degraded state, and never get back to the active+clean state. Detail messages follow: root#deis-2:/# ceph -s libust[276/276]: Warning: HOME environment variable not set. Disabling LTTng-UST per-user tracing. (in setup_local_apps() at lttng-ust-comm.c:305) cluster dfa09ba0-66f2-46bb-8d84-12795f281f7d health HEALTH_WARN 1536 pgs degraded; 1536 pgs stuck unclean; 1536 pgs undersized; recovery 1314/3939 objects degraded (33.359%) monmap e3: 3 mons at {deis-1=10.132.183.190:6789/0,deis-2=10.132.183.191:6789/0,deis-3=10.132.183.192:6789/0}, election epoch 28, quorum 0,1,2 deis-1,deis-2,deis-3 mdsmap e32: 1/1/1 up {0=deis-1=up:active}, 2 up:standby osdmap e77: 3 osds: 2 up, 2 in pgmap v109093: 1536 pgs, 12 pools, 897 MB data, 1313 objects 27342 MB used, 48256 MB / 77175 MB avail 1314/3939 objects degraded (33.359%) 1536 active+undersized+degraded client io 817 B/s wr, 0 op/s I am totally new to ceph. I wonder: Is it a big deal to fix this issue, or could I let it be in this state? If it is recommended to fix this, would you point out how should I go about it? I read about Ceph troubleshooting section and POOL, PG AND CRUSH CONFIG REFERENCE, but still have no idea what I should do next. Thanks a lot!
From this output: osdmap e77: 3 osds: 2 up, 2 in. It sounds like one of your deis-store-daemons isn't responding. deisctl restart store-daemon should recover your cluster, but I'd be curious about what happened to that daemon. I'd love to see journalctl --no-pager -u deis-store-daemon on all of your hosts. If you could add your logs to https://github.com/deis/deis/issues/2520 that'd help us figure out why the daemon isn't responding. Also, 2GB nodes on DO will likely result in performance issues (and Ceph may be unhappy).
Help me analyze dump file
Customers are reporting problems almost every day on about the same hours. This app is running on 2 nodes. It is Metastorm BPM platform and it's calling our code. In some dumps I noticed very long running threads (~50 minutes) but not in all of them. Administrators are also telling me that just before users report problems memory usage goes up. Then everything slows down to the point they can't work and admins have to restart platforms on both nodes. My first thought was deadlocks (long running threads) but didn't manage to confirm that. !syncblk isn't returning anything. Then I looked at memory usage. I noticed a lot of dynamic assemblies so thought maybe assemblies leak. But it looks it's not that. I have received dump from day where everything was working fine and number of dynamic assemblies is similar. So maybe memory leak I thought. But also cannot confirm that. !dumpheap -stat shows memory usage grows but I haven't found anything interesting with !gcroot. But there is one thing I don't know what it is. Threadpool Completion Port. There's a lot of them. So maybe sth is waiting on sth? Here is data I can give You so far that will fit in this post. Could You suggest anything that could help diagnose this situation? Users not reporting problems: Node1 Node2 Size of dump: 638MB 646MB DynamicAssemblies 259 265 GC Heaps: 37MB 35MB Loader Heaps: 11MB 11MB Node1: Number of Timers: 12 CPU utilization 2% Worker Thread: Total: 5 Running: 0 Idle: 5 MaxLimit: 2000 MinLimit: 200 Completion Port Thread:Total: 2 Free: 2 MaxFree: 16 CurrentLimit: 4 MaxLimit: 1000 MinLimit: 8 !dumpheap -stat (biggest) 0x793041d0 32,664 2,563,292 System.Object[] 0x79332b9c 23,072 3,485,624 System.Int32[] 0x79330a00 46,823 3,530,664 System.String 0x79333470 22,549 4,049,536 System.Byte[] Node2: Number of Timers: 12 CPU utilization 0% Worker Thread: Total: 7 Running: 0 Idle: 7 MaxLimit: 2000 MinLimit: 200 Completion Port Thread:Total: 3 Free: 1 MaxFree: 16 CurrentLimit: 5 MaxLimit: 1000 MinLimit: 8 !dumpheap -stat 0x793041d0 30,678 2,537,272 System.Object[] 0x79332b9c 21,589 3,298,488 System.Int32[] 0x79333470 21,825 3,680,000 System.Byte[] 0x79330a00 46,938 5,446,576 System.String ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Users start to report problems: Node1 Node2 Size of dump: 662MB 655MB DynamicAssemblies 236 235 GC Heaps: 159MB 113MB Loader Heaps: 10MB 10MB Node1: Work Request in Queue: 0 Number of Timers: 14 CPU utilization 20% Worker Thread: Total: 7 Running: 0 Idle: 7 MaxLimit: 2000 MinLimit: 200 Completion Port Thread:Total: 48 Free: 1 MaxFree: 16 CurrentLimit: 49 MaxLimit: 1000 MinLimit: 8 !dumpheap -stat 0x7932a208 88,974 3,914,856 System.Threading.ReaderWriterLock 0x79333054 71,397 3,998,232 System.Collections.Hashtable 0x24f70350 319,053 5,104,848 Our.Class 0x79332b9c 53,190 6,821,588 System.Int32[] 0x79333470 52,693 6,883,120 System.Byte[] 0x79333150 72,900 11,081,328 System.Collections.Hashtable+bucket[] 0x793041d0 247,011 26,229,980 System.Object[] 0x79330a00 644,807 34,144,396 System.String Node2: Work Request in Queue: 1 Number of Timers: 17 CPU utilization 17% Worker Thread: Total: 6 Running: 0 Idle: 6 MaxLimit: 2000 MinLimit: 200 Completion Port Thread:Total: 48 Free: 2 MaxFree: 16 CurrentLimit: 49 MaxLimit: 1000 MinLimit: 8 !dumpheap -stat 0x7932a208 76,425 3,362,700 System.Threading.ReaderWriterLock 0x79332b9c 42,417 5,695,492 System.Int32[] 0x79333150 41,172 6,451,368 System.Collections.Hashtable+bucket[] 0x79333470 44,052 6,792,004 System.Byte[] 0x793041d0 175,973 18,573,780 System.Object[] 0x79330a00 397,361 21,489,204 System.String Edit: I downloaded debugdiag and let it analyze my dumps. Here is part of output: The following threads in process_name name_of_dump.dmp are making a COM call to thread 193 within the same process which in turn is waiting on data to be returned from another server via WinSock. The call to WinSock originated from 0x0107b03b and is destined for port xxxx at IP address xxx.xxx.xxx.xxx ( 18 76 172 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 210 211 212 213 214 215 216 217 218 224 225 226 227 228 229 231 232 233 236 239 ) 14,79% of threads blocked And the recommendation is: Several threads making calls to the same STA thread can cause a performance bottleneck due to serialization. Server side COM servers are recommended to be thread aware and follow MTA guidelines when multiple threads are sharing the same object instance. I checked using windbg what thread 193 does. It is calling our code. Our code is calling some Metastorm engine code and it hangs on some remoting call. But !runaway shows it is hanging for 8 seconds. So not that long. So I checked what are those waiting threads. All except thread 18 are: System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*) I could understand one, but why so many of them. Is it specific to business process modeling engine we're using or is it something typical? I guess it's taking threads that could be used by other clients and that's why the slowdown reported by users. Are those threads Completion Port Threads I asked about before? Can I do anything more to diagnose or did I found our code to be the cause?
From the looks of the output most of the memory is not on the .net heaps (only 35 MB out of ~650) so if you are looking at the .net heaps I think you are looking in the wrong place. The memory is probably either in assemblies or in native memory if you are using some native component for file transfers or similar. You would want to use Debug Diag to monitor that. It is hard to say if you are leaking dynamic assemblies without looking at the pattern of growth so I would suggest for that that you look at perfmon and #current assemblies to see if it keeps growing over time, if it does then you would have to investigate that further by looking at what the dynamic assemblies are with !dda