How does chronyd work when it changes from a state where the time is synchronized to the server whose name is resolved from the pool to a state where communication with the server is lost?
CentOS 7
# chronyd -v
chronyd (chrony) version 1.29.1
# grep server /etc/chrony.conf
# Use public servers from the pool.ntp.org project.
server 0.amazon.pool.ntp.org iburst
server 1.amazon.pool.ntp.org iburst
server 2.amazon.pool.ntp.org iburst
server 3.amazon.pool.ntp.org iburst
# Serve time even if not synchronized to any NTP server.
Time was not synchronized
# chronyc sources
210 Number of sources = 4
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^? 45.76.218.213.vultr.com 2 10 0 518d -6679us[-6905us] +/- 40ms
^? v150-95-148-140.a08d.g.ty 2 10 0 745d -105.4s[ +315us] +/- 47ms
^? 45.76.98.188.vultr.com 3 10 0 564d +1031ms[ -242us] +/- 209ms
^? 153-128-30-125.compute.jp 2 10 0 748d -108.1s[-2427us] +/- 37ms
I have something like this
outputs = Parallel(n_jobs=12, verbose=10)(delayed(_process_article)(article, config) for article in data)
Case 1: Run on ubuntu with 80 cores:
CPU(s): 80
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
There are a total of 90,000 tasks. At around 67k it fails and is terminated.
joblib.externals.loky.process_executor.BrokenProcessPool: A process in the executor was terminated abruptly, the pool is not usable anymore.
When I monitor the top at 67k I see a sharp fall in the memory
top - 11:40:25 up 2 days, 18:35, 4 users, load average: 7.09, 7.56, 7.13
Tasks: 32 total, 3 running, 29 sleeping, 0 stopped, 0 zombie
%Cpu(s): 7.6 us, 2.6 sy, 0.0 ni, 89.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 33554432 total, 40 free, 33520996 used, 33396 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 40 avail Mem
Case 2: Mac with 8 cores
hw.physicalcpu: 4
hw.logicalcpu: 8
But on the mac it is much much slower .. And surprisingly it does not get killed at 67k..
Additionally, I reduced the parallelism (in case 1) to 2,4 and it still fails :(
Why is this happening? Has anyone faced this issue before and has a fix?
Note: when I run for 50,000 tasks it runs well and does not give any problems.
Thank you!
Got a machine with an increased memory of 128GB and that solved the problem!
I am facing a problem with HttpClient (version 4.5.2) in a web application, I mean, in a multi-threaded way. In normal situation, when a connection request is arrived, a connection is leased from the pool, then used and finally released back to the pool again to be used in future requests again as the following part of log for connection with id 673890 states .
15 Feb 2017 018:25:54:115 p-1-thread-121 DEBUG PoolingHttpClientConnectionManager:249 - Connection request: [route: {}->http://127.0.0.1:8080][total kept alive: 51; route allocated: 4 of 100; total allocated: 92 of 500]
15 Feb 2017 018:25:54:116 p-1-thread-121 DEBUG PoolingHttpClientConnectionManager:282 - Connection leased: [id: 673890][route: {}->http://127.0.0.1:8080][total kept alive: 51; route allocated: 4 of 100; total allocated: 92 of 500]
15 Feb 2017 018:25:54:116 p-1-thread-121 DEBUG DefaultManagedHttpClientConnection:90 - http-outgoing-673890: set socket timeout to 9000
15 Feb 2017 018:25:54:120 p-1-thread-121 DEBUG PoolingHttpClientConnectionManager:314 - Connection [id: 673890][route: {}->http://127.0.0.1:8080] can be kept alive for 10.0 seconds
15 Feb 2017 018:25:54:121 p-1-thread-121 DEBUG PoolingHttpClientConnectionManager:320 - Connection released: [id: 673890][route: {}->http://127.0.0.1:8080][total kept alive: 55; route allocated: 4 of 100; total allocated: 92 of 500]
After using the mentioned connection (id 673890) several times in a normal way which I mentioned above, I notice the following happens in the code:
15 Feb 2017 018:25:54:130 p-1-thread-126 DEBUG PoolingHttpClientConnectionManager:249 - Connection request: [route: {}->http://127.0.0.1:8080][total kept alive: 55; route allocated: 4 of 100; total allocated: 92 of 500]
15 Feb 2017 018:25:54:130 p-1-thread-126 DEBUG PoolingHttpClientConnectionManager:282 - Connection leased: [id: 673890][route: {}->http://127.0.0.1:8080][total kept alive: 54; route allocated: 4 of 100; total allocated: 92 of 500]
15 Feb 2017 018:25:54:131 p-1-thread-126 DEBUG DefaultManagedHttpClientConnection:90 - http-outgoing-673890: set socket timeout to 9000
15 Feb 2017 018:25:54:133 p-1-thread-126 DEBUG DefaultManagedHttpClientConnection:81 - http-outgoing-673890: Close connection
15 Feb 2017 018:25:54:133 p-1-thread-126 DEBUG PoolingHttpClientConnectionManager:320 - Connection released: [id: 673890][route: {}->http://127.0.0.1:8080][total kept alive: 55; route allocated: 3 of 100; total allocated: 91 of 500]
The log says that the connection is requested, leased, used, closed and then released back to the pool. So, my question is that why the connection is closed? And why it is released to the pool after closing?
I know that the connection could be closed by the server, but that is a different situation. In that case, the connection is leased from the pool, determined as stale, so a new connection is established and used but the log I presented above shows a different behavior.
I am aware of two reasons for connection close in HttpClient. First, closed for being idle because their KeepAliveTime is expired. Second, closed by the server which makes the connection stale in the pool. Is there any other reason for connections to be closed?
Based on Oleg Kalnichevski's reply in the HttpClient mailing list, and the examinations which I made, it turned out that the problem was because of 'Connection: close' header sent by the other hand. Another cause that may lead to the same situation is using HTTP/1.0 non-persistent connections.
I'm working on my thesis about solving a Traveling salesman problem with genetic algorithm. I use netlogo to solve the problem. But i got this error :
Can't find element 62 of the list
[7400 5100 5000 5000 2100 4300 5200 1200 900 4300 6000 6000 7600 5900 7600
8600 7400 7100 6800 8100 3300 1400 1200 10400 8500 3700 11400 6900 2000 650
0 3000 4900 9800 10600 4000 5200 7700 8500 5900 5000 7100 6100 6800 1000
3200 2700 2900 1800 1300 9600 4800 4600 6700 7700 6100 4200 3200 9000 8200
10500 13400],
which is only of length 62.
error while turtle 2 running ITEM
called by procedure CALCULATE-DISTANCE
called by procedure SETUP_1
called by Button 'setup 1'
and i dont know why. Can someone help me about this?
Customers are reporting problems almost every day on about the same hours. This app is running on 2 nodes. It is Metastorm BPM platform and it's calling our code.
In some dumps I noticed very long running threads (~50 minutes) but not in all of them. Administrators are also telling me that just before users report problems memory usage goes up. Then everything slows down to the point they can't work and admins have to restart platforms on both nodes. My first thought was deadlocks (long running threads) but didn't manage to confirm that. !syncblk isn't returning anything. Then I looked at memory usage. I noticed a lot of dynamic assemblies so thought maybe assemblies leak. But it looks it's not that. I have received dump from day where everything was working fine and number of dynamic assemblies is similar. So maybe memory leak I thought. But also cannot confirm that. !dumpheap -stat shows memory usage grows but I haven't found anything interesting with !gcroot. But there is one thing I don't know what it is. Threadpool Completion Port. There's a lot of them. So maybe sth is waiting on sth? Here is data I can give You so far that will fit in this post. Could You suggest anything that could help diagnose this situation?
Users not reporting problems:
Node1 Node2
Size of dump: 638MB 646MB
DynamicAssemblies 259 265
GC Heaps: 37MB 35MB
Loader Heaps: 11MB 11MB
Node1:
Number of Timers: 12
CPU utilization 2%
Worker Thread: Total: 5 Running: 0 Idle: 5 MaxLimit: 2000 MinLimit: 200
Completion Port Thread:Total: 2 Free: 2 MaxFree: 16 CurrentLimit: 4 MaxLimit: 1000 MinLimit: 8
!dumpheap -stat (biggest)
0x793041d0 32,664 2,563,292 System.Object[]
0x79332b9c 23,072 3,485,624 System.Int32[]
0x79330a00 46,823 3,530,664 System.String
0x79333470 22,549 4,049,536 System.Byte[]
Node2:
Number of Timers: 12
CPU utilization 0%
Worker Thread: Total: 7 Running: 0 Idle: 7 MaxLimit: 2000 MinLimit: 200
Completion Port Thread:Total: 3 Free: 1 MaxFree: 16 CurrentLimit: 5 MaxLimit: 1000 MinLimit: 8
!dumpheap -stat
0x793041d0 30,678 2,537,272 System.Object[]
0x79332b9c 21,589 3,298,488 System.Int32[]
0x79333470 21,825 3,680,000 System.Byte[]
0x79330a00 46,938 5,446,576 System.String
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Users start to report problems:
Node1 Node2
Size of dump: 662MB 655MB
DynamicAssemblies 236 235
GC Heaps: 159MB 113MB
Loader Heaps: 10MB 10MB
Node1:
Work Request in Queue: 0
Number of Timers: 14
CPU utilization 20%
Worker Thread: Total: 7 Running: 0 Idle: 7 MaxLimit: 2000 MinLimit: 200
Completion Port Thread:Total: 48 Free: 1 MaxFree: 16 CurrentLimit: 49 MaxLimit: 1000 MinLimit: 8
!dumpheap -stat
0x7932a208 88,974 3,914,856 System.Threading.ReaderWriterLock
0x79333054 71,397 3,998,232 System.Collections.Hashtable
0x24f70350 319,053 5,104,848 Our.Class
0x79332b9c 53,190 6,821,588 System.Int32[]
0x79333470 52,693 6,883,120 System.Byte[]
0x79333150 72,900 11,081,328 System.Collections.Hashtable+bucket[]
0x793041d0 247,011 26,229,980 System.Object[]
0x79330a00 644,807 34,144,396 System.String
Node2:
Work Request in Queue: 1
Number of Timers: 17
CPU utilization 17%
Worker Thread: Total: 6 Running: 0 Idle: 6 MaxLimit: 2000 MinLimit: 200
Completion Port Thread:Total: 48 Free: 2 MaxFree: 16 CurrentLimit: 49 MaxLimit: 1000 MinLimit: 8
!dumpheap -stat
0x7932a208 76,425 3,362,700 System.Threading.ReaderWriterLock
0x79332b9c 42,417 5,695,492 System.Int32[]
0x79333150 41,172 6,451,368 System.Collections.Hashtable+bucket[]
0x79333470 44,052 6,792,004 System.Byte[]
0x793041d0 175,973 18,573,780 System.Object[]
0x79330a00 397,361 21,489,204 System.String
Edit:
I downloaded debugdiag and let it analyze my dumps. Here is part of output:
The following threads in process_name name_of_dump.dmp are making a COM call to thread 193 within the same process which in turn is waiting on data to be returned from another server via WinSock.
The call to WinSock originated from 0x0107b03b and is destined for port xxxx at IP address xxx.xxx.xxx.xxx
( 18 76 172 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 210 211 212 213 214 215 216 217 218 224 225 226 227 228 229 231 232 233 236 239 )
14,79% of threads blocked
And the recommendation is:
Several threads making calls to the same STA thread can cause a performance bottleneck due to serialization. Server side COM servers are recommended to be thread aware and follow MTA guidelines when multiple threads are sharing the same object instance.
I checked using windbg what thread 193 does. It is calling our code. Our code is calling some Metastorm engine code and it hangs on some remoting call. But !runaway shows it is hanging for 8 seconds. So not that long. So I checked what are those waiting threads. All except thread 18 are:
System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*) I could understand one, but why so many of them. Is it specific to business process modeling engine we're using or is it something typical? I guess it's taking threads that could be used by other clients and that's why the slowdown reported by users. Are those threads Completion Port Threads I asked about before? Can I do anything more to diagnose or did I found our code to be the cause?
From the looks of the output most of the memory is not on the .net heaps (only 35 MB out of ~650) so if you are looking at the .net heaps I think you are looking in the wrong place. The memory is probably either in assemblies or in native memory if you are using some native component for file transfers or similar. You would want to use Debug Diag to monitor that.
It is hard to say if you are leaking dynamic assemblies without looking at the pattern of growth so I would suggest for that that you look at perfmon and #current assemblies to see if it keeps growing over time, if it does then you would have to investigate that further by looking at what the dynamic assemblies are with !dda