LSF job states for a given job - lsf

Let's say my job was running for some time and it went to suspend state due to machine overloading and became running after sometime and got completed.
Now the status acquired by this job were RUNNING -> SUSPEND -> RUNNING
How to get all states acquired by a given job ?

bjobs -l If the job hasn't been cleaned from the system yet.
bhist -l Otherwise. You might need -n, depending on how old the job is.
Here's an example of bhist -l output when a job was suspended and later resumed because the system load temporarily exceeded the configured threshold.
$ bhist -l 1168
Job <1168>, User <mclosson>, Project <default>, Command <sleep 10000>
Fri Jan 20 15:08:40: Submitted from host <hostA>, to
Queue <normal>, CWD <$HOME>, Specified Hosts <hostA>;
Fri Jan 20 15:08:41: Dispatched 1 Task(s) on Host(s) <hostA>, Allocated 1 Slot(
s) on Host(s) <hostA>, Effective RES_REQ <select[type == any] or
der[r15s:pg] >;
Fri Jan 20 15:08:41: Starting (Pid 30234);
Fri Jan 20 15:08:41: Running with execution home </home/mclosson>, Execution CW
D </home/mclosson>, Execution Pid <30234>;
Fri Jan 20 16:19:22: Suspended: Host load exceeded threshold: 1-minute CPU ru
n queue length (r1m)
Fri Jan 20 16:21:43: Running;
Summary of time in seconds spent in various states by Fri Jan 20 16:22:09
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
1 0 4267 0 141 0 4409
At 16:19:22 the jobs was suspended because r1m exceeded the threshold. Later at 16:21:43 the job resumes.

Related

Why does Puma mess up incoming requests? (timed out worker)

Problem
I have a Rails 7 app deployed on render.com, and it doesn't get a lot of traffic (maybe once per day). However, when a few requests do come in, everything seems to running fine for a moment until Puma seems to barf. The incoming requests are from Twilio for a voice call, and the call eventually errors with "We're sorry, an application error has occurred. Goodbye". It seems like something about a "timed out" worker happens, then the worker boots, and whammo! a flood of "Completed 2XX OK" and "Kredis Connected to shared" lines come crashing through like they've been pent up the entire time. THEN, nearly a day later without any outside requests coming in, several log lines about Out-of-sync worker list, no 78 worker come through. My Puma config file is unchanged from what ships with Rails.
Questions
Where might I go look for the offending code? What tools could help me decipher why a Puma worker is timing out? Could it have something to do with how I'm using Redis via Kredis in my app?
Workaround
To get around this issue, I've started to occasionally redeploy my latest commit and that seems to help. I'm not certain, but it seems like inactivity causes Puma to become discombobulated.
Log output
Here's what the offending lines in my log file look like:
... a few requests that complete 200 OK ...
Sep 13 05:53:15 PM [70] ! Terminating timed out worker (worker failed to check in within 60 seconds): 90
... a couple more normal log lines and then ...
Sep 13 05:53:16 PM [70] - Worker 3 (PID: 134) booted in 0.04s, phase: 0
... some more normal log lines and then ...
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.593713 #74] INFO -- : [595ad8e5-fa3a-45a3-8c5b-a506e6c94b69] Completed 204 No Content in 110ms (Allocations: 13681)
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.425579 #86] INFO -- : [f1a64c71-8048-4032-8bf6-2e68aa1fa7ba] Completed 204 No Content in 2ms (Allocations: 541)
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.595408 #86] INFO -- : [68d19bd9-2286-4f75-a982-5fa3e864d6ac] Completed 200 OK in 105ms (Views: 0.2ms | Allocations: 1592)
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.614951 #76] INFO -- : [e883350f-9a26-4d3d-8f1c-4853285aa71a] Kredis (10.6ms) Connected to shared
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.615787 #76] INFO -- : [fbcd8730-1514-4af5-9332-0bdf0c89fc2d] Kredis (17.2ms) Connected to shared
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.705926 #86] INFO -- : [1f67a177-38f2-4bf5-bd03-1c59a3edb3a4] Kredis (224.1ms) Connected to shared
Sep 13 05:53:16 PM I, [2022-09-13T22:53:16.958386 #76] INFO -- : [e883350f-9a26-4d3d-8f1c-4853285aa71a] Completed 200 OK in 472ms (ActiveRecord: 213.1ms | Allocations: 32402)
Sep 13 05:53:17 PM I, [2022-09-13T22:53:17.034211 #86] INFO -- : [1f67a177-38f2-4bf5-bd03-1c59a3edb3a4] Completed 200 OK in 606ms (ActiveRecord: 256.6ms | Allocations: 17832)
Sep 13 05:53:17 PM I, [2022-09-13T22:53:17.136231 #76] INFO -- : [fbcd8730-1514-4af5-9332-0bdf0c89fc2d] Completed 200 OK in 654ms (ActiveRecord: 88.0ms | Allocations: 37385)
... literally a day later without any other activity ...
Sep 14 05:02:29 AM [69] ! Terminating timed out worker (worker failed to check in within 60 seconds): 78
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] ! Out-of-sync worker list, no 78 worker
Sep 14 05:02:31 AM [69] - Worker 1 (PID: 132) booted in 0.03s, phase: 0

How should I limit the slow disconnect of thousands of tcp connections?

I have a golang program running on centos that usually has around 5k tcp clients connected. Every once in a while this number goes to around 15k for about an hour and still everything is fine.
The program has a slow shutdown mode where it stops accepting new clients and slowly kills all currently connected clients over the course of 20 mins. During these slow shutdown periods if the machine has 15k clients, sometimes I get:
Wed Oct 31 21:28:23 2018] net_ratelimit: 482 callbacks suppressed
[Wed Oct 31 21:28:23 2018] TCP: too many orphaned sockets
[Wed Oct 31 21:28:23 2018] TCP: too many orphaned sockets
[Wed Oct 31 21:28:23 2018] TCP: too many orphaned sockets
I have tried adding:
echo "net.ipv4.tcp_max_syn_backlog=5000" >> /etc/sysctl.conf
echo "net.ipv4.tcp_fin_timeout=10" >> /etc/sysctl.conf
echo "net.ipv4.tcp_tw_recycle=1" >> /etc/sysctl.conf
echo "net.ipv4.tcp_tw_reuse=1" >> /etc/sysctl.conf
sysctl -f /etc/sysctl.conf
And these values are set, I see them with their correct new values. A typical sockstat is:
cat /proc/net/sockstat
sockets: used 31682
TCP: inuse 17286 orphan 5 tw 3874 alloc 31453 mem 15731
UDP: inuse 8 mem 3
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
And ideas how to stop the too many orphaned socket error and crash? Should I increase the 20 min slow shutdown period to 40 mins? Increase tcp_mem? Thanks!

MongoDB primary stepDown does not succeed

Setup: replica set with 5 nodes, version 3.4.5.
Trying to switch PRIMARY with rs.stepDown(60, 30) but consistently getting the error:
rs0:PRIMARY> rs.stepDown(60, 30)
{
"ok" : 0,
"errmsg" : "No electable secondaries caught up as of 2017-07-11T00:21:11.205+0000. Please use {force: true} to force node to step down.",
"code" : 50,
"codeName" : "ExceededTimeLimit"
}
However, rs.printSlaveReplicationInfo() running in a parallel terminal confirms that all replicas are fully caught up:
rs0:PRIMARY> rs.printSlaveReplicationInfo()
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
Am I doing something wrong?
UPD: I've checked long running operations before and during rs.stepDown as was suggested below and it looks like this:
# Before rs.stepDown
$ watch "mongo --quiet --eval 'JSON.stringify(db.currentOp())' | jq -r '.inprog[] | \"\(.secs_running) \(.desc) \(.op)\"' | sort -rnk1"
984287 rsSync none
984287 ReplBatcher none
67 WT RecordStoreThread: local.oplog.rs none
null SyncSourceFeedback none
null NoopWriter none
0 conn615153 command
0 conn614948 update
0 conn614748 getmore
...
# During rs.stepDown
984329 rsSync none
984329 ReplBatcher none
108 WT RecordStoreThread: local.oplog.rs none
16 conn615138 command
16 conn615136 command
16 conn615085 update
16 conn615079 insert
...
Basically, long running user operations seem to happen as a result of rs.stepDown() as secs_running becomes nonzero once PRIMARY attempts to switch over and keeps growing all the way up until stepDown fails. Then everything gets back to normal.
Any ideas on why this happens and whether that's normal at all?
I have used below command to step down to secondary
db.adminCommand( { replSetStepDown: 120, secondaryCatchUpPeriodSecs: 15, force: true } )
You can find this in below mongodb official documentation
https://docs.mongodb.com/manual/reference/command/replSetStepDown/
To close the loop on this question, it was determined that the failed stepdown was due to time going backward on the host.
MongoDB 3.4.6 is more resilient to time issues on the host, and upgrading the deployment fixes the stalling issues.
Before stepping down, rs.stepDown() will attempt to terminate long running user operations that would block the primary from stepping down, such as an index build, a write operation or a map-reduce job.
Do you have some long running jobs on going? Check db. Check result of db.currentOp()
You can try to set longer stepping down time rs.stepDown(60, 360).
Quoting an answer from https://jira.mongodb.org/browse/SERVER-27015:
This is most likely due to the fact that by default the shutdown command will only succeed on a primary if the secondaries are fully caught up at the exact moment that the shutdown command is executed.
I faced a similar issue and tried the db.shutdownServer() command several times, however it worked exactly when the secondary was 0 seconds behind the primary.

Couchbase: 20k items stuck in Tap Queue

We are currently evaluating couchbase as a memcached replacement in the first place. Our setup looks like this:
php -> localhost moxi -> couchbase bucket (Total bucket size = 10240 MB (2048 MB x 5 nodes with replica count 1))
The Servers have 16GB RAM and are SSD backed.
We were inserting at about 400 ops/s and had no problem for a few days. When we reached about 13 million items. We found out that we forgot to implement the delete function in our testsetup and a lot of keys had no expiration set.
To start over again we flushed the bucket through the webinterface. This where our problems began.
We started to see that we had temp ooms, back-offs, and tap queue was filled with 20k items. the drain and fill rate was nearly the same. See attached screenshot
What also catched our eye was that node 4 had only 220k items, where everyone else had around 1.39M
Somehow it looks like the replication messed up something, but im relatively new to couchbase. Any hints, suggestions? - See more at: http://www.couchbase.com/communities/q-and-a/20k-items-stuck-tap-queue#sthash.v9MxNnTk.dpuf
The problem was solved for a short time, after removing the failing node from the cluster.
So now with this four nodes left in the cluster, after some hours the same happend again with another node. We tried setting the now failing node into FailOver state. That fixed the problem again, but after Re-Adding the node, the same phenomenon happened again on that node.
Other things we realized are:
* Three out of four nodes have thousands of items in their TAP replication queue, but one
("the failing one") has 0.
* Also three out of four nodes have a back-off rate of around 400, but one ("the failing one") has 0.
* Only the failing one has a massive amount of "Temp OOMs per second", but the other three have 0.
The phenomenon seems to disappear, if we lower the load to the servers by disabling the couchbase-writes for one out of two software project writing to couchbase.
But if we enable the writes again, after around 10 minutes we can see this in the memcached.log on the failing node:
Tue Dec 17 12:29:05.010547 CET 3: (CENSORED) Received error[86] from mccouch for unknown
Tue Dec 17 12:29:05.010576 CET 3: (CENSORED) Retry notify CouchDB of update, vbucket=277 rev=522
Tue Dec 17 12:29:08.748103 CET 3: (CENSORED) Received error[86] from mccouch for unknown
Tue Dec 17 12:29:08.748257 CET 3: (CENSORED) Retry notify CouchDB of update, vbucket=321 rev=948
Tue Dec 17 12:40:17.354448 CET 3: (CENSORED) Received error[86] from mccouch for unknown
Tue Dec 17 12:40:17.354476 CET 3: (CENSORED) Retry notify CouchDB of update, vbucket=303 rev=491
This error then happens around 5 times within four hours:
Tue Dec 17 14:19:32.145071 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
And after these four hours it starts spamming this instantly (Maybe, because the load increased heavily, because in the evening our page generates much more load than in the morning/noon) together with this "error from mccouch":
Tue Dec 17 16:42:30.875343 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:42:36.493317 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:25.239876 CET 3: (CENSORED) Received error[86] from mccouch for unknown
Tue Dec 17 16:43:25.240052 CET 3: (CENSORED) Retry notify CouchDB of update, vbucket=296 rev=483
Tue Dec 17 16:43:25.903997 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:31.906178 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:36.913045 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:42.919114 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:48.920354 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:43:54.924017 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
Tue Dec 17 16:44:00.928572 CET 3: (CENSORED) TAP (Producer) eq_tapq:replication_ns_1#10.65.20.12 - Suspend for 5.00 secs
We have no clue what is happening here, why this failing node seems to reject every replication and throwing this error.
Do you have any idea?
Thanks for all your help and greetings from Cologne,
Andy!
Seeing as you just want to delete all items in the Bucket have you tried just deleting and re-creating the bucket?
This will be much faster than flush, as flush actually needs to send a delete request for every document in the bucket.
I can't find it in the docs at the moment, but I think Flush is not really recommended with the latest versions.
you are not writing what is your operating system. If it's Linux try to check maximum amount of open sockets for user running the Couchbase. Check the file /etc/security/limits.conf.
the command for check on Linux is: ulimit -Hn.
Hope that helps.
Daniel
I think you should try these settings:
http://docs.couchbase.com/couchbase-manual-2.1/#specifying-backoff-for-replication

Gwan - Why gwan determine only 1 core?

I am currently testing the gwan web server, I have a question about the default settings for the gwan worker, and the CPU Core detection.
Running gwan on a Xen server (which contains a 4-core Xeon CPU), the gwan.log file reports that only got a single core was detected:
'./gwan -w 4' used to override 1 x 1-Core CPU(s)
[Wed May 22 06:54:29 2013 GMT] using 1 workers 0[01]0
[Wed May 22 06:54:29 2013 GMT] among 2 threads 0[11]1
[Wed May 22 06:54:29 2013 GMT] (can't use more threads than CPU Cores)
[Wed May 22 06:54:29 2013 GMT] CPU: 1x Intel(R) Xeon(R) CPU E5506 # 2.13GHz
[Wed May 22 06:54:29 2013 GMT] 0 id: 0 0
[Wed May 22 06:54:29 2013 GMT] 1 id: 1 1
[Wed May 22 06:54:29 2013 GMT] 2 id: 2 2
[Wed May 22 06:54:29 2013 GMT] 3 id: 3 3
[Wed May 22 06:54:29 2013 GMT] CPU(s):1, Core(s)/CPU:1, Thread(s)/Core:2
Any idea?
Thanks!!
the gwan.log file reports that I only got a single core:
Xen, like many other hypervisors, is breaking the CPUID instruction and the /proc/cpuinfo Linux kernel structures (both of which are used by G-WAN to detect CPU Cores).
As you have seen, this is a real problem for multithreaded applications designed to scale on multicore.
'./gwan -w 4' used to override 1 x 1-Core CPU(s)
To avoid stupid manual overrides wasting memory and impacting performance, G-WAN checks that the user-supplied thread count is not greater than the actual CPU Core count.
This is why you have the warning: "(can't use more threads than CPU Cores)".
To bypass this protection and warning, you can use the -g ("God") mode:
./gwan -g -w 4
This command line switch is documented in the G-WAN executable help:
./gwan -h
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Usage: gwan [-b -d -g -k -r -t -v -w] [argument]
(grouped options '-bd' are ignored, use '-b -d')
-b: use the TCP_DEFER_ACCEPT TCP option
(this is disabling the DoS shield!)
-d: daemon mode (uses an 'angel' process)
-d:group:user dumps 'root' privileges
-g: do not limit worker threads to Cores
(using more threads than physical Cores
will lower speed and waste resources)
-k: (gracefully) kill local gwan processes
using the pid files of the gwan folder
-r: run the specified C script (and exit)
for general-purpose code, not servlets
-t: store client requests in 'trace' file
(may be useful to debug/trace attacks)
-v: return version number and build date
-w: define the number of worker threads.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Available network interfaces (2):
127.0.0.1 192.168.0.11