Unresponsive scala-play application - scala

I am using scala-play app in production. Few days back I observed that because of high CPU at DB side, play app started acting up and response time increased up-to few minutes. Play app was deployed on 3 EC2 instances and all of them were attached to ELB. During this time two processes went unresponsive and response time went up-to 600 minutes(usually response-time is below 200 mili-seconds). Because of high response-time at two of the processes, ELB marked them as unhealthy and all requests were routed to single process(which had response time of 20 seconds). Going through logs didn't find help much. After exploring few articles, I understood that deadlock in thread-pool can be one of the reason. We have used thread-pool for blocking S3 calls and non-blocking DB calls. Different thread-pool is used for these purposes.
executor {
sync = {
fork-join-executor {
parallelism-factor = 1.0
parallelism-max = 24
}
}
async = {
fork-join-executor {
parallelism-factor = 1.0
parallelism-max = 24
}
}
}
Can anyone help in understanding what could possibly have gone wrong?
All 3 nodes have same build deployed, but only two of them went unresponsive. CPU at these unresponsive nodes was less than 10%.
Play: 2.5.14
Scala: 2.11.11

There are many things that can go wrong and it's just a guess game with the information you provided.
I'd start with creation of thread dumps of the JVM that is unresponsive. If you do capture console logs of your app, one way do get the dump is sending signal 3 to the jvm process.
Assuming you run your service in unix environment,
ps aux | grep java
Find java pid that runs your play app.
kill -3 <pid>
By sending signal 3, jvm produces thread dump in console.
If console is not available, do
jstack -l <pid> >> threaddumps.log
Now, you'll be able to see the snapshot state of your threads and where it blocks if there are blocked threads.

Related

First 10 long running transactions

I have a fairly small cluster of 6 nodes, 3 client, and 3 server nodes. Important configurations,
storeKeepBinary = true,
cacheMode = Partitioned (some caches's about 5-8, out of 25 have this as TRANSACTIONAL)
AtomicityMode = Atomic
backups = 1
readFromBackups = false
no persistence
When I run the app for some load/performance test on-prem on 2 large boxes, 3 clients on one box, and 3 servers on another box, all within docker containers, I get a decent performance.
However, when I move them over to AWS and run them in EKS, the only change I make is to change the cluster discovery from standard TCP (default) to Kubernetes-based discovery and run the same test.
But now the performance is very bad, I keep getting,
WARN [sys-#145%test%] - [org.apache.ignite] First 10 long-running transactions [total=3]
Here the transactions are running more than a min long.
In other cases, I am getting,
WARN [sys-#196%test-2%] - [org.apache.ignite] First 10 long-running cache futures [total=1]
Here the associated future has been running for > 3 min.
Most of the places 'google search' has taken me, talks flaky/inconsistent n/w as the cause.
The app and the test seem to be ok since on a local on-prem this works just fine and the performance is decent as well.
Wanted to check if others have faced this or when running on Kubernetes in the public cloud something else needs to be done. Like somewhere I read nodes need to be pinned to the host in a cloud/virtual environment, but it's not mandatory.
TIA

ActiveMQ Artemis produce/consume latency issue

I have been monitoring the end to end latency of my microservice applications. Each service is loosely coupled via an ActiveMQ Artemis queue.
------------- ------------- -------------
| Service 1 | --> | Service 2 | --> | Service 3 |
------------- ------------- -------------
Service 1 listens as an HTTP endpoint and produces to a queue 1. Service 2 consumes from queue 1, modifies the message, & produces to queue 2. Service 3 consumes from queue 2. Each service inserts to db a row in a separate table. From there I can also monitor latency. So "end-to-end" is going into "Service 1" and coming out of "Service 3".
Each service processing time remains steady, and most messages have a reasonable e2e latency of a few milliseconds. I produce with a constant rate using JMeter of 400 req/sec, and I can monitor this via Grafana.
Sporadically I notice a dip in this constant rate which can be seen throughout the chain. At first I thought it could be the producer side (Service 1) since the rate suddenly dropped to 370 req/sec and might be attributed to GC or possibly the JMeter HTTP simulator fault, but this does not explain why certain messages e2e latency jumps to ~2-3 sec.
Since it would be hard to reproduce my scenario I checked out this load generator for ActiveMQ Artemis and bumped the versions up to 2.17.0, 5.16.2 & 0.58.0. To match my broker 2.17.0. Which is a cluster of 2 masters/slaves using nfsv4 shared storage.
The below command generated 5,000,000 messages to a single queue q6, with 4 producer/consumer with a max overall produce rate of 400. Messages are persistent. The only code change in the artemis-load-generator was in ConsumerLatencyRecorderTask when elapsedTime > 1sec I would print out the message ID and latency.
java -jar destination-bench.jar --persistent --bytes 1000 --protocol artemis --url tcp://localhost:61616?producerMaxRate=100 --out /tmp/test1.txt --name q6 --iterations 5000000 --runs 1 --warmup 20000 --forks 4 --destinations 1
From this I noticed that there were outlier messages with produce/consume latency nearing 2 secs. Most (90.00%) were below 3358.72 microseconds.
I am not sure why and how this happens? Is this reasonable ?
EDIT/UPDATE
I have run the test a few times this an output of a shorter run.
java -jar destination-bench.jar --persistent --bytes 1000 --protocol artemis --url tcp://localhost:61616?producerMaxRate=100 --out ~/test-perf1.txt --name q6 --iterations 400000 --runs 1 --warmup 20000 --forks 4 --destinations 1
The result is below
RUN 1 EndToEnd Throughput: 398 ops/sec
**************
EndToEnd SERVICE-TIME Latencies distribution in MICROSECONDS
mean 10117.30
min 954.37
50.00% 1695.74
90.00% 2637.82
99.00% 177209.34
99.90% 847249.41
99.99% 859832.32
max 5939134.46
count 1600000
The JVM Threads Statusis what I am noticing in my actual system on the broker a lot of time_waiting threads and were there are spike push-to-queue latency seems to increase.
Currently my data is as i said hosted on ntfs v4 as shown . I read Artemis persistence section that
If the journal is on a volume which is shared with other processes which might be writing other files (e.g. bindings journal, database, or transaction coordinator) then the disk head may well be moving rapidly between these files as it writes them, thus drastically reducing performance.
Should I move the bindings folder outside ntfs on the vms disk? Will this improve performance ? It is unclear to me.
How does this affect Shared Store HA?
I started a fresh, default instance of ActiveMQ Artemis 2.17.0, cloned and built the artemis-load-generator (with a modification to alert immediately on messages that take > 1 second to process), and then ran the same command you ran. I let the test run for about an hour on my local machine, but I didn't let it finish because it was going to take over 3 hours (5 million messages at 400 messages per second). Out of roughly 1 million messages I saw only 1 "outlier" - certainly nothing close to the 10% you're seeing. It's worth noting that I was still using my computer for my normal development work during this time.
At this point I have to attribute this to some kind of environmental issue, e.g.:
Garbage Collection
Low performance disk
Network latency
Insufficient CPU, RAM, etc.

24 hours performance test execution stopped abruptly running in jmeter pod in AKS

I am running load test of 24 hours using Jmeter in Azure Kubernetes service. I am using Throughput shaping timer in my jmx file. No listener is added as part of jmx file.
My test stopped abruptly after 6 or 7 hrs.
jmeter-server.log file under Jmeter slave pod is giving warning --> WARN k.a.j.t.VariableThroughputTimer: No free threads left in worker pool.
Below is snapshot from jmeter-server.log file.
Using Jmeter version - 5.2.1 and Kubernetes version - 1.19.6
I checked, Jmeter pods for master and slaves are continously running(no restart happened) in AKS.
I provided 2GB memory to Jmeter slave pod still load test is stopped abruptly.
I am using log analytics workspace for logging. Checked ContainerLog table not getting error.
Snapshot of JMX file.
Using following elements -> Thread Group, Throughput Controller, Http request Sampler and Throughput Shaping Timer
Please suggest for same.
It looks like your Schedule Feedback Function configuration is wrong in its last parameter
The warning means that the Throughput Shaping Timer attempts to increase the number of threads to reach/maintain the desired concurrency but it doesn't have enough threads in order to do this.
So either increase this Spare threads ration to be closer to 1 if you're using a float value for percentage or increment the absolute value in order to match the number of threads.
Quote from documentation:
Example function call: ${__tstFeedback(tst-name,1,100,10)} , where "tst-name" is name of Throughput Shaping Timer to integrate with, 1 and 100 are starting threads and max allowed threads, 10 is how many spare threads to keep in thread pool. If spare threads parameter is a float value <1, then it is interpreted as a ratio relative to the current estimate of threads needed. If above 1, spare threads is interpreted as an absolute count.
More information: Using JMeter’s Throughput Shaping Timer Plugin
However it doesn't explain the premature termination of the test so ensure that there are no errors in jmeter/k8s logs, one of the possible reasons is that JMeter process is being terminated by OOMKiller

CoordinatedShutdown timeout on Akka cluster application

We've an akka cluster application (sharding some actors). Sometimes, when we deploy and our application should be turned off we see some logs like that:
Coordinated shutdown phase [cluster-sharding-shutdown-region] timed
out after 10000 milliseconds
This happens on the first deploy after more than 2 days since last deploy (on mondays for example). We ask the akka node to quit the cluster with the JMX helper and we have the following code too:
actorSystem.registerOnTermination {
logger.error("Gracefully shutdown of node")
System.exit(0)
}
So when this error happens, eventually node leaves the cluster (or at least it closes the JMX entry point to manage akka cluster) but process don't finish and the log "Gracefully shutdown of node" doesn't appear. So when this happen we need to shutdown the java process manually (we handle this with supervisor) and redeploy.
I know the timeout can be tunned through config but what are the implications of increasing this timeout? Why sometimes coordinated shutdown throws a timeout? What happens when coordinated shutdown timeout?
Any clue would be appreciated :D
Thank you
What happens after timeout? Quoting from Akka documentation:
If tasks are not completed within a configured timeout (see reference.conf) the next phase will be started anyway. It is possible to configure recover=off for a phase to abort the rest of the shutdown process if a task fails or is not completed within the timeout.
Why the shutdown may time out? Quite possible you have a deadlock somewhere. In that case, increasing the timeout wouldn't help. It may also very well be that you need more time for shutdown. Then, you must increase the timeout.
But more related to your problem, could be the following:
By default, the JVM is not forcefully stopped (it will be stopped if all non-daemon threads have been terminated). To enable a hard System.exit as a final action you can configure:
akka.coordinated-shutdown.exit-jvm = on
So you can turn this on, which should solve the "shutdown the java process manually" step.
Nevertheless, the hard question is to find out why the shutdown times out in the first place. I guess with the above trick you can survive for some time, but you'd better spend some time to find the actual cause.
We used to face this problem (One of the Co-ordinated shutdown phase timeout) for short lived application.
Use case where we faced this:
Application joins existing akka cluster
Does some work
Leaves the cluster
But at step 3, the status of member was still (Joining or WeaklyUp) and if you see task added for PhaseClusterLeave, it allows to remove member from cluster only if it's status is UP.
Snippet from ClusterDaemon.scala which is invoked on Running ClusterLeave phase :
def leaving(address: Address): Unit = {
// only try to update if the node is available (in the member ring)
if (latestGossip.members.exists(m ⇒ m.address == address && m.status == Up)) {
val newMembers = latestGossip.members map { m ⇒ if (m.address == address) m.copy(status = Leaving) else m } // mark node as LEAVING
val newGossip = latestGossip copy (members = newMembers)
updateLatestGossip(newGossip)
logInfo("Marked address [{}] as [{}]", address, Leaving)
publishMembershipState()
// immediate gossip to speed up the leaving process
gossip()
}
}
To solve this problem, we ended up writing our own CoordinatedShutdown which you can refer here CswCoordinatedShutdown.scala

Jobs in a queue is dropped unexpectedly in Gearman

I'm dealing with a very strange problem now.
Since I queue the jobs over 1,000 at once, Gearman doesn't work properly so far...
The problem is that, when I reserve the jobs in background mode, I could see the jobs were correctly queued from the monitoring page (gearman monitor),
but It is drained right after without delivering it to the worker. (within a few seconds)
After all, the jobs never be executed by the worker, just disappeared from the queue (job server).
So I tried rebooting the server entirely, and reinstall gearman as well as php library. (I'm using 1 CentOS, 1 Ubuntu with PHP gearman library, and version is 0.34 and 1.0.2)
But no luck yet... Job server just misbehaving as I explained in aobve.
What should I do for now?
Can I check the workers state, or see and monitor the whole process from queueing the jobs to the delivering to the worker?
When I tried gearmand with a option like: 'gearmand -vvvv' It never print anything on the screen while I register worker to the server, and run a job with client code (PHP)
Any comment will be appreciated.
For your information, I'm not considering persistent queue using MySQL or SQLite for now, because it sometimes occurs performance issue with slow execution.