ActiveMQ Artemis produce/consume latency issue - activemq-artemis

I have been monitoring the end to end latency of my microservice applications. Each service is loosely coupled via an ActiveMQ Artemis queue.
------------- ------------- -------------
| Service 1 | --> | Service 2 | --> | Service 3 |
------------- ------------- -------------
Service 1 listens as an HTTP endpoint and produces to a queue 1. Service 2 consumes from queue 1, modifies the message, & produces to queue 2. Service 3 consumes from queue 2. Each service inserts to db a row in a separate table. From there I can also monitor latency. So "end-to-end" is going into "Service 1" and coming out of "Service 3".
Each service processing time remains steady, and most messages have a reasonable e2e latency of a few milliseconds. I produce with a constant rate using JMeter of 400 req/sec, and I can monitor this via Grafana.
Sporadically I notice a dip in this constant rate which can be seen throughout the chain. At first I thought it could be the producer side (Service 1) since the rate suddenly dropped to 370 req/sec and might be attributed to GC or possibly the JMeter HTTP simulator fault, but this does not explain why certain messages e2e latency jumps to ~2-3 sec.
Since it would be hard to reproduce my scenario I checked out this load generator for ActiveMQ Artemis and bumped the versions up to 2.17.0, 5.16.2 & 0.58.0. To match my broker 2.17.0. Which is a cluster of 2 masters/slaves using nfsv4 shared storage.
The below command generated 5,000,000 messages to a single queue q6, with 4 producer/consumer with a max overall produce rate of 400. Messages are persistent. The only code change in the artemis-load-generator was in ConsumerLatencyRecorderTask when elapsedTime > 1sec I would print out the message ID and latency.
java -jar destination-bench.jar --persistent --bytes 1000 --protocol artemis --url tcp://localhost:61616?producerMaxRate=100 --out /tmp/test1.txt --name q6 --iterations 5000000 --runs 1 --warmup 20000 --forks 4 --destinations 1
From this I noticed that there were outlier messages with produce/consume latency nearing 2 secs. Most (90.00%) were below 3358.72 microseconds.
I am not sure why and how this happens? Is this reasonable ?
EDIT/UPDATE
I have run the test a few times this an output of a shorter run.
java -jar destination-bench.jar --persistent --bytes 1000 --protocol artemis --url tcp://localhost:61616?producerMaxRate=100 --out ~/test-perf1.txt --name q6 --iterations 400000 --runs 1 --warmup 20000 --forks 4 --destinations 1
The result is below
RUN 1 EndToEnd Throughput: 398 ops/sec
**************
EndToEnd SERVICE-TIME Latencies distribution in MICROSECONDS
mean 10117.30
min 954.37
50.00% 1695.74
90.00% 2637.82
99.00% 177209.34
99.90% 847249.41
99.99% 859832.32
max 5939134.46
count 1600000
The JVM Threads Statusis what I am noticing in my actual system on the broker a lot of time_waiting threads and were there are spike push-to-queue latency seems to increase.
Currently my data is as i said hosted on ntfs v4 as shown . I read Artemis persistence section that
If the journal is on a volume which is shared with other processes which might be writing other files (e.g. bindings journal, database, or transaction coordinator) then the disk head may well be moving rapidly between these files as it writes them, thus drastically reducing performance.
Should I move the bindings folder outside ntfs on the vms disk? Will this improve performance ? It is unclear to me.
How does this affect Shared Store HA?

I started a fresh, default instance of ActiveMQ Artemis 2.17.0, cloned and built the artemis-load-generator (with a modification to alert immediately on messages that take > 1 second to process), and then ran the same command you ran. I let the test run for about an hour on my local machine, but I didn't let it finish because it was going to take over 3 hours (5 million messages at 400 messages per second). Out of roughly 1 million messages I saw only 1 "outlier" - certainly nothing close to the 10% you're seeing. It's worth noting that I was still using my computer for my normal development work during this time.
At this point I have to attribute this to some kind of environmental issue, e.g.:
Garbage Collection
Low performance disk
Network latency
Insufficient CPU, RAM, etc.

Related

Kafka broker dying abruptly without any error log

We are running kafka version 2.4.0. After 4-5 days of application running, it dies without any logs. We have 20gb box with xmx and xms set to 5gb. The GC activity of application is healthy and there are not GC issue. I don't see OOM killer being invoked as checked from system logs. There is 13gb available memory when process died.
total used free shared buff/cache available
Mem: 19 5 0 0 13 13
Swap: 0 0 0
The root cause for this was vm.max_map_count limit (default being 65k) being hit by the application. We concluded this by looking at
jmx.java.nio.BufferPool.mapped.Count
metrics in jmx mbean.
Another way to check this is
cat /proc/<kafka broker pid>/maps | wc -l
Updating the max_map_count limit fixed the issue for us.
Another way to fix this issue could have been
Increasing the segment creation duration or number of records when segment is triggered.
Have more instances so that each instance gets assigned lesser number of paritions.

24 hours performance test execution stopped abruptly running in jmeter pod in AKS

I am running load test of 24 hours using Jmeter in Azure Kubernetes service. I am using Throughput shaping timer in my jmx file. No listener is added as part of jmx file.
My test stopped abruptly after 6 or 7 hrs.
jmeter-server.log file under Jmeter slave pod is giving warning --> WARN k.a.j.t.VariableThroughputTimer: No free threads left in worker pool.
Below is snapshot from jmeter-server.log file.
Using Jmeter version - 5.2.1 and Kubernetes version - 1.19.6
I checked, Jmeter pods for master and slaves are continously running(no restart happened) in AKS.
I provided 2GB memory to Jmeter slave pod still load test is stopped abruptly.
I am using log analytics workspace for logging. Checked ContainerLog table not getting error.
Snapshot of JMX file.
Using following elements -> Thread Group, Throughput Controller, Http request Sampler and Throughput Shaping Timer
Please suggest for same.
It looks like your Schedule Feedback Function configuration is wrong in its last parameter
The warning means that the Throughput Shaping Timer attempts to increase the number of threads to reach/maintain the desired concurrency but it doesn't have enough threads in order to do this.
So either increase this Spare threads ration to be closer to 1 if you're using a float value for percentage or increment the absolute value in order to match the number of threads.
Quote from documentation:
Example function call: ${__tstFeedback(tst-name,1,100,10)} , where "tst-name" is name of Throughput Shaping Timer to integrate with, 1 and 100 are starting threads and max allowed threads, 10 is how many spare threads to keep in thread pool. If spare threads parameter is a float value <1, then it is interpreted as a ratio relative to the current estimate of threads needed. If above 1, spare threads is interpreted as an absolute count.
More information: Using JMeter’s Throughput Shaping Timer Plugin
However it doesn't explain the premature termination of the test so ensure that there are no errors in jmeter/k8s logs, one of the possible reasons is that JMeter process is being terminated by OOMKiller

Unresponsive scala-play application

I am using scala-play app in production. Few days back I observed that because of high CPU at DB side, play app started acting up and response time increased up-to few minutes. Play app was deployed on 3 EC2 instances and all of them were attached to ELB. During this time two processes went unresponsive and response time went up-to 600 minutes(usually response-time is below 200 mili-seconds). Because of high response-time at two of the processes, ELB marked them as unhealthy and all requests were routed to single process(which had response time of 20 seconds). Going through logs didn't find help much. After exploring few articles, I understood that deadlock in thread-pool can be one of the reason. We have used thread-pool for blocking S3 calls and non-blocking DB calls. Different thread-pool is used for these purposes.
executor {
sync = {
fork-join-executor {
parallelism-factor = 1.0
parallelism-max = 24
}
}
async = {
fork-join-executor {
parallelism-factor = 1.0
parallelism-max = 24
}
}
}
Can anyone help in understanding what could possibly have gone wrong?
All 3 nodes have same build deployed, but only two of them went unresponsive. CPU at these unresponsive nodes was less than 10%.
Play: 2.5.14
Scala: 2.11.11
There are many things that can go wrong and it's just a guess game with the information you provided.
I'd start with creation of thread dumps of the JVM that is unresponsive. If you do capture console logs of your app, one way do get the dump is sending signal 3 to the jvm process.
Assuming you run your service in unix environment,
ps aux | grep java
Find java pid that runs your play app.
kill -3 <pid>
By sending signal 3, jvm produces thread dump in console.
If console is not available, do
jstack -l <pid> >> threaddumps.log
Now, you'll be able to see the snapshot state of your threads and where it blocks if there are blocked threads.

How to find the root cause of high CPU usage of Kafka brokers?

I am in charge of operating two kafka clusters (one for prod and one for our dev environment). The setup is mostly similiar, but the dev environment has no SASL/SSL setup and uses just 4 instead of 8 brokers. Each broker is assigned to a dedicated google kubernetes node with 4 vCPU and 26GB RAM.
On our dev environment we've got roughly 1000 messages in / sec and each of the 4 brokers uses pretty consistently 3 out of the 4 available CPU cores (75% CPU usage).
On our prod environment we got about 1500 messages in / sec and the CPU usage is also 3 out of 4 cores there.
It seems that CPU usage is at least the bottleneck for us and I'd like to know how I can perform a CPU profiling, so that I know what exactly is causing the high cpu usage. Since it's relatively consistent I guess it could be our snappy compression.
I am interested in all ideas how I could investigate the cause of the high cpu usage and how I could tweak that in my cluster.
Apache Kafka version: 2.1 (CPU load used to be similiar on Kafka 0.11.x too)
Dev Cluster (Snappy compression, no SASL/SSL, 4 Brokers): 1000 messages in / sec, 3 CPU cores consistent usage
Prod cluster (Snappy compression, SASL/SSL, 8 Brokers): 1500 messages in / sec, 3 CPU cores consistent usage
Side note: I already made sure producers produce their messages snappy compressed. I have access to all JMX metrics, couldn't find anything useful for figuring out the CPU usage though.
I already have metrics attached to my prometheus (this is where I got the CPU usage stats from too). The problem is that the container's CPU usage doesn't tell me WHY it is that high. I need more granularity e. g. what are CPU cycles being spent on (compression? broker communication? sasl/ssl?).
If you have access to JMX metrics you are almost done for profiling CPU. All thing have to do is installing Prometheus and Grafana and then store metrics in Prometheus and monitor them with Grafana. You can find complete steps in Monitoring Kafka
Note: If you are suspicious about snappy compression, maybe this performance test can help you
Update:
Based on Confluent, most of the CPU usage is because of SSL.
Note that if SSL is enabled, the CPU requirements can be significantly
higher (the exact details depend on the CPU type and JVM
implementation).
You should choose a modern processor with multiple cores. Common
clusters utilize 24 core machines.
If you need to choose between faster CPUs or more cores, choose more
cores. The extra concurrency that multiple cores offers will far
outweigh a slightly faster clock speed.

Simulating very high loads using ApacheBench (ab)

Is it possible to simulate very high (~30k concurrent requests) using apache bench. I increased my ulimit to 30k and then load tested my server using ab -n 60000 -c 30000 .... Sometimes, I get receive exceptions on ab. But, the number of requests that ab is posting to my server is greater than what I am specifying (I monitored my server stats, it is getting greater than 60k requests). Why am I getting this weird behaviour?