24 hours performance test execution stopped abruptly running in jmeter pod in AKS - kubernetes

I am running load test of 24 hours using Jmeter in Azure Kubernetes service. I am using Throughput shaping timer in my jmx file. No listener is added as part of jmx file.
My test stopped abruptly after 6 or 7 hrs.
jmeter-server.log file under Jmeter slave pod is giving warning --> WARN k.a.j.t.VariableThroughputTimer: No free threads left in worker pool.
Below is snapshot from jmeter-server.log file.
Using Jmeter version - 5.2.1 and Kubernetes version - 1.19.6
I checked, Jmeter pods for master and slaves are continously running(no restart happened) in AKS.
I provided 2GB memory to Jmeter slave pod still load test is stopped abruptly.
I am using log analytics workspace for logging. Checked ContainerLog table not getting error.
Snapshot of JMX file.
Using following elements -> Thread Group, Throughput Controller, Http request Sampler and Throughput Shaping Timer
Please suggest for same.

It looks like your Schedule Feedback Function configuration is wrong in its last parameter
The warning means that the Throughput Shaping Timer attempts to increase the number of threads to reach/maintain the desired concurrency but it doesn't have enough threads in order to do this.
So either increase this Spare threads ration to be closer to 1 if you're using a float value for percentage or increment the absolute value in order to match the number of threads.
Quote from documentation:
Example function call: ${__tstFeedback(tst-name,1,100,10)} , where "tst-name" is name of Throughput Shaping Timer to integrate with, 1 and 100 are starting threads and max allowed threads, 10 is how many spare threads to keep in thread pool. If spare threads parameter is a float value <1, then it is interpreted as a ratio relative to the current estimate of threads needed. If above 1, spare threads is interpreted as an absolute count.
More information: Using JMeter’s Throughput Shaping Timer Plugin
However it doesn't explain the premature termination of the test so ensure that there are no errors in jmeter/k8s logs, one of the possible reasons is that JMeter process is being terminated by OOMKiller

Related

ActiveMQ Artemis produce/consume latency issue

I have been monitoring the end to end latency of my microservice applications. Each service is loosely coupled via an ActiveMQ Artemis queue.
------------- ------------- -------------
| Service 1 | --> | Service 2 | --> | Service 3 |
------------- ------------- -------------
Service 1 listens as an HTTP endpoint and produces to a queue 1. Service 2 consumes from queue 1, modifies the message, & produces to queue 2. Service 3 consumes from queue 2. Each service inserts to db a row in a separate table. From there I can also monitor latency. So "end-to-end" is going into "Service 1" and coming out of "Service 3".
Each service processing time remains steady, and most messages have a reasonable e2e latency of a few milliseconds. I produce with a constant rate using JMeter of 400 req/sec, and I can monitor this via Grafana.
Sporadically I notice a dip in this constant rate which can be seen throughout the chain. At first I thought it could be the producer side (Service 1) since the rate suddenly dropped to 370 req/sec and might be attributed to GC or possibly the JMeter HTTP simulator fault, but this does not explain why certain messages e2e latency jumps to ~2-3 sec.
Since it would be hard to reproduce my scenario I checked out this load generator for ActiveMQ Artemis and bumped the versions up to 2.17.0, 5.16.2 & 0.58.0. To match my broker 2.17.0. Which is a cluster of 2 masters/slaves using nfsv4 shared storage.
The below command generated 5,000,000 messages to a single queue q6, with 4 producer/consumer with a max overall produce rate of 400. Messages are persistent. The only code change in the artemis-load-generator was in ConsumerLatencyRecorderTask when elapsedTime > 1sec I would print out the message ID and latency.
java -jar destination-bench.jar --persistent --bytes 1000 --protocol artemis --url tcp://localhost:61616?producerMaxRate=100 --out /tmp/test1.txt --name q6 --iterations 5000000 --runs 1 --warmup 20000 --forks 4 --destinations 1
From this I noticed that there were outlier messages with produce/consume latency nearing 2 secs. Most (90.00%) were below 3358.72 microseconds.
I am not sure why and how this happens? Is this reasonable ?
EDIT/UPDATE
I have run the test a few times this an output of a shorter run.
java -jar destination-bench.jar --persistent --bytes 1000 --protocol artemis --url tcp://localhost:61616?producerMaxRate=100 --out ~/test-perf1.txt --name q6 --iterations 400000 --runs 1 --warmup 20000 --forks 4 --destinations 1
The result is below
RUN 1 EndToEnd Throughput: 398 ops/sec
**************
EndToEnd SERVICE-TIME Latencies distribution in MICROSECONDS
mean 10117.30
min 954.37
50.00% 1695.74
90.00% 2637.82
99.00% 177209.34
99.90% 847249.41
99.99% 859832.32
max 5939134.46
count 1600000
The JVM Threads Statusis what I am noticing in my actual system on the broker a lot of time_waiting threads and were there are spike push-to-queue latency seems to increase.
Currently my data is as i said hosted on ntfs v4 as shown . I read Artemis persistence section that
If the journal is on a volume which is shared with other processes which might be writing other files (e.g. bindings journal, database, or transaction coordinator) then the disk head may well be moving rapidly between these files as it writes them, thus drastically reducing performance.
Should I move the bindings folder outside ntfs on the vms disk? Will this improve performance ? It is unclear to me.
How does this affect Shared Store HA?
I started a fresh, default instance of ActiveMQ Artemis 2.17.0, cloned and built the artemis-load-generator (with a modification to alert immediately on messages that take > 1 second to process), and then ran the same command you ran. I let the test run for about an hour on my local machine, but I didn't let it finish because it was going to take over 3 hours (5 million messages at 400 messages per second). Out of roughly 1 million messages I saw only 1 "outlier" - certainly nothing close to the 10% you're seeing. It's worth noting that I was still using my computer for my normal development work during this time.
At this point I have to attribute this to some kind of environmental issue, e.g.:
Garbage Collection
Low performance disk
Network latency
Insufficient CPU, RAM, etc.

K8s job memory consumption after job finishes

On below dashboard, yellow line represents memory limit, red line represents memory requests and green memory usage. Why is memory consumption still being reported by Prometheus after job completion? I checked job logs and job completed same time when memory request and limits went to 0. Job TTL is set to 60 sec so I think it's not related.
prometheus grafana metrics
Factually a completed job means the process is no longer running let a lone consuming any resources. So, what you see in your logs is probably due to a delay in the metrics refresh period.
Keep in mind that K8s-related metrics like resource requests are reported by reaping information from the K8s API-Server, whereas actual resource consumption is reported by different infrastructure components, e.g. the Metrics Server. Those systems will likely have different refresh periods, which explains the discrepancy when you aggregate them on the same chart.

K8s cluster memory decreases when running an Apache Flink Job

We are trying to deploy an apache Flink job on a K8s Cluster, but we are noticing an odd behavior, when we start our job, the task manager memory starts with the amount assigned, in our case is 3 GB.
taskmanager.memory.process.size: 3g
eventually, the memory starts decreasing until it reaches about 160 MB, at that point, it recovers a little memory so it doesn't reach its end.
that very low memory often causes that the job is terminated due to task manager heartbeat exception even when trying to watch the logs on Flink dashboard or doing the job's process.
Why is it going so low on memory? we expected to have that behavior but in the range of GB because we assigned those 3Gb to the task manager even if we change our task manager memory size we have the same behavior.
Our Flink conf looks like this:
flink-conf.yaml: |+
taskmanager.numberOfTaskSlots: 1
blob.server.port: 6124
taskmanager.rpc.port: 6122
taskmanager.memory.process.size: 3g
metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 9999
metrics.system-resource: true
metrics.system-resource-probing-interval: 5000
jobmanager.rpc.address: flink-jobmanager
jobmanager.rpc.port: 6123
is there a recommended configuration on K8s for memory or something that we are missing on our flink-conf.yml?
Thanks.
Your configuration looks fine. It's most likely an issue with your code and some kind of memory leak. This is a very good answer describing what may be the problem.
You can try setting a limit on the JVM heap with taskmanager.memory.task.heap.size that you give the JVM some extra room to do GC, etc. But in the end, if you are allocating something that is not being referenced you will run into the situation.
Presumably, you are using your memory to store your state in which case you can also try RockDB as a state backend in case you are storing large objects.
What are your requests/limits in you deployment templates? If there are no specified request sizes you may be seeing your cluster resources get eaten.

Airflow Memory Error: Task exited with return code -9

According to both of these Link1 and Link2, my Airflow DAG run is returning the error INFO - Task exited with return code -9 due to an out-of-memory issue. My DAG run has 10 tasks/operators, and each task simply:
makes a query to get one of my BigQuery tables, and
writes the results to a collection in my Mongo database.
The size of the 10 BigQuery tables range from 1MB to 400MB, and the total size of all 10 tables is ~1GB. My docker container has default 2GB of memory and I've increased this to 4GB, however I am still receiving this error from a few of the tasks. I am confused about this, as 4GB should be plenty of memory for this. I am also concerned because, in the future, these tables may become larger (a single table query could be 1-2GB), and I'd like to avoid these return code -9 errors at that time.
I'm not quite sure how to handle this issue, since the point of the DAG is to transfer data from BigQuery to Mongo daily, and the queries / data in-memory for the DAG's tasks is necessarily fairly large then, based on the size of the tables.
As you said, the error message you get corresponds to an out of memory issue.
Referring to the official documentation:
DAG execution is RAM limited. Each task execution starts with two
Airflow processes: task execution and monitoring. Currently, each node
can take up to 6 concurrent tasks. More memory can be consumed,
depending on the size of the DAG.
High memory pressure in any of the GKE nodes will lead the Kubernetes scheduler to evict pods from nodes in an attempt to relieve that pressure. While many different Airflow components are running within GKE, most don't tend to use much memory, so the case that happens most frequently is that a user uploaded a resource-intensive DAG. The Airflow workers run those DAGs, run out of resources, and then get evicted.
You can check it with following steps:
In the Cloud Console, navigate to Kubernetes Engine -> Workloads
Click on airflow-worker, and look under Managed pods
If there are pods that show Evicted, click each evicted pod and look for the The node was low on resource: memory message at the top of the window.
What are the possible ways to fix OOM issue?
Create a new Cloud Composer environment with a larger machine type than the current machine type.
Ensure that the tasks in the DAG are idempotent, which means that the result of running the same DAG run multiple times should be the same as the result of running it once.
Configure task retries by setting the number of retries on the task - this way when your task gets -9'ed by the scheduler it will go to up_for_retry instead of failed
Additionally you can check the behavior of CPU:
In the Cloud Console, navigate to Kubernetes Engine -> Clusters
Locate Node Pools at the bottom of the page, and expand the default-pool section
Click the link listed under Instance groups
Switch to the Monitoring tab, where you can find CPU utilization
Ideally, the GCE instances shouldn't be running over 70% CPU at all times, or the Composer environment may become unstable during resource usage.
I hope you find the above pieces of information useful.
I am going to chunk the data so that less is loaded into any 1 task at any given time. I'm not sure yet whether I will need to use GCS/S3 for intermediary storage.

Queue mode in HPC Pack 2016 can scale up?

I'm working on a project where we need to execute a lot of jobs (say 60000 jobs) each time in HPC cluster.
From HPC documentation, i noticed HPC has 2 mode
- Queued mode: tart jobs in queue order, and attempt to allocate the maximum requested resources to running jobs.
- Balanced mode: Attempt to start all incoming jobs as soon as possible at their minimum resource requirements
https://learn.microsoft.com/en-us/powershell/high-performance-computing/understanding-policy-configuration?view=hpc16-ps
But i'm not sure about tolerance of this balance mode in HPC. Does it can scale like other queue service like SQS in AWS or Queue Storage in Azure?
Queue mode and balance mode are referring to the scheduling policy. There is no real queue included.
Basically, in Queue mode, your jobs run in a FIFO manner. If the next job needs more cores then currently available, it waits despite the fact that there are available resources. While in Balance mode, HPC Pack tries to run as many jobs as possible at the same time and makes the best effort to ensure they use the same amount of resources., to maximize the resource usage.
Both two policies will not effect scale target of the cluster.