How to get JMETER Load Testing Accurately With multiple webservice (pods) - kubernetes

I am new to JMeter Load testing. We have installed a distributed JMeter Load testing in EC2 AWS, with 1 master and 5 slaves.
So I am testing my web service endpoint (HTTP Request) in JMeter.
Details:
Tested my web service and reached 1900 threads with 0% error percentage in 1 pod with 8Gi RAM.
Now, I deployed my web service in 2 pods (replicas) with 8Gi RAM, however it is having more than 0% error percentage even my threads was set ONLY to 2100.
Did check the logs, there is no error, did check the pods health it was alright, did check the Database CPU utillization and it was alright.
Our expectation is that, since we have 2 pods already, it can accomodate x2 of 1900 threads (just like what happened in 1 pod).
Did I miss something to check? I hope someone could give me some light :(
It has been bugging me for 12 hours now.
Thank you in advance.

Related

Internal k8s services communication not balanced

I’m running a k8s cluster on aws-eks.
I have two services A and B.
Service A listens to a rabbit queue and sends http request to service B (which takes a couple of seconds).
Both services scale based on the number of message in the queue.
The problem is that the requests from A to B are not balanced.
When scaled to about 100 pods each, I see that service A pods sends requests to only about 60% of service B pods at a given time,
Meaning, eventually all pods gets messages, but some pods are at 100 cpu receiving 5 messages at a time, while others at 2 cpu
Receiving 1 message every minute or so..
That obviously causes low performance and timeouts.
I’ve read that it should work in round robin, but when I tried to set 10 fixed replicas of each service (all pods already up and running)
And pushing 10 messages to queue, I’ve seen that all service A pods pulled a message to send to service B, but some of service B pods never got any requests while other got more than one - resulting in one whole process to finish within 4 second while another took about 12 second.
Any ideas for why it’s working like that and how to change it to be more balanced?
Thanks

is RabbitMQ queueing system unnecessary in a Kubernetes cluster?

I have just been certified CKAD (Kubernetes Application Developer) by The Linux Foundation.
And from now on I am wondering : is RabbitMQ queueing system unnecessary in a Kubernetes cluster ?
We use workers with queueing system in order to avoid http 30 seconds timeout : let's say for example we have a microservice which generates big pdf documents in average of 50 seconds each and you have 20 documents to generate right now, the classical schema would be to make a worker which will queue each documents one by one (this is the case for the company I have been working for lately)
But in a Kubernetes cluster by default there is no timeout for http request going inside the cluster. You can wait 1000 seconds without any issue (20 documents * 50 seconds = 1000 seconds)
With this last point, is it enought to say that RabbitMQ queueing system (via the amqplib module) is unuseful in a Kubernetes cluster ? moreover Kubernetes manages so well load balancing on each of your microservice replicas...
But in a Kubernetes cluster by default there is no timeout for http request going inside the cluster.
Not sure where you got that idea. Depending on your config there might be no timeouts at the proxy level but there's still client and server timeouts to consider. Kubernetes doesn't change what you deploy, just how you deploy it. There's certainly other options than RabbitMQ specifically, and other system architectures you could consider, but "queue workers" is still a very common pattern and likely will be forever even as the tech around it changes.

HAProxy reverse ssl termination: Memory keeps growing. Memory leak?

I have haproxy 2.5.1 in SSL termination config running in a container of a Kubernetes POD, the backend is an Scala App that runs in another container of same POD.
I have seen that I can put 500K connections in the setup and the RSS memory usage of HAProxy is 20GB. If I remove the traffic and wait 15 minutes the RSS memory drops to 15GB, but if I repeat the same exercise one or two more times, RSS for HAProxy will hit 30GB and HAProxy will be kill as I have a limit of 30GB in the POD for HAProxy.
The question here is if this behavior of continuous memory growth is expected?
Here is the incoming traffic:
And here is the memory usage chart which shows how after 3 cycles of Placing Load and Removing Load, the RSS memory reached 30GB and then got killed (Just as an observation the two charts have different timezone but they belong to same execution)
We switched from Alpine based image(musl) into libc based image and that solved the problem. We got 5X increase on connection rate and memory growth gone too.

Kube Cluster under high load has time out errors when connecting to an external service

I have two azure kube clusters, each is connecting to ms dynamics. And under high load (hundreds of calls/min), the connection starts throwing timed out errors ("org.apache.http.conn.HttpHostConnectException"). About 10% of the time.
AND, this error seems to be coming from specific nodes. So if there are 20 pods spread across 8 nodes, only the pods that are on say 4 of the nodes will experience the error. The other pod/nodes are handling just as many calls, but no errors.
This is over many days too. So the same pod/nodes have had the error over the course of 10 day.
Has anyone experienced this?
How can I trouble shoot this?
NOTE: These are not 429 connection refused errors.

`kubctl get pods` has high latency

I am attempting to identify and fix the source of high latency when running kubectl get pods.
I am running 1.1.4 on AWS.
When running the command from the master host of afflicted master, I consistently get response times of 6s.
Other queries, such as get svc and get rc return on the order of 20ms.
Running get pods on a mirror cluster returns in 150ms.
I've crawled through master logs and system stats, but have not identified the issue.
We speeded up LIST operations in 1.2. You might be interested in learning the updates to Kubernetes performance and scalability in 1.2.
Chris - how big cluster do you have and how many pods do you have in it?
Obviously the time it take to return the response will be bigger if the result is bigger.
Also, what do you mean by "running on mirror cluster returns in 150ms"? What is "mirror cluster"?