Load generated by distributed worker processes is not equivalent generated by single process - locust

I start another test trying to figure how users are allocated to worker node.
Here is my locust file.
def mytask(self):
class QuickstartUser(HttpUser):
wait_time = between(1, 2)
tasks = [mytask]
It is nothing but access a Chinese search engine website that never failed to visit.
When I start 30 users running in single node, and RPS is 20.
locust -f try_baidu.py
and got running status and result as below.
I switch to distributed running mode using command in 3 terminal of my computer.
locust -f try_baidu.py --master #for master node
locust -f try_baidu.py --worker --master-host= #for worker node each
and I input same amount of users and hatch rate in locust UI as above, say 30 users and hatch rate 10.
I've got same RPS which is 20 or around, and each worker node runs 15 users.
This explains that number of user input in UI is total amount to simulated and dispersed around worker node. It is something like load balance to burden load generation.
But I don't know why same amount of users gives 2 different RPS when running in single node (Scenario 1) and distributed (Scenario 2). They shall be result into same or closed RPS as above test.
The only difference I can tell is above comparison is in same computer while Scenario 2 have worker nodes in 2 remote linux VMs. But is it real reason?
Question may be asked not very clearly and I add some testing result here trying to depict what I have when running distributed and in single node with specified users.
Scenario 1: Single Node
Scenario 2: Trying to simulate 3 worker process each of which running 30 users but get lower RPS even.
from console I can see that each worker process starts 30 users as expected but have no idea why RPS is only 1/3 or single node.
Scenario 3: Adding triple times users to 90 for each worker process and get almost same RPS as running in single node.
It seems Scenario 3 is what I expected for triple simulation amount. But why locust graphic panel gives each worker process is running 90 users?
Scenario 4: To make sure locust truly distribute users specified to each worker node, I put 30 users for single worker node and get the same RPS as single node (not distributed)
Do I have to adding up total users distributed to worker node and input this total amount?

The problem is solved to some extend by set master node to another system. I guess there should be something interferes master to collect requests sent from one of workers which results into half of expected RPS. This may suggest a possible solution to lower RPS than it should be.


First 10 long running transactions

I have a fairly small cluster of 6 nodes, 3 client, and 3 server nodes. Important configurations,
storeKeepBinary = true,
cacheMode = Partitioned (some caches's about 5-8, out of 25 have this as TRANSACTIONAL)
AtomicityMode = Atomic
backups = 1
readFromBackups = false
no persistence
When I run the app for some load/performance test on-prem on 2 large boxes, 3 clients on one box, and 3 servers on another box, all within docker containers, I get a decent performance.
However, when I move them over to AWS and run them in EKS, the only change I make is to change the cluster discovery from standard TCP (default) to Kubernetes-based discovery and run the same test.
But now the performance is very bad, I keep getting,
WARN [sys-#145%test%] - [org.apache.ignite] First 10 long-running transactions [total=3]
Here the transactions are running more than a min long.
In other cases, I am getting,
WARN [sys-#196%test-2%] - [org.apache.ignite] First 10 long-running cache futures [total=1]
Here the associated future has been running for > 3 min.
Most of the places 'google search' has taken me, talks flaky/inconsistent n/w as the cause.
The app and the test seem to be ok since on a local on-prem this works just fine and the performance is decent as well.
Wanted to check if others have faced this or when running on Kubernetes in the public cloud something else needs to be done. Like somewhere I read nodes need to be pinned to the host in a cloud/virtual environment, but it's not mandatory.

24 hours performance test execution stopped abruptly running in jmeter pod in AKS

I am running load test of 24 hours using Jmeter in Azure Kubernetes service. I am using Throughput shaping timer in my jmx file. No listener is added as part of jmx file.
My test stopped abruptly after 6 or 7 hrs.
jmeter-server.log file under Jmeter slave pod is giving warning --> WARN k.a.j.t.VariableThroughputTimer: No free threads left in worker pool.
Below is snapshot from jmeter-server.log file.
Using Jmeter version - 5.2.1 and Kubernetes version - 1.19.6
I checked, Jmeter pods for master and slaves are continously running(no restart happened) in AKS.
I provided 2GB memory to Jmeter slave pod still load test is stopped abruptly.
I am using log analytics workspace for logging. Checked ContainerLog table not getting error.
Snapshot of JMX file.
Using following elements -> Thread Group, Throughput Controller, Http request Sampler and Throughput Shaping Timer
Please suggest for same.
It looks like your Schedule Feedback Function configuration is wrong in its last parameter
The warning means that the Throughput Shaping Timer attempts to increase the number of threads to reach/maintain the desired concurrency but it doesn't have enough threads in order to do this.
So either increase this Spare threads ration to be closer to 1 if you're using a float value for percentage or increment the absolute value in order to match the number of threads.
Quote from documentation:
Example function call: ${__tstFeedback(tst-name,1,100,10)} , where "tst-name" is name of Throughput Shaping Timer to integrate with, 1 and 100 are starting threads and max allowed threads, 10 is how many spare threads to keep in thread pool. If spare threads parameter is a float value <1, then it is interpreted as a ratio relative to the current estimate of threads needed. If above 1, spare threads is interpreted as an absolute count.
More information: Using JMeter’s Throughput Shaping Timer Plugin
However it doesn't explain the premature termination of the test so ensure that there are no errors in jmeter/k8s logs, one of the possible reasons is that JMeter process is being terminated by OOMKiller

Kubernetes Handling a Sudden Request of Processing Power (Such as a Python Script using 5 Processes)

I have a little scenario that I am running in my mind with the following setup:
A Django Web Server running in Kubernetes with the ability to autoscale resources (Google Kubernetes Engine), and I set the resource values to be requesting nodes with 8 Processing Units (8 Cores) and 16 GB Ram.
Because it is a web server, I have my frontend that can call a Python script that executes with 5 Processes, and here's what I am worried about:
I know that If I run this script twice on my webserver (located in the same container as my Django code), I am going to be using (to keep it simple) 10 Processes/CPUs to execute this code.
So what would happen?
Would the first Python script be ran on Pod 1 and the second Python script (since we used 5 out of the 8 processing units) trigger a Pod 2 and another Node, then run on that new replica with full access to 5 new processes?
Or, would the first Python script be ran on Replica 1, and then the second Python script be throttled to 3 processing units because, perhaps, Kubernetes is allocating based on CPU usage in the Replica, not how much processes I called the script with?
If your system has a Django application that launches scripts with subprocess or a similar mechanism, those will always be in the same container as the main server, in the same pod, on the same node. You'll never trigger any of the Kubernetes autoscaling capabilities. If the pod has resource limits set, you could get CPU utilization throttled, and if you exceed the configured memory limit, the pod could get killed off (with Django and all of its subprocesses together).
If you want to take better advantage of Kubernetes scheduling and resource management, you may need to restructure this application. Ideally you could run the Django server and each of the supporting tasks in a separate pod. You would then need a way to trigger the tasks and collect the results.
I'd generally build this by introducing a job queue system such as RabbitMQ or Celery into the mix. The Django application adds items to the queue, but doesn't directly do the work itself. Then you have a worker for each of the processes that reads the queue and does work. This is not directly tied to Kubernetes, and you could run this setup with a RabbitMQ or Redis installation and a local virtual environment.
If you deploy this setup to Kubernetes, then each of the tasks can run in its own deployment, fed by the work queue. Each task could run up to its own configured memory and CPU limits, and they could run on different nodes. With a little extra work you can connect a horizontal pod autoscaler to scale the workers based on the length of the job queue, so if you're running behind processing one of the tasks, the HPA can cause more workers to get launched, which will create more pods, which can potentially allocate more nodes.
The important detail here, though, is that a pod is the key unit of scaling; if all of your work stays within a single pod then you'll never trigger the horizontal pod autoscaler or the cluster autoscaler.

Running Parallel Tasks in Batch

I have few questions about running tasks in parallel in Azure Batch. Per the official documentation, "Azure Batch allows you to set maximum tasks per node up to four times (4x) the number of node cores."
Is there a setup other than specifying the max tasks per node when creating a pool, that needs to be done (to the code) to be able to run parallel tasks with batch?
So if I am understanding this correctly, if I have a Standard_D1_v2 machine with 1 core, I can run up to 4 concurrent tasks running in parallel in it. Is that right? If yes, I ran some tests and I am quite not sure about the behavior that I got. In a pool of D1_v2 machines set up to run 1 task per node, I get about 16 min for my job execution time. Then, using the same applications and same parameters with the only change being a new pool with same setup, also D1_v2, except running 4 tasks per node, I still get a job execution time of about 15 min. There wasn't any improvement in the job execution time for running tasks in parallel. What could be happening? What am I missing here?
I ran a test with a pool of D3_v2 machines with 4 cores, set up to run 2 tasks per core for a total of 8 tasks per node, and another test with a pool (same number of machines as previous one) of D2_v2 machines with 2 cores, set up to run 2 tasks per core for a total of 4 parallel tasks per node. The run time/ job execution time for both these tests were the same. Isn't there supposed to be an improvement considering that 8 tasks are running per node in the first test versus 4 tasks per node in the second test? If yes, what could be a reason why I'm not getting this improvement?
No. Although you may want to look into the task scheduling policy, compute node fill type to control how your tasks are distributed amongst nodes in your pool.
How many tasks are in your job? Are your tasks compute-bound? If so, you won't see any improvement (perhaps even end-to-end performance degradation).
Batch merely schedules the tasks concurrently on the node. If the command/process that you're running utilizes all of the cores on the machine and is compute-bound, you won't see an improvement. You should double check your tasks start and end times within the job and the node execution info to see if they are actually being scheduled concurrently on the same node.

NServiceBus worker endpoints

Im currently evaluation the distributor in NSB and noticed that when i run the distributor and a couple of workers on my own machine, then the queue name for each worker is appended with a Guid.
According to Udi, the master himself :), in this post: Distributor and worker end point queue in same machine
The reason is that NSB assumes you are running in a test setup.
But what happens if I run 4 workers on 1 seperate machine?
Will the queue names on that machine again be appended with a Guid OR are the workers capable of sharing the same queue just because the distributor is on a remote machine?
The question is important as I did expect to have multiple workers on 1 remote machine and generating new queue names every time the machine is booted is not a good idea for maintenance purposes.
Kind regards
But what happens if I run 4 workers on 1 separate machine?
Why would you want to do that?
Each worker can be configured to run multiple worker threads. This is why it doesn't make sense to run multiple workers on a single machine ...
I would increase the number of threads a single worker is using until the max throughput is reached on that machine. Then, scale out to another machine ... so, one worker per box, multiple threads per worker
See here for details on NumberOfWorkerThreads configuration