How do I implement a spike test with Locust? - locust

I would like to run a spike test using Locust.IO where a large of number of requests are made in parallel to my service.
I have experimented with locust and this is type of command I would like to run:
locust -f --headless --host --users 1000 --hatch-rate 1000 --run-time 5s
While running this test no requests are made. I have also tried extending the run time to 60 seconds and no requests are made.
Is there a way of running this type of test in locust?

Here is an example of how to make Users wait until they have all been started.
This way you can have a more reasonable hatch rate (maybe 50/s per load gen) but still have all your users to start the same second (more or less).
class MyUser(HttpUser):
def t(self):
while self.environment.runner.user_count < self.environment.runner.target_user_count:
# do your stuff
You may still encounter Python/OS/network related problems related to creating too many outgoing network connections in a short period, so you may need to combine this with multiple processes and maybe even multiple load gens (
You might be able to work around this if you first do a dummy request against the server before the sleep so that the connection is already established (although this may reduce the realism of your test a little)
You might also want to consider subclassing FastHttpUser instead of HttpUser:


Limiting the number of times an endpoint of Kubernetes pod can be accessed?

I have a machine learning model inside a docker image. I pushed the docker image to google container registry and then deploy it inside a Kubernetes pod. There is a fastapi application that runs on Port 8000 and this Fastapi endpoint is public
(call it mymodel:8000).
The structure of fastapi is :
asynd def get_homepage()
aysnc def get_modelpage()"/model"):
async def get_results(query: Form(...))
User can put query and submit them and get results from the machine learning model running inside the docker. I want to limit the number of times a query can be made by all the users combined. So if the query limit is 100, all the users combined can make only 100 queries in total.
I thought of a way to do this:
Store a database that stores the number of times GET and POST method has been called. As soon as the total number of times POST has been called crosses the limit, stop accepting any more queries.
Is there an alternative way of doing this using Kubernetes limits? Such as I can define a limit_api_calls such that the total number of times mymodel:8000 is accessed is at max equal to limit_api_calls.
I looked at the documentation and I could only find setting limits for CPUs, Memory and rateLimits.
There are several approaches that could satisfy your needs.
Custom implementation: As you mentioned, keep in a persistence layer the number of API calls received and deny requests after it has been reached.
Use a service mesh: Istio (for instance) will let you limit the number of requests received and act as a circuit breaker.
Use an external Api Manager: Apigee will also let you limit and even charge your users, however if it is only for internal use (not pay per use) I definitely won't recommend it.
The tricky part is what you want to happen after the limit has been reached, if it is just a pod you may exit the application to finish and clear it.
Otherwise, if you have a deployment with its replica set and several resources associated with it (like configmaps), you probably want to use some kind of asynchronous alert or polling check to clean up everything related to your deployment. You may want to have a deep look at orchestrators like Airflow (Composer) and use several tools such as Helm for keeping deployments easy.

Fastapi scaleup multi-tennent application

I am trying to understand how to scale up Fastapi on our app. We have currently application developed like into snippet code bellow. So we dont use async calls. Our application is multi-tennent and we expect to load big requests (~10mbs) per requests.
from fastapi import FastAPI
app = FastAPI()
def root():
psycopg2 queries select ... Query last 2-3 minutes or ml model
return {"message": "Hello World"}
When the API call is made another user is wating to start doing requests which is what we dont want. I can increase from 1 worker to 4-6 workers (guvicorn). So than 4-6 users can use app independently. Does it means that we can handle 4-6x workers more or is it less ?
We were thinking to change to async and uses async postgres drivers (asyncio) we could get more throughtput. I assume than will be database bottnlneck soon ? Also we did some performance testing and this approach would decrease time on half according to our tests.
How can we scale up our apllication further if we want in peak times handle 1000 users at same time ? What should we take into consideration ?
First of all: Does this processing need to be sync? I mean, is the user waiting for the response of this processing that takes 2-3 minutes? It is not recommended that you have APIs that take that long to respond.
If your user doesn't need to wait until it finishes, you have a few options:
You can use celery and make this processing async using a background tasks. Celery is commonly used for this kind of things where you have huge queries or huge processing that takes a while and that can be done async.
You can also use the background task from FastAPI that allows you to run things on background.
If we do it this way you will be able to easily scale your application. Note that celery currently doesn't support async, so you would not be able to use async there unless you implement a few tweaks yourself.
About scaling the number of workers - FastAPI recommends that you use your container structure to manage the number of replicas running, so instead of having gunicorn, you could simply scale the number of replicas of your service. If you are not using containers, then you can use a structure from gunicorn that allows you to automatically spins up new workers based on the number of requests that you are receiving.
If none of my answers above make sense for you, I'd suggest:
Use the async driver from Postgres so while it is running and processing your query FastAPI will be able to receive requests from other users. Note that if your query is huge, you might need a lot of memory to do what you are saying.
Create some sort of auto scaling based on response time/requests per second so you can scale your application as you receive more requests

How to interpret LocustIO's output / simulate short user visits

I like Locust, but I'm having a problem interpreting the results.
e.g. my use case is that I have a petition site. I expect 10,000 people to sign the petition over a 12 hour period.
I've written a locust file that simulates user behaviour:
Some users load but don't sign petition
Some users load and submit invalid data
Some users (hopefully) successfully submit.
In real life the user now goes away (because the petition is an API not a main website).
Locust shows me things like:
with 50 concurrent users the median time is 11s
with 100 concurent users the median time is 20s
But as one "Locust" just repeats the tasks over and over, it's not really like one user. If I set it up with a swarm of 1 user, then that still represents many real world users, over a period of time; e.g. in 1 minute it might do the task 5 times: that would be 5 users.
Is there a way I can interpret the data ("this means we can handle N people/hour"), or some way I can see how many "tasks" get run per second or minute etc. (ie locust gives me requests per second but not tasks)
Tasks dont really exist on the logging level in locust.
If you want, you could log your own fake samples, and use that as your task counter. This has an unfortunate side effect of inflating your request rate, but it should not impact things like average response times.
Like this:
from import request_success
def mytask(self):
# do your normal requests"task", name="completed", response_time=None, response_length=0)
Here's the hacky way that I've got somewhere. I'm not happy with it and would love to hear some other answers.
Create class variables on my HttpLocust (WebsiteUser) class:
WebsiteUser.successfulTasks = 0
Then on the UserBehaviour taskset:
def theTaskThatIsConsideredSuccessful(self):
WebsiteUser.successfulTasks += 1
# the work...
# This runs once regardless how many 'locusts'/users hatch
def setup(self):
WebsiteUser.start_time = time.time();
WebsiteUser.successfulTasks = 0
# This runs for every user when test is stopped.
# I could not find another method that did this (tried various combos)
# It doesn't matter much, you just get N copies of the result!
def on_stop(self):
took = time.time() - WebsiteUser.start_time
total = WebsiteUser.successfulTasks
avg = took/total
hr = 60*60/avg
print("{} successful\nAverage: {}s/success\n{} successful signatures per hour".format(total, avg, hr)
And then set a zero wait_time and run till it settles (or failures emerge) and then stop the test with the stop button in the web UI.
Output is like
188 successful
13147.527132862522 successful signatures per hour
I think this therefore gives me the max conceivable throughput that the server can cope with (determined by changing the No. users hatched until failures emerge, or until the average response time becomes unbearable).
Obviously real users would have pauses, but that makes it harder to test the maximums.
Can't use distributed Locust instances
Messy; also can't 'reset' - have to quit the process and restart for another test.

Multiple concurrent connections with Vertx

I'm trying to build a web application that should be able to handle at least 15000 rps. Some of the optimizations I have done is increase the worker pool size to 20 and set an accept back log to 25000. Since I have set my worker pool size to 20; wil this help with the the blocking piece of code?
A worker pool size of 20 seems to be the default.
I believe the important question in your case is how long do you expect each request to run. On my side, I expect to have thousands of short-lived requests, each with a payload size of about 5-10KB. All of these will be blocking, because of a blocking database driver I use at the moment. I have increased the default worker pool size to 40 and I have explicitly set my deploy vertical instances using the following formulae:
final int instances = Math.min(Math.max(Runtime.getRuntime().availableProcessors() / 2, 1), 2);
A test run of 500 simultaneous clients running for 60 seconds, on a vert.x server doing nothing but blocking calls, produced an average of 6 failed requests out of 11089. My test payload in this case was ~28KB.
Of course, from experience I know that running my software in production would often produce results that I have not anticipated. Thus, the important thing in my case is to have good atomicity rules in place, so that I don't get half-baked or corrupted data in the database.

Unique access to Kubernetes resources

For some integration tests we would like to have a way of ensuring, that only one test at a time has access to certain resources (e.g. 3 DeploymentConfigurations).
For that to work we have have the following workflow:
Before test is started - wait until all DCs are "undeployed".
When test is started - set DC replicas to 1.
When test is stopped - set DC replicas to 0.
This works to some degree, but obviously has the problem, that once the test terminates unexpectedly, the DCs might still be in flight.
Now one way to "solve" this would be to introduce a CR, with a Controller, which handles lifetime of the lock (CR).
Is there any more elegant and straight forward way of allowing unique access to Kubernetes resources?
Sadly we are stuck with Kubernetes 1.9 for now.
look at 'kubectl wait' api to set different conditions between the test flow and depending on the result proceed to next test step.