If Thread: 100, Rampup: 1 and Loop count: 1 is the configuration, how will jmeter start sending requests to the server?
Request will be sent 1 req/sec or all requests will be sent all at once to server?
JMeter will send requests as fast as it can, to wit:
It will start all threads (virtual users) you define in Thread Group within the ramp-up period (in your case - 100 threads in 1 second)
Each thread (virtual user) will start executing Samplers which are present in the Thread Group upside down (or according to the Logic Controllers)
When there are no more samplers to execute or loops to iterate the thread will be shut down
When there are no more active threads left - JMeter test will end.
With regards to requests per second - it mostly depends on your application response time, i.e.
if you have 100 virtual users and response time is 1 second - you will get 100 requests/second
if you have 100 virtual users and response time is 2 seconds - you will get 50 requests/second
if you have 100 virtual users and response time is 500 milliseconds - you will get 200 requests/second
etc.
I would recommend increasing (and decreasing) the load gradually, this way you will be able to correlate increasing load with increasing throughput/response time/number of errors, etc. while releasing all threads at once will not tell you the full story (unless you're doing a form of spike testing, in this case consider using Synchronizing Timer)
JMeter's ramp-up period set as 1 means to start all 100 threads in 1 second.
This isn't recommended settings as describe below
The ramp-up period tells JMeter how long to take to "ramp-up" to the full number of threads chosen. If 10 threads are used, and the ramp-up period is 100 seconds, then JMeter will take 100 seconds to get all 10 threads up and running. Each thread will start 10 (100/10) seconds after the previous thread was begun. If there are 30 threads and a ramp-up period of 120 seconds, then each successive thread will be delayed by 4 seconds.
Ramp-up needs to be long enough to avoid too large a work-load at the start of a test, and short enough that the last threads start running before the first ones finish (unless one wants that to happen).
Start with Ramp-up = number of threads and adjust up or down as needed.
See also Can i set ramp up period 0 in JMeter?
bear in mind that with low rampup and many threads, you may be limited by local resources, so your results may be a measurement of client capability rather than server.
Related
I have the following HTTP-based application that routes every request to an Akka Actor which uses a long chain of Akka Actors to process the request.
path("process-request") {
post {
val startedAtAsNano = System.nanoTime()
NonFunctionalMetrics.requestsCounter.inc()
NonFunctionalMetrics.requestsGauge.inc()
entity(as[Request]) { request =>
onComplete(distributor ? [Response](replyTo => Request(request, replyTo))) {
case Success(response) =>
NonFunctionalMetrics.requestsGauge.dec()
NonFunctionalMetrics.responseHistogram.labels(HttpResponseStatus.OK.getCode.toString).observeAsMicroseconds(startedAtAsNano, System.nanoTime())
complete(response)
case Failure(ex) =>
NonFunctionalMetrics.requestsGauge.dec()
NonFunctionalMetrics.responseHistogram.labels(HttpResponseStatus.INTERNAL_SERVER_ERROR.getCode.toString).observeAsMicroseconds(startedAtAsNano, System.nanoTime())
logger.warn(s"A general error occurred for request: $request, ex: ${ex.getMessage}")
complete(InternalServerError, s"A general error occurred: ${ex.getMessage}")
}
}
}
}
As you can see, I'm sending the distributor an ask request for response.
The problem is that on really high RPS, sometimes, the distributor fails with the following exception:
2022-04-16 00:36:26.498 WARN c.d.p.b.http.AkkaHttpServer - A general error occurred for request: Request(None,0,None,Some(EntitiesDataRequest(10606082,0,-1,818052,false))) with ex: Ask timed out on [Actor[akka://MyApp/user/response-aggregator-pool#1374579366]] after [5000 ms]. Message of type [com.dv.phoenix.common.pool.WorkerPool$Request]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
This is a typical non-informative Exception, the normal processing time is about 700 micros, 5 seconds its must be stuck somewhere at the pipeline since it cannot be that high.
I want to monitor this, I thought about adding Kamon integration which provides Akka Actors module with mailboxes, etc.
I tried to add the following configurations but its not worked for me:
https://kamon.io/docs/latest/instrumentation/akka/ask-pattern-timeout-warning/ (didn't show any effect)
Is there other suggestions to understand the cause for this issue on high RPS system?
Thanks!
The Kamon instrumentation is useful for finding how you got to the ask. It can be useful if you have a lot of places where an ask can time out, but otherwise it's not likely to tell you the problem.
This is because an ask timeout is nearly always a symptom of some other problem (the lone exception is if many asks could plausibly be done in a stream (e.g. in a mapAsync or ask stage) but aren't; that doesn't apply in this code). Assuming that the timeouts aren't caused by (e.g.) a database being down so you're getting no reply or a cluster failing (both of these are fairly obvious, thus my assumption), the cause of a timeout (any timeout, generally) is often having too many elements in a queue ("saturation").
But which queue? We'll start with the distributor, which is an actor processing messages one-at-a-time from its mailbox (which is a queue). When you say that the normal processing time is 700 micros, is that measuring the time the distributor spends handling a request (i.e. the time before it can handle the next request)? If so, and the distributor is taking 700 micros, but requests come in every 600 micros, this can happen:
time 0: request 0 comes in, processing starts in distributor (mailbox depth 0)
600 micros: request 1 comes in, queued in distributor's mailbox (mailbox depth 1)
700 micros: request 0 completes (700 micros latency), processing of request 1 begins (mailbox depth 0)
1200 micros: request 2 comes in, queued (mailbox depth 1)
1400 micros: request 1 completes (800 micros latency), processing of request 2 begins (mailbox depth 0)
1800 micros: request 3 comes in, queued (mailbox depth 1)
2100 micros: request 2 completes (900 micros latency), processing of request 3 begins (mailbox depth 0)
2400 micros: request 4 comes in, queued (mailbox depth 1)
2800 micros: request 3 completes (1000 micros latency), processing of request 4 begins (mailbox depth 0)
3000 micros: request 5 comes in, queued (mailbox depth 1)
3500 micros: request 4 completes (1100 micros latency), processing of request 5 begins (mailbox depth 0)
3600 micros: request 6 comes in, queued (mailbox depth 1)
4200 micros: request 7 comes in, queued, request 5 completes (1200 micros latency), processing of request 6 begins (mailbox depth 1)
4800 micros: request 8 comes in, queued (mailbox depth 2)
4900 micros: request 6 completes (1300 micros latency), processing of request 7 begins (mailbox depth 1)
5400 micros: request 9 comes in, queued (mailbox depth 2)
and so on: the latency and depth increase without bound. Eventually, the depth is such that requests spend 5 seconds (or more, even) in the mailbox.
Kamon has the ability to track the number of messages in the mailbox of an actor (it's recommended to only do this on specific actors). Tracking the mailbox depth of distributor in this case would show it growing without bound to confirm that this is happening.
If the distributor's mailbox is the queue that's getting too deep, first consider how request N can affect request N + 1. The one-at-a-time processing model of an actor is only strictly required when the response to a request can be affected by the request immediately prior to it. If a request only concerns some portion of the overall state of the system then that request can be handled in parallel with requests that do not concern any part of that portion. If there are distinct portions of the overall state such that no request is ever concerned with 2 or more portions, then responsibility for each portion of state can be offloaded to a specific actor and the distributor looks at each request only for long enough to determine which actor to forward the request to (note that this will typically not entail the distributor making an ask: it hands off the request and its the responsibility of the actor it hands off to (or that actor's designee...) to reply). This is basically what Cluster Sharding does under the hood, and it's also noteworthy that doing this will probably increase the latency under low load (because you are doing more work), but increases peak throughput by up to the number of portions of state.
If that's not a workable way to address the distributor's mailbox being saturated (viz. there's no good way to partition the state), then you can at least limit the time requests spend in the mailbox by including a "respond-by" field in the request message (e.g. for a 5 second ask timeout, you might require a response by 4900 millis after constructing the ask). When the distributor starts processing a message and the respond-by time has passed, it moves onto the next request: doing this effectively means that when the mailbox starts to saturate, the message processing rate increases.
Of course, it's possible that your distributor's mailbox isn't the queue that's getting saturated, or that if it is, it's not because the actor is spending too much time processing messages. It's possible that the distributor (or other actors needed for a response) aren't processing messages.
Actors run inside a dispatcher which has the ability to have some number of actors (or Future callbacks or other tasks, each of which can be viewed as equivalent to an actor which is spawned for processing a single message) processing a message at a given time. If there are more actors which have a message in their respective mailboxes than the number that can be processing a message, those actors are in a queue to be scheduled (note that this applies even if you happen to have a dispatcher which will spawn as many threads as it needs to process a message: since there are a limited number of CPU cores, the OS kernel scheduler's queue will take the role of the dispatcher queue). Kamon can track the depth of this queue. In my experience, it's more valuable to detect thread starvation (basically whether the time between task submission and when the task starts executing exceeds some threshold) is occurring. Lightbend's package of commercial tooling for use with Akka (disclaimer: I am employed by Lightbend) provides tools for detecting, with minimal overhead, whether starvation is occurring and providing other diagnostic information.
If thread starvation is being observed, and things like garbage collection pauses, or CPU throttling (e.g. due to running in a container) are ruled out, the primary cause of starvation is actors (or actor-like things) taking too long to process a message either because they are executing blocking I/O or are doing too much in the processing of a single message. If blocking I/O is the culprit, try to move the I/O to actors or futures running in a thread pool with far more threads than the number of CPU cores (some even advocate for an unbounded thread pool for this purpose). If it's a case of doing too much computation in processing a single message, look for spots in the processing where it makes sense to capture the state needed for the remainder of the computation in a message and send that message to yourself (this is basically equivalent to a coroutine yielding).
I'm trying to build a web application that should be able to handle at least 15000 rps. Some of the optimizations I have done is increase the worker pool size to 20 and set an accept back log to 25000. Since I have set my worker pool size to 20; wil this help with the the blocking piece of code?
A worker pool size of 20 seems to be the default.
I believe the important question in your case is how long do you expect each request to run. On my side, I expect to have thousands of short-lived requests, each with a payload size of about 5-10KB. All of these will be blocking, because of a blocking database driver I use at the moment. I have increased the default worker pool size to 40 and I have explicitly set my deploy vertical instances using the following formulae:
final int instances = Math.min(Math.max(Runtime.getRuntime().availableProcessors() / 2, 1), 2);
A test run of 500 simultaneous clients running for 60 seconds, on a vert.x server doing nothing but blocking calls, produced an average of 6 failed requests out of 11089. My test payload in this case was ~28KB.
Of course, from experience I know that running my software in production would often produce results that I have not anticipated. Thus, the important thing in my case is to have good atomicity rules in place, so that I don't get half-baked or corrupted data in the database.
So, I am using JMeter's throughput shaping timer to test the performance of our REST Server. I noticed a few things i did not expect.
First of all my setup details :
1)JMeter Version : 3.0 r1743807
2)JMX file : DropBox Link
Now , my questions :
1)The throughput shaping timer is configured to run for 60 seconds(100rps - 30 secs, 200 rps - next 30 secs). But the actual test runs only for 3 seconds as shown below. Why?
2) As per the plan the number of requests per second should go from 100 - 200. But here it seems to decrease , as in above.
3)As per this plugin's documentation , the number of thread groups = desired requests per second * server response time / 1000 . Is it because how this plugin internally works or is it a simple logic i am missing?
The issue is with the Thread Group settings.
You only one have 1 iteration and ramp up 300 users in 1 second. So if Jmeter can send all the 300 requests and get the response, JMeter will finish the test immediately. Those timer settings will apply only if the test is running.
If you need the test to run for some duration (say 60 seconds), then set the loop count to forever & duration to 60
I read this in the celery documentation for Task.rate_limit:
Note that this is a per worker instance rate limit, and not a global rate limit. To enforce a global rate limit (e.g., for an API with a maximum number of requests per second), you must restrict to a given queue.
How do I put a rate limit on a celery queue?
Turns out it cant be done at queue level for multiple workers.
IT can be done at queue level for 1 worker. Or at queue level for each worker.
So if u say 10 jobs/ minute on 5 workers. Your workers will process upto 50 jobs per minute collectively.
So to have only 10 jobs running at a time you either chose one worker. Or chose 5 workers with a limit of 2/minute.
Update: How to exactly put the limit in settings/configuration:
task_annotations = {'tasks.<task_name>': {'rate_limit': '10/m'}}
or change the same for all tasks:
task_annotations = {'*': {'rate_limit': '10/m'}}
10/m means 10 tasks per minute, /s would mean per second. More details here: Task annotations setting
hey I am trying to find a way to do rate limit on queue, and I find out Celery can't do that, however Celery can control the rate per tasks, see this:
http://docs.celeryproject.org/en/latest/userguide/workers.html#rate-limits
so for a workaround, maybe you can set up one tasks per queue(which makes sense in a lot of situations), and put the limit on task.
You can set this limit in the flower > worker pane.
there is a specified blank space for entering your limit there.
The format that is suggested to be used is also like the below:
The rate limits can be specified in seconds, minutes or hours by appending “/s”, >“/m” or “/h” to the value. Tasks will be evenly distributed over the specified >time frame.
Example: “100/m” (hundred tasks a minute). This will enforce a minimum delay of >600ms between starting two tasks on the same worker instance.
I'm trying to understand the performance numbers I'm getting and how to determine the optimal number of threads.
See the bottom of this post for my results
I wrote an experimental multi-threaded web client in perl which downloads a page, grabs the source for each image tag and downloads the image - discarding the data.
It uses a non-blocking connect with an initial per file timeout of 10 seconds which doubles after each timeout and retry. It also caches IP addresses so each thread only has to do a DNS lookup once.
The total amount of data downloaded is 2271122 bytes in 1316 files via 2.5Mbit connection from http://hubblesite.org/gallery/album/entire/npp/all/hires/true/ . The thumbnail images are hosted by a company which claims to specialize in low latency for high bandwidth applications.
Wall times are:
1 Thread takes 4:48 -- 0 timeouts
2 Threads takes 2:38 -- 0 timeouts
5 Threads takes 2:22 -- 20 timeouts
10 Threads take 2:27 -- 40 timeouts
50 Threads take 2:27 -- 170 timeouts
In the worst case ( 50 threads ) less than 2 seconds of CPU time are consumed by the client.
avg file size 1.7k
avg rtt 100 ms ( as measured by ping )
avg cli cpu/img 1 ms
The fastest average download speed is 5 threads at about 15 KB / sec overall.
The server actually does seem to have pretty low latency as it takes only 218 ms to get each image meaning it takes only 18 ms on average for the server to process each request:
0 cli sends syn
50 srv rcvs syn
50 srv sends syn + ack
100 cli conn established / cli sends get
150 srv recv's get
168 srv reads file, sends data , calls close
218 cli recv HTTP headers + complete file in 2 segments MSS == 1448
I can see that the per file average download speed is low because of the small file sizes and the relatively high cost per file of the connection setup.
What I don't understand is why I see virtually no improvement in performance beyond 2 threads. The server seems to be sufficiently fast, but already starts timing out connections at 5 threads.
The timeouts seem to start after about 900 - 1000 successful connections whether it's 5 or 50 threads, which I assume is probably some kind of throttling threshold on the server, but I would expect 10 threads to still be significantly faster than 2.
Am I missing something here?
EDIT-1
Just for comparisons sake I installed the DownThemAll Firefox extension and downloaded the images using it. I set it to 4 simultaneous connections with a 10 second timeout. DTM took about 3 minutes to download all the files + write them to disk, and it also started experiencing timeouts after about 900 connections.
I'm going to run tcpdump to try and get a better picture what's going on at the tcp protocol level.
I also cleared Firefox's cache and hit reload. 40 Seconds to reload the page and all the images. That seemed way too fast - maybe Firefox kept them in a memory cache which wasn't cleared? So I opened Opera and it also took about 40 seconds. I assume they're so much faster because they must be using HTTP/1.1 pipelining?
And the Answer Is!??
So after a little more testing and writing code to reuse the sockets via pipelining I found out some interesting info.
When running at 5 threads the non-pipelined version retrieves the first 1026 images in 77 seconds but takes a further 65 seconds to retrieve the remaining 290 images. This pretty much confirms MattH's theory about my client getting hit by a SYN FLOOD event causing the server to stop responding to my connection attempts for a short period of time. However, that is only part of the problem since 77 seconds is still very slow for 5 threads to get 1026 images; if you remove the SYN FLOOD issue it would still take about 99 seconds to retrieve all the files. So based on a little research and some tcpdump's it seems like the other part of the issue is latency and the connection setup overhead.
Here's where we get back to the issue of finding the "Sweet Spot" or the optimal number of threads. I modified the client to implement HTTP/1.1 Pipelining and found that the optimal number of threads in this case is between 15 and 20. For example:
1 Thread takes 2:37 -- 0 timeouts
2 Threads takes 1:22 -- 0 timeouts
5 Threads takes 0:34 -- 0 timeouts
10 Threads take 0:20 -- 0 timeouts
11 Threads take 0:19 -- 0 timeouts
15 Threads take 0:16 -- 0 timeouts
There are four factors which
affect this; latency / rtt , maximum end-to-end bandwidth, recv buffer size
and the size of the image files being downloaded. See this site for a
discussion on how receive buffer size and RTT latency affect available
bandwidth.
In addition to the above, average file size affects the maximum per connection
transfer rate. Every time you issue a GET request you create an empty gap in
your transfer pipe which is the size of the connection RTT. For example, if
you're Maximum Possible Transfer Rate ( recv buff size / RTT ) is 2.5Mbit and
your RTT is 100ms, then every GET request incurs a minimum 32kB gap in your
pipe. For a large average image size of 320kB that amounts to a 10% overhead
per file, effectively reducing your available bandwidth to 2.25Mbit. However,
for a small average file size of 3.2kB the overhead jumps to 1000% and
available bandwidth is reduced to 232 kbit / second - about 29kB.
So to find the optimal number of threads:
Gap Size = MPTR * RTT
MPTR / (MPTR / Gap Size + AVG file size) * AVG file size)
For my above scenario this gives me an optimum thread count of 11 threads, which is extremely close to my real world results.
If the actual connection speed is slower than the theoretical MPTR then it
should be used in the calculation instead.
Please correct me this summary is incorrect:
Your multi-threaded client will start a thread that connects to the server and issues just one HTTP GET then that thread closes.
When you say 1, 2, 5, 10, 50 threads, you're just referring to how many concurrent threads you allow, each thread itself only handles one request
Your client takes between 2 and 5 minutes to download over 1000 images
Firefox and Opera will download an equivalent data set in 40 seconds
I suggest that the server rate-limits http connections, either by the webserver daemon itself, a server-local firewall or most likely dedicated firewall.
You are actually abusing the webservice by not re-using the HTTP Connections for more than one request and that the timeouts you experience are because your SYN FLOOD is being clamped.
Firefox and Opera are probably using between 4 and 8 connections to download all of the files.
If you redesign your code to re-use the connections you should achieve similar performance.