I am testing the majordomo broker's throughput. The test_client.c that comes along with the majordomo code on github sends synchronous request. I am wanting to test the maximum throughput that the majordomo broker can achieve. The specifications (http://rfc.zeromq.org/spec:7) say that it can switch upto a million messages per second.
First, I changed the client code to send 100k requests asynchronously. Even after setting the HWM on all the sockets sufficiently high, and increasing the TCP buffers to 4 MB, I was observing packet loss with three clients running in parallel.
So I changed the client to send 10k requests at once, and then send two requests for every reply that it receives. I chose 10k because that allowed me to run up to ten clients (each sending 100k messages) in parallel without any packet loss. Here is the client code:
#include "../include/mdp.h"
#include <time.h>
int main (int argc, char *argv [])
{
int verbose = (argc > 1 && streq (argv [1], "-v"));
mdp_client_t *session = mdp_client_new (argv[1], verbose);
int count1, count2;
struct timeval start,end;
gettimeofday(&start, NULL);
for (count1 = 0; count1 < 10000; count1++) {
zmsg_t *request = zmsg_new ();
zmsg_pushstr (request, "Hello world");
mdp_client_send (session, "echo", &request);
}
for (count1 = 0; count1 < 45000; count1++) {
zmsg_t *reply = mdp_client_recv (session,NULL,NULL);
if (reply)
{
zmsg_destroy (&reply);
zmsg_t *request = zmsg_new ();
zmsg_pushstr (request, "Hello world");
mdp_client_send (session, "echo", &request);
request = zmsg_new ();
zmsg_pushstr (request, "Hello world");
mdp_client_send (session, "echo", &request);
}
else
break; // Interrupted by Ctrl-C
}
/* receiving the remaining 55k replies */
for(count1 = 45000; count1 < 100000; count1++)
{
zmsg_t *reply = mdp_client_recv (session,NULL,NULL);
if (reply)
{
zmsg_destroy (&reply);
}
else
break;
}
gettimeofday(&end, NULL);
long elapsed = (end.tv_sec - start.tv_sec) +((end.tv_usec - start.tv_usec)/1000000);
printf("time = %ld\n", elapsed);
printf ("%d replies received\n", count1);
mdp_client_destroy (&session);
return 0;
}
I ran the broker, worker, and the clients within the same machine. Here is the recorded time:
number of clients in parallel
(each client sends 100k ) Time elapsed (seconds)
1 4
2 9
3 12
4 16
5 21
10 43
So for every 100k requests, the broker is taking about 4 seconds. Is this the expected behavior? Am not sure how to achieve million messages per second.
LATEST UPDATE:
I came up with an approach to improve the throughput of the system:
Two brokers instead of one. One of the brokers (broker1) is responsible for sending the client requests to the workers, and the other broker (broker2) is responsible for sending the response of the workers to the clients.
The workers register with broker1.
The clients generate a unique id and register with broker2.
Along with the request, a client also sends its unique id to broker1.
Worker extracts the unique client id from the request, and sends its response (along with the client id to whom the response has to be sent) to broker2.
Now, every 100k requests take around 2 seconds instead of 4 seconds (when using a single broker). I added gettimeofday calls within the broker code to measure how much latency is added by the broker itself.
Here is what I have recorded
100k requests (total time: ~2 seconds) -> latency added by the brokers is 2 seconds
200k requests (total time: ~4 seconds) -> latency added by the brokers is 3 seconds
300k requests (total time: ~7 seconds) -> latency added by the brokers is 5 seconds
So the bulk of the time is being spent within the broker code. Could someone please suggest how to improve this.
The maximum throughput is bound by the maximum throughout of the broker, but it's also bound by the maximum throughout of the workers.
It seems to me like you're starting only one worker. If you read the majordomo protocol specification carefully, it says that the broker should be able to switch up to millions of messages per second, but it doesn't guarantee that a single worker can process millions of requests per second.
Given that the worker treats only one request at a time and uses a fully synchronous dialog with the workers (the broker will not send another request until it gets a reply), it's impossible to squeeze the most out of the broker with a single worker, or even a single worker for that matter: a full network round-trip with the broker is performed between the reply and the next request so the worker spends most of its time waiting.
For starters, you should try adding more workers :-)
EDIT: quick demonstration.
I ran an echo service using a Python implementation of the Majordomo pattern with ZeroMQ. Of course, this is Python code will full error checking and the C version should be much, much faster, but it shows the impact of having multiple workers responding to replies.
Test setup:
1 client
client sends a max of 1000 outstanding requests at once (initially, 1000 are sent, then the client sends one new request each time it receives a response);
the client sends a total of 100,000 requests;
Results:
Using 1 worker, the client gets 100,000 replies in 13.6 seconds (~7300 RPS).
Using 2 workers, the clients gets 100,000 replies in 8.0 seconds (~12500 RPS).
Using 3 workers, the client gets 100,000 replies in 7.8 seconds (~13000 RPS).
Using more clients will yield better throughput in the broker because it will be performing I/O on multiple sockets, but it's not as easy to compute the total broker throughput, so I'll leave you measure that.
Related
I have the following HTTP-based application that routes every request to an Akka Actor which uses a long chain of Akka Actors to process the request.
path("process-request") {
post {
val startedAtAsNano = System.nanoTime()
NonFunctionalMetrics.requestsCounter.inc()
NonFunctionalMetrics.requestsGauge.inc()
entity(as[Request]) { request =>
onComplete(distributor ? [Response](replyTo => Request(request, replyTo))) {
case Success(response) =>
NonFunctionalMetrics.requestsGauge.dec()
NonFunctionalMetrics.responseHistogram.labels(HttpResponseStatus.OK.getCode.toString).observeAsMicroseconds(startedAtAsNano, System.nanoTime())
complete(response)
case Failure(ex) =>
NonFunctionalMetrics.requestsGauge.dec()
NonFunctionalMetrics.responseHistogram.labels(HttpResponseStatus.INTERNAL_SERVER_ERROR.getCode.toString).observeAsMicroseconds(startedAtAsNano, System.nanoTime())
logger.warn(s"A general error occurred for request: $request, ex: ${ex.getMessage}")
complete(InternalServerError, s"A general error occurred: ${ex.getMessage}")
}
}
}
}
As you can see, I'm sending the distributor an ask request for response.
The problem is that on really high RPS, sometimes, the distributor fails with the following exception:
2022-04-16 00:36:26.498 WARN c.d.p.b.http.AkkaHttpServer - A general error occurred for request: Request(None,0,None,Some(EntitiesDataRequest(10606082,0,-1,818052,false))) with ex: Ask timed out on [Actor[akka://MyApp/user/response-aggregator-pool#1374579366]] after [5000 ms]. Message of type [com.dv.phoenix.common.pool.WorkerPool$Request]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
This is a typical non-informative Exception, the normal processing time is about 700 micros, 5 seconds its must be stuck somewhere at the pipeline since it cannot be that high.
I want to monitor this, I thought about adding Kamon integration which provides Akka Actors module with mailboxes, etc.
I tried to add the following configurations but its not worked for me:
https://kamon.io/docs/latest/instrumentation/akka/ask-pattern-timeout-warning/ (didn't show any effect)
Is there other suggestions to understand the cause for this issue on high RPS system?
Thanks!
The Kamon instrumentation is useful for finding how you got to the ask. It can be useful if you have a lot of places where an ask can time out, but otherwise it's not likely to tell you the problem.
This is because an ask timeout is nearly always a symptom of some other problem (the lone exception is if many asks could plausibly be done in a stream (e.g. in a mapAsync or ask stage) but aren't; that doesn't apply in this code). Assuming that the timeouts aren't caused by (e.g.) a database being down so you're getting no reply or a cluster failing (both of these are fairly obvious, thus my assumption), the cause of a timeout (any timeout, generally) is often having too many elements in a queue ("saturation").
But which queue? We'll start with the distributor, which is an actor processing messages one-at-a-time from its mailbox (which is a queue). When you say that the normal processing time is 700 micros, is that measuring the time the distributor spends handling a request (i.e. the time before it can handle the next request)? If so, and the distributor is taking 700 micros, but requests come in every 600 micros, this can happen:
time 0: request 0 comes in, processing starts in distributor (mailbox depth 0)
600 micros: request 1 comes in, queued in distributor's mailbox (mailbox depth 1)
700 micros: request 0 completes (700 micros latency), processing of request 1 begins (mailbox depth 0)
1200 micros: request 2 comes in, queued (mailbox depth 1)
1400 micros: request 1 completes (800 micros latency), processing of request 2 begins (mailbox depth 0)
1800 micros: request 3 comes in, queued (mailbox depth 1)
2100 micros: request 2 completes (900 micros latency), processing of request 3 begins (mailbox depth 0)
2400 micros: request 4 comes in, queued (mailbox depth 1)
2800 micros: request 3 completes (1000 micros latency), processing of request 4 begins (mailbox depth 0)
3000 micros: request 5 comes in, queued (mailbox depth 1)
3500 micros: request 4 completes (1100 micros latency), processing of request 5 begins (mailbox depth 0)
3600 micros: request 6 comes in, queued (mailbox depth 1)
4200 micros: request 7 comes in, queued, request 5 completes (1200 micros latency), processing of request 6 begins (mailbox depth 1)
4800 micros: request 8 comes in, queued (mailbox depth 2)
4900 micros: request 6 completes (1300 micros latency), processing of request 7 begins (mailbox depth 1)
5400 micros: request 9 comes in, queued (mailbox depth 2)
and so on: the latency and depth increase without bound. Eventually, the depth is such that requests spend 5 seconds (or more, even) in the mailbox.
Kamon has the ability to track the number of messages in the mailbox of an actor (it's recommended to only do this on specific actors). Tracking the mailbox depth of distributor in this case would show it growing without bound to confirm that this is happening.
If the distributor's mailbox is the queue that's getting too deep, first consider how request N can affect request N + 1. The one-at-a-time processing model of an actor is only strictly required when the response to a request can be affected by the request immediately prior to it. If a request only concerns some portion of the overall state of the system then that request can be handled in parallel with requests that do not concern any part of that portion. If there are distinct portions of the overall state such that no request is ever concerned with 2 or more portions, then responsibility for each portion of state can be offloaded to a specific actor and the distributor looks at each request only for long enough to determine which actor to forward the request to (note that this will typically not entail the distributor making an ask: it hands off the request and its the responsibility of the actor it hands off to (or that actor's designee...) to reply). This is basically what Cluster Sharding does under the hood, and it's also noteworthy that doing this will probably increase the latency under low load (because you are doing more work), but increases peak throughput by up to the number of portions of state.
If that's not a workable way to address the distributor's mailbox being saturated (viz. there's no good way to partition the state), then you can at least limit the time requests spend in the mailbox by including a "respond-by" field in the request message (e.g. for a 5 second ask timeout, you might require a response by 4900 millis after constructing the ask). When the distributor starts processing a message and the respond-by time has passed, it moves onto the next request: doing this effectively means that when the mailbox starts to saturate, the message processing rate increases.
Of course, it's possible that your distributor's mailbox isn't the queue that's getting saturated, or that if it is, it's not because the actor is spending too much time processing messages. It's possible that the distributor (or other actors needed for a response) aren't processing messages.
Actors run inside a dispatcher which has the ability to have some number of actors (or Future callbacks or other tasks, each of which can be viewed as equivalent to an actor which is spawned for processing a single message) processing a message at a given time. If there are more actors which have a message in their respective mailboxes than the number that can be processing a message, those actors are in a queue to be scheduled (note that this applies even if you happen to have a dispatcher which will spawn as many threads as it needs to process a message: since there are a limited number of CPU cores, the OS kernel scheduler's queue will take the role of the dispatcher queue). Kamon can track the depth of this queue. In my experience, it's more valuable to detect thread starvation (basically whether the time between task submission and when the task starts executing exceeds some threshold) is occurring. Lightbend's package of commercial tooling for use with Akka (disclaimer: I am employed by Lightbend) provides tools for detecting, with minimal overhead, whether starvation is occurring and providing other diagnostic information.
If thread starvation is being observed, and things like garbage collection pauses, or CPU throttling (e.g. due to running in a container) are ruled out, the primary cause of starvation is actors (or actor-like things) taking too long to process a message either because they are executing blocking I/O or are doing too much in the processing of a single message. If blocking I/O is the culprit, try to move the I/O to actors or futures running in a thread pool with far more threads than the number of CPU cores (some even advocate for an unbounded thread pool for this purpose). If it's a case of doing too much computation in processing a single message, look for spots in the processing where it makes sense to capture the state needed for the remainder of the computation in a message and send that message to yourself (this is basically equivalent to a coroutine yielding).
I have a jsr223 preprocessor in the concurrency thread group which creates data to send it to Kafka producer. and I have a JSR sampler that uses Kafka client 2.7.0 to send messages to Kafka procedure.
The message sent to Kafka should be different each time for e.g. it has device information which should be different and events with time (which is the current time). These are been generated without any issues as I tested it with few
(50) threads. The problem I am having is when I want to send more messages like 6000 messages per second. How to resolve this issue
below is my setup
You're showing us a screenshot of the Concurrency Thread Group configured to start 6000 threads (virtual users) and hold them for 20 seconds.
It will result in 6000 messages per second only if your JSR223 PreProcessor and Sampler cumulative response time is exactly 1 second. If it will be less - you will generate more messages per second and vice versa.
For example:
if PreProcessor and Sampler execution time is 500ms - you will end with 12000 messages per second
if PreProcessor and Sampler execution time is 2000ms - you will send 3000 messages per second
If you're sending less messages than you need - consider following JMeter Best Practices, at least disable all the Listeners and run your test in non-GUI mode. Still not enough? Increase the concurrency. Increased concurrency, lacking resources and still not enough - go for Distributed Testing
If you're sending more messages than 6000 per second you can limit JMeter sampler execution rate to the desired throughput using Throughput Shaping Timer
You can see your current throughput using i.e. Transactions per Second listener
Can i get help with this? I can't seem to understand the question
"In this problem you are to compare reading a file using a single-threaded file server with a multi- threaded file server. It takes 16 msec to get a request for work, dispatch it, and do the rest of the necessary processing, assuming the data are in the block cache. If a disk operation is needed (assume a spinning disk drive with 1 head), as is the case one-fourth of the time, an additional 32 msec is required."
Can i get help with this?
I don't think so (I don't think there's enough information in the question for anyone to be able to understand it).
Example 1
The file server is single-threaded and handles asynchronous requests, and the "16 msec" is primarily "request delivery latency" (time between a process sending a request and the file server receiving the request). A process sends a single request asking to read from 1000 files, the file server receives this request, "immediately" sends back 750 replies (for file data that was cached) and sends a single request asking something (file system code, disk driver?) to fetch the remaining 250 things; then file server "immediately" waits for more requests while waiting for the reply from something (file system code, disk driver?) to complete the early 250 things. In this case you can say that throughput for single-threaded file server is virtually infinite (e.g. infinite throughput for "file data cache hit", which is the only thing that matters because you can make more requests while waiting for slow disk IO).
Example 2
The file server has 8 threads and handles synchronous requests. A single-threaded process sends 1 request (to read from 1 file) and then has to wait for the reply, the request is given to one of the file server's threads (doesn't matter which) and that thread takes an average of "16 + 32*0.25 = 24 msec" for the request to be handled before the process can make it's next request; and the process does this in a loop because it wants to read 1000 files. In this case throughput is "1/0.024 = 41.66 requests per second", which is extremely bad (primarily because the single-threaded process can't send requests fast enough to keep all threads of the multi-threaded server busy).
Example 3
The file server has 8 threads and handles synchronous requests. A process with 1000 threads send 1 request (to read from 1 file) from each of its threads. In this case we need to know how many CPUs there are (and how scheduler works) to determine anything about throughput. E.g. if there's only 2 CPUs then you're not going to get 8 file server threads running in parallel at the same time.
If Thread: 100, Rampup: 1 and Loop count: 1 is the configuration, how will jmeter start sending requests to the server?
Request will be sent 1 req/sec or all requests will be sent all at once to server?
JMeter will send requests as fast as it can, to wit:
It will start all threads (virtual users) you define in Thread Group within the ramp-up period (in your case - 100 threads in 1 second)
Each thread (virtual user) will start executing Samplers which are present in the Thread Group upside down (or according to the Logic Controllers)
When there are no more samplers to execute or loops to iterate the thread will be shut down
When there are no more active threads left - JMeter test will end.
With regards to requests per second - it mostly depends on your application response time, i.e.
if you have 100 virtual users and response time is 1 second - you will get 100 requests/second
if you have 100 virtual users and response time is 2 seconds - you will get 50 requests/second
if you have 100 virtual users and response time is 500 milliseconds - you will get 200 requests/second
etc.
I would recommend increasing (and decreasing) the load gradually, this way you will be able to correlate increasing load with increasing throughput/response time/number of errors, etc. while releasing all threads at once will not tell you the full story (unless you're doing a form of spike testing, in this case consider using Synchronizing Timer)
JMeter's ramp-up period set as 1 means to start all 100 threads in 1 second.
This isn't recommended settings as describe below
The ramp-up period tells JMeter how long to take to "ramp-up" to the full number of threads chosen. If 10 threads are used, and the ramp-up period is 100 seconds, then JMeter will take 100 seconds to get all 10 threads up and running. Each thread will start 10 (100/10) seconds after the previous thread was begun. If there are 30 threads and a ramp-up period of 120 seconds, then each successive thread will be delayed by 4 seconds.
Ramp-up needs to be long enough to avoid too large a work-load at the start of a test, and short enough that the last threads start running before the first ones finish (unless one wants that to happen).
Start with Ramp-up = number of threads and adjust up or down as needed.
See also Can i set ramp up period 0 in JMeter?
bear in mind that with low rampup and many threads, you may be limited by local resources, so your results may be a measurement of client capability rather than server.
The responses to concurrent requests to remote actors were taking long time to respond, aka 1 request takes 300 ms, but 100 concurrent requests took almost 30 seconds to complete! So it almost looks like the requests are being executed sequentially! The request size is small, but response size was about 120 kB in JVM before serialization. But the response had deep nested case class.
The response times are similar when running on two different JVMs on same machine as well. But responses are fast when in same JVM (i.e. local actors). It is a single client making concurrent requests to one remote actor.
I see this log in akka debug logs. What does this indicate?
DEBUG test-app akka.remote.EndpointWriter - Drained buffer with
maxWriteCount: 50, fullBackoffCount: 546, smallBackoffCount: 2,
noBackoffCount: 1 , adaptiveBackoff: 2000
The logs show that write to send-buffer failed. This could indicate that
send-buffer is too small
receive-buffer on the remote actor's side is too small
network issues
The send buffer size and receive buffer size directly limits the number of concurrent requests and responses! Increase the send buffer and receive buffer sizes, on both client and server, to support the required concurrency in both client and server.
If the buffer size is not adequate, netty will wait for the buffer to be cleared before attempting to rewrite to the buffer. And by default there will be a backoff time too, and this can be configured as well.
The settings are under remote.netty.tcp:
akka {
remote {
netty.tcp {
# Sets the send buffer size of the Sockets,
# set to 0b for platform default
send-buffer-size = 1024000b
# Sets the receive buffer size of the Sockets,
# set to 0b for platform default
receive-buffer-size = 2048000b
}
# Controls the backoff interval after a refused write is reattempted.
# (Transports may refuse writes if their internal buffer is full)
backoff-interval = 1 ms
}
}
For full configuration see Akka reference config.