Akka Http Performance tuning - scala

I am performing Load testing on Akka-http framework(version: 10.0), I am using wrk tool.
wrk command:
wrk -t6 -c10000 -d 60s --timeout 10s --latency http://localhost:8080/hello
first run without any blocking call,
object WebServer {
implicit val system = ActorSystem("my-system")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
def main(args: Array[String]) {
val bindingFuture = Http().bindAndHandle(router.route, "localhost", 8080)
println(
s"Server online at http://localhost:8080/\nPress RETURN to stop...")
StdIn.readLine() // let it run until user presses return
bindingFuture
.flatMap(_.unbind()) // trigger unbinding from the port
.onComplete(_ => system.terminate()) // and shutdown when done
}
}
object router {
implicit val executionContext = WebServer.executionContext
val route =
path("hello") {
get {
complete {
"Ok"
}
}
}
}
output of wrk:
Running 1m test # http://localhost:8080/hello
6 threads and 10000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 4.22ms 16.41ms 2.08s 98.30%
Req/Sec 9.86k 6.31k 25.79k 62.56%
Latency Distribution
50% 3.14ms
75% 3.50ms
90% 4.19ms
99% 31.08ms
3477084 requests in 1.00m, 477.50MB read
Socket errors: connect 9751, read 344, write 0, timeout 0
Requests/sec: 57860.04
Transfer/sec: 7.95MB
Now if i add a future call in the route and run the test again.
val route =
path("hello") {
get {
complete {
Future { // Blocking code
Thread.sleep(100)
"OK"
}
}
}
}
Output, of wrk:
Running 1m test # http://localhost:8080/hello
6 threads and 10000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 527.07ms 491.20ms 10.00s 88.19%
Req/Sec 49.75 39.55 257.00 69.77%
Latency Distribution
50% 379.28ms
75% 632.98ms
90% 1.08s
99% 2.07s
13744 requests in 1.00m, 1.89MB read
Socket errors: connect 9751, read 385, write 38, timeout 98
Requests/sec: 228.88
Transfer/sec: 32.19KB
As you can see with future call only 13744 requests are being served.
After following Akka documentation, I added a separate dispatcher thread pool for the route which creates max, of 200 threads.
implicit val executionContext = WebServer.system.dispatchers.lookup("my-blocking-dispatcher")
// config of dispatcher
my-blocking-dispatcher {
type = Dispatcher
executor = "thread-pool-executor"
thread-pool-executor {
// or in Akka 2.4.2+
fixed-pool-size = 200
}
throughput = 1
}
After the above change, the performance improved a bit
Running 1m test # http://localhost:8080/hello
6 threads and 10000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 127.03ms 21.10ms 504.28ms 84.30%
Req/Sec 320.89 175.58 646.00 60.01%
Latency Distribution
50% 122.85ms
75% 135.16ms
90% 147.21ms
99% 190.03ms
114378 requests in 1.00m, 15.71MB read
Socket errors: connect 9751, read 284, write 0, timeout 0
Requests/sec: 1903.01
Transfer/sec: 267.61KB
In the my-blocking-dispatcher config if I increase the pool size above 200 the performance is same.
Now, what other parameters or config should I use to increase the performance while using future call.So that app gives the maximum throughput.

Some disclaimers first: I haven't worked with wrk tool before, so I might get something wrong. Here are assumptions I've made for this answer:
Connections count is independent from threads count, i.e. if I specify -t4 -c10000 it keeps 10000 connections, not 4 * 10000.
For every connection the behavior is as follows: it sends the request, receives the response completely, and immediately sends the next one, etc., until the time runs out.
Also I've run the server on the same machine as wrk, and my machine seems to be weaker than yours (I have only dual-core CPU), so I've reduced wrk's thread counts to 2, and connection count to 1000, to get decent results.
The Akka Http version I've used is the 10.0.1, and wrk version is 4.0.2.
Now to the answer. Let's look at the blocking code you have:
Future { // Blocking code
Thread.sleep(100)
"OK"
}
This means, every request will take at least 100 milliseconds. If you have 200 threads, and 1000 connections, the timeline will be as follows:
Msg: 0 200 400 600 800 1000 1200 2000
|--------|--------|--------|--------|--------|--------|---..---|---...
Ms: 0 100 200 300 400 500 600 1000
Where Msg is amount of processed messages, Ms is elapsed time in milliseconds.
This gives us 2000 messages processed per second, or ~60000 messages per 30 seconds, which mostly agrees to the test figures:
wrk -t2 -c1000 -d 30s --timeout 10s --latency http://localhost:8080/hello
Running 30s test # http://localhost:8080/hello
2 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 412.30ms 126.87ms 631.78ms 82.89%
Req/Sec 0.95k 204.41 1.40k 75.73%
Latency Distribution
50% 455.18ms
75% 512.93ms
90% 517.72ms
99% 528.19ms
here: --> 56104 requests in 30.09s <--, 7.70MB read
Socket errors: connect 0, read 1349, write 14, timeout 0
Requests/sec: 1864.76
Transfer/sec: 262.23KB
It is also obvious that this number (2000 messages per second) is strictly bound by the threads count. E.g. if we would have 300 threads, we'd process 300 messages every 100 ms, so we'd have 3000 messages per second, if our system can handle so many threads. Let's see how we'll fare if we provide 1 thread per connection, i.e. 1000 threads in pool:
wrk -t2 -c1000 -d 30s --timeout 10s --latency http://localhost:8080/hello
Running 30s test # http://localhost:8080/hello
2 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 107.08ms 16.86ms 582.44ms 97.24%
Req/Sec 3.80k 1.22k 5.05k 79.28%
Latency Distribution
50% 104.77ms
75% 106.74ms
90% 110.01ms
99% 155.24ms
223751 requests in 30.08s, 30.73MB read
Socket errors: connect 0, read 1149, write 1, timeout 0
Requests/sec: 7439.64
Transfer/sec: 1.02MB
As you can see, now one request takes almost exactly 100ms on average, i.e. the same amount we put into Thread.sleep. It seems we can't get much faster than this! Now we're pretty much in standard situation of one thread per request, which worked pretty well for many years until the asynchronous IO let servers scale up much higher.
For the sake of comparison, here's the fully non-blocking test results on my machine with default fork-join thread pool:
complete {
Future {
"OK"
}
}
====>
wrk -t2 -c1000 -d 30s --timeout 10s --latency http://localhost:8080/hello
Running 30s test # http://localhost:8080/hello
2 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 15.50ms 14.35ms 468.11ms 93.43%
Req/Sec 22.00k 5.99k 34.67k 72.95%
Latency Distribution
50% 13.16ms
75% 18.77ms
90% 25.72ms
99% 66.65ms
1289402 requests in 30.02s, 177.07MB read
Socket errors: connect 0, read 1103, write 42, timeout 0
Requests/sec: 42946.15
Transfer/sec: 5.90MB
To summarize, if you use blocking operations, you need one thread per request to achieve the best throughput, so configure your thread pool accordingly. There are natural limits for how many threads your system can handle, and you might need to tune your OS for maximum threads count. For best throughput, avoid blocking operations.
Also don't confuse asynchronous operations with non-blocking ones. Your code with Future and Thread.sleep is a perfect example of asynchronous, but blocking operation. Lots of popular software operates in this mode (some legacy HTTP clients, Cassandra drivers, AWS Java SDKs, etc.). To fully reap the benefits of non-blocking HTTP server, you need to be non-blocking all the way down, not just asynchronous. It might not be always possible, but it's something to strive for.

I get x3 performance on my localhost with this config:
akka {
actor {
default-dispatcher {
fork-join-executor {
parallelism-min = 1
parallelism-max = 64
parallelism-factor = 1
}
throughput = 64
}
}
http {
host-connection-pool {
max-connections = 10000
max-open-requests = 4096
}
server {
pipelining-limit = 1024
max-connections = 4096
backlog = 1024
}
}
}
Maybe other values for these params will make even better (write to me pls if yes).
Akka Http version 10.1.12.

Related

Celery Prefetch count ignored when using greenlet based pools with SQS

We are experiencing an issue with Celery+SQS, using the gevent or the eventlet pools.
Even if we have a large amount of messages in the queue, and the concurrency set to 100, prefetch multiplier set to 5 (so the total prefetch_count reported is 500) celery will receive 10 messages from the queue (the max allowed by SQS) and wait for all of them to finish before attempting to receive again (from AWS we see either 10 or 0 messages in flight).
We do not see the same behavior using the prefork pool.
Our tasks are I/O bound, so using prefork would be inefficient, is there any way to increase the amount of messages pulled from the queue to keep all the greenlet busy?

How does jmeter starts sending requests to server

If Thread: 100, Rampup: 1 and Loop count: 1 is the configuration, how will jmeter start sending requests to the server?
Request will be sent 1 req/sec or all requests will be sent all at once to server?
JMeter will send requests as fast as it can, to wit:
It will start all threads (virtual users) you define in Thread Group within the ramp-up period (in your case - 100 threads in 1 second)
Each thread (virtual user) will start executing Samplers which are present in the Thread Group upside down (or according to the Logic Controllers)
When there are no more samplers to execute or loops to iterate the thread will be shut down
When there are no more active threads left - JMeter test will end.
With regards to requests per second - it mostly depends on your application response time, i.e.
if you have 100 virtual users and response time is 1 second - you will get 100 requests/second
if you have 100 virtual users and response time is 2 seconds - you will get 50 requests/second
if you have 100 virtual users and response time is 500 milliseconds - you will get 200 requests/second
etc.
I would recommend increasing (and decreasing) the load gradually, this way you will be able to correlate increasing load with increasing throughput/response time/number of errors, etc. while releasing all threads at once will not tell you the full story (unless you're doing a form of spike testing, in this case consider using Synchronizing Timer)
JMeter's ramp-up period set as 1 means to start all 100 threads in 1 second.
This isn't recommended settings as describe below
The ramp-up period tells JMeter how long to take to "ramp-up" to the full number of threads chosen. If 10 threads are used, and the ramp-up period is 100 seconds, then JMeter will take 100 seconds to get all 10 threads up and running. Each thread will start 10 (100/10) seconds after the previous thread was begun. If there are 30 threads and a ramp-up period of 120 seconds, then each successive thread will be delayed by 4 seconds.
Ramp-up needs to be long enough to avoid too large a work-load at the start of a test, and short enough that the last threads start running before the first ones finish (unless one wants that to happen).
Start with Ramp-up = number of threads and adjust up or down as needed.
See also Can i set ramp up period 0 in JMeter?
bear in mind that with low rampup and many threads, you may be limited by local resources, so your results may be a measurement of client capability rather than server.

Performance issue sending larger messages via akka remote actors

The responses to concurrent requests to remote actors were taking long time to respond, aka 1 request takes 300 ms, but 100 concurrent requests took almost 30 seconds to complete! So it almost looks like the requests are being executed sequentially! The request size is small, but response size was about 120 kB in JVM before serialization. But the response had deep nested case class.
The response times are similar when running on two different JVMs on same machine as well. But responses are fast when in same JVM (i.e. local actors). It is a single client making concurrent requests to one remote actor.
I see this log in akka debug logs. What does this indicate?
DEBUG test-app akka.remote.EndpointWriter - Drained buffer with
maxWriteCount: 50, fullBackoffCount: 546, smallBackoffCount: 2,
noBackoffCount: 1 , adaptiveBackoff: 2000
The logs show that write to send-buffer failed. This could indicate that
send-buffer is too small
receive-buffer on the remote actor's side is too small
network issues
The send buffer size and receive buffer size directly limits the number of concurrent requests and responses! Increase the send buffer and receive buffer sizes, on both client and server, to support the required concurrency in both client and server.
If the buffer size is not adequate, netty will wait for the buffer to be cleared before attempting to rewrite to the buffer. And by default there will be a backoff time too, and this can be configured as well.
The settings are under remote.netty.tcp:
akka {
remote {
netty.tcp {
# Sets the send buffer size of the Sockets,
# set to 0b for platform default
send-buffer-size = 1024000b
# Sets the receive buffer size of the Sockets,
# set to 0b for platform default
receive-buffer-size = 2048000b
}
# Controls the backoff interval after a refused write is reattempted.
# (Transports may refuse writes if their internal buffer is full)
backoff-interval = 1 ms
}
}
For full configuration see Akka reference config.

Majordomo broker throughput measurement

I am testing the majordomo broker's throughput. The test_client.c that comes along with the majordomo code on github sends synchronous request. I am wanting to test the maximum throughput that the majordomo broker can achieve. The specifications (http://rfc.zeromq.org/spec:7) say that it can switch upto a million messages per second.
First, I changed the client code to send 100k requests asynchronously. Even after setting the HWM on all the sockets sufficiently high, and increasing the TCP buffers to 4 MB, I was observing packet loss with three clients running in parallel.
So I changed the client to send 10k requests at once, and then send two requests for every reply that it receives. I chose 10k because that allowed me to run up to ten clients (each sending 100k messages) in parallel without any packet loss. Here is the client code:
#include "../include/mdp.h"
#include <time.h>
int main (int argc, char *argv [])
{
int verbose = (argc > 1 && streq (argv [1], "-v"));
mdp_client_t *session = mdp_client_new (argv[1], verbose);
int count1, count2;
struct timeval start,end;
gettimeofday(&start, NULL);
for (count1 = 0; count1 < 10000; count1++) {
zmsg_t *request = zmsg_new ();
zmsg_pushstr (request, "Hello world");
mdp_client_send (session, "echo", &request);
}
for (count1 = 0; count1 < 45000; count1++) {
zmsg_t *reply = mdp_client_recv (session,NULL,NULL);
if (reply)
{
zmsg_destroy (&reply);
zmsg_t *request = zmsg_new ();
zmsg_pushstr (request, "Hello world");
mdp_client_send (session, "echo", &request);
request = zmsg_new ();
zmsg_pushstr (request, "Hello world");
mdp_client_send (session, "echo", &request);
}
else
break; // Interrupted by Ctrl-C
}
/* receiving the remaining 55k replies */
for(count1 = 45000; count1 < 100000; count1++)
{
zmsg_t *reply = mdp_client_recv (session,NULL,NULL);
if (reply)
{
zmsg_destroy (&reply);
}
else
break;
}
gettimeofday(&end, NULL);
long elapsed = (end.tv_sec - start.tv_sec) +((end.tv_usec - start.tv_usec)/1000000);
printf("time = %ld\n", elapsed);
printf ("%d replies received\n", count1);
mdp_client_destroy (&session);
return 0;
}
I ran the broker, worker, and the clients within the same machine. Here is the recorded time:
number of clients in parallel
(each client sends 100k ) Time elapsed (seconds)
1 4
2 9
3 12
4 16
5 21
10 43
So for every 100k requests, the broker is taking about 4 seconds. Is this the expected behavior? Am not sure how to achieve million messages per second.
LATEST UPDATE:
I came up with an approach to improve the throughput of the system:
Two brokers instead of one. One of the brokers (broker1) is responsible for sending the client requests to the workers, and the other broker (broker2) is responsible for sending the response of the workers to the clients.
The workers register with broker1.
The clients generate a unique id and register with broker2.
Along with the request, a client also sends its unique id to broker1.
Worker extracts the unique client id from the request, and sends its response (along with the client id to whom the response has to be sent) to broker2.
Now, every 100k requests take around 2 seconds instead of 4 seconds (when using a single broker). I added gettimeofday calls within the broker code to measure how much latency is added by the broker itself.
Here is what I have recorded
100k requests (total time: ~2 seconds) -> latency added by the brokers is 2 seconds
200k requests (total time: ~4 seconds) -> latency added by the brokers is 3 seconds
300k requests (total time: ~7 seconds) -> latency added by the brokers is 5 seconds
So the bulk of the time is being spent within the broker code. Could someone please suggest how to improve this.
The maximum throughput is bound by the maximum throughout of the broker, but it's also bound by the maximum throughout of the workers.
It seems to me like you're starting only one worker. If you read the majordomo protocol specification carefully, it says that the broker should be able to switch up to millions of messages per second, but it doesn't guarantee that a single worker can process millions of requests per second.
Given that the worker treats only one request at a time and uses a fully synchronous dialog with the workers (the broker will not send another request until it gets a reply), it's impossible to squeeze the most out of the broker with a single worker, or even a single worker for that matter: a full network round-trip with the broker is performed between the reply and the next request so the worker spends most of its time waiting.
For starters, you should try adding more workers :-)
EDIT: quick demonstration.
I ran an echo service using a Python implementation of the Majordomo pattern with ZeroMQ. Of course, this is Python code will full error checking and the C version should be much, much faster, but it shows the impact of having multiple workers responding to replies.
Test setup:
1 client
client sends a max of 1000 outstanding requests at once (initially, 1000 are sent, then the client sends one new request each time it receives a response);
the client sends a total of 100,000 requests;
Results:
Using 1 worker, the client gets 100,000 replies in 13.6 seconds (~7300 RPS).
Using 2 workers, the clients gets 100,000 replies in 8.0 seconds (~12500 RPS).
Using 3 workers, the client gets 100,000 replies in 7.8 seconds (~13000 RPS).
Using more clients will yield better throughput in the broker because it will be performing I/O on multiple sockets, but it's not as easy to compute the total broker throughput, so I'll leave you measure that.

TCP, HTTP and the Multi-Threading Sweet Spot

I'm trying to understand the performance numbers I'm getting and how to determine the optimal number of threads.
See the bottom of this post for my results
I wrote an experimental multi-threaded web client in perl which downloads a page, grabs the source for each image tag and downloads the image - discarding the data.
It uses a non-blocking connect with an initial per file timeout of 10 seconds which doubles after each timeout and retry. It also caches IP addresses so each thread only has to do a DNS lookup once.
The total amount of data downloaded is 2271122 bytes in 1316 files via 2.5Mbit connection from http://hubblesite.org/gallery/album/entire/npp/all/hires/true/ . The thumbnail images are hosted by a company which claims to specialize in low latency for high bandwidth applications.
Wall times are:
1 Thread takes 4:48 -- 0 timeouts
2 Threads takes 2:38 -- 0 timeouts
5 Threads takes 2:22 -- 20 timeouts
10 Threads take 2:27 -- 40 timeouts
50 Threads take 2:27 -- 170 timeouts
In the worst case ( 50 threads ) less than 2 seconds of CPU time are consumed by the client.
avg file size 1.7k
avg rtt 100 ms ( as measured by ping )
avg cli cpu/img 1 ms
The fastest average download speed is 5 threads at about 15 KB / sec overall.
The server actually does seem to have pretty low latency as it takes only 218 ms to get each image meaning it takes only 18 ms on average for the server to process each request:
0 cli sends syn
50 srv rcvs syn
50 srv sends syn + ack
100 cli conn established / cli sends get
150 srv recv's get
168 srv reads file, sends data , calls close
218 cli recv HTTP headers + complete file in 2 segments MSS == 1448
I can see that the per file average download speed is low because of the small file sizes and the relatively high cost per file of the connection setup.
What I don't understand is why I see virtually no improvement in performance beyond 2 threads. The server seems to be sufficiently fast, but already starts timing out connections at 5 threads.
The timeouts seem to start after about 900 - 1000 successful connections whether it's 5 or 50 threads, which I assume is probably some kind of throttling threshold on the server, but I would expect 10 threads to still be significantly faster than 2.
Am I missing something here?
EDIT-1
Just for comparisons sake I installed the DownThemAll Firefox extension and downloaded the images using it. I set it to 4 simultaneous connections with a 10 second timeout. DTM took about 3 minutes to download all the files + write them to disk, and it also started experiencing timeouts after about 900 connections.
I'm going to run tcpdump to try and get a better picture what's going on at the tcp protocol level.
I also cleared Firefox's cache and hit reload. 40 Seconds to reload the page and all the images. That seemed way too fast - maybe Firefox kept them in a memory cache which wasn't cleared? So I opened Opera and it also took about 40 seconds. I assume they're so much faster because they must be using HTTP/1.1 pipelining?
And the Answer Is!??
So after a little more testing and writing code to reuse the sockets via pipelining I found out some interesting info.
When running at 5 threads the non-pipelined version retrieves the first 1026 images in 77 seconds but takes a further 65 seconds to retrieve the remaining 290 images. This pretty much confirms MattH's theory about my client getting hit by a SYN FLOOD event causing the server to stop responding to my connection attempts for a short period of time. However, that is only part of the problem since 77 seconds is still very slow for 5 threads to get 1026 images; if you remove the SYN FLOOD issue it would still take about 99 seconds to retrieve all the files. So based on a little research and some tcpdump's it seems like the other part of the issue is latency and the connection setup overhead.
Here's where we get back to the issue of finding the "Sweet Spot" or the optimal number of threads. I modified the client to implement HTTP/1.1 Pipelining and found that the optimal number of threads in this case is between 15 and 20. For example:
1 Thread takes 2:37 -- 0 timeouts
2 Threads takes 1:22 -- 0 timeouts
5 Threads takes 0:34 -- 0 timeouts
10 Threads take 0:20 -- 0 timeouts
11 Threads take 0:19 -- 0 timeouts
15 Threads take 0:16 -- 0 timeouts
There are four factors which
affect this; latency / rtt , maximum end-to-end bandwidth, recv buffer size
and the size of the image files being downloaded. See this site for a
discussion on how receive buffer size and RTT latency affect available
bandwidth.
In addition to the above, average file size affects the maximum per connection
transfer rate. Every time you issue a GET request you create an empty gap in
your transfer pipe which is the size of the connection RTT. For example, if
you're Maximum Possible Transfer Rate ( recv buff size / RTT ) is 2.5Mbit and
your RTT is 100ms, then every GET request incurs a minimum 32kB gap in your
pipe. For a large average image size of 320kB that amounts to a 10% overhead
per file, effectively reducing your available bandwidth to 2.25Mbit. However,
for a small average file size of 3.2kB the overhead jumps to 1000% and
available bandwidth is reduced to 232 kbit / second - about 29kB.
So to find the optimal number of threads:
Gap Size = MPTR * RTT
MPTR / (MPTR / Gap Size + AVG file size) * AVG file size)
For my above scenario this gives me an optimum thread count of 11 threads, which is extremely close to my real world results.
If the actual connection speed is slower than the theoretical MPTR then it
should be used in the calculation instead.
Please correct me this summary is incorrect:
Your multi-threaded client will start a thread that connects to the server and issues just one HTTP GET then that thread closes.
When you say 1, 2, 5, 10, 50 threads, you're just referring to how many concurrent threads you allow, each thread itself only handles one request
Your client takes between 2 and 5 minutes to download over 1000 images
Firefox and Opera will download an equivalent data set in 40 seconds
I suggest that the server rate-limits http connections, either by the webserver daemon itself, a server-local firewall or most likely dedicated firewall.
You are actually abusing the webservice by not re-using the HTTP Connections for more than one request and that the timeouts you experience are because your SYN FLOOD is being clamped.
Firefox and Opera are probably using between 4 and 8 connections to download all of the files.
If you redesign your code to re-use the connections you should achieve similar performance.