Benchmarking: Why is Play (Scala) throughput-latency curve not coming flat? - scala

I am doing performance benchmarking of my Play (Scala) web app. The application is hosted on a cloud server. I am using 2.5.x and Scala 2.11.11. I used Apache Bench to generate requests. One example command of using 'ab':
ab -n 10 -c 10 -T 'application/json'
For my APIs I am getting consistently a linear curve for Number of requests vs. Response time (ms). Here is one such data point:
50% 80% 90%
10 592 602 732
20 1002 1013 1014
50 2168 2222 2290
100 4177 4179 4222
200 8477 9459 9462
First column is the number of concurrent requests. Second, third and fourth columns are the "percentage of requests served within this time".
Blue, Red and Orange bars represent respectively 50%, 80% and 90% the percentage of requests served within this time. The CPU load goes above 50% only when concurrent requests > 100.
These results are on my standard Play+Scala app without any specific optimizations e.g. I am using standard Action => Result controllers for APIs. The results are quite disappointing to me given that the system is partially loaded (CPU load < 50% and hardly any memory usage). The server has 2 CPUs + 8GB Mem.

If you are interested in how to measure a real response latency, than use wrk2 tool instead.
Here is a presentation of wrk2 author about how to measure latency and throughput to compare scalability of different systems or their configurations: https://www.infoq.com/presentations/latency-response-time
As an option use Gatling - it has properly implemented measuring to overcome a coordinated omission.
BTW if is possible than share your sources and scripts for testing. In history of the following repository you can find all that stuff for Play 2.5 version too: https://github.com/plokhotnyuk/play
FYI: It is great to see that Java still in top-5, but Rust, Kotlin and Go are approaching quickly... and most pity that Scala frameworks are not based on top Java's... even NodeJs shown greater result than Netty and Undertow: https://www.techempower.com/benchmarks/#section=data-r15&hw=ph&test=json

Related

determine ideal number of workers and EC2 sizing for master

I have a requirement to use locust to simulate 20,000 (and higher) users in a 10 minute test window.
the locustfile is a tasksquence of 9 API calls. I am trying to determine the ideal number of workers, and how many workers should be attached to an EC2 on AWS. My testing shows with 20 workers, on two EC2 instance, the CPU load is minimal. the master however suffers big time. a 4 CPU 16 GB RAM system as the master ends up thrashing to the point that the workers start printing messages like this:
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.util.exception_handler: Retry failed after 3 times.
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/ERROR/locust.runners: RPCError found when sending heartbeat: ZMQ sent failure
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.runners: Reset connection to master
the master seems memory exhausted as each locust master process has grown to 12GB virtual RAM. ok - so the EC2 has a problem. But if I need to test 20,000 users, is there a machine big enough on the planet to handle this? or do i need to take a different approach and if so, what is the recommended direction?
In my specific case, one of the steps is to download a file from CloudFront which is randomly selected in one of the tasks. This means the more open connections to cloudFront trying to download a file, the more congested the available network becomes.
Because the app client is actually a native app on a mobile and there are a lot of factors affecting the download speed for each mobile, I decided to to switch from a GET request to a HEAD request. this allows me to test the response time from CloudFront, where the distribution is protected by a Lambda#Edge function which authenticates the user using data from earlier in the test.
Doing this dramatically improved the load test results and doesn't artificially skew the other testing happening as with bandwidth or system resource exhaustion, every other test will be negatively impacted.
Using this approach I successfully executed a 10,000 user test in a ten minute run-time. I used 4 EC2 T2.xlarge instances with 4 workers per T2. The 9 tasks in test plan resulted in almost 750,000 URL calls.
The answer for the question in the title is: "It depends"
Your post is a little confusing. You say you have 10 master processes? Why?
This problem is most likely not related to the master at all, as it does not care about the size of the downloads (which seems to be the only difference between your test case and most other locust tests)
There are some general tips that might help:
Switch to FastHttpUser (https://docs.locust.io/en/stable/increase-performance.html)
Monitor your network usage (if your load gens are already maxing out their bandwidth or CPU then your test is very unrealistic anyway, and adding more users just adds to the noice. In general, start low and work your way up)
Increase the number of loadgens
In general, the number of users is not an issue for locust, but number of requests per second or bandwidth might be.

Calculating and improving the number of requests/concurrent users my webserver can handle?

I am building a dynamic website and app using HTML/Javascript/PHP/mysql. I have
completed the site and my main focus is now ensuring that when it is launched it is
not taken down by the traffic I am hoping to receive. (I predict around 5000-7000 unique visits on launch day).
The website is currently live, you can see it here : http://www.nightmapper.com/
My hosting is provided by bhost and I am on there silver VPS package:
1024MB Guaranteed Memory,
1536MB Burst Memory,
4 Virtual Cores,
40GB Disk Space,
750GB Data Transfer,
1 IPv4 Addresses
I manage the server myself, but I'm fairly new to it.
Anyway, the most computationally expensive page is the index/home page, on this
page I have 10 mySql queries, which are (mostly) used to get this weeks venue
listings. The listing results are each displayed with a thumbnail image.
the size of the home page for a first time visit is: 2.7mb, I have done
everything I can think of to minimize this including generating thumbnails to
reduce image size and utilizing browser caching.
I have tried a couple methods for stress testing the site including load impact: http://imgur.com/4UCGobf
and ab testing in terminal. I am worried by the results (mostly
with the result of 5.26 requests per second, which appears to be quite a low):
ab -n 100 -c 10 http://www.nightmapper.com/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking www.nightmapper.com (be patient).....done
Server Software: Apache/2.2.22
Server Hostname: www.nightmapper.com
Server Port: 80
Document Path: /
Document Length: 44808 bytes
Concurrency Level: 10
Time taken for tests: 19.012 seconds
Complete requests: 100
Failed requests: 0
Write errors: 0
Total transferred: 4519300 bytes
HTML transferred: 4480800 bytes
Requests per second: 5.26 [#/sec] (mean)
Time per request: 1901.199 [ms] (mean)
Time per request: 190.120 [ms] (mean, across all concurrent requests)
Transfer rate: 232.14 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 26 38 17.9 32 107
Processing: 933 1828 510.2 1782 3495
Waiting: 22 116 303.4 28 1601
Total: 967 1867 518.8 1813 3591
Percentage of the requests served within a certain time (ms)
50% 1813
66% 1983
75% 2032
80% 2184
90% 2412
95% 3124
98% 3568
99% 3591
100% 3591 (longest request)
Using these results, how can I calculate the number of unique visitors a day and concurrent users I can handle, and which methods can I use to identify problems and improve on these results?
I should probably take this opportunity to ask for any good resources;
where I can learn more about such optimization, load testing and Scalability?
This is a complex problem, as there are many factors involved. Here are some things I would investigate:
Your home page as you state is very large, that is going to be a problem. You could look at a caching service for the images, that could help a lot (something like Amazon Cloudfront: https://aws.amazon.com/cloudfront/). This type of content delivery service copies your images to "edge" locations, and takes the burden off of your Web server for downloading those. It could make a very big difference. I would guess that this is the biggest portion of your content, so removing this from your Web server will make things much faster.
The next thing you mention is that you are performing 10 MySQL queries on the home page load, that is a lot of individual queries. If you can restructure your data model or queries to get it down to 1 or 2 queries, it will probably be much faster.
The other option you could try is some sort of paging scheme on the Web page, as the user scrolls down you can perform individual MySQL queries for each portion as it becomes visible.
It seems like you are running on a single server now, an easy thing to do is to run on at least 2 servers (1 for your Web server, 1 for MySQL). MySQL consumes a lot of memory and CPU when it gets busy, so isolating that is recommended.
For scaling your application server that is easy, you can use a load balancer and have many app server instances.
Scaling the database tier is more challenging, there are several ways to do that, including read-balancing (using MySQL replication to a read-only slave). After simple read-balancing it gets into sharding, but I doubt you will need that as it does not appear that you have a lot of database writes, or a very big data set. If you do get into a situation with high write volumes and very large data (50GB - 1TB), then sharding is worth looking into.
To estimate the number of users you can handle should be simple to figure out. There is a book I wrote called Software Pipelines which talks about approaches for doing this (http://www.amazon.com/Software-Pipelines-SOA-Multi-Core-Processing/dp/0137137974). The basic idea is to identify how long each step in your processing is taking, and compute that against the peak traffic you expect. You have the crude figures to do that now even with your current implementation. For example if you can do 5 loads of the home page/second, and you expect 7000 users/day, then just calculate the peak traffic. On average 7000 users/day (with 1 home page load each) is only about 5 page requests/minute. Therefore, even if your peak load is 10X that number, you should be able to handle the load.
The key is to understand and profile your application to see where the time is being spent, then apply one or more of the approaches outlined above.
Good luck with your site!

Best CPU for GWT compile for a new build server

When building our current project the GWT compiling needs quite a large amount of the overall time (currently ~25min overall, 2/3 gwt compile). We reserched how to optimize that (e.g. here)
however in the end we decided to buy a new build server. GWT compiling is a quite CPU intensive task so we did some tests to analyze the improvement per core:
1 cores = 197s
2 cores = 165s
3 cores = 149s
4 cores = 157s (can be that the last core was busy with other tasks)
Judging from those numbers its seems that adding more cores doesn't necessarily improve performance since those numbers seem to flatten.
1.)
So now i would be interessted if someone of you can confirm / disprove that? So 8 or 12 cores doesn't necessarily make a difference - but the individual cpu speed (mhz) does?
2.)
After seeing some benchmarks our sales tend to buy *ntel xeon - any experience with AMD? (I am more of an AMD guy however currently it seems hard to disregard the benchmarks)
3.) Any other suggestions regarding memory, IO etc are welcome
Update: When we get the new server I'll post the updated numbers...
We are using an AMD FX-8350 (#4.00 Ghz) with a Samsung 830 Pro SSD. and we've set localWorkers=4 as well as -Xmx2048m. Previously we used a Intel XEON E5-2609 (#2.40 Ghz). That reduced compilation time from ~440s down to ~310s.
So we also experienced that raw CPU speed matters most in case of a single compilation process (with localWorkers=4). In case of multiple compilation processes running at the same time on this machine a SSD improves the IO wait time which increases with the count of concurrent compilation processes.
Our current hardware supports up to 4 maven builds at the same time (each one with localWorkers=4) and uses then up to 20GB of RAM. With the increasing count of concurrent builds the build time increases. But it is not a linear increase, so we try to reduce the idle time in periods where not all resources are used by a single maven process (Java class compiling, tests, ...).
As we compared the hardware prices, we decided to buy a consumer PC used as a slave in our Jenkins buildfarm. The overall price is much cheaper than server hardware and can easily replaced with a new one in case of a hardware failure.

How Clojure's agents compare to Scala's actors?

I wrote a simulation of the Ring network topology in Scala (source here) (Scala 2.8 RC7) and Clojure (source here) (Clojure 1.1) for a comparison of Actors and Agents.
While the Scala version shows almost constant message exchange rate as I increase the number of nodes in network from 100 to 1000000, the Clojure version shows message rates which decrease with the increase in the number of nodes. Also during a single run, the message rate in Clojure version decreases as the time passes.
So I am curious about how the Scala's Actors compare to Clojure's Agents? Are Agents inherently less concurrent than Actors or is the code inefficiently written (autoboxing?)?
PS: I noted that the memory usage in the Scala version increases a lot with the increase in the number of nodes (> 500 MB for 1 million nodes) while the Clojure one uses much less memory (~ 100 MB for 1 million nodes).
Edit:
Both the versions are running on same JVM with all the JVM args and Actor and Agent configuration parameters set as default. On my machine, the Scala version gives a message rate of around 5000 message/sec consistently for 100 to 1 million nodes, whereas the Clojure version starts with 60000 message/sec for 100 nodes which decreases to 200 messages/sec for 1 million nodes.
Edit 2
Turns out that my Clojure version was inefficiently written. I changed the type of nodes collection from list to vector and now it shows consistent behaviour: 100000 message/sec for 100 nodes and 80000 message/sec for 100000 nodes. So Clojure Agents seem to be faster than Scala Actors. I have updated the linked sources too.
[Disclaimer: I'm on the Akka team]
A Clojure Agent is a different beast from a Scala actor, most notably if you think about who controls the behavior. In Agents the behavior is defined outside and is pushed to the Agent, and in Actors the behavior is defined inside the Actor.
Without knowing anything about your code I really cannot say much, are you using the same JVM parameters, warming things up the same, sensible settings for Actors vs. sensible settings for Agents, or are they tuned separately?
As a side note:
Akka has an implementation of the ring bench located here: http://github.com/jboner/akka-bench/tree/master/ring/
Would be interesting to see what the result is compared to your Clojure test on your machine.

Statistic for requests in deployed VPS servers

I was thinking about different scalability features, and suddenly understand that I don't really know how much can handle one server (VPS). The question for them who have loaded projects.
Imagine server with:
1 Gb Ram
1 Xeon CPU
CentOS
LAMP with FastCGI
PostgreSQL on the same machine
And we need to calculate count of request, so I decided to take middle parameters for app:
80% of requests using one call to db with indexes
40-50 Kb of html
Cache in 60% of cases
Add some other parameters, and lets calculate, or tell your story about your loads?
I would look at cacti - it can give you plenty of stats to choose from.