Maximizing S3 upload performance with AWS C++ SDK - aws-sdk-cpp

I am using a c5.18xlarge instance with the ENA adapter enabled (so expect to have 25 Gbps connectivity to S3 per AWS support). I am using the AWS C++ SDK (version 1.3.59) on RHEL 7 to upload a 70 GB file to a single S3 object using a 256 MB part size. Per AWS support, I've set the ClientConfiguration's maxConnections field to 999 and its executor field to use a PooledThreadExecutor with a pool size of 999 (and these have improved my performance). I am performing a series of S3Client::UploadPart() calls, threading these myself; I get very similar performance when using UploadPartCallable() and letting the SDK manage the threading.
Here's the performance I'm seeing:
- 36 threads: 7.5 Gbps
- 200 threads: 15.7 Gbps
AWS support reported similar behavior (actually they were using 900 threads).
I've looked through the underlying implementation of S3Client and all the low level thread management and curl handle management. I don't see anything obviously inefficient going on. It just doesn't make any sense to me that I would need 200 threads to achieve this performance on a machine that has 36 physical cores. Is this expected? Could someone provide an explanation for what's happening or a different way to configure the SDK to not require this many threads? I think I could provide my own HTTPClientFactory and customize things to cut out a mutex in how the curl handles are managed if I'm careful, but this seems unlikely to account for what I'm seeing.
Thanks for any help.
-Adam

I am using the AWS C++ SDK (version 1.3.59) on RHEL 7 to upload a 70 GB file to a single S3 object using a 256 MB part size.
You're probably being limited by your disk/storage device's read throughput. It's actually impressive that you're able to reach 15.7 Gbps.

In my test, I see all threads created by Aws::Utils::Threading::PooledThreadExecutor are running in one single CPU core(while the spot instance has 72 vCPUs). Have you seen the same behavior in your tests?
The way I further improved the performance is by using my own threading model with S3Client blocking APIs instead of PooledThreadExecutor with S3 async methods(such as UploadPartAsync()).

Related

How many concurrent users can the latest ejabberd XMPP server handle?

I want to build an instant messaging app and I need to know how many concurrent users can ejabberd xmpp server handle, the following are my server hardware specs:
CPU: 2x Intel Xeon E5 2630v4, 2 x 10 x 2.20 GHz
RAM: 256 GB REG ECC
Storage: 1TB SSD
Thanks for the help in advance.
There is no hard limit in ejabberd, it all depends whether the machine CPU and RAM can handle the load.
It is quite important if those accounts will generate traffic (have many contacts and change presence, and send messages), or will be almost quiet.
Many accounts almost quiet: consume memory
Few accounts chatting and changing presence a lot: consume processor
It is also important in the long term that you use a powerful SQL database instead of the internal Mnesia database (which is acceptable only for small servers).
You may want to run some benchmark tool, designed with your expected usage, to check ejabberd (or any other xmpp server).
For example, 18 years ago, I used jabsimul in a machine with one 1.7 GHz processor, less than 1 GB RAM, and it could handle thousands of chatty users perfectly:
https://www.ejabberd.im/benchmark/index.html

GCP CloudSQL (PostgreSQL) Crash During Stored Procedure Execution and Failover

I have a stored procedure in GCP CloudSQL (PostgreSQL v9.0.23). It works find in lower environments; but when it runs in Production (with significantly more volume), it crashes the DB itself which results in a Failover.
When we checked the metrics, what we found out is that the memory is more than 90% just before it crashes (15 GB out of the 16GB instance memory). Also the Read / Writes are very high >1000 Ops per second.
The SP does some select and insert statements. Any suggestions to improve this situation helps.
Thanks in advance.
As you have mentioned that the Cloud SQL instance is running smoothly with a small amount of workload but crashing with the Production environment where more intensive workloads are there, it seems the issue is with the instance size. So I would suggest you increase the instance size as per your need.
Also you have mentioned that the memory usage is 15 GB out of 16 GB which amounts to nearly 94%. As per this document your Cloud SQL instance will not be covered under Cloud SQL SLA if memory usage is over 90% for more than 6 hours of duration. So I would suggest you keep the memory usage within 90%. Also I would suggest keeping the CPU utilization as mentioned in this document. To know when your instance reaches any threshold I will suggest you set a monitoring alert for that metrics as mentioned here.
If increasing your instance size doesn’t help I would recommend you to create a support ticket with Google Cloud Support so that they can investigate in detail.

determine ideal number of workers and EC2 sizing for master

I have a requirement to use locust to simulate 20,000 (and higher) users in a 10 minute test window.
the locustfile is a tasksquence of 9 API calls. I am trying to determine the ideal number of workers, and how many workers should be attached to an EC2 on AWS. My testing shows with 20 workers, on two EC2 instance, the CPU load is minimal. the master however suffers big time. a 4 CPU 16 GB RAM system as the master ends up thrashing to the point that the workers start printing messages like this:
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.util.exception_handler: Retry failed after 3 times.
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/ERROR/locust.runners: RPCError found when sending heartbeat: ZMQ sent failure
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.runners: Reset connection to master
the master seems memory exhausted as each locust master process has grown to 12GB virtual RAM. ok - so the EC2 has a problem. But if I need to test 20,000 users, is there a machine big enough on the planet to handle this? or do i need to take a different approach and if so, what is the recommended direction?
In my specific case, one of the steps is to download a file from CloudFront which is randomly selected in one of the tasks. This means the more open connections to cloudFront trying to download a file, the more congested the available network becomes.
Because the app client is actually a native app on a mobile and there are a lot of factors affecting the download speed for each mobile, I decided to to switch from a GET request to a HEAD request. this allows me to test the response time from CloudFront, where the distribution is protected by a Lambda#Edge function which authenticates the user using data from earlier in the test.
Doing this dramatically improved the load test results and doesn't artificially skew the other testing happening as with bandwidth or system resource exhaustion, every other test will be negatively impacted.
Using this approach I successfully executed a 10,000 user test in a ten minute run-time. I used 4 EC2 T2.xlarge instances with 4 workers per T2. The 9 tasks in test plan resulted in almost 750,000 URL calls.
The answer for the question in the title is: "It depends"
Your post is a little confusing. You say you have 10 master processes? Why?
This problem is most likely not related to the master at all, as it does not care about the size of the downloads (which seems to be the only difference between your test case and most other locust tests)
There are some general tips that might help:
Switch to FastHttpUser (https://docs.locust.io/en/stable/increase-performance.html)
Monitor your network usage (if your load gens are already maxing out their bandwidth or CPU then your test is very unrealistic anyway, and adding more users just adds to the noice. In general, start low and work your way up)
Increase the number of loadgens
In general, the number of users is not an issue for locust, but number of requests per second or bandwidth might be.

Benchmarking: Why is Play (Scala) throughput-latency curve not coming flat?

I am doing performance benchmarking of my Play (Scala) web app. The application is hosted on a cloud server. I am using 2.5.x and Scala 2.11.11. I used Apache Bench to generate requests. One example command of using 'ab':
ab -n 10 -c 10 -T 'application/json'
For my APIs I am getting consistently a linear curve for Number of requests vs. Response time (ms). Here is one such data point:
50% 80% 90%
10 592 602 732
20 1002 1013 1014
50 2168 2222 2290
100 4177 4179 4222
200 8477 9459 9462
First column is the number of concurrent requests. Second, third and fourth columns are the "percentage of requests served within this time".
Blue, Red and Orange bars represent respectively 50%, 80% and 90% the percentage of requests served within this time. The CPU load goes above 50% only when concurrent requests > 100.
These results are on my standard Play+Scala app without any specific optimizations e.g. I am using standard Action => Result controllers for APIs. The results are quite disappointing to me given that the system is partially loaded (CPU load < 50% and hardly any memory usage). The server has 2 CPUs + 8GB Mem.
If you are interested in how to measure a real response latency, than use wrk2 tool instead.
Here is a presentation of wrk2 author about how to measure latency and throughput to compare scalability of different systems or their configurations: https://www.infoq.com/presentations/latency-response-time
As an option use Gatling - it has properly implemented measuring to overcome a coordinated omission.
BTW if is possible than share your sources and scripts for testing. In history of the following repository you can find all that stuff for Play 2.5 version too: https://github.com/plokhotnyuk/play
FYI: It is great to see that Java still in top-5, but Rust, Kotlin and Go are approaching quickly... and most pity that Scala frameworks are not based on top Java's... even NodeJs shown greater result than Netty and Undertow: https://www.techempower.com/benchmarks/#section=data-r15&hw=ph&test=json

Statistic for requests in deployed VPS servers

I was thinking about different scalability features, and suddenly understand that I don't really know how much can handle one server (VPS). The question for them who have loaded projects.
Imagine server with:
1 Gb Ram
1 Xeon CPU
CentOS
LAMP with FastCGI
PostgreSQL on the same machine
And we need to calculate count of request, so I decided to take middle parameters for app:
80% of requests using one call to db with indexes
40-50 Kb of html
Cache in 60% of cases
Add some other parameters, and lets calculate, or tell your story about your loads?
I would look at cacti - it can give you plenty of stats to choose from.