One core works harder than the others - multicore

I have the Xeon E5-2650 V4, 2.2 Ghz, 24 Logical cores running on it,
I run my application which uses all 24 cores, but surprisingly always the core 12 get more cache(L3) with respect to the others.
I was wondering how I could find why and how I can force it to do like the others?!

Related

Maximizing S3 upload performance with AWS C++ SDK

I am using a c5.18xlarge instance with the ENA adapter enabled (so expect to have 25 Gbps connectivity to S3 per AWS support). I am using the AWS C++ SDK (version 1.3.59) on RHEL 7 to upload a 70 GB file to a single S3 object using a 256 MB part size. Per AWS support, I've set the ClientConfiguration's maxConnections field to 999 and its executor field to use a PooledThreadExecutor with a pool size of 999 (and these have improved my performance). I am performing a series of S3Client::UploadPart() calls, threading these myself; I get very similar performance when using UploadPartCallable() and letting the SDK manage the threading.
Here's the performance I'm seeing:
- 36 threads: 7.5 Gbps
- 200 threads: 15.7 Gbps
AWS support reported similar behavior (actually they were using 900 threads).
I've looked through the underlying implementation of S3Client and all the low level thread management and curl handle management. I don't see anything obviously inefficient going on. It just doesn't make any sense to me that I would need 200 threads to achieve this performance on a machine that has 36 physical cores. Is this expected? Could someone provide an explanation for what's happening or a different way to configure the SDK to not require this many threads? I think I could provide my own HTTPClientFactory and customize things to cut out a mutex in how the curl handles are managed if I'm careful, but this seems unlikely to account for what I'm seeing.
Thanks for any help.
-Adam
I am using the AWS C++ SDK (version 1.3.59) on RHEL 7 to upload a 70 GB file to a single S3 object using a 256 MB part size.
You're probably being limited by your disk/storage device's read throughput. It's actually impressive that you're able to reach 15.7 Gbps.
In my test, I see all threads created by Aws::Utils::Threading::PooledThreadExecutor are running in one single CPU core(while the spot instance has 72 vCPUs). Have you seen the same behavior in your tests?
The way I further improved the performance is by using my own threading model with S3Client blocking APIs instead of PooledThreadExecutor with S3 async methods(such as UploadPartAsync()).

Benchmarking: Why is Play (Scala) throughput-latency curve not coming flat?

I am doing performance benchmarking of my Play (Scala) web app. The application is hosted on a cloud server. I am using 2.5.x and Scala 2.11.11. I used Apache Bench to generate requests. One example command of using 'ab':
ab -n 10 -c 10 -T 'application/json'
For my APIs I am getting consistently a linear curve for Number of requests vs. Response time (ms). Here is one such data point:
50% 80% 90%
10 592 602 732
20 1002 1013 1014
50 2168 2222 2290
100 4177 4179 4222
200 8477 9459 9462
First column is the number of concurrent requests. Second, third and fourth columns are the "percentage of requests served within this time".
Blue, Red and Orange bars represent respectively 50%, 80% and 90% the percentage of requests served within this time. The CPU load goes above 50% only when concurrent requests > 100.
These results are on my standard Play+Scala app without any specific optimizations e.g. I am using standard Action => Result controllers for APIs. The results are quite disappointing to me given that the system is partially loaded (CPU load < 50% and hardly any memory usage). The server has 2 CPUs + 8GB Mem.
If you are interested in how to measure a real response latency, than use wrk2 tool instead.
Here is a presentation of wrk2 author about how to measure latency and throughput to compare scalability of different systems or their configurations: https://www.infoq.com/presentations/latency-response-time
As an option use Gatling - it has properly implemented measuring to overcome a coordinated omission.
BTW if is possible than share your sources and scripts for testing. In history of the following repository you can find all that stuff for Play 2.5 version too: https://github.com/plokhotnyuk/play
FYI: It is great to see that Java still in top-5, but Rust, Kotlin and Go are approaching quickly... and most pity that Scala frameworks are not based on top Java's... even NodeJs shown greater result than Netty and Undertow: https://www.techempower.com/benchmarks/#section=data-r15&hw=ph&test=json

Why no good speedup for Astropy parallel testing with pytest-xdist?

I'm running the Astropy tests in parallel using python setup.py test --parallel N option on my Macbook (4 real cores, solid state disk), which uses pytest-xdist to run the ~ 8000 tests in parallel.
I tried different N in the 1 to 10 range, but in all cases I can only get speed-ups of roughly 2, but I expected to get speedups in the 3 to 4 range (because running the tests should be CPU-limited).
Why are the speedups low and how can I get good speedups (using multiple cores on one computer)?
Update
I tried the ramdisk suggestion from #Iguananaut:
diskutil erasevolume HFS+ 'ramdisk' hdiutil attach -nomount ram://8388608
mkdir /Volumes/ramdisk/tmp
time python setup.py test -a '--basetemp=/Volumes/ramdisk/tmp' --parallel 8
The speedup is now ~ 2.2 compared to ~ 2.0 with the SSD.
Since I have four physical cores I expect something in the range 3 to 4.
Maybe the overhead for running the tests in parallel is very large for some reason.
I would suspect the SSD is the limiting factor there. Many of the tests are CPU bound, but just as many make heavy disk usage--temp files and the like. Those could perhaps be made even slower by running in parallel. Beyond that it's hard to say much since it depends on the particulars of your environment. I get significant speedup running the tests on six cores. Not quite 6x but it does make a difference.
One thing you might try is making a ramdisk to set as your temp directory. You can do this in OSX with diskutil. You can Google how to do that if you're not sure. Then you should be able to run ./setup.py test -A '--basetemp=path/to/ramdisk'. I haven't actually tried that with the Astropy tests and am not sure how it will work. But if it does work it will at least help somewhat rule out I/O as the bottleneck.
That said I'm being intentionally wishy-washy as to how much it might help. Even using a ramdisk--now your RAM's speed is becoming the bottleneck for I/O bound tests. No matter how many CPUs you have all the CPU-bound tests could finish instantly and the I/O-bound tests won't be made any faster, so you would still have to wait just as long (or almost as long for them to finish). With multiprocessing there's also additional overhead in message passing between the processes--exactly how this is being performed depends on a lot of factors but it's most likely through some shared memory. Anyone reading this also has no way of knowing what other processes are running on your machine that could be contending for those same resources. Even if your system monitor doesn't show anything making heavy use of the CPU, that doesn't mean there aren't processes doing other things that are adding to some bottleneck.
TL;DR I wouldn't make much of not getting a speedup directly proportional to the number of corse you throw at it, especially on something like a laptop.

Best CPU for GWT compile for a new build server

When building our current project the GWT compiling needs quite a large amount of the overall time (currently ~25min overall, 2/3 gwt compile). We reserched how to optimize that (e.g. here)
however in the end we decided to buy a new build server. GWT compiling is a quite CPU intensive task so we did some tests to analyze the improvement per core:
1 cores = 197s
2 cores = 165s
3 cores = 149s
4 cores = 157s (can be that the last core was busy with other tasks)
Judging from those numbers its seems that adding more cores doesn't necessarily improve performance since those numbers seem to flatten.
1.)
So now i would be interessted if someone of you can confirm / disprove that? So 8 or 12 cores doesn't necessarily make a difference - but the individual cpu speed (mhz) does?
2.)
After seeing some benchmarks our sales tend to buy *ntel xeon - any experience with AMD? (I am more of an AMD guy however currently it seems hard to disregard the benchmarks)
3.) Any other suggestions regarding memory, IO etc are welcome
Update: When we get the new server I'll post the updated numbers...
We are using an AMD FX-8350 (#4.00 Ghz) with a Samsung 830 Pro SSD. and we've set localWorkers=4 as well as -Xmx2048m. Previously we used a Intel XEON E5-2609 (#2.40 Ghz). That reduced compilation time from ~440s down to ~310s.
So we also experienced that raw CPU speed matters most in case of a single compilation process (with localWorkers=4). In case of multiple compilation processes running at the same time on this machine a SSD improves the IO wait time which increases with the count of concurrent compilation processes.
Our current hardware supports up to 4 maven builds at the same time (each one with localWorkers=4) and uses then up to 20GB of RAM. With the increasing count of concurrent builds the build time increases. But it is not a linear increase, so we try to reduce the idle time in periods where not all resources are used by a single maven process (Java class compiling, tests, ...).
As we compared the hardware prices, we decided to buy a consumer PC used as a slave in our Jenkins buildfarm. The overall price is much cheaper than server hardware and can easily replaced with a new one in case of a hardware failure.

What is your experience with Sun CoolThreads technology?

My project has some money to spend before the end of the fiscal year and we are considering replacing a Sun-Fire-V490 server we've had for a few years. One option we are looking at is the CoolThreads technology. All I know is the Sun marketing, which may not be 100% unbiased. Has anyone actually played with one of these?
I suspect it will be no value to us, since we don't use threads or virtual machines much and we can't spend a lot of time retrofitting code. We do spawn a ton of processes, but I doubt CoolThreads will be of help there.
(And yes, the money would be better spent on bonuses or something, but that's not going to happen.)
IIRC The coolthreads technology is referring to the fact that rather than just ramping up the clock speed ever higher to improve performance they are now looking at multiple core processors with hyperthreading effectively giving you loads of processors on one chip. Overall the processing capacity available is higher but without the additional electrical power and aircon requirements you would expect (hence cool). Its usefulness definitely depends on what you are planning to run on it. If you are running Apache with the multiple threads core it will love it as it can run the individual response threads on the individual cpu cores. If you are simply running single thread processes you will get some performance increases over a single cpu box but not as great (any old fashioned non mod_perl/mod_python CGID processes would still be sharing the the cpu a bit). If your application consists of one single threaded process running maxed out on the box you will get very little improvement on a single core cpu running at the same speed.
Peter
Edit:
Oh and for a benchmark. We compared a T2000 in our server farm to our current V240s (May have been V480's I don't recall) The T2000 took the load of 12-13 of the Older boxes in a live test without any OS tweeking for performance. As I said Apache loves it :-)
Disclosure: I work for Sun (but as an engineer in client software).
You don't necesarily need multithreaded code to make use of these machines. Having multiple processes will make use of multiple hardware threads on multiple cores.
The old T1 processors (T1000 and T2000 boxes) did have only a single FPU, and weren't really suitable for tasks with much more than about 1% floating point. The newer T2 and T2+ processors have an FPU per core. That's probably still not great for massive floating point crunching, but is much more respectable.
(Note: Hyper-Threading Technology is a trademark of Intel. Sun uses the term Chip MultiThreading (CMT).)
We used Sun Fire T2000s for my last system. The boxes themselves were far exceeded our capacity requirements in terms of processing power. For us the decision was based on the lower power consumption and space requirement. We successfully ran WebSphere 6, Oracle 10g and SunONE Directory server on the same box.
My info may be a bit out of date (last used these servers 2 years ago) but as I recall one big gotcha was that all the cores on a single CPU all shared the same FPU unit, so if your code did a lot of floating point (we were doing GIS) the FPU was a massive bottleneck and you didn't get much benefit from the large number of threads.
For any process with high parallelism these machines (eg, the t1000/t2000) are great for their cost. I've been running oracle on them for about 18 months now and it works great.
If you task is a single threaded/single process, then you'd be better off with a high speed dual/quad core intel machine.
If your application has lots of threads/lots of processes then these machines will likely be great for it.
Best of all, Sun will send you one for 60 days to evaluate, that is what we did before committing to it, ended up getting 2 t2000's and have recently purchased another 4 t1000's.
It hit me last night that our core processes aren't multi-threaded, but the machine in question does have a bunch of system processes that are. In particular, it acts as an NFS server. It sounds like running hundreds of processes will benefit from all those cores, as well.
I'll see if we can get a demo unit to test on first.
Sun has been selling the Niagra machines to be all things to all comers. They do have their place: web services being the best deployment. We have run Oracle on some T2000s and it worked well for highly parallelized operations. But the machines fall flat on single-treaded operations, the performance of which is rather bad. If you have floating point work to do, look elsewhere. Even the newer chips with A FPU per core is inadequate. Also, these machines cannot take a enterprise-class pounding for long and we've had reliability problems. Multi-core techology is more hype than substance. Sandia National Lab's research on it and found that four to eight cores is about the top-end of usefulnes and that a 16 core chip has the same throughput as a dual core chip. So a 16 core chip is a waste of a lot of money. Also, as the number of cores increase, the clock speed muust decrease, because of the thermal wall. Most manufacturers will probably settle on quad-core chips until memory technology improves (you can't keep 16 cores fed with memory and most of the cores are stalled). Finally, given the chaos at Sun, you'd do better to look elsewhere.