Best CPU for GWT compile for a new build server - gwt

When building our current project the GWT compiling needs quite a large amount of the overall time (currently ~25min overall, 2/3 gwt compile). We reserched how to optimize that (e.g. here)
however in the end we decided to buy a new build server. GWT compiling is a quite CPU intensive task so we did some tests to analyze the improvement per core:
1 cores = 197s
2 cores = 165s
3 cores = 149s
4 cores = 157s (can be that the last core was busy with other tasks)
Judging from those numbers its seems that adding more cores doesn't necessarily improve performance since those numbers seem to flatten.
1.)
So now i would be interessted if someone of you can confirm / disprove that? So 8 or 12 cores doesn't necessarily make a difference - but the individual cpu speed (mhz) does?
2.)
After seeing some benchmarks our sales tend to buy *ntel xeon - any experience with AMD? (I am more of an AMD guy however currently it seems hard to disregard the benchmarks)
3.) Any other suggestions regarding memory, IO etc are welcome
Update: When we get the new server I'll post the updated numbers...

We are using an AMD FX-8350 (#4.00 Ghz) with a Samsung 830 Pro SSD. and we've set localWorkers=4 as well as -Xmx2048m. Previously we used a Intel XEON E5-2609 (#2.40 Ghz). That reduced compilation time from ~440s down to ~310s.
So we also experienced that raw CPU speed matters most in case of a single compilation process (with localWorkers=4). In case of multiple compilation processes running at the same time on this machine a SSD improves the IO wait time which increases with the count of concurrent compilation processes.
Our current hardware supports up to 4 maven builds at the same time (each one with localWorkers=4) and uses then up to 20GB of RAM. With the increasing count of concurrent builds the build time increases. But it is not a linear increase, so we try to reduce the idle time in periods where not all resources are used by a single maven process (Java class compiling, tests, ...).
As we compared the hardware prices, we decided to buy a consumer PC used as a slave in our Jenkins buildfarm. The overall price is much cheaper than server hardware and can easily replaced with a new one in case of a hardware failure.

Related

Optimization Experiment technical error Anylogic

I'm trying to minimize the waiting time of people in the queue for a truck, I provided 25 trucks and only 1 is being used, so I did an optimization experiment with the objective of minimizing the waiting time in the queue with requirements of 95% utilization of trucks, so more than one truck at once could deliver people, when I run the optimization experiment it gives me this error: OpenJDK 64 bit server warning there is insufficient memory of for the java runtime environment to continue , although I used maximum available memory of 16343, how to solve this issue in order to give me the best number of trucks?
Thanks
Insufficient memory of for the java runtime simply means there is too much memory required to run the model.
More often than not the issue in AnyLogic is that there are too many agents, especially if you run some parameter variation or optimization experiments that tries to run multiple experiments in parallel which increase the memory requirement
One option is to convert as many agents to Java Classes as possible. First start converting all agents that are not used in Flow Blocks and where you don't need any animation.
Check this blog post for an example and more information
https://www.theanylogicmodeler.com/post/why-use-java-classes-in-anylogic

Why no good speedup for Astropy parallel testing with pytest-xdist?

I'm running the Astropy tests in parallel using python setup.py test --parallel N option on my Macbook (4 real cores, solid state disk), which uses pytest-xdist to run the ~ 8000 tests in parallel.
I tried different N in the 1 to 10 range, but in all cases I can only get speed-ups of roughly 2, but I expected to get speedups in the 3 to 4 range (because running the tests should be CPU-limited).
Why are the speedups low and how can I get good speedups (using multiple cores on one computer)?
Update
I tried the ramdisk suggestion from #Iguananaut:
diskutil erasevolume HFS+ 'ramdisk' hdiutil attach -nomount ram://8388608
mkdir /Volumes/ramdisk/tmp
time python setup.py test -a '--basetemp=/Volumes/ramdisk/tmp' --parallel 8
The speedup is now ~ 2.2 compared to ~ 2.0 with the SSD.
Since I have four physical cores I expect something in the range 3 to 4.
Maybe the overhead for running the tests in parallel is very large for some reason.
I would suspect the SSD is the limiting factor there. Many of the tests are CPU bound, but just as many make heavy disk usage--temp files and the like. Those could perhaps be made even slower by running in parallel. Beyond that it's hard to say much since it depends on the particulars of your environment. I get significant speedup running the tests on six cores. Not quite 6x but it does make a difference.
One thing you might try is making a ramdisk to set as your temp directory. You can do this in OSX with diskutil. You can Google how to do that if you're not sure. Then you should be able to run ./setup.py test -A '--basetemp=path/to/ramdisk'. I haven't actually tried that with the Astropy tests and am not sure how it will work. But if it does work it will at least help somewhat rule out I/O as the bottleneck.
That said I'm being intentionally wishy-washy as to how much it might help. Even using a ramdisk--now your RAM's speed is becoming the bottleneck for I/O bound tests. No matter how many CPUs you have all the CPU-bound tests could finish instantly and the I/O-bound tests won't be made any faster, so you would still have to wait just as long (or almost as long for them to finish). With multiprocessing there's also additional overhead in message passing between the processes--exactly how this is being performed depends on a lot of factors but it's most likely through some shared memory. Anyone reading this also has no way of knowing what other processes are running on your machine that could be contending for those same resources. Even if your system monitor doesn't show anything making heavy use of the CPU, that doesn't mean there aren't processes doing other things that are adding to some bottleneck.
TL;DR I wouldn't make much of not getting a speedup directly proportional to the number of corse you throw at it, especially on something like a laptop.

NUMA awareness of JVM

My question concerns the extent to which a JVM application can exploit the NUMA layout of a host.
I have an Akka application in which actors concurrently process requests by combining incoming data with 'common' data already loaded into an immutable (Scala) object. The application scales well in the cloud, using many dual core VMs, but performs poorly on a single 64 core machine. I presume this is because the common data object resides in one NUMA cell and many threads concurrently accessing from other cells is too much for the interconnects.
If I run 64 separate JVM applications each containing 1 actor then performance is is good again. A more moderate approach might be to run as many JVM applications as there are NUMA cells (8 in my case), giving the host OS a chance to keep the threads and memory together?
But is there a smarter way to achieve the same effect within a single JVM? E.g. if I replaced my common data object with several instances of a case class, would the JVM have the capability to place them on the optimal NUMA cell?
Update:
I'm using Oracle JDK 1.7.0_05, and Akka 2.1.4
I've now tried with the UseNUMA and UseParallelGC JVM options. Neither seemed to have any significant impact on slow performance when using one or few JVMs. I've also tried using a PinnedDispatcher and the thre-pool-executor with no effect. I'm not sure if the configuration is having an effect though, since there seems nothing different in the startup logs.
The biggest improvement remains when I use a single JVM per worker (~50). However, the problem with this appears to be that there is a long delay (up to a couple of min) before the FailureDector registers the successful exchange of 'first heartbeat' between Akka cluster JVMs. I suspect there is some other issue here that I've not yet uncovered. I already had to increase the ulimit -u since I was hitting the default maximum number of processes (1024).
Just to clarify, I'm not trying to achieve large numbers of messages, just trying to have lots of separate actors concurrently access an immutable object.
I think if you sure that problems not in message processing algorithms then you should take in account not only NUMA option but whole env. configuration, starting from JVM version (latest is better, Oracle JDK also mostly performs better than OpenJDK) then JVM options (including GC, memory, concurrency options etc.) then Scala and Akka versions (latest release candidates and milestones can be much better) and also Akka configuration.
From here you can borrow all things that matter to got 50M messages per second of total throughput for Akka actors on contemporary laptops.
Never had chance to run these benchmarks on 64-core server - so any feedback will be greatly appreciated.
From my findings, which can help, current implementations of ForkJoinPool increases message send latency when number of threads in pool increases. It is greatly noticeable for cases when rate of response-request call between actors is high, e. g. on my laptop when increasing pool size from 4 to 64 message send latency of Akka actors for such cases grows up to 2-3x times for most executor services (Scala's ForkJoinPool, JDK's ForkJoinPool, ThreadPoolExecutor).
You can check if there are any differences by running mvnAll.sh with the benchmark.parallelism system variable set to different values.

What are the differences between multi-CPU, multi-core and hyper-thread?

Could anyone explain to me the differences between multi-CPU, multi-core, and hyper-thread? I am always confused about these differences, and about the pros/cons of each architecture in different scenarios.
Here is my current understanding after learning online and learning from others' comments.
I think hyper-thread is the most inferior technology among them, but cheap. Its main idea is duplicate registers to save context switch time;
Multi processor is better than hyper-thread, but since different CPUs are on different chips, the communication between different CPUs is of longer latency than multi-core, and using multiple chips, there is more expense and more power consumption than with multi-core;
multi-core integrates all the CPUs on a single chip, so the latency of communication between different CPUs are greatly reduced compared with multi-processor. Since it uses one single chip to contain all CPUs, it consumer less power and is less expensive than a multi processor system.
Is this correct?
Multi-CPU was the first version: You'd have one or more mainboards with one or more CPU chips on them. The main problem here was that the CPUs would have to expose some of their internal data to the other CPU so they wouldn't get in their way.
The next step was hyper-threading. One chip on the mainboard but it had some parts twice internally so it could execute two instructions at the same time.
The current development is multi-core. It's basically the original idea (several complete CPUs) but in a single chip. The advantage: Chip designers can easily put the additional wires for the sync signals into the chip (instead of having to route them out on a pin, then over the crowded mainboard and up into a second chip).
Super computers today are multi-cpu, multi-core: They have lots of mainboards with usually 2-4 CPUs on them, each CPU is multi-core and each has its own RAM.
[EDIT] You got that pretty much right. Just a few minor points:
Hyper-threading keeps track of two contexts at once in a single core, exposing more parallelism to the out-of-order CPU core. This keeps the execution units fed with work, even when one thread is stalled on a cache miss, branch mispredict, or waiting for results from high-latency instructions. It's a way to get more total throughput without replicating much hardware, but if anything it slows down each thread individually. See this Q&A for more details, and an explanation of what was wrong with the previous wording of this paragraph.
The main problem with multi-CPU is that code running on them will eventually access the RAM. There are N CPUs but only one bus to access the RAM. So you must have some hardware which makes sure that a) each CPU gets a fair amount of RAM access, b) that accesses to the same part of the RAM don't cause problems and c) most importantly, that CPU 2 will be notified when CPU 1 writes to some memory address which CPU 2 has in its internal cache. If that doesn't happen, CPU 2 will happily use the cached value, oblivious to the fact that it is outdated
Just imagine you have tasks in a list and you want to spread them to all available CPUs. So CPU 1 will fetch the first element from the list and update the pointers. CPU 2 will do the same. For efficiency reasons, both CPUs will not only copy the few bytes into the cache but a whole "cache line" (whatever that may be). The assumption is that, when you read byte X, you'll soon read X+1, too.
Now both CPUs have a copy of the memory in their cache. CPU 1 will then fetch the next item from the list. Without cache sync, it won't have noticed that CPU 2 has changed the list, too, and it will start to work on the same item as CPU 2.
This is what effectively makes multi-CPU so complicated. Side effects of this can lead to a performance which is worse than what you'd get if the whole code ran only on a single CPU. The solution was multi-core: You can easily add as many wires as you need to synchronize the caches; you could even copy data from one cache to another (updating parts of a cache line without having to flush and reload it), etc. Or the cache logic could make sure that all CPUs get the same cache line when they access the same part of real RAM, simply blocking CPU 2 for a few nanoseconds until CPU 1 has made its changes.
[EDIT2] The main reason why multi-core is simpler than multi-cpu is that on a mainboard, you simply can't run all wires between the two chips which you'd need to make sync effective. Plus a signal only travels 30cm/ns tops (speed of light; in a wire, you usually have much less). And don't forget that, on a multi-layer mainboard, signals start to influence each other (crosstalk). We like to think that 0 is 0V and 1 is 5V but in reality, "0" is something between -0.5V (overdrive when dropping a line from 1->0) and .5V and "1" is anything above 0.8V.
If you have everything inside of a single chip, signals run much faster and you can have as many as you like (well, almost :). Also, signal crosstalk is much easier to control.
You can find some interesting articles about dual CPU, multi-core and hyper-threading on Intel's website or in a short article from Yale University.
I hope you find here all the information you need.
In a nutshell: multi-CPU or multi-processor system has several processors. A multi-core system is a multi-processor system with several processors on the same die. In hyperthreading, multiple threads can run on the same processor (that is the context-switch time between these multiple threads is very small).
Multi-processors have been there for 30 years now but mostly in labs. Multi-core is the new popular multi-processor. Server processors nowadays implement hyperthreading along with multi-processors.
The wikipedia articles on these topics are quite illustrative.
Hyperthreading is a cheaper and slower alternative to having multiple-cores
The Intel Manual Volume 3 System Programming Guide - 325384-056US September 2015 8.7 "INTEL HYPER-THREADING TECHNOLOGY ARCHITECTURE" describes HT briefly. It contains the following diagram:
TODO it is slower by how much percent in average in real applications?
Hyperthreading is possible because modern single CPUs cores already execute multiple instructions at once with the instruction pipeline https://en.wikipedia.org/wiki/Instruction_pipelining
The instruction pipeline is a separation of functions inside of a single core to ensure that each part of the circuit is used at any given time: reading memory, decoding instructions, executing instructions, etc.
Hyperthreading separates functions further by using:
a single backend, which actually runs the instructions with its pipeline.
Dual core has two backends, which explains the greater cost and performance.
two front-ends, which take two streams of instructions and order them in a way to maximize pipelining usage of the single backend by avoiding hazards.
Dual core would also have 2 front-ends, one for each backend.
There are edge cases where instruction reordering produces no benefit, making hyperthreading useless. But it produces a significant improvement in average.
Two hyperthreads in a single core share further cache levels (TODO how many? L1?) than two different cores, which share only L3, see:
Multiple threads and CPU cache
How are cache memories shared in multicore Intel CPUs?
The interface that each hyperthread exposes to the operating system is similar to that of an actual core, and both can be controlled separately. Thus cat /proc/cpuinfo shows me 4 processors, even though I only have 2 cores with 2 hyperthreads each.
Operating systems can however take advantage of knowing which hyperthreads are on the same core to run multiple threads of a given program on a single core, which might improve cache usage.
This LinusTechTips video contains a light-hearted non-technical explanation: https://www.youtube.com/watch?v=wnS50lJicXc
Multi-CPU is a bit like multicore, but communication can only happen through RAM, not L3 cache
This means that if possible, you want to partition tasks that use the same memory a lot for each separate CPU.
E.g. the following SBI-7228R-T2X blade server contains 4 CPUs, 2 on each node:
Source.
We can see that there seem to be 4 sockets for the CPUs, each covered by a heat sink, with one open.
I think across the nodes, they don't even share RAM memory and must communicate through some kind of networking, thus representing one further step up on the hyperthread/multicore/multi-CPU hierarchy, TODO confirm:
https://scicomp.stackexchange.com/questions/7530/difference-between-nodes-and-cpus-when-running-software-on-a-cluster
SLURM nodes, tasks, cores, and cpus
https://www.quora.com/In-High-Performance-Computing-what-exactly-is-the-difference-between-the-terms-%E2%80%9Ccores-%E2%80%9D-%E2%80%9Cprocessors-%E2%80%9D-%E2%80%9Cnodes-%E2%80%9D-and-%E2%80%9Cclusters%E2%80%9D

What is your experience with Sun CoolThreads technology?

My project has some money to spend before the end of the fiscal year and we are considering replacing a Sun-Fire-V490 server we've had for a few years. One option we are looking at is the CoolThreads technology. All I know is the Sun marketing, which may not be 100% unbiased. Has anyone actually played with one of these?
I suspect it will be no value to us, since we don't use threads or virtual machines much and we can't spend a lot of time retrofitting code. We do spawn a ton of processes, but I doubt CoolThreads will be of help there.
(And yes, the money would be better spent on bonuses or something, but that's not going to happen.)
IIRC The coolthreads technology is referring to the fact that rather than just ramping up the clock speed ever higher to improve performance they are now looking at multiple core processors with hyperthreading effectively giving you loads of processors on one chip. Overall the processing capacity available is higher but without the additional electrical power and aircon requirements you would expect (hence cool). Its usefulness definitely depends on what you are planning to run on it. If you are running Apache with the multiple threads core it will love it as it can run the individual response threads on the individual cpu cores. If you are simply running single thread processes you will get some performance increases over a single cpu box but not as great (any old fashioned non mod_perl/mod_python CGID processes would still be sharing the the cpu a bit). If your application consists of one single threaded process running maxed out on the box you will get very little improvement on a single core cpu running at the same speed.
Peter
Edit:
Oh and for a benchmark. We compared a T2000 in our server farm to our current V240s (May have been V480's I don't recall) The T2000 took the load of 12-13 of the Older boxes in a live test without any OS tweeking for performance. As I said Apache loves it :-)
Disclosure: I work for Sun (but as an engineer in client software).
You don't necesarily need multithreaded code to make use of these machines. Having multiple processes will make use of multiple hardware threads on multiple cores.
The old T1 processors (T1000 and T2000 boxes) did have only a single FPU, and weren't really suitable for tasks with much more than about 1% floating point. The newer T2 and T2+ processors have an FPU per core. That's probably still not great for massive floating point crunching, but is much more respectable.
(Note: Hyper-Threading Technology is a trademark of Intel. Sun uses the term Chip MultiThreading (CMT).)
We used Sun Fire T2000s for my last system. The boxes themselves were far exceeded our capacity requirements in terms of processing power. For us the decision was based on the lower power consumption and space requirement. We successfully ran WebSphere 6, Oracle 10g and SunONE Directory server on the same box.
My info may be a bit out of date (last used these servers 2 years ago) but as I recall one big gotcha was that all the cores on a single CPU all shared the same FPU unit, so if your code did a lot of floating point (we were doing GIS) the FPU was a massive bottleneck and you didn't get much benefit from the large number of threads.
For any process with high parallelism these machines (eg, the t1000/t2000) are great for their cost. I've been running oracle on them for about 18 months now and it works great.
If you task is a single threaded/single process, then you'd be better off with a high speed dual/quad core intel machine.
If your application has lots of threads/lots of processes then these machines will likely be great for it.
Best of all, Sun will send you one for 60 days to evaluate, that is what we did before committing to it, ended up getting 2 t2000's and have recently purchased another 4 t1000's.
It hit me last night that our core processes aren't multi-threaded, but the machine in question does have a bunch of system processes that are. In particular, it acts as an NFS server. It sounds like running hundreds of processes will benefit from all those cores, as well.
I'll see if we can get a demo unit to test on first.
Sun has been selling the Niagra machines to be all things to all comers. They do have their place: web services being the best deployment. We have run Oracle on some T2000s and it worked well for highly parallelized operations. But the machines fall flat on single-treaded operations, the performance of which is rather bad. If you have floating point work to do, look elsewhere. Even the newer chips with A FPU per core is inadequate. Also, these machines cannot take a enterprise-class pounding for long and we've had reliability problems. Multi-core techology is more hype than substance. Sandia National Lab's research on it and found that four to eight cores is about the top-end of usefulnes and that a 16 core chip has the same throughput as a dual core chip. So a 16 core chip is a waste of a lot of money. Also, as the number of cores increase, the clock speed muust decrease, because of the thermal wall. Most manufacturers will probably settle on quad-core chips until memory technology improves (you can't keep 16 cores fed with memory and most of the cores are stalled). Finally, given the chaos at Sun, you'd do better to look elsewhere.