AnyLogic Computer Processor Advice needed - Single-core speed vs. number of cores? - anylogic

I model on an ancient PC and recently got some lab funds for a new modeling computer. The choice of processor confounds me. For optimal AnyLogic simulation modeling, should I focus on maxing out the single-core speed or max the number of processor cores? Also, would a high-end graphics card help? I have heard from my engineering colleagues that for certain modeling tools that they do help with the work load. Any advice helps. Thanks.

This is what AnyLogic answered when I asked for the perfect computer to buy:
The recommended platform for AnyLogic is a powerful PC/laptop running
64-bit operating system (Windows preferable), plus CPU with multiple
cores like i7 and at least 8 Gb of RAM.
In general, faster CPU (3GHz or more recommended) means faster single
run execution. More cores means faster execution of the experiments
running the model multiple times in parallel (optimization, parameter
variation, monte carlo, etc.). Also, pedestrians and transporters
benefit from many cores (even single run, since the algorithm causing
movement of pedestrians and transporters uses all available cores).
For the time being, AnyLogic doesn't support GPU processing. RAM is
crucial when you have a lot of agents and many parallel runs (e.g. if
single run takes 1GB, then 8 parallel runs will take 8 Gb). For
working with GIS map, it may be needed to have a good connection to
the Internet. For example, if model requests a lot of routes from
online route provider.
On average, a middle-end PC/laptop in sufficient for most of the
models, high-end PC or server/instance will be useful in case of
really heavy models.

Just to add to Felipe's reply: graphic card is completely irrelevant, AnyLogic does not support outsourcing computations to their tensor cores.
Focus on decent processor speed and 8-12 cores as well as at least 16 GB of RAM and (crucial!!) an SSD harddrive. Good to go :)
Oh, and you may want to use Windows. Linux and Mac OS seem to feature more problems/bugs in AnyLogic than Windows

Related

Why is MATLAB so slow on my Windows server?

I have Matlab R2017a installed on a server running MS Windows Server 2008 R2 Enterprise v 6.1 (SP1) and the benchmark results are awful:
bench
3.6424 0.5267 0.2114 5.0303 1.5557 3.4980
[columns = LU, FFT, ODE, Sparse, 2-D, 3-D]
Note that it is particularly slow for LU and Sparse.
The server has this hardware:
CPU: Intel Xeon E7320 # 2.13GHz (4 physical processors, 16 logical)
128 GB RAM
64-bit operating system
Matlab Version: 9.2.0.556344 (R2017a)
Java version: Java 1.7.0_60-b19 with Oracle corporation Java Hotspot(TM) 64-Bit Server VM mixed mode.
There are also other users that can be online on the server but I can see that they are not stressing the system and have verified that these running times are stable (have tested multiple times the past week.
My question is this: is there any other library or something that Matlab relies on that could be "wrong"? I have another similar setup on a similar but slightly newer server that gets bench results much closer to what I'd expect based on the specs. I'm wondering if it's using a "wrong" linear algebra module or something.
Alternative explanation I know that Matlab ran extremely slowly on a particular AMD Opteron CPU (I happen to also have worked on such a server in Matlab, link https://se.mathworks.com/matlabcentral/answers/33939-poor-matlab-performance-on-amd-based-computer). Is it possible that it's a similar issue with the Intel Xeon E7320?
Edit: Xeon E7320 as suggested by Peter.
Update: I'm not sure whether Matlab's bench takes advantage of just a single CPU core, multiple CPU cores, or also a GPU (OpenCL / CUDA). If it can use GPU acceleration, that would make a huge difference. (Especially if you don't have one at all in your "slow" server).
As discussed in comments, a dual-core Sandybridge laptop is 10x faster on some of the benchmarks, but only 2 or 1.5x faster on some other components. (But I'm not sure if the version of Matlab was controlled for; that thread you linked mentioned that different version of Matlab do a different amount of work in their bench).
The rest of this answer was written with the assumption that your test takes advantage of all your CPU cores (otherwise there's no point using an old many-core machine). But without considering GPU.
I think your CPU is actually a 65nm Core2-based Xeon E7320, not "E3720" (no google hits). What are you comparing against? Your Tigerton CPUs are ancient (about 10 years old), of course they're slow. (Tigerton is the same microarchitecture as Conroe/Merom, first-gen Core2).
You have very low memory bandwidth and cache speeds compared to a modern CPU, as well as only having SSSE3, not AVX or FMA. These CPUs don't have a memory controller built-in, so all 4 sockets are sharing the memory controller hub (MCH) via separate 1066MHz Front-Side Buses. Memory bandwidth doesn't scale with number of sockets, and is not great. Memory bandwidth has grown faster than per-core performance over the years. According to that link, a quad-socket 16-core Tigerton (like you have) is barely better than a quad-socket 8-core Barcelona Opteron. It's not so bad for CPU-bound workloads, but memory-bound workloads will do quite badly.
As well as the low clock speed, it's significantly slower clock-for-clock than a modern CPU. IDK what those times are supposed to be like (I'm here for the [performance] tag, not Matlab), but it's totally plausible that a 3GHz quad-core i5 or i7 Haswell / Skylake desktop or high-power laptop would be faster than your 16-core dinosaur machine.
(Actually, does that benchmark even scale with the number of cores? If not, the single-threaded memory bandwidth is probably really not good.)
A very big jump in performance happened with Sandybridge (for all code, including non-SIMD workloads), and there were several other smaller jumps in between your machine and modern CPUs as well. SnB can run 2 load instructions per clock, vs. 1 for previous Intel (like your Core2).
For FP-specific stuff that optimized libraries will take advantage of, x86 ISA extensions have been important: AVX doubles the SIMD vector width, doubling FLOPS (on Intel CPUs with full-width execution units). FMA does a mul+add in one instruction, potentially doubling FLOPS. Microarchitectural improvements are important, too: Haswell has two FMA units vs. earlier CPUs having one FP adder and one FP multiplier, again potentially doubling FLOPS. Only contiguous memory and high computation vs. memory workloads will fully take advantage of this, e.g. a dense matmul, but in that case one Haswell core is doing as much work as 8 Tigerton cores.
I assume Matlab can take advantage of AVX + FMA if the CPU has it.
And BTW, it's not just 16 "logical" processors. You don't have hyperthreading, so you have a 4-socket system with four quad-core CPUs, for 16 physical cores. (And these "quad core" chips are actually two separate dual-core dies in the same package, according to wikipedia.
So as far as the number of physical chips that need to communicate with each other, there are 8 (two in each package). That's a lot of hops to reach other CPUs, so synchronization between cores is more expensive than for a single-die quad-core. (And probably worse even than a modern dual-socket Xeon box with a pair of 18-core CPUs or something).
Note that high latency to memory can also hurt memory bandwidth: see the "latency bound platforms" part of this answer about optimizing memcpy/memset and how store bandwidth works in Intel CPUs.

What is the maximum memory per worker that MATLAB can address?

Short version: Is there a maximum amount of RAM / worker, that MATLAB can address?
Long version: My wife uses MATLAB's parallel processing capabilities in data-heavy spatial analyses (I don't really understand it, I just know how to build computers that make her work quicker) and I would like to build a new computer so she can radically reduce her process times.
I am leaning toward something in the 10-16 core range, since prices on such processors seem to be dropping with each generation and I would like to use 128 GB of RAM, because 'why not' if you can stomach the cost and see some meaningful time savings?
The number of cores I shoot for will depend on the maximum amount of RAM that MATLAB can address for each worker (if such a limit exists). The computer I built for similar work in 2013 has 4 physical cores (Intel i7-3770k) and 32 GB RAM (which she maxed out), and whatever I build next, I would like to have at least the same memory/core. With 128 GB of RAM a given, 10 cores would be 12.8 GB/core, 12 cores would be ~10.5 GB/core and 16 cores would be 8 GB/core. I am inclined to maximize cores rather than memory, but since she doesn't know what will benefit her processes the most, I would like to know how realistic those three options are. As for your next question, she has an nVidia GPU capable of parallel processing, but she believes her processes would not benefit from its CUDA cores.
Thank you for your insights. Many, many Google searches did not yield an answer.

MATLAB program simulation with the given processor requirements

I have a system with configuration intel(R) core(TM) i3-5020U CPU # 2.2 GHz,4GB RAM. But in order to compare the performance of my MATLAB program in terms of execution time, I need to execute it on a machine with configuration Intel(R) Core(TM) i5-3570 CPU # 3.40GHz, 16 GB RAM. Is there a way to perform this kind of simulation?
TL:DR: No. Performance differences between Broadwell and IvyBridge depend on lots of complicated details. (See Agner Fog's microarch pdf for the low-level microarchitectural details, and also other stuff in the x86 tag wiki)
It's likely that performance will scale with either clock speed or memory speed within maybe 10%, even between different microarchitectures, but it might not.
Using your own system, you can probably figure out how your code scales with CPU frequency, by forcing it to stay at minimum frequency for a test run. If it's a lot less than perfect scaling, then memory speed is a big factor. (The slower your CPU, the fewer cycles are spent waiting for memory.)
You can't extrapolate IvB i5 3.4GHz performance from BDW 2.2GHz performance without knowing a lot more details about exactly what your code bottlenecks on. It's possible that it bottlenecks on the same simple thing on both CPUs, in which case you could extrapolate. e.g. if it turns out that it bottlenecks on FP multiply latency, then run-time on IvB would be 5/3rds the run time on Broadwell (times the clock frequency ratio), since BDW has 3 cycle FP multiply and add, but SnB/IvB/Haswell have 5 cycle multiply. (FMA is 5 cycles on BDW, if I recall correctly. IvB doesn't support FMA, so if Matlab takes advantage of that on BDW, it's not even running the same machine code).
More likely, it's not that simple and cache / memory performance comes into it, too. Haswell/Broadwell don't have L1 cache-bank conflicts, but SnB/IvB do.
Depending on how you run the workload on the i5 CPU, it might or might not be able to turbo up to higher than its rated 3.4GHz, further confounding any attempt to guess at performance.
It's hard to tell with different computers to measure practical efficiency. That's why you usually use theoretical efficiency with Big-O, check the wiki page for algorithm efficiency and Big-O notation.
In the case you have access to both codes (yours, and the other guy's code), you can test them in the same computer with the methods for measuring performance proposed by mathworks, which are mainly time functions in real time and cpu time.
Lastly, you can see here several challenges about benchmarking that might be interesting to consider.

What are the minimum specifications for a Scala development machine?

I need to spec out a new computer. I'm only going to use this computer for developing Scala software. I'm going to be running Intellij, doing builds with Maven and SBT, and perhaps firing up a couple of Virtual Machines. I'm going to building a mixture of fairly large Play Framework and micro-services. What is a reasonable machine for doing this work?
The Scala compiler still has poor parrelisation. I doubt that's going to change before you'll be due an upgrade. For this reason I would suggest as a minimum using a Haswell 4670. Going up to an i7 will probably be of doubtful benefit. if you want to spend extra money over-clock a 4770K or a 4670k. If you've really got money to burn use an Ivybridge 4960x, but you won't see much benefit for that extra money. Intel beats AMD on core for core performance. Make sure you've got a 4 memory slot motherboard. 2 Eight Gig DDR3 1600 sticks are probably more than sufficient but allow for an upgrade to 32 Gig in a year or so's time when hopefully memory's come down in price.
As already stated a decent SSD. Run your operating System, your IDE and your projects off the SSD. You'll want a SATA drive for mass storage.
Anyway above $1500 or so for the Base unit diminishing returns set in rapidly. Unless you've really got money to burn.
You'll probably want a graphic cards to run multiple monitors. An AMD 7790 should do the job. I'm assuming that a budget of a 1000 to 1500 dollars for a base unit is not as issue. Personally I find 3 24" 1920 * 1200 monitors just right for civilised development.
You should turn your focus to these key components:
CPU
Since Scala (and the compiler) parallelize well: go for more cores, the more the better. Depending on your budget you can think of multi-CPU systems.
RAM
A lot of RAM helps a lot. I am pretty happy with 16GB, but depending on the size you are planning you might need more, but 16 is a decent amount. You can also think about using RAM to make a RAM-disk for faster compilation etc.
HDD
You definitely want to have a fast SSD. You should look for one with high IOPS, the transfer rate is for developing not that important. If you have a large budget, you can go for 2 SSDs in RAID-0. But be aware that some RAID controllers are not fast enough and will not give you the full possible performance of your SSD RAID.

Parallel programming on a Quad-Core and a VM?

I'm thinking of slowly picking up Parallel Programming. I've seen people use clusters with OpenMPI installed to learn this stuff. I do not have access to a cluster but have a Quad-Core machine. Will I be able to experience any benefit here? Also, if I'm running linux inside a Virtual machine, does it make sense in using OpenMPI inside a VM?
If your target is to learn, you don't need a cluster at all. Your quad-core (or any dual-core or even a single-cored) computer will be more than enough. The main point is to learn how to think "in parallel" and how to design your application.
Some important points are to:
Exploit different parallelism paradigms like divide-and-conquer, master-worker, SPMD, ... depending on data and tasks dependencies of what you want to do.
Chose different data division granularities to check the computation/communication ratio (in case of message passing), or to check the amount of serial execution because of mutual exclusion to memory regions.
Having a quad-core you can measure your approach speedup (the gain on performance attained because of the parallelization) which is normally given by the division between the time of the non parallelized execution and the time of the parallel execution.
The closer you get to 4 (four cores meaning 1/4th the execution time), the better your parallelization strategy was (once you could evenly distribute work and data).