What is the maximum memory per worker that MATLAB can address? - matlab

Short version: Is there a maximum amount of RAM / worker, that MATLAB can address?
Long version: My wife uses MATLAB's parallel processing capabilities in data-heavy spatial analyses (I don't really understand it, I just know how to build computers that make her work quicker) and I would like to build a new computer so she can radically reduce her process times.
I am leaning toward something in the 10-16 core range, since prices on such processors seem to be dropping with each generation and I would like to use 128 GB of RAM, because 'why not' if you can stomach the cost and see some meaningful time savings?
The number of cores I shoot for will depend on the maximum amount of RAM that MATLAB can address for each worker (if such a limit exists). The computer I built for similar work in 2013 has 4 physical cores (Intel i7-3770k) and 32 GB RAM (which she maxed out), and whatever I build next, I would like to have at least the same memory/core. With 128 GB of RAM a given, 10 cores would be 12.8 GB/core, 12 cores would be ~10.5 GB/core and 16 cores would be 8 GB/core. I am inclined to maximize cores rather than memory, but since she doesn't know what will benefit her processes the most, I would like to know how realistic those three options are. As for your next question, she has an nVidia GPU capable of parallel processing, but she believes her processes would not benefit from its CUDA cores.
Thank you for your insights. Many, many Google searches did not yield an answer.

Related

AnyLogic Computer Processor Advice needed - Single-core speed vs. number of cores?

I model on an ancient PC and recently got some lab funds for a new modeling computer. The choice of processor confounds me. For optimal AnyLogic simulation modeling, should I focus on maxing out the single-core speed or max the number of processor cores? Also, would a high-end graphics card help? I have heard from my engineering colleagues that for certain modeling tools that they do help with the work load. Any advice helps. Thanks.
This is what AnyLogic answered when I asked for the perfect computer to buy:
The recommended platform for AnyLogic is a powerful PC/laptop running
64-bit operating system (Windows preferable), plus CPU with multiple
cores like i7 and at least 8 Gb of RAM.
In general, faster CPU (3GHz or more recommended) means faster single
run execution. More cores means faster execution of the experiments
running the model multiple times in parallel (optimization, parameter
variation, monte carlo, etc.). Also, pedestrians and transporters
benefit from many cores (even single run, since the algorithm causing
movement of pedestrians and transporters uses all available cores).
For the time being, AnyLogic doesn't support GPU processing. RAM is
crucial when you have a lot of agents and many parallel runs (e.g. if
single run takes 1GB, then 8 parallel runs will take 8 Gb). For
working with GIS map, it may be needed to have a good connection to
the Internet. For example, if model requests a lot of routes from
online route provider.
On average, a middle-end PC/laptop in sufficient for most of the
models, high-end PC or server/instance will be useful in case of
really heavy models.
Just to add to Felipe's reply: graphic card is completely irrelevant, AnyLogic does not support outsourcing computations to their tensor cores.
Focus on decent processor speed and 8-12 cores as well as at least 16 GB of RAM and (crucial!!) an SSD harddrive. Good to go :)
Oh, and you may want to use Windows. Linux and Mac OS seem to feature more problems/bugs in AnyLogic than Windows

Using a cluster of Raspberry Pi 4 as a cluster for number crunching?

So I am currently developing an algorithm in MATLAB that is computationally expensive but is parallel processing friendly. Given that, I have been using the parallel processing library but I am still falling short of my computation time goals.
I am currently running my algorithm on an Intel i7 8086k CPU (6 Core, 12 logical, #4.00GHz, turbo is 5GHz)
Here are my questions:
If I was to purchase, lets say 10 raspberry pi 4 SBCs (4 cores #1.5GHz), could I use my main desktop as the host and the PIs as the clients? (Let us assume I migrate my algorithm to C++ and run it in Ubuntu for now).
1a. If I was to go through with the build in question 1, will there be a significant upgrade in computation for the ~$500 spent?
1b. If I am not able to use my desktop as host (I believe this shouldn't be an issue), how many raspberry PIs would I need to equate to my current CPU or how many would I need to make it advantageous to work on a PI cluster vs my computer?
Is it possible to run Windows on the host computer and linux on the clients(Pis) so that I continue using MATLAB?
Thanks for your help, any other advise and recommendations are welcome
Does your algorithm bottleneck on raw FMA / FLOPS throughput? If so then a cluster of weak ARM cores is more trouble than it's worth. I'd expect a used Zen2 machine, or maybe Haswell or Broadwell, could be good if you can find one cheaply. (You'd have to look at core counts, clocks, and FLOPS/$. And whether the problem would still not be memory bottlenecked on an older system with less memory bandwidth.)
If you bottleneck instead on cache misses from memory bandwidth or latency (e.g. cache-unfriendly data layout), there might possibly be something to gain from having more weaker CPUs each with their own memory controller and cache, even if those caches are smaller than your Intel.
Does Matlab use your GPU at all (e.g. via OpenCL)? Your current CPU's peak double (FP64) throughput from the IA cores is 96 GFLOPS, but its integrated GPU is capable of 115.2 GFLOPS. Or for single-precision, 460.8 GFLOPS GPU vs. 192 GFLOPS from your x86 cores. Again, theoretical max throughput, running 2x 256-bit SIMD FMA instructions per clock cycle per core on the CPU.
Upgrading to a beefy GPU could be vastly more effective than a cluster of RPi4. e.g. https://en.wikipedia.org/wiki/FLOPS#Hardware_costs shows that cost per single-precision GFLOP in 2017 was about 5 cents, adding big GPUs to a cheapo CPU. Or 79 cents per double-precision GFLOP.
If your problem is GPU-friendly but Matlab hasn't been using your GPU, look into that. Maybe Matlab has options, or you could use OpenCL from C++.
will there be a significant upgrade in computation for the ~$500 spent?
RPi4 model B has a Broadcom BCM2711 SoC. The CPU is Cortex-A72.
Their cache hierachy 32 KB data + 48 KB instruction L1 cache per core. 1MB shared L2 cache. That's weaker than your 4GHz i7 with 32k L1d + 256k L2 private per-core, and a shared 12MiB L3 cache. But faster cores waste more cycles for the same absolute time waiting for a cache miss, and the ARM chips run their DRAM at a competitive DDR4-2400.
RPi CPUs are not FP powerhouses. There's a large gap in the raw numbers, but with enough of them the throughput does add up.
https://en.wikipedia.org/wiki/FLOPS#FLOPs_per_cycle_for_various_processors shows that Cortex-A72 has peak FPU throughput of 2 double FLOPS per core per cycle, vs. 16 for Intel since Haswell, AMD since Zen2.
Dropping to single precision float improves x86 by a factor of 2, but A72 by a factor of 4. Apparently their SIMD units have lower throughput for FP64 instructions, as well as half the work per SIMD vector. (Some other ARM cores aren't extra slow for double, just the expected 2:1, like Cortex-A57 and A76.)
But all this is peak FLOPS throughput; coming close to that in real code is only achieved with well-tuned code with good computational intensity (lots of work each time the data is loaded into cache, and/or into registers). e.g. a dense matrix multiply is the classic example: O(n^3) FPU work over O(n^2) data, in a way that makes cache-blocking possible. Or Prime95 is another example.
Still, a rough back of the envelope calculation, being generous and assuming sustained non-turbo clocks for the Coffee Lake. (All 6 cores busy running 2x 256-bit FMA instructions per clock makes a lot of heat. That's literally what Prime95 does, so expect that level of power consumption if your code is that efficient.)
6 * 4GHz * 4 elements/vec * 2 vec/cycle = 48G FMAs / sec = 96 GFLOP/sec on the CFL
4 * 1.5GHz * 2 DP flops / clock = 12 GFLOP / sec per RPi.
With 5x RPi systems, that's 60 GFLOPS added to your existing 96 GFLOP.
Doesn't sound worth the trouble to manage 5 RPi systems for less than your existing total FP throughput. But again, if your problem has the right kind of parallelism, a GPU can run it much more efficiently. 60 GFLOPS for 500$ is not a good deal compared to ~50$ per 60 GFLOP from a high-end (in 2017) video card.
The GPU in an RPi might have some compute capability, but almost certainly not worth it compared to slapping a 500$ discrete GPU into your existing machine if your code is CPU-friendly.
Or your problem might not scale with theoretical max FLOPS, but instead perhaps with cache bandwidth or some other factor.
Is it possible to run Windows on the host computer and linux on the clients(Pis) so that I continue using MATLAB?
Zero clue; I'm only considering theoretical best case for efficient machine code running on these CPUs.

Why is MATLAB so slow on my Windows server?

I have Matlab R2017a installed on a server running MS Windows Server 2008 R2 Enterprise v 6.1 (SP1) and the benchmark results are awful:
bench
3.6424 0.5267 0.2114 5.0303 1.5557 3.4980
[columns = LU, FFT, ODE, Sparse, 2-D, 3-D]
Note that it is particularly slow for LU and Sparse.
The server has this hardware:
CPU: Intel Xeon E7320 # 2.13GHz (4 physical processors, 16 logical)
128 GB RAM
64-bit operating system
Matlab Version: 9.2.0.556344 (R2017a)
Java version: Java 1.7.0_60-b19 with Oracle corporation Java Hotspot(TM) 64-Bit Server VM mixed mode.
There are also other users that can be online on the server but I can see that they are not stressing the system and have verified that these running times are stable (have tested multiple times the past week.
My question is this: is there any other library or something that Matlab relies on that could be "wrong"? I have another similar setup on a similar but slightly newer server that gets bench results much closer to what I'd expect based on the specs. I'm wondering if it's using a "wrong" linear algebra module or something.
Alternative explanation I know that Matlab ran extremely slowly on a particular AMD Opteron CPU (I happen to also have worked on such a server in Matlab, link https://se.mathworks.com/matlabcentral/answers/33939-poor-matlab-performance-on-amd-based-computer). Is it possible that it's a similar issue with the Intel Xeon E7320?
Edit: Xeon E7320 as suggested by Peter.
Update: I'm not sure whether Matlab's bench takes advantage of just a single CPU core, multiple CPU cores, or also a GPU (OpenCL / CUDA). If it can use GPU acceleration, that would make a huge difference. (Especially if you don't have one at all in your "slow" server).
As discussed in comments, a dual-core Sandybridge laptop is 10x faster on some of the benchmarks, but only 2 or 1.5x faster on some other components. (But I'm not sure if the version of Matlab was controlled for; that thread you linked mentioned that different version of Matlab do a different amount of work in their bench).
The rest of this answer was written with the assumption that your test takes advantage of all your CPU cores (otherwise there's no point using an old many-core machine). But without considering GPU.
I think your CPU is actually a 65nm Core2-based Xeon E7320, not "E3720" (no google hits). What are you comparing against? Your Tigerton CPUs are ancient (about 10 years old), of course they're slow. (Tigerton is the same microarchitecture as Conroe/Merom, first-gen Core2).
You have very low memory bandwidth and cache speeds compared to a modern CPU, as well as only having SSSE3, not AVX or FMA. These CPUs don't have a memory controller built-in, so all 4 sockets are sharing the memory controller hub (MCH) via separate 1066MHz Front-Side Buses. Memory bandwidth doesn't scale with number of sockets, and is not great. Memory bandwidth has grown faster than per-core performance over the years. According to that link, a quad-socket 16-core Tigerton (like you have) is barely better than a quad-socket 8-core Barcelona Opteron. It's not so bad for CPU-bound workloads, but memory-bound workloads will do quite badly.
As well as the low clock speed, it's significantly slower clock-for-clock than a modern CPU. IDK what those times are supposed to be like (I'm here for the [performance] tag, not Matlab), but it's totally plausible that a 3GHz quad-core i5 or i7 Haswell / Skylake desktop or high-power laptop would be faster than your 16-core dinosaur machine.
(Actually, does that benchmark even scale with the number of cores? If not, the single-threaded memory bandwidth is probably really not good.)
A very big jump in performance happened with Sandybridge (for all code, including non-SIMD workloads), and there were several other smaller jumps in between your machine and modern CPUs as well. SnB can run 2 load instructions per clock, vs. 1 for previous Intel (like your Core2).
For FP-specific stuff that optimized libraries will take advantage of, x86 ISA extensions have been important: AVX doubles the SIMD vector width, doubling FLOPS (on Intel CPUs with full-width execution units). FMA does a mul+add in one instruction, potentially doubling FLOPS. Microarchitectural improvements are important, too: Haswell has two FMA units vs. earlier CPUs having one FP adder and one FP multiplier, again potentially doubling FLOPS. Only contiguous memory and high computation vs. memory workloads will fully take advantage of this, e.g. a dense matmul, but in that case one Haswell core is doing as much work as 8 Tigerton cores.
I assume Matlab can take advantage of AVX + FMA if the CPU has it.
And BTW, it's not just 16 "logical" processors. You don't have hyperthreading, so you have a 4-socket system with four quad-core CPUs, for 16 physical cores. (And these "quad core" chips are actually two separate dual-core dies in the same package, according to wikipedia.
So as far as the number of physical chips that need to communicate with each other, there are 8 (two in each package). That's a lot of hops to reach other CPUs, so synchronization between cores is more expensive than for a single-die quad-core. (And probably worse even than a modern dual-socket Xeon box with a pair of 18-core CPUs or something).
Note that high latency to memory can also hurt memory bandwidth: see the "latency bound platforms" part of this answer about optimizing memcpy/memset and how store bandwidth works in Intel CPUs.

performance issues with parallel MATLAB on a NUMA machine

I'm running memory-intensive parallel computations in MATLAB on a 64-core NUMA machine under Windows 7, 8 cores per socket. I'm using parallel computing toolbox to do that. I've noticed a very strange cpu load pattern: then running say 36 parallel MATLABs, the cores on the 1st socket are fully loaded, 2nd socket is almost fully loaded too, third socket is about 50% and so on. The last socket is usually almost completely free and doing nothing. Running more than 12 parallel workers simultaneously seem to very adversely affect performance of all workers.
I tried to experiment with cpu affinity, pinning different workers to different cores. While it helps in simple tests (i.e. cpu load pattern becomes uniform across all cores), it doesn't help in our real-life memory-intensive computations.
I suspect the problem is with memory locality. I.e. all memory is allocated on 1st and 2nd sockets. This would explain strange cpu load: OS tires to run computational threads closer to the data. But I don't know neither how to confirm this suspicion directly, nor how to fix it, if it's true.
I use maxNumCompThreads(4) in all my parallel workers, if that's important. Hyperthreading is off.
You should only be able to run 12 local workers using Parallel Computing Toolbox. See the data sheet.
Please note that in R2014a the limit on the number of local workers was removed. See the release notes.

Performance degrades, if number of threads is more than 2 on Xeon X5355

I have a strange problem but may not be that much strange to some of you.
I am writing an application using boost threads and using boost barriers to synchronize the threads. I have two machines to test the application.
Machine 1 is a core2 duo (T8300) cpu machine (windows XP professional - 4GB RAM) where I am getting following performance figures :
Number of threads :1 , TPS :21
Number of threads :2 , TPS :35 (66 % improvement)
further increase in number of threads decreases the TPS but that is understandable as the machine has only two cores.
Machine 2 is a 2 quad core ( Xeon X5355) cpu machine (windows 2003 server with 4GB RAM) and has 8 effective cores.
Number of threads :1 , TPS :21
Number of threads :2 , TPS :27 (28 % improvement)
Number of threads :4 , TPS :25
Number of threads :8 , TPS :24
As you can see, performance is degrading after 2 threads (though it has 8 cores). If the program has some bottle neck , then for 2 thread also it should have degraded.
Any idea? , Explanations ? , Does the OS has some role in performance ? - It seems like the Core2duo (2.4GHz) scales better than Xeon X5355 (2.66GHz) though it has better clock speed.
Thank you
-Zoolii
The clock speed and the operating system doesn't have as much to do with it as the way your code is written. Things to check might include:
Are you actually spinning up more than two threads at one time?
Do you have unnecessary synchronization artifacts in your code?
Are you synchronizing your code at the appropriate places?
What is your shareable resource and how many of then are there? If each of your transactions is relying on a single section of code, native library, file, database, whatever, then it doesn't matter how many CPUs you've got.
One tool at your disposal when analyzing software bottlenecks is the simple thread dump. Taking a few dumps throughout the life of an execution of your software should expose bottlenecks in your software. You may be able to take that output and use it to reevaluate your code.
Adding more CPU's does not always equate to better performance, locking and contention can severely degrade performance. Factors to consider are:
Is your algorithm suited to parallelisation?
Any inherently sequential portions of code?
Can you partition work into coarse grained 'chunks'? Corase is usually better than fine grained...
Can you alter your code to use less locking?
Synchronisation overheads can often be reduced by ensuring chunks of work are similiar sized.
Based on experience it could be that the Intel policy is 2 threads or dual-process only on that processor, that only pthreads can be used with that version of operating system, that the two processors were designed to conform to different laws with different provisions or allows, that the own thread process is not allowed, that more than n threads are being backed-out by the processor and the processing of error messages reporting this is slowing down throughput of the two cores and may lead to deactivate of cores 3 and 4.