Using a cluster of Raspberry Pi 4 as a cluster for number crunching? - matlab

So I am currently developing an algorithm in MATLAB that is computationally expensive but is parallel processing friendly. Given that, I have been using the parallel processing library but I am still falling short of my computation time goals.
I am currently running my algorithm on an Intel i7 8086k CPU (6 Core, 12 logical, #4.00GHz, turbo is 5GHz)
Here are my questions:
If I was to purchase, lets say 10 raspberry pi 4 SBCs (4 cores #1.5GHz), could I use my main desktop as the host and the PIs as the clients? (Let us assume I migrate my algorithm to C++ and run it in Ubuntu for now).
1a. If I was to go through with the build in question 1, will there be a significant upgrade in computation for the ~$500 spent?
1b. If I am not able to use my desktop as host (I believe this shouldn't be an issue), how many raspberry PIs would I need to equate to my current CPU or how many would I need to make it advantageous to work on a PI cluster vs my computer?
Is it possible to run Windows on the host computer and linux on the clients(Pis) so that I continue using MATLAB?
Thanks for your help, any other advise and recommendations are welcome

Does your algorithm bottleneck on raw FMA / FLOPS throughput? If so then a cluster of weak ARM cores is more trouble than it's worth. I'd expect a used Zen2 machine, or maybe Haswell or Broadwell, could be good if you can find one cheaply. (You'd have to look at core counts, clocks, and FLOPS/$. And whether the problem would still not be memory bottlenecked on an older system with less memory bandwidth.)
If you bottleneck instead on cache misses from memory bandwidth or latency (e.g. cache-unfriendly data layout), there might possibly be something to gain from having more weaker CPUs each with their own memory controller and cache, even if those caches are smaller than your Intel.
Does Matlab use your GPU at all (e.g. via OpenCL)? Your current CPU's peak double (FP64) throughput from the IA cores is 96 GFLOPS, but its integrated GPU is capable of 115.2 GFLOPS. Or for single-precision, 460.8 GFLOPS GPU vs. 192 GFLOPS from your x86 cores. Again, theoretical max throughput, running 2x 256-bit SIMD FMA instructions per clock cycle per core on the CPU.
Upgrading to a beefy GPU could be vastly more effective than a cluster of RPi4. e.g. https://en.wikipedia.org/wiki/FLOPS#Hardware_costs shows that cost per single-precision GFLOP in 2017 was about 5 cents, adding big GPUs to a cheapo CPU. Or 79 cents per double-precision GFLOP.
If your problem is GPU-friendly but Matlab hasn't been using your GPU, look into that. Maybe Matlab has options, or you could use OpenCL from C++.
will there be a significant upgrade in computation for the ~$500 spent?
RPi4 model B has a Broadcom BCM2711 SoC. The CPU is Cortex-A72.
Their cache hierachy 32 KB data + 48 KB instruction L1 cache per core. 1MB shared L2 cache. That's weaker than your 4GHz i7 with 32k L1d + 256k L2 private per-core, and a shared 12MiB L3 cache. But faster cores waste more cycles for the same absolute time waiting for a cache miss, and the ARM chips run their DRAM at a competitive DDR4-2400.
RPi CPUs are not FP powerhouses. There's a large gap in the raw numbers, but with enough of them the throughput does add up.
https://en.wikipedia.org/wiki/FLOPS#FLOPs_per_cycle_for_various_processors shows that Cortex-A72 has peak FPU throughput of 2 double FLOPS per core per cycle, vs. 16 for Intel since Haswell, AMD since Zen2.
Dropping to single precision float improves x86 by a factor of 2, but A72 by a factor of 4. Apparently their SIMD units have lower throughput for FP64 instructions, as well as half the work per SIMD vector. (Some other ARM cores aren't extra slow for double, just the expected 2:1, like Cortex-A57 and A76.)
But all this is peak FLOPS throughput; coming close to that in real code is only achieved with well-tuned code with good computational intensity (lots of work each time the data is loaded into cache, and/or into registers). e.g. a dense matrix multiply is the classic example: O(n^3) FPU work over O(n^2) data, in a way that makes cache-blocking possible. Or Prime95 is another example.
Still, a rough back of the envelope calculation, being generous and assuming sustained non-turbo clocks for the Coffee Lake. (All 6 cores busy running 2x 256-bit FMA instructions per clock makes a lot of heat. That's literally what Prime95 does, so expect that level of power consumption if your code is that efficient.)
6 * 4GHz * 4 elements/vec * 2 vec/cycle = 48G FMAs / sec = 96 GFLOP/sec on the CFL
4 * 1.5GHz * 2 DP flops / clock = 12 GFLOP / sec per RPi.
With 5x RPi systems, that's 60 GFLOPS added to your existing 96 GFLOP.
Doesn't sound worth the trouble to manage 5 RPi systems for less than your existing total FP throughput. But again, if your problem has the right kind of parallelism, a GPU can run it much more efficiently. 60 GFLOPS for 500$ is not a good deal compared to ~50$ per 60 GFLOP from a high-end (in 2017) video card.
The GPU in an RPi might have some compute capability, but almost certainly not worth it compared to slapping a 500$ discrete GPU into your existing machine if your code is CPU-friendly.
Or your problem might not scale with theoretical max FLOPS, but instead perhaps with cache bandwidth or some other factor.
Is it possible to run Windows on the host computer and linux on the clients(Pis) so that I continue using MATLAB?
Zero clue; I'm only considering theoretical best case for efficient machine code running on these CPUs.

Related

Differences between current gen Xeon Processors

What's the actual differences between Xeon W series, Bronze, Silver, Gold and Platinum series?
With earlier versions of Xeons, The E3 were single socket CPU's. whereas E5's could be used in motherboards with two sockets. The E7's were quad sockets supported (probably 8 too)
However, with the current generation Xeon's, Most of the lineup has a scalability of 2S (2 processors in one Motherboard)
If Xeon Silver and Xeon Platinum could be used in a dual-socket motherboard, why would I need a platinum processor, which is atleast 5X more expensive than the silver part? Unless there are other differences.
What are the differences between the current-gen Xeon processors? I see some differences in cache size. Other than that, I couldn't find anything else.
Gold/Platinum has more cores per socket, and/or higher base or turbo clocks. That's most of what you're paying for.
The extra UPI links that let them work in 4S or higher systems aren't relevant when being used in a 2 socket system, but that's not the only feature. Presumably it's only a small part of the cost. With the change from inclusive L3 cache to non-inclusive, Skylake Xeon and later already need a snoop filter separate from L3 tags even for single-socket, unlike Xeon E5 which just broadcast everything to the other socket. Presumably Xeon-SP's snoop filter can work for filtering snoops to the other socket as well so it didn't need to be a separate feature for 1S vs. 2S.
e.g. the top-end 2nd-gen (Cascade Lake) Intel® Xeon® Platinum 9282 Processor has 56 cores (112 threads), max turbo = 3.8 GHz, base clock = 2.6 GHz, and 77 MB of L3 Cache.
The top-end Silver is Intel® Xeon® Silver 4216: 16c/32t 3.2 GHz turbo, 2.10 GHz base, 22 MB L3 cache.
Despite have almost 4x the cores, sustained and peak turbo clocks are higher on the Platinum. (With a 400W TDP, vs. 100W for the Silver! Less-insane Platinum chips are lower TDP, e.g. a 32c/64t with 2.3GHz base / 3.7GHz turbo is 250W TDP).
Also, some (all?) Silver / Bronze CPUs only have one AVX512 FMA execution unit so throughput for 512-bit SIMD FP math instructions is reduced, including all FP math and int<->FP conversions, and _mm512_lzcnt_epi32. Look for the # of AVX-512 FMA Unit line on the Ark page for a specific CPU. For integer SIMD, only multiply is affected. (In hardware, SIMD integer multiply uops run on the FMA units.) Shifts, blends, shuffles, add/sub, compare, and boolean all have separate vector ALUs which are 512 bits wide and don't take as much die area as multipliers.
Even that top-end Silver 4216 Cascade Lake has only 1 512-bit FMA unit.
Running AVX2 code there's zero difference. Even AVX512 using only 256-bit vectors is fine. (gcc -march=skylake-avx512 defaults to -mprefer-vector-width=256 because using 512-bit vectors at all reduces max turbo temporarily. It wants to avoid the case where one unimportant 512-bit-vectorized loop gimps the clock speed for the rest of the program that spends most of its time in scalar code.)
But if you're doing heavy AVX-512 FP number crunching you probably want a CPU with 2 FMA units and to compile with 512-bit vectors.
IDK why you tagged this Xeon Phi; that's a totally different microarchitecture.

Why is MATLAB so slow on my Windows server?

I have Matlab R2017a installed on a server running MS Windows Server 2008 R2 Enterprise v 6.1 (SP1) and the benchmark results are awful:
bench
3.6424 0.5267 0.2114 5.0303 1.5557 3.4980
[columns = LU, FFT, ODE, Sparse, 2-D, 3-D]
Note that it is particularly slow for LU and Sparse.
The server has this hardware:
CPU: Intel Xeon E7320 # 2.13GHz (4 physical processors, 16 logical)
128 GB RAM
64-bit operating system
Matlab Version: 9.2.0.556344 (R2017a)
Java version: Java 1.7.0_60-b19 with Oracle corporation Java Hotspot(TM) 64-Bit Server VM mixed mode.
There are also other users that can be online on the server but I can see that they are not stressing the system and have verified that these running times are stable (have tested multiple times the past week.
My question is this: is there any other library or something that Matlab relies on that could be "wrong"? I have another similar setup on a similar but slightly newer server that gets bench results much closer to what I'd expect based on the specs. I'm wondering if it's using a "wrong" linear algebra module or something.
Alternative explanation I know that Matlab ran extremely slowly on a particular AMD Opteron CPU (I happen to also have worked on such a server in Matlab, link https://se.mathworks.com/matlabcentral/answers/33939-poor-matlab-performance-on-amd-based-computer). Is it possible that it's a similar issue with the Intel Xeon E7320?
Edit: Xeon E7320 as suggested by Peter.
Update: I'm not sure whether Matlab's bench takes advantage of just a single CPU core, multiple CPU cores, or also a GPU (OpenCL / CUDA). If it can use GPU acceleration, that would make a huge difference. (Especially if you don't have one at all in your "slow" server).
As discussed in comments, a dual-core Sandybridge laptop is 10x faster on some of the benchmarks, but only 2 or 1.5x faster on some other components. (But I'm not sure if the version of Matlab was controlled for; that thread you linked mentioned that different version of Matlab do a different amount of work in their bench).
The rest of this answer was written with the assumption that your test takes advantage of all your CPU cores (otherwise there's no point using an old many-core machine). But without considering GPU.
I think your CPU is actually a 65nm Core2-based Xeon E7320, not "E3720" (no google hits). What are you comparing against? Your Tigerton CPUs are ancient (about 10 years old), of course they're slow. (Tigerton is the same microarchitecture as Conroe/Merom, first-gen Core2).
You have very low memory bandwidth and cache speeds compared to a modern CPU, as well as only having SSSE3, not AVX or FMA. These CPUs don't have a memory controller built-in, so all 4 sockets are sharing the memory controller hub (MCH) via separate 1066MHz Front-Side Buses. Memory bandwidth doesn't scale with number of sockets, and is not great. Memory bandwidth has grown faster than per-core performance over the years. According to that link, a quad-socket 16-core Tigerton (like you have) is barely better than a quad-socket 8-core Barcelona Opteron. It's not so bad for CPU-bound workloads, but memory-bound workloads will do quite badly.
As well as the low clock speed, it's significantly slower clock-for-clock than a modern CPU. IDK what those times are supposed to be like (I'm here for the [performance] tag, not Matlab), but it's totally plausible that a 3GHz quad-core i5 or i7 Haswell / Skylake desktop or high-power laptop would be faster than your 16-core dinosaur machine.
(Actually, does that benchmark even scale with the number of cores? If not, the single-threaded memory bandwidth is probably really not good.)
A very big jump in performance happened with Sandybridge (for all code, including non-SIMD workloads), and there were several other smaller jumps in between your machine and modern CPUs as well. SnB can run 2 load instructions per clock, vs. 1 for previous Intel (like your Core2).
For FP-specific stuff that optimized libraries will take advantage of, x86 ISA extensions have been important: AVX doubles the SIMD vector width, doubling FLOPS (on Intel CPUs with full-width execution units). FMA does a mul+add in one instruction, potentially doubling FLOPS. Microarchitectural improvements are important, too: Haswell has two FMA units vs. earlier CPUs having one FP adder and one FP multiplier, again potentially doubling FLOPS. Only contiguous memory and high computation vs. memory workloads will fully take advantage of this, e.g. a dense matmul, but in that case one Haswell core is doing as much work as 8 Tigerton cores.
I assume Matlab can take advantage of AVX + FMA if the CPU has it.
And BTW, it's not just 16 "logical" processors. You don't have hyperthreading, so you have a 4-socket system with four quad-core CPUs, for 16 physical cores. (And these "quad core" chips are actually two separate dual-core dies in the same package, according to wikipedia.
So as far as the number of physical chips that need to communicate with each other, there are 8 (two in each package). That's a lot of hops to reach other CPUs, so synchronization between cores is more expensive than for a single-die quad-core. (And probably worse even than a modern dual-socket Xeon box with a pair of 18-core CPUs or something).
Note that high latency to memory can also hurt memory bandwidth: see the "latency bound platforms" part of this answer about optimizing memcpy/memset and how store bandwidth works in Intel CPUs.

SIMD intrinsic and memory bus size - How CPU fetches all 128/256 bits in a single memory read?

Hello Forum – I have a few similar/related questions about SIMD intrinsic for which I searched online including stackoverflow but did not find good answers so requesting your help.
Basically I am trying to understand how a 64 bit CPU fetches all 128 bits in a single read and what are the requirements for such an operation.
Would CPU fetch all 128 bits from memory in a single memory operation or will it do two 64 bit reads?
Do CPU manufactures demand certain size of the memory bus, example, for a 64 bit CPU, would Intel require 128 bit bus for SSE memory bound operations?
Are these operations dependent on memory bus size, number of channels and number of memory modules?
Loads/stores don't go to directly to memory (unless you use them on an uncacheable memory region). Even NT stores go into a write-combining fill buffer.
Load/stores go between execution units and L1D cache. CPUs internally have wide data paths from cache to execution units, and from L1D to outer caches. See How can cache be that fast? on electronics.SE, about Intel IvyBridge.
e.g. IvB has 128b data paths between execution units and L1D. Haswell widened that to 256 bits. Unaligned loads/stores have full performance as long as they don't cross a cache-line boundary. Skylake-AVX512 widened that to 512 bits, so it can do 2 64-byte loads and a 64-byte store in a single clock cycle. (As long as data is hot in L1D cache).
AMD CPUs including Ryzen handle 256b vectors in 128b chunks (even in the execution units, unlike Intel after Pentium M). Older CPUs (e.g. Pentium III and Pentium-M) split 128b loads/stores (and vector ALU) into two 64-bit halves because their load/store execution units were only 64 bits wide.
The memory controllers are DDR2/3/4. The bus is 64-bits wide, but uses a burst mode with a burst size of 64 bytes (not coincidentally, the size of a cache line.)
Being a "64-bit" CPU is unrelated to the width of any internal or external data buses. That terminology did get used for other CPUs in the past, but even P5 Pentium had a 64-bit data bus. (aligned 8-byte load/store is guaranteed atomic as far back as P5, e.g. x87 or MMX.) 64-bit in this case refers to the width of pointers, and of integer registers.
Further reading:
David Kanter's Haswell deep dive compares data-path widths in Haswell vs. SnB cores.
What Every Programmer Should Know About Memory (but note that much of the software-prefetch stuff is obsolete, modern CPUs have better HW prefetchers than Pentium4). Still essential reading, especially if you want to understand how CPUs are connected to DDR2/3/4 memory.
Other performance links in the x86 tag wiki.
Enhanced REP MOVSB for memcpy for more about x86 memory bandwidth. Note especially that single-threaded bandwidth can be limited by max_concurrency / latency, rather than by the DRAM controller, especially on a many-core Xeon (higher latency to L3 / memory).

What is the maximum memory per worker that MATLAB can address?

Short version: Is there a maximum amount of RAM / worker, that MATLAB can address?
Long version: My wife uses MATLAB's parallel processing capabilities in data-heavy spatial analyses (I don't really understand it, I just know how to build computers that make her work quicker) and I would like to build a new computer so she can radically reduce her process times.
I am leaning toward something in the 10-16 core range, since prices on such processors seem to be dropping with each generation and I would like to use 128 GB of RAM, because 'why not' if you can stomach the cost and see some meaningful time savings?
The number of cores I shoot for will depend on the maximum amount of RAM that MATLAB can address for each worker (if such a limit exists). The computer I built for similar work in 2013 has 4 physical cores (Intel i7-3770k) and 32 GB RAM (which she maxed out), and whatever I build next, I would like to have at least the same memory/core. With 128 GB of RAM a given, 10 cores would be 12.8 GB/core, 12 cores would be ~10.5 GB/core and 16 cores would be 8 GB/core. I am inclined to maximize cores rather than memory, but since she doesn't know what will benefit her processes the most, I would like to know how realistic those three options are. As for your next question, she has an nVidia GPU capable of parallel processing, but she believes her processes would not benefit from its CUDA cores.
Thank you for your insights. Many, many Google searches did not yield an answer.

MATLAB program simulation with the given processor requirements

I have a system with configuration intel(R) core(TM) i3-5020U CPU # 2.2 GHz,4GB RAM. But in order to compare the performance of my MATLAB program in terms of execution time, I need to execute it on a machine with configuration Intel(R) Core(TM) i5-3570 CPU # 3.40GHz, 16 GB RAM. Is there a way to perform this kind of simulation?
TL:DR: No. Performance differences between Broadwell and IvyBridge depend on lots of complicated details. (See Agner Fog's microarch pdf for the low-level microarchitectural details, and also other stuff in the x86 tag wiki)
It's likely that performance will scale with either clock speed or memory speed within maybe 10%, even between different microarchitectures, but it might not.
Using your own system, you can probably figure out how your code scales with CPU frequency, by forcing it to stay at minimum frequency for a test run. If it's a lot less than perfect scaling, then memory speed is a big factor. (The slower your CPU, the fewer cycles are spent waiting for memory.)
You can't extrapolate IvB i5 3.4GHz performance from BDW 2.2GHz performance without knowing a lot more details about exactly what your code bottlenecks on. It's possible that it bottlenecks on the same simple thing on both CPUs, in which case you could extrapolate. e.g. if it turns out that it bottlenecks on FP multiply latency, then run-time on IvB would be 5/3rds the run time on Broadwell (times the clock frequency ratio), since BDW has 3 cycle FP multiply and add, but SnB/IvB/Haswell have 5 cycle multiply. (FMA is 5 cycles on BDW, if I recall correctly. IvB doesn't support FMA, so if Matlab takes advantage of that on BDW, it's not even running the same machine code).
More likely, it's not that simple and cache / memory performance comes into it, too. Haswell/Broadwell don't have L1 cache-bank conflicts, but SnB/IvB do.
Depending on how you run the workload on the i5 CPU, it might or might not be able to turbo up to higher than its rated 3.4GHz, further confounding any attempt to guess at performance.
It's hard to tell with different computers to measure practical efficiency. That's why you usually use theoretical efficiency with Big-O, check the wiki page for algorithm efficiency and Big-O notation.
In the case you have access to both codes (yours, and the other guy's code), you can test them in the same computer with the methods for measuring performance proposed by mathworks, which are mainly time functions in real time and cpu time.
Lastly, you can see here several challenges about benchmarking that might be interesting to consider.