CPU usage of CP-SAT solver with num_search_workers = 8

CPU usage of CP-SAT solver with num_search_workers = 8 - or-tools

Dears,
I have a question, If I have a 32-core CPU and use CP-SAT with num_search_workers = 8, how many CPU core am I using?
thanks

Related

Using a cluster of Raspberry Pi 4 as a cluster for number crunching?

So I am currently developing an algorithm in MATLAB that is computationally expensive but is parallel processing friendly. Given that, I have been using the parallel processing library but I am still falling short of my computation time goals.
I am currently running my algorithm on an Intel i7 8086k CPU (6 Core, 12 logical, #4.00GHz, turbo is 5GHz)
Here are my questions:
If I was to purchase, lets say 10 raspberry pi 4 SBCs (4 cores #1.5GHz), could I use my main desktop as the host and the PIs as the clients? (Let us assume I migrate my algorithm to C++ and run it in Ubuntu for now).
1a. If I was to go through with the build in question 1, will there be a significant upgrade in computation for the ~$500 spent?
1b. If I am not able to use my desktop as host (I believe this shouldn't be an issue), how many raspberry PIs would I need to equate to my current CPU or how many would I need to make it advantageous to work on a PI cluster vs my computer?
Is it possible to run Windows on the host computer and linux on the clients(Pis) so that I continue using MATLAB?
Thanks for your help, any other advise and recommendations are welcome

Does your algorithm bottleneck on raw FMA / FLOPS throughput? If so then a cluster of weak ARM cores is more trouble than it's worth. I'd expect a used Zen2 machine, or maybe Haswell or Broadwell, could be good if you can find one cheaply. (You'd have to look at core counts, clocks, and FLOPS/$. And whether the problem would still not be memory bottlenecked on an older system with less memory bandwidth.)
If you bottleneck instead on cache misses from memory bandwidth or latency (e.g. cache-unfriendly data layout), there might possibly be something to gain from having more weaker CPUs each with their own memory controller and cache, even if those caches are smaller than your Intel.
Does Matlab use your GPU at all (e.g. via OpenCL)? Your current CPU's peak double (FP64) throughput from the IA cores is 96 GFLOPS, but its integrated GPU is capable of 115.2 GFLOPS. Or for single-precision, 460.8 GFLOPS GPU vs. 192 GFLOPS from your x86 cores. Again, theoretical max throughput, running 2x 256-bit SIMD FMA instructions per clock cycle per core on the CPU.
Upgrading to a beefy GPU could be vastly more effective than a cluster of RPi4. e.g. https://en.wikipedia.org/wiki/FLOPS#Hardware_costs shows that cost per single-precision GFLOP in 2017 was about 5 cents, adding big GPUs to a cheapo CPU. Or 79 cents per double-precision GFLOP.
If your problem is GPU-friendly but Matlab hasn't been using your GPU, look into that. Maybe Matlab has options, or you could use OpenCL from C++.
will there be a significant upgrade in computation for the ~$500 spent?
RPi4 model B has a Broadcom BCM2711 SoC. The CPU is Cortex-A72.
Their cache hierachy 32 KB data + 48 KB instruction L1 cache per core. 1MB shared L2 cache. That's weaker than your 4GHz i7 with 32k L1d + 256k L2 private per-core, and a shared 12MiB L3 cache. But faster cores waste more cycles for the same absolute time waiting for a cache miss, and the ARM chips run their DRAM at a competitive DDR4-2400.
RPi CPUs are not FP powerhouses. There's a large gap in the raw numbers, but with enough of them the throughput does add up.
https://en.wikipedia.org/wiki/FLOPS#FLOPs_per_cycle_for_various_processors shows that Cortex-A72 has peak FPU throughput of 2 double FLOPS per core per cycle, vs. 16 for Intel since Haswell, AMD since Zen2.
Dropping to single precision float improves x86 by a factor of 2, but A72 by a factor of 4. Apparently their SIMD units have lower throughput for FP64 instructions, as well as half the work per SIMD vector. (Some other ARM cores aren't extra slow for double, just the expected 2:1, like Cortex-A57 and A76.)
But all this is peak FLOPS throughput; coming close to that in real code is only achieved with well-tuned code with good computational intensity (lots of work each time the data is loaded into cache, and/or into registers). e.g. a dense matrix multiply is the classic example: O(n^3) FPU work over O(n^2) data, in a way that makes cache-blocking possible. Or Prime95 is another example.
Still, a rough back of the envelope calculation, being generous and assuming sustained non-turbo clocks for the Coffee Lake. (All 6 cores busy running 2x 256-bit FMA instructions per clock makes a lot of heat. That's literally what Prime95 does, so expect that level of power consumption if your code is that efficient.)
6 * 4GHz * 4 elements/vec * 2 vec/cycle = 48G FMAs / sec = 96 GFLOP/sec on the CFL
4 * 1.5GHz * 2 DP flops / clock = 12 GFLOP / sec per RPi.
With 5x RPi systems, that's 60 GFLOPS added to your existing 96 GFLOP.
Doesn't sound worth the trouble to manage 5 RPi systems for less than your existing total FP throughput. But again, if your problem has the right kind of parallelism, a GPU can run it much more efficiently. 60 GFLOPS for 500$ is not a good deal compared to ~50$ per 60 GFLOP from a high-end (in 2017) video card.
The GPU in an RPi might have some compute capability, but almost certainly not worth it compared to slapping a 500$ discrete GPU into your existing machine if your code is CPU-friendly.
Or your problem might not scale with theoretical max FLOPS, but instead perhaps with cache bandwidth or some other factor.
Is it possible to run Windows on the host computer and linux on the clients(Pis) so that I continue using MATLAB?
Zero clue; I'm only considering theoretical best case for efficient machine code running on these CPUs.

Why is MATLAB so slow on my Windows server?

I have Matlab R2017a installed on a server running MS Windows Server 2008 R2 Enterprise v 6.1 (SP1) and the benchmark results are awful:
bench
3.6424 0.5267 0.2114 5.0303 1.5557 3.4980
[columns = LU, FFT, ODE, Sparse, 2-D, 3-D]
Note that it is particularly slow for LU and Sparse.
The server has this hardware:
CPU: Intel Xeon E7320 # 2.13GHz (4 physical processors, 16 logical)
128 GB RAM
64-bit operating system
Matlab Version: 9.2.0.556344 (R2017a)
Java version: Java 1.7.0_60-b19 with Oracle corporation Java Hotspot(TM) 64-Bit Server VM mixed mode.
There are also other users that can be online on the server but I can see that they are not stressing the system and have verified that these running times are stable (have tested multiple times the past week.
My question is this: is there any other library or something that Matlab relies on that could be "wrong"? I have another similar setup on a similar but slightly newer server that gets bench results much closer to what I'd expect based on the specs. I'm wondering if it's using a "wrong" linear algebra module or something.
Alternative explanation I know that Matlab ran extremely slowly on a particular AMD Opteron CPU (I happen to also have worked on such a server in Matlab, link https://se.mathworks.com/matlabcentral/answers/33939-poor-matlab-performance-on-amd-based-computer). Is it possible that it's a similar issue with the Intel Xeon E7320?
Edit: Xeon E7320 as suggested by Peter.

Update: I'm not sure whether Matlab's bench takes advantage of just a single CPU core, multiple CPU cores, or also a GPU (OpenCL / CUDA). If it can use GPU acceleration, that would make a huge difference. (Especially if you don't have one at all in your "slow" server).
As discussed in comments, a dual-core Sandybridge laptop is 10x faster on some of the benchmarks, but only 2 or 1.5x faster on some other components. (But I'm not sure if the version of Matlab was controlled for; that thread you linked mentioned that different version of Matlab do a different amount of work in their bench).
The rest of this answer was written with the assumption that your test takes advantage of all your CPU cores (otherwise there's no point using an old many-core machine). But without considering GPU.
I think your CPU is actually a 65nm Core2-based Xeon E7320, not "E3720" (no google hits). What are you comparing against? Your Tigerton CPUs are ancient (about 10 years old), of course they're slow. (Tigerton is the same microarchitecture as Conroe/Merom, first-gen Core2).
You have very low memory bandwidth and cache speeds compared to a modern CPU, as well as only having SSSE3, not AVX or FMA. These CPUs don't have a memory controller built-in, so all 4 sockets are sharing the memory controller hub (MCH) via separate 1066MHz Front-Side Buses. Memory bandwidth doesn't scale with number of sockets, and is not great. Memory bandwidth has grown faster than per-core performance over the years. According to that link, a quad-socket 16-core Tigerton (like you have) is barely better than a quad-socket 8-core Barcelona Opteron. It's not so bad for CPU-bound workloads, but memory-bound workloads will do quite badly.
As well as the low clock speed, it's significantly slower clock-for-clock than a modern CPU. IDK what those times are supposed to be like (I'm here for the [performance] tag, not Matlab), but it's totally plausible that a 3GHz quad-core i5 or i7 Haswell / Skylake desktop or high-power laptop would be faster than your 16-core dinosaur machine.
(Actually, does that benchmark even scale with the number of cores? If not, the single-threaded memory bandwidth is probably really not good.)
A very big jump in performance happened with Sandybridge (for all code, including non-SIMD workloads), and there were several other smaller jumps in between your machine and modern CPUs as well. SnB can run 2 load instructions per clock, vs. 1 for previous Intel (like your Core2).
For FP-specific stuff that optimized libraries will take advantage of, x86 ISA extensions have been important: AVX doubles the SIMD vector width, doubling FLOPS (on Intel CPUs with full-width execution units). FMA does a mul+add in one instruction, potentially doubling FLOPS. Microarchitectural improvements are important, too: Haswell has two FMA units vs. earlier CPUs having one FP adder and one FP multiplier, again potentially doubling FLOPS. Only contiguous memory and high computation vs. memory workloads will fully take advantage of this, e.g. a dense matmul, but in that case one Haswell core is doing as much work as 8 Tigerton cores.
I assume Matlab can take advantage of AVX + FMA if the CPU has it.
And BTW, it's not just 16 "logical" processors. You don't have hyperthreading, so you have a 4-socket system with four quad-core CPUs, for 16 physical cores. (And these "quad core" chips are actually two separate dual-core dies in the same package, according to wikipedia.
So as far as the number of physical chips that need to communicate with each other, there are 8 (two in each package). That's a lot of hops to reach other CPUs, so synchronization between cores is more expensive than for a single-die quad-core. (And probably worse even than a modern dual-socket Xeon box with a pair of 18-core CPUs or something).
Note that high latency to memory can also hurt memory bandwidth: see the "latency bound platforms" part of this answer about optimizing memcpy/memset and how store bandwidth works in Intel CPUs.

can tensor flow using multi-cpu process? [duplicate]

This question already has an answer here:
Configuring Tensorflow to use all CPU's
(1 answer)
Closed 6 years ago.
I have a desktop computer with 4 cpu.
can I run tensorflow with 4 cpu so it is faster than single cpu ?
like a MPI programme ?
and can I use tflearn to implement it ?
thanks very much !

Yes tensorflow will take advantage of the multiple cores on your CPU. When you build it from source you can build it to only run on the CPU or be able to run on both the CPU and GPU. Right now I have tensorflow built to run on my GPU but if I want to place it on the CPU I stick the following at the top of my python code.
os.environ['CUDA_VISIBLE_DEVICES'] = ''
This just sets an environment variable which can be done through the terminal as well. When I run it with no cuda devices aviable it runs all cores of my CPU at 100%. I would imagine that if you were to put this at the top of your tflearn code it would force it onto the cpu as well.

MATLAB program simulation with the given processor requirements

I have a system with configuration intel(R) core(TM) i3-5020U CPU # 2.2 GHz,4GB RAM. But in order to compare the performance of my MATLAB program in terms of execution time, I need to execute it on a machine with configuration Intel(R) Core(TM) i5-3570 CPU # 3.40GHz, 16 GB RAM. Is there a way to perform this kind of simulation?

TL:DR: No. Performance differences between Broadwell and IvyBridge depend on lots of complicated details. (See Agner Fog's microarch pdf for the low-level microarchitectural details, and also other stuff in the x86 tag wiki)
It's likely that performance will scale with either clock speed or memory speed within maybe 10%, even between different microarchitectures, but it might not.
Using your own system, you can probably figure out how your code scales with CPU frequency, by forcing it to stay at minimum frequency for a test run. If it's a lot less than perfect scaling, then memory speed is a big factor. (The slower your CPU, the fewer cycles are spent waiting for memory.)
You can't extrapolate IvB i5 3.4GHz performance from BDW 2.2GHz performance without knowing a lot more details about exactly what your code bottlenecks on. It's possible that it bottlenecks on the same simple thing on both CPUs, in which case you could extrapolate. e.g. if it turns out that it bottlenecks on FP multiply latency, then run-time on IvB would be 5/3rds the run time on Broadwell (times the clock frequency ratio), since BDW has 3 cycle FP multiply and add, but SnB/IvB/Haswell have 5 cycle multiply. (FMA is 5 cycles on BDW, if I recall correctly. IvB doesn't support FMA, so if Matlab takes advantage of that on BDW, it's not even running the same machine code).
More likely, it's not that simple and cache / memory performance comes into it, too. Haswell/Broadwell don't have L1 cache-bank conflicts, but SnB/IvB do.
Depending on how you run the workload on the i5 CPU, it might or might not be able to turbo up to higher than its rated 3.4GHz, further confounding any attempt to guess at performance.

It's hard to tell with different computers to measure practical efficiency. That's why you usually use theoretical efficiency with Big-O, check the wiki page for algorithm efficiency and Big-O notation.
In the case you have access to both codes (yours, and the other guy's code), you can test them in the same computer with the methods for measuring performance proposed by mathworks, which are mainly time functions in real time and cpu time.
Lastly, you can see here several challenges about benchmarking that might be interesting to consider.

Tensorflow. Cifar10 Multi-gpu example performs worse with more gpus

I have to test the distributed version of tensorflow across multiple gpus.
I run the Cifar-10 multi-gpu example on an AWS g2.8x EC2 instance.
Running time for 2000 steps of the cifar10_multi_gpu_train.py (code here) was 427 seconds with 1 gpu (flag num_gpu=1). Afterwards the eval.py script returned precision # 1 = 0.537.
With the same example running for the same number of steps (with one step being executed in parallel across all gpus), but using 4 gpus (flag num_gpu=4) running time was about 530 seconds and the eval.py script returned only a slightly higher precision # 1 of 0.552 (maybe due to randomness in the computation?).
Why is the example performing worse with a higher number of gpus? I have used a very small number of steps for testing purposes and was expecting a much higher gain in precision using 4 gpus.
Did I miss something or made some basic mistakes?
Did someone else try the above example?
Thank you very much.

The cifar10 example uses variables on CPU by default, which is what you need for a multi-GPU architecture. You may achieve about 1.5x speed up compared to a single GPU setup with 2 GPUs.
Your problem has to do with the Dual GPU architecture for Nvidia Tesla K80. It has a PCIe switch to communicate both GPU cards internally. It shall introduce an overhead on communication. See block diagram:

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse