why don't we get the odd number of cores in a processor? - multicore

Why we do no have a processor with odd number of cores? Why don't the manufacturing companies prefer odd number of cores to even number of cores?

Related

Process performance difference under same CPU usage percentage

I am using a four kernel raspberry pi to implement a small project on networking, in which I need to create several worker processes on the same raspberry pi to accelerate the task. I discovered that although top shows that my CPU usage percentage for each individual worker is not much different, namely 25% for one worker and 23% for four workers each, the single process performance seems to be much lower when I have more workers (a single task takes 3 seconds when I open one worker, but takes 9 seconds when I have 4 workers. There is no communication between the workers, and they perform similar tasks independently. Can anyone explain why this could happen?

Number of workers in Matlab's parfor

I am running a for loop using MATLAB's parfor function. My CPU's specs are
I set preferred number of workers to 24. However, MATLAB sets this number to 6. Is number of workers bounded by the number of cores or by (number of cores)x(number of processors=6x12?
Matlab prefers to limit the number of workers to the number of cores (six in your case).
Your CPU (intel i7-9750H) has hyperthreading, i.e. you can run multiple (here 2) threads per core. However, this is of no use if you want to run them under full-load, which means that there is simply no resources available to switch to a different task (what the additional threads effectively are).
See the documentation.
Restricting to one worker per physical core ensures that each worker
has exclusive access to a floating point unit, which generally
optimizes performance of computational code. If your code is not
computationally intensive, for example, it is input/output (I/O)
intensive, then consider using up to two workers per physical core.
Running too many workers on too few resources may impact performance
and stability of your machine.
Note that Matlab needs to stream data to every core in order to run the distributed code. This is some kind of initialization effort and the reason why you won't be able to cut the runtime in half if you double the number of cores/workers. And that is also the explanation why there is no use for Matlab to make use of hyperthreading. It would just mean to increase the initial streaming effort without any speed-up -- in fact, the core would probably force matlab to save intermediate results and switch to the other task from time to time... which is the same task as before;)

Differences between current gen Xeon Processors

What's the actual differences between Xeon W series, Bronze, Silver, Gold and Platinum series?
With earlier versions of Xeons, The E3 were single socket CPU's. whereas E5's could be used in motherboards with two sockets. The E7's were quad sockets supported (probably 8 too)
However, with the current generation Xeon's, Most of the lineup has a scalability of 2S (2 processors in one Motherboard)
If Xeon Silver and Xeon Platinum could be used in a dual-socket motherboard, why would I need a platinum processor, which is atleast 5X more expensive than the silver part? Unless there are other differences.
What are the differences between the current-gen Xeon processors? I see some differences in cache size. Other than that, I couldn't find anything else.
Gold/Platinum has more cores per socket, and/or higher base or turbo clocks. That's most of what you're paying for.
The extra UPI links that let them work in 4S or higher systems aren't relevant when being used in a 2 socket system, but that's not the only feature. Presumably it's only a small part of the cost. With the change from inclusive L3 cache to non-inclusive, Skylake Xeon and later already need a snoop filter separate from L3 tags even for single-socket, unlike Xeon E5 which just broadcast everything to the other socket. Presumably Xeon-SP's snoop filter can work for filtering snoops to the other socket as well so it didn't need to be a separate feature for 1S vs. 2S.
e.g. the top-end 2nd-gen (Cascade Lake) Intel® Xeon® Platinum 9282 Processor has 56 cores (112 threads), max turbo = 3.8 GHz, base clock = 2.6 GHz, and 77 MB of L3 Cache.
The top-end Silver is Intel® Xeon® Silver 4216: 16c/32t 3.2 GHz turbo, 2.10 GHz base, 22 MB L3 cache.
Despite have almost 4x the cores, sustained and peak turbo clocks are higher on the Platinum. (With a 400W TDP, vs. 100W for the Silver! Less-insane Platinum chips are lower TDP, e.g. a 32c/64t with 2.3GHz base / 3.7GHz turbo is 250W TDP).
Also, some (all?) Silver / Bronze CPUs only have one AVX512 FMA execution unit so throughput for 512-bit SIMD FP math instructions is reduced, including all FP math and int<->FP conversions, and _mm512_lzcnt_epi32. Look for the # of AVX-512 FMA Unit line on the Ark page for a specific CPU. For integer SIMD, only multiply is affected. (In hardware, SIMD integer multiply uops run on the FMA units.) Shifts, blends, shuffles, add/sub, compare, and boolean all have separate vector ALUs which are 512 bits wide and don't take as much die area as multipliers.
Even that top-end Silver 4216 Cascade Lake has only 1 512-bit FMA unit.
Running AVX2 code there's zero difference. Even AVX512 using only 256-bit vectors is fine. (gcc -march=skylake-avx512 defaults to -mprefer-vector-width=256 because using 512-bit vectors at all reduces max turbo temporarily. It wants to avoid the case where one unimportant 512-bit-vectorized loop gimps the clock speed for the rest of the program that spends most of its time in scalar code.)
But if you're doing heavy AVX-512 FP number crunching you probably want a CPU with 2 FMA units and to compile with 512-bit vectors.
IDK why you tagged this Xeon Phi; that's a totally different microarchitecture.

What is the maximum memory per worker that MATLAB can address?

Short version: Is there a maximum amount of RAM / worker, that MATLAB can address?
Long version: My wife uses MATLAB's parallel processing capabilities in data-heavy spatial analyses (I don't really understand it, I just know how to build computers that make her work quicker) and I would like to build a new computer so she can radically reduce her process times.
I am leaning toward something in the 10-16 core range, since prices on such processors seem to be dropping with each generation and I would like to use 128 GB of RAM, because 'why not' if you can stomach the cost and see some meaningful time savings?
The number of cores I shoot for will depend on the maximum amount of RAM that MATLAB can address for each worker (if such a limit exists). The computer I built for similar work in 2013 has 4 physical cores (Intel i7-3770k) and 32 GB RAM (which she maxed out), and whatever I build next, I would like to have at least the same memory/core. With 128 GB of RAM a given, 10 cores would be 12.8 GB/core, 12 cores would be ~10.5 GB/core and 16 cores would be 8 GB/core. I am inclined to maximize cores rather than memory, but since she doesn't know what will benefit her processes the most, I would like to know how realistic those three options are. As for your next question, she has an nVidia GPU capable of parallel processing, but she believes her processes would not benefit from its CUDA cores.
Thank you for your insights. Many, many Google searches did not yield an answer.

Performance degrades, if number of threads is more than 2 on Xeon X5355

I have a strange problem but may not be that much strange to some of you.
I am writing an application using boost threads and using boost barriers to synchronize the threads. I have two machines to test the application.
Machine 1 is a core2 duo (T8300) cpu machine (windows XP professional - 4GB RAM) where I am getting following performance figures :
Number of threads :1 , TPS :21
Number of threads :2 , TPS :35 (66 % improvement)
further increase in number of threads decreases the TPS but that is understandable as the machine has only two cores.
Machine 2 is a 2 quad core ( Xeon X5355) cpu machine (windows 2003 server with 4GB RAM) and has 8 effective cores.
Number of threads :1 , TPS :21
Number of threads :2 , TPS :27 (28 % improvement)
Number of threads :4 , TPS :25
Number of threads :8 , TPS :24
As you can see, performance is degrading after 2 threads (though it has 8 cores). If the program has some bottle neck , then for 2 thread also it should have degraded.
Any idea? , Explanations ? , Does the OS has some role in performance ? - It seems like the Core2duo (2.4GHz) scales better than Xeon X5355 (2.66GHz) though it has better clock speed.
Thank you
-Zoolii
The clock speed and the operating system doesn't have as much to do with it as the way your code is written. Things to check might include:
Are you actually spinning up more than two threads at one time?
Do you have unnecessary synchronization artifacts in your code?
Are you synchronizing your code at the appropriate places?
What is your shareable resource and how many of then are there? If each of your transactions is relying on a single section of code, native library, file, database, whatever, then it doesn't matter how many CPUs you've got.
One tool at your disposal when analyzing software bottlenecks is the simple thread dump. Taking a few dumps throughout the life of an execution of your software should expose bottlenecks in your software. You may be able to take that output and use it to reevaluate your code.
Adding more CPU's does not always equate to better performance, locking and contention can severely degrade performance. Factors to consider are:
Is your algorithm suited to parallelisation?
Any inherently sequential portions of code?
Can you partition work into coarse grained 'chunks'? Corase is usually better than fine grained...
Can you alter your code to use less locking?
Synchronisation overheads can often be reduced by ensuring chunks of work are similiar sized.
Based on experience it could be that the Intel policy is 2 threads or dual-process only on that processor, that only pthreads can be used with that version of operating system, that the two processors were designed to conform to different laws with different provisions or allows, that the own thread process is not allowed, that more than n threads are being backed-out by the processor and the processing of error messages reporting this is slowing down throughput of the two cores and may lead to deactivate of cores 3 and 4.