Degrading performance when increasing number of slaves [duplicate] - scala

I am doing a simple scaling test on Spark using sort benchmark -- from 1 core, up to 8 cores. I notice that 8 cores is slower than 1 core.
//run spark using 1 core
spark-submit --master local[1] --class john.sort sort.jar data_800MB.txt data_800MB_output
//run spark using 8 cores
spark-submit --master local[8] --class john.sort sort.jar data_800MB.txt data_800MB_output
The input and output directories in each case, are in HDFS.
1 core: 80 secs
8 cores: 160 secs
I would expect 8 cores performance to have x amount of speedup.

Theoretical limitations
I assume you are familiar Amdahl's law but here is a quick reminder. Theoretical speedup is defined as followed :
where :
s - is the speedup of the parallel part.
p - is fraction of the program that can be parallelized.
In practice theoretical speedup is always limited by the part that cannot be parallelized and even if p is relatively high (0.95) the theoretical limit is quite low:
(This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.
Attribution: Daniels220 at English Wikipedia)
Effectively this sets theoretical bound how fast you can get. You can expect that p will be relatively high in case embarrassingly parallel jobs but I wouldn't dream about anything close to 0.95 or higher. This is because
Spark is a high cost abstraction
Spark is designed to work on commodity hardware at the datacenter scale. It's core design is focused on making a whole system robust and immune to hardware failures. It is a great feature when you work with hundreds of nodes
and execute long running jobs but it is doesn't scale down very well.
Spark is not focused on parallel computing
In practice Spark and similar systems are focused on two problems:
Reducing overall IO latency by distributing IO operations between multiple nodes.
Increasing amount of available memory without increasing the cost per unit.
which are fundamental problems for large scale, data intensive systems.
Parallel processing is more a side effect of the particular solution than the main goal. Spark is distributed first, parallel second. The main point is to keep processing time constant with increasing amount of data by scaling out, not speeding up existing computations.
With modern coprocessors and GPGPUs you can achieve much higher parallelism on a single machine than a typical Spark cluster but it doesn't necessarily help in data intensive jobs due to IO and memory limitations. The problem is how to load data fast enough not how to process it.
Practical implications
Spark is not a replacement for multiprocessing or mulithreading on a single machine.
Increasing parallelism on a single machine is unlikely to bring any improvements and typically will decrease performance due to overhead of the components.
In this context:
Assuming that the class and jar are meaningful and it is indeed a sort it is just cheaper to read data (single partition in, single partition out) and sort in memory on a single partition than executing a whole Spark sorting machinery with shuffle files and data exchange.

Related

Number of workers in Matlab's parfor

I am running a for loop using MATLAB's parfor function. My CPU's specs are
I set preferred number of workers to 24. However, MATLAB sets this number to 6. Is number of workers bounded by the number of cores or by (number of cores)x(number of processors=6x12?
Matlab prefers to limit the number of workers to the number of cores (six in your case).
Your CPU (intel i7-9750H) has hyperthreading, i.e. you can run multiple (here 2) threads per core. However, this is of no use if you want to run them under full-load, which means that there is simply no resources available to switch to a different task (what the additional threads effectively are).
See the documentation.
Restricting to one worker per physical core ensures that each worker
has exclusive access to a floating point unit, which generally
optimizes performance of computational code. If your code is not
computationally intensive, for example, it is input/output (I/O)
intensive, then consider using up to two workers per physical core.
Running too many workers on too few resources may impact performance
and stability of your machine.
Note that Matlab needs to stream data to every core in order to run the distributed code. This is some kind of initialization effort and the reason why you won't be able to cut the runtime in half if you double the number of cores/workers. And that is also the explanation why there is no use for Matlab to make use of hyperthreading. It would just mean to increase the initial streaming effort without any speed-up -- in fact, the core would probably force matlab to save intermediate results and switch to the other task from time to time... which is the same task as before;)

How to distribute MATLAB's parfor workers between GPUs AND CPUs (cores)?

I have a computation that has the structure of a binary tree, where at each node a bunch of highly vectorized functions take the output of the previous branches to produce new branch(es) (nodes on the same level are independent). Since the functions are vectorized, they run well both on CPU or GPU, the latter naturally giving substantially faster execution.
I will soon have access to 4-GPU 2-CPU workstation to run my code on and I would like to use it as optimally as I can. I understand how to use parfor on the GPUs only or on the CPUs' cores only, but I would like to reasonably distribute the workload between the GPUs and the CPU, since GPU execution only leaves so many CPU cores at idle, and even though they are much slower than the GPUs, they are still fast enough to have noticeable impact on the total execution time.
(Q1) Since the functions in each node are vectorized, is it actually reasonable to run independent nodes in 1-node-per-core mode? Or does that strictly depend on the particular case? Is there a "rule of the thumb" for such dilemmas?
(Q2) Assuming in (Q1) that simultaneous execution of 1 node per core is suboptimal, is there a way to assign several CPU cores to one worker?
(Q3) Is there a way to distribute the parfor workers between GPUs and CPUs in an efficient way?
Here is what I don't consider particularly efficient in (Q3): depending on the loop index, the loop instance can execute the GPU code on a given gpuDevice or on CPU (core). Knowing the performance difference between GPU and CPU execution, one can deduce a suitable proportion of the indices assigned to CPU execution. The problem with this is that parfor does not pick the loop indices in any particular order, which in turn can easily lead to instances where it tries to execute two independent tasks on same GPU, which is inefficient since it will have to serialize the tasks.
Thanks!

Is Scala doing anything in parallel on its own?

I have little program creating a maze. It uses lots of collections (the default variant, which is immutable, or at least used as an immutable).
The program calculates 30 mazes with increasing dimensions. Using a for comprehension over (1 to 30)
Since with the latest versions the parallel collections framework became available I thought to give it a spin, hoping for some performance gain.
This failed and when I investigated a little, I found the following:
When run without any call to anything remotely parallel it still showed a processor load of about 30% on each of the 4 cores of my machine.
When I replaced the Range 1 to 30 with (1 to 30).par CPU load went up to about 80% on all cores (which I expected). The order in which the mazes completed became more or less random (which I expected). The total time for all mazes stayed the same.
Replacing some of the internally used collections with their parallel counter parts did seem to have an effect.
I now have 2 questions:
Why do I have all 4 cores spinning, although there isn't anything that runs in parallel.
What might be likely reasons for the program to still take the same time, no matter if running in parallel or not. There are no obvious other bottlenecks but CPU cycles (no IO, no Network, plenty of Memory via -Xmx setting)
Any ideas on this?
The 30% per core version is just a poor scheduler (sounds like Windows 7) migrating the process from core to core very frequently. It's probably closer to 25% per core (1/4) for your process plus misc other load making 30%. If you run the same example under Linux you would probably see one core pegged.
When you converted to (1 to 30).par, you started really using threads across all cores but the synchronization overhead of distributing such a small amount of work and then collecting the results cancelled out the parallelism gains. You need to break your work into larger independent chunks.
EDIT: If each of 1..30 represents some larger amount of work (solving a maze, say) then automatic parallelization will work much better if each unit of work is about the same. Imagine you had 29 easy mazes and one very very hard maze. The 30th maze will still run serially (or very nearly) with everything else). If your mazes increase in complexity by number try spawning them in the order 30 to 1 by -1 so that the biggest tasks will go first. Think of it as a braindead solution to the knapsack problem.

Performance degrades, if number of threads is more than 2 on Xeon X5355

I have a strange problem but may not be that much strange to some of you.
I am writing an application using boost threads and using boost barriers to synchronize the threads. I have two machines to test the application.
Machine 1 is a core2 duo (T8300) cpu machine (windows XP professional - 4GB RAM) where I am getting following performance figures :
Number of threads :1 , TPS :21
Number of threads :2 , TPS :35 (66 % improvement)
further increase in number of threads decreases the TPS but that is understandable as the machine has only two cores.
Machine 2 is a 2 quad core ( Xeon X5355) cpu machine (windows 2003 server with 4GB RAM) and has 8 effective cores.
Number of threads :1 , TPS :21
Number of threads :2 , TPS :27 (28 % improvement)
Number of threads :4 , TPS :25
Number of threads :8 , TPS :24
As you can see, performance is degrading after 2 threads (though it has 8 cores). If the program has some bottle neck , then for 2 thread also it should have degraded.
Any idea? , Explanations ? , Does the OS has some role in performance ? - It seems like the Core2duo (2.4GHz) scales better than Xeon X5355 (2.66GHz) though it has better clock speed.
Thank you
-Zoolii
The clock speed and the operating system doesn't have as much to do with it as the way your code is written. Things to check might include:
Are you actually spinning up more than two threads at one time?
Do you have unnecessary synchronization artifacts in your code?
Are you synchronizing your code at the appropriate places?
What is your shareable resource and how many of then are there? If each of your transactions is relying on a single section of code, native library, file, database, whatever, then it doesn't matter how many CPUs you've got.
One tool at your disposal when analyzing software bottlenecks is the simple thread dump. Taking a few dumps throughout the life of an execution of your software should expose bottlenecks in your software. You may be able to take that output and use it to reevaluate your code.
Adding more CPU's does not always equate to better performance, locking and contention can severely degrade performance. Factors to consider are:
Is your algorithm suited to parallelisation?
Any inherently sequential portions of code?
Can you partition work into coarse grained 'chunks'? Corase is usually better than fine grained...
Can you alter your code to use less locking?
Synchronisation overheads can often be reduced by ensuring chunks of work are similiar sized.
Based on experience it could be that the Intel policy is 2 threads or dual-process only on that processor, that only pthreads can be used with that version of operating system, that the two processors were designed to conform to different laws with different provisions or allows, that the own thread process is not allowed, that more than n threads are being backed-out by the processor and the processing of error messages reporting this is slowing down throughput of the two cores and may lead to deactivate of cores 3 and 4.

Parallel programming on a Quad-Core and a VM?

I'm thinking of slowly picking up Parallel Programming. I've seen people use clusters with OpenMPI installed to learn this stuff. I do not have access to a cluster but have a Quad-Core machine. Will I be able to experience any benefit here? Also, if I'm running linux inside a Virtual machine, does it make sense in using OpenMPI inside a VM?
If your target is to learn, you don't need a cluster at all. Your quad-core (or any dual-core or even a single-cored) computer will be more than enough. The main point is to learn how to think "in parallel" and how to design your application.
Some important points are to:
Exploit different parallelism paradigms like divide-and-conquer, master-worker, SPMD, ... depending on data and tasks dependencies of what you want to do.
Chose different data division granularities to check the computation/communication ratio (in case of message passing), or to check the amount of serial execution because of mutual exclusion to memory regions.
Having a quad-core you can measure your approach speedup (the gain on performance attained because of the parallelization) which is normally given by the division between the time of the non parallelized execution and the time of the parallel execution.
The closer you get to 4 (four cores meaning 1/4th the execution time), the better your parallelization strategy was (once you could evenly distribute work and data).