Why is this not faster using parallel collections? - scala

I just wanted to test the parallel collections a bit and I used the following line of code (in REPL):
(1 to 100000).par.filter(BigInt(_).isProbablePrime(100))
against:
(1 to 100000).filter(BigInt(_).isProbablePrime(100))
But the parallel version is not faster. In fact it even feels a bit slower (But I haven't really measured that).
Has anyone an explanation for that?
Edit 1: Yes, I do have a multi-core processor
Edit 2: OK, I "solved" the problem myself. The implementation of isProbablePrime seems to be the problem and not the parallel collections. I replaced isProbablePrime with another function to test for primality and now I get an expected speedup.

Both with sequential and parallel ranges, filter will generate a vector data structure - a Vector or a ParVector, respectively.
This is a known problem with parallel vectors that get generated from range collections - transformer methods (such as filter) for parallel vectors do not construct the vector in parallel.
A solution for this that allows efficient parallel construction of vectors has already been developed, but was not yet implemented. I suggest you file a ticket, so that it can be fixed for the next release.

Related

Scala : How to speed up parallel synchronous processing on a small list?

I see that Parallel collections are mostly designed to speed up processing of large collections but it is not mentioned if this can be helpful for small lists or not.
Check this example:
List(1,2,3).map(loadHeavyFile(_))
List(1,2,3).par.map(loadHeavyFile(_)).toList
What I want to know here is that :
Is one faster than another?
Will Scala use multiple threads if the list only has 3 elements?
In a general way, is it possible to speed up the response time of this code?
I know I can use Future, then Future.sequence and wait for the result to come up, but this seems unnatural when loadHeavyFile is a synchronous call. I don't want to have to specify a timeout for example.
Note: I want to preserve list ordering.
Any guidance here is appreciated.
In parallel collections Measuring Performance - How big should a collection be to go parallel? note some parameters that are useful for estimating and deciding whether a parallel version of a collection may outperform the non-parallel one.
For such a small list, the overhead of creating the parallel version may be larger than the actual sequential mapping; however in this example the focus lies on I/O performance and possibly on memory management, should the involved files be large and be uploaded to memory.
To provide a more factual answer, consider benchmarking the sequential and parallel versions for different I/O loads in order to estimate roughly which heavy I/O load threshold makes the parallel version worth it. The I/O factor makes the estimation OS and hardware dependent and it may prove it hard to generalise.

Is QuickSort really the fastet sorting technique

Hello all this is my very first question here. I am new to datastructure and algorithms my teacher asked me to compare time complexity of different algorithms including: merge sort, heap sort, insertion sort, and quick sort. I search over internet and find out that quick sort is the fastest of all but my version of quick sort is the slowest of all (it sort 100 random integers in almost 1 second while my other sorting algorithms took almost 0 second). I tweak my quick sort logic many times (taking first value as pivot than tried to take middle value as pivot but in vain) I finally search the code over internet and there was not much difference in my code and code on internet. Now I really am confused that if this is behaviour of quick sort is natural (I mean whatever your logic is you are gonna get same results.) or there are some specific situations where you should use quick sort. In the end I know my question is not clear (I don't know how to ask besides my english is also not very good.) I hope someone can help me I really wanted to attach picture of awkward result I am having but I can't (reputation < 10).
Theoretically, quicksort is supposed to be the fastest algorithm for sorting, with a runtime of O(nlogn). It's worst case would be O(n^2), but only occurs if there are repeated values are equal to the pivot.
In your situation, I can only assume that your pivot value is not ideal in your data array, but is still able to sort the values using that pivot. Otherwise, your quicksort implementation is unfortunately incorrect.
Quicksort has O(n^2) worst-case runtime and O(nlogn) average case runtime. A good reason why Quicksort is so fast in practice compared to most other O(nlogn) algorithms such as Heapsort, is because it is relatively cache-efficient. Its running time is actually O(n/Blog(n/B)), where B is the block size. Heapsort, on the other hand, doesn't have any such speedup: it's not at all accessing memory cache-efficiently.
The value you choose as pivot may not be appropriate hence your sorting may be taking some time.You can avoid quicksort’s worst-case run time of O(n^2) almost entirely by using an appropriate choice of the pivot – such as picking it at random.
Also , the best and worst case often are extremes rarely occurring in practice.But any average case analysis assume some distribution of inputs. For sorting, the typical choice is the random permutation model (as assumed on Wikipedia).

How to use parallel 'for' loop in Octave or Scilab?

I have two for loops running in my Matlab code. The inner loop is parallelized using Matlabpool in 12 processors (which is maximum Matlab allows in a single machine).
I dont have Distributed computing license. Please help me how to do it using Octave or Scilab. I just want to parallelize 'for' loop ONLY.
There are some broken links given while I searched for it in google.
parfor is not really implemented in octave yet. The keyword is accepted, but is a mere synonym of for (http://octave.1599824.n4.nabble.com/Parfor-td4630575.html).
The pararrayfun and parcellfun functions of the parallel package are handy on multicore machines.
They are often a good replacement to a parfor loop.
For examples, see
http://wiki.octave.org/Parallel_package.
To install, issue (just once)
pkg install -forge parallel
And then, once on each session
pkg load parallel
before using the functions
In Scilab you can use parallel_run:
function a=g(arg1)
a=arg1*arg1
endfunction
res=parallel_run(1:10, g);
Limitations
uses only one core on Windows platforms.
For now, parallel_run only handles arguments and results of scalar matrices of real values and the types argument is not used
one should not rely on side effects such as modifying variables from outer scope : only the data stored into the result variables will be copied back into the calling environment.
macros called by parallel_run are not allowed to use the JVM
no stack resizing (via gstacksize() or via stacksize()) should take place during a call to parallel_run
In GNU Octave you can use the parfor construct:
parfor i=1:10
# do stuff that may run in parallel
endparfor
For more info: help parfor
To see a list of Free and Open Source alternatives to MATLAB-SIMULINK please check its Alternativeto page or my answer here. Specifically for SIMULINK alternatives see this post.
something you should consider is the difference between vectorized, parallel, concurrent, asynchronous and multithreaded computing. Without going much into the details vectorized programing is a way to avoid ugly for-loops. For example map function and list comprehension on Python is vectorised computation. It is the way you write the code not necesarily how it is being handled by the computer. Parallel computation, mostly used for GPU computing (data paralleism), is when you run massive amount of arithmetic on big arrays, using GPU computational units. There is also task parallelism which mostly refers to ruing a task on multiple threads, each processed by a separate CPU core. Concurrent or asynchronous is when you have just one computational unit, but it does multiple jobs at the same time, without blocking the processor unconditionally. Basically like a mom cooking and cleaning and taking care of its kid at the same time but doing only one job at the time :)
Given the above description there are lot in the FOSS world for each one of these. For Scilab specifically check this page. There is MPI interface for distributed computation (multithreading/parallelism on multiple computers). OpenCL interfaces for GPU/data-parallel computation. OpenMP interface for multithreading/task-parallelism. The feval functions is not parallelism but a way to vectorize a conventional function.Scilab matrix arithmetic and parallel_run are vectorized or parallel depending to the platform, hardware and version of the Scilab.

Fast DP in Matlab (Viterbi for profile HMMs)

I've got efficiency problems with viterbi logodds computation in Matlab.
Basically my problem is that it is mandatory to have nested loops which slows the code down a lot. This is the expensive part:
for i=1:input_len
for j=1:num_states
v_m=emission_value+max_over_3_elements; %V_M
v_i=max_over_2_elements; %V_I
v_d=max_over_2_elements; %V_D
end
end
I believe I'm not the first to implement viterbi for profile HMMs so maybe you've got some advice. I also took a look into Matlab's own hmmviterbi but there were no revelations (also uses nested loops). I also tested replacing max with some primitive operations but there was no noticeable difference (was actually a little slower).
Unfortunately, loops just are slow in Matlab (it gets better with more recent versions though) - and I don't think it can be easily vectorized/parallelized as the operations inside the loops are not independent on other iterations.
This seems like a task for MEX - it should not be too much work to write this in C and the expected speedup is probably quite large.

Understanding MATLAB on multiple cores, multiple processors and MPI

I have several closely related questions about how how MATLAB takes advantage of parallel hardware. They are short, so I thought it would be best to put them in the same post:
Does MATLAB leverage/benefit from multiple cores when not using the Parallel Computing Toolbox?
Does MATLAB leverage/benefit from multiple processors when not using the PCT?
Does MATLAB use MPI when not using the PCT?
Does MATLAB use MPI when using the PCT?
Does MATLAB leverage/benefit from multiple cores when not using the
Parallel Computing Toolbox?
Yes. Since R2007a, more and more built-in functions have been re-written to be multi-threaded (though multi-threading will only kick in if it's beneficial).
Element Wise Functions and Expressions:
------------------------------------------------------------------------------------------------
Functions that speed up for double arrays > 20k elements
1) Trigonometric: ACOS(x), ACOSH(x), ASIN(x), ASINH(x), ATAN(x), ATAND(x), ATANH(x), COS(x), COSH(x), SIN(x), SINH(x), TAN(x), TANH(x)
2) Exponential: EXP(x), POW2(x), SQRT(x)
3) Operators: x.^y
For Example: 3*x.^3+2*x.^2+4*x +6, sqrt(tan(x).*sin(x).*3+8);
Functions that speed up for double arrays > 200k elements
4) Trigonometric: HYPOT(x,y), TAND(x)
5) Complex: ABS(x)
6) Rounding and remainder: UNWRAP(x), CEIL(x), FIX(x), FLOOR(x), MOD(x,N), ROUND(x)
7) Basic and array operations: LOGICAL(X), ISINF(X), ISNAN(X), INT8(X), INT16(X), INT32(X)
Linear Algebra Functions:
------------------------------------------------------------------------------------------------
Functions that speed up for double arrays > 40k elements (200 square)
1)Operators: X*Y (Matrix Multiply), X^N (Matrix Power)
2)Reduction Operations : MAX and MIN (Three Input), PROD, SUM
3) Matrix Analysis: DET(X), RCOND(X), HESS(X), EXPM(X)
4) Linear Equations: INV(X), LSCOV(X,x), LINSOLVE(X,Y), A\b (backslash)
5) Matrix Factorizations: LU(X), QR(X) for sparse matrix inputs
6) Other Operations: FFT and IFFT of multiple columns of data, FFTN, IFFTN, SORT, BSXFUN, GAMMA, GAMMALN, ERF,ERFC,ERFCX,ERFINV,ERFCINV, FILTER
For code implemented as .m file, multiple cores won't help, though.
Multi-threaded mex-files will benefit as well, of course.
Does MATLAB use MPI when not using the PCT?
Not to my knowledge.
Does MATLAB use MPI when using the PCT?
Yes, when you run it on a cluster (though you can use other schedulers as well). To do this, you need a license for the Matlab Distributed Computing Server license. I don't know what architecture the local scheduler uses (the one you use when you run parallel jobs on a local machine); given that MPI functions are part of the PCT suggests that they may use it for at least part of the functionality.
EDIT: See #Edric's answer for more details
To clarify and expand on a couple of points from #Jonas' detailed answer:
PCT uses a build of MPICH2 (this is not shipped with base MATLAB).
MPI functions are available under the local scheduler - in this case, the build of MPICH2 can take advantage of shared memory for communication.
The labSend/labReceive family of functions present a wrapper around MPI_Send/MPI_Recv etc.
When not using the PCT, MatLab issues only one command at once (single-threaded).
However, if you have a multi-threaded BLAS, you could still benefit from extra cores (and it doesn't particularly matter whether they're all in a single processor or not).
MEX files can also be written with multiple threads, in which case you will use multiple cores even without the PCT. If you have performance problems, rewriting some of the hotspots as MEX is often a big win.
First, the answers are mostly "No, but...", as #BenVoigt has addressed. The "but..." part comes from libraries used by Matlab. One of the most notable examples was given by Ben, for BLAS, and you can replace this with one that supports multiple cores or processors, such as ATLAS, the Intel or AMD versions, Goto BLAS, or some other options.
You can also call out from Matlab to code in other languages, which can leverage multiple cores, processors, computers, etc. In the past, I have called R from Matlab, and have made use of multiple cores in this way, by taking advantage of R packages that support multicore processing. The same could be done with MPI. However, as you scale, you'll discover that more and more of your code ends up being in the language that can do more parallel or distributed work (i.e. a free language like R, Python, C, C++, or Java), rather than in Matlab.
So, does Matlab benefit from such infrastructure without PCT? Not directly. Can your code in Matlab benefit from such infrastructure via various supporting libraries? Yes.
When not using PCT, MATLAB uses only one core/one processor.
I don't know the answers to the 3rd and 4th questions.