dijkstra with adjacency list and minimum heap as queue vs adjacency matrix and a normal array as "queue" - dijkstra

Which one should be faster? I have written a script comparing both run time, initially the implementation with adjacency list and minimum heap performs faster, but as the number of nodes/edges increases, it seems the latter perform faster Is this result expected?
With 500 nodes, my script says the latter perform faster by almost 30ms

Related

Converting all variables into gpuArrays doesn't speed up computation

I'm writing simulation with MATLAB where I used CUDA acceleration.
Suppose we have vector x and y, matrix A and scalar variables dt,dx,a,b,c.
What I found out was that by putting x,y,A into gpuArray() before running the iteration and built-in functions, the iteration could be accelerated significantly.
However, when I tried to put variables like dt,dx,a,b,c into the gpuArray(), the program would be significantly slowed down, by a factor of over 30%. (Time increased from 7s to 11s).
Why it was not a good idea to put all the variables into the gpuArray()?
(Short comment, those scalars were multiplied together with x,y,A, and was never used during the iteration alone.)
GPU hardware is optimised for working on relatively large amounts of data. You only really see the benefit of GPU computing when you can feed the many processing cores lots of data to keep them busy. Typically this means you need operations working on thousands or millions of elements.
The overheads of launching operations on the GPU dwarf the computation time when you're dealing with scalar quantities, so it is no surprise that they are slower than on the CPU. (This is not peculiar to MATLAB & gpuArray).

Parallel Optimization in Matlab: Gradient or Loop

I am optimizing a rather messy likelihood function in Matlab, where I need to run about 1,000 separate runs of the optimization algorithm (fmincon) at different initial points, where there are something like 32 free parameters.
Unfortunately I can not both parallelize the 1,000 runs of the optimization algorithm, and the computation of the finite difference gradient simultaneously. I must choose one.
Does anyone know if its more efficient to parallelize the outer loop and have each optimization run on its own core, or the calculation of the finite-difference gradient computation?
Thanks!
This is impossible to answer exactly without knowing anything about your code and or hardware.
If you have more than 32 cores, then some of them will have nothing to do during parallel gradient computation. In this case, running the 1000 simulations in parallel might be faster.
On the other hand, computing the gradients in parallel might enable your CPU(s) to use their caches more efficiently, in that there will be fewer cache misses. You may have a look at Why does the order of the loops affect performance when iterating over a 2D array? or What is “cache-friendly” code?.

Large and Sparse Matrix Multiplcation

I have a very large and sparse matrix of size 180GB(text , 30k * 3M) containing only the entries and no additional data. I have to do matrix multiplication , inversion and some similar linear algebra operations over it. I tried octave and simple single-threaded C code for the multiplication but my system RAM of 40GB gets used up very fast and then I can find the program starts thrashing. Is there any other options available to me. I am not familiar with MathLab or any other matrix operational library that can help me in doing so.
When I run a simple matrix multiplication of two matrices with 10 rows and 3 M cols, and its transpose, it gives the following error :
memory exhausted or requested size too large for range of Octave's index type
I am not sure whether the same would work on Matlab or not. For sparse matrix representation and matrix multiplication, is there another library or code.
if there are few enough nonzero entries, I suggest creating a sparse matrix S with appropriate dimensions and max nonzero entries; see matlab create sparse matrix. Then as #oleg komarov described, load the matrix in blocks and assign the nonzero entries from each block into the correct address in the sparse matrix S. I feel that if your matrix is sparse enough, then loading it is really the only difficulty you face. I had similar issues with large transfer operators.
Have you considered performing your processing in blocks? Transposition and multiplications work very well with block matrix processing (see https://en.wikipedia.org/wiki/Block_matrix) and that will get you around any limitations about the indices.
This wouldn't help you with matrix inversion though unless you can decompose your matrix in blocks when blocks that aren't on the diagonal are completely empty, which isn't stated in your assumptions.
Octave has a limit in both the memory resources of about 2GB and the maximum number of indices a matrix can hold of about 2^32 (for 32 bits Octave). MatLab doesn't have such a memory limit, since it will use all of your memory resources, swapping file included. Thus you could try with MatLab by setting a huge swapfile, you may then compute your operations (but it will anyway take quite along time...).
If you are interested by other approaches, you may take a look into out-of-core computing which aims to promote new methods to process huge datasets that cannot reside all in memory, but rather store it on disk and load efficiently the bits that are necessary.
For a practical approach, you may take a look into Blaze for Python (notice: still in development!).

Matlab parfor work distribution

I have a parfor loop through say 100 iterations, and the workload on every iteration is different but changes linearly in a way that the first one takes the most time and the last one is the fastest. But when I run through the parfor loop with my four instances/labs, during the last few hours only one lab is active as it's running through the few first iterations by its own.
So I know which iterations are the slow ones. How could I make workload between cores more even. For example somehow force all labs to start working on the first four slow ones and then proceed in order? Or something similar to prevent only one active core running the few slow ones alone..
Matlab parfor does nothing more but split up the indices and distributes them to the workers. It does this by creating contiguous chunks from the indices. I don't know the exact algorithm but this means that data with similar indices get computed in the same chunk and by the same worker.
The simplest solution would be a stochastic one. Just shuffle your indices so that the work intensive steps are distributed nicely. While this doesn't give you any guarantees on performance it is simple and will work most of the time.
Some example code:
% dummy data
N=10;
data=1:N;
% generate the permutated indices
permIndex=randperm(N);
% permute the data
dataPermuted=data(permIndex);
% run the loop
parfor i=1:N
% do something e.g. pause for the time as specified by data
pause(dataPermuted(i));
end
%invert the index permutation
dataInversePermuted(permIndex)=dataPermuted;
I used pause to simulate the different computation times.
I don't think this is documented anywhere, but you can quickly deduce that PARFOR runs iterations in reverse loop order (using pause and disp if you want to see it in action). So, you should simply reverse your loop. PARFOR gives you no means to explicitly control execution order, but SPMD using for-drange does (PARFOR is significantly easier to use though).
#denahiro's suggestion is also a good one.

Parallelizing a for loop to run simultaneously on multiple GPU cores?

I understand that you can use a matlabpool and parfor to run for loop iterations in parallel, however, I want to try and take advantage of using the high number of cores in my GPU to run a larger number of simultaneous iterations. I was wondering if there is any built in functionality to do this?
To my understanding, the method in which MATLAB runs code on the GPU is through a GPUarray, but that does not seem to parallelize a loop, only certain functions inside the loop.
For the loop that I am running, each iteration can run independently and the only variables that need to exist outside of the loop is the data to be processed (a 3-D array, where the first index is time, and each iteration is operating on a different time) and a 2-D output array where each iteration is storing the result for a particular time. Each time is independent.
Thanks
With a GPUArray, you can run elementwise operations in parallel by structuring your algorithm in terms of MATLAB's arrayfun. Effectively, this implicitly loops over each element of your arrays, and can apply the body of a MATLAB function to each element. The doc is: here.
There's a simple demo: here.