Running internal matlab functions in parallel within existing matlabpool with "-singleCompThread" - matlab

I'm implementing an adaptive (approximate) matrix-vector multiplication for very large systems (known sparsity structure) - see Predicting runtime of parallel loop using a-priori estimate of effort per iterand (for given number of workers) for a more long-winded description. I first determine the entries I need to calculate for each block, but even though the entries are only a small subset, calculating them directly (with quadrature) would be impossibly expensive. However, they are characterised by an underlying structure (the difference of their respective modulations) which means I only need to calculate the quadrature once per "equivalence class", which I get by calling unique on a large 2xN matrix of differences (and then mapping back to the original entries).
Unfortunately, this 2xN-matrix becomes so large in practice, that it is becoming somewhat of a bottleneck in my code - which is still orders magnitude faster than calculating the quadrature redundantly, but annoying nevertheless, since it could run faster in principle.
The problem is that the cluster on which I compute requires the -singleCompThread option, so that Matlab doesn't spread where it shouldn't. This means that unique is forced to use only one core, even though I could arrange it within the code that it is called serially (as this procedure must be completed for all relevant blocks).
My search for a solution has lead me to the function maxNumCompThreads, but it is deprecated and will be removed in a future release (aside from throwing warnings every time it's called), so I didn't pursue it further.
It is also possible to pass a function to a batch job and specify a cluster and a poolsize it should run on (e.g. j=batch(cluster,#my_unique,3,{D,'cols'},'matlabpool',127); this is 2013a; in 2013b, the key for 'matlabpool' changed to 'Pool'), but the problem is that batch opens a new pool. In my current setup, I can have a permanently open pool on the cluster, and it would take a lot of unnecessary time to always open and shut pools for batch (aside from the fact that the maximal size of the pool I could open would decrease).
What I'd like is to call unique in such a way, that it takes advantage of the currently open matlabpool, without requesting new pools of or submitting jobs to the cluster.
Any ideas? Or is this impossible?
Best regards,
Axel
Ps. It is completely unfathomable to me why the standard set functions in Matlab have a 'rows'- but not a 'cols'-option, especially since this would "cost" about 5 lines of code within each function. This is the reason for my_unique:
function varargout=my_unique(a,elem_type,varargin)
% Adapt unique to be able to deal with columns as well
% Inputs:
% a:
% Set of which the unique values are sought
% elem_type (optional, default='scalar'):
% Parameter determining which kind of unique elements are sought.
% Possible arguments are 'scalar', 'rows' and 'cols'.
% varargin (optional):
% Any valid combination of optional arguments that can be passed to
% unique (with the exception of 'rows' if elem_type is either 'rows'
% or 'cols')
%
% Outputs:
% varargout:
% Same outputs as unique
if nargin < 2; elem_type='scalar'; end
if ~any(strcmp(elem_type,{'scalar','rows','cols'}))
error('Unknown Flag')
end
varargout=cell(1,max(nargout,1));
switch (elem_type)
case 'scalar'
[varargout{:}]=unique(a,varargin{:});
case 'rows'
[varargout{:}]=unique(a,'rows',varargin{:});
case 'cols'
[varargout{:}]=unique(transpose(a),'rows',varargin{:});
varargout=cellfun(#transpose,varargout,'UniformOutput',false);
end
end

Without trying the example you cited above, you could try blockproc to do block processing. It however, belongs to Image Processing Toolbox.

Leaving aside the 'rows' problem for the time being, if I've understood correctly, what you're after is a way to use an open parallel pool to do a large call to 'unique'. One option may be to use distributed arrays. For example, you could do:
spmd
A = randi([1 100], 1e6, 2); % already transposed to Nx2
r = unique(A, 'rows'); % operates in parallel
end
This works because sortrows is implemented for codistributed arrays. You'll find that you only get speedup from (co)distributed arrays if you can arrange for the data always to live on the cluster, and also when the data is so large that processing it on one machine is infeasible.

Related

How Should I Parallelize My Genetic Algorithm Fitness Evaluation?

I have a GA code that I developed myself. Since I'm new to coding, my code is not fast. I have a Dual-Core CPU 2.6GHz.
The only line of the code that takes a long time to run is the fitness function. I am not familiar with the GA toolbox and my fitness function is quite complex so I assume even if I knew how to use the GA toolbox, I would have to code the fitness function myself.
The algoritm's structure is as follows:
after generating the initial generation and evaluating the fitness values (which takes long but does not matter that much because this is only run once), it starts a loop which will be iterated for up to 10000 times. In each iteration, we have a new generation whose fitness values needs to be calculated. So when a new generation of 50 individuals is generated, the whole generation is fed to the fitness_function. In this function there is a for loop which calculates the fitness value for each 50 individual (so the for loop is iterated 50 times). Here is my question. How should I use parfor so that 25 individual is evaluated by one CPU core and the other 25 individuals with the other core, so that the calculation time is decreased to almost half. I already know from here
I have tried changing the for loop in the fitness_function directly to parfor and I have received the following error: "The PARFOR loop cannot run due to the way variable "Z" is used." and "Variable z is indexed in different ways. Potentially causing dependencies between iterations." Variable Z is a 50*3 matrix which stores the fitness values for each of the individuals.
The problem with your assignment into Z is that you have three different assignment statements, and that is not allowed. You need to make the assignment into Z meet the requirements for a "sliced" variable. The easiest way to do this is to make a temporary variable Zrow to store the values for the ith row of Z, and then make a single assignment, like this
parfor i = 1:50
Zrow = zeros(1, 3); % allocate to ensure parfor knows this is a temporary
...
Zrow(1) = TTT;
...
Zrow(2) = sum(FSL,1);
Zrow(3) = 0.5*Z(i,1)+0.5*Z(i,2);
% Finally, make a single sliced assignment into Z
Z(i, :) = Zrow;
end
Also, in general, it's best to have the parfor loop be the outermost one. Also, whether parfor actually gives you any speed-up depends a lot on whether the body of the loop is already being multithreaded by MATLAB's built-in multithreaded capabilities. (If it is, then parfor using only your local machine cannot make things faster because in that case, the multithreaded code is already taking full advantage of your computer's resources).

Clean methodology for running a function for a large set of input parameters (in Matlab)

I have a differential equation that's a function of around 30 constants. The differential equation is a system of (N^2+1) equations (where N is typically 4). Solving this system produces N^2+1 functions.
Often I want to see how the solution of the differential equation functionally depends on constants. For example, I might want to plot the maximum value of one of the output functions and see how that maximum changes for each solution of the differential equation as I linearly increase one of the input constants.
Is there a particularly clean method of doing this?
Right now I turn my differential-equation-solving script into a large function that returns an array of output functions. (Some of the inputs are vectors & matrices). For example:
for i = 1:N
[OutputArray1(i, :), OutputArray2(i, :), OutputArray3(i, :), OutputArray4(i, :), OutputArray5(i, :)] = DE_Simulation(Parameter1Array(i));
end
Here I loop through the function. The function solves a differential equation, and then returns the set of solution functions for that input parameter, and then each is appended as a row to a matrix.
There are a few issues I have with my method:
If I want to see the solution to the differential equation for a different parameter, I have to redefine the function so that it is an input of one of the thirty other parameters. For the sake of code readability, I cannot see myself explicitly writing all of the input parameters as individual inputs. (Although I've read that structures might be helpful here, but I'm not sure how that would be implemented.)
I typically get lost in parameter space and often have to update the same parameter across multiple scripts. I have a script that runs the differential-equation-solving function, and I have a second script that plots the set of simulated data. (And I will save the local variables to a file so that I can load them explicitly for plotting, but I often get lost figuring out which file is associated with what set of parameters). The remaining parameters that are not in the input of the function are inside the function itself. I've tried making the parameters global, but doing so drastically slows down the speed of my code. Additionally, some of the inputs are arrays I would like to plot and see before running the solver. (Some of the inputs are time-dependent boundary conditions, and I often want to see what they look like first.)
I'm trying to figure out a good method for me to keep track of everything. I'm trying to come up with a smart method of saving generated figures with a file tag that displays all the parameters associated with that figure. I can save such a file as a notepad file with a generic tagging-number that's listed in the title of the figure, but I feel like this is an awkward system. It's particularly awkward because it's not easy to see what's different about a long list of 30+ parameters.
Overall, I feel as though what I'm doing is fairly simple, yet I feel as though I don't have a good coding methodology and consequently end up wasting a lot of time saving almost-identical functions and scripts to solve fairly simple tasks.
It seems like what you really want here is something that deals with N-D arrays instead of splitting up the outputs.
If all of the OutputArray_ variables have the same number of rows, then the line
for i = 1:N
[OutputArray1(i, :), OutputArray2(i, :), OutputArray3(i, :), OutputArray4(i, :), OutputArray5(i, :)] = DE_Simulation(Parameter1Array(i));
end
seems to suggest that what you really want your function to return is an M x K array (where in this case, K = 5), and you want to pack that output into an M x K x N array. That is, it seems like you'd want to refactor your DE_Simulation to give you something like
for i = 1:N
OutputArray(:,:,i) = DE_Simulation(Parameter1Array(i));
end
If they aren't the same size, then a struct or a table is probably the best way to go, as you could assign to one element of the struct array per loop iteration or one row of the table per loop iteration (the table approach would assume that the size of the variables doesn't change from iteration to iteration).
If, for some reason, you really need to have these as separate outputs (and perhaps later as separate inputs), then what you probably want is a cell array. In that case you'd be able to deal with the variable number of inputs doing something like
for i = 1:N
[OutputArray{i, 1:K}] = DE_Simulation(Parameter1Array(i));
end
I hesitate to even write that, though, because this almost certainly seems like the wrong data structure for what you're trying to do.

Create different `randperm` numbers in loops

Suppose that we have this structure:
for i=1:x1
Out = randperm(40);
Out_Final = %% divide 'Out' to 10 parts. and select these parts for some purposes
for j=1:x2
%% Process on `Out_Final`
end
end
I'm using outer loop (for i=1:x1) to repeat main process (for j=1:x2) loop and average between outputs to have more robust results. I want randperm doesn't result equal (or near equal) outputs. I want have different Output for this function as far as possible in every calling in (for i=1:x1) loop.
How can i do that in MATLAB R2014a?
The randomness algorithms used by randperm are very good. So, don't worry about that.
However, if you draw 10 random numbers from 1 to 10, you are likely to see some more frequently than others.
If you REALLY don't want this, you should probably not focus on randomly selecting the numbers, but on selecting the numbers in a way that they are nicely spread out througout their possible range. (This is a quite different problem to solve).
To address your comment:
The rng function allows you to create reproducible results, make sure to check doc rng for examples.
In your case it seems like you actually don't want to reset the rng each time, as that would lead to correlated random numbers.

best way to parallelize calculations on time series data in matlab

I have a linux cluster with Matlab & PCT installed (128 workers with Torque Manager), and I am looking for a good way to parallelize my calculations.
I have a time-series Trajectory data (100k x 2) matrix. I perform maximum likelihood (ML) calculations that involve matrix diagonalization, exponentiation & multiplications, which is running fast for smaller matrices. I divide the Trajectory data into small chunks and perform the calculations on many workers (coarse parallelization) and don't have any problems here as it works fine (gets done in ~30s)
But the calculations also depend on a number of parameters that I need to vary & test the effect on ML. (something akin to parameter sweep).
When I try to do this using a loop, the calculations becomes progressively very slow, for some reason I am unable to figure out.
%%%%%%% Pseudo- Code Example:
% a [100000x2], timeseries data
load trajectoryData
% p1,p2,p3,p4 are parameters
% but i want to do this over a multiple values fp3 & fp4 ;
paramsMat = [p1Vect; p2Vect;p3Vect ;p4Vect];
matlabpool start 128
[ML] = objfun([p1 p2 p3 p4],trajectoryData) % runs fast ~ <30s
%% NOTE: this runs progressively slow
for i = 1:length(paramsMat)
currentparams = paramsMat(i,:);
[ML] = objfun(currentparams,trajectoryData)
end
matlabpool close
The objFunc function is as follows:
% objFunc.m
[ML] = objFunc(Params, trajectoryData)
% b = 2 always
[a b] = size(trajectoryData) ;
% split into fragments of 1000 points (or any other way)
fragsMat = reshape(trajectoryData,1000, a*2/1000) ;
% simple parallelization. do the calculation on small chunks
parfor ix = 1: numFragments
% do heavy calculations
costVal(ix) = costValFrag;
end
% just an example;
ML = sum(costVal) ;
%%%%%%
Just a single calculation oddly takes ~30s (using the full cluster) but within the for loop, for some weird reason there is damping of speed & even within the 100th calculation, it becomes very slow. The workers are using only 10-20% of CPU.
If you have any suggestions including alternative parallelization suggestions it would be of immense help.
If I read this correctly, each parameter set is completely independent of all the others, and you have more parameter sets than you do workers.
The simple solution is to use a batch job instead of parfor.
job_manager = findresource( ... look up the args that fit your cluster ... )
job = createJob(job_manager);
for i = 1:num_param_sets
t = createTask(job, #your_function, 0, {your params});
end
submit(job);
This way you avoid any communications overhead you have from the parfor of the inner function, and you keep your matlabs separate. You can even tell it to automatically restart the workers between tasks (I think), as one of the job parameters.
What is the value of numFragments? If this is not always larger than your number of workers, then you will see things slowing down.
I would suggest trying to make your outer for loop be the parfor. It's generally better to apply the parallelism at the outermost level.

Parallelize or vectorize all-against-all operation on a large number of matrices?

I have approximately 5,000 matrices with the same number of rows and varying numbers of columns (20 x ~200). Each of these matrices must be compared against every other in a dynamic programming algorithm.
In this question, I asked how to perform the comparison quickly and was given an excellent answer involving a 2D convolution. Serially, iteratively applying that method, like so
list = who('data_matrix_prefix*')
H = cell(numel(list),numel(list));
for i=1:numel(list)
for j=1:numel(list)
if i ~= j
eval([ 'H{i,j} = compare(' char(list(i)) ',' char(list(j)) ');']);
end
end
end
is fast for small subsets of the data (e.g. for 9 matrices, 9*9 - 9 = 72 calls are made in ~1 s, 870 calls in ~2.5 s).
However, operating on all the data requires almost 25 million calls.
I have also tried using deal() to make a cell array composed entirely of the next element in data, so I could use cellfun() in a single loop:
# who(), load() and struct2cell() calls place k data matrices in a 1D cell array called data.
nextData = cell(k,1);
for i=1:k
[nextData{:}] = deal(data{i});
H{:,i} = cellfun(#compare,data,nextData,'UniformOutput',false);
end
Unfortunately, this is not really any faster, because all the time is in compare(). Both of these code examples seem ill-suited for parallelization. I'm having trouble figuring out how to make my variables sliced.
compare() is totally vectorized; it uses matrix multiplication and conv2() exclusively (I am under the impression that all of these operations, including the cellfun(), should be multithreaded in MATLAB?).
Does anyone see a (explicitly) parallelized solution or better vectorization of the problem?
Note
I realize both my examples are inefficient - the first would be twice as fast if it calculated a triangular cell array, and the second is still calculating the self comparisons, as well. But the time savings for a good parallelization are more like a factor of 16 (or 72 if I install MATLAB on everyone's machines).
Aside
There is also a memory issue. I used a couple of evals to append each column of H into a file, with names like H1, H2, etc. and then clear Hi. Unfortunately, the saves are very slow...
Does
compare(a,b) == compare(b,a)
and
compare(a,a) == 1
If so, change your loop
for i=1:numel(list)
for j=1:numel(list)
...
end
end
to
for i=1:numel(list)
for j= i+1 : numel(list)
...
end
end
and deal with the symmetry and identity case. This will cut your calculation time by half.
The second example can be easily sliced for use with the Parallel Processing Toolbox. This toolbox distributes iterations of your code among up to 8 different local processors. If you want to run the code on a cluster, you also need the Distributed Computing Toolbox.
%# who(), load() and struct2cell() calls place k data matrices in a 1D cell array called data.
parfor i=1:k-1 %# this will run the loop in parallel with the parallel processing toolbox
%# only make the necessary comparisons
H{i+1:k,i} = cellfun(#compare,data(i+1:k),repmat(data(i),k-i,1),'UniformOutput',false);
%# if the above doesn't work, try this
hSlice = cell(k,1);
hSlice{i+1:k} = cellfun(#compare,data(i+1:k),repmat(data(i),k-i,1),'UniformOutput',false);
H{:,i} = hSlice;
end
If I understand correctly you have to perform 5000^2 matrix comparisons ? Rather than try to parallelise the compare function, perhaps you should think of your problem being composed of 5000^2 tasks ? The Matlab Parallel Compute Toolbox supports task-based parallelism. Unfortunately my experience with PCT is with parallelisation of large linear algebra type problems so I can't really tell you much more than that. The documentation will undoubtedly help you more.