best way to parallelize calculations on time series data in matlab

best way to parallelize calculations on time series data in matlab - matlab

I have a linux cluster with Matlab & PCT installed (128 workers with Torque Manager), and I am looking for a good way to parallelize my calculations.
I have a time-series Trajectory data (100k x 2) matrix. I perform maximum likelihood (ML) calculations that involve matrix diagonalization, exponentiation & multiplications, which is running fast for smaller matrices. I divide the Trajectory data into small chunks and perform the calculations on many workers (coarse parallelization) and don't have any problems here as it works fine (gets done in ~30s)
But the calculations also depend on a number of parameters that I need to vary & test the effect on ML. (something akin to parameter sweep).
When I try to do this using a loop, the calculations becomes progressively very slow, for some reason I am unable to figure out.
%%%%%%% Pseudo- Code Example:
% a [100000x2], timeseries data
load trajectoryData
% p1,p2,p3,p4 are parameters
% but i want to do this over a multiple values fp3 & fp4 ;
paramsMat = [p1Vect; p2Vect;p3Vect ;p4Vect];
matlabpool start 128
[ML] = objfun([p1 p2 p3 p4],trajectoryData) % runs fast ~ <30s
%% NOTE: this runs progressively slow
for i = 1:length(paramsMat)
currentparams = paramsMat(i,:);
[ML] = objfun(currentparams,trajectoryData)
end
matlabpool close
The objFunc function is as follows:
% objFunc.m
[ML] = objFunc(Params, trajectoryData)
% b = 2 always
[a b] = size(trajectoryData) ;
% split into fragments of 1000 points (or any other way)
fragsMat = reshape(trajectoryData,1000, a*2/1000) ;
% simple parallelization. do the calculation on small chunks
parfor ix = 1: numFragments
% do heavy calculations
costVal(ix) = costValFrag;
end
% just an example;
ML = sum(costVal) ;
%%%%%%
Just a single calculation oddly takes ~30s (using the full cluster) but within the for loop, for some weird reason there is damping of speed & even within the 100th calculation, it becomes very slow. The workers are using only 10-20% of CPU.
If you have any suggestions including alternative parallelization suggestions it would be of immense help.

If I read this correctly, each parameter set is completely independent of all the others, and you have more parameter sets than you do workers.
The simple solution is to use a batch job instead of parfor.
job_manager = findresource( ... look up the args that fit your cluster ... )
job = createJob(job_manager);
for i = 1:num_param_sets
t = createTask(job, #your_function, 0, {your params});
end
submit(job);
This way you avoid any communications overhead you have from the parfor of the inner function, and you keep your matlabs separate. You can even tell it to automatically restart the workers between tasks (I think), as one of the job parameters.

What is the value of numFragments? If this is not always larger than your number of workers, then you will see things slowing down.
I would suggest trying to make your outer for loop be the parfor. It's generally better to apply the parallelism at the outermost level.

Related

Why accessing 2d matrix in parfor so slow?

Let's say I have a large matrix A:
A = rand(10000,10000);
The following serial code took around 0.5 seconds
tic
for i=1:5
r=9999*rand(1);
disp(A(round(r)+1, round(r)+1))
end
toc
Whereas the following code with parfor took around 47 seconds
tic
parfor i=1:5
r=9999*rand(1);
disp(A(round(r)+1, round(r)+1))
end
toc
How can I speed up the parfor code?
EDIT: If instead of using disp, I try to compute the sum with the following code
sum=0;
tic
for i=1:5000
r=9999*rand(1);
sum=sum+(A(round(r)+1, round(r)+1));
end
toc
This takes .025 sec
But parfor it takes 42.5 sec:
tic
parfor i=1:5000
r=9999*rand(1);
sum=sum+(A(round(r)+1, round(r)+1));
end
toc

Your issue is in not considering node communication overheads.
When you use a parfor to loop using parallel computation, you have to think about the structure of several worker nodes doing small tasks for the client node.
Here are some issues with the tests you present:
The function disp is serial, since you can only display results one at a time to the client node. Communication between nodes is needed to schedule this task.
Creating a summation external to the loop means all of the nodes have to communicate the current value back to the client node.
A is a broadcast variable in all of your examples. From the docs:
This type of variable can be useful or even essential for particular tasks. However, large broadcast variables can cause significant communication between client and workers and increase parallel overhead.
The MATLAB editor warns you about this, underlining the variable in orange with the following tooltip:
The entire array or structure 'A' is a broadcast variable. This might result in unnecessary communication overhead.
Instead, we can calculate some random indices up front and slice A into temporary variables to use in the loop. Then do gathering operations (like summing all of the parts) after the loop.
k = 50;
sumA = zeros( k, 1 ); % Output values for each loop index
idx = randi( [1,1e4], k, 1 ); % Calculate our indices outside the loop
randA = A( idx, idx ); % Slice A outside the loop
parfor ii = 1:k
sumA( ii ) = randA( ii ); % All that's left to do in the loop
end
sumA = sum( sumA ); % Collate results from all nodes
I did a quick benchmark to compare your 2 summation tests with the above code, using R2017b and 12 workers, here are my results:
Serial loop: ~ 0.001 secs
Parallel with broadcasting: ~ 100 secs
Parallel no broadcasting: ~ 0.1 secs
Parallel loops are overkill for this operation, the overhead isn't justified, but it's clear that with some pre-allocation and avoiding of broadcast variables, they are at least not 5 orders of magnitude slower!
See how the version of the code without broadcast variables uses more vectorisation too, which will speed up the code without even having to use parfor. Optimising your code before using parallel computation will not only speed things up for serial computation, but often make the transition easier too!
Side note: sum and i are bad variable names because they are the names of built-in functions.

So there are a few main causes,
MATLAB parallel toolbox sucks. It just does unless you're using the GPU portion.
The only time it's beneficial is if the individual tasks are large enough. Your computer has to dedicate a core to assigning jobs to all the other cores. This is expensive and has a lot of overhead unless the jobs are of sufficient size. Your computer is running overtime assigning small jobs. If you were assigning jobs that would each take a minute it would be a different story.
You're running too few jobs. You're only looping through 5 times on very small jobs. Why would you even bother trying to multithread this? When I assign it to loop through 500,000 times it finally gains a speedup with parfor if I reduce the matrix size to 1000 x 1000
When you run parfor, MATLAB has to duplicate memory across all of the treads, you have a 10,000 x 10,000 matrix which takes up 800 MB. Duplicated across a 4 core machine is 3,200 MB or probably half of your RAM. Operating on these arrays costs extra memory, potentially doubling the size -> 6,400 MB. Probably more than you can afford to use.
Simply put, "how do you I speed up this parfor code?"
You don't

Saving time and memory using parfor?

Consider prova.mat in MATLAB obtained in the following way
for w=1:100
for p=1:9
A{p}=randn(100,1);
end
baseA_.A=A;
eval(['baseA.A' num2str(w) '= baseA_;'])
end
save(sprintf('prova.mat'),'-v7.3', 'baseA')
To have an idea of the actual dimensions in my data, the 1x9 cell in A1 is composed by the following 9 arrays: 904x5, 913x5, 1722x5, 4136x5, 9180x5, 3174x5, 5970x5, 4455x5, 340068x5. The other Aj's have a similar composition.
Consider the following code
clear all
load prova
tic
parfor w=1:100
indA=sprintf('A%d', w);
Aarr=baseA.(indA).A;
Boot=[];
for p=1:9
C=randn(100,1).*Aarr{p};
Boot=[Boot; C];
end
D{w}=Boot;
end
toc
If I run the parfor loop with 4 local workers in my Macbook Pro it takes 1.2 sec. Replacing parfor with for it takes 0.01 sec.
With my actual data, the difference of time is 31 sec versus 7 sec [the creation of the matrix C is also more complicated].
If have understood correctly the problem is that the computer has to send baseAto each local worker and this takes time and memory.
Could you suggest a solution that is able to make parfor more convenient than for? I thought that saving all cells in baseA was a way to save time by loading once at the beginning, but maybe I'm wrong.

General information
A lot of functions have implicit multi-threading built-in, making a parfor loop not more efficient, when using these functions, than a serial for loop, since all cores are already being used. parfor will actually be a detriment in this case, since it has the allocation overhead, whilst being as parallel as the function you are trying to use.
When not using one of the implicitly multithreaded functions parfor is basically recommended in two cases: lots of iterations in your loop (i.e., like 1e10), or if each iteration takes a very long time (e.g., eig(magic(1e4))). In the second case you might want to consider using spmd (slower than parfor in my experience). The reason parfor is slower than a for loop for short ranges or fast iterations is the overhead needed to manage all workers correctly, as opposed to just doing the calculation.
Check this question for information on splitting data between separate workers.
Benchmarking
Code
Consider the following example to see the behaviour of for as opposed to that of parfor. First open the parallel pool if you've not already done so:
gcp; % Opens a parallel pool using your current settings
Then execute a couple of large loops:
n = 1000; % Iteration number
EigenValues = cell(n,1); % Prepare to store the data
Time = zeros(n,1);
for ii = 1:n
tic
EigenValues{ii,1} = eig(magic(1e3)); % Might want to lower the magic if it takes too long
Time(ii,1) = toc; % Collect time after each iteration
end
figure; % Create a plot of results
plot(1:n,t)
title 'Time per iteration'
ylabel 'Time [s]'
xlabel 'Iteration number[-]';
Then do the same with parfor instead of for. You will notice that the average time per iteration goes up (0.27s to 0.39s for my case). Do realise however that the parfor used all available workers, thus the total time (sum(Time)) has to be divided by the number of cores in your computer. So for my case the total time went down from around 270s to 49s, since I have an octacore processor.
So, whilst the time to do each separate iteration goes up using parfor with respect to using for, the total time goes down considerably.
Results
This picture shows the results of the test as I just ran it on my home PC. I used n=1000 and eig(500); my computer has an I5-750 2.66GHz processor with four cores and runs MATLAB R2012a. As you can see the average of the parallel test hovers around 0.29s with a lot of spread, whilst the serial code is quite steady around 0.24s. The total time, however, went down from 234s to 72s, which is a speed up of 3.25 times. The reason that this is not exactly 4 is the memory overhead, as expressed in the extra time each iteration takes. The memory overhead is due to MATLAB having to check what each core is doing and making sure that each loop iteration is performed only once and that the data is put into the correct storage location.

Slice broadcasted data into a cell array
The following approach works for data which is looped by group. It does not matter what the grouping variable is, as long as it is determined before the loop. The speed advantage is huge.
A simplified example of such data is the following, with the first column containing a grouping variable:
ngroups = 1000;
nrows = 1e6;
data = [randi(ngroups,[nrows,1]), randn(nrows,1)];
data(1:5,:)
ans =
620 -0.10696
586 -1.1771
625 2.2021
858 0.86064
78 1.7456
Now, suppose for simplicity that I am interested in the sum() by group of the values in the second column. I can loop by group, index the elements of interest and sum them up. I will perform this task with a for loop, a plain parfor and a parfor with sliced data, and will compare the timings.
Keep in mind that this is a toy example and I am not interested in alternative solutions like bsxfun(), this is not the point of the analysis.
Results
Borrowing the same type of plot from Adriaan, I first confirm the same findings about plain parfor vs for. Second, both methods are completely outperformed by the parfor on sliced data which takes a bit more than 2 seconds to complete on a dataset with 10 million rows (the slicing operation is included in the timing). The plain parfor takes 24s to complete and the for almost twice that amount of time (I am on Win7 64, R2016a and i5-3570 with 4 cores).
The main point of slicing the data before starting the parfor is to avoid:
the overhead from the whole data being broadcast to the workers,
indexing operations into ever growing datasets.
The code
ngroups = 1000;
nrows = 1e7;
data = [randi(ngroups,[nrows,1]), randn(nrows,1)];
% Simple for
[out,t] = deal(NaN(ngroups,1));
overall = tic;
for ii = 1:ngroups
tic
idx = data(:,1) == ii;
out(ii) = sum(data(idx,2));
t(ii) = toc;
end
s.OverallFor = toc(overall);
s.TimeFor = t;
s.OutFor = out;
% Parfor
try parpool(4); catch, end
[out,t] = deal(NaN(ngroups,1));
overall = tic;
parfor ii = 1:ngroups
tic
idx = data(:,1) == ii;
out(ii) = sum(data(idx,2));
t(ii) = toc;
end
s.OverallParfor = toc(overall);
s.TimeParfor = t;
s.OutParfor = out;
% Sliced parfor
[out,t] = deal(NaN(ngroups,1));
overall = tic;
c = cache2cell(data,data(:,1));
s.TimeDataSlicing = toc(overall);
parfor ii = 1:ngroups
tic
out(ii) = sum(c{ii}(:,2));
t(ii) = toc;
end
s.OverallParforSliced = toc(overall);
s.TimeParforSliced = t;
s.OutParforSliced = out;
x = 1:ngroups;
h = plot(x, s.TimeFor,'xb',x,s.TimeParfor,'+r',x,s.TimeParforSliced,'.g');
set(h,'MarkerSize',1)
title 'Time per iteration'
ylabel 'Time [s]'
xlabel 'Iteration number[-]';
legend({sprintf('for : %5.2fs',s.OverallFor),...
sprintf('parfor : %5.2fs',s.OverallParfor),...
sprintf('parfor_sliced: %5.2fs',s.OverallParforSliced)},...
'interpreter', 'none','fontname','courier')
You can find cache2cell() on my github repo.
Simple for on sliced data
You might wonder what happens if we run the simple for on the sliced data? For this simple toy example, if we take away the indexing operation by slicing the data, we remove the only bottleneck of the code, and the for will actually be slighlty faster than the parfor.
However, this is a toy example where the cost of the inner loop is completely taken by the indexing operation. Hence, for the parfor to be worthwhile, the inner loop should be more complex and/or spread out.
Saving memory with sliced parfor
Now, assuming that your inner loop is more complex and the simple for loop is slower, let's look at how much memory we save by avoiding broadcasted data in a parfor with 4 workers and a dataset with 50 million rows (for about 760 MB in RAM).
As you can see, almost 3 GB of additional memory are sent to the workers. The slice operation needs some memory to be completed, but still much less than the broadcasting operation and can in principle overwrite the initial dataset, hence bearing negligible RAM cost once completed. Finally, the parfor on the sliced data will only use a small fraction of memory, i.e. that amount that corresponds to slices being used.
Sliced into a cell
The raw data is sliced by group and each section is stored into a cell. Since a cell array is an array of references we basically partitioned the contiguous data in memory into independent blocks.
While our sample data looked like this
data(1:5,:)
ans =
620 -0.10696
586 -1.1771
625 2.2021
858 0.86064
78 1.7456
out sliced c looks like
c(1:5)
ans =
[ 969x2 double]
[ 970x2 double]
[ 949x2 double]
[ 986x2 double]
[1013x2 double]
where c{1} is
c{1}(1:5,:)
ans =
1 0.58205
1 0.80183
1 -0.73783
1 0.79723
1 1.0414

Iteration for convergence in Matlab without using a while loop

I have to iterate a process where I have an initial guess for the Mach number (M0). This initial guess will give me another guess for the Mach number by using two equations (Mn). Eventually, i want to iterate this process untill the error between M0 and Mn is small. I have the following piece of code and it actually works well with a while loop.
However, I am afraid that the while loop will take many iterations and computational time for certain inputs since this will be part of a bigger code which most likely will give unfeasible inputs for the while loop.
Therefore my question is the following. How can I iterate this process within Matlab without consulting a while loop? The code that I am implementing now is the following:
%% Input
gamma = 1.4;
theta = atan(0.315);
cpi = -0.732;
%% Loop
M0 = 0.2; %initial guess
Err = 100;
iterations = 0;
while Err > 0.5E-3
B = (1-(M0^2)*(1-M0*cpi))^0.5;
Mn = (((gamma+1)/2) * ((B+((1-cpi)^0.5)*sec(theta)-1)^2/(B^2 + (tan(theta))^2)) - ((gamma-1)/2) )^-0.5;
Err = abs(M0 - Mn);
M0 = Mn;
iterations=iterations+1;
end
disp(iterations) disp(Mn)
Many thanks

Since M0 is calculated in each iteration and you have trigonometric functions, you cannot use another way than iteration structures (i.e. while).
If you had a specific increase or decrease at M0, then you could initialize a vector of M0 and do vector calculations for B and Err.
But, with sec and tan this is not possible.
Another wat would be to use the parallel processing. But, since you change the M0 at each iteration then you cannot use the parfor loop.
As for a for loop, in MATLAB you need an array for for "command" argument (e.g. 1:10 or 1:length(x) or i = A, where A = 1:10 or A = [1:10;11:20]). Since you evaluate a condition and depending on the result of the evaluation you judge if you continue the execution or not, it seems that the while loop (or do while in another language) is the only way to go.

I think you need to clarify the issue. If it the issue you want to solve is that some inputs take a long time to calculate, it is not the while loop that takes the time, it is the execution of the code multiple times that causes it. Any method that loops through will be restricted by the time the block of code takes to execute multiplied by the number of iterations required to converge.
You can introduce something to stop at a certain number of iterationtions, conceptually:
While ((err > tolerance) && (numIterations < limit))
If you want an answer which does not require iterating over the code, this is akin to finding a closed form solution, and I suspect this does not exist.
Edit to add: by not exist I mean in a practical form which can be implemented in a more efficient way then iterating to a solution.

Scope for improvement in this code

I have written the following code in MATLAB to process large images of the order of 3000x2500 pixels. Currently the operation takes more than half hour to complete. Is there any scope to improve the code to consume less time? I heard parallel processing can make things faster, but I have no idea on how to implement it. How do I do it, given the following code?
function dirvar(subfn)
[fn,pn] = uigetfile({'*.TIF; *.tiff; *.tif; *.TIFF; *.jpg; *.bmp; *.JPG; *.png'}, ...
'Select an image', '~/');
I = double(imread(fullfile(pn,fn)));
ld = input('Enter the lag distance = '); % prompt for lag distance
fh = eval(['#' subfn]); % Function handles
I2 = uint8(nlfilter(I, [7 7], fh));
imshow(I2); % Texture Layer Image
imwrite(I2,'result_mat.tif');
% Zero Degree Variogram
function [gamma] = ewvar(I)
c = (size(I)+1)/2; % Finds the central pixel of moving window
EW = I(c(1),c(2):end); % Determines the values from central pixel to margin of window
h = length(EW) - ld; % Number of lags
gamma = 1/(2 * h) * sum((EW(1:ld:end-1) - EW(2:ld:end)).^2);
end
The input lag distance is usually 1.

You really need to use the profiler to get some improvements out of it. My first guess (as I haven't run the profiler, which you should as suggested already), would be to use as little length operations as possible. Since you are processing every image with a [7 7] window, you can precalculate some parts,
such that you won't repeat these actions
function dirvar(subfn)
[fn,pn] = uigetfile({'*.TIF; *.tiff; *.tif; *.TIFF; *.jpg; *.bmp; *.JPG; *.png'}, ...
'Select an image', '~/');
I = double(imread(fullfile(pn,fn)));
ld = input('Enter the lag distance = '); % prompt for lag distance
fh = eval(['#' subfn]); % Function handles
%% precalculations
wind = [7 7];
center = (wind+1)/2; % Finds the central pixel of moving window
EWlength = (wind(2)+1)/2;
h = EWlength - ld; % Number of lags
%% calculations
I2 = nlfilter(I, wind, fh);
imshow(I2); % Texture Layer Image
imwrite(I2,'result_mat.tif');
% Zero Degree Variogram
function [gamma] = ewvar(I)
EW = I(center(1),center(2):end); % Determines the values from central pixel to margin of window
gamma = 1/(2 * h) * sum((EW(1:ld:end-1) - EW(2:ld:end)).^2);
end
end
Note that by doing so, you trade performance for clearness of your code and coupling (between the function dirvar and the nested function ewvar). However, since I haven't profiled your code (you should do that yourself using your own inputs), you can find what line of your code consumes the most time.
For batch processing, I would also recommend to leave out any input, imshow, imwrite and uigetfile. Those are commands that you typically call from a more high-level function/script and that will force you to enter these inputs even when you want them to stay the same. So instead of that code, make each of the variables they produce (/process) a parameter (/return value) for your function. That way, you could leave MATLAB running during the weekend to process everything (without having manually enter to all those values), even if you are unable to speed up the code.

A few general purpose tricks:
1 - use the MATLAB profiler to determine all the computational bottlenecks
2 - parallel processing can make things faster and there are a lot of tools that you can use, but it depends on how your entire code is set up and whether the code is optimized for it. By far the easiest trick to learn is parfor, where you can replace the top level for loop by parfor. This does mean you must open the MATLAB pool with matlabpool open.
3 - If you have a rather recent Nvidia GPU as well as MATLAB 2011, you can also write some CUDA code.
All in all 30 mins to me is peanuts, so don't fret it too much.

First of all, I strongly suggest you follow the advice by #Egon: Write a separate function that collects a list of files (the excellent UIPICKFILES from the FEX is your friend here), and then runs your filtering code in a loop for each image. Note that you should definitely keep the call to imwrite in your filtering code: In case the analysis crashes at image 48 (e.g. due to power failure), you don't want to lose all the previous work.
Running thusly in batch mode has two big advantages: (1) you can start running your code and go home for the week-end, and (2) you can easily parallelize this outside loop using PARFOR. However, with only a dual-core machine, it is unlikely that you get any significant improvements from parallelization - your OS also wants to run stuff at times, and the overhead of parallelization might be more than the gain from running two workers. Also, 2.5GB of RAM is seriously limiting.
As to your specific code: in my experience using IM2COL is often faster than NLFILTER. im2col creates a nElementsInMask-by-nMasks array out of your image, so that you can apply the filtering in one single operation. With a 7x7 window, the output of im2col will be 3000*2500*49 bytes, which is close to 400MB. Thus, it should just work. All that you need to do is rewrite ewvar so that it works on a 49x1 array of pixels that make up the pixels your mask, which will require some index juggling, if I understand your code correctly.

Parallelize or vectorize all-against-all operation on a large number of matrices?

I have approximately 5,000 matrices with the same number of rows and varying numbers of columns (20 x ~200). Each of these matrices must be compared against every other in a dynamic programming algorithm.
In this question, I asked how to perform the comparison quickly and was given an excellent answer involving a 2D convolution. Serially, iteratively applying that method, like so
list = who('data_matrix_prefix*')
H = cell(numel(list),numel(list));
for i=1:numel(list)
for j=1:numel(list)
if i ~= j
eval([ 'H{i,j} = compare(' char(list(i)) ',' char(list(j)) ');']);
end
end
end
is fast for small subsets of the data (e.g. for 9 matrices, 9*9 - 9 = 72 calls are made in ~1 s, 870 calls in ~2.5 s).
However, operating on all the data requires almost 25 million calls.
I have also tried using deal() to make a cell array composed entirely of the next element in data, so I could use cellfun() in a single loop:
# who(), load() and struct2cell() calls place k data matrices in a 1D cell array called data.
nextData = cell(k,1);
for i=1:k
[nextData{:}] = deal(data{i});
H{:,i} = cellfun(#compare,data,nextData,'UniformOutput',false);
end
Unfortunately, this is not really any faster, because all the time is in compare(). Both of these code examples seem ill-suited for parallelization. I'm having trouble figuring out how to make my variables sliced.
compare() is totally vectorized; it uses matrix multiplication and conv2() exclusively (I am under the impression that all of these operations, including the cellfun(), should be multithreaded in MATLAB?).
Does anyone see a (explicitly) parallelized solution or better vectorization of the problem?
Note
I realize both my examples are inefficient - the first would be twice as fast if it calculated a triangular cell array, and the second is still calculating the self comparisons, as well. But the time savings for a good parallelization are more like a factor of 16 (or 72 if I install MATLAB on everyone's machines).
Aside
There is also a memory issue. I used a couple of evals to append each column of H into a file, with names like H1, H2, etc. and then clear Hi. Unfortunately, the saves are very slow...

Does
compare(a,b) == compare(b,a)
and
compare(a,a) == 1
If so, change your loop
for i=1:numel(list)
for j=1:numel(list)
...
end
end
to
for i=1:numel(list)
for j= i+1 : numel(list)
...
end
end
and deal with the symmetry and identity case. This will cut your calculation time by half.

The second example can be easily sliced for use with the Parallel Processing Toolbox. This toolbox distributes iterations of your code among up to 8 different local processors. If you want to run the code on a cluster, you also need the Distributed Computing Toolbox.
%# who(), load() and struct2cell() calls place k data matrices in a 1D cell array called data.
parfor i=1:k-1 %# this will run the loop in parallel with the parallel processing toolbox
%# only make the necessary comparisons
H{i+1:k,i} = cellfun(#compare,data(i+1:k),repmat(data(i),k-i,1),'UniformOutput',false);
%# if the above doesn't work, try this
hSlice = cell(k,1);
hSlice{i+1:k} = cellfun(#compare,data(i+1:k),repmat(data(i),k-i,1),'UniformOutput',false);
H{:,i} = hSlice;
end

If I understand correctly you have to perform 5000^2 matrix comparisons ? Rather than try to parallelise the compare function, perhaps you should think of your problem being composed of 5000^2 tasks ? The Matlab Parallel Compute Toolbox supports task-based parallelism. Unfortunately my experience with PCT is with parallelisation of large linear algebra type problems so I can't really tell you much more than that. The documentation will undoubtedly help you more.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse