MATLAB parfor loop & random simulations - matlab

I am running multiple simulations using the parfor loop of MATLAB. The simulations run in a time-slotted fashion.
In these simulations, flows arrive at a server, finish there session and leave the server, of course, the arrival process is random (i.e: the departure process is also random). For each simulation, there are 100 arrivals, which means that for 2 simulations, under the parfor loop, the 100 arrivals are less likely to occur at the same slot in the 2 simulations (in other words: it is less likely that 2 simulations are perfectly identical).
I am calculating some metrics at the end of each simulation. After running 20 simulations I observe that for some simulations the values of the metrics are identical, see lines: 2, 4, 6, 8, 12, 15, 17 and 20:
1 12.1380000000000 8.67000000000000e-07 2126951.79378669
2 38.3040000000000 2.73600000000000e-06 964079.569727887
3 7.95200000000000 5.68000000000000e-07 2724654.56890640
4 38.3040000000000 2.73600000000000e-06 964079.569727887
5 21.0080000000000 1.50057142857143e-06 1653785.58341616
6 38.3040000000000 2.73600000000000e-06 964079.569727887
7 21.0080000000000 1.50057142857143e-06 1653785.58341616
8 38.3040000000000 2.73600000000000e-06 964079.569727887
9 11.9820000000000 8.55857142857143e-07 1827114.39301842
10 7.63000000000000 5.45000000000000e-07 2662037.48091595
11 8.28400000000000 5.91714285714286e-07 2584669.40096182
12 38.3040000000000 2.73600000000000e-06 964079.569727887
13 16.7040000000000 1.19314285714286e-06 1745020.01488049
14 20.6480000000000 1.47485714285714e-06 1131378.20827498
15 38.3040000000000 2.73600000000000e-06 964079.569727887
16 9.45400000000000 6.75285714285714e-07 2330783.95992713
17 38.3040000000000 2.73600000000000e-06 964079.569727887
18 20.5960000000000 1.47114285714286e-06 1349336.77768965
19 9.54400000000000 6.81714285714286e-07 2344366.38257795
20 38.3040000000000 2.73600000000000e-06 964079.569727887
Getting results that are very identical makes me think that the simulations were perfectly identical also, which means that the arrival/departure to/from the server occurs at the same slots in different simulations.
Why I am getting perfectly similar results when the simulations are random?
If simulations are perfectly identical they will not be useful for me since I want to use results to determine confidence intervals...
I tried my best to make the question clear unfortunately I can not put code here (1xxx line of code in the main function)
I tried to simulate the arrival process only:
The script to launch simulations in parallel:
numSim = 20;
numSlotPerSec = 10;
arrivalRatesVec = [0.2 0.3];
parfor i = 1:numSim
[arrivals] = simulationFunction(numSlotPerSec, i,arrivalRatesVec);
The function that runs the simulations:
function [arrivals] = simulationFunction(numSlotPerSec, numSim, arrivalRatesVec)
arrivals = [];
numSlots = 1;
nextArrival = 1 + round(numSlotPerSec*exprnd(1/sum(arrivalRatesVec)));
arrivals = [arrivals nextArrival];
output_file = ['ResTest_' num2str(numSim)];
while numSlots < 10000
numSlots = numSlots + 1;
nextArrival = nextArrival + max(1, round(numSlotPerSec*exprnd(1/sum(arrivalRatesVec))));
arrivals = [arrivals nextArrival];
eval(['arrivals_' num2str(numSim) '= arrivals']);
save(output_file,['arrivals_', num2str(numSim)]);
I noticed that arrivals occur in different slots but I still don't understand why my metrics are perfectly identical.
The function eval can cause problems when it is called in a parfor loop? I saw on MATLAB that it might not access the correct workspace.
I launched 20 simulations again in parallel just to see if the arrival of events happens at the same slots
Figures below (corresponding to simulation 2 and 13) that users arrive at the server at the same slots (see columns "entry" for arrivals and "exit" for departures)
Which means that these simulations are identical.
Also, I tried to check if eval causes the problem, so I saved one of my metrics without using eval and turns out that there are duplicated values.


Matlab - Chien Search implementation initialisation

I'm trying to implement a DVBS2 (48408, 48600) BCH decoder and I'm having troubles with finding the roots of the locator polynomial. For the Chien search here the author initialises the registers taking into account the fact that it is shortened subtracting 48600 from (2^16 - 1). Why so?
This is the code I have so far:
function [error_locations, errors] = compute_chien_search(n, locator_polynomial, field, alpha_powers, n_max)
t = length(locator_polynomial);
error_locations = zeros(t, 1);
errors = 0;
% Init the registers with the locator polynomial coefficients.
coefficient_buffer = locator_polynomial;
alpha_degrees = uint32(1:t)';
alpha_polynoms = field(alpha_degrees);
alpha_polynoms = [1; alpha_polynoms];
for i = 1:n %
for j = 2:t
coefficient_buffer(j) = gf_mul_elements(coefficient_buffer(j), ...
alpha_polynoms(j), ...
field, alpha_powers, n_max);
% Compute locator polynomial at position i
tmp = 0;
for j = 2:t
tmp = bitxor(tmp, coefficient_buffer(j));
% Signal the error
if tmp == 1
errors = errors + 1;
error_locations(errors) = n_max - i + 1;
It almost gives me the correct result except for some error locations. For example: for errors made in positions
418 14150 24575 25775 37403
The code above gives me
48183 47718 34451 24026 22826
which after subtracting from 48600 gives:
417 882 14149 24574 25774
which is the position minus 1, except for the 37403, which it did not find.
What am I missing?
The code in question is a DVBS2 12 error correcting 48408, 48600 BCH code. The generator polynomial has degree 192 and is given by multiplying the 12 minimal polynomials given on the standard’s documentation.
Update - I created an example C program using Windows | Visual Studio for BCH(48600,48408). On my desktop (Intel 3770K 3.5 ghz, Win 7, VS 2015), encode takes about 30 us, a 12 bit error correction takes about 4.5 ms. On my laptop, (Intel Core i7-10510U up to 4.9 ghz, Win 10, VS 2019), 12 bit error correction takes about 3.0 ms. I used a carryless multiply intrinsic to simplify generating the 192 bit polynomial, but this is a one time generated constant. Encode uses a [256][3] 64 bit unsigned integer polynomial (192 bits) table and decode uses a [256][12] 16 bit unsigned integer syndrome table, to process a byte at a time.
The code includes both Berlekamp Massey and Sugiyama extended Euclid decoders that I copied from existing RS code I have. For BCH (not RS) code, the Berlekamp Massey discrepancy will be zero on odd steps, so for odd steps, the discrepancy is not calculated (the iteration count since last update is incremented, the same as when a calculated discrepancy is zero). I didn't see a significant change in running time, but I left the check in there.
The run times are about the same for BM or Euclid.
I suspect an overflow problem in the case of a failure at bit error index 37403, since it is the only bit index > 2^15-1 (32767). There is this comment on that web site:
This code is great. However, it does not work for the large block sizes in the DVB-S2
specification. For example, it doesn't work with:
n = 16200;
n_max = 65535;
k_max = 65343;
t = 12;
prim_poly = 65581;
The good news is that the problem is easy to fix. Replace all the uint16() functions
in the code with uint32(). You will also have to run the following Matlab function
once. It took several hours for gftable() to complete on my computer.
gftable(16, 65581); (hex 1002D => x^16 + x^5 + x^3 + x^2 + 1)
The Chien search should be looking for values (1/(2^(0))) to (1/(2^(48599))), then zero - log of those values to get offsets relative to the right most bit of the message, and 48599-offset to get indexes relative to the left most bit of the message.
If the coefficients of the error locator polynomial are reversed, then the Chien search would be looking for values 2^(0) to 2^(48599).

MATLAB is too slow calculation of Spearman's rank correlation for 9-element vectors

I need to calculate the Spearman's rank correlation (using corr function) for pairs of vectors with different lengths (for example 5-element vectors to 20-element vectors). The number of pairs is usually above 300 pairs for each length. I track the progress with waitbar. I have noticed that it takes unusually very long time for 9-element pair of vectors, where for other lengths (greater and smaller) it takes very short times. Since the formula is exactly the same, the problem must have originated in MATLAB function corr.
I wrote the following code to verify that the problem is with corr function and not other calculations that I have besides 'corr', where all of that calculations (including 'corr') take place inside some 2 or 3 'for' loops. The code repeats the timing 50 times to avoid accidental results.
The result is a bar graph, confirming that it takes a long time for MATLAB to calculate Spearman's rank correlation for 9-element vectors. Since my calculations are not that heavy, this problem does not cause endless wait, it just increases the total time consumed for the whole process. Can someone tell me that what causes the problem and how to avoid it?
Times1 = zeros(20,50);
for i = 5:20
for j = 1:50
A = rand(i,2);
[r,p] = corr(A(:,1),A(:,2),'type','Spearman');
Times1(i,j) = toc;
Times2 = mean(Times1,2);
xlabel('number of elements in vectors');
ylabel('average time');
After some investigation, I think I found the root of this very interesting problem. My tests have been conducted profiling every outer iteration using the built-in Matlab profiler, as follows:
res = cell(20,1);
for i = 5:20
profile clear;
profile on -history;
for j = 1:50
uni = rand(i,2);
profile off;
p = profile('info');
res{i} = p.FunctionTable;
The produced output looks like this:
The first thing I noticed is that the Spearman correlation for matrices with a number of rows less than or equal to 9 is computed in a different way than for matrices with 10 or more rows. For the former, the functions being internally called by the corr function are:
Function Number of Calls
----------------------- -----------------
'factorial' 100
'tiedrank>tr' 100
'tiedrank' 100
'corr>pvalSpearman' 50
'corr>rcumsum' 50
'perms>permsr' 50
'perms' 50
'corr>spearmanExactSub' 50
'corr>corrPearson' 50
'corr>corrSpearman' 50
'corr' 50
'parseArgs' 50
'parseArgs' 50
For the latter, the functions being internally called by the corr function are:
Function Number of Calls
----------------------- -----------------
'tiedrank>tr' 100
'tiedrank' 100
'corr>AS89' 50
'corr>pvalSpearman' 50
'corr>corrPearson' 50
'corr>corrSpearman' 50
'corr' 50
'parseArgs' 50
'parseArgs' 50
Since the computation of the Spearman correlation for matrices with 10 or more rows seems to run smoothly and quickly and doesn't show any evidence of performance bottlenecks, I decided to avoid losing time investigating on this fact and I focused on the main concern: the small matrices.
I tried to understand the difference between the execution time of the whole process for a matrix with 5 rows and for one with 9 rows (the one notably showing the worst performance). This is the code I used:
res5 = res{5,1};
res5_tt = [res5.TotalTime];
res5_tt_perc = ((res5_tt ./ sum(res5_tt)) .* 100).';
res9_tt = [res{9,1}.TotalTime];
res9_tt_perc = ((res9_tt ./ sum(res9_tt)) .* 100).';
res_diff = res9_tt_perc - res5_tt_perc;
[~,res_diff_sort] = sort(res_diff,'desc');
tab = [cellstr(char(res5.FunctionName)) num2cell([res5_tt_perc res9_tt_perc res_diff])];
tab = tab(res_diff_sort,:);
tab = cell2table(tab,'VariableNames',{'Function' 'TT_M5' 'TT_M9' 'DIFF'});
And here is the result:
Function TT_M5 TT_M9 DIFF
_______________________ _________________ __________________ __________________
'corr>spearmanExactSub' 7.14799963478685 16.2879721171023 9.1399724823154
'corr>pvalSpearman' 7.98185309750143 16.3043118970503 8.32245879954885
'perms>permsr' 3.47311716905926 8.73599255035966 5.26287538130039
'perms' 4.58132952553723 8.77488502392486 4.19355549838763
'corr>corrSpearman' 15.629476293326 16.440893059217 0.811416765890929
'corr>rcumsum' 0.510550019981949 0.0152486312660671 -0.495301388715882
'factorial' 0.669357868472376 0.0163923929871943 -0.652965475485182
'parseArgs' 1.54242684137027 0.0309456171268161 -1.51148122424345
'tiedrank>tr' 2.37642998160463 0.041010720272735 -2.3354192613319
'parseArgs' 2.4288171135289 0.0486075856244615 -2.38020952790444
'corr>corrPearson' 2.49766877262937 0.0484657591710417 -2.44920301345833
'tiedrank' 3.16762535118088 0.0543584195582888 -3.11326693162259
'corr' 21.8214856092549 16.5664346332513 -5.25505097600355
Once the bottleneck was detected, I started analyzing the internal code (open corr) and I finally found the cause of the problem. Within the spearmanExactSub, this part of code is being executed (where n is the number of rows of the matrix):
n = arg1;
nfact = factorial(n);
Dperm = sum((repmat(1:n,nfact,1) - perms(1:n)).^2, 2);
A permutation is being computed on a vector whose values range from 1 to n. This is what comes into play increasing the computational complexity (and, obviously, the computational time) of the function. Other operations, like the subsequent repmat on factorial(n) of 1:n and the ones below that point, contribute to worsen the situation. Now, long story short...
factorial(5) = 120
factorial(6) = 720
factorial(7) = 5040
factorial(8) = 40320
factorial(9) = 362880
can you see the reason why, between 5 and 9, your bar graph shows an "exponentially" increasing computational time?
On a side note, there is nothing you can do to solve this problem, unless you find another implementation of the Spearman correlation that doesn't present the same bottleneck or you implement your own.

Speedup constrained shuffling. GPU (Tesla K40m), CPU parallel computations in MATLAB

I have 100 lamps. They are blinking. I observe them during some time. For each lamp i calculate mean, std and autocorrelation of intervals between blinking.
Now I should resample observed data and keep permutations, where all parameters (mean, std, autocorrelation) are inside some range. Code which I have works good. But it takes to long time (week) for each round of experiment. I do it on computing server with 12 cores and 2 Tesla K40m GPUs (details are in the end).
My code:
close all
clear all
% open parpool skip error if it was opened
try parpool(24); end
% Sample input. It is faked, just for demo.
% Number of "lamps" and number of "blinks" are similar to real.
NLamps = 10^2;
NBlinks = 2*10^2;
Events = cumsum([randg(9,NLamps,NBlinks)],2); % each row - different "lamp"
% Define parameters
nLags=2; % I need to keep autocorrelation with lags 1-2
alpha=[0.01,0.1]; % range of allowed relative deviation from observed
% parameters should be > 0 to avoid generating original
% sequence
nPermutations=10^2; % In original code 10^5
% Processing of experimental data
Intervals=cellfun(#(x) diff(x),Events,'UniformOutput',false);
observedParams=cellfun(#(x) fGetParameters(x,nLags),Intervals,'UniformOutput',false);
% Constrained shuffling. EXPENSIVE PART!!!
while true
parfor iPermutation=1:nPermutations
% Shuffle intervals
shuffledIntervals=cellfun(#(x,y) fPermute(x,y),Intervals,DurationOfExperiment,'UniformOutput',false);
% get parameters of shuffled intervals
shuffledParameters=cellfun(#(x) fGetParameters(x,nLags),shuffledIntervals,'UniformOutput',false);
% get relative deviation
% find shuffled Lamps, which are inside alpha range
MaximumDeviation=max(delta,[] ,2);
MinimumDeviation=min(delta,[] ,2);
% if shuffling of ANY lamp was succesful, save these Intervals
if ~isempty(LampID)
parsave( LampID,shuffledIntervals,shuffledParameters);
function [ params ] = fGetParameters( intervals,nLags )
% Calculate [mean,std,autocorrelations with lags from 1 to nLags
for lag=1:nLags
R(lag) = corr(intervals(1:end-lag)',intervals((1+lag):end)','type','Spearman');
params = [mean(intervals),std(intervals),R];
function [ Intervals ] = fPermute( Intervals,Duration )
% Create long shuffled time-series
% Keep the same duration
% Calculate Intervals
function parsave( LampID,Intervals,params)
Server specs:
CUDADevice with properties:
Name: 'Tesla K40m'
Index: 1
ComputeCapability: '3.5'
SupportsDouble: 1
DriverVersion: 8
ToolkitVersion: 8
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 1.1979e+10
AvailableMemory: 1.1846e+10
MultiprocessorCount: 15
ClockRateKHz: 745000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
>> feature('numcores')
MATLAB detected: 12 physical cores.
MATLAB detected: 24 logical cores.
MATLAB was assigned: 24 logical cores by the OS.
MATLAB is using: 12 logical cores.
MATLAB is not using all logical cores because hyper-threading is enabled.
>> system('for /f "tokens=2 delims==" %A in (''wmic cpu get name /value'') do #(echo %A)')
Intel(R) Xeon(R) CPU E5-2630 v2 # 2.60GHz
Intel(R) Xeon(R) CPU E5-2630 v2 # 2.60GHz
>> memory
Maximum possible array: 496890 MB (5.210e+11 bytes) *
Memory available for all arrays: 496890 MB (5.210e+11 bytes) *
Memory used by MATLAB: 18534 MB (1.943e+10 bytes)
Physical Memory (RAM): 262109 MB (2.748e+11 bytes)
* Limited by System Memory (physical + swap file) available.
Is it possible to speedup my calculation? I think about CPU+GPU computing, but I could not understand how to do it (I have no experience with gpuArrays). Moreover, I am not sure it is a good idea. Sometimes some algorithm optimisation gives bigger profit, then parallel computing.
Saving step is not the bottleneck- it happens once in 10-30 mins in best case.
GPU-based processing is only available on some functions and with the right cards (if I remember correctly).
For the GPU part of your question MATLAB has a list of available functions - that you can run on GPU - the most expensive part of your code is the function corr which unfortunately isn't on the list.
If the profiler isn't highlighting bottlenecks - something weird is going on... So I ran some tests on your code above:
nPermutations = 10^0 iteration takes ~0.13 seconds
nPermutations = 10^1 iteration takes ~1.3 seconds
nPermutations = 10^3 iteration takes ~130 seconds
nPermutations = 10^4 probably takes ~1300 seconds
nPermutations = 10^5 probably takes ~13000 seconds
Which is a lot less than a week...
Did I mention that I put a break out of your while statement - as I couldn't see in your code where you ever "break" out of the while loop - I hope for your sake that this isn't the reason that your function would run forever....
while true
parfor iPermutation=1:nPermutations
% Shuffle intervals
shuffledIntervals=cellfun(#(x,y) fPermute(x,y),Intervals,DurationOfExperiment,'UniformOutput',false);
% get parameters of shuffled intervals
shuffledParameters=cellfun(#(x) fGetParameters(x,nLags),shuffledIntervals,'UniformOutput',false);
% get relative deviation
% find shuffled Lamps, which are inside alpha range
MaximumDeviation=max(delta,[] ,2);
MinimumDeviation=min(delta,[] ,2);
% if shuffling of ANY lamp was succesful, save these Intervals
if ~isempty(LampID)
parsave( LampID,shuffledIntervals,shuffledParameters);
break % You need to break out of the loop at some point
% otherwise it would run forever....

matlab parfor and spmd doesn't work

The script is as follows:
Lambdass = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000];
numcores = feature('numcores'); % get the number of cpu cores
num_slices = floor(length(Lambdass)/numcores); % get the number of slices for parallel computing
if mod(length(Lambdass), numcores)~=0
num_slices = num_slices + 1;
for slice_i=1:num_slices
if slice_i~=num_slices
Lambdas = Lambdass(((slice_i-1)*numcores+1):((slice_i)*numcores));
Lambdas = Lambdass(((slice_i-1)*numcores+1):end);
% start the parallel processing
myparpool = parpool(length(Lambdas))
parfor li = 1:length(Lambdas)
% spmd
lambda = Lambdas(li);
save_path1 = sprintf('results/lambda_%f/', lambda);
if ~exist([save_path1, '/fs_results.mat'], 'file')
do_something_and_save_results(lambda, save_path1);
This script could be run correctly in one computer, but in another computer, the parfor seems to not work correctly and there are some warning information as follows, and the parfor finally runs without parallel mode, just seems the for in sequential. Could anyone help give some advice, please?
Starting parallel pool (parpool) using the 'local' profile ... Warning: Could not launch SMPD process manager. Using fallback parallel mechanism.
> In SmpdGateway>SmpdGateway.canUseSmpd at 81
In Local.hSubmitCommunicatingJob at 15
In CJSCommunicatingJob>CJSCommunicatingJob.submitOneJob at 81
In Job.Job>Job.submit at 302
In InteractiveClient>InteractiveClient.start at 327
In Pool.Pool>iStartClient at 537
In Pool.Pool>Pool.hBuildPool at 434
In parpool at 104

Matlab lsqnonlin() exitflag=4

I am optimizing some test data using lsqnonlin (i.e. data simulated from known parameter values).
maturity=[1 3 6 9 12 15 18 21 24 30 36 48 60 72 84 96 108 120]'; %maturities
options=optimset('Algorithm',{'levenberg-marquardt',.01},'Display','iter','TolFun',10^(-20),'TolX',10^-3,'MaxFunEvals',10000,'MaxIter',10000); %LM
vp0=[0.99 0.94 0.84 0.0802 -0.0144 -0.0042 0.001693 0.004094 0.003256 log(0.000960765^2) 0.077]'; %LM
[vpML,resnorm,residual,exitflag,output,lambda,jacobian]=lsqnonlin(#(vp) DNS_LL_LM(vp,y,maturity),vp0,[],[],options); %LM
I want the convergence to occur when the norm of the parameter vector changes by 10^-6.
As 'TolX' refers to the raw changes in the parameter vector I use 10^-3 as the tolerance of X which when squared would gives the desired norm of 10^-6.
However I find that when I run the code the exitflag keeps coming up as exitflag=4: "Magnitude of search direction was smaller than the specified tolerance."
But there is nowhere to set the tolerance for the search direction?
In the options you can only set: "TolX" and "TolFun"?
So how can I force the optimization to keep running till my desired convergence criterion?
Kind Regards
OK I went into the code and there seems to be some disconnect between what the exitflags as described here:
For example exitflag 2 which in the link above is supposed to relate to the change in x being less than tolerance is in fact used here to indicate that the Jacobian is undefined
if undefJac
msgFlag = 26;
msgData = {'levenbergMarquardt',msgFlag,verbosity > 0,detailedExitMsg,caller, ...
[], [], []};
done = true;
The description of exitflag 4 on the mathworks page is a little vague but you can see what it is doing below:
if norm(step) < tolX*(sqrtEps + norm(XOUT))
msgData = {'levenbergMarquardt',EXITFLAG,verbosity > 0,detailedExitMsg,caller, ...
done = true;
Seems that it it testing if the norm of the stepsize is less than the tolerance of X times the norm of X. This is along the lines of what I want, and can easily be changed to give me exactly what I want.