Speedup constrained shuffling. GPU (Tesla K40m), CPU parallel computations in MATLAB - matlab

I have 100 lamps. They are blinking. I observe them during some time. For each lamp i calculate mean, std and autocorrelation of intervals between blinking.
Now I should resample observed data and keep permutations, where all parameters (mean, std, autocorrelation) are inside some range. Code which I have works good. But it takes to long time (week) for each round of experiment. I do it on computing server with 12 cores and 2 Tesla K40m GPUs (details are in the end).
My code:
close all
clear all
clc
% open parpool skip error if it was opened
try parpool(24); end
% Sample input. It is faked, just for demo.
% Number of "lamps" and number of "blinks" are similar to real.
NLamps = 10^2;
NBlinks = 2*10^2;
Events = cumsum([randg(9,NLamps,NBlinks)],2); % each row - different "lamp"
DurationOfExperiment=Events(:,end).*1.01;
%% MAIN
% Define parameters
nLags=2; % I need to keep autocorrelation with lags 1-2
alpha=[0.01,0.1]; % range of allowed relative deviation from observed
% parameters should be > 0 to avoid generating original
% sequence
nPermutations=10^2; % In original code 10^5
% Processing of experimental data
DurationOfExperiment=num2cell(DurationOfExperiment);
Events=num2cell(Events,2);
Intervals=cellfun(#(x) diff(x),Events,'UniformOutput',false);
observedParams=cellfun(#(x) fGetParameters(x,nLags),Intervals,'UniformOutput',false);
observedParams=cell2mat(observedParams);
% Constrained shuffling. EXPENSIVE PART!!!
while true
parfor iPermutation=1:nPermutations
% Shuffle intervals
shuffledIntervals=cellfun(#(x,y) fPermute(x,y),Intervals,DurationOfExperiment,'UniformOutput',false);
% get parameters of shuffled intervals
shuffledParameters=cellfun(#(x) fGetParameters(x,nLags),shuffledIntervals,'UniformOutput',false);
shuffledParameters=cell2mat(shuffledParameters);
% get relative deviation
delta=abs((shuffledParameters-observedParams)./observedParams);
% find shuffled Lamps, which are inside alpha range
MaximumDeviation=max(delta,[] ,2);
MinimumDeviation=min(delta,[] ,2);
LampID=find(and(MaximumDeviation<alpha(2),MinimumDeviation>alpha(1)));
% if shuffling of ANY lamp was succesful, save these Intervals
if ~isempty(LampID)
shuffledIntervals=shuffledIntervals(LampID);
shuffledParameters=shuffledParameters(LampID,:);
parsave( LampID,shuffledIntervals,shuffledParameters);
'DONE'
end
end
end
%% FUNCTIONS
function [ params ] = fGetParameters( intervals,nLags )
% Calculate [mean,std,autocorrelations with lags from 1 to nLags
R=nan(1,nLags);
for lag=1:nLags
R(lag) = corr(intervals(1:end-lag)',intervals((1+lag):end)','type','Spearman');
end
params = [mean(intervals),std(intervals),R];
end
%--------------------------------------------------------------------------
function [ Intervals ] = fPermute( Intervals,Duration )
% Create long shuffled time-series
Time=cumsum([0,datasample(Intervals,numel(Intervals)*3)]);
% Keep the same duration
Time(Time>Duration)=[];
% Calculate Intervals
Intervals=diff(Time);
end
%--------------------------------------------------------------------------
function parsave( LampID,Intervals,params)
save([num2str(randi(10^9)),'.mat'],'LampID','Intervals','params')
end
Server specs:
>>gpuDevice()
CUDADevice with properties:
Name: 'Tesla K40m'
Index: 1
ComputeCapability: '3.5'
SupportsDouble: 1
DriverVersion: 8
ToolkitVersion: 8
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 1.1979e+10
AvailableMemory: 1.1846e+10
MultiprocessorCount: 15
ClockRateKHz: 745000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
>> feature('numcores')
MATLAB detected: 12 physical cores.
MATLAB detected: 24 logical cores.
MATLAB was assigned: 24 logical cores by the OS.
MATLAB is using: 12 logical cores.
MATLAB is not using all logical cores because hyper-threading is enabled.
>> system('for /f "tokens=2 delims==" %A in (''wmic cpu get name /value'') do #(echo %A)')
Intel(R) Xeon(R) CPU E5-2630 v2 # 2.60GHz
Intel(R) Xeon(R) CPU E5-2630 v2 # 2.60GHz
>> memory
Maximum possible array: 496890 MB (5.210e+11 bytes) *
Memory available for all arrays: 496890 MB (5.210e+11 bytes) *
Memory used by MATLAB: 18534 MB (1.943e+10 bytes)
Physical Memory (RAM): 262109 MB (2.748e+11 bytes)
* Limited by System Memory (physical + swap file) available.
Question:
Is it possible to speedup my calculation? I think about CPU+GPU computing, but I could not understand how to do it (I have no experience with gpuArrays). Moreover, I am not sure it is a good idea. Sometimes some algorithm optimisation gives bigger profit, then parallel computing.
P.S.
Saving step is not the bottleneck- it happens once in 10-30 mins in best case.

GPU-based processing is only available on some functions and with the right cards (if I remember correctly).
For the GPU part of your question MATLAB has a list of available functions - that you can run on GPU - the most expensive part of your code is the function corr which unfortunately isn't on the list.
If the profiler isn't highlighting bottlenecks - something weird is going on... So I ran some tests on your code above:
nPermutations = 10^0 iteration takes ~0.13 seconds
nPermutations = 10^1 iteration takes ~1.3 seconds
nPermutations = 10^3 iteration takes ~130 seconds
nPermutations = 10^4 probably takes ~1300 seconds
nPermutations = 10^5 probably takes ~13000 seconds
Which is a lot less than a week...
Did I mention that I put a break out of your while statement - as I couldn't see in your code where you ever "break" out of the while loop - I hope for your sake that this isn't the reason that your function would run forever....
while true
parfor iPermutation=1:nPermutations
% Shuffle intervals
shuffledIntervals=cellfun(#(x,y) fPermute(x,y),Intervals,DurationOfExperiment,'UniformOutput',false);
% get parameters of shuffled intervals
shuffledParameters=cellfun(#(x) fGetParameters(x,nLags),shuffledIntervals,'UniformOutput',false);
shuffledParameters=cell2mat(shuffledParameters);
% get relative deviation
delta=abs((shuffledParameters-observedParams)./observedParams);
% find shuffled Lamps, which are inside alpha range
MaximumDeviation=max(delta,[] ,2);
MinimumDeviation=min(delta,[] ,2);
LampID=find(and(MaximumDeviation<alpha(2),MinimumDeviation>alpha(1)));
% if shuffling of ANY lamp was succesful, save these Intervals
if ~isempty(LampID)
shuffledIntervals=shuffledIntervals(LampID);
shuffledParameters=shuffledParameters(LampID,:);
parsave( LampID,shuffledIntervals,shuffledParameters);
'DONE'
end
end
break % You need to break out of the loop at some point
% otherwise it would run forever....
end

Related

How to maximize MATLAB's GPU utility?

I've surveyed my GPU's performance against itself and the CPU for varying matrix sizes, and found the opposite of what most GPU literature suggests: the GPU's computing advantage diminishes with array size. Code, results, & specs shown below. Noteworthy observations:
GPU utility remains sub-10%, according to Task Manager
~(50%, 20%) = (RAM, CPU) usage for large (K > 9000) arrays
Considerable speed ratio drop's observed for around K > 8000
Splitting the K > 8000 (= 9000) Xga matrix into four increases vectorized speed two-fold
My GPU ranks far higher among GPUs than my CPU (#24 vs. #174); it thus seems an on-par CPU would outperform the GPU for larger arrays
Last pic's GPU vs. CPU benchmark supports (5); GPU isn't as vastly superior as expected
What's the culprit - is my code, or MATLAB, or hardware configuration under-utilizing the GPU? How to find out and resolve it?
%% CODE: centroid indexing in K-means algorithm
% size(X) = [16000, 3]
% size(centroids) = [K, 3]
% Xga = gpuArray(single(X)); cga = gpuArray(single(centroids));
% Speed ratio = t2/t1, if t2 > t1 - else, t1/t2
%% TIMING
f1 = fasterFunction(...);
f2 = slowerFunction(...);
t1 = gputimeit(f1) % OR timeit(f1) for non-GPU arrays
t2 = timeit(f2) % OR gputimeit(f2) for GPU arrays
%% FUNCTIONS
function out = vecHammer(X, c, K, m)
[~, out] = min(reshape(permute(sum((X-permute(c,[3 2 1])).^2,2),[1 2 3]),m,K),[],2);
end
function out = forvecHammer(X, c, m)
out = zeros(m,1);
for j=1:m
[~,out(j)] = min(sum(((X(j,:))'-c').^2));
end
end
function out = forforHammer(X,c,m,K)
out = zeros(m,1); idxtemp = zeros(K,1);
for i=1:m
for j=1:K
idxtemp(j) = sum((X(i,:)-c(j,:)).^2,2);
end
[~, out(i)] = min(idxtemp);
end
end
The probable answer is - the data was simply too small, and only so much can be parallelized; my GPU pulls a gigabyte dataset with a few percentage points - this one barely measured up to 10MB.

CUDA GPU time in MATLAB increasing when the grid size is increased

I am using MATLAB R2017a. I am running a simple code to calculate cumulative sum from the first point until ith point.
my CUDA kernel code is:
__global__ void summ(const double *A, double *B, int N){
for (int i=threadIdx.x; i<N; i++){
B[i+1] = B[i] + A[i];}}
my MATLAB code is
k=parallel.gpu.CUDAKernel('summ.ptx','summ.cu');
n=10^7;
A=rand(n,1);
ans=zeros(n,1);
A1=gpuArray(A);
ans2=gpuArray(ans);
k.ThreadBlockSize = [1024,1,1];
k.GridSize = [3,1];
G = feval(k,A1,ans2,n);
G1 = gather(G);
GPU_time = toc
I am wondering why the GPU time increasing when i increase the grid size (k,.GridSize). for instant for 10^7 data,
k.GridSize=[1,1] the time is 8.0748s
k.GridSize=[2,1] the time is 8.0792s
k.GridSize=[3,1] the time is 8.0928s
From what i understand, for 10^7 number of data, the system will need 10^7 / 1024 ~ 9767 blocks, so the grid size should be [9767,1].
The GPU device is
Name: 'Tesla K20c'
Index: 1
ComputeCapability: '3.5'
SupportsDouble: 1
DriverVersion: 9.1000
ToolkitVersion: 8
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 5.2983e+09
AvailableMemory: 4.9132e+09
MultiprocessorCount: 13
ClockRateKHz: 705500
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 0
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
thank you for your response.
You appear to be worrying about a very very small portion of the time compared to the overall effect. The real question you should be asking is: does this amount of time to solve this problem make sense? The answer to that is no absolutely not.
Here is a modified code which should run much faster
n=10^7;
dev = gpuDevice;
A = randn(n,1,'gpuArray');
B = randn(n,1,'gpuArray');
tic
G = A+cumsum(B);
wait(dev)
toc
On my 1060 this runs in 0.03 seconds. For even faster speeds you can use single precision
At any rate, that 0.02 seconds could be easily attributable to small changes in loads on your GPU. It's a much more likely scenario than having to do with gridsizes.

MATLAB is too slow calculation of Spearman's rank correlation for 9-element vectors

I need to calculate the Spearman's rank correlation (using corr function) for pairs of vectors with different lengths (for example 5-element vectors to 20-element vectors). The number of pairs is usually above 300 pairs for each length. I track the progress with waitbar. I have noticed that it takes unusually very long time for 9-element pair of vectors, where for other lengths (greater and smaller) it takes very short times. Since the formula is exactly the same, the problem must have originated in MATLAB function corr.
I wrote the following code to verify that the problem is with corr function and not other calculations that I have besides 'corr', where all of that calculations (including 'corr') take place inside some 2 or 3 'for' loops. The code repeats the timing 50 times to avoid accidental results.
The result is a bar graph, confirming that it takes a long time for MATLAB to calculate Spearman's rank correlation for 9-element vectors. Since my calculations are not that heavy, this problem does not cause endless wait, it just increases the total time consumed for the whole process. Can someone tell me that what causes the problem and how to avoid it?
Times1 = zeros(20,50);
for i = 5:20
for j = 1:50
tic
A = rand(i,2);
[r,p] = corr(A(:,1),A(:,2),'type','Spearman');
Times1(i,j) = toc;
end
end
Times2 = mean(Times1,2);
bar(Times2);
xticks(1:25);
xlabel('number of elements in vectors');
ylabel('average time');
After some investigation, I think I found the root of this very interesting problem. My tests have been conducted profiling every outer iteration using the built-in Matlab profiler, as follows:
res = cell(20,1);
for i = 5:20
profile clear;
profile on -history;
for j = 1:50
uni = rand(i,2);
corr(uni(:,1),uni(:,2),'type','Spearman');
end
profile off;
p = profile('info');
res{i} = p.FunctionTable;
end
The produced output looks like this:
The first thing I noticed is that the Spearman correlation for matrices with a number of rows less than or equal to 9 is computed in a different way than for matrices with 10 or more rows. For the former, the functions being internally called by the corr function are:
Function Number of Calls
----------------------- -----------------
'factorial' 100
'tiedrank>tr' 100
'tiedrank' 100
'corr>pvalSpearman' 50
'corr>rcumsum' 50
'perms>permsr' 50
'perms' 50
'corr>spearmanExactSub' 50
'corr>corrPearson' 50
'corr>corrSpearman' 50
'corr' 50
'parseArgs' 50
'parseArgs' 50
For the latter, the functions being internally called by the corr function are:
Function Number of Calls
----------------------- -----------------
'tiedrank>tr' 100
'tiedrank' 100
'corr>AS89' 50
'corr>pvalSpearman' 50
'corr>corrPearson' 50
'corr>corrSpearman' 50
'corr' 50
'parseArgs' 50
'parseArgs' 50
Since the computation of the Spearman correlation for matrices with 10 or more rows seems to run smoothly and quickly and doesn't show any evidence of performance bottlenecks, I decided to avoid losing time investigating on this fact and I focused on the main concern: the small matrices.
I tried to understand the difference between the execution time of the whole process for a matrix with 5 rows and for one with 9 rows (the one notably showing the worst performance). This is the code I used:
res5 = res{5,1};
res5_tt = [res5.TotalTime];
res5_tt_perc = ((res5_tt ./ sum(res5_tt)) .* 100).';
res9_tt = [res{9,1}.TotalTime];
res9_tt_perc = ((res9_tt ./ sum(res9_tt)) .* 100).';
res_diff = res9_tt_perc - res5_tt_perc;
[~,res_diff_sort] = sort(res_diff,'desc');
tab = [cellstr(char(res5.FunctionName)) num2cell([res5_tt_perc res9_tt_perc res_diff])];
tab = tab(res_diff_sort,:);
tab = cell2table(tab,'VariableNames',{'Function' 'TT_M5' 'TT_M9' 'DIFF'});
And here is the result:
Function TT_M5 TT_M9 DIFF
_______________________ _________________ __________________ __________________
'corr>spearmanExactSub' 7.14799963478685 16.2879721171023 9.1399724823154
'corr>pvalSpearman' 7.98185309750143 16.3043118970503 8.32245879954885
'perms>permsr' 3.47311716905926 8.73599255035966 5.26287538130039
'perms' 4.58132952553723 8.77488502392486 4.19355549838763
'corr>corrSpearman' 15.629476293326 16.440893059217 0.811416765890929
'corr>rcumsum' 0.510550019981949 0.0152486312660671 -0.495301388715882
'factorial' 0.669357868472376 0.0163923929871943 -0.652965475485182
'parseArgs' 1.54242684137027 0.0309456171268161 -1.51148122424345
'tiedrank>tr' 2.37642998160463 0.041010720272735 -2.3354192613319
'parseArgs' 2.4288171135289 0.0486075856244615 -2.38020952790444
'corr>corrPearson' 2.49766877262937 0.0484657591710417 -2.44920301345833
'tiedrank' 3.16762535118088 0.0543584195582888 -3.11326693162259
'corr' 21.8214856092549 16.5664346332513 -5.25505097600355
Once the bottleneck was detected, I started analyzing the internal code (open corr) and I finally found the cause of the problem. Within the spearmanExactSub, this part of code is being executed (where n is the number of rows of the matrix):
n = arg1;
nfact = factorial(n);
Dperm = sum((repmat(1:n,nfact,1) - perms(1:n)).^2, 2);
A permutation is being computed on a vector whose values range from 1 to n. This is what comes into play increasing the computational complexity (and, obviously, the computational time) of the function. Other operations, like the subsequent repmat on factorial(n) of 1:n and the ones below that point, contribute to worsen the situation. Now, long story short...
factorial(5) = 120
factorial(6) = 720
factorial(7) = 5040
factorial(8) = 40320
factorial(9) = 362880
can you see the reason why, between 5 and 9, your bar graph shows an "exponentially" increasing computational time?
On a side note, there is nothing you can do to solve this problem, unless you find another implementation of the Spearman correlation that doesn't present the same bottleneck or you implement your own.

linear combination of curves to match a single curve with integer constraints

I have a set of vectors (curves) which I would like to match to a single curve. The issue isnt only finding a linear combination of the set of curves which will most closely match the single curve (this can be done with least squares Ax = B). I need to be able to add constraints, for example limiting the number of curves used in the fitting to a particular number, or that the curves lie next to each other. These constraints would be found in mixed integer linear programming optimization.
I have started by using lsqlin which allows constraints and have been able to limit the variable to be > 0.0, but in terms of adding further constraints I am at a loss. Is there a way to add integer constraints to least squares, or alternatively is there a way to solve this with a MILP?
any help in the right direction much appreciated!
Edit: Based on the suggestion by ErwinKalvelagen I am attempting to use CPLEX and its quadtratic solvers, however until now I have not managed to get it working. I have created a minimal 'notworking' example and have uploaded the data here and code here below. The issue is that matlabs LS solver lsqlin is able to solve, however CPLEX cplexlsqnonneglin returns CPLEX Error 5002: %s is not convex for the same problem.
function [ ] = minWorkingLSexample( )
%MINWORKINGLSEXAMPLE for LS with matlab and CPLEX
%matlab is able to solve the least squares, CPLEX returns error:
% Error using cplexlsqnonneglin
% CPLEX Error 5002: %s is not convex.
%
%
% Error in Backscatter_Transform_excel2_readMut_LINPROG_CPLEX (line 203)
% cplexlsqnonneglin (C,d);
%
load('C_n_d_2.mat')
lb = zeros(size(C,2),1);
options = optimoptions('lsqlin','Algorithm','trust-region-reflective');
[fact2,resnorm,residual,exitflag,output] = ...
lsqlin(C,d,[],[],[],[],lb,[],[],options);
%% CPLEX
ctype = cellstr(repmat('C',1,size(C,2)));
options = cplexoptimset;
options.Display = 'on';
[fact3, resnorm, residual, exitflag, output] = ...
cplexlsqnonneglin (C,d);
end
I could reproduce the Cplex problem. Here is a workaround. Instead of solving the first model, use a model that is less nonlinear:
The second model solves fine with Cplex. The problem is somewhat of a tolerance/numeric issue. For the second model we have a much more well-behaved Q matrix (a diagonal). Essentially we moved some of the complexity from the objective into linear constraints.
You should now see something like:
Tried aggregator 1 time.
QP Presolve eliminated 1 rows and 1 columns.
Reduced QP has 401 rows, 443 columns, and 17201 nonzeros.
Reduced QP objective Q matrix has 401 nonzeros.
Presolve time = 0.02 sec. (1.21 ticks)
Parallel mode: using up to 8 threads for barrier.
Number of nonzeros in lower triangle of A*A' = 80200
Using Approximate Minimum Degree ordering
Total time for automatic ordering = 0.00 sec. (3.57 ticks)
Summary statistics for Cholesky factor:
Threads = 8
Rows in Factor = 401
Integer space required = 401
Total non-zeros in factor = 80601
Total FP ops to factor = 21574201
Itn Primal Obj Dual Obj Prim Inf Upper Inf Dual Inf
0 3.3391791e-01 -3.3391791e-01 9.70e+03 0.00e+00 4.20e+04
1 9.6533667e+02 -3.0509942e+03 1.21e-12 0.00e+00 1.71e-11
2 6.4361775e+01 -3.6729243e+02 3.08e-13 0.00e+00 1.71e-11
3 2.2399862e+01 -6.8231454e+01 1.14e-13 0.00e+00 3.75e-12
4 6.8012056e+00 -2.0011575e+01 2.45e-13 0.00e+00 1.04e-12
5 3.3548410e+00 -1.9547176e+00 1.18e-13 0.00e+00 3.55e-13
6 1.9866256e+00 6.0981384e-01 5.55e-13 0.00e+00 1.86e-13
7 1.4271894e+00 1.0119284e+00 2.82e-12 0.00e+00 1.15e-13
8 1.1434804e+00 1.1081026e+00 6.93e-12 0.00e+00 1.09e-13
9 1.1163905e+00 1.1149752e+00 5.89e-12 0.00e+00 1.14e-13
10 1.1153877e+00 1.1153509e+00 2.52e-11 0.00e+00 9.71e-14
11 1.1153611e+00 1.1153602e+00 2.10e-11 0.00e+00 8.69e-14
12 1.1153604e+00 1.1153604e+00 1.10e-11 0.00e+00 8.96e-14
Barrier time = 0.17 sec. (38.31 ticks)
Total time on 8 threads = 0.17 sec. (38.31 ticks)
QP status(1): optimal
Cplex Time: 0.17sec (det. 38.31 ticks)
Optimal solution found.
Objective : 1.115360
See here for some details.
Update: In Matlab this becomes:

Matlab code 5 msec timer

Can any tell me how i can write a 5 msec timer code in matlab ?
%% Decomposing into sets of 40 bytes packets
% While Time < T1(= 5 msec), keep on filling the 40 bytes-sized packet
%while(Total_Connection_Time-Running_Time)>0
for n=1:length(total_number_of_bytes)
% n=counter to go through "total_number_of_bytes" marix
packets=[]; % 40-bytes matrix (packetization phase)
% checking whether number of bytes at each talkspurt period is < or > 40 bytes in order to start packetization
if (total_number_of_bytes(n)<=40)
k=40-total_number_of_bytes(n); % calculating how many remaining bytes we need to complete a 40 bytes packet
packets=[packets,total_number_of_bytes(n)+k];
total_number_of_bytes(n)=40; %new bytes matrix after packetization (adding bytes from next talkspurt to get total of 40 bytes)
total_number_of_bytes(n+1)= total_number_of_bytes(n+1)-k; % bytes are taken from the next talkspurt period in order to get a 40 byte packet
if total_number_of_bytes(n+1)<0
for i=1:length(total_number_of_bytes) % looping through the array starting total_number_of_bytes(n+1)
total_number_of_bytes(n+1)=total_number_of_bytes(n+1)+total_number_of_bytes(n+1+i)
total_number_of_bytes(n+1+i)=0;
packets=[total_number_of_bytes]
end
end
end
if(total_number_of_bytes(n)>40)
m=total_number_of_bytes(n)-40; % cz we need 40 bytes packets
packets=[packets,total_number_of_bytes -40];
total_number_of_bytes(n)=40;
total_number_of_bytes(n+1)= total_number_of_bytes(n+1)+m; % The remaining bytes are added to the next talkspurt period bytes
packets=[total_number_of_bytes]
end
for better accuracy use
java.lang.Thread.sleep(5);
instead of tic toc, see more here for further info.
Tic and toc are getting a bad rap, so I will just post this.
I tried the following:
tic
count = 0;
while toc<0.005
a=randn(10);
count = count+1;
end
toc
Running it ten times, the maximum value of toc was 5.006 ms. The count was around 1000 each time.
This is not the same as your program, but if graphics or GUI are not involved, I think tic and toc can do the job.