Standard arrays seem faster than gpuArray on conv net feed forward - matlab

I am implementing Convolutional networks in MATLAB, and I added a support for GPUs (I am using gpuArrays). I implemented the feed forward part. When I run it with standard array (I have the arrays already in my workspace ready), it takes 0.15 sec. However, when I run the EXACT same thing, but the arrays being gpuArrays, which are all in my workspace prior to running the feed forward script, it takes ~1.39 sec. Can someone explain what's going on here? Thanks
UPDATE: I tested running time and everything suggests that the main bottleneck is my convolution part, so I will paste that part of code down here:
pad = (size(layers_W{layerNum}, 1)-1) / 2;
for imageNum = 1:options.minibatchSize
for filterNum = 1:size(layers_W{layerNum}, 4)
for filterD = 1:size(layers_W{layerNum}, 3)
c = conv2(convInput(:, :, filterD, imageNum), ...
rot90(layers_W{layerNum}(:, :, filterD, filterNum), 2), 'valid');
layers_activations{layerNum}(pad+1:end-pad, pad+1:end-pad, filterNum, imageNum) = ...
layers_activations{layerNum}(pad+1:end-pad, pad+1:end-pad, filterNum, imageNum) + ...
c;
end
layers_activations{layerNum}(pad+1:end-pad, pad+1:end-pad, filterNum, imageNum) = ...
layers_activations{layerNum}(pad+1:end-pad, pad+1:end-pad, filterNum, imageNum) + ...
layers_b{layerNum}(filterNum);
end
end
if strcmp(options.activation, 'relu') == 1
layers_activations{layerNum} = max(0, layers_activations{layerNum});
elseif strcmp(options.activation, 'sigmoid') == 1
layers_activations{layerNum} = 1 ./ (1 + exp(-layers_activations{layerNum}));
end
This exact piece of code is ~52 times slower on GPU than on CPU. Any ideas?
UPDATE2: Tested separately the line that does 2d convolution (~10 times slower on GPU) and the line below it that adds two matrices(~100 times slower on GPU). I am completely confused why this is happening.

This isn't at all a surprise. The GPU is efficient at doing convolutions on large images (HD, 4K) but not particularly at images 227x227 or smaller, such as are typical in CNNs. You need to at least be running a 3-D convolution so you can apply all the filters over each input activation in one call, rather than looping over all the filters and all the images. Try replacing the inner loop with a call to convn.
Smart GPU implementations of convolution in this context, such as that used by the Neural Network Toolbox in MATLAB, use custom kernels and multi-threading to take advantage of spatial parallelism and parallelism in the batch dimensions of filters and inputs. Your implementation throws away all the batch parallelism.

Related

nested forloops for running sliding window function for MATLAB

I'm trying to write a simple code that will generate the sum of a large window and divide by the sum of the small running window to get the energy ratio.
my code looks like this in MATLAB
S = data1;
[nt,ntraces] = size(S);
!Create sliding windows for First Break Picking:
!define a window length
!for large Window
nl = 300
!for small running Window
ns = 50
! tolerance/Fudge Factor
beta = 0.0000
for i_slide = 1:nt-nl
for i_large = i_slide:(i_slide+nl)
large_window(i_large) = sum(S(i_large).^2)';
for i_small = i_slide+ns:i_slide+nl
small_window(i_small) = sum(S(i_small).^2)';
end
end
ER(i_slide) = small_window/(large_window + beta);
end
The problem i am having is that my small running window is not indexing correctly nor is it running the sum along the whole large window length at the maximum slide.
any ideas how i can overcome this problem?
In general, the problem you're really trying to solve seems to be general 2-D (or 1-D?) convolution. You can use MATLAB's conv or conv2 function (or filter or imfilter, if you have image processing toolbox) to do this. If you need to write a 2-D convolution function, I wouldn't try and write one that does two convolutions and takes the ratio. Instead write a simple convolution function: my_conv and run it twice, and take the ratio. e.g., you're trying to write:
output = my_double_conv(data,smallFilt,bigFilt); %this does ratios
I don't think that's a good idea in general. Don't do that. Do
output = my_conv(data,smallFilt) ./ my_conv(data,bigFilt);
You might see some speed benefits from not having to index everything twice in my_double_conv, but if computational concerns are your issue, you shouldn't be writing your own convolution in the first place; instead you should be using FFT convolutions, or integral-image convolutions (e.g., http://hebb.mit.edu/courses/9.29/2004/readings/c13-1.pdf or http://en.wikipedia.org/wiki/Summed_area_table )
That said, your code has several problems. Have you tried debugging with the MATLAB debugger?
For example, this is clearly wrong, since i_small is a scalar index:
for i_small = i_slide+ns:i_slide+nl
small_window(i_small) = sum(S(i_small).^2)';
end
That sum is not going to "sum" over anything, since i_small will be a scalar...
Do you want:
small_window= S(i_slide+ns:i_slide+nl);
small_window_sum = sum(small_window.^2);
Also note that for element-wise matrix operations, like:
small_window/(large_window + beta);
Where small_window and large_window are scalars, you want:
small_window./(large_window + beta); %note the "."

best way to parallelize calculations on time series data in matlab

I have a linux cluster with Matlab & PCT installed (128 workers with Torque Manager), and I am looking for a good way to parallelize my calculations.
I have a time-series Trajectory data (100k x 2) matrix. I perform maximum likelihood (ML) calculations that involve matrix diagonalization, exponentiation & multiplications, which is running fast for smaller matrices. I divide the Trajectory data into small chunks and perform the calculations on many workers (coarse parallelization) and don't have any problems here as it works fine (gets done in ~30s)
But the calculations also depend on a number of parameters that I need to vary & test the effect on ML. (something akin to parameter sweep).
When I try to do this using a loop, the calculations becomes progressively very slow, for some reason I am unable to figure out.
%%%%%%% Pseudo- Code Example:
% a [100000x2], timeseries data
load trajectoryData
% p1,p2,p3,p4 are parameters
% but i want to do this over a multiple values fp3 & fp4 ;
paramsMat = [p1Vect; p2Vect;p3Vect ;p4Vect];
matlabpool start 128
[ML] = objfun([p1 p2 p3 p4],trajectoryData) % runs fast ~ <30s
%% NOTE: this runs progressively slow
for i = 1:length(paramsMat)
currentparams = paramsMat(i,:);
[ML] = objfun(currentparams,trajectoryData)
end
matlabpool close
The objFunc function is as follows:
% objFunc.m
[ML] = objFunc(Params, trajectoryData)
% b = 2 always
[a b] = size(trajectoryData) ;
% split into fragments of 1000 points (or any other way)
fragsMat = reshape(trajectoryData,1000, a*2/1000) ;
% simple parallelization. do the calculation on small chunks
parfor ix = 1: numFragments
% do heavy calculations
costVal(ix) = costValFrag;
end
% just an example;
ML = sum(costVal) ;
%%%%%%
Just a single calculation oddly takes ~30s (using the full cluster) but within the for loop, for some weird reason there is damping of speed & even within the 100th calculation, it becomes very slow. The workers are using only 10-20% of CPU.
If you have any suggestions including alternative parallelization suggestions it would be of immense help.
If I read this correctly, each parameter set is completely independent of all the others, and you have more parameter sets than you do workers.
The simple solution is to use a batch job instead of parfor.
job_manager = findresource( ... look up the args that fit your cluster ... )
job = createJob(job_manager);
for i = 1:num_param_sets
t = createTask(job, #your_function, 0, {your params});
end
submit(job);
This way you avoid any communications overhead you have from the parfor of the inner function, and you keep your matlabs separate. You can even tell it to automatically restart the workers between tasks (I think), as one of the job parameters.
What is the value of numFragments? If this is not always larger than your number of workers, then you will see things slowing down.
I would suggest trying to make your outer for loop be the parfor. It's generally better to apply the parallelism at the outermost level.

Matlab's fftn gets slower with multithreading?

I have access to a 12 core machine and some matlab code that relies heavily on fftn. I would like to speed up my code.
Since the fft can be parallelized I would think that more cores would help but I'm seeing the opposite.
Here's an example:
X = peaks(1028);
ncores = feature('numcores');
ntrials = 20;
mtx_power_times = zeros(ncores,ntrials);
fft_times = zeros(ncores, ntrials);
for i=1:ncores
for j=1:ntrials
maxNumCompThreads(i);
tic;
X^2;
mtx_power_times(i,j) = toc;
tic
fftn(X);
fft_times(i,j) = toc;
end
end
subplot(1,2,1);
plot(mtx_power_times,'x-')
title('mtx power time vs number of cores');
subplot(1,2,2);
plot(fft_times,'x-');
title('fftn time vs num of cores');
Which gives me this:
The speedup for matrix multiplication is great but it looks like my ffts go almost 3x slower when I use all my cores. What's going on?
For reference my version is 7.12.0.635 (R2011a)
Edit: On large 2D arrays taking 1D transforms I get the same problem:
Edit: The problem appears to be that fftw is not seeing the thread limiting that maxNumCompThreads enforces. I'm getting all the cpus going full speed no matter what I set maxNumCompThreads at.
So... is there a way I can specify how many processors I want to use for an fft in Matlab?
Edit: Looks like I can't do this without some careful work in .mex files. http://www.mathworks.com/matlabcentral/answers/35088-how-to-control-number-of-threads-in-fft has an answer. It would be nice if someone has an easy fix...
Looks like I can't do this without some careful work in .mex files. http://www.mathworks.com/matlabcentral/answers/35088-how-to-control-number-of-threads-in-fft has an answer. It would be nice if someone has an easy fix...
To use different cores, you should use the Parallel Computing Toolbox. For instance, you could use a parfor loop, and you have to pass the functions as a list of handles:
function x = f(n, i)
...
end
m = ones(8);
parfor i=1:8
m(i,:) = f(m(i,:), i);
end
More info is available at:
High performance computing
Multithreaded computation
Multithreading

Scope for improvement in this code

I have written the following code in MATLAB to process large images of the order of 3000x2500 pixels. Currently the operation takes more than half hour to complete. Is there any scope to improve the code to consume less time? I heard parallel processing can make things faster, but I have no idea on how to implement it. How do I do it, given the following code?
function dirvar(subfn)
[fn,pn] = uigetfile({'*.TIF; *.tiff; *.tif; *.TIFF; *.jpg; *.bmp; *.JPG; *.png'}, ...
'Select an image', '~/');
I = double(imread(fullfile(pn,fn)));
ld = input('Enter the lag distance = '); % prompt for lag distance
fh = eval(['#' subfn]); % Function handles
I2 = uint8(nlfilter(I, [7 7], fh));
imshow(I2); % Texture Layer Image
imwrite(I2,'result_mat.tif');
% Zero Degree Variogram
function [gamma] = ewvar(I)
c = (size(I)+1)/2; % Finds the central pixel of moving window
EW = I(c(1),c(2):end); % Determines the values from central pixel to margin of window
h = length(EW) - ld; % Number of lags
gamma = 1/(2 * h) * sum((EW(1:ld:end-1) - EW(2:ld:end)).^2);
end
The input lag distance is usually 1.
You really need to use the profiler to get some improvements out of it. My first guess (as I haven't run the profiler, which you should as suggested already), would be to use as little length operations as possible. Since you are processing every image with a [7 7] window, you can precalculate some parts,
such that you won't repeat these actions
function dirvar(subfn)
[fn,pn] = uigetfile({'*.TIF; *.tiff; *.tif; *.TIFF; *.jpg; *.bmp; *.JPG; *.png'}, ...
'Select an image', '~/');
I = double(imread(fullfile(pn,fn)));
ld = input('Enter the lag distance = '); % prompt for lag distance
fh = eval(['#' subfn]); % Function handles
%% precalculations
wind = [7 7];
center = (wind+1)/2; % Finds the central pixel of moving window
EWlength = (wind(2)+1)/2;
h = EWlength - ld; % Number of lags
%% calculations
I2 = nlfilter(I, wind, fh);
imshow(I2); % Texture Layer Image
imwrite(I2,'result_mat.tif');
% Zero Degree Variogram
function [gamma] = ewvar(I)
EW = I(center(1),center(2):end); % Determines the values from central pixel to margin of window
gamma = 1/(2 * h) * sum((EW(1:ld:end-1) - EW(2:ld:end)).^2);
end
end
Note that by doing so, you trade performance for clearness of your code and coupling (between the function dirvar and the nested function ewvar). However, since I haven't profiled your code (you should do that yourself using your own inputs), you can find what line of your code consumes the most time.
For batch processing, I would also recommend to leave out any input, imshow, imwrite and uigetfile. Those are commands that you typically call from a more high-level function/script and that will force you to enter these inputs even when you want them to stay the same. So instead of that code, make each of the variables they produce (/process) a parameter (/return value) for your function. That way, you could leave MATLAB running during the weekend to process everything (without having manually enter to all those values), even if you are unable to speed up the code.
A few general purpose tricks:
1 - use the MATLAB profiler to determine all the computational bottlenecks
2 - parallel processing can make things faster and there are a lot of tools that you can use, but it depends on how your entire code is set up and whether the code is optimized for it. By far the easiest trick to learn is parfor, where you can replace the top level for loop by parfor. This does mean you must open the MATLAB pool with matlabpool open.
3 - If you have a rather recent Nvidia GPU as well as MATLAB 2011, you can also write some CUDA code.
All in all 30 mins to me is peanuts, so don't fret it too much.
First of all, I strongly suggest you follow the advice by #Egon: Write a separate function that collects a list of files (the excellent UIPICKFILES from the FEX is your friend here), and then runs your filtering code in a loop for each image. Note that you should definitely keep the call to imwrite in your filtering code: In case the analysis crashes at image 48 (e.g. due to power failure), you don't want to lose all the previous work.
Running thusly in batch mode has two big advantages: (1) you can start running your code and go home for the week-end, and (2) you can easily parallelize this outside loop using PARFOR. However, with only a dual-core machine, it is unlikely that you get any significant improvements from parallelization - your OS also wants to run stuff at times, and the overhead of parallelization might be more than the gain from running two workers. Also, 2.5GB of RAM is seriously limiting.
As to your specific code: in my experience using IM2COL is often faster than NLFILTER. im2col creates a nElementsInMask-by-nMasks array out of your image, so that you can apply the filtering in one single operation. With a 7x7 window, the output of im2col will be 3000*2500*49 bytes, which is close to 400MB. Thus, it should just work. All that you need to do is rewrite ewvar so that it works on a 49x1 array of pixels that make up the pixels your mask, which will require some index juggling, if I understand your code correctly.

Parallelize or vectorize all-against-all operation on a large number of matrices?

I have approximately 5,000 matrices with the same number of rows and varying numbers of columns (20 x ~200). Each of these matrices must be compared against every other in a dynamic programming algorithm.
In this question, I asked how to perform the comparison quickly and was given an excellent answer involving a 2D convolution. Serially, iteratively applying that method, like so
list = who('data_matrix_prefix*')
H = cell(numel(list),numel(list));
for i=1:numel(list)
for j=1:numel(list)
if i ~= j
eval([ 'H{i,j} = compare(' char(list(i)) ',' char(list(j)) ');']);
end
end
end
is fast for small subsets of the data (e.g. for 9 matrices, 9*9 - 9 = 72 calls are made in ~1 s, 870 calls in ~2.5 s).
However, operating on all the data requires almost 25 million calls.
I have also tried using deal() to make a cell array composed entirely of the next element in data, so I could use cellfun() in a single loop:
# who(), load() and struct2cell() calls place k data matrices in a 1D cell array called data.
nextData = cell(k,1);
for i=1:k
[nextData{:}] = deal(data{i});
H{:,i} = cellfun(#compare,data,nextData,'UniformOutput',false);
end
Unfortunately, this is not really any faster, because all the time is in compare(). Both of these code examples seem ill-suited for parallelization. I'm having trouble figuring out how to make my variables sliced.
compare() is totally vectorized; it uses matrix multiplication and conv2() exclusively (I am under the impression that all of these operations, including the cellfun(), should be multithreaded in MATLAB?).
Does anyone see a (explicitly) parallelized solution or better vectorization of the problem?
Note
I realize both my examples are inefficient - the first would be twice as fast if it calculated a triangular cell array, and the second is still calculating the self comparisons, as well. But the time savings for a good parallelization are more like a factor of 16 (or 72 if I install MATLAB on everyone's machines).
Aside
There is also a memory issue. I used a couple of evals to append each column of H into a file, with names like H1, H2, etc. and then clear Hi. Unfortunately, the saves are very slow...
Does
compare(a,b) == compare(b,a)
and
compare(a,a) == 1
If so, change your loop
for i=1:numel(list)
for j=1:numel(list)
...
end
end
to
for i=1:numel(list)
for j= i+1 : numel(list)
...
end
end
and deal with the symmetry and identity case. This will cut your calculation time by half.
The second example can be easily sliced for use with the Parallel Processing Toolbox. This toolbox distributes iterations of your code among up to 8 different local processors. If you want to run the code on a cluster, you also need the Distributed Computing Toolbox.
%# who(), load() and struct2cell() calls place k data matrices in a 1D cell array called data.
parfor i=1:k-1 %# this will run the loop in parallel with the parallel processing toolbox
%# only make the necessary comparisons
H{i+1:k,i} = cellfun(#compare,data(i+1:k),repmat(data(i),k-i,1),'UniformOutput',false);
%# if the above doesn't work, try this
hSlice = cell(k,1);
hSlice{i+1:k} = cellfun(#compare,data(i+1:k),repmat(data(i),k-i,1),'UniformOutput',false);
H{:,i} = hSlice;
end
If I understand correctly you have to perform 5000^2 matrix comparisons ? Rather than try to parallelise the compare function, perhaps you should think of your problem being composed of 5000^2 tasks ? The Matlab Parallel Compute Toolbox supports task-based parallelism. Unfortunately my experience with PCT is with parallelisation of large linear algebra type problems so I can't really tell you much more than that. The documentation will undoubtedly help you more.