Efficient 3D element-wise operations in MATLAB - matlab

Say i have two martrices:
A=50;
B=50;
C=1000;
X = rand(A,B);
Y = rand(A,B,C);
I want to subtract X from each slice C of Y. This is a fairly common problem and i have found three alternative solutions:
% Approach 1: for-loop
tic
Z1 = zeros(size(Y));
for i=1:C
Z1(:,:,i) = Y(:,:,i) - X;
end
toc
% Approach 2: repmat
tic
Z2 = Y - repmat(X,[1 1 C]);
toc
% Approach 3: bsxfun
tic
Z3=bsxfun(#minus,Y,X);
toc
I'm building a program which frequently (i.e., many thousands of times) solves problems like this, so i'm looking for the most efficient solution. Here is a common pattern of results:
Elapsed time is 0.013527 seconds.
Elapsed time is 0.004080 seconds.
Elapsed time is 0.006310 seconds.
The loop is clearly slower, and bsxfun is a little slower than repmat. I find the same pattern when i element-wise multiply (rather than subtract) X against slices of Y, though repmat and bsxfun are a little closer in multiplication.
Increasing the size of the data...
A=500;
B=500;
C=1000;
Elapsed time is 2.049753 seconds.
Elapsed time is 0.570809 seconds.
Elapsed time is 1.016121 seconds.
Here, repmat is the clear winner. I'm wondering if anyone in the SO community has a cool trick up their sleeves to speed this operation up at all.

Depending on your real case scenario, bsxfun and repmat will sometimes have some advantage over the other, just like #rayryeng suggested. There is one other option you can consider : mex. I hard coded some parameters for better performance here.
#include "mex.h"
#include "matrix.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
double *A, *B, *C;
int ind_1, ind_2, ind_3, ind_21, ind_32, ind_321, Dims[3] = {500,500,5000};
plhs[0] = mxCreateNumericArray(3, Dims, mxDOUBLE_CLASS, mxREAL);
A = mxGetPr(prhs[0]);
B = mxGetPr(prhs[1]);
C = mxGetPr(plhs[0]);
for ( int ind_3 = 0; ind_3 < 5000; ind_3++)
{
ind_32 = ind_3*250000;
for ( int ind_2 = 0; ind_2 < 500; ind_2++)
{
ind_21 = ind_2*500; // taken out of the innermost loop to save some calculation
ind_321 = ind_32 + ind_21;
for ( int ind_1 = 0 ; ind_1 < 500; ind_1++)
{
C[ind_1 + ind_321] = A[ind_1 + ind_321] - B[ind_1 + ind_21];
}
}
}
}
To use it, type this into command window ( assuming you name the above c file mexsubtract.c )
mex -WIN64 mexsubtract.c
Then you can use it like this:
Z4 = mexsubtract(Y,X);
Here are some test results on my computer using A=500, B=500, C=5000:
(repmat) Elapsed time is 3.441695 seconds.
(bsxfun) Elapsed time is 3.357830 seconds.
(cmex) Elapsed time is 3.391378 seconds.
It's a close contender and in some more extreme case, it'll have an edge. For example, this is what I got with A = 10, B = 500, C = 200000 :
(repmat) Elapsed time is 2.769177 seconds.
(bsxfun) Elapsed time is 3.178385 seconds.
(cmex) Elapsed time is 2.552115 seconds.

Related

Why adding numbers on a single CPU core is 6 times faster than on a 32-core GPU? [duplicate]

I'm trying to speed up my computing by using gpuArray. However, that's not the case for my code below.
for i=1:10
calltest;
end
function [t1,t2]=calltest
N=10;
tic
u=gpuArray(rand(1,N).^(1./[N:-1:1]));
t1=toc
tic
u2=rand(1,N).^(1./[N:-1:1]);
t2=toc
end
where I get
t1 =
4.8445e-05
t2 =
1.4369e-05
I have an Nvidia GTX850M graphic card. Am I using gpuArray incorrectly? This code is wrapped inside a function, and the function is called by a loop thousands of times.
Why?Because there is botha) a small problem-scale &b) not a "mathematically-dense" GPU-kernel
The method of comparison is blurring the root-cause of the problem
Step 0: separate data-set ( a vector ) creation from section-under-test:
N = 10;
R = rand( 1, N );
tic; < a-gpu-based-computing-section>; GPU_t = toc
tic; c = R.^( 1. / [N:-1:1] ); CPU_t = toc
Step 1: test the scaling:
trying just 10 elements, will not make the observation clear, as an overhead-naive formulation of Amdahl Law does not explicitly emphasise the added time, spent on CPU-based GPU-kernel assembly & transport + ( CPU-to-GPU + GPU-to-CPU ) data-handling phases. These add-on phases may get negligibly small, if compared to
a) an indeed large-scale vector / matrix GPU-kernel processing, which N ~10 obviously is not
or
b) an indeed "mathematically-dense" GPU-kernel processing, which R.^() obviously is not
so,
do not blame the GPU-computing for having acquired a must-do part ( the overheads ) as it cannot get working without this prior add-ons in time ( and CPU may, during the same amount of time, produce the final result - Q.E.D. )
Fine-grain measurement, per each of the CPU-GPU-CPU-workflow sections:
N = 10; %% 100, 1000, 10000, 100000, ..
tic; CPU_hosted = rand( N, 'single' ); %% 'double'
CPU_gen_RAND = toc
tic; GPU_hosted_IN1 = gpuArray( CPU_hosted );
GPU_xfer_h2d = toc
tic; GPU_hosted_IN2 = rand( N, 'gpuArray' );
GPU_gen__h2d = toc
tic; <kernel-generation-with-might-be-lazy-eval-deferred-xfer-setup>;
GPU_kernel_AssyExec = toc
tic; CPU_hosted_RES = gather( GPU_hosted_RES );
GPU_xfer_d2h = toc

Fastest way to sum the elements of a matrix

I have some problems with the efficiency of my code. Basically my code works like this:
a = zeros(1,50000);
for n = 1:50000
a(n) = 10.*n - 5;
end
sum(a);
What is the fastest way to solve the sum of all the elements of this matrix?
first you want to remove your for loop by making it a vector multiplication:
tic
a = zeros(1,50000);
b = [1:50000];
a = 10.*b-5;
result = sum(a);
toc
Elapsed time is 0.008504 seconds.
An alternative way is to simplify your operation, you are multiplying 1 to 50000 by 10 and subtracting 5 then taking the sum (which is a single number), which is equivalent to:
tic
result = sum(1:50000)*10 - 5*50000;
toc
Elapsed time is 0.003851 seconds.
or if you are really into Math (this is a pure mathematical expression approach) :
tic
result = (1+50000)*(50000/2)*10 - 5*50000;
toc
Elapsed time is 0.003702 seconds.
and as you can see, a little math can do greater good than pure efficient programming, and actually, loop is not always slow, in your case, the loop is actually faster than the vectorized method:
tic
a = zeros(1,50000);
for n = 1:50000
a(n)=10.*n-5;
end
sum(a);
toc
Elapsed time is 0.006431 seconds.
Timing
Let's do some timing and see the results. The function to run it yourself is provided at the bottom. The approximate execution time execTime is in seconds and the percentage of improvement impPercentage in %.
Results
R2016a on OSX 10.11.4
execTime impPercentage
__________ _____________
loop 0.00059336 0
vectorized 0.00014494 75.574
adiel 0.00010468 82.359
math 9.3659e-08 99.984
Code
The following function can be used to generate the output. Note that it requires minimum R2013b to be able to use the built-in timeit-function and table.
function timings
%feature('accel','on') %// commented out because it's undocumented
cycleCount = 100;
execTime = zeros(4,cycleCount);
names = {'loop';'vectorized';'adiel';'math'};
w = warning;
warning('off','MATLAB:timeit:HighOverhead');
for k = 1:cycleCount
execTime(1,k) = timeit(#()loop,1);
execTime(2,k) = timeit(#()vectorized,1);
execTime(3,k) = timeit(#()adiel,1);
execTime(4,k) = timeit(#()math,1);
end
warning(w);
execTime = min(execTime,[],2);
impPercentage = (1 - execTime/max(execTime)) * 100;
table(execTime,impPercentage,'RowNames',names)
function result = loop
a = zeros(1,50000);
for n = 1:50000
a(n) = 10.*n - 5;
end
result = sum(a);
function result = vectorized
b = 1:50000;
a = 10.*b - 5;
result = sum(a);
function result = adiel
result = sum(1:50000)*10 - 5*50000;
function result = math
result = (1+50000)*(50000/2)*10 - 5*50000;

Fast way to compute (1:N)'*(1:N)

I am looking for a fast way to compute
(1:N)'*(1:N)
for reasonably large N. I feel like the symmetry of the problem makes it so that actually doing the multiplications and additions is wasteful.
The question of why you want to do this really matters.
In the theoretical sense, the triangular approach suggested in the other answers will save you operations. #jgmao's answer is especially interesting in reducing multiplies.
In the practical sense, number of CPU operations is no longer the metric to minimize when writing fast code. Memory bandwidth dominates when you have so few CPU operations, so tuned cache-aware access patterns are how to make this go fast. Matrix multiplication code is implemented extremely efficiently, since it's such a common operation, and every implementation of the BLAS numeric library worth its salt will use optimized access patterns, and SIMD computation as well.
Even if you wrote straight C and reduced your op count to the theoretic minimum, you'd probably still not beat the full matrix multiply. What this boils down to is to find the numeric primitive which most closely matches your operation.
All that said, there's a BLAS operation which gets a little closer than DGEMM (matrix multiply). It's called DSYRK, the rank-k update, and it can be used for exactly A'*A. The MEX function I wrote for this a long time ago is here. I haven't messed with it in a long time, but it did work when I first wrote it, and did in fact run faster than a straight A'*A.
/* xtrx.c: calculates x'*x taking advantage of the symmetry.
Peter Boettcher <email removed>
Last modified: <Thu Jan 23 13:53:02 2003> */
#include "mex.h"
const double one = 1;
const double zero = 0;
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
double *x, *z;
int i, j, mrows, ncols;
if(nrhs!=1) mexErrMsgTxt("One input required.");
x = mxGetPr(prhs[0]);
mrows = mxGetM(prhs[0]);
ncols = mxGetN(prhs[0]);
plhs[0] = mxCreateDoubleMatrix(ncols,ncols, mxREAL);
z = mxGetPr(plhs[0]);
/* Call the FORTRAN BLAS routine for rank k update */
dsyrk_("U", "T", &ncols, &mrows, &one, x, &mrows, &zero, z, &ncols);
/* Result is in the upper triangle. Copy it down the lower part */
for(i=0; i<ncols; i++)
for(j=i+1; j<ncols; j++)
z[i*ncols + j] = z[j*ncols + i];
}
MATLAB's matrix multiplication is generally pretty fast, but here are a couple of ways to get just the upper triangular matrix. They are slower than naïvely computing the v'*v (or using a MEX wrapper that calls the more appropriate symmetric rank k update function in BLAS, not surprisingly!). Anyway, here are a few MATLAB-only solutions:
The first uses linear indexing:
% test vector
N = 1e3;
v = 1:N;
% compute upper triangle of product
[ii, jj] = find(triu(ones(N)));
upperMask = false(N,N);
upperMask(ii + N*(jj-1)) = true;
Mu = zeros(N);
Mu(upperMask) = v(ii).*v(jj); % other lines always the same computation
% validate
M = v'*v;
isequal(triu(M),Mu)
This next way won't be faster than the naive approach either, but here's another solution to compute the lower triangle with bsxfun:
Ml = bsxfun(#(x,y) [zeros(y-1,1); x(y:end)*y],v',v);
For the upper triangle:
Mu = bsxfun(#(x,y) [x(1:y)*y; zeros(numel(x)-y,1)],v',v);
isequal(triu(M),Mu)
Another solution for the whole matrix using cumsum for this special case (where v=1:N). This one is actually close in speed.
M = cumsum(repmat(v,[N 1]));
Maybe these can be a starting point for something better.
This is 3 times faster than (1:N).'*(1:N) provided an int32 result is acceptable (it's even faster if the numbers are small enough to use int16 instead of int32):
N = 1000;
aux = int32(1:N);
result = bsxfun(#times,aux.',aux);
Benchmarking:
>> N = 1000; aux = int32(1:N); tic, for count = 1:1e2, bsxfun(#times,aux.',aux); end, toc
Elapsed time is 0.734992 seconds.
>> N = 1000; aux = 1:N; tic, for count = 1:1e2, aux.'*aux; end, toc
Elapsed time is 2.281784 seconds.
Note that aux.'*aux cannot be used for aux = int32(1:N).
As pointed out by #DanielE.Shub, if the result is needed as a double matrix, a final cast has to be done, and in that case the gain is very small:
>> N = 1000; aux = int32(1:N); tic, for count = 1:1e2, double(bsxfun(#times,aux.',aux)); end, toc
Elapsed time is 2.173059 seconds.
Since the special ordered structure of the input, consider the case N=4
(1:4)'*(1:4) = [1 2 3 4
2 4 6 8
3 6 9 12
4 8 12 16]
you will find that 1st row is just (1:N), from second (j=2) row, the value of this row is previous row (j=1) plus (1:N).
So 1. you do not to do many multiplications. Instead, you can generate it by N*N additions.
2. since the output is symmetric, only half of the output matrix need to be computed. So the total computation is (N-1)+(N-2)+...+1 = N^2 / 2 additions.

how to vectorize an expression in matlab

I'm unable to vectorize this :
for x=2:i
for y=2:j
if(x ~= y)
Savings(x,y) = Costs(x,1) + Costs(1,y) - Costs(x,y);
end
end
end
Could someone tell me of I could improve the performance of this code ? Thanks
With some help from bsxfun:
Ix=2:i;
Iy=2:j;
I = false(i,j);
I(Ix,Iy) = bsxfun(#ne, Ix', Iy);
S = bsxfun(#plus, Costs(Ix,1), Costs(1,Iy)) - Costs(Ix,Iy);
Savings(I) = S(I(Ix,Iy));
You can vectorize it like this, I don't know if that will effect your performance or not though. You will need to test that yourself.
m=size(Costs, 1);
n=size(Costs, 2);
[Y, X] = meshgrid(2:m, 2:n);
Cx = Costs(:,1);
Cy = Costs(1,:);
S = Cx(X) + Cy(Y) - Costs(2:end,2:end);
S(eye(m-1,n-1)==1) = 0;
Savings = zeros(m,n);
Savings(2:end,2:end) = S;
EDIT
Incidentally I have verified that all three methods give the same answer. For a Costs size of 400x400 the run times are were as follows:
Elapsed time is 0.00741386 seconds. //My method
Elapsed time is 0.003304 seconds. //Mohsen's method (after correcting to prevent errors)
Elapsed time is 2.16231 seconds. //Original Loop
So both our methods give a significant boost. However if you just pre-allocate Savings your loop is actually the fastest. Is this really too slow for your purposes?

Octave/Matlab: Adding new elements to a vector

Having a vector x and I have to add an element (newElem) .
Is there any difference between -
x(end+1) = newElem;
and
x = [x newElem];
?
x(end+1) = newElem is a bit more robust.
x = [x newElem] will only work if x is a row-vector, if it is a column vector x = [x; newElem] should be used. x(end+1) = newElem, however, works for both row- and column-vectors.
In general though, growing vectors should be avoided. If you do this a lot, it might bring your code down to a crawl. Think about it: growing an array involves allocating new space, copying everything over, adding the new element, and cleaning up the old mess...Quite a waste of time if you knew the correct size beforehand :)
Just to add to #ThijsW's answer, there is a significant speed advantage to the first method over the concatenation method:
big = 1e5;
tic;
x = rand(big,1);
toc
x = zeros(big,1);
tic;
for ii = 1:big
x(ii) = rand;
end
toc
x = [];
tic;
for ii = 1:big
x(end+1) = rand;
end;
toc
x = [];
tic;
for ii = 1:big
x = [x rand];
end;
toc
Elapsed time is 0.004611 seconds.
Elapsed time is 0.016448 seconds.
Elapsed time is 0.034107 seconds.
Elapsed time is 12.341434 seconds.
I got these times running in 2012b however when I ran the same code on the same computer in matlab 2010a I get
Elapsed time is 0.003044 seconds.
Elapsed time is 0.009947 seconds.
Elapsed time is 12.013875 seconds.
Elapsed time is 12.165593 seconds.
So I guess the speed advantage only applies to more recent versions of Matlab
As mentioned before, the use of x(end+1) = newElem has the advantage that it allows you to concatenate your vector with a scalar, regardless of whether your vector is transposed or not. Therefore it is more robust for adding scalars.
However, what should not be forgotten is that x = [x newElem] will also work when you try to add multiple elements at once. Furthermore, this generalizes a bit more naturally to the case where you want to concatenate matrices. M = [M M1 M2 M3]
All in all, if you want a solution that allows you to concatenate your existing vector x with newElem that may or may not be a scalar, this should do the trick:
x(end+(1:numel(newElem)))=newElem