Fastest way to sum the elements of a matrix - matlab

I have some problems with the efficiency of my code. Basically my code works like this:
a = zeros(1,50000);
for n = 1:50000
a(n) = 10.*n - 5;
end
sum(a);
What is the fastest way to solve the sum of all the elements of this matrix?

first you want to remove your for loop by making it a vector multiplication:
tic
a = zeros(1,50000);
b = [1:50000];
a = 10.*b-5;
result = sum(a);
toc
Elapsed time is 0.008504 seconds.
An alternative way is to simplify your operation, you are multiplying 1 to 50000 by 10 and subtracting 5 then taking the sum (which is a single number), which is equivalent to:
tic
result = sum(1:50000)*10 - 5*50000;
toc
Elapsed time is 0.003851 seconds.
or if you are really into Math (this is a pure mathematical expression approach) :
tic
result = (1+50000)*(50000/2)*10 - 5*50000;
toc
Elapsed time is 0.003702 seconds.
and as you can see, a little math can do greater good than pure efficient programming, and actually, loop is not always slow, in your case, the loop is actually faster than the vectorized method:
tic
a = zeros(1,50000);
for n = 1:50000
a(n)=10.*n-5;
end
sum(a);
toc
Elapsed time is 0.006431 seconds.
Timing
Let's do some timing and see the results. The function to run it yourself is provided at the bottom. The approximate execution time execTime is in seconds and the percentage of improvement impPercentage in %.
Results
R2016a on OSX 10.11.4
execTime impPercentage
__________ _____________
loop 0.00059336 0
vectorized 0.00014494 75.574
adiel 0.00010468 82.359
math 9.3659e-08 99.984
Code
The following function can be used to generate the output. Note that it requires minimum R2013b to be able to use the built-in timeit-function and table.
function timings
%feature('accel','on') %// commented out because it's undocumented
cycleCount = 100;
execTime = zeros(4,cycleCount);
names = {'loop';'vectorized';'adiel';'math'};
w = warning;
warning('off','MATLAB:timeit:HighOverhead');
for k = 1:cycleCount
execTime(1,k) = timeit(#()loop,1);
execTime(2,k) = timeit(#()vectorized,1);
execTime(3,k) = timeit(#()adiel,1);
execTime(4,k) = timeit(#()math,1);
end
warning(w);
execTime = min(execTime,[],2);
impPercentage = (1 - execTime/max(execTime)) * 100;
table(execTime,impPercentage,'RowNames',names)
function result = loop
a = zeros(1,50000);
for n = 1:50000
a(n) = 10.*n - 5;
end
result = sum(a);
function result = vectorized
b = 1:50000;
a = 10.*b - 5;
result = sum(a);
function result = adiel
result = sum(1:50000)*10 - 5*50000;
function result = math
result = (1+50000)*(50000/2)*10 - 5*50000;

Related

Why does `minmax` take longer than a consecutive `min` and `max`?

Well basically the question says it all, my intuition tells me that a call to minmax should take less time than calling a min and then a max.
Is there some optimization I prevent Matlab carrying out in the following code?
minmax:
function minmax_vals = minmaxtest()
buffSize = 1000000;
A = rand(128,buffSize);
windowSize = 100;
minmax_vals = zeros(128,buffSize/windowSize*2);
for i=1:(buffSize/windowSize)
minmax_vals(:,(2*i-1):(2*i)) = minmax(A(:,((i-1)*windowSize+1):(i*windowSize)));
end
end
separate min-max:
function minmax_vals = minmaxtest()
buffSize = 1000000;
A = rand(128,buffSize);
windowSize = 100;
minmax_vals = zeros(128,buffSize/windowSize*2);
for i=1:(buffSize/windowSize)
minmax_vals(:,(2*i-1)) = min(A(:,((i-1)*windowSize+1):(i*windowSize)),[],2);
minmax_vals(:,(2*i)) = max(A(:,((i-1)*windowSize+1):(i*windowSize)),[],2);
end
end
Summary
You can see the overhead because minmax isn't completely obfuscated. Simply type
edit minmax
And you will see the function!
It appears that there is a data-type conversion to nntype.data('format',x,'Data');, which will not be the case for min or max and could be costly. This is for use with MATLAB's neural networking (nn) tools as minmax belongs to that toolbox.
In short, min and max are lower-level, compiled functions (hence they are fully obfuscated), which don't require functionality from the nn toolbox.
Benchmark
Here is a slightly more isolated benchmark, without your windowing and using timeit instead of the profiler. I've also included timings for just the data conversion used in minmax! The test gets the min and max of each row in a large matrix, see here the output plot and code below...
It appears that there is a linear relationship between number of rows and time taken (as expected for a linear operator), but the coefficient is much greater for the combined minmax relationship, with the separate operations being approximately 10x quicker. Also you can clearly see that data conversion takes more time that the min then max version alone!
function benchie()
K = zeros(10, 3);
for k = 1:10
n = 2^k;
A = rand(n, 200);
Arow = zeros(1,200);
m = zeros(n,2);
f1 = #()minmaxtest(A,m);
K(k,1) = timeit(f1);
f2 = #()minthenmaxtest(A,m);
K(k,2) = timeit(f2);
f3 = #()dataconversiontest(A, Arow);
K(k,3) = timeit(f3);
end
figure; hold on; plot(2.^(1:10), K(:,1)); plot(2.^(1:10), K(:,2)); plot(2.^(1:10), K(:,3));
end
function minmaxtest(A,m)
for ii = 1:size(A,1)
m(ii, 1:2) = minmax(A(ii,:));
end
end
function dataconversiontest(A, Arow)
for ii = 1:size(A,1)
Arow = nntype.data('format', A(ii,:), 'Data');;
end
end
function minthenmaxtest(A,m)
for ii = 1:size(A,1)
m(ii, 1) = min(A(ii,:));
m(ii, 2) = max(A(ii,:));
end
end

Initialize vector with function in matlab

just started out with matlab and have some troubles finding the solution for the following action:
I am trying to initialize a vector of 1000 different values, with a function that doesn't take any arguments as input. I can do this with a for loop, but haven't found out how to do it without.
What I expected that would work:
z = zeros(1,1000)
result = arrayfun(*functionname*,z)
This however gives an error saying that the first input must be a function handle.
My function is a simple implementation of a monte carlo method to calculate pi:
function Result = mcm()
clear
N=1000;
M=0;
for j=1:N
p=[2*rand-1; 2*rand-1];
if p'*p<1
M=M+1;
end
end
Result=4*M/N
One way to actually vectorize your given function mcm would be -
N = 1000; %// Number of data points
P = [2*rand(1,N)-1; 2*rand(1,N)-1]; %// OR 2*rand(2,N)-1
out = 4*sum(sum(P.^2,1)<1)/N
Runtime tests
Code -
N = 1000000; %// Number of data points
disp('---------------- With Original Approach')
tic
M=0;
for j=1:N
P=[2*rand-1; 2*rand-1];
if P'*P<1
M=M+1;
end
end
Result=4*M/N;
toc
disp('---------------- With Proposed Approach')
tic
P = 2*rand(2,N)-1;
out = 4*sum(sum(P.^2,1)<1)/N;
toc
Timings & Outputs -
---------------- With Original Approach
Elapsed time is 3.952998 seconds.
---------------- With Proposed Approach
Elapsed time is 0.089590 seconds.
>> Result
Result =
3.1422
>> out
out =
3.1428
Since your function takes no arguments you can't use arrayfun. arrayfun applies the function to each element in the array.
Instead use this:
z = ones(1,1000) * mcm;
A side benefit is that mcm will only run once so it will be faster than looping that function 1000 times.

Time series aggregation efficiency

I commonly need to summarize a time series with irregular timing with a given aggregation function (i.e., sum, average, etc.). However, the current solution that I have seems inefficient and slow.
Take the aggregation function:
function aggArray = aggregate(array, groupIndex, collapseFn)
groups = unique(groupIndex, 'rows');
aggArray = nan(size(groups, 1), size(array, 2));
for iGr = 1:size(groups,1)
grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2);
for iSer = 1:size(array, 2)
aggArray(iGr,iSer) = collapseFn(array(grIdx,iSer));
end
end
end
Note that both array and groupIndex can be 2D. Every column in array is an independent series to be aggregated, but the columns of groupIndex should be taken together (as a row) to specify a period.
Then when we bring an irregular time series to it (note the second period is one base period longer), the timing results are poor:
a = rand(20006,10);
b = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);
tic; aggregate(a, b, #sum); toc
Elapsed time is 1.370001 seconds.
Using the profiler, we can find out that the grpIdx line takes about 1/4 of the execution time (.28 s) and the iSer loop takes about 3/4 (1.17 s) of the total (1.48 s).
Compare this with the period-indifferent case:
tic; cumsum(a); toc
Elapsed time is 0.000930 seconds.
Is there a more efficient way to aggregate this data?
Timing Results
Taking each response and putting it in a separate function, here are the timing results I get with timeit with Matlab 2015b on Windows 7 with an Intel i7:
original | 1.32451
felix1 | 0.35446
felix2 | 0.16432
divakar1 | 0.41905
divakar2 | 0.30509
divakar3 | 0.16738
matthewGunn1 | 0.02678
matthewGunn2 | 0.01977
Clarification on groupIndex
An example of a 2D groupIndex would be where both the year number and week number are specified for a set of daily data covering 1980-2015:
a2 = rand(36*52*5, 10);
b2 = [sort(repmat(1980:2015, [1 52*5]))' repmat(1:52, [1 36*5])'];
Thus a "year-week" period are uniquely identified by a row of groupIndex. This is effectively handled through calling unique(groupIndex, 'rows') and taking the third output, so feel free to disregard this portion of the question.
Method #1
You can create the mask corresponding to grIdx across all
groups in one go with bsxfun(#eq,..). Now, for collapseFn as #sum, you can bring in matrix-multiplication and thus have a completely vectorized approach, like so -
M = squeeze(all(bsxfun(#eq,groupIndex,permute(groups,[3 2 1])),2))
aggArray = M.'*array
For collapseFn as #mean, you need to do a bit more work, as shown here -
M = squeeze(all(bsxfun(#eq,groupIndex,permute(groups,[3 2 1])),2))
aggArray = bsxfun(#rdivide,M,sum(M,1)).'*array
Method #2
In case you are working with a generic collapseFn, you can use the 2D mask M created with the previous method to index into the rows of array, thus changing the complexity from O(n^2) to O(n). Some quick tests suggest this to give appreciable speedup over the original loopy code. Here's the implementation -
n = size(groups,1);
M = squeeze(all(bsxfun(#eq,groupIndex,permute(groups,[3 2 1])),2));
out = zeros(n,size(array,2));
for iGr = 1:n
out(iGr,:) = collapseFn(array(M(:,iGr),:),1);
end
Please note that the 1 in collapseFn(array(M(:,iGr),:),1) denotes the dimension along which collapseFn would be applied, so that 1 is essential there.
Bonus
By its name groupIndex seems like would hold integer values, which could be abused to have a more efficient M creation by considering each row of groupIndex as an indexing tuple and thus converting each row of groupIndex into a scalar and finally get a 1D array version of groupIndex. This must be more efficient as the datasize would be 0(n) now. This M could be fed to all the approaches listed in this post. So, we would have M like so -
dims = max(groupIndex,[],1);
agg_dims = cumprod([1 dims(end:-1:2)]);
[~,~,idx] = unique(groupIndex*agg_dims(end:-1:1).'); %//'
m = size(groupIndex,1);
M = false(m,max(idx));
M((idx-1)*m + [1:m]') = 1;
Mex Function 1
HAMMER TIME: Mex function to crush it:
The base case test with original code from the question took 1.334139 seconds on my machine. IMHO, the 2nd fastest answer from #Divakar is:
groups2 = unique(groupIndex);
aggArray2 = squeeze(all(bsxfun(#eq,groupIndex,permute(groups,[3 2 1])),2)).'*array;
Elapsed time is 0.589330 seconds.
Then my MEX function:
[groups3, aggArray3] = mg_aggregate(array, groupIndex, #(x) sum(x, 1));
Elapsed time is 0.079725 seconds.
Testing that we get the same answer: norm(groups2-groups3) returns 0 and norm(aggArray2 - aggArray3) returns 2.3959e-15. Results also match original code.
Code to generate the test conditions:
array = rand(20006,10);
groupIndex = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);
For pure speed, go mex. If the thought of compiling c++ code / complexity is too much of a pain, go with Divakar's answer. Another disclaimer: I haven't subject my function to robust testing.
Mex Approach 2
Somewhat surprising to me, this code appears even faster than the full Mex version in some cases (eg. in this test took about .05 seconds). It uses a mex function mg_getRowsWithKey to figure out the indices of groups. I think it may be because my array copying in the full mex function isn't as fast as it could be and/or overhead from calling 'feval'. It's basically the same algorithmic complexity as the other version.
[unique_groups, map] = mg_getRowsWithKey(groupIndex);
results = zeros(length(unique_groups), size(array,2));
for iGr = 1:length(unique_groups)
array_subset = array(map{iGr},:);
%// do your collapse function on array_subset. eg.
results(iGr,:) = sum(array_subset, 1);
end
When you do array(groups(1)==groupIndex,:) to pull out array entries associated with the full group, you're searching through the ENTIRE length of groupIndex. If you have millions of row entries, this will totally suck. array(map{1},:) is far more efficient.
There's still unnecessary copying of memory and other overhead associated with calling 'feval' on the collapse function. If you implement the aggregator function efficiently in c++ in such a way to avoid copying of memory, probably another 2x speedup can be achieved.
A little late to the party, but a single loop using accumarray makes a huge difference:
function aggArray = aggregate_gnovice(array, groupIndex, collapseFn)
[groups, ~, index] = unique(groupIndex, 'rows');
numCols = size(array, 2);
aggArray = nan(numel(groups), numCols);
for col = 1:numCols
aggArray(:, col) = accumarray(index, array(:, col), [], collapseFn);
end
end
Timing this using timeit in MATLAB R2016b for the sample data in the question gives the following:
original | 1.127141
gnovice | 0.002205
Over a 500x speedup!
Doing away with the inner loop, i.e.
function aggArray = aggregate(array, groupIndex, collapseFn)
groups = unique(groupIndex, 'rows');
aggArray = nan(size(groups, 1), size(array, 2));
for iGr = 1:size(groups,1)
grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2);
aggArray(iGr,:) = collapseFn(array(grIdx,:));
end
and calling the collapse function with a dimension parameter
res=aggregate(a, b, #(x)sum(x,1));
gives some speedup (3x on my machine) already and avoids the errors e.g. sum or mean produce, when they encounter a single row of data without a dimension parameter and then collapse across columns rather than labels.
If you had just one group label vector, i.e. same group labels for all columns of data, you could speed further up:
function aggArray = aggregate(array, groupIndex, collapseFn)
ng=max(groupIndex);
aggArray = nan(ng, size(array, 2));
for iGr = 1:ng
aggArray(iGr,:) = collapseFn(array(groupIndex==iGr,:));
end
The latter functions gives identical results for your example, with a 6x speedup, but cannot handle different group labels per data column.
Assuming a 2D test case for the group index (provided here as well with 10 different columns for groupIndex:
a = rand(20006,10);
B=[]; % make random length periods for each of the 10 signals
for i=1:size(a,2)
n0=randi(10);
b=transpose([ones(1,n0) 2*ones(1,11-n0) sort(repmat((3:4001), [1 5]))]);
B=[B b];
end
tic; erg0=aggregate(a, B, #sum); toc % original method
tic; erg1=aggregate2(a, B, #(x)sum(x,1)); toc %just remove the inner loop
tic; erg2=aggregate3(a, B, #(x)sum(x,1)); toc %use function below
Elapsed time is 2.646297 seconds.
Elapsed time is 1.214365 seconds.
Elapsed time is 0.039678 seconds (!!!!).
function aggArray = aggregate3(array, groupIndex, collapseFn)
[groups,ix1,jx] = unique(groupIndex, 'rows','first');
[groups,ix2,jx] = unique(groupIndex, 'rows','last');
ng=size(groups,1);
aggArray = nan(ng, size(array, 2));
for iGr = 1:ng
aggArray(iGr,:) = collapseFn(array(ix1(iGr):ix2(iGr),:));
end
I think this is as fast as it gets without using MEX. Thanks to the suggestion of Matthew Gunn!
Profiling shows that 'unique' is really cheap here and getting out just the first and last index of the repeating rows in groupIndex speeds things up considerably. I get 88x speedup with this iteration of the aggregation.
Well I have a solution that is almost as quick as the mex but only using matlab.
The logic is the same as most of the above, creating a dummy 2D matrix but instead of using #eq I initialize a logical array from the start.
Elapsed time for mine is 0.172975 seconds.
Elapsed time for Divakar 0.289122 seconds.
function aggArray = aggregate(array, group, collapseFn)
[m,~] = size(array);
n = max(group);
D = false(m,n);
row = (1:m)';
idx = m*(group(:) - 1) + row;
D(idx) = true;
out = zeros(m,size(array,2));
for ii = 1:n
out(ii,:) = collapseFn(array(D(:,ii),:),1);
end
end

How to reduce the time of computation in presence of for loops and matrix-vector multiplication

I am writing a program in which the time of computation is really important so I have to write my codes in a way to reduce the time. In the following, I wrote a code but it will be time consuming if the length of my vectors goes high. Is there anyway to produce the same result in a faster way?
K1 = [1 2 3 4 5]; K2 = [6 7 8 9 10];
kt1 = [1.5 3 4.5]; kt2 = [6.5 8 9.5];
numk1 = bsxfun(#minus,K1.',kt1);
denomk1 = bsxfun(#minus, kt1.',kt1)+eye(numel(kt1));
numk2 = bsxfun(#minus,K2.',kt2);
denomk2 = bsxfun(#minus, kt2.', kt2)+eye(numel(kt2));
for j=1:numel(kt1)
for jj=1:numel(kt2)
k1_dir = bsxfun(#rdivide,numk1,denomk1(j,:)); k1_dir(:,j)=[];
k_dir1 = prod(k1_dir,2);
k2_dir = bsxfun(#rdivide,numk2,denomk2(jj,:)); k2_dir(:,jj)=[];
k_dir2 = prod(k2_dir,2);
k1_k2(:,:,j,jj) = k_dir1 * k_dir2';
end
end
In the above code, as the length of K1and K2increase, the length of kt1and kt2 increase too. So for long vector lengths this code is time consuming.
Going full-throttle on vectorization, this could be one approach to replace the loopy portion of the code listed in the problem -
%// Size parameters
M1 = numel(K1);
N1 = numel(kt1);
M2 = numel(K2);
N2 = numel(kt2);
%// Indices to be removed from k1_dir & k2_dir.
%// In our vectorized version, we will just use these to set
%// corresponding elements in vectorized versions of k1_dir & k2_dir
%// to ONES, as later on PROD would take care of it.
rm_idx1 = bsxfun(#plus,[1:M1]',[0:N1-1]*(M1*N1+M1)); %//'
rm_idx2 = bsxfun(#plus,[1:M2]',[0:N2-1]*(M2*N2+M2)); %//'
%// Get vectorized version of k1_dir, as k1_dirv
k1_dirv = bsxfun(#rdivide,numk1,permute(denomk1,[3 2 1]));
k1_dirv(rm_idx1) = 1;
k_dir1v = prod(k1_dirv,2);
%// Get vectorized version of k2_dir, as k2_dirv
k2_dirv = bsxfun(#rdivide,numk2,permute(denomk2,[3 2 1]))
k2_dirv(rm_idx2) = 1;
k_dir2v = prod(k2_dirv,2);
%// Get vectorized version of k1_k2, as k1_k2v
k1_k2v = bsxfun(#times,k_dir1v,permute(k_dir2v,[2 1 4 3]));
Quick runtime test:
With the inputs setup like so -
SZ1 = 100;
SZ2 = 100;
K1 = randi(9,1,SZ1);
K2 = randi(9,1,SZ1);
kt1 = randi(9,1,SZ2);
kt2 = randi(9,1,SZ2);
The runtimes for the loopy portion in the original (after adding code for pre-allocation with zeros for a more fair benchmarking) and proposed vectorized approach were -
-------------------------- With Original Loopy Approach
Elapsed time is 1.086666 seconds.
-------------------------- With Proposed Vectorized Approach
Elapsed time is 0.178805 seconds.
Doesn't seem like JIT is showing its magic, at least not when bsxfun is used inside nested loops and also the fact that you need to index into that huge 4D array in each iteration isn't helping you. So, going full-throttle on vectorization in cases like these make more sense !

Octave: how can these FOR loops be vectorized?

I am writing an Octave script to calculate the price of an European option.
The first part uses Monte Carlo to simulate the underlying asset price over n number of time periods. This is repeated nIter number of times.
Octave makes it very easy to setup initial matrices. But I haven't found the way to complete the task in a vectorized way, avoiding FOR loops:
%% Octave simplifies creation of 'e', 'dlns', and 'Prices'
e = norminv(rand(nIter,n));
dlns = cat(2, ones(nIter,1), exp((adj_r+0.5*sigma^2)*dt+sigma*e.*sqrt(dt)));
Prices = zeros(nIter, n+1);
for i = 1:nIter % IS THERE A WAY TO VECTORIZE THESE FOR LOOPS?
for j = 1:n+1
if j == 1
Prices(i,j)=S0;
else
Prices(i,j)=Prices(i,j-1)*dlns(i,j);
end
endfor
endfor
Note that the price in n is equal to price in n-1 times a factor, hence the following does not work...
Prices(i,:) = S0 * dlns(i,:)
...since it takes S0 and multiplies it by all the factors, yielding different results than the expected random walk.
Because of the dependency between iterations to obtain results for each new column with respect to the previous column, it seems you would need at least one loop there, but do all operations within a column in a vectorized fashion and that might speed it up for you. The vectorized replacement for the two nested loops would look something like this -
Prices(:,1)=S0;
for j = 2:n+1
Prices(:,j) = Prices(:,j-1).*dlns(:,j);
endfor
It just occurred to me that the dependency can be taken care of with cumprod that gets us cumulative product which is essentially being done here and thus would lead to a no-loop solution! Here's the implementation -
Prices = [repmat(S0,nIter,1) cumprod(dlns(:,2:end),2)*S0]
Benchmarking on MATLAB
Benchmarking Code -
%// Parameters as told by OP and then create the inputs
nIter= 100000;
n = 100;
adj_r = 0.03;
sigma = 0.2;
dt = 1/n;
S0 = 60;
e = norminv(rand(nIter,n));
dlns = cat(2, ones(nIter,1), exp((adj_r+0.5*sigma^2)*dt+sigma*e.*sqrt(dt)));
disp('-------------------------------------- With Original Approach')
tic
Prices = zeros(nIter, n+1);
for i = 1:nIter
for j = 1:n+1
if j == 1
Prices(i,j)=S0;
else
Prices(i,j)=Prices(i,j-1)*dlns(i,j);
end
end
end
toc, clear Prices
disp('-------------------------------------- With Proposed Approach - I')
tic
Prices2(nIter, n+1)=0; %// faster pre-allocation scheme
Prices2(:,1)=S0;
for j = 2:n+1
Prices2(:,j)=Prices2(:,j-1).*dlns(:,j);
end
toc, clear Prices2
disp('-------------------------------------- With Proposed Approach - II')
tic
Prices3 = [repmat(S0,nIter,1) cumprod(dlns(:,2:end),2)*S0];
toc, clear Prices3
Runtimes results -
-------------------------------------- With Original Approach
Elapsed time is 0.259054 seconds.
-------------------------------------- With Proposed Approach - I
Elapsed time is 0.020566 seconds.
-------------------------------------- With Proposed Approach - II
Elapsed time is 0.067292 seconds.
Now, the runtimes do suggest that the first proposed approach might be a better fit here!