why arrayfun does NOT improve my struct array operation performance - matlab

here is the input data:
% #param Landmarks:
% Landmarks should be 1*m struct.
% m is the number of training set.
% Landmark(i).data is a n*2 matrix
old function:
function Landmarks=CenterOfGravity(Landmarks)
% align center of gravity
for i=1 : length(Landmarks)
Landmarks(i).data=Landmarks(i).data - ones(size(Landmarks(i).data,1),1)...
*mean(Landmarks(i).data);
end
end
new function which use arrayfun:
function [Landmarks] = center_to_gravity(Landmarks)
Landmarks = arrayfun(#(struct_data)...
struct('data', struct_data.data - repmat(mean(struct_data.data), [size(struct_data.data, 1), 1]))...
,Landmarks);
end %function center_to_gravity
when using profiler, I find the usage of time is NOT what I expected:
Function Total Time Self Time*
CenterOfGravity 0.011s 0.004 s
center_to_gravity 0.029s 0.001 s
Can someone tell me why?
BTW...I can't add "arrayfun" as a new tag for my reputation.

Using arrayfun does not count as "vectorizing your code" as described in every Matlab performance blog post ever written.
If your .data field is the same length for all entries of landmark, your could vectorize this code by first placing all of the data into a single DATASIZE-BY-LANDMARKSIZE martix, and then running this command
meanRemovedData = bsxfun(#minus, data, mean(data,1));
But you lose an awful lot of code clarity that way. (I'm pretty sure that bsxfun usually has vectorization-like speed advantages, but I haven't done any time testing this morning.)
In terms of why, I'm not really the right guy to ask. But many of the advantages of vectorization are dependent on performing simple operations of contiguous blocks of memory. Data stored in an array of structures is (I believe) stored as an array of pointers to disparate memory locations, which is why you can change the size or class of Landmarks(i).data without reallocating the whole structure array.

Thanks for Amro and Pursuit's enthusiastic to my question.
I get the best solution at Matlab answers from Jan Simon:
why arrayfun does NOT improve my struct array operation performance
There are some points that do improve the performance:
It is surprisingly that SUM/LENGTH is faster than MEAN
timeit can give more accurate result.
The fastest approach use tricks like this:
m = sum(data, 1) / size(data, 1);
data(:, 1) = data(:, 1) - m(1);

Consider the following three implementations (all vectorized using BSXFUN):
function s = func1(s)
for i=1:numel(s)
s(i).data = bsxfun(#minus, s(i).data, mean(s(i).data));
end
end
function v = func2(s)
v = arrayfun(#(ss) bsxfun(#minus,ss.data,mean(ss.data)), ...
s, 'UniformOutput',false);
v = struct('data',v);
end
function v = func3(s)
v = arrayfun(#(ss) struct('data',bsxfun(#minus,ss.data,mean(ss.data))), ...
s, 'UniformOutput',true);
end
Explanation:
First uses a for-loop to iterate over the array of structs.
Second uses ARRAYFUN to return a cell array of the data matrices, which are then passed to STRUCT to build the array of structures.
The last one uses ARRAYFUN and builds a structure directly at each iteration.
Here is a simple test to compare the timings:
function testArrayStruct()
%# sample array of structures
s = struct('data',[]);
for i=5000:-1:1
s(i).data = rand(randi(1000),2);
end
%# timing
tic; v1 = func1(s); toc
tic; v2 = func2(s); toc
tic; v3 = func3(s); toc
%# check all have the same output
assert(isequal(v1,v2,v3))
end
The results:
Elapsed time is 0.357796 seconds. %# func1
Elapsed time is 0.427568 seconds. %# func2
Elapsed time is 0.537971 seconds. %# func3
So you can see the loop-based solution is actually the fastest..

Related

MATLAB cellfun vectorization slow when using function handle

I encountered a weird bug in cell vectorization (MATLAB version R2019B).
Please consider the following minimal example, say we generate a cell array with variable length vector in each cell:
N = 10000;
rng(1);
result = cell(N,1);
numConnect = randi(10, [N,1]); % randomly generated number of connected nodes
for i = 1:N
result{i} = randi(N, [1, numConnect(i)]);
end
Now we want to retrospectively retrieve numConnect, i.e., the length of each cell, we can use cellfun. According to this documentation, in Backward Compatibility mode, you can use string as func variable instead of function handle. However, there is a drastic difference in performance locally.
tic;
nC1 = cellfun('length', result);
toc;
This one usually produces something like
Elapsed time is 0.038531 seconds.
If I changed to # function handle:
tic;
nC2 = cellfun(#length, result);
toc;
Then
Elapsed time is 1.041925 seconds.
is normal. There is a 30x difference!
I wonder is this performance difference a bug on my local machine, or a "feature" of MATLAB cellfun?

Time series aggregation efficiency

I commonly need to summarize a time series with irregular timing with a given aggregation function (i.e., sum, average, etc.). However, the current solution that I have seems inefficient and slow.
Take the aggregation function:
function aggArray = aggregate(array, groupIndex, collapseFn)
groups = unique(groupIndex, 'rows');
aggArray = nan(size(groups, 1), size(array, 2));
for iGr = 1:size(groups,1)
grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2);
for iSer = 1:size(array, 2)
aggArray(iGr,iSer) = collapseFn(array(grIdx,iSer));
end
end
end
Note that both array and groupIndex can be 2D. Every column in array is an independent series to be aggregated, but the columns of groupIndex should be taken together (as a row) to specify a period.
Then when we bring an irregular time series to it (note the second period is one base period longer), the timing results are poor:
a = rand(20006,10);
b = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);
tic; aggregate(a, b, #sum); toc
Elapsed time is 1.370001 seconds.
Using the profiler, we can find out that the grpIdx line takes about 1/4 of the execution time (.28 s) and the iSer loop takes about 3/4 (1.17 s) of the total (1.48 s).
Compare this with the period-indifferent case:
tic; cumsum(a); toc
Elapsed time is 0.000930 seconds.
Is there a more efficient way to aggregate this data?
Timing Results
Taking each response and putting it in a separate function, here are the timing results I get with timeit with Matlab 2015b on Windows 7 with an Intel i7:
original | 1.32451
felix1 | 0.35446
felix2 | 0.16432
divakar1 | 0.41905
divakar2 | 0.30509
divakar3 | 0.16738
matthewGunn1 | 0.02678
matthewGunn2 | 0.01977
Clarification on groupIndex
An example of a 2D groupIndex would be where both the year number and week number are specified for a set of daily data covering 1980-2015:
a2 = rand(36*52*5, 10);
b2 = [sort(repmat(1980:2015, [1 52*5]))' repmat(1:52, [1 36*5])'];
Thus a "year-week" period are uniquely identified by a row of groupIndex. This is effectively handled through calling unique(groupIndex, 'rows') and taking the third output, so feel free to disregard this portion of the question.
Method #1
You can create the mask corresponding to grIdx across all
groups in one go with bsxfun(#eq,..). Now, for collapseFn as #sum, you can bring in matrix-multiplication and thus have a completely vectorized approach, like so -
M = squeeze(all(bsxfun(#eq,groupIndex,permute(groups,[3 2 1])),2))
aggArray = M.'*array
For collapseFn as #mean, you need to do a bit more work, as shown here -
M = squeeze(all(bsxfun(#eq,groupIndex,permute(groups,[3 2 1])),2))
aggArray = bsxfun(#rdivide,M,sum(M,1)).'*array
Method #2
In case you are working with a generic collapseFn, you can use the 2D mask M created with the previous method to index into the rows of array, thus changing the complexity from O(n^2) to O(n). Some quick tests suggest this to give appreciable speedup over the original loopy code. Here's the implementation -
n = size(groups,1);
M = squeeze(all(bsxfun(#eq,groupIndex,permute(groups,[3 2 1])),2));
out = zeros(n,size(array,2));
for iGr = 1:n
out(iGr,:) = collapseFn(array(M(:,iGr),:),1);
end
Please note that the 1 in collapseFn(array(M(:,iGr),:),1) denotes the dimension along which collapseFn would be applied, so that 1 is essential there.
Bonus
By its name groupIndex seems like would hold integer values, which could be abused to have a more efficient M creation by considering each row of groupIndex as an indexing tuple and thus converting each row of groupIndex into a scalar and finally get a 1D array version of groupIndex. This must be more efficient as the datasize would be 0(n) now. This M could be fed to all the approaches listed in this post. So, we would have M like so -
dims = max(groupIndex,[],1);
agg_dims = cumprod([1 dims(end:-1:2)]);
[~,~,idx] = unique(groupIndex*agg_dims(end:-1:1).'); %//'
m = size(groupIndex,1);
M = false(m,max(idx));
M((idx-1)*m + [1:m]') = 1;
Mex Function 1
HAMMER TIME: Mex function to crush it:
The base case test with original code from the question took 1.334139 seconds on my machine. IMHO, the 2nd fastest answer from #Divakar is:
groups2 = unique(groupIndex);
aggArray2 = squeeze(all(bsxfun(#eq,groupIndex,permute(groups,[3 2 1])),2)).'*array;
Elapsed time is 0.589330 seconds.
Then my MEX function:
[groups3, aggArray3] = mg_aggregate(array, groupIndex, #(x) sum(x, 1));
Elapsed time is 0.079725 seconds.
Testing that we get the same answer: norm(groups2-groups3) returns 0 and norm(aggArray2 - aggArray3) returns 2.3959e-15. Results also match original code.
Code to generate the test conditions:
array = rand(20006,10);
groupIndex = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);
For pure speed, go mex. If the thought of compiling c++ code / complexity is too much of a pain, go with Divakar's answer. Another disclaimer: I haven't subject my function to robust testing.
Mex Approach 2
Somewhat surprising to me, this code appears even faster than the full Mex version in some cases (eg. in this test took about .05 seconds). It uses a mex function mg_getRowsWithKey to figure out the indices of groups. I think it may be because my array copying in the full mex function isn't as fast as it could be and/or overhead from calling 'feval'. It's basically the same algorithmic complexity as the other version.
[unique_groups, map] = mg_getRowsWithKey(groupIndex);
results = zeros(length(unique_groups), size(array,2));
for iGr = 1:length(unique_groups)
array_subset = array(map{iGr},:);
%// do your collapse function on array_subset. eg.
results(iGr,:) = sum(array_subset, 1);
end
When you do array(groups(1)==groupIndex,:) to pull out array entries associated with the full group, you're searching through the ENTIRE length of groupIndex. If you have millions of row entries, this will totally suck. array(map{1},:) is far more efficient.
There's still unnecessary copying of memory and other overhead associated with calling 'feval' on the collapse function. If you implement the aggregator function efficiently in c++ in such a way to avoid copying of memory, probably another 2x speedup can be achieved.
A little late to the party, but a single loop using accumarray makes a huge difference:
function aggArray = aggregate_gnovice(array, groupIndex, collapseFn)
[groups, ~, index] = unique(groupIndex, 'rows');
numCols = size(array, 2);
aggArray = nan(numel(groups), numCols);
for col = 1:numCols
aggArray(:, col) = accumarray(index, array(:, col), [], collapseFn);
end
end
Timing this using timeit in MATLAB R2016b for the sample data in the question gives the following:
original | 1.127141
gnovice | 0.002205
Over a 500x speedup!
Doing away with the inner loop, i.e.
function aggArray = aggregate(array, groupIndex, collapseFn)
groups = unique(groupIndex, 'rows');
aggArray = nan(size(groups, 1), size(array, 2));
for iGr = 1:size(groups,1)
grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2);
aggArray(iGr,:) = collapseFn(array(grIdx,:));
end
and calling the collapse function with a dimension parameter
res=aggregate(a, b, #(x)sum(x,1));
gives some speedup (3x on my machine) already and avoids the errors e.g. sum or mean produce, when they encounter a single row of data without a dimension parameter and then collapse across columns rather than labels.
If you had just one group label vector, i.e. same group labels for all columns of data, you could speed further up:
function aggArray = aggregate(array, groupIndex, collapseFn)
ng=max(groupIndex);
aggArray = nan(ng, size(array, 2));
for iGr = 1:ng
aggArray(iGr,:) = collapseFn(array(groupIndex==iGr,:));
end
The latter functions gives identical results for your example, with a 6x speedup, but cannot handle different group labels per data column.
Assuming a 2D test case for the group index (provided here as well with 10 different columns for groupIndex:
a = rand(20006,10);
B=[]; % make random length periods for each of the 10 signals
for i=1:size(a,2)
n0=randi(10);
b=transpose([ones(1,n0) 2*ones(1,11-n0) sort(repmat((3:4001), [1 5]))]);
B=[B b];
end
tic; erg0=aggregate(a, B, #sum); toc % original method
tic; erg1=aggregate2(a, B, #(x)sum(x,1)); toc %just remove the inner loop
tic; erg2=aggregate3(a, B, #(x)sum(x,1)); toc %use function below
Elapsed time is 2.646297 seconds.
Elapsed time is 1.214365 seconds.
Elapsed time is 0.039678 seconds (!!!!).
function aggArray = aggregate3(array, groupIndex, collapseFn)
[groups,ix1,jx] = unique(groupIndex, 'rows','first');
[groups,ix2,jx] = unique(groupIndex, 'rows','last');
ng=size(groups,1);
aggArray = nan(ng, size(array, 2));
for iGr = 1:ng
aggArray(iGr,:) = collapseFn(array(ix1(iGr):ix2(iGr),:));
end
I think this is as fast as it gets without using MEX. Thanks to the suggestion of Matthew Gunn!
Profiling shows that 'unique' is really cheap here and getting out just the first and last index of the repeating rows in groupIndex speeds things up considerably. I get 88x speedup with this iteration of the aggregation.
Well I have a solution that is almost as quick as the mex but only using matlab.
The logic is the same as most of the above, creating a dummy 2D matrix but instead of using #eq I initialize a logical array from the start.
Elapsed time for mine is 0.172975 seconds.
Elapsed time for Divakar 0.289122 seconds.
function aggArray = aggregate(array, group, collapseFn)
[m,~] = size(array);
n = max(group);
D = false(m,n);
row = (1:m)';
idx = m*(group(:) - 1) + row;
D(idx) = true;
out = zeros(m,size(array,2));
for ii = 1:n
out(ii,:) = collapseFn(array(D(:,ii),:),1);
end
end

Powers of a matrix

I have a square matrix A (nxn). I would like to create a series of k powers of this matrix into an nxnxk multidimensional matrix (Not element-wise but actual powers of the matrix), i.e.getting [A^0 A^1 A^2..A^k]. It's sort of a varied vandermonde for matrix case.
I am able to do it with loops but it is annoying and slow. I tried using bsxfun but no luck since I am probably missing something here.
Here is a simple loop that I did:
for j=1:1:100
final(:,:,j)=A^(j-1);
end
You are trying to perform cummulative version of mpower with a vector of k values.
Sadly, bsxfun hasn't evolved yet to handle such a case. So, the best I could suggest at this point would be having a running storage that accumulates the matrix-product at each iteration to be used at the next one.
Your original loop code looked something like this -
final = zeros([size(A),100]);
for j=1:1:100
final(:,:,j)=A^(j-1);
end
So, with the suggestion, the modified loopy code would be -
final = zeros([size(A),100]);
matprod = A^0;
final(:,:,1) = matprod;
for j=2:1:100
matprod = A*matprod;
final(:,:,j)= matprod;
end
Benchmarking -
%// Input
A = randi(9,200,200);
disp('---------- Original loop code -----------------')
tic
final = zeros([size(A),100]);
for j=1:1:100
final(:,:,j)=A^(j-1);
end
toc
disp('---------- Modified loop code -----------------')
tic
final2 = zeros([size(A),100]);
matprod = A^0;
final2(:,:,1) = matprod;
for j=2:1:100
matprod = A*matprod;
final2(:,:,j)= matprod;
end
toc
Runtimes -
---------- Original loop code -----------------
Elapsed time is 1.255266 seconds.
---------- Modified loop code -----------------
Elapsed time is 0.205227 seconds.

How to accumulate submatrices without looping (subarray smoothing)?

In Matlab I need to accumulate overlapping diagonal blocks of a large matrix. The sample code is given below.
Since this piece of code needs to run several times, it consumes a lot of resources. The process is used in array signal processing for a so-called subarray smoothing or spatial smoothing. Is there any way to do this faster?
% some values for parameters
M = 1000; % size of array
m = 400; % size of subarray
n = M-m+1; % number of subarrays
R = randn(M)+1i*rand(M);
% main code
S = R(1:m,1:m);
for i = 2:n
S = S + R(i:m+i-1,i:m+i-1);
end
ATTEMPTS:
1) I tried the following alternative vectorized version, but unfortunately it became much slower!
[X,Y] = meshgrid(1:m);
inds1 = sub2ind([M,M],Y(:),X(:));
steps = (0:n-1)*(M+1);
inds = repmat(inds1,1,n) + repmat(steps,m^2,1);
RR = sum(R(inds),2);
S = reshape(RR,m,m);
2) I used Matlab coder to create a MEX file and it became much slower!
I've personally had to fasten up some portions of my code lately. Being not an expert at all, I would recommend trying the following:
1) Vectorize:
Getting rid of the for-loop
S = R(1:m,1:m);
for i = 2:n
S = S + R(i:m+i-1,i:m+i-1)
end
and replacing it for an alternative based on cumsum should be the way to go here.
Note: will try and work on this approach on a future Edit
2) Generating a MEX-file:
In some instances, you could simply fire up the Matlab Coder app (given that you have it in your current Matlab version).
This should generate a .mex file for you, that you can call as it was the function that you are trying to replace.
Regardless of your choice (1) or 2)), you should profile your current implementation with tic; my_function(); toc; for a fair number of function calls, and compare it with your current implementation:
my_time = zeros(1,10000);
for count = 1:10000
tic;
my_function();
my_time(count) = toc;
end
mean(my_time)

MATLab Bootstrap without for loop

yesterday I implemented my first bootstrap in MATLab. (and yes, I know, for loops are evil.):
%data is an mxn matrix where the data should be sampled per column but there
can be a NaNs Elements
%from the array (a column of data) n values are sampled nReps times
function result = bootstrap_std(data, n, nReps,quantil)
result = zeros(1,size(data,2));
for i=1:size(data,2)
bootstrap_data = zeros(n,nReps);
values = find(~isnan(data(:,i)));
if isempty(values)
bootstrap_data(:,:) = NaN;
else
for k=1:nReps
bootstrap_data(:,k) = datasample(data(values,i),n);
end
end
stat = zeros(1,nReps);
for k=1:nReps
stat(k) = nanstd(bootstrap_data(:,k));
end
sort(stat);
result(i) = quantile(stat,quantil);
end
end
As one can see, this version works columnwise. The algorithm does what it should but is really slow when the data size increaes. My question is now: Is it possible to implement this logic without using for loops? My problem is here that I could not find a version of datasample which does the sampling columnwise. Or is there a better function to use?
I am happy for any hint or idea how I can speed up this implementation.
Thanks and best regards!
stephan
The bottlenecks in your implementation are
The function spends a lot of time inside nanstd which is unnecessary since you exclude NaN values from your sample anyway.
There are a lot of functions that operate column-wise, but you spend time looping over the columns and calling them many times.
You make many calls to datasample which is a relatively slow function. It's much faster to create a random vector of indices using randi and use that instead.
Here's how I would write the function (actually I probably wouldn't put in this many comments, and I wouldn't use so many temp variables, but I'm doing it now so you can see what all the steps of the computation are).
function result = bootstrap_std_new(data, n, nRep, quantil)
result = zeros(1, size(data,2));
for i = 1:size(data,2)
isbad = isnan(data(:,i)); %// Vector of NaN values
if all(isbad)
result(i) = NaN;
else
data0 = data(~isbad, i); %// Temp copy of this column for indexing
index = randi(size(data0,1), n, nRep); %// Create the indexing vector
bootstrapdata = data0(index); %// Sample the data
stdevs = std(bootstrapdata); %// Stdev of sampled data
result(i) = quantile(stdevs, quantil); %// Find the correct quantile
end
end
end
Here are some timings
>> data = randn(100,10);
>> data(randi(1000, 50, 1)) = NaN;
>> tic, bootstrap_std(data, 50, 1000, 0.5); toc
Elapsed time is 1.359529 seconds.
>> tic, bootstrap_std_new(data, 50, 1000, 0.5); toc
Elapsed time is 0.038558 seconds.
So this gives you about a 35x speedup.
Your main issue seems to be that you may have varying numbers/positions of NaN in each column, so can't work on the full matrix unless you're okay with also sampling NaNs. However, some of the inner loops could be simplified.
for k=1:nReps
bootstrap_data(:,k) = datasample(data(values,i),n);
end
Since you're sampling with replacement, you should be able to just do:
bootstrap_data = datasample(data(values,i), n*nReps);
bootstrap_data = reshape(bootstrap_data, [n nReps]);
Also nanstd can work on a full matrix so no need to loop:
stat = nanstd(bootstrap_data); % or nanstd(x,0,2) to change dimension
It would also be worth just looking over your code with profile to see where the bottlenecks are.