Performance/Low level meaning of " a(idx) = [] " vs "a = a(~idx)" - matlab

As the title says, I wish to know what does Matlab do differently between the two options. For the sake of argument, let's say that matrix a and idx are sufficiently large to be dealing with memory issues, and define:
Case A: a(idx) = []
Case B: a = a(~idx)
My intuition says that in Case A performs a value reassignment, which then the CPU needs to deal with indexed copies from original positions to the new ordered ones, while keeping track what is the current "head" of the same matrix, and later trimming the excess memory.
On the other hand, Case B would perform an indexed bulk copy to a newly allocation memory space.
So probably Case A is slower but less memory demanding than Case B. Am I assuming right? I don't know, immediately after writing this I feel like Case B needs to perform Case A first... Any ideas?
Thanks in advance

It's interesting, so I decided to take a measure:
I am using Windows (64 bits) version of Matlab R2016a.
CPU: Core i5-3550 at 3.3GHz.
Memory: 8GB DDR3 1333 (Dual channel).
len = 100000000; %Number of elements in array (make it large enouth to be outsize of cache memory).
idx = zeros(len, 1, 'logical'); %Fill idx with ones.
idx(1:10:end) = 1; %Put 1 in every 10'th element of idx.
a = ones(len, 1); %Fill arrary a with ones.
disp('Measure: a(idx) = [];')
a(idx) = [];
a = ones(len, 1);
disp(' ');disp('Measure: a = a(~idx);')
a = a(~idx);
disp(' ');disp('Measure: not_idx = ~idx;')
not_idx = ~idx;
a = ones(len, 1);
disp(' ');disp('Measure: a = a(not_idx);')
a = a(not_idx);
Measure: a(idx) = [];
Elapsed time is 1.647617 seconds.
Measure: a = a(~idx);
Elapsed time is 0.732233 seconds.
Measure: not_idx = ~idx;
Elapsed time is 0.032649 seconds.
Measure: a = a(not_idx);
Elapsed time is 0.686351 seconds.
a = a(~idx) is about twice faster than a(idx) = [].
Total time of a = a(~idx) equals sum of not_idx = ~idx plus a = a(not_idx)
Matlab is probably calculating ~idx separately, so it consumes more memory.
Memory consumption meters only when physical RAM is in full consumption.
I think it's negligible (~idx memory consumption is temporary).
Both solutions are not optimized.
I estimate, fully optimized implementation (in C) to be 10 times faster.


Fastest approach to copying/indexing variable parts of 3D matrix

I have large sets of 3D data consisting of 1D signals acquired in 2D space.
The first step in processing this data is thresholding all signals to find the arrival of a high-amplitude pulse. This pulse is present in all signals and arrives at different times.
After thresholding, the 3D data set should be reordered so that every signal starts at the arrival of the pulse and what came before is thrown away (the end of the signals is of no importance, as of now i concatenate zeros to the end of all signals so the data remains the same size).
Now, I have implemented this in the following manner:
First, i start by calculating the sample number of the first sample exceeding the threshold in all signals
M = randn(1000,500,500); % example matrix of realistic size
threshold = 0.25*max(M(:,1,1)); % 25% of the maximum in the first signal as threshold
[~,index] = max(M>threshold); % indices of first sample exceeding threshold in all signals
Next, I want all signals to be shifted so that they all start with the pulse. For now, I have implemented it this way:
outM = zeros(size(M)); % preallocation for speed
for i = 1:size(M,2)
for j = 1:size(M,3)
outM(1:size(M,1)+1-index(1,i,j),i,j) = M(index(1,i,j):end,i,j);
This works fine, and i know for-loops are not that slow anymore, but this easily takes a few seconds for the datasets on my machine. A single iteration of the for-loop takes about 0.05-0.1 sec, which seems slow to me for just copying a vector containing 500-2000 double values.
Therefore, I have looked into the best way to tackle this, but for now I haven't found anything better.
I have tried several things: 3D masks, linear indexing, and parallel loops (parfor).
for 3D masks, I checked to see if any improvements are possible. Therefore i first contruct a logical mask, and then compare the speed of the logical mask indexing/copying to the double nested for loop.
%% set up for logical mask copying
AA = logical(ones(500,1)); % only copy the first 500 values after the threshold value
Mask = logical(zeros(size(M)));
Jepla = zeros(500,size(M,2),size(M,3));
for i = 1:size(M,2)
for j = 1:size(M,3)
Mask(index(1,i,j):index(1,i,j)+499,i,j) = AA;
%% speed comparison
Jepla = M(Mask);
for i = 1:size(M,2)
for j = 1:size(M,3)
outM(1:size(M,1)+1-index(1,i,j),i,j) = M(index(1,i,j):end,i,j);
The for-loop is faster every time, even though there is more that's copied.
Next, linear indexing.
%% setup for linear index copying
%put all indices in 1 long column
LongIndex = reshape(index,numel(index),1);
% convert to linear indices and store in new variable
linearIndices = sub2ind(size(M),LongIndex,repmat(1:size(M,2),1,size(M,3))',repelem(1:size(M,3),size(M,2))');
% extend linear indices with those of all values to copy
k = zeros(numel(M),1);
count = 1;
for i = 1:numel(LongIndex)
values = linearIndices(i):size(M,1)*i;
k(count:count+length(values)-1) = values;
count = count + length(values);
k = k(1:count-1);
% get linear indices of locations in new matrix
l = zeros(length(k),1);
count = 1;
for i = 1:numel(LongIndex)
values = repelem(LongIndex(i)-1,size(M,1)-LongIndex(i)+1);
l(count:count+length(values)-1) = values;
count = count + length(values);
l = k-l;
% create new matrix
outM = zeros(size(M));
%% speed comparison
outM(l) = M(k);
for i = 1:size(M,2)
for j = 1:size(M,3)
outM(1:size(M,1)+1-index(1,i,j),i,j) = M(index(1,i,j):end,i,j);
Again, the alternative approach, linear indexing, is (a lot) slower.
After this failed, I learned about parallelisation, and though this would for sure speed up my code.
By reading some of the documentation around parfor and trying it out a bit, I changed my code to the following:
outM = zeros(size(M));
inM = mat2cell(M,size(M,1),ones(size(M,2),1),size(M,3));
parfor i = 1:500
for j = 1:500
outM(:,i,j) = [inM{i}(index(1,i,j):end,1,j);zeros(index(1,i,j)-1,1)];
I changed it so that "outM" and "inM" would both be sliced variables, as I read this is best. Still this is very slow, a lot slower than the original for loop.
So now the question, should I give up on trying to improve the speed of this operation? Or is there another way in which to do this? I have searched a lot, and for now do not see how to speed this up.
Sorry for the long question, but I wanted to show what I tried.
Thank you in advance!
Not sure if an option in your situation, but looks like cell arrays are actually faster here:
outM2 = cell(size(M,2),size(M,3));
for i = 1:size(M,2)
for j = 1:size(M,3)
outM2{i,j} = M(index(1,i,j):end,i,j);
And a second idea which also came out faster, batch all data which have to be shifted by the same value:
for i = 1:unique(index).'
outM(1:size(M,1)+1-i,index==i) = M(i:end,index==i);
It totally depends on your data if this approach is actually faster.
And yes integer valued and logical indexing can be mixed

Matlab: for loop on vector. Weird speed behaviour?

I was playing a bit with the for loop on matlab, I know that often it is possible to avoid them and in that case, much much faster. But if I really want to go through all the element of the vector V , I did that little test:
V =1:n;
s1 = 0;
for x=V
s1 = s1+x;
s2 = 0;
for ind=1:numel(V)
s2 = s2+V(ind);
s1 et s2 are equal (normal) but it takes 24.63 sec for the first loop and only 0.48 sec for the second one.
I was a bit surprised by these numbers. Is it something known ? does anyone have any explanation ?
This is probably caused by memory allocation. The statement in case 1,
for x=V
has to create a copy of V. Why do we know? If you modify V within the loop x won't be affected: it will still run through the original V values.
On the other hand, the statement in case 2,
for ind=1:numel(V)
doesn't actually create the vector 1:numel(V). From help for,
Long loops are more memory efficient when the colon expression appears in the for statement since the index vector is never created.
The fact that no memory needs to be allocated probably accounts for the increase of speed, at least in part.
To test this, let's change for ind=1:numel(V) to for ind=[1:numel(V)]. This will force creation of the vector 1:numel(V). Then the running time should be similar to case 1, or indeed a little larger because we still need to index into V with V(ind).
These are the running times on my computer:
% Case 1
V =1:n;
s1 = 0;
for x=V
s1 = s1+x;
% Case 2
s2 = 0;
for ind=1:numel(V)
s2 = s2+V(ind);
% Case 3
s3 = 0;
for ind=[1:numel(V)]
s3 = s3+V(ind);
Elapsed time is 0.610825 seconds.
Elapsed time is 0.182983 seconds.
Elapsed time is 0.831321 seconds.

Time series aggregation efficiency

I commonly need to summarize a time series with irregular timing with a given aggregation function (i.e., sum, average, etc.). However, the current solution that I have seems inefficient and slow.
Take the aggregation function:
function aggArray = aggregate(array, groupIndex, collapseFn)
groups = unique(groupIndex, 'rows');
aggArray = nan(size(groups, 1), size(array, 2));
for iGr = 1:size(groups,1)
grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2);
for iSer = 1:size(array, 2)
aggArray(iGr,iSer) = collapseFn(array(grIdx,iSer));
Note that both array and groupIndex can be 2D. Every column in array is an independent series to be aggregated, but the columns of groupIndex should be taken together (as a row) to specify a period.
Then when we bring an irregular time series to it (note the second period is one base period longer), the timing results are poor:
a = rand(20006,10);
b = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);
tic; aggregate(a, b, #sum); toc
Elapsed time is 1.370001 seconds.
Using the profiler, we can find out that the grpIdx line takes about 1/4 of the execution time (.28 s) and the iSer loop takes about 3/4 (1.17 s) of the total (1.48 s).
Compare this with the period-indifferent case:
tic; cumsum(a); toc
Elapsed time is 0.000930 seconds.
Is there a more efficient way to aggregate this data?
Timing Results
Taking each response and putting it in a separate function, here are the timing results I get with timeit with Matlab 2015b on Windows 7 with an Intel i7:
original | 1.32451
felix1 | 0.35446
felix2 | 0.16432
divakar1 | 0.41905
divakar2 | 0.30509
divakar3 | 0.16738
matthewGunn1 | 0.02678
matthewGunn2 | 0.01977
Clarification on groupIndex
An example of a 2D groupIndex would be where both the year number and week number are specified for a set of daily data covering 1980-2015:
a2 = rand(36*52*5, 10);
b2 = [sort(repmat(1980:2015, [1 52*5]))' repmat(1:52, [1 36*5])'];
Thus a "year-week" period are uniquely identified by a row of groupIndex. This is effectively handled through calling unique(groupIndex, 'rows') and taking the third output, so feel free to disregard this portion of the question.
Method #1
You can create the mask corresponding to grIdx across all
groups in one go with bsxfun(#eq,..). Now, for collapseFn as #sum, you can bring in matrix-multiplication and thus have a completely vectorized approach, like so -
M = squeeze(all(bsxfun(#eq,groupIndex,permute(groups,[3 2 1])),2))
aggArray = M.'*array
For collapseFn as #mean, you need to do a bit more work, as shown here -
M = squeeze(all(bsxfun(#eq,groupIndex,permute(groups,[3 2 1])),2))
aggArray = bsxfun(#rdivide,M,sum(M,1)).'*array
Method #2
In case you are working with a generic collapseFn, you can use the 2D mask M created with the previous method to index into the rows of array, thus changing the complexity from O(n^2) to O(n). Some quick tests suggest this to give appreciable speedup over the original loopy code. Here's the implementation -
n = size(groups,1);
M = squeeze(all(bsxfun(#eq,groupIndex,permute(groups,[3 2 1])),2));
out = zeros(n,size(array,2));
for iGr = 1:n
out(iGr,:) = collapseFn(array(M(:,iGr),:),1);
Please note that the 1 in collapseFn(array(M(:,iGr),:),1) denotes the dimension along which collapseFn would be applied, so that 1 is essential there.
By its name groupIndex seems like would hold integer values, which could be abused to have a more efficient M creation by considering each row of groupIndex as an indexing tuple and thus converting each row of groupIndex into a scalar and finally get a 1D array version of groupIndex. This must be more efficient as the datasize would be 0(n) now. This M could be fed to all the approaches listed in this post. So, we would have M like so -
dims = max(groupIndex,[],1);
agg_dims = cumprod([1 dims(end:-1:2)]);
[~,~,idx] = unique(groupIndex*agg_dims(end:-1:1).'); %//'
m = size(groupIndex,1);
M = false(m,max(idx));
M((idx-1)*m + [1:m]') = 1;
Mex Function 1
HAMMER TIME: Mex function to crush it:
The base case test with original code from the question took 1.334139 seconds on my machine. IMHO, the 2nd fastest answer from #Divakar is:
groups2 = unique(groupIndex);
aggArray2 = squeeze(all(bsxfun(#eq,groupIndex,permute(groups,[3 2 1])),2)).'*array;
Elapsed time is 0.589330 seconds.
Then my MEX function:
[groups3, aggArray3] = mg_aggregate(array, groupIndex, #(x) sum(x, 1));
Elapsed time is 0.079725 seconds.
Testing that we get the same answer: norm(groups2-groups3) returns 0 and norm(aggArray2 - aggArray3) returns 2.3959e-15. Results also match original code.
Code to generate the test conditions:
array = rand(20006,10);
groupIndex = transpose([ones(1,5) 2*ones(1,6) sort(repmat((3:4001), [1 5]))]);
For pure speed, go mex. If the thought of compiling c++ code / complexity is too much of a pain, go with Divakar's answer. Another disclaimer: I haven't subject my function to robust testing.
Mex Approach 2
Somewhat surprising to me, this code appears even faster than the full Mex version in some cases (eg. in this test took about .05 seconds). It uses a mex function mg_getRowsWithKey to figure out the indices of groups. I think it may be because my array copying in the full mex function isn't as fast as it could be and/or overhead from calling 'feval'. It's basically the same algorithmic complexity as the other version.
[unique_groups, map] = mg_getRowsWithKey(groupIndex);
results = zeros(length(unique_groups), size(array,2));
for iGr = 1:length(unique_groups)
array_subset = array(map{iGr},:);
%// do your collapse function on array_subset. eg.
results(iGr,:) = sum(array_subset, 1);
When you do array(groups(1)==groupIndex,:) to pull out array entries associated with the full group, you're searching through the ENTIRE length of groupIndex. If you have millions of row entries, this will totally suck. array(map{1},:) is far more efficient.
There's still unnecessary copying of memory and other overhead associated with calling 'feval' on the collapse function. If you implement the aggregator function efficiently in c++ in such a way to avoid copying of memory, probably another 2x speedup can be achieved.
A little late to the party, but a single loop using accumarray makes a huge difference:
function aggArray = aggregate_gnovice(array, groupIndex, collapseFn)
[groups, ~, index] = unique(groupIndex, 'rows');
numCols = size(array, 2);
aggArray = nan(numel(groups), numCols);
for col = 1:numCols
aggArray(:, col) = accumarray(index, array(:, col), [], collapseFn);
Timing this using timeit in MATLAB R2016b for the sample data in the question gives the following:
original | 1.127141
gnovice | 0.002205
Over a 500x speedup!
Doing away with the inner loop, i.e.
function aggArray = aggregate(array, groupIndex, collapseFn)
groups = unique(groupIndex, 'rows');
aggArray = nan(size(groups, 1), size(array, 2));
for iGr = 1:size(groups,1)
grIdx = all(groupIndex == repmat(groups(iGr,:), [size(groupIndex,1), 1]), 2);
aggArray(iGr,:) = collapseFn(array(grIdx,:));
and calling the collapse function with a dimension parameter
res=aggregate(a, b, #(x)sum(x,1));
gives some speedup (3x on my machine) already and avoids the errors e.g. sum or mean produce, when they encounter a single row of data without a dimension parameter and then collapse across columns rather than labels.
If you had just one group label vector, i.e. same group labels for all columns of data, you could speed further up:
function aggArray = aggregate(array, groupIndex, collapseFn)
aggArray = nan(ng, size(array, 2));
for iGr = 1:ng
aggArray(iGr,:) = collapseFn(array(groupIndex==iGr,:));
The latter functions gives identical results for your example, with a 6x speedup, but cannot handle different group labels per data column.
Assuming a 2D test case for the group index (provided here as well with 10 different columns for groupIndex:
a = rand(20006,10);
B=[]; % make random length periods for each of the 10 signals
for i=1:size(a,2)
b=transpose([ones(1,n0) 2*ones(1,11-n0) sort(repmat((3:4001), [1 5]))]);
B=[B b];
tic; erg0=aggregate(a, B, #sum); toc % original method
tic; erg1=aggregate2(a, B, #(x)sum(x,1)); toc %just remove the inner loop
tic; erg2=aggregate3(a, B, #(x)sum(x,1)); toc %use function below
Elapsed time is 2.646297 seconds.
Elapsed time is 1.214365 seconds.
Elapsed time is 0.039678 seconds (!!!!).
function aggArray = aggregate3(array, groupIndex, collapseFn)
[groups,ix1,jx] = unique(groupIndex, 'rows','first');
[groups,ix2,jx] = unique(groupIndex, 'rows','last');
aggArray = nan(ng, size(array, 2));
for iGr = 1:ng
aggArray(iGr,:) = collapseFn(array(ix1(iGr):ix2(iGr),:));
I think this is as fast as it gets without using MEX. Thanks to the suggestion of Matthew Gunn!
Profiling shows that 'unique' is really cheap here and getting out just the first and last index of the repeating rows in groupIndex speeds things up considerably. I get 88x speedup with this iteration of the aggregation.
Well I have a solution that is almost as quick as the mex but only using matlab.
The logic is the same as most of the above, creating a dummy 2D matrix but instead of using #eq I initialize a logical array from the start.
Elapsed time for mine is 0.172975 seconds.
Elapsed time for Divakar 0.289122 seconds.
function aggArray = aggregate(array, group, collapseFn)
[m,~] = size(array);
n = max(group);
D = false(m,n);
row = (1:m)';
idx = m*(group(:) - 1) + row;
D(idx) = true;
out = zeros(m,size(array,2));
for ii = 1:n
out(ii,:) = collapseFn(array(D(:,ii),:),1);

Preallocation and Vectorization Speedup

I am trying to improve the speed of script I am trying to run.
Here is the code: (my machine = 4 core win 7)
clear y;
% no y pre-allocation using zeros
for k=1:n,
y(k) = (1-(3/5)*x(k)+(3/20)*x(k)^2 -(x(k)^3/60)) / (1+(2/5)*x(k)-(1/20)*x(k)^2);
elapsed_time1 = toc(start_time);
fprintf('Computational time for serialized solution: %f\n',elapsed_time1);
Above code gives 0.013654 elapsed time.
On the other hand, I was tried to use pre-allocation by adding y = zeros(1,n); in the above code where the comment is but the running time is similar around ~0.01. Any ideas why? I was told it would improve by a factor of 2. Am I missing something?
Lastly is there any type of vectorization in Matlab that will allow me to forget about the for loop in the above code?
In your code: try with n=10000 and you'll see more of a difference (a factor of almost 10 on my machine).
These things related with allocation are most noticeable when the size of your variable is large. In that case it's more difficult for Matlab to dynamically allocate memory for that variable.
To reduce the number of operations: do it vectorized, and reuse intermediate results to avoid powers:
y = (1 + x.*(-3/5 + x.*(3/20 - x/60))) ./ (1 + x.*(2/5 - x/20));
With n=100:
Parag's / venergiac's solution:
>> tic
for count = 1:100
y=(1-(3/5)*x+(3/20)*x.^2 -(x.^3/60))./(1+(2/5)*x-(1/20)*x.^2);
Elapsed time is 0.010769 seconds.
My solution:
>> tic
for count = 1:100
y = (1 + x.*(-3/5 + x.*(3/20 - x/60))) ./ (1 + x.*(2/5 - x/20));
Elapsed time is 0.006186 seconds.
You don't need a for loop. Replace the for loop with the following and MATLAB will handle it.
y=(1-(3/5)*x+(3/20)*x.^2 -(x.^3/60))./(1+(2/5)*x-(1/20)*x.^2);
This may give a computational advantage when vectors become larger in size. Smaller size is the reason why you cannot see the effect of pre-allocation. Read this page for additional tips on how to improve the performance.
Edit: I observed that at larger sizes, n>=10^6, I am getting a constant performance improvement when I try the following:
instead of using linspace. At n=10^7, I gain 0.05 seconds (0.03 vs 0.08) by NOT using linspace.
try operation element per element (.*, .^)
clear y;
% no y pre-allocation using zeros
for k=1:n,
y(k) = (1-(3/5)*x(k)+(3/20)*x(k)^2 -(x(k)^3/60)) / (1+(2/5)*x(k)-(1/20)*x(k)^2);
elapsed_time1 = toc(start_time);
fprintf('Computational time for serialized solution: %f\n',elapsed_time1);
y = (1-(3/5)*x+(3/20)*x.^2 -(x.^3/60)) / (1+(2/5)*x-(1/20)*x.^2);
elapsed_time1 = toc(start_time);
fprintf('Computational time for product solution: %f\n',elapsed_time1);
my data
Computational time for serialized solution: 2.578290
Computational time for serialized solution: 0.010060

Efficient method for finding elements in MATLAB matrix

I would like to know how can the bottleneck be treated in the given piece of code.
%% Points is an Nx3 matrix having the coordinates of N points where N ~ 10^6
Z = points(:,3)
listZ = (Z >= a & Z < b); % Bottleneck
np = sum(listZ); % For later usage
slice = points(listZ,:);
Currently for N ~ 10^6, np ~ 1000 and number of calls to this part of code = 1000, the bottleneck statement is taking around 10 seconds in total, which is a big chunk of time compared to the rest of my code.
Some more screenshots of a sample code for only the indexing statement as requested by #EitanT
If the equality on one side is not important you can reformulate it to a one-sided comparison and it gets one order of magnitude faster:
Z = rand(1e6,3);
a=0.5; b=0.6;
for k=1:100,
listZ1 = (Z >= a & Z < b); % Bottleneck
for k=1:100,
listZ2 = (abs(Z-c)<d);
isequal(listZ1, listZ2)
Elapsed time is 5.567460 seconds.
Elapsed time is 0.625646 seconds.
ans =
Assuming the worst case:
element-wise & is not short-circuited internally
the comparisons are single-threaded
You're doing 2*1e6*1e3 = 2e9 comparisons in ~10 seconds. That's ~200 million comparisons per second (~200 MFLOPS).
Considering you can do some 1.7 GFLops on a single core, this indeed seems rather low.
Are you running Windows 7? If so, have you checked your power settings? You are on a mobile processor, so I expect that by default, there will be some low-power consumption scheme in effect. This allows windows to scale down the processing speed, so...check that.
Other than that....I really have no clue.
Try doing something like this:
for i = 1:1000
x = (a >= 0.5);
x = (x < 0.6);
I found it to be faster than:
for i = 1:1000
x = (a >= 0.5 & a < 0.6);
by about 4 seconds:
Elapsed time is 0.985001 seconds. (first one)
Elapsed time is 4.888243 seconds. (second one)
I think the reason for your slowing is the element wise & operation.