When does vectorization is a better or worse solution than a loop? [duplicate] - matlab

This question already has answers here:
'for' loop vs vectorization in MATLAB
(5 answers)
Closed 3 years ago.
In Matlab, I am trying to vectorise my code to improve the simulation time. However, the result I got was that I deteriorated the overall efficiency.
To understand the phenomenon I created 3 distinct functions that does the same thing but with different approach :
The main file :
clc,
clear,
n = 10000;
Value = cumsum(ones(1,n));
NbLoop = 10000;
time01 = zeros(1,NbLoop);
time02 = zeros(1,NbLoop);
time03 = zeros(1,NbLoop);
for test = 1 : NbLoop
tic
vector1 = function01(n,Value);
time01(test) = toc ;
tic
vector2 = function02(n,Value);
time02(test) = toc ;
tic
vector3 = function03(n,Value);
time03(test) = toc ;
end
figure(1)
hold on
plot( time01, 'b')
plot( time02, 'g')
plot( time03, 'r')
The function 01:
function vector = function01(n,Value)
vector = zeros( 2*n,1);
for k = 1:n
vector(2*k -1) = Value(k);
vector(2*k) = Value(k);
end
end
The function 02:
function vector = function02(n,Value)
vector = zeros( 2*n,1);
vector(1:2:2*n) = Value;
vector(2:2:2*n) = Value;
end
The function 03:
function vector = function03(n,Value)
MatrixTmp = transpose([Value(:), Value(:)]);
vector = MatrixTmp (:);
end
The blue plot correspond to the for - loop.
n = 100:
n = 10000:
When I run the code with n = 100, the more efficient solution is the first function with the for loop.
When n = 10000 The first function become the less efficient.
Do you have a way to know how and when to properly replace a for-loop by a vectorised counterpart?
What is the impact of index searching with array of tremendous dimensions ?
Does Matlab compute in a different manner an array of dimensions 3 or higher than a array of dimension 1 or 2?
Is there a clever way to replace a while loop that use the result of an iteration for the next iteration?

Using MATLAB Online I see something different:
n 10000 100
function01 5.6248e-05 2.2246e-06
function02 1.7748e-05 1.9491e-06
function03 2.7748e-05 1.2278e-06
function04 1.1056e-05 7.3390e-07 (my version, see below)
Thus, the loop version is always slowest. Method #2 is faster for very large matrices, Method #3 is faster for very small matrices.
The reason is that method #3 makes 2 copies of the data (transpose or a matrix incurs a copy), and that is bad if there's a lot of data. Method #2 uses indexing, which is expensive, but not as expensive as copying lots of data twice.
I would suggest this function instead (Method #4), which transposes only vectors (which is essentially free). It is a simple modification of your Method #3:
function vector = function04(n,Value)
vector = [Value(:).'; Value(:).'];
vector = vector(:);
end
Do you have a way to know how and when to properly replace a for-loop by a vectorised counterpart?
In general, vectorized code is always faster if there are no large intermediate matrices. For small data you can vectorize more aggressively, for large data sometimes loops are more efficient because of the reduced memory pressure. It depends on what is needed for vectorization.
What is the impact of index searching with array of tremendous dimensions?
This refers to operations such as d = data(data==0). Much like everything else, this is efficient for small data and less so for large data, because data==0 is an intermediate array of the same size as data.
Does Matlab compute in a different manner an array of dimensions 3 or higher than a array of dimension 1 or 2?
No, not in general. Functions such as sum are implemented in a dimensionality-independent waycitation needed.
Is there a clever way to replace a while loop that use the result of an iteration for the next iteration?
It depends very much on what the operations are. Functions such as cumsum can often be used to vectorize this type of code, but not always.
This is my timing code, I hope it shows how to properly use timeit:
%n = 10000;
n = 100;
Value = cumsum(ones(1,n));
vector1 = function01(n,Value);
vector2 = function02(n,Value);
vector3 = function03(n,Value);
vector4 = function04(n,Value);
assert(isequal(vector1,vector2))
assert(isequal(vector1,vector3))
assert(isequal(vector1,vector4))
timeit(#()function01(n,Value))
timeit(#()function02(n,Value))
timeit(#()function03(n,Value))
timeit(#()function04(n,Value))
function vector = function01(n,Value)
vector = zeros(2*n,1);
for k = 1:n
vector(2*k-1) = Value(k);
vector(2*k) = Value(k);
end
end
function vector = function02(n,Value)
vector = zeros(2*n,1);
vector(1:2:2*n) = Value;
vector(2:2:2*n) = Value;
end
function vector = function03(n,Value)
MatrixTmp = transpose([Value(:), Value(:)]);
vector = MatrixTmp(:);
end
function vector = function04(n,Value)
vector = [Value(:).'; Value(:).'];
vector = vector(:);
end

Related

Computing matrices for Model Predictive Control

In model predictive control, an optimization problem is solved at every time instant and it is very common to write down the matrices in a compact form. Without going into the details of the optimization problem, suppose I have matrices and . I need to compute the matrices and defined as
Note that N_p is called "prediction horizon" and it is not the order of matrix A. How can I compute these matrices in a fast and efficient way? In Matlab, I have done the following, but maybe there is a more efficient way to compute these matrices:
A_cal = zeros(length(A)*Np, length(A)); %calligraphic A matrix
B_cal = zeros(size(B,1)*Np, size(Bd,2)*Np); %calligraphic B matrix
temp = eye(size(A));
for j = 1:Np
A_cal(1+(j-1)*length(A):j*length(A),:)= temp;
if j > 1
%The current row is obtained as shift of the previous row, and only the block in the first column is computed
B_cal(1+(j-1)*size(B,1):j*size(B,1),:) = circshift(B_cal(1+(j-2)*size(B,1):(j-1)*size(B,1),:),size(B,2),2);
B_cal(1+(j-1)*size(B,1):j*size(B,1),1:size(B,2)) = temp_prev*B_cent;
end
temp_prev = temp; %this variable contains A^(j-1)
temp = temp * A_cent; %use temp variable to speed up the matrix power computation
end
I assume you already solved your problem, but here is the code from my exercises sheet, creating your matrices.
S_x being the first matrix.
function S_x = compute_Sx(A,N)
S_x = eye(size(A));
for i=1:N
S_x = [S_x;A^i];
end
end
function S_u = compute_Su(A,B,N)
S_u = zeros(size(A,1)*N,size(B,2)*N);
for i=1:N
S_u = S_u + kron(diag(ones(N-i+1,1),-i+1),A^(i-1)*B);
end
S_u = [zeros(size(A,1),size(B,2)*N);S_u];
end
Best Regards

Fastest approach to copying/indexing variable parts of 3D matrix

I have large sets of 3D data consisting of 1D signals acquired in 2D space.
The first step in processing this data is thresholding all signals to find the arrival of a high-amplitude pulse. This pulse is present in all signals and arrives at different times.
After thresholding, the 3D data set should be reordered so that every signal starts at the arrival of the pulse and what came before is thrown away (the end of the signals is of no importance, as of now i concatenate zeros to the end of all signals so the data remains the same size).
Now, I have implemented this in the following manner:
First, i start by calculating the sample number of the first sample exceeding the threshold in all signals
M = randn(1000,500,500); % example matrix of realistic size
threshold = 0.25*max(M(:,1,1)); % 25% of the maximum in the first signal as threshold
[~,index] = max(M>threshold); % indices of first sample exceeding threshold in all signals
Next, I want all signals to be shifted so that they all start with the pulse. For now, I have implemented it this way:
outM = zeros(size(M)); % preallocation for speed
for i = 1:size(M,2)
for j = 1:size(M,3)
outM(1:size(M,1)+1-index(1,i,j),i,j) = M(index(1,i,j):end,i,j);
end
end
This works fine, and i know for-loops are not that slow anymore, but this easily takes a few seconds for the datasets on my machine. A single iteration of the for-loop takes about 0.05-0.1 sec, which seems slow to me for just copying a vector containing 500-2000 double values.
Therefore, I have looked into the best way to tackle this, but for now I haven't found anything better.
I have tried several things: 3D masks, linear indexing, and parallel loops (parfor).
for 3D masks, I checked to see if any improvements are possible. Therefore i first contruct a logical mask, and then compare the speed of the logical mask indexing/copying to the double nested for loop.
%% set up for logical mask copying
AA = logical(ones(500,1)); % only copy the first 500 values after the threshold value
Mask = logical(zeros(size(M)));
Jepla = zeros(500,size(M,2),size(M,3));
for i = 1:size(M,2)
for j = 1:size(M,3)
Mask(index(1,i,j):index(1,i,j)+499,i,j) = AA;
end
end
%% speed comparison
tic
Jepla = M(Mask);
toc
tic
for i = 1:size(M,2)
for j = 1:size(M,3)
outM(1:size(M,1)+1-index(1,i,j),i,j) = M(index(1,i,j):end,i,j);
end
end
toc
The for-loop is faster every time, even though there is more that's copied.
Next, linear indexing.
%% setup for linear index copying
%put all indices in 1 long column
LongIndex = reshape(index,numel(index),1);
% convert to linear indices and store in new variable
linearIndices = sub2ind(size(M),LongIndex,repmat(1:size(M,2),1,size(M,3))',repelem(1:size(M,3),size(M,2))');
% extend linear indices with those of all values to copy
k = zeros(numel(M),1);
count = 1;
for i = 1:numel(LongIndex)
values = linearIndices(i):size(M,1)*i;
k(count:count+length(values)-1) = values;
count = count + length(values);
end
k = k(1:count-1);
% get linear indices of locations in new matrix
l = zeros(length(k),1);
count = 1;
for i = 1:numel(LongIndex)
values = repelem(LongIndex(i)-1,size(M,1)-LongIndex(i)+1);
l(count:count+length(values)-1) = values;
count = count + length(values);
end
l = k-l;
% create new matrix
outM = zeros(size(M));
%% speed comparison
tic
outM(l) = M(k);
toc
tic
for i = 1:size(M,2)
for j = 1:size(M,3)
outM(1:size(M,1)+1-index(1,i,j),i,j) = M(index(1,i,j):end,i,j);
end
end
toc
Again, the alternative approach, linear indexing, is (a lot) slower.
After this failed, I learned about parallelisation, and though this would for sure speed up my code.
By reading some of the documentation around parfor and trying it out a bit, I changed my code to the following:
gcp;
outM = zeros(size(M));
inM = mat2cell(M,size(M,1),ones(size(M,2),1),size(M,3));
tic
parfor i = 1:500
for j = 1:500
outM(:,i,j) = [inM{i}(index(1,i,j):end,1,j);zeros(index(1,i,j)-1,1)];
end
end
end
toc
I changed it so that "outM" and "inM" would both be sliced variables, as I read this is best. Still this is very slow, a lot slower than the original for loop.
So now the question, should I give up on trying to improve the speed of this operation? Or is there another way in which to do this? I have searched a lot, and for now do not see how to speed this up.
Sorry for the long question, but I wanted to show what I tried.
Thank you in advance!
Not sure if an option in your situation, but looks like cell arrays are actually faster here:
outM2 = cell(size(M,2),size(M,3));
tic;
for i = 1:size(M,2)
for j = 1:size(M,3)
outM2{i,j} = M(index(1,i,j):end,i,j);
end
end
toc
And a second idea which also came out faster, batch all data which have to be shifted by the same value:
tic;
for i = 1:unique(index).'
outM(1:size(M,1)+1-i,index==i) = M(i:end,index==i);
end
toc
It totally depends on your data if this approach is actually faster.
And yes integer valued and logical indexing can be mixed

Vectorize Evaluations of Meshgrid Points in Matlab

I need the "for" loop in the following representative section of code to run as efficiently as possible. The mean function in the code is acting as a representative placeholder for my own function.
x = linspace(-1,1,15);
y = linspace(2,4,15);
[xgrid, ygrid] = meshgrid(x,y);
mc = rand(100000,1);
z=zeros(size(xgrid));
for i=1:length(xgrid)
for j=1:length(ygrid)
z(i,j) = mean(xgrid(i,j) + ygrid(i,j) + xgrid(i,j)*ygrid(i,j)*mc);
end
end
I have vectorized the code and improved its speed by about 2.5 times by building a matrix in which mc is replicated for each grid point. My implementation results in a very large matrix (3 x 22500000) filled with repeated data. I've mitigated the memory penalty of this approach by converting the matrix to single precision, but it seems like there should be a more efficient way to do what I want that avoids replicating so much data.
You could use bsxfun with few reshapes -
A = bsxfun(#times,y,x.'); %//'
B = bsxfun(#plus,y,x.'); %//'
C = mean(bsxfun(#plus,bsxfun(#times,mc,reshape(A,1,[])) , reshape(B,1,[])),1);
z_out = reshape(C,numel(x),[]).';

Fastest way to add multiple sparse matrices in a loop in MATLAB

I have a code that repeatedly calculates a sparse matrix in a loop (it performs this calculation 13472 times to be precise). Each of these sparse matrices is unique.
After each execution, it adds the newly calculated sparse matrix to what was originally a sparse zero matrix.
When all 13742 matrices have been added, the code exits the loop and the program terminates.
The code bottleneck occurs in adding the sparse matrices. I have made a dummy version of the code that exhibits the same behavior as my real code. It consists of a MATLAB function and a script given below.
(1) Function that generates the sparse matrix:
function out = test_evaluate_stiffness(n)
ind = randi([1 n*n],300,1);
val = rand(300,1);
[I,J] = ind2sub([n,n],ind);
out = sparse(I,J,val,n,n);
end
(2) Main script (program)
% Calculate the stiffness matrix
n=1000;
K=sparse([],[],[],n,n,n^2);
tic
for i=1:13472
temp=rand(1)*test_evaluate_stiffness(n);
K=K+temp;
end
fprintf('Stiffness Calculation Complete\nTime taken = %f s\n',toc)
I'm not very familiar with sparse matrix operations so I may be missing a critical point here that may allow my code to be sped up considerably.
Am I handling the updating of my stiffness matrix in a reasonable way in my code? Is there another way that I should be using sparse that will result in a faster solution?
A profiler report is also provided below:
If you only need the sum of those matrices, instead of building all of them individually and then summing them, simply concatenate the vectors I,J and vals and call sparse only once. If there are duplicate rows [i,j] in [I,J] the corresponding values S(i,j) will be summed automatically, so the code is absolutely equivalent. As calling sparse involves an internal call to a sorting algorithm, you save 13742-1 intermediate sorts and can get away with only one.
This involves changing the signature of test_evaluate_stiffness to output [I,J,val]:
function [I,J,val] = test_evaluate_stiffness(n)
and removing the line out = sparse(I,J,val,n,n);.
You will then change your other function to:
n = 1000;
[I,J,V] = deal([]);
tic;
for i = 1:13472
[I_i, J_i, V_i] = test_evaluate_stiffness(n);
nE = numel(I_i);
I(end+(1:nE)) = I_i;
J(end+(1:nE)) = J_i;
V(end+(1:nE)) = rand(1)*V_i;
end
K = sparse(I,J,V,n,n);
fprintf('Stiffness Calculation Complete\nTime taken = %f s\n',toc);
If you know the lengths of the output of test_evaluate_stiffness ahead of time, you can possibly save some time by preallocating the arrays I,J and V with appropriately-sized zeros matrices and set them using something like:
I((i-1)*nE + (1:nE)) = ...
J((i-1)*nE + (1:nE)) = ...
V((i-1)*nE + (1:nE)) = ...
The biggest remaining computation, taking 11s, is the sparse operation
on the final I,J,V vectors so I think we've taken it down to the bare
bones.
Nearly... but one final trick: if you can create the vectors so that J is sorted ascending then you will greatly improve the speed of the sparse call, about a factor 4 in my experience.
(If it's easier to have I sorted, then create the transpose matrix sparse(J,I,V) and un-transpose it afterwards.)

Apply function to rolling window

Say I have a long list A of values (say of length 1000) for which I want to compute the std in pairs of 100, i.e. I want to compute std(A(1:100)), std(A(2:101)), std(A(3:102)), ..., std(A(901:1000)).
In Excel/VBA one can easily accomplish this by writing e.g. =STDEV(A1:A100) in one cell and then filling down in one go. Now my question is, how could one accomplish this efficiently in Matlab without having to use any expensive for-loops.
edit: Is it also possible to do this for a list of time series, e.g. when A has dimensions 1000 x 4 (i.e. 4 time series of length 1000)? The output matrix should then have dimensions 901 x 4.
Note: For the fastest solution see Luis Mendo's answer
So firstly using a for loop for this (especially if those are your actual dimensions) really isn't going to be expensive. Unless you're using a very old version of Matlab, the JIT compiler (together with pre-allocation of course) makes for loops inexpensive.
Secondly - have you tried for loops yet? Because you should really try out the naive implementation first before you start optimizing prematurely.
Thirdly - arrayfun can make this a one liner but it is basically just a for loop with extra overhead and very likely to be slower than a for loop if speed really is your concern.
Finally some code:
n = 1000;
A = rand(n,1);
l = 100;
for loop (hardly bulky, likely to be efficient):
S = zeros(n-l+1,1); %//Pre-allocation of memory like this is essential for efficiency!
for t = 1:(n-l+1)
S(t) = std(A(t:(t+l-1)));
end
A vectorized (memory in-efficient!) solution:
[X,Y] = meshgrid(1:l)
S = std(A(X+Y-1))
A probably better vectorized solution (and a one-liner) but still memory in-efficient:
S = std(A(bsxfun(#plus, 0:l-1, (1:l)')))
Note that with all these methods you can replace std with any function so long as it is applies itself to the columns of the matrix (which is the standard in Matlab)
Going 2D:
To go 2D we need to go 3D
n = 1000;
k = 4;
A = rand(n,k);
l = 100;
ind = bsxfun(#plus, permute(o:n:(k-1)*n, [3,1,2]), bsxfun(#plus, 0:l-1, (1:l)')); %'
S = squeeze(std(A(ind)));
M = squeeze(mean(A(ind)));
%// etc...
OR
[X,Y,Z] = meshgrid(1:l, 1:l, o:n:(k-1)*n);
ind = X+Y+Z-1;
S = squeeze(std(A(ind)))
M = squeeze(mean(A(ind)))
%// etc...
OR
ind = bsxfun(#plus, 0:l-1, (1:l)'); %'
for t = 1:k
S = std(A(ind));
M = mean(A(ind));
%// etc...
end
OR (taken from Luis Mendo's answer - note in his answer he shows a faster alternative to this simple loop)
S = zeros(n-l+1,k);
M = zeros(n-l+1,k);
for t = 1:(n-l+1)
S(t,:) = std(A(k:(k+l-1),:));
M(t,:) = mean(A(k:(k+l-1),:));
%// etc...
end
What you're doing is basically a filter operation.
If you have access to the image processing toolbox,
stdfilt(A,ones(101,1)) %# assumes that data series are in columns
will do the trick (no matter the dimensionality of A). Note that if you also have access to the parallel computing toolbox, you can let filter operations like these run on a GPU, although your problem might be too small to generate noticeable speedups.
To minimize number of operations, you can exploit the fact that the standard deviation can be computed as a difference involving second and first moments,
and moments over a rolling window are obtained efficiently with a cumulative sum (using cumsum):
A = randn(1000,4); %// random data
N = 100; %// window size
c = size(A,2);
A1 = [zeros(1,c); cumsum(A)];
A2 = [zeros(1,c); cumsum(A.^2)];
S = sqrt( (A2(1+N:end,:)-A2(1:end-N,:) ...
- (A1(1+N:end,:)-A1(1:end-N,:)).^2/N) / (N-1) ); %// result
Benchmarking
Here's a comparison against a loop based solution, using timeit. The loop approach is as in Dan's solution but adapted to the 2D case, exploting the fact that std works along each column in a vectorized manner.
%// File loop_approach.m
function S = loop_approach(A,N);
[n, p] = size(A);
S = zeros(n-N+1,p);
for k = 1:(n-N+1)
S(k,:) = std(A(k:(k+N-1),:));
end
%// File bsxfun_approach.m
function S = bsxfun_approach(A,N);
[n, p] = size(A);
ind = bsxfun(#plus, permute(0:n:(p-1)*n, [3,1,2]), bsxfun(#plus, 0:n-N, (1:N).')); %'
S = squeeze(std(A(ind)));
%// File cumsum_approach.m
function S = cumsum_approach(A,N);
c = size(A,2);
A1 = [zeros(1,c); cumsum(A)];
A2 = [zeros(1,c); cumsum(A.^2)];
S = sqrt( (A2(1+N:end,:)-A2(1:end-N,:) ...
- (A1(1+N:end,:)-A1(1:end-N,:)).^2/N) / (N-1) );
%// Benchmarking code
clear all
A = randn(1000,4); %// Or A = randn(1000,1);
N = 100;
t_loop = timeit(#() loop_approach(A,N));
t_bsxfun = timeit(#() bsxfun_approach(A,N));
t_cumsum = timeit(#() cumsum_approach(A,N));
disp(' ')
disp(['loop approach: ' num2str(t_loop)])
disp(['bsxfun approach: ' num2str(t_bsxfun)])
disp(['cumsum approach: ' num2str(t_cumsum)])
disp(' ')
disp(['bsxfun/loop gain factor: ' num2str(t_loop/t_bsxfun)])
disp(['cumsum/loop gain factor: ' num2str(t_loop/t_cumsum)])
Results
I'm using Matlab R2014b, Windows 7 64 bits, dual core processor, 4 GB RAM:
4-column case:
loop approach: 0.092035
bsxfun approach: 0.023535
cumsum approach: 0.0002338
bsxfun/loop gain factor: 3.9106
cumsum/loop gain factor: 393.6526
Single-column case:
loop approach: 0.085618
bsxfun approach: 0.0040495
cumsum approach: 8.3642e-05
bsxfun/loop gain factor: 21.1431
cumsum/loop gain factor: 1023.6236
So the cumsum-based approach seems to be the fastest: about 400 times faster than the loop in the 4-column case, and 1000 times faster in the single-column case.
Several functions can do the job efficiently in Matlab.
On one side, you can use functions such as colfilt or nlfilter, which performs computations on sliding blocks. colfilt is way more efficient than nlfilter, but can be used only if the order of the elements inside a block does not matter. Here is how to use it on your data:
S = colfilt(A, [100,1], 'sliding', #std);
or
S = nlfilter(A, [100,1], #std);
On your example, you can clearly see the difference of performance. But there is a trick : both functions pad the input array so that the output vector has the same size as the input array. To get only the relevant part of the output vector, you need to skip the first floor((100-1)/2) = 49 first elements, and take 1000-100+1 values.
S(50:end-50)
But there is also another solution, close to colfilt, more efficient. colfilt calls col2im to reshape the input vector into a matrix on which it applies the given function on each distinct column. This transforms your input vector of size [1000,1] into a matrix of size [100,901]. But colfilt pads the input array with 0 or 1, and you don't need it. So you can run colfilt without the padding step, then apply std on each column and this is easy because std applied on a matrix returns a row vector of the stds of the columns. Finally, transpose it to get a column vector if you want. In brief and in one line:
S = std(im2col(X,[100 1],'sliding')).';
Remark: if you want to apply a more complex function, see the code of colfilt, line 144 and 147 (for v2013b).
If your concern is speed of the for loop, you can greatly reduce the number of loop iteration by folding your vector into an array (using reshape) with the columns having the number of element you want to apply your function on.
This will let Matlab and the JIT perform the optimization (and in most case they do that way better than us) by calculating your function on each column of your array.
You then reshape an offseted version of your array and do the same. You will still need a loop but the number of iteration will only be l (so 100 in your example case), instead of n-l+1=901 in a classic for loop (one window at a time).
When you're done, you reshape the array of result in a vector, then you still need to calculate manually the last window, but overall it is still much faster.
Taking the same input notation than Dan:
n = 1000;
A = rand(n,1);
l = 100;
It will take this shape:
width = (n/l)-1 ; %// width of each line in the temporary result array
tmp = zeros( l , width ) ; %// preallocation never hurts
for k = 1:l
tmp(k,:) = std( reshape( A(k:end-l+k-1) , l , [] ) ) ; %// calculate your stat on the array (reshaped vector)
end
S2 = [tmp(:) ; std( A(end-l+1:end) ) ] ; %// "unfold" your results then add the last window calculation
If I tic ... toc the complete loop version and the folded one, I obtain this averaged results:
Elapsed time is 0.057190 seconds. %// windows by window FOR loop
Elapsed time is 0.016345 seconds. %// "Folded" FOR loop
I know tic/toc is not the way to go for perfect timing but I don't have the timeit function on my matlab version. Besides, the difference is significant enough to show that there is an improvement (albeit not precisely quantifiable by this method). I removed the first run of course and I checked that the results are consistent with different matrix sizes.
Now regarding your "one liner" request, I suggest your wrap this code into a function like so:
function out = foldfunction( func , vec , nPts )
n = length( vec ) ;
width = (n/nPts)-1 ;
tmp = zeros( nPts , width ) ;
for k = 1:nPts
tmp(k,:) = func( reshape( vec(k:end-nPts+k-1) , nPts , [] ) ) ;
end
out = [tmp(:) ; func( vec(end-nPts+1:end) ) ] ;
Which in your main code allows you to call it in one line:
S = foldfunction( #std , A , l ) ;
The other great benefit of this format, is that you can use the very same sub function for other statistical function. For example, if you want the "mean" of your windows, you call the same just changing the func argument:
S = foldfunction( #mean , A , l ) ;
Only restriction, as it is it only works for vector as input, but with a bit of rework it could be made to take arrays as input too.