Is there a way to speed up concatenation in MATLAB? - matlab

I want to concatenate along the third dimension
z = cat(3,A,B,C);
Many many times. I if I was doing that along the second dimension then
z = [A,B,C];
Would be faster than
z = cat(2,A,B,C);
Can a similar thing be done along the third dimension or is there any other way to speed this up?

There are some indexing options to get a slightly better performance than cat(3,...).
Both solutions use U(30,30,3)=0; instead of zeros(30,30,3) to preallocate, but it is unsave as it will result in a subscript dimension missmatch when U is already a variable of a larger size.
The first option is to assign the different slices individually.
%fast but unsafe preallocation
U(30,30,3)=0;
%robust alternative:
%U=zeros(30,30,3)
U(:,:,3)=C;
U(:,:,1)=A;
U(:,:,2)=B;
The second option is to use linear indexing. For z1 = cat(3,A,B,C); and z2=[A;B;C] it is true that z1(:)==z2(:)
%fast but unsafe preallocation
U(30,30,3)=0;
%robust alternative:
%U=zeros(30,30,3)
U(:)=[A,B,C];
I benchmarked the solutions, comparing it to cat(3,A,B,C) and [A,B,C]. The linear indexing solution is only slightly slower than [A,B,C].
0.392289 s for 2D CAT
0.476525 s for Assign slices
0.588346 s for cat(3...)
0.392703 s for linear indexing
Code for benchmarking:
N=30;
A=randn(N,N);
B=randn(N,N);
C=randn(N,N);
T=containers.Map;
cycles=10^5;
tic;
for i=1:cycles
W=[A;B;C];
X=W+1;
end
T('2D CAT')=toc;
tic;
for i=1:cycles
W=cat(3,A,B,C);
X=W+1;
end
T('cat(3...)')=toc;
U=zeros(N,N,3);
tic;
for i=1:cycles
U(N,N,3)=0;
U(:,:,3)=C;
U(:,:,1)=A;
U(:,:,2)=B;
V=U+1;
end
T('Assign slices')=toc;
tic;
for i=1:cycles
U(N,N,3)=0;
U(:)=[A,B,C];
V=U+1;
end
T('linear indexing')=toc;
for X=T.keys
fprintf('%f s for %s\n',T(X{1}),X{1})
end

Related

dynamically fill vector without assigning empty matrix

Oftentimes I need to dynamically fill a vector in Matlab. However this is sligtly annoying since you first have to define an empty variable first, e.g.:
[a,b,c]=deal([]);
for ind=1:10
if rand>.5 %some random condition to emphasize the dynamical fill of vector
a=[a, randi(5)];
end
end
a %display result
Is there a better way to implement this 'push' function, so that you do not have to define an empty vector beforehand? People tell me this is nonsensical in Matlab- if you think this is the case please explain why.
related: Push a variable in a vector in Matlab, is-there-an-elegant-way-to-create-dynamic-array-in-matlab
In MATLAB, pre-allocation is the way to go. From the docs:
for and while loops that incrementally increase the size of a data structure each time through the loop can adversely affect performance and memory use.
As pointed out in the comments by m7913d, there is a question on MathWorks' answers section which addresses this same point, read it here.
I would suggest "over-allocating" memory, then reducing the size of the array after your loop.
numloops = 10;
a = nan(numloops, 1);
for ind = 1:numloops
if rand > 0.5
a(ind) = 1; % assign some value to the current loop index
end
end
a = a(~isnan(a)); % Get rid of values which weren't used (and remain NaN)
No, this doesn't decrease the amount you have to write before your loop, it's even worse than having to write a = []! However, you're better off spending a few extra keystrokes and minutes writing well structured code than making that saving and having worse code.
It is (as for as I known) not possible in MATLAB to omit the initialisation of your variable before using it in the right hand side of an expression. Moreover it is not desirable to omit it as preallocating an array is almost always the right way to go.
As mentioned in this post, it is even desirable to preallocate a matrix even if the exact number of elements is not known. To demonstrate it, a small benchmark is desirable:
Ns = [1 10 100 1000 10000 100000];
timeEmpty = zeros(size(Ns));
timePreallocate = zeros(size(Ns));
for i=1:length(Ns)
N = Ns(i);
timeEmpty(i) = timeit(#() testEmpty(N));
timePreallocate(i) = timeit(#() testPreallocate(N));
end
figure
semilogx(Ns, timeEmpty ./ timePreallocate);
xlabel('N')
ylabel('time_{empty}/time_{preallocate}');
% do not preallocate memory
function a = testEmpty (N)
a = [];
for ind=1:N
if rand>.5 %some random condition to emphasize the dynamical fill of vector
a=[a, randi(5)];
end
end
end
% preallocate memory with the largest possible return size
function a = testPreallocate (N)
last = 0;
a = zeros(N, 1);
for ind=1:N
if rand>.5 %some random condition to emphasize the dynamical fill of vector
last = last + 1;
a(last) = randi(5);
end
end
a = a(1:last);
end
This figure shows how much time the method without preallocating is slower than preallocating a matrix based on the largest possible return size. Note that preallocating is especially important for large matrices due the the exponential behaviour.

performance difference between subscript indexing and linear indexing

I have a 2D matrix in MATLAB and I use two different ways to access its elements. One is based on subscript indexing and the other is based on linear indexing. I test both methods by following code:
N = 512; it = 400; im = zeros(N);
%// linear indexing
[ind_x,ind_y] = ndgrid(1:2:N,1:2:N);
index = sub2ind(size(im),ind_x,ind_y);
tic
for i=1:it
im(index) = im(index) + 1;
end
toc %//cost 0.45 seconds on my machine (MATLAB2015b, Thinkpad T410)
%// subscript indexing
x = 1:2:N;
y = 1:2:N;
tic
for i=1:it
im(x,y) = im(x,y) +1;
end
toc %// cost 0.12 seconds on my machine(MATLAB2015b, Thinkpad T410)
%//someone pointed that double or uint32 might an issue, so we turn both into uint32
%//uint32 for linear indexing
index = uint32(index);
tic
for i=1:it
im(index) = im(index) +1;
end
toc%// cost 0.25 seconds on my machine(MATLAB2015b, Thinkpad T410)
%//uint32 for the subscript indexing
x = uint32(1:2:N);
y = uint32(1:2:N);
tic
for i=1:it
im(x,y) = im(x,y) +1;
end
toc%// cost 0.11 seconds on my machine(MATLAB2015b, Thinkpad T410)
%% /*********************comparison with others*****************/
%//third way of indexing, loops
tic
for i=1:it
for j=1:2:N
for k=1:2:N
im(j,k) = im(j,k)+1;
end
end
end
toc%// cost 0.74 seconds on my machine(MATLAB2015b, Thinkpad T410)
It seems that directly using subscript indexing is faster than the linear indexing obtained from sub2ind. Does anyone know why? I thought they were almost the same.
The intuition
As Daniel mentioned in his answer, the linear index takes up more space in RAM while the subscripts are much smaller.
For the subscripted indexing, internally, Matlab will not create the linear index, but it will use a (double) compiled loop to cycle through all elements.
The subscripted version on the other hand will have to loop through all the linear indices passed from outside, which will require more reads from memory, thus will take longer.
Claims
Linear indexing is faster
...as long as the total number of indices is the same
Timings
From the timings we see a direct confirmation for the first claim and we can infer the second with some additional testing (below).
LOOPED
subs assignment: 0.2878s
linear assignment: 0.0812s
VECTORIZED
subs assignment: 0.0302s
linear assignment: 0.0862s
First claim
We can test it with loops. The number of subref operations is the same but the linear index points directly to the element of interest while subscripts, internally, need to be converted.
The functions of interest:
function B = subscriptedIndexing(A,row,col)
n = numel(row);
B = zeros(n);
for r = 1:n
for c = 1:n
B(r,c) = A(row(r),col(c));
end
end
end
function B = linearIndexing(A,index)
B = zeros(size(index));
for ii = 1:numel(index)
B(ii) = A(index(ii));
end
end
Second claim
This claim is an inference from the observed difference in speed when using the vectorized approach.
First, the vectorized approach (as opposed to the looped) speeds up the subscripted assignment while linear indexing is slightly slower (probably not statistically significant).
Second, the only difference in the two indexing methods comes from the size of the indices/subscripts. We want to isolate this as the only possible cause of the difference in the timings. One other major player could be JIT optimization.
The testing functions:
function B = subscriptedIndexingVect(A,row,col)
n = numel(row);
B = zeros(n);
B = A(row,col);
end
function B = linearIndexingVect(A,index)
B = zeros(size(index));
B = A(index);
end
NOTE: I keep the superfluous preallocation of B, to keep the vectorized and looped approaches comparable. In other words, differences in timings should only come from indexing and the internal implementation of the loops.
All tests are run with:
function testFun(N)
A = magic(N);
row = 1:2:N;
col = 1:2:N;
[ind_x,ind_y] = ndgrid(row,col);
index = sub2ind(size(A),ind_x,ind_y);
% isequal(linearIndexing(A,index), subscriptedIndexing(A,row,col))
% isequal(linearIndexingVect(A,index), subscriptedIndexingVect(A,row,col))
fprintf('<strong>LOOPED</strong>\n')
fprintf(' subs assignment: %.4fs\n', timeit(#()subscriptedIndexing(A,row,col)))
fprintf(' linear assignment: %.4fs\n\n',timeit(#()linearIndexing(A,index)))
fprintf('<strong>VECTORIZED</strong>\n')
fprintf(' subs assignment: %.4fs\n', timeit(#()subscriptedIndexingVect(A,row,col)))
fprintf(' linear assignment: %.4fs\n', timeit(#()linearIndexingVect(A,index)))
end
Turning JIT on/off has NO impact:
feature accel off
testFun(5e3)
...
VECTORIZED
subs assignment: 0.0303s
linear assignment: 0.0873s
feature accel on
testFun(5e3)
...
VECTORIZED
subs assignment: 0.0303s
linear assignment: 0.0871s
This excludes that subscripted assignment's superior speed comes from JIT optimization which leaves us with the only plausible cause, number of RAM accesses. It is true that the final matrix has the same number of elements. However, the linear assignment has to retrieve all elements of the index in order to fetch the numbers.
SETUP
Tested on Win7 64 with MATLAB R2015b. Prior versions of Matlab will provide different results due to recent changes in Matlab's execution engine
In fact, turning JIT off in Matlab R2014a affects timings, but only for the loops (expected result):
feature accel off
testFun(5e3)
LOOPED
subs assignment: 7.8915s
linear assignment: 6.4418s
VECTORIZED
subs assignment: 0.0295s
linear assignment: 0.0878s
This again confirms that the difference in timings between linear and sibscripted assignment should come from the number of RAM accesses, since JIT does not play a role in the vectorized approach.
It does not really surprise me that the subscript indexing is much faster here. If you take a look at your input data, the index is much smaller in this case. For the subscript indexing case you have 512 elements while for the linear indexing case you have 65536 elements.
When you apply your example to a vector instead, you will notice that there is no difference between both methods.
Here is the slightly modified code I used to evaluate different matrix sizes:
it = 400; im = zeros(512*512,1);
x = 1:2:size(im,1);
y = 1:2:size(im,2);
%// linear indexing
[ind_x,ind_y] = ndgrid(x,y);
index = sub2ind(size(im),ind_x,ind_y);
tic
for i=1:it
im(index) = im(index) + 1;
end
toc
%// subscript indexing
tic
for i=1:it
im(x,y) = im(x,y) +1;
end
toc
A very good question. Right ahead, I don't know the correct answer, however, you can analyze the behavior. Save the first toc into t1 and the second one into t2. At the end calculate t1/t2. You will recognize, changing the number of iterations or the size of your matrix does (almost) not change the factor.
I propose:
The amount of iterations only improves the quality of the tictoc. (obvious?)
The size of the matrix has no influcence, i.e. there must be a time delay in the syntax.
I imagine, that there is simply an internal check or transformation from linear index to subscript indexing, i.e. the internal addition (operation) you perform is exactly the same. It appears to be more natural to use subscript indexing instead of linear indexing, so maybe mathworks simply optimized the first.
UPDATE:
You can also simply access an element in your matrix, you will see, that using subscript index is faster than using linear index. That supports the theory, that there is a slow conversion done internally from linear to subscript.
DISCLAIMER: I don't have a MATLAB license at the moment, so the code I provide below is admittedly untested. However, if anyone decides to test, please comment on this answer accordingly.
Depending on your release of MATLAB (are you using R2015b?), there is a possibility that you may not have paid the full upfront cost of preallocation when invoking "zeros". There is a possibility that you are paying for allocation on the first get/set of im, which is causing additional but hidden overhead when you first access the values inside im.
See: http://undocumentedmatlab.com/blog/preallocation-performance
As an initial test, I suggest switching the order that you are profiling the code:
N = 512; it = 400; im = zeros(N);
%// subscript indexing
x = 1:2:N;
y = 1:2:N;
tic
for i=1:it
im(x,y) = im(x,y) +1;
end
toc %// What's the cost now?
%// linear indexing
[ind_x,ind_y] = ndgrid(1:2:N,1:2:N);
index = sub2ind(size(im),ind_x,ind_y);
tic
for i=1:it
im(index) = im(index) + 1;
end
toc %// What's the cost now?
To profile perhaps more fairly against subscript vs. linear indexing, I suggest one of two possible methods:
Make sure you incur allocation costs on both methods by creating two separate im matrices, im1 and im2, both initially set to zeros(N), and use each matrix for a separate indexing method.
Run a full get/set on each element of im before actually profiling between subscript vs. linear indexing.
Method 1:
N = 512; it = 400; im1 = zeros(N); im2 = zeros(N);
%// subscript indexing
x = 1:2:N;
y = 1:2:N;
tic
for i=1:it
im1(x,y) = im1(x,y) + 1;
end
toc %// What's the cost now?
%// linear indexing
[ind_x,ind_y] = ndgrid(1:2:N,1:2:N);
index = sub2ind(size(im2),ind_x,ind_y);
tic
for i=1:it
im2(index) = im2(index) + 1;
end
toc %// What's the cost now?
Method 2:
N = 512; it = 400; im = zeros(N);
%// Run a full get/set on each element to force allocation
tic
for i=1:N^2
im(i) = im(i) +1;
end
toc
%// subscript indexing
x = 1:2:N;
y = 1:2:N;
tic
for i=1:it
im(x,y) = im(x,y) +1;
end
toc %// What's the cost now?
%// linear indexing
[ind_x,ind_y] = ndgrid(1:2:N,1:2:N);
index = sub2ind(size(im),ind_x,ind_y);
tic
for i=1:it
im(index) = im(index) + 1;
end
toc %// What's the cost now?
I have a second hypothesis, which is that you incur some additional overhead when you explicitly declare each and every single element to be accessed, versus if you have MATLAB infer the elements for you. excasa's "duplicate post" reference (not exactly a duplicate in my humble opinion) has the same general insight, but uses different datapoints to come to this conclusion. I won't write examples of this here, but basically, creating a straight up giant array index compared to the smaller subscript indices x and y gives MATLAB less room for internal optimizations. I don't know what inside MATLAB would perform these specific optimizations, but perhaps they come from the black magic that you may know as MATLAB's JIT/LXE. If you honestly want to check if JIT is the culprit here (and are working in 2014b or prior), then you can try disabling it and then running the code above.
There are several ways to disable the JIT:
Use undocumented feature methods.
Copy/paste the commands to the command prompt, as opposed running them straight from the script editor.
Unfortunately, I do not know of a way to turn of LXE in R2015a and later, and trying to diagnose if LXE is the culprit may be a bit of an uphill battle. If this is where you are stuck, perhaps you can delve even further via MathWorks' technical support or MathWorks Central. You may be surprised to find some astounding experts from either source.

Vectorizing the solution of a linear equation system in MATLAB

Summary: This question deals with the improvement of an algorithm for the computation of linear regression.
I have a 3D (dlMAT) array representing monochrome photographs of the same scene taken at different exposure times (the vector IT) . Mathematically, every vector along the 3rd dimension of dlMAT represents a separate linear regression problem that needs to be solved. The equation whose coefficients need to be estimated is of the form:
DL = R*IT^P, where DL and IT are obtained experimentally and R and P must be estimated.
The above equation can be transformed into a simple linear model after applying a logarithm:
log(DL) = log(R) + P*log(IT) => y = a + b*x
Presented below is the most "naive" way to solve this system of equations, which essentially involves iterating over all "3rd dimension vectors" and fitting a polynomial of order 1 to (IT,DL(ind1,ind2,:):
%// Define some nominal values:
R = 0.3;
IT = 600:600:3000;
P = 0.97;
%// Impose some believable spatial variations:
pMAT = 0.01*randn(3)+P;
rMAT = 0.1*randn(3)+R;
%// Generate "fake" observation data:
dlMAT = bsxfun(#times,rMAT,bsxfun(#power,permute(IT,[3,1,2]),pMAT));
%// Regression:
sol = cell(size(rMAT)); %// preallocation
for ind1 = 1:size(dlMAT,1)
for ind2 = 1:size(dlMAT,2)
sol{ind1,ind2} = polyfit(log(IT(:)),log(squeeze(dlMAT(ind1,ind2,:))),1);
end
end
fittedP = cellfun(#(x)x(1),sol); %// Estimate of pMAT
fittedR = cellfun(#(x)exp(x(2)),sol); %// Estimate of rMAT
The above approach seems like a good candidate for vectorization, since it does not utilize MATLAB's main strength that is MATrix operations. For this reason, it does not scale very well and takes much longer to execute than I think it should.
There exist alternative ways to perform this computation based on matrix division, as demonstrated here and here, which involve something like this:
sol = [ones(size(x)),log(x)]\log(y);
That is, appending a vector of 1s to the observations, followed by mldivide to solve the equation system.
The main challenge I'm facing is how to adapt my data to the algorithm (or vice versa).
Question #1: How can the matrix-division-based solution be extended to solve the problem presented above (and potentially replace the loops I am using)?
Question #2 (bonus): What is the principle behind this matrix-division-based solution?
The secret ingredient behind the solution that includes matrix division is the Vandermonde matrix. The question discusses a linear problem (linear regression), and those can always be formulated as a matrix problem, which \ (mldivide) can solve in a mean-square error senseā€”. Such an algorithm, solving a similar problem, is demonstrated and explained in this answer.
Below is benchmarking code that compares the original solution with two alternatives suggested in chat1, 2 :
function regressionBenchmark(numEl)
clc
if nargin<1, numEl=10; end
%// Define some nominal values:
R = 5;
IT = 600:600:3000;
P = 0.97;
%// Impose some believable spatial variations:
pMAT = 0.01*randn(numEl)+P;
rMAT = 0.1*randn(numEl)+R;
%// Generate "fake" measurement data using the relation "DL = R*IT.^P"
dlMAT = bsxfun(#times,rMAT,bsxfun(#power,permute(IT,[3,1,2]),pMAT));
%% // Method1: loops + polyval
disp('-------------------------------Method 1: loops + polyval')
tic; [fR,fP] = method1(IT,dlMAT); toc;
fprintf(1,'Regression performance:\nR: %d\nP: %d\n',norm(fR-rMAT,1),norm(fP-pMAT,1));
%% // Method2: loops + Vandermonde
disp('-------------------------------Method 2: loops + Vandermonde')
tic; [fR,fP] = method2(IT,dlMAT); toc;
fprintf(1,'Regression performance:\nR: %d\nP: %d\n',norm(fR-rMAT,1),norm(fP-pMAT,1));
%% // Method3: vectorized Vandermonde
disp('-------------------------------Method 3: vectorized Vandermonde')
tic; [fR,fP] = method3(IT,dlMAT); toc;
fprintf(1,'Regression performance:\nR: %d\nP: %d\n',norm(fR-rMAT,1),norm(fP-pMAT,1));
function [fittedR,fittedP] = method1(IT,dlMAT)
sol = cell(size(dlMAT,1),size(dlMAT,2));
for ind1 = 1:size(dlMAT,1)
for ind2 = 1:size(dlMAT,2)
sol{ind1,ind2} = polyfit(log(IT(:)),log(squeeze(dlMAT(ind1,ind2,:))),1);
end
end
fittedR = cellfun(#(x)exp(x(2)),sol);
fittedP = cellfun(#(x)x(1),sol);
function [fittedR,fittedP] = method2(IT,dlMAT)
sol = cell(size(dlMAT,1),size(dlMAT,2));
for ind1 = 1:size(dlMAT,1)
for ind2 = 1:size(dlMAT,2)
sol{ind1,ind2} = flipud([ones(numel(IT),1) log(IT(:))]\log(squeeze(dlMAT(ind1,ind2,:)))).'; %'
end
end
fittedR = cellfun(#(x)exp(x(2)),sol);
fittedP = cellfun(#(x)x(1),sol);
function [fittedR,fittedP] = method3(IT,dlMAT)
N = 1; %// Degree of polynomial
VM = bsxfun(#power, log(IT(:)), 0:N); %// Vandermonde matrix
result = fliplr((VM\log(reshape(dlMAT,[],size(dlMAT,3)).')).');
%// Compressed version:
%// result = fliplr(([ones(numel(IT),1) log(IT(:))]\log(reshape(dlMAT,[],size(dlMAT,3)).')).');
fittedR = exp(real(reshape(result(:,2),size(dlMAT,1),size(dlMAT,2))));
fittedP = real(reshape(result(:,1),size(dlMAT,1),size(dlMAT,2)));
The reason why method 2 can be vectorized into method 3 is essentially that matrix multiplication can be separated by the columns of the second matrix. If A*B produces matrix X, then by definition A*B(:,n) gives X(:,n) for any n. Moving A to the right-hand side with mldivide, this means that the divisions A\X(:,n) can be done in one go for all n with A\X. The same holds for an overdetermined system (linear regression problem), in which there is no exact solution in general, and mldivide finds the matrix that minimizes the mean-square error. In this case too, the operations A\X(:,n) (method 2) can be done in one go for all n with A\X (method 3).
The implications of improving the algorithm when increasing the size of dlMAT can be seen below:
For the case of 500*500 (or 2.5E5) elements, the speedup from Method 1 to Method 3 is about x3500!
It is also interesting to observe the output of profile (here, for the case of 500*500):
Method 1
Method 2
Method 3
From the above it is seen that rearranging the elements via squeeze and flipud takes up about half (!) of the runtime of Method 2. It is also seen that some time is lost on the conversion of the solution from cells to matrices.
Since the 3rd solution avoids all of these pitfalls, as well as the loops altogether (which mostly means re-evaluation of the script on every iteration) - it unsurprisingly results in a considerable speedup.
Notes:
There was very little difference between the "compressed" and the "explicit" versions of Method 3 in favor of the "explicit" version. For this reason it was not included in the comparison.
A solution was attempted where the inputs to Method 3 were gpuArray-ed. This did not provide improved performance (and even somewhat degradaed them), possibly due to wrong implementation, or the overhead associated with copying matrices back and forth between RAM and VRAM.

Fastest way to add multiple sparse matrices in a loop in MATLAB

I have a code that repeatedly calculates a sparse matrix in a loop (it performs this calculation 13472 times to be precise). Each of these sparse matrices is unique.
After each execution, it adds the newly calculated sparse matrix to what was originally a sparse zero matrix.
When all 13742 matrices have been added, the code exits the loop and the program terminates.
The code bottleneck occurs in adding the sparse matrices. I have made a dummy version of the code that exhibits the same behavior as my real code. It consists of a MATLAB function and a script given below.
(1) Function that generates the sparse matrix:
function out = test_evaluate_stiffness(n)
ind = randi([1 n*n],300,1);
val = rand(300,1);
[I,J] = ind2sub([n,n],ind);
out = sparse(I,J,val,n,n);
end
(2) Main script (program)
% Calculate the stiffness matrix
n=1000;
K=sparse([],[],[],n,n,n^2);
tic
for i=1:13472
temp=rand(1)*test_evaluate_stiffness(n);
K=K+temp;
end
fprintf('Stiffness Calculation Complete\nTime taken = %f s\n',toc)
I'm not very familiar with sparse matrix operations so I may be missing a critical point here that may allow my code to be sped up considerably.
Am I handling the updating of my stiffness matrix in a reasonable way in my code? Is there another way that I should be using sparse that will result in a faster solution?
A profiler report is also provided below:
If you only need the sum of those matrices, instead of building all of them individually and then summing them, simply concatenate the vectors I,J and vals and call sparse only once. If there are duplicate rows [i,j] in [I,J] the corresponding values S(i,j) will be summed automatically, so the code is absolutely equivalent. As calling sparse involves an internal call to a sorting algorithm, you save 13742-1 intermediate sorts and can get away with only one.
This involves changing the signature of test_evaluate_stiffness to output [I,J,val]:
function [I,J,val] = test_evaluate_stiffness(n)
and removing the line out = sparse(I,J,val,n,n);.
You will then change your other function to:
n = 1000;
[I,J,V] = deal([]);
tic;
for i = 1:13472
[I_i, J_i, V_i] = test_evaluate_stiffness(n);
nE = numel(I_i);
I(end+(1:nE)) = I_i;
J(end+(1:nE)) = J_i;
V(end+(1:nE)) = rand(1)*V_i;
end
K = sparse(I,J,V,n,n);
fprintf('Stiffness Calculation Complete\nTime taken = %f s\n',toc);
If you know the lengths of the output of test_evaluate_stiffness ahead of time, you can possibly save some time by preallocating the arrays I,J and V with appropriately-sized zeros matrices and set them using something like:
I((i-1)*nE + (1:nE)) = ...
J((i-1)*nE + (1:nE)) = ...
V((i-1)*nE + (1:nE)) = ...
The biggest remaining computation, taking 11s, is the sparse operation
on the final I,J,V vectors so I think we've taken it down to the bare
bones.
Nearly... but one final trick: if you can create the vectors so that J is sorted ascending then you will greatly improve the speed of the sparse call, about a factor 4 in my experience.
(If it's easier to have I sorted, then create the transpose matrix sparse(J,I,V) and un-transpose it afterwards.)

Can you suggest something faster than imfilter under certain conditions?

I'm using a Matlab program that has a very long loop, inside this loop is the following code:
...
H = fspecial('gaussian', 6*sig(i), sig(i));
img_out = imfilter(img{i},H,'same');
...
Where 'sig' is a list of Gaussian widths, and 'img' is a cell array of images.
I need to make this code more efficient and perhaps those two points will allow for something more clever:
The filter is always Gaussian - just different sigma.
The image inside 'img{i}' is a grayscale sparse matrix.
I found a wonderful solution to the problem:
http://blog.ivank.net/fastest-gaussian-blur.html
There is a quick implementation in Matlab Help files:
intImage = integralImage(I);
avgH = integralKernel([1 1 7 7], 1/49);
J = integralFilter(intImage, avgH);
So 3 passes of that should approximate a Gaussian!
I tried to batch process images of same size and sigma, stacking them:
%problem generator
%real number is 10^6
num_of_images=10^4;
%i assume squared images
image_size=randi([60,100],num_of_images,1);
sig=randi([2,4],num_of_images,1);
img=cell(num_of_images,1);
ratio_nnz=.02;
for idx=1:num_of_images
ti=rand(image_size(idx))/ratio_nnz;
ti(ti>1)=0;
img{idx}=ti;
end
%existing approac
tic;
for idx=1:num_of_images
H = fspecial('gaussian', 6*sig(idx), sig(idx));
img_out = imfilter(img{idx},H,'same');
end
toc;
%idea: Match images of same sigma and
tic
%calculate all filters offline
[sig_unique,~,sig_index]=unique(sig);
H=cell(numel(sig_unique),1);
for idx=1:numel(sig_unique)
H{idx}= fspecial('gaussian', 6*sig_unique(idx), sig_unique(idx));
end
%find instances of same size and sigma
[x,y]=cellfun(#size,img);
[a,b,c]=unique([sig_index,x,y],'rows');
img_out=cell(size(img));
for didx=1:numel(b)
%img{c==didx} contains images of same sigma and size, process them at
%once
iH=H{a(didx,1)};
timg=cat(3,img{c==didx});
timg_out=imfilter(timg,iH,'same');
img_out(c==didx)=num2cell(timg_out,[1,2]);
end
toc
The result surprised me, actually calling imfilter with less but larger matrices was slower with the data I generated. Nevertheless try it with your data and or the faster gaussian filter you are planning to implement. It might be faster then.