How to Efficiently Combine Sparse Matrices Vertically - matlab

My goal is to combine many sparse matrices together to form one large sparse matrix. The only two ideas I've been able to think of are (1) create a large sparse matrix and overwrite certain blocks, (2) create the blocks individually use vertcat to form my final sparse matrix. However,I've read that overwriting sparse matrices is quite inefficient, and I've also read that vertcat isn't exactly computationally efficient. (I didn't both to consider using a for loop because of how inefficient they are).
What other alternatives do I have then?
Edit: By combine I mean "gluing" matrices together (vertically), the elements don't interact.

According to the matlab help, you can "disassemble" a sparse matrix with
[i,j,s] = find(S);
This means that if you have two matrices S and T, and you want to (effectively) vertcat them, you can do
[is, js, ss] = find(S);
[it, jt, st] = find(T);
ST = sparse([is; it + size(S,1)], [js; jt], [ss; st]);
Not sure if this is very efficient... but I'm guessing it's not too bad.
EDIT: using a 2000x1000 sparse matrix with a density of 1%, and combining it with another that has density of 2%, the above code ran in 0.016 seconds on my machine. Just doing [S;T] was 10x faster. What makes you think vertical concatenation is slow?
EDIT2: assuming you need to do this with "many" sparse matrices, the following works (this assumes you want them all "in the same place"):
m = 1000; n = 2000; density = 0.01;
N = 100;
Q = cell(1, N);
is = Q;
js = Q;
ss = Q;
numrows = 0; % keep track of dimensions so far
for ii = 1:N
Q{ii} = sprandn(m+ii, n-jj, density); % so each matrix has different size
[a b c] = find(Q{ii});
sz = size(Q{ii});
is{ii} = a' + numrows; js{ii}=b'; ss{ii}=c'; % append "on the corner"
numrows = numrows + sz(1); % keep track of the size
end
tic
ST = sparse([is{:}], [js{:}], [ss{:}]);
fprintf(1, 'using find takes %.2f sec\n', toc);
Output:
using find takes 0.63 sec
The big advantage of this method is that you don't need to have the same number of columns in your individual sparse arrays... it will all get sorted out by the sparse command which will simply consider the missing columns to be all zeros.

Considering the answer already given.
I have changed the experiment a bit, to be able to join matricies vertically (it should have the same width), so you we do no need to tweak n by extracting ii (which is mistyped by jj).
This approach
tic
ST = sparse([is{:}], [js{:}], [ss{:}]);
fprintf(1, 'using find takes %.2f sec\n', toc);
with its 0.45 sec is much slower than this one
tic
ST = vertcat(Q{:});
fprintf(1, 'using vertcat takes %.2f sec\n', toc);
with 0.18 sec average.
I also checked it with profiler, first example is expectedly slower, since at least the memory allocation is 100 times higher. Most probably, because of [ss{:}] array constructions which explicityly copies the data to the new array.
However, even with precomputed vectors the speed is 0,3 sec vs 0,18 sec for vertcat.
Thus, I suggest that vertcat is a better option for original problem. At least in 2021 :)

Related

Use of bsxfun with singleton expansion with matrixes of three dimensions

I'm using bsxfun to vectorize an operation with singleton expansion between matrixes of sizes:
MS: (nms, nls)
KS: (nks, nls)
The operation is the sum of the absolute differences between each value MS(m,l) with m in 1:nms and l in 1:nls, and every KS(k,l) with k in 1:nks.
I achieve this through the code:
[~, nls] = size(MS);
MS = reshape(MS',1,nls,[]);
R = sum(abs(bsxfun(#minus,MS,KS)));
R is of size (nls, nms).
I want to generalize this operation to a list of samples, so the new sizes will be:
MS: (nxs, nls, nms)
KS: (nxs, nls, nks)
This can be achieved easily with a for loop that executes the first piece of code for each 2 dimensional matrixes, but I suspect that performance may be much better by generalizing the previous code by adding a new dimension.
R has would be of size: (nxs, nls, nms)
I have tried to reshape MS to 4 dimensions with no success. Could this be done with reshaping and bsxfun?
You might need this:
% generate small dummy data
nxs = 2;
nls = 3;
nms = 4;
nks = 5;
MS = rand(nxs, nls, nms);
KS = rand(nxs, nls, nks);
R = sum(abs(bsxfun(#minus,MS,permute(KS,[1,2,4,3]))),4)
This will produce a matrix of size [2,3,4], i.e. [nxs,nls,nms]. Each element [k1,k2,k3] will correspond to
R(k1,k2,k3) == sum_k abs(MS(k1,k2,k3) - KS(k1,k2,k))
For instance, in my random run
R(2,1,3)
ans =
1.255765020150647
>> sum(abs(MS(2,1,3)-KS(2,1,:)))
ans =
1.255765020150647
The trick is to introduce singleton dimensions with permute: permute(KS,[1,2,4,3]) is of size [nxs,nls,1,nks], while MS of size [nxs,nls,nms] is implicitly also of size [nxs,nls,nms,1]: every array in MATLAB is assumed to possess a countably infinite number of trailing singleton dimensions. From here it's easy to see how you can bsxfun together arrays of size [nxs,nls,nms,1] and [nxs,nls,1,nks], respectively, to obtain one with size [nxs,nls,nms,nks]. Summing along dimension 4 seals the deal.
I noted in a comment, that it might be faster to permute the summing index to be in the first place. Turns out that this by itself makes the code run slower. However, by reshaping the arrays to have decreasing dimension sizes, the overall performance increases (due to optimal memory access). Compare this:
% generate larger dummy data
nxs = 20;
nls = 30;
nms = 40;
nks = 500;
MS = rand(nxs, nls, nms);
KS = rand(nxs, nls, nks);
MS2 = permute(MS,[4 3 2 1]);
KS2 = permute(KS,[3 4 2 1]);
R3 = permute(squeeze(sum(abs(bsxfun(#minus,MS2,KS2)),1)),[3 2 1]);
What I did was put the summing nks dimension into first place, and order the rest of the dimensions in decreasing order. This could be done automatically, I just didn't want to overcomplicate the example. In your use case you'll probably know the magnitude of the dimensions anyway.
Runtimes with the above two codes: 0.07028 s for the original, 0.051162 s for the reordered one (best out of 5). Larger examples don't fit into memory for me now, unfortunately.

make this matlab snippet run without a loop

I want a code the below code more efficient timewise. preferably without a loop.
arguments:
t % time values vector
t_index = c % one of the possible indices ranging from 1:length(t).
A % a MXN array where M = length(t)
B % a 1XN array
code:
m = 1;
for k = t_index:length(t)
A(k,1:(end-m+1)) = A(k,1:(end-m+1)) + B(m:end);
m = m + 1;
end
Many thanks.
I'd built from B a matrix of size NxM (call it B2), with zeros in the right places and a triangular from according to the conditions and then all you need to do is A+B2.
something like this:
N=size(A,2);
B2=zeros(size(A));
k=c:length(t);
B2(k(1):k(N),:)=hankel(B)
ans=A+B2;
Note, the fact that it is "vectorized" doesn't mean it is faster these days. Matlab's JIT makes for loops comparable and sometimes faster than built-in vectorized options.

Efficient way in MATLAB to apply the same left and right matrix multiplication to a large set of matrices

I have a lot of 2-by-2 matrices S1, S2, ..., SN, and on each of those matrices, I want to perform a left and right matrix multiplication as in R*S*R^T, where R is also a 2-by-2 matrix. Obviously I could just write this with a for loop, but I anticipate it being very slow for large N in MATLAB. Is there a simple and efficient way to accomplish this without using a for loop? Thanks in Advance!
Your biggest problem is not the loops. For matrices so small calling MATLABs A*B introduces a lot of overhead. The best thing you can do is to store all the matrices in a large 4 x n_matrices matrix and spell out the matrix multiplications manually:
A = rand(4, 1000);
B = rand(4, 1000);
tic;
C = zeros(size(A));
C(1,:) = A(1,:).*B(1,:) + A(3,:).*B(2,:);
C(2,:) = A(2,:).*B(1,:) + A(4,:).*B(2,:);
C(3,:) = A(1,:).*B(3,:) + A(3,:).*B(4,:);
C(4,:) = A(2,:).*B(3,:) + A(4,:).*B(4,:);
toc
Elapsed time is 0.020950 seconds.
As you see, this takes little time (this is a 6-years old desktop PC). For small matrices like this it is practical and I can not imagine anything else written in MATLAB that could beat this performance-wise. Well, for very large number of 2x2 matrices you could introduce blocking (i.e., handle only a number of matrices at a time) to enhance cache reuse.
I would say that the cycle here is not that bad and not that slow, consider this
N = 1000000
S = cell(1,N);
Out = S;
A = rand(2);
B = rand(2);
for i = 1 : N
S{i} = rand(2);
end
tic
for i = 1 : N
Out{i} = A * S{i} * B;
end
toc
tic
f = #(i) A*i*B;
Out = cellfun(f,S,'UniformOutput' , false);
toc
N =
1000000
Elapsed time is 2.609569 seconds.
Elapsed time is 9.871200 seconds.
You may think of performing a cat of your 2x2 matrices and then performing just 2 multiplications (transposing correctly on the way). But you will loose time in catting.

Calculating all two element sums of [a1 a2 a3 ... an] in Matlab

Context: I'm working on Project Euler Problem 23 using Matlab in order to practice my barely existing programming skills.
My Problem:
Now I have a vector with roughly 6500 numbers (ranging from 12 to 28122) as elements and want to calculate all the two element sums. That is I only need one instance of every sum, so having calculated a1 + an it's not necessary to calculate an + a1.
Edit for clarification: This includes the sums a1+a1, a2+a2,..., an+an.
The problem is that this is much too slow.
Problem specific constraints:
It's a given that sums 28123 or over aren't necessary to calculate, since those can't be used to solve the problem further.
My approach:
AbundentNumberSumsRaw=[];
for i=1:3490
AbundentNumberSumsRaw=[AbundentNumberSumRaw AbundentNumbers(i)+AbundentNumbers(i:end);
end
This works terribly :p
My Comments:
I'm pretty sure that incrementally increasing the vector AbundentNumbersRaw is bad coding, since that means memory usage will spike unnecessarily. I haven't done so, since a) I don't know what size vector to pre-allocate and b) I couldn't come up with a way to inject the sums into AbundentNumbersRaw in a orderly manner without using some ugly looking nested loops.
"for i=1:3490" is lower than the numbers of elements simply because I checked and saw that all the resulting sums for numbers whose index are above 3490 would be too large for me to use anyway.
I'm pretty sure my main issue is that the program need to do a lot of incremental increases of the vector AbundentNumbersRaw.
Any and all help and suggestions would be much appreciated :)
Cheers
Rasmus
Suppose
a = 28110*rand(6500,1)+12;
then
sums = [
a(1) + a(1:end)
a(2) + a(2:end)
...
];
is the calculation you're after.
You also state that sums whose value goes over 28123 should be discarded.
This can be generalized like so:
% Compute all 2-element sums without repetitions
C = arrayfun(#(x) a(x)+a(x:end), 1:numel(a), 'uniformoutput', false);
C = cat(1, C{:});
% discard sums exceeding threshold
C(C>28123) = [];
or using a loop
% Compute all 2-element sums without repetitions
E = cell(numel(a),1);
for ii = 1:numel(a)
E{ii} = a(ii)+a(ii:end); end
E = cat(1, E{:});
% discard sums exceeding threshold
E(E>28123) = [];
Simple testing shows that arrayfun is somewhat faster than the loop, so I'd go for the arrayfun option.
As your primary problem is to find out, which integers in a given set can be written as the sum of two integers of a different set, I'd choose a different approach:
AbundantNumbers = 1:6500; % replace with the list you generated somewhere else
maxInteger = 28122;
AbundantNumberSum(1:maxInteger) = true; % logical array
for i = 1:length(AbundantNumbers)
sumIndices = AbundantNumbers(i) + AbundantNumbers;
AbundantNumberSum(sumIndices(sumIndices <= maxInteger)) = false;
end
Unfortunantely, this is not an answer to your question but to your problem ;-) For the MatLab way to solve your original question, see the elegant answer of Rody Oldenhuis.
My approach would be the following:
v = 1:3490; % your vector here
s = length(v);
result = zeros(s); % preallocate memory
for m = 1:s
result(m,m:end) = v(m)+v(m:end);
end
You will get a matrix of 3490 x 3490 elements and more than half of them 0.

What is a fast way to compute column by column correlation in matlab

I have two very large matrices (60x25000) and I'd like to compute the correlation between the columns only between the two matrices. For example:
corrVal(1) = corr(mat1(:,1), mat2(:,1);
corrVal(2) = corr(mat1(:,2), mat2(:,2);
...
corrVal(i) = corr(mat1(:,i), mat2(:,i);
For smaller matrices I can simply use:
colCorr = diag( corr( mat1, mat2 ) );
but this doesn't work for very large matrices as I run out of memory. I've considered slicing up the matrices to compute the correlations and then combining the results but it seems like a waste to compute correlation between column combinations that I'm not actually interested.
Is there a quick way to directly compute what I'm interested?
Edit: I've used a loop in the past but its just way to slow:
mat1 = rand(60,5000);
mat2 = rand(60,5000);
nCol = size(mat1,2);
corrVal = zeros(nCol,1);
tic;
for i = 1:nCol
corrVal(i) = corr(mat1(:,i), mat2(:,i));
end
toc;
This takes ~1 second
tic;
corrVal = diag(corr(mat1,mat2));
toc;
This takes ~0.2 seconds
I can obtain a x100 speed improvement by computing it by hand.
An=bsxfun(#minus,A,mean(A,1)); %%% zero-mean
Bn=bsxfun(#minus,B,mean(B,1)); %%% zero-mean
An=bsxfun(#times,An,1./sqrt(sum(An.^2,1))); %% L2-normalization
Bn=bsxfun(#times,Bn,1./sqrt(sum(Bn.^2,1))); %% L2-normalization
C=sum(An.*Bn,1); %% correlation
You can compare using that code:
A=rand(60,25000);
B=rand(60,25000);
tic;
C=zeros(1,size(A,2));
for i = 1:size(A,2)
C(i)=corr(A(:,i), B(:,i));
end
toc;
tic
An=bsxfun(#minus,A,mean(A,1));
Bn=bsxfun(#minus,B,mean(B,1));
An=bsxfun(#times,An,1./sqrt(sum(An.^2,1)));
Bn=bsxfun(#times,Bn,1./sqrt(sum(Bn.^2,1)));
C2=sum(An.*Bn,1);
toc
mean(abs(C-C2)) %% difference between methods
Here are the computing times:
Elapsed time is 10.822766 seconds.
Elapsed time is 0.119731 seconds.
The difference between the two results is very small:
mean(abs(C-C2))
ans =
3.0968e-17
EDIT: explanation
bsxfun does a column-by-column operation (or row-by-row depending on the input).
An=bsxfun(#minus,A,mean(A,1));
This line will remove (#minus) the mean of each column (mean(A,1)) to each column of A... So basically it makes the columns of A zero-mean.
An=bsxfun(#times,An,1./sqrt(sum(An.^2,1)));
This line multiply (#times) each column by the inverse of its norm. So it makes them L-2 normalized.
Once the columns are zero-mean and L2-normalized, to compute the correlation, you just have to make the dot product of each column of An with each column of B. So you multiply them element-wise An.*Bn, and then you sum each column: sum(An.*Bn);.
I think the obvious loop might be good enough for your size of problem. On my laptop it takes less than 6 seconds to do the following:
A = rand(60,25000);
B = rand(60,25000);
n = size(A,1);
m = size(A,2);
corrVal = zeros(1,m);
for k=1:m
corrVal(k) = corr(A(:,k),B(:,k));
end