Calculating correlation coefficient efficiently - matlab

I have this huge dimensional data.
One A of size (50,12000) and B of size (50,1000).
I want to calculate the correlation of the each column of A with each column of B. How to do this efficiently
I tried with corr([A B]) in matlab but it consumes lots of memory and freezes. How to do this quickly and efficiently?

To compute the correlation of the each column of A with each column of B you use corr(A,B), not corr([A B]).
If corr(A,B) causes memory problems, work in chunks. For example, the following code divides A into vertical stripes of chunk_size columns, computes the correlation of each A-stripe with B, and stores it. The final result is the same as corr(A,B).
chunk_size = 100; %// must divide size(A,2) (easy to avoid if needed, though)
result = NaN(size(A,2),size(B,2)); %// preallocate
for ii = chunk_size:chunk_size:size(A,2)
ind = ii+(-chunk_size+1:0);
result(ind,:) = corr(A(:,ind),B);
end

Related

Optimize nested for loop for calculating xcorr of matrix rows

I have 2 nested loops which do the following:
Get two rows of a matrix
Check if indices meet a condition or not
If they do: calculate xcorr between the two rows and put it into new vector
Find the index of the maximum value of sub vector and replace element of LAG matrix with this value
I dont know how I can speed this code up by vectorizing or otherwise.
b=size(data,1);
F=size(data,2);
LAG= zeros(b,b);
for i=1:b
for j=1:b
if j>i
x=data(i,:);
y=data(j,:);
d=xcorr(x,y);
d=d(:,F:(2*F)-1);
[M,I] = max(d);
LAG(i,j)=I-1;
d=xcorr(y,x);
d=d(:,F:(2*F)-1);
[M,I] = max(d);
LAG(j,i)=I-1;
end
end
end
First, a note on floating point precision...
You mention in a comment that your data contains the integers 0, 1, and 2. You would therefore expect a cross-correlation to give integer results. However, since the calculation is being done in double-precision, there appears to be some floating-point error introduced. This error can cause the results to be ever so slightly larger or smaller than integer values.
Since your calculations involve looking for the location of the maxima, then you could get slightly different results if there are repeated maximal integer values with added precision errors. For example, let's say you expect the value 10 to be the maximum and appear in indices 2 and 4 of a vector d. You might calculate d one way and get d(2) = 10 and d(4) = 10.00000000000001, with some added precision error. The maximum would therefore be located in index 4. If you use a different method to calculate d, you might get d(2) = 10 and d(4) = 9.99999999999999, with the error going in the opposite direction, causing the maximum to be located in index 2.
The solution? Round your cross-correlation data first:
d = round(xcorr(x, y));
This will eliminate the floating-point errors and give you the integer results you expect.
Now, on to the actual solutions...
Solution 1: Non-loop option
You can pass a matrix to xcorr and it will perform the cross-correlation for every pairwise combination of columns. Using this, you can forego your loops altogether like so:
d = round(xcorr(data.'));
[~, I] = max(d(F:(2*F)-1,:), [], 1);
LAG = reshape(I-1, b, b).';
Solution 2: Improved loop option
There are limits to how large data can be for the above solution, since it will produce large intermediate and output variables that can exceed the maximum array size available. In such a case for loops may be unavoidable, but you can improve upon the for-loop solution above. Specifically, you can compute the cross-correlation once for a pair (x, y), then just flip the result for the pair (y, x):
% Loop over rows:
for row = 1:b
% Loop over upper matrix triangle:
for col = (row+1):b
% Cross-correlation for upper triangle:
d = round(xcorr(data(row, :), data(col, :)));
[~, I] = max(d(:, F:(2*F)-1));
LAG(row, col) = I-1;
% Cross-correlation for lower triangle:
d = fliplr(d);
[~, I] = max(d(:, F:(2*F)-1));
LAG(col, row) = I-1;
end
end

Use of bsxfun with singleton expansion with matrixes of three dimensions

I'm using bsxfun to vectorize an operation with singleton expansion between matrixes of sizes:
MS: (nms, nls)
KS: (nks, nls)
The operation is the sum of the absolute differences between each value MS(m,l) with m in 1:nms and l in 1:nls, and every KS(k,l) with k in 1:nks.
I achieve this through the code:
[~, nls] = size(MS);
MS = reshape(MS',1,nls,[]);
R = sum(abs(bsxfun(#minus,MS,KS)));
R is of size (nls, nms).
I want to generalize this operation to a list of samples, so the new sizes will be:
MS: (nxs, nls, nms)
KS: (nxs, nls, nks)
This can be achieved easily with a for loop that executes the first piece of code for each 2 dimensional matrixes, but I suspect that performance may be much better by generalizing the previous code by adding a new dimension.
R has would be of size: (nxs, nls, nms)
I have tried to reshape MS to 4 dimensions with no success. Could this be done with reshaping and bsxfun?
You might need this:
% generate small dummy data
nxs = 2;
nls = 3;
nms = 4;
nks = 5;
MS = rand(nxs, nls, nms);
KS = rand(nxs, nls, nks);
R = sum(abs(bsxfun(#minus,MS,permute(KS,[1,2,4,3]))),4)
This will produce a matrix of size [2,3,4], i.e. [nxs,nls,nms]. Each element [k1,k2,k3] will correspond to
R(k1,k2,k3) == sum_k abs(MS(k1,k2,k3) - KS(k1,k2,k))
For instance, in my random run
R(2,1,3)
ans =
1.255765020150647
>> sum(abs(MS(2,1,3)-KS(2,1,:)))
ans =
1.255765020150647
The trick is to introduce singleton dimensions with permute: permute(KS,[1,2,4,3]) is of size [nxs,nls,1,nks], while MS of size [nxs,nls,nms] is implicitly also of size [nxs,nls,nms,1]: every array in MATLAB is assumed to possess a countably infinite number of trailing singleton dimensions. From here it's easy to see how you can bsxfun together arrays of size [nxs,nls,nms,1] and [nxs,nls,1,nks], respectively, to obtain one with size [nxs,nls,nms,nks]. Summing along dimension 4 seals the deal.
I noted in a comment, that it might be faster to permute the summing index to be in the first place. Turns out that this by itself makes the code run slower. However, by reshaping the arrays to have decreasing dimension sizes, the overall performance increases (due to optimal memory access). Compare this:
% generate larger dummy data
nxs = 20;
nls = 30;
nms = 40;
nks = 500;
MS = rand(nxs, nls, nms);
KS = rand(nxs, nls, nks);
MS2 = permute(MS,[4 3 2 1]);
KS2 = permute(KS,[3 4 2 1]);
R3 = permute(squeeze(sum(abs(bsxfun(#minus,MS2,KS2)),1)),[3 2 1]);
What I did was put the summing nks dimension into first place, and order the rest of the dimensions in decreasing order. This could be done automatically, I just didn't want to overcomplicate the example. In your use case you'll probably know the magnitude of the dimensions anyway.
Runtimes with the above two codes: 0.07028 s for the original, 0.051162 s for the reordered one (best out of 5). Larger examples don't fit into memory for me now, unfortunately.

How can I detect the minimum and maximum values every 50 rows

I'm trying to detect peak values in MATLAB. I'm trying to use the findpeaks function. The problem is that my data consists of 4200 rows and I just want to detect the minimum and maximum point in every 50 rows.After I'll use this code for real time accelerometer data.
This is my code:
[peaks,peaklocations] = findpeaks( filteredX, 'minpeakdistance', 50 );
plot( x, filteredX, x( peaklocations ), peaks, 'or' )
So you want to first reshape your vector into 50 sample rows and then compute the peaks for each row.
A = randn(4200,1);
B = reshape (A,[50,size(A,1)/50]); %//which gives B the structure of 50*84 Matrix
pks=zeros(50,size(A,1)/50); %//pre-define and set to zero/NaN for stability
pklocations = zeros(50,size(A,1)/50); %//pre-define and set to zero/NaN for stability
for i = 1: size(A,1)/50
[pks(1:size(findpeaks(B(:,i)),1),i),pklocations(1:size(findpeaks(B(:,i)),1),i)] = findpeaks(B(:,i)); %//this gives you your peak, you can alter the parameters of the findpeaks function.
end
This generates 2 matrices, pklocations and pks for each of your segments. The downside ofc is that since you do not know how many peaks you will get for each segment and your matrix must have the same length of each column, so I padded it with zero, you can pad it with NaN if you want.
EDIT, since the OP is looking for only 1 maximum and 1 minimum for each 50 samples, this can easily be satisfied by the min/max function in MATLAB.
A = randn(4200,1);
B = reshape (A,[50,size(A,1)/50]); %//which gives B the structure of 50*84 Matrix
[pks,pklocations] = max(B);
[trghs,trghlocations] = min(B);
I guess alternatively, you could do a max(pks), but it is simply making it complicated.

Best way to join different length column vectors into a matrix in MATLAB

Assuming i have a series of column-vectors with different length, what would be the best way, in terms of computation time, to join all of them into one matrix where the size of it is determined by the longest column and the elongated columns cells are all filled with NaN's.
Edit: Please note that I am trying to avoid cell arrays, since they are expensive in terms of memory and run time.
For example:
A = [1;2;3;4];
B = [5;6];
C = magicFunction(A,B);
Result:
C =
1 5
2 6
3 NaN
4 NaN
The following code avoids use of cell arrays except for the estimation of number of elements in each vector and this keeps the code a bit cleaner. The price for using cell arrays for that tiny bit of work shouldn't be too expensive. Also, varargin gets you the inputs as a cell array anyway. Now, you can avoid cell arrays there too, but it would most probably involve use of for-loops and might have to use variable names for each of the inputs, which isn't too elegant when creating a function with unknown number of inputs. Otherwise, the code uses numeric arrays, logical indexing and my favourite bsxfun, which must be cheap in the market of runtimes.
Function Code
function out = magicFunction(varargin)
lens = cellfun(#(x) numel(x),varargin);
out = NaN(max(lens),numel(lens));
out(bsxfun(#le,[1:max(lens)]',lens)) = vertcat(varargin{:}); %//'
return;
Example
Script -
A1 = [9;2;7;8];
A2 = [1;5];
A3 = [2;6;3];
out = magicFunction(A1,A2,A3)
Output -
out =
9 1 2
2 5 6
7 NaN 3
8 NaN NaN
Benchmarking
As part of the benchmarking, we are comparing our solution to #gnovice's solution that was mostly based on using cell arrays. Our intention here to see that after avoiding cell arrays, what speedups we are getting if there's any. Here's the benchmarking code with 20 vectors -
%// Let's create row vectors A1,A2,A3.. to be used with #gnovice's solution
num_vectors = 20;
max_vector_length = 1500000;
vector_lengths = randi(max_vector_length,num_vectors,1);
vs =arrayfun(#(x) randi(9,1,vector_lengths(x)),1:numel(vector_lengths),'uni',0);
[A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16,A17,A18,A19,A20] = vs{:};
%// Maximally cell-array based approach used in linked #gnovice's solution
disp('--------------------- With #gnovice''s approach')
tic
tcell = {A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16,A17,A18,A19,A20};
maxSize = max(cellfun(#numel,tcell)); %# Get the maximum vector size
fcn = #(x) [x nan(1,maxSize-numel(x))]; %# Create an anonymous function
rmat = cellfun(fcn,tcell,'UniformOutput',false); %# Pad each cell with NaNs
rmat = vertcat(rmat{:});
toc, clear tcell maxSize fcn rmat
%// Transpose each of the input vectors to get column vectors as needed
%// for our problem
vs = cellfun(#(x) x',vs,'uni',0); %//'
[A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16,A17,A18,A19,A20] = vs{:};
%// Our solution
disp('--------------------- With our new approach')
tic
out = magicFunction(A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,...
A11,A12,A13,A14,A15,A16,A17,A18,A19,A20);
toc
Results -
--------------------- With #gnovice's approach
Elapsed time is 1.511669 seconds.
--------------------- With our new approach
Elapsed time is 0.671604 seconds.
Conclusions -
With 20 vectors and with a maximum length of 1500000, the speedups are between 2-3x and it was seen that the speedups have increased as we have increased the number of vectors. The results to prove that are not shown here to save space, as we have already used quite a lot of it here.
If you use a cell matrix you won't need them to be filled with NaNs, just write each array into one column and the unused elements stay empty (that would be the space efficient way). You could either use:
cell_result{1} = A;
cell_result{2} = B;
THis would result in a size 2 cell array which contains all elements of A,B in his elements. Or if you want them to be saved as columns:
cell_result(1,1:numel(A)) = num2cell(A);
cell_result(2,1:numel(B)) = num2cell(B);
If you need them to be filled with NaN's for future coding, it would be the easiest to find the maximum length you got. Create yourself a matrix of (max_length X Number of arrays).
So lets say you have n=5 arrays:A,B,C,D and E.
h=zeros(1,n);
h(1)=numel(A);
h(2)=numel(B);
h(3)=numel(C);
h(4)=numel(D);
h(5)=numel(E);
max_No_Entries=max(h);
result= zeros(max_No_Entries,n);
result(:,:)=NaN;
result(1:numel(A),1)=A;
result(1:numel(B),2)=B;
result(1:numel(C),3)=C;
result(1:numel(D),4)=D;
result(1:numel(E),5)=E;

How to Efficiently Combine Sparse Matrices Vertically

My goal is to combine many sparse matrices together to form one large sparse matrix. The only two ideas I've been able to think of are (1) create a large sparse matrix and overwrite certain blocks, (2) create the blocks individually use vertcat to form my final sparse matrix. However,I've read that overwriting sparse matrices is quite inefficient, and I've also read that vertcat isn't exactly computationally efficient. (I didn't both to consider using a for loop because of how inefficient they are).
What other alternatives do I have then?
Edit: By combine I mean "gluing" matrices together (vertically), the elements don't interact.
According to the matlab help, you can "disassemble" a sparse matrix with
[i,j,s] = find(S);
This means that if you have two matrices S and T, and you want to (effectively) vertcat them, you can do
[is, js, ss] = find(S);
[it, jt, st] = find(T);
ST = sparse([is; it + size(S,1)], [js; jt], [ss; st]);
Not sure if this is very efficient... but I'm guessing it's not too bad.
EDIT: using a 2000x1000 sparse matrix with a density of 1%, and combining it with another that has density of 2%, the above code ran in 0.016 seconds on my machine. Just doing [S;T] was 10x faster. What makes you think vertical concatenation is slow?
EDIT2: assuming you need to do this with "many" sparse matrices, the following works (this assumes you want them all "in the same place"):
m = 1000; n = 2000; density = 0.01;
N = 100;
Q = cell(1, N);
is = Q;
js = Q;
ss = Q;
numrows = 0; % keep track of dimensions so far
for ii = 1:N
Q{ii} = sprandn(m+ii, n-jj, density); % so each matrix has different size
[a b c] = find(Q{ii});
sz = size(Q{ii});
is{ii} = a' + numrows; js{ii}=b'; ss{ii}=c'; % append "on the corner"
numrows = numrows + sz(1); % keep track of the size
end
tic
ST = sparse([is{:}], [js{:}], [ss{:}]);
fprintf(1, 'using find takes %.2f sec\n', toc);
Output:
using find takes 0.63 sec
The big advantage of this method is that you don't need to have the same number of columns in your individual sparse arrays... it will all get sorted out by the sparse command which will simply consider the missing columns to be all zeros.
Considering the answer already given.
I have changed the experiment a bit, to be able to join matricies vertically (it should have the same width), so you we do no need to tweak n by extracting ii (which is mistyped by jj).
This approach
tic
ST = sparse([is{:}], [js{:}], [ss{:}]);
fprintf(1, 'using find takes %.2f sec\n', toc);
with its 0.45 sec is much slower than this one
tic
ST = vertcat(Q{:});
fprintf(1, 'using vertcat takes %.2f sec\n', toc);
with 0.18 sec average.
I also checked it with profiler, first example is expectedly slower, since at least the memory allocation is 100 times higher. Most probably, because of [ss{:}] array constructions which explicityly copies the data to the new array.
However, even with precomputed vectors the speed is 0,3 sec vs 0,18 sec for vertcat.
Thus, I suggest that vertcat is a better option for original problem. At least in 2021 :)