matlab correlation matrix of a huge matrix - matlab

I have an X matrix of dimensions 37,000,000 by 22, and I want to compute the correlation matrix of X.
I.e.,
X_corr = corr(X,'type','Spearman');
And I'd like the size of X_corr to be of 22 by 22.
But it takes forever, is there anyway to compute the correlation matrix faster for such long matrices?
Thanks!

Inspired by #Bitwise's solution, I looked into the implementation of corr. (You can do so by simply typing edit corr. It has a loop over pairs of variables since it wants to deal with NaN's. If you don't have NaN's in your data, you can compute Spearman's correlation simply as:
X = rand(3e6, 22);
R = tiedrank(X); % Elapsed time is 8.956700 seconds.
C = corrcoef(X); % Elapsed time is 0.579448 seconds.
which should be same as
C2 = corr(X, 'type', 'Spearman'); Elapsed time is 9.501480 seconds.
But it's about the same speed.

Try corrcoef():
>> X=rand(1000000,22);
>> tic;corr(X);toc
Elapsed time is 18.320141 seconds.
>> tic;corrcoef(X);toc
Elapsed time is 0.494406 seconds
Also this is almost what you want (I don't have enough memory for 37e6x22):
>> X=rand(10000000,22);
>> tic;corrcoef(X);toc
Elapsed time is 7.620509 seconds.
Edit:
If you want Spearman, you can convert to ranks and then calculate Pearson, which is equivalent. Sorting isn't that bad:
>> X=rand(10000000,22);
>> tic;sort(X);toc
Elapsed time is 31.639637 seconds.

Related

Relational operators between big sparse matrices in Matlab

I have two big sparse double matrices in Matlab:
P with dimension 1048576 x 524288
I with dimension 1048576 x 524288
I want to find the number of entrances i,j such that P(i,j)<=I(i,j)
Naively I tried to run
n=sum(sum(P<=I));
but it is extremely slow (I had to shut down Matlab because it was running since forever and I wasn't able to stop it).
Is there any other more efficient way to proceed or what I want to do is unfeasible?
From some simple tests,
n = numel(P) - nnz(P>I);
seems to be faster than sum(sum(P<=I)) or even nnz(P<=I). The reason is probably that the sparse matrix P<=I has many more nonzero entries than P>I, and thus requires more memory.
Example:
>> P = sprand(10485, 52420, 1e-3);
>> I = sprand(10485, 52420, 1e-3);
>> tic, disp(sum(sum(P<=I))); toc
(1,1) 549074582
Elapsed time is 3.529121 seconds.
>> tic, disp(nnz(P<=I)); toc
549074582
Elapsed time is 3.538129 seconds.
>> tic, disp(nnz(P<=I)); toc
549074582
Elapsed time is 3.499927 seconds.
>> tic, disp(numel(P) - nnz(P>I)); toc
549074582
Elapsed time is 0.010624 seconds.
Of course this highly depends on the matrix sizes and density.
Here is a solution using indices of nonzero elements:
xp = find(P);
xi = find(I);
vp = nonzeros(P);
vi = nonzeros(I);
[s,ia,ib] = intersect(xp,xi);
iia = true(numel(vp),1);
iia(ia)=false;
iib = true(numel(vi),1);
iib(ib) = false;
n = sum(vp(ia) <= vi(ib))+sum(vp(iia)<0)+sum(vi(iib)>0)-(numel(xp)+numel(xi)-numel(s))+numel(P);

Optimization/vectorization of Matlab algorithm including very big matrices

I have a optimization problem in Matlab. Assume I get the following three vectors as input:
A of size (1 x N) (time-series of signal amplitude)
F of size (1 x N) (time-series of signal instantaneous frequency)
fx of size (M x 1) (frequency axis that I want to match the above on)
Now, the elements of F might not (99% of the times they will not) necessarily match the items of fx exactly, which is why I have to match to the closest frequency.
Here's the catch: We are talking about big data. N can easily be up to 2 million, and this has to be run hundred times on several hundred subjects. My two concerns:
Time (main concern)
Memory (production will be run on machines with +16GB memory, but development is on a machine with only 8GB of memory)
I have these two working solutions. For the following, N=2604000 and M=201:
Method 1 (for-loop)
Simple for-loop. Memory is no problem at all, but it is time consuming. Easiest implementation.
tic;
I = zeros(M,N);
for i = 1:N
[~,f] = min(abs(fx-F(i)));
I(f,i) = A(i).^2;
end
toc;
Duration: 18.082 seconds.
Method 2 (vectorized)
The idea is to match the frequency axis with each instantaneous frequency, to get the id.
F
[ 0.9 0.2 2.3 1.4 ] N
[ 0 ][ 0 1 0 0 ]
fx [ 1 ][ 1 0 0 1 ]
[ 2 ][ 0 0 1 0 ]
M
And then multiply each column with the amplitude at that time.
tic;
m_Ff = repmat(F,M,1);
m_fF = repmat(fx,1,N);
[~,idx] = min(abs(m_Ff - m_fF)); clearvars m_Ff m_fF;
m_if = repmat(idx,M,1); clearvars idx;
m_fi = repmat((1:M)',1,N);
I = double(m_if==m_fi); clearvars m_if m_fi;
I = bsxfun(#times,I,A);
toc;
Duration: 64.223 seconds. This is surprising to me, but probably because the huge variable sizes and my limited memory forces it to store the variables as files. I have SSD, though.
The only thing I have not taken advantage of, is that the matrices will have many zero-elements. I will try and look into sparse matrices.
I need at least single precision for both the amplitudes and frequencies, but really I found that it takes a lot of time to convert from double to single.
Any suggestions on how to improve?
UPDATE
As of the suggestions, I am now down to a time of combined 2.53 seconds. This takes advantage of the fact that fx is monotonically increasing and even-spaced (always starting in 0). Here is the code:
tic; df = mode(diff(fx)); toc; % Find fx step size
tic; idx = round(F./df+1); doc; % Convert to bin ids
tic; I = zeros(M,N); toc; % Pre-allocate output
tic; lin_idx = idx + (0:N-1)*M; toc; % Find indices to insert at
tic; I(lin_idx) = A.^2; toc; % Insert
The timing outputs are the following:
Elapsed time is 0.000935 seconds.
Elapsed time is 0.021878 seconds.
Elapsed time is 0.175729 seconds.
Elapsed time is 0.018815 seconds.
Elapsed time is 2.294869 seconds.
Hence the most time-consuming step is now the very final one. Any advice on this is greatly appreciated. Thanks to #Peter and #Divakar for getting me this far.
UPDATE 2 (Solution)
Wuhuu. Using sparse(i,j,k) really improves the outcome;
tic; df = fx(2)-fx(1); toc;
tic; idx = round(F./df+1); toc;
tic; I = sparse(idx,1:N,A.^2); toc;
With timings:
Elapsed time is 0.000006 seconds.
Elapsed time is 0.016213 seconds.
Elapsed time is 0.114768 seconds.
Here's one approach based on bsxfun -
abs_diff = abs(bsxfun(#minus,fx,F));
[~,idx] = min(abs_diff,[],1);
IOut = zeros(M,N);
lin_idx = idx + [0:N-1]*M;
IOut(lin_idx) = A.^2;
I'm not following entirely the relationship of F and fx, but it sounds like fx might be a set of bins of frequency, and you want to find the appropriate bin for each input F.
Optimizing this depends on the characteristics of fx.
If fx is monotonic and evenly spaced, then you don't need to search it at all. You just need to scale and offset F to align the scales, then round to get the bin number.
If fx is monotonic (sorted) but not evenly spaced, you want histc. This will use an efficient search on the edges of fx to find the correct bin. You probably need to transform f first so that it contains the edges of the bins rather than the centers.
If it's neither, then you should be able to at least sort it to get it monotonic, storing the sort order, and restoring the original order once you've found the correct "bin".

Correlation matrix ignoring NaN

I am using matlab and I have a (60x882) matrix and I need to compute pairwise correlations between columns. However I want to ignore all the columns which have a NaN or more (i.e. the result for any pair of columns in which at least one entry is NaN should be NaN).
Here's my code so far:
for i=1:size(auxret,2)
for j=1:size(auxret,2)
rho(i,j)=corr(auxret(:,i),auxret(:,j));
end
end
end
But this is extremely innefficient. I considered using the function:
corr(auxret, 'rows','pairwise');
But it didn't produce the same result (it ignores NaNs but still computes the correlation - so unless all entries of a column except one are NaN it will still give an output).
Any suggestions on how to improve efficiency?
To get the same output as your code using corr(auxret, 'rows','pairwise'), the following does the job
auxret(:,any(isnan(auxret))) = NaN;
r = corr(auxret, 'rows','pairwise');
What you describe is the default behavior of corr without any special options. For example,
auxret = [8 2 3
3 5 NaN
7 10 3
7 4 6
2 6 7];
rho = corr(auxret)
results in
rho =
1.0000 -0.1497 NaN
-0.1497 1.0000 NaN
NaN NaN NaN
This would be an efficient approach, specially when dealing with input data involving NaNs -
%// Get mask of invalid columns and thus extract columns without any NaN
mask = any(isnan(auxret),1);
A = auxret(:,~mask);
%// Use correlation formula to get correlation outputs for valid columns
n = size(A,1);
sum_cols = sum(A,1);
sumsq_sqcolsum = n*sum(A.^2,1) - sum_cols.^2;
val1 = n.*(A.'*A) - bsxfun(#times,sum_cols.',sum_cols); %//'
val2 = sqrt(bsxfun(#times,sumsq_sqcolsum.',sumsq_sqcolsum)); %//'
valid_outvals = val1./val2;
%// Setup output array and store the valid outputs in it
ncols = size(auxret,2);
valid_idx = find(~mask);
out = nan(ncols);
out(valid_idx,valid_idx) = valid_outvals;
Basically, as the pre-processing step, it altogether removes all columns having one or more NaNs and calculates the correlation outputs. Then we initialize an output array of NaNs with appropriate size and puts back the valid outputs into it at appropriate places.
Benchmarking
It seems the results are valid whether you stay with your loopy approach or go with the optional corr(auxret, 'rows','pairwise'). But, there is a big catch here: Even a single NaN
in any of the columns slows down the performance quite a bit and this performance drop is huge with the original loopy approach and still big with the rows + pairwise option as we will
find out in the benchmarking results next up.
Benchmarking code
nrows = 60;
ncols = 882;
percent_nans = 1; %// decides the percentage of NaNs in input
auxret = rand(nrows,ncols);
auxret(randperm(numel(auxret),round((percent_nans/100)*numel(auxret))))=nan;
disp('------------------------------- With Proposed Approach')
tic
%// Solution code from earlier
toc
disp('------------------------------- With ROWS + PAIRWISE Approach')
tic
auxret(:,any(isnan(auxret))) = NaN;
out1 = corr(auxret, 'rows','pairwise');
toc
disp('------------------------------- With Original Loopy Approach')
tic
out2 = zeros(size(auxret,2));
for i=1:size(auxret,2)
for j=1:size(auxret,2)
out2(i,j)=corr(auxret(:,i),auxret(:,j));
end
end
toc
So, there are few cases possible based on the input datasizes and percentage of NaNs and correspondingly we have the runtime results -
Case 1: Input is 6 x 88 and Percentage of NaNs is 10
------------------------------- With Proposed Approach
Elapsed time is 0.006371 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 0.052563 seconds.
------------------------------- With Original Loopy Approach
Elapsed time is 0.875620 seconds.
Case 2: Input is 6 x 88 and Percentage of NaNs is 1
------------------------------- With Proposed Approach
Elapsed time is 0.006303 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 0.049194 seconds.
------------------------------- With Original Loopy Approach
Elapsed time is 0.871369 seconds.
Case 3: Input is 6 x 88 and Percentage of NaNs is 0.001
------------------------------- With Proposed Approach
Elapsed time is 0.006738 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 0.025754 seconds.
------------------------------- With Original Loopy Approach
Elapsed time is 0.867647 seconds.
Case 4: Input is 60 x 882 and Percentage of NaNs is 10
------------------------------- With Proposed Approach
Elapsed time is 0.007766 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 2.479645 seconds.
------------------------------- With Original Loopy Approach
...... Taken Too long ...
Case 5: Input is 60 x 882 and Percentage of NaNs is 1
------------------------------- With Proposed Approach
Elapsed time is 0.014144 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 2.324878 seconds.
------------------------------- With Original Loopy Approach
...... Taken Too long ...
Case 6: Input is 60 x 882 and Percentage of NaNs is 0.001
------------------------------- With Proposed Approach
Elapsed time is 0.020410 seconds.
------------------------------- With ROWS + PAIRWISE Approach
Elapsed time is 1.830632 seconds.
------------------------------- With Original Loopy Approach
...... Taken Too long ...

How to generate a random integer that is either 0 or 1 in MATLAB

I searched the MATLAB documentation for how to generate a random integer that is either a 0 or a 1.
I stumbled upon the two functions randint and randi.
randint appears to be deprecated in my version of MATLAB although it is in the documentation online and randi appears to only create randoms numbers between 1 and a specified imax value.
I even created my own randint function to solve the problem although it doesn't run very efficiently in my program since it is used for large data sets:
function [ints] = randint(m,n)
ints = round(rand(m,n));
end
Is there a built-in function to create a random integer that is either a 0 or 1 or is there a more efficient way to create such a function in MATLAB?
This seems to be a little faster:
result = rand(m,n)<.5;
Or, if you need the result as double:
result = double(rand(m,n)<.5);
Examples with m = 100, n = 1e5 (Matlab 2010b):
>> tic, round(rand(m,n)); toc
Elapsed time is 0.494488 seconds.
>> tic, randi([0 1], m,n); toc
Elapsed time is 0.565805 seconds.
>> tic, rand(m,n)<.5; toc
Elapsed time is 0.365703 seconds.
>> tic, double(rand(m,n)<.5); toc
Elapsed time is 0.445467 seconds.
randi is the way to go, just define the boundaries to 1 and 0
A = randi([0 1], m, n);
To generate either 0 or 1 randomly...
x=round(rand)

Number of values greater than a threshold

I have a matrix A. Now I want to find the number of elements greater than 5 and their corresponding indices. How to solve this in matlab without using for loop?
For example if A = [1 4 6 8 9 5 6 8 9]':
Number of elements > 5: 6
Indices: [3 4 5 7 8 9]
You use find:
index = find(A>5);
numberOfElements = length(index);
You use sum, which allows you to get the number of elements with one command:
numberOfElements = sum(A>5);
Do you really need explicit indices? Because the logical matrix A>5 can also be used as index (usually a tad more efficient than indexing with find):
index = (A>5);
numberOfElements = sum(index);
For completeness: indexing with logicals is the same as with regular indices:
>> A(A>5)
ans =
6 8 9 6 8 9
Motivated by the above discussion with Rody, here is a simple benchmark, which tests speed of integer vs. logical array indexing in MATLAB. Quite an important thing I would say, since 'vectorized' MATLAB is mostly about indexing. So
% random data
a = rand(10^7, 1);
% threashold - how much data meets the a>threashold criterion
% This determines the total indexing time - the more data we extract from a,
% the longer it takes.
% In this example - small threashold meaning most data in a
% will meet the criterion.
threashold = 0.08;
% prepare logical and integer indices (note the uint32 cast)
index_logical = a>threashold;
index_integer = uint32(find(index_logical));
% logical indexing of a
tic
for i=1:10
b = a(index_logical);
end
toc
% integer indexing of a
tic
for i=1:10
b = a(index_integer);
end
toc
On my computer the results are
Elapsed time is 0.755399 seconds.
Elapsed time is 0.728462 seconds.
meaning that the two methods perform almost the same - thats how I chose the example threashold. It is interesing, because the index_integer array is almost 4 times larger!
index_integer 9198678x1 36794712 uint32
index_logical 10000000x1 10000000 logical
For larger values of the threashold integer indexing is faster. Results for threashold=0.5:
Elapsed time is 0.687044 seconds. (logical)
Elapsed time is 0.296044 seconds. (integer)
Unless I am doing something wrong here, integer indexing seems to be the fastest most of the time.
Including the creation of the indices in the test yields very different results however:
a = rand(1e7, 1);
threshold = 0.5;
% logical
tic
for i=1:10
inds = a>threshold;
b = a(inds);
end
toc
% double
tic
for i=1:10
inds = find(a>threshold);
b = a(inds);
end
toc
% integer
tic
for i=1:10
inds = uint32(find(a>threshold));
b = a(inds);
end
toc
Results (Rody):
Elapsed time is 1.945478 seconds. (logical)
Elapsed time is 3.233831 seconds. (double)
Elapsed time is 3.508009 seconds. (integer)
Results (angainor):
Elapsed time is 1.440018 seconds. (logical)
Elapsed time is 1.851225 seconds. (double)
Elapsed time is 1.726806 seconds. (integer)
So it would seem that the actual indexing is faster when indexing with integers, but front-to-back, logical indexing performs much better.
The runtime difference between the last two methods is unexpected though -- it seems Matlab's internals either do not cast the doubles to integers, of perform error-checking on each element before doing the actual indexing. Otherwise, we would have seen virtually no difference between the double and integer methods.
Edit There are two options as I see it:
matlab converts double indices to uint32 indices explicitly before the indexing call (much like we do in the integer test)
matlab passes doubles and performs the double->int cast on the fly during the indexing call
The second option should be faster, because we only have to read the double indexes once. In our explicit conversion test we have to read double indices, write integer indices, and then again read the integer indices during the actual indexing. So matlab should be faster... Why is it not?