Calculating Jaccard distance of a large matrix in Matlab - matlab

I have a large matrix of size 40K * 900K. It is a sparse, binary matrix and I would like to calculate the Jaccard distance between its rows (40K by 40K Jacard distance in total). I'm aware of built-in function pdist which calculates ths similarity for me, but due to matrix size it seems like it can't and it shows me the following error message.
Matrix is too large to convert to linear index.
Error in ==> pdist at 139
elseif any(imag(X(:)))
Any suggestion on how to resolve this problem?

Related

PCA for dimensionality reduction MATLAB

I have a feature vector of size [4096 x 180], where 180 is the number of samples and 4096 is the feature vector length of each sample.
I want to reduce the dimensionality of the data using PCA.
I tried using the built in pca function of MATLAB [V U]=pca(X) and reconstructed the data by X_rec= U(:, 1:n)*V(:, 1:n)', n being the dimension I chose. This returns a matrix of 4096 x 180.
Now I have 3 questions:
How to obtain the reduced dimension?
When I put n as 200, it gave an error as matrix dimension increased, which gave me the assumption that we cannot reduce dimension lesser than the sample size. Is this true?
How to find the right number of reduced dimensions?
I have to use the reduced dimension feature set for further classification.
If anyone can provide a detailed step by step explanation of the pca code for this I would be grateful. I have looked at many places but my confusion still persists.
You may want to refer to Matlab example to analyse city data.
Here is some simplified code:
load cities;
[~, pca_scores, ~, ~, var_explained] = pca(ratings);
Here, pca_scores are the pca components with respective variances of each component in var_explained. You do not need to do any explicit multiplication after running pca. Matlab will give you the components directly.
In your case, consider that data X is a 4096-by-180 matrix, i.e. you have 4096 samples and 180 features. Your goal is to reduce dimensionality such that you have p features, where p < 180. In Matlab, you can simply run the following,
p = 100;
[~, pca_scores, ~, ~, var_explained] = pca(X, 'NumComponents', p);
pca_scores will be a 4096-by-p matrix and var_explained will be a vector of length p.
To answer your questions:
How to obtain the reduced dimension? I above example, pca_scores is your reduced dimension data.
When I put n as 200, it gave an error as matrix dimension increased, which gave me the assumption that we cannot reduce dimension lesser than the sample size. Is this true? You can't use 200, since the reduced dimensions have to be less than 180.
How to find the right number of reduced dimensions? You can make this decision by checking the var_explained vector. Typically you want to retain about 99% variance of the features. You can read more about this here.

In Matlab, how to fast generate 'sparse random matrix', and fast multiply it with a dense vector?

I am using this function X = randsrc(250,600,[[-1,0,1];[0.5/ps,1-1/ps,0.5/ps]])) with ps=2373 It shows that 250*600 matrix is generated. Its entries only contain -1,0 or 1. And -1,0,1 is randomly choosed according to the probability distribution 0.5/ps,1-1/ps,0.5/ps.
So that the density is about 0.00042.
The above X is called sparse random projection matrix, see https://web.stanford.edu/~hastie/Papers/Ping/KDD06_rp.pdf. It can be used to compress a data vector from dimension 600 to 250 with some nice geometric properties guaranteed.
The problem is that in Matlab, randsrc seems to be very slow (e.g., compared with randn(250,600)). Then, how can I fast generate the above matrix?
BTW, how can I fast calculate X*y? where y may be a dense vector.
My code is:
ps=2373;
tic;
X = randsrc(250,600,[[-1,0,1];[0.5/ps,1-1/ps,0.5/ps]]));
toc
a = randn(600,1);
tic;
X*a;
toc
Also, I have tried a same Python function http://scikit-learn.org/stable/modules/generated/sklearn.random_projection.SparseRandomProjection.html, it is twice faster than Matlab.
You can use sprand to generate a sparsity structure, then find to extract the rows and columns of the non-zero elements. Finally randsample will select values -1,1 with 50% probability of each:
ps=2373;
tic
[i,j,~] = find(sprand(250,600,1/ps))
X = sparse(i,j,randsample([-1,1],length(i),true))
toc
MATLAB is very fast at multiplying matrices so X*a is very fast.

the coeff of pca in matlab is not a p*p matrix

My data matrix is X which is 4999*37152. Then I use this command in Matlab:
[coeff, score, latent, tsquared1, explained1] = pca(X);
The output: coeff is 37152*4998, score is 4999*4998, latent is 4998*1. According to http://www.mathworks.com/help/stats/pca.html, the coeff should be p*p. So what is wrong with my code ?
As Matlab documentation says, "Rows of X correspond to observations and columns correspond to variables". So you are feeding in a matrix with only 4999 observations for 37152 observations. Geometrically, you have 4999 points in a 37152-dimensional space. These points are contained in a 4998-dimensional affine subspace, so Matlab gets you 4998 directions there (each expressed as a vector with 37152 components).
For more, see the Statistics site:
Why are there only n-1 principal components for n data points if the number of dimensions is larger than n?
PCA when the dimensionality is greater than the number of samples
The MATLAB documentation is written under assumption that you have at least as many observations as variables, which is how people normally use PCA.
Of course, it's possible that your data actually has 37152 observations for 4999 variables, in which case you need to transpose X.

Fast Computation of Eigenvectors of a Sparse Matrix

I am working on a project that involves the computation of the eigenvectors of a very large sparse matrix.
To be more specific I have a Matrix that is the laplacian of a big graph and I am interested in finding the eigenvector associated to the second smallest eigenvalue.
Of course Matlab takes ages to compute the eigenvectors, even because it computes all of them.
Any suggestions?
Thank you very much
Andrea
Have you tried this usage of eigs:
[v,c]=eigs(A,2,'sm');
for example:
A = delsq(numgrid('C',256));
[v,c]=eigs(A,2,'sm');
generates a ~50K x 50K sparse matrix and find its 2 smallerst eigenvalues and eigenvectors in about 1 second in my old laptop...

Correlation matrix to set of vectors

I try to calculate the correlation matrix of a set of histogram vectors. But the result is a truncated version of what I (think) I want. I have 200 histograms by 32 bins each. The result from
correlation_matrix = corrcoef(set_of_histograms)
is a 32 by 32 matrix.
I want to use this to calculate how my original histograms match up. (this by later using eigs and other stuff).
But which correlation method is right for this? I have tried "corrcoef" but there are "corr" and "cov" as well. Can't understand their differences by reading matlab help...
correlation_matrix = corrcoef(set_of_histograms')
(Note the ')
1) corrcoef treats every column as an observation, and calculates the correlations between each pair. I'm assuming your histograms matrix is 200x32; hence, in your case, every row is an observation. If you transpose your histograms matrix before running corrcoef, you should get the 200x200 result you're looking for:
[rho, p] = corrcoef( set_of_histograms' );
(' transposes the matrix)
2) cov returns the covariance matrix, not the correlation; while the covariance matrix is used in calculating the correlation, it is not the measure you're looking for.
3) As for corr and corrcoef, they have a few implementation differences between them. As long as you are only interested in Pearson's correlation, they are identical for your purposes. corr also has an option to calculate Spearman's or Kendall's correlations, which corrcoef does not have.