Correlation matrix to set of vectors - matlab

I try to calculate the correlation matrix of a set of histogram vectors. But the result is a truncated version of what I (think) I want. I have 200 histograms by 32 bins each. The result from
correlation_matrix = corrcoef(set_of_histograms)
is a 32 by 32 matrix.
I want to use this to calculate how my original histograms match up. (this by later using eigs and other stuff).
But which correlation method is right for this? I have tried "corrcoef" but there are "corr" and "cov" as well. Can't understand their differences by reading matlab help...

correlation_matrix = corrcoef(set_of_histograms')
(Note the ')

1) corrcoef treats every column as an observation, and calculates the correlations between each pair. I'm assuming your histograms matrix is 200x32; hence, in your case, every row is an observation. If you transpose your histograms matrix before running corrcoef, you should get the 200x200 result you're looking for:
[rho, p] = corrcoef( set_of_histograms' );
(' transposes the matrix)
2) cov returns the covariance matrix, not the correlation; while the covariance matrix is used in calculating the correlation, it is not the measure you're looking for.
3) As for corr and corrcoef, they have a few implementation differences between them. As long as you are only interested in Pearson's correlation, they are identical for your purposes. corr also has an option to calculate Spearman's or Kendall's correlations, which corrcoef does not have.

Related

How to measure the pairwise cosine for a data matrix in MATLAB

Assume there is a data matrix (MATLAB)
X = [0.8147, 0.9134, 0.2785, 0.9649, 0.9572;
0.9058, 0.6324, 0.5469, 0.1576, 0.4854;
0.1270, 0.0975, 0.9575, 0.9706, 0.8003]
Each column represent a feature vector for a sample.
What is the fastest way to get the pairwise consine similarity measure in X in MATLAB? such as we want to compute the symmetric S is 5X5 matrix, the element in S(3,4) is the consine between the third column and fourth column.
Note: The consine measurment cos(a,b) means the angle bettween vector a and b.
If you have the Statistics Toolbox, use pdist with the 'cosine' option, followed by squareform. Note that:
pdist considers rows, not columns, as observations. So you need to transpose the input.
The output is 1 minus the cosine similarity. So you need to subtract the result from 1.
To get the result in the form of a symmetric matrix apply squareform.
So, you can use
S = 1 - squareform(pdist(X.', 'cosine'));

PCA for dimensionality reduction MATLAB

I have a feature vector of size [4096 x 180], where 180 is the number of samples and 4096 is the feature vector length of each sample.
I want to reduce the dimensionality of the data using PCA.
I tried using the built in pca function of MATLAB [V U]=pca(X) and reconstructed the data by X_rec= U(:, 1:n)*V(:, 1:n)', n being the dimension I chose. This returns a matrix of 4096 x 180.
Now I have 3 questions:
How to obtain the reduced dimension?
When I put n as 200, it gave an error as matrix dimension increased, which gave me the assumption that we cannot reduce dimension lesser than the sample size. Is this true?
How to find the right number of reduced dimensions?
I have to use the reduced dimension feature set for further classification.
If anyone can provide a detailed step by step explanation of the pca code for this I would be grateful. I have looked at many places but my confusion still persists.
You may want to refer to Matlab example to analyse city data.
Here is some simplified code:
load cities;
[~, pca_scores, ~, ~, var_explained] = pca(ratings);
Here, pca_scores are the pca components with respective variances of each component in var_explained. You do not need to do any explicit multiplication after running pca. Matlab will give you the components directly.
In your case, consider that data X is a 4096-by-180 matrix, i.e. you have 4096 samples and 180 features. Your goal is to reduce dimensionality such that you have p features, where p < 180. In Matlab, you can simply run the following,
p = 100;
[~, pca_scores, ~, ~, var_explained] = pca(X, 'NumComponents', p);
pca_scores will be a 4096-by-p matrix and var_explained will be a vector of length p.
To answer your questions:
How to obtain the reduced dimension? I above example, pca_scores is your reduced dimension data.
When I put n as 200, it gave an error as matrix dimension increased, which gave me the assumption that we cannot reduce dimension lesser than the sample size. Is this true? You can't use 200, since the reduced dimensions have to be less than 180.
How to find the right number of reduced dimensions? You can make this decision by checking the var_explained vector. Typically you want to retain about 99% variance of the features. You can read more about this here.

compute SVD using Matlab function

I have a doubt about SVD. in the literature that i had read, it's written that we have to convert our input matrix into covariance matrix first, and then SVD function from matlab (SVD) is used.
But, in Mathworks website we can use SVD function directly to the input matrix (no need to convert it into covariance matrix)..
[U,S,V]=svd(inImageD);
Which one is the true??
And if we want to do dimensionality reduction, we have to project our data into eigen vector.. But where is the eigen vector generated by SVD function..
I know that S is the eigen value.. But what is U and S??
To reduce our data dimensional, do we need to substract the input matrix with its mean and then multiply it with eigen vector?? or we can just multiply our input matrix with the eigen vector (no need to substract it first with its mean)..
EDIT
Suppose if I want to do classification using SIFT as the features and SVM as the classifier.
I have 10 images for training and I arrange them in a different row..
So 1st row for 1st images, 2nd row for second images and so on...
Feat=[1 2 5 6 7 >> Images1
2 9 0 6 5 >> Images2
3 4 7 8 2 >> Images3
2 3 6 3 1 >> Images4
..
.
so on. . ]
To do dimensionality reduction (from my 10x5 matrix), we have yo do A*EigenVector
And from what U had explained (#Sam Roberts), I can compute it by using EIGS function from the covariance matrix (instead of using SVD function).
And as I arrange the feat of images in different row, so I need to do A'*A
So it becomes:
Matrix=A'*A
MAT_Cov=Cov(Matrix)
[EigVector EigValue] = eigs (MAT_Cov);
is that right??
Eigenvector decomposition (EVD) and singular value decomposition (SVD) are closely related.
Let's say you have some data a = rand(3,4);. Note that this not a square matrix - it represents a dataset of observations (rows) and variables (columns).
Do the following:
[u1,s1,v1] = svd(a);
[u2,s2,v2] = svd(a');
[e1,d1] = eig(a*a');
[e2,d2] = eig(a'*a);
Now note a few things.
Up to the sign (+/-), which is arbitrary, u1 is the same as v2. Up to a sign and an ordering of the columns, they are also equal to e1. (Note that there may be some very very tiny numerical differences as well, due to slight differences in the svd and eig algorithms).
Similarly, u2 is the same as v1 and e2.
s1 equals s2, and apart from some extra columns and rows of zeros, both also equal sqrt(d1) and sqrt(d2). Again, there may be some very tiny numerical differences as well just due to algorithmic issues (they'll be on the order of something to the -10 or so).
Note also that a*a' is basically the covariances of the rows, and a'*a is basically the covariances of the columns (that's not quite true - a would need to be centred first by subtracting the column or row mean for them to be equal, and there might be a multiplicative constant difference as well, but it's basically pretty similar).
Now to answer your questions, I assume that what you're really trying to do is PCA. You can do PCA either by taking the original data matrix and applying SVD, or by taking its covariance matrix and applying EVD. Note that Statistics Toolbox has two functions for PCA - pca (in older versions princomp) and pcacov.
Both do essentially the same thing, but from different starting points, because of the above equivalences between SVD and EVD.
Strictly speaking, u1, v1, u2 and v2 above are not eigenvectors, they are singular vectors - and s1 and s2 are singular values. They are singular vectors/values of the matrix a. e1 and d1 are the eigenvectors and eigenvalues of a*a' (not a), and e2 and d2 are the eigenvectors and eigenvalues of a'*a (not a). a does not have any eigenvectors - only square matrices have eigenvectors.
Centring by subtracting the mean is a separate issue - you would typically do that prior to PCA, but there are situations where you wouldn't want to. You might also want to normalise by dividing by the standard deviation but again, you wouldn't always want to - it depends what the data represents and what question you're trying to answer.

Multivariate Random Number Generation in Matlab

I'm probably being a little dense but I'm not very mathsy and can't seem to understand the covariance element of creating multivariate data.
I'm after two columns of random data (representing two correlated variables).
I think I am right in needing to use the mvnrnd function and I understand that 'mu' must be a column of my mean vectors. As I need 4 distinct classes within my data these are going to be (1, 1) (-1 1) (1 -1) and (-1 -1). I assume I will have to do the function 4x with a different column of mean vectors each time and then combine them to get my full data set.
I don't understand what I should put for SIGMA - Matlab help tells me that it must be 'a d-by-d symmetric positive semi-definite matrix, or a d-by-d-by-n array' i.e. a covariance matrix. I don't understand how I create a covariance matrix for numbers that I am yet to generate.
Any advice would be greatly appreciated!
Assuming that I understood your case properly, I would go this way:
data = [normrnd(0,1,5000,1),normrnd(0,1,5000,1)]; %% your starting data series
MU = mean(data,1);
SIGMA = cov(data);
Now, it should be possible to feed mvnrnd with MU and SIGMA:
r = mvnrnd(MU,SIGMA,5000);
plot(r(:,1),r(:,2),'+') %% in case you wanna plot the results
I hope this helps.
I think your aim is to generate the simulated multivariate gaussian distributed data. For example, I use
k = 6; % feature dimension
mu = rand(1,k);
sigma = 10*eye(k,k);
unit matrix by 10 times is a symmetric positive semi-definite matrix. And the gaussian distribution will be more round than other type of sigma.
then you can use it as the above example of mvnrnd function and see the plot.

svds not working for some matrices - wrong result

Here is my testing function:
function diff = svdtester()
y = rand(500,20);
[U,S,V] = svd(y);
%{
y = sprand(500,20,.1);
[U,S,V] = svds(y);
%}
diff_mat = y - U*S*V';
diff = mean(abs(diff_mat(:)));
end
There are two very similar parts: one finds the SVD of a random matrix, the other finds the SVD of a random sparse matrix. Regardless of which one you choose to comment (right now the second one is commented-out), we compute the difference between the original matrix and the product of its SVD components and return that average absolute difference.
When using rand/svd, the typical return (mean error) value is around 8.8e-16, basically zero. When using sprand/svds, the typical return values is around 0.07, which is fairly terrible considering the sparse matrix is 90% 0's to start with.
Am I misunderstanding how SVD should work for sparse matrices, or is something wrong with these functions?
Yes, the behavior of svds is a little bit different from svd. According to MATLAB's documentation:
[U,S,V] = svds(A,...) returns three output arguments, and if A is m-by-n:
U is m-by-k with orthonormal columns
S is k-by-k diagonal
V is n-by-k with orthonormal columns
U*S*V' is the closest rank k approximation to A
In fact, usually k will be somethings about 6, so you will get rather "rude" approximation. To get more exact approximation specify k to be min(size(y)):
[U, S, V] = svds(y, min(size(y)))
and you will get error of the same order of magnitude as in case of svd.
P.S. Also, MATLAB's documentations says:
Note svds is best used to find a few singular values of a large, sparse matrix. To find all the singular values of such a matrix, svd(full(A)) will usually perform better than svds(A,min(size(A))).