Matlab - Stratified Sampling of Multidimensional Data - matlab

I want to divide a corpus into training & testing sets in a stratified fashion.
The observation data points are arranged in a Matrix A as
A=[16,3,0;12,6,4;19,2,1;.........;17,0,2;13,3,2]
Each column of the matrix represent a distinct feature.
In Matlab, the cvpartition(A,'holdout',p) function requires A to be a vector. How can I perform the same action with A as a Matrix i.e. resulting sets have roughly the same distribution of each feature as in the original corpus.

By using a matrix A rather than grouped data, you are making the assumption that a random partition of your data will return a test and train set with the same column distributions.
In general, the assumption you are making in your question is that there is a partition of A such that each of the marginal distributions of A (1 per column) has the same distribution across all three variables. There is no guarantee that this is true. Check whether the columns of your matrix are correlated. If they are not, simply partition on 1 and use the row indices to define a test matrix:
cv = cvpartition(A(:, 1), 'holdout', p);
text_mat = A(cv.test, :);
If they are correlated, you may need to go back and reconsider what you are trying to do.

Related

The pooled covariance matrix of TRAINING must be positive definite. (lda classifier)

I have a problem with classification (LDA classifier ).
I have 80 samples of training data (80x100) and 15 samples of testing data (15x100). classify function returns: The covariance matrix of each group in TRAINING must be positive definite.
Without knowing how your data looks like, all I can do is to suggest you a few solutions that may solve your problem. A non positive definite convariance matrix can be produced by many different factors:
linear dependence between two or more columns (you can get rid of as many columns that produce linear dependence as possible)
non-stationary data (in this case, you can use differences instead of levels because they grant stationarity)
columns with highly mismatching magnitude, for example a column with very big values and another one with very small values (rescale your columns so that all of them have approximately the same magnitude).

Randomly rearranging data points when creating cross-validation indices?

I have a dataset where the columns corresponds to features (predictors) and the rows correspond to data points. The data points are extracted in a structured way, i.e. they are sorted. I will use either crossvalind or cvpartition from Matlab for stratified cross-validation.
If I use the above function, do I still have to first randomly rearrange the data points (rows)?
These functions shuffle your data internally, as you can see in the docs
Indices = crossvalind('Kfold', N, K) returns randomly generated indices for a K-fold cross-validation of N observations. Indices contains equal (or approximately equal) proportions of the integers 1 through K that define a partition of the N observations into K disjoint subsets. Repeated calls return different randomly generated partitions. K defaults to 5 when omitted. In K-fold cross-validation, K-1 folds are used for training and the last fold is used for evaluation. This process is repeated K times, leaving one different fold for evaluation each time.
However, if your data is structured in this sense, that object ith has some information about object i+1, then you should consider different kind of splitting. For example - if your data is actually a (locally) time series, typical random cv is not a valid estimation technique. Why? Because if your data actually contains clusters where knowledge of value of at least one element - gives you high probability of estimating remaining ones, what you will obtain in the end after applying CV is actually estimate of ability to do exactly so - predict inside these clusters. Thus if during actual real life usage of your model you expect to get completely new cluster - model you selected can be completely random there. In other words - if your data has some kind of internal cluster structure (or time series) your splits should cover this feature by splitting over clusters (thus instead of K random points splits you have K random clusters splits and so on).

What does selecting the largest eigenvalues and eigenvectors in the covariance matrix mean in data analysis?

Suppose there is a matrix B, where its size is a 500*1000 double(Here, 500 represents the number of observations and 1000 represents the number of features).
sigma is the covariance matrix of B, and D is a diagonal matrix whose diagonal elements are the eigenvalues of sigma. Assume A is the eigenvectors of the covariance matrix sigma.
I have the following questions:
I need to select the first k = 800 eigenvectors corresponding to the eigenvalues with the largest magnitude to rank the selected features. The final matrix named Aq. How can I do this in MATLAB?
What is the meaning of these selected eigenvectors?
It seems the size of the final matrix Aq is 1000*800 double once I calculate Aq. The time points/observation information of 500 has disappeared. For the final matrix Aq, what does the value 1000 in matrix Aq represent now? Also, what does the value 800 in matrix Aq represent now?
I'm assuming you determined the eigenvectors from the eig function. What I would recommend to you in the future is to use the eigs function. This not only computes the eigenvalues and eigenvectors for you, but it will compute the k largest eigenvalues with their associated eigenvectors for you. This may save computational overhead where you don't have to compute all of the eigenvalues and associated eigenvectors of your matrix as you only want a subset. You simply supply the covariance matrix of your data to eigs and it returns the k largest eigenvalues and eigenvectors for you.
Now, back to your problem, what you are describing is ultimately Principal Component Analysis. The mechanics behind this would be to compute the covariance matrix of your data and find the eigenvalues and eigenvectors of the computed result. It has been known that doing it this way is not recommended due to numerical instability with computing the eigenvalues and eigenvectors for large matrices. The most canonical way to do this now is via Singular Value Decomposition. Concretely, the columns of the V matrix give you the eigenvectors of the covariance matrix, or the principal components, and the associated eigenvalues are the square root of the singular values produced in the diagonals of the matrix S.
See this informative post on Cross Validated as to why this is preferred:
https://stats.stackexchange.com/questions/79043/why-pca-of-data-by-means-of-svd-of-the-data
I'll throw in another link as well that talks about the theory behind why the Singular Value Decomposition is used in Principal Component Analysis:
https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca
Now let's answer your question one at a time.
Question #1
MATLAB generates the eigenvalues and the corresponding ordering of the eigenvectors in such a way where they are unsorted. If you wish to select out the largest k eigenvalues and associated eigenvectors given the output of eig (800 in your example), you'll need to sort the eigenvalues in descending order, then rearrange the columns of the eigenvector matrix produced from eig then select out the first k values.
I should also note that using eigs will not guarantee sorted order, so you will have to explicitly sort these too when it comes down to it.
In MATLAB, doing what we described above would look something like this:
sigma = cov(B);
[A,D] = eig(sigma);
vals = diag(D);
[~,ind] = sort(abs(vals), 'descend');
Asort = A(:,ind);
It's a good thing to note that you do the sorting on the absolute value of the eigenvalues because scaled eigenvalues are also eigenvalues themselves. These scales also include negatives. This means that if we had a component whose eigenvalue was, say -10000, this is a very good indication that this component has some significant meaning to your data, and if we sorted purely on the numbers themselves, this gets placed near the lower ranks.
The first line of code finds the covariance matrix of B, even though you said it's already stored in sigma, but let's make this reproducible. Next, we find the eigenvalues of your covariance matrix and the associated eigenvectors. Take note that each column of the eigenvector matrix A represents one eigenvector. Specifically, the ith column / eigenvector of A corresponds to the ith eigenvalue seen in D.
However, the eigenvalues are in a diagonal matrix, so we extract out the diagonals with the diag command, sort them and figure out their ordering, then rearrange A to respect this ordering. I use the second output of sort because it tells you the position of where each value in the unsorted result would appear in the sorted result. This is the ordering we need to rearrange the columns of the eigenvector matrix A. It's imperative that you choose 'descend' as the flag so that the largest eigenvalue and associated eigenvector appear first, just like we talked about before.
You can then pluck out the first k largest vectors and values via:
k = 800;
Aq = Asort(:,1:k);
Question #2
It's a well known fact that the eigenvectors of the covariance matrix are equal to the principal components. Concretely, the first principal component (i.e. the largest eigenvector and associated largest eigenvalue) gives you the direction of the maximum variability in your data. Each principal component after that gives you variability of a decreasing nature. It's also good to note that each principal component is orthogonal to each other.
Here's a good example from Wikipedia for two dimensional data:
I pulled the above image from the Wikipedia article on Principal Component Analysis, which I linked you to above. This is a scatter plot of samples that are distributed according to a bivariate Gaussian distribution centred at (1,3) with a standard deviation of 3 in roughly the (0.878, 0.478) direction and of 1 in the orthogonal direction. The component with a standard deviation of 3 is the first principal component while the one that is orthogonal is the second component. The vectors shown are the eigenvectors of the covariance matrix scaled by the square root of the corresponding eigenvalue, and shifted so their tails are at the mean.
Now let's get back to your question. The reason why we take a look at the k largest eigenvalues is a way of performing dimensionality reduction. Essentially, you would be performing a data compression where you would take your higher dimensional data and project them onto a lower dimensional space. The more principal components you include in your projection, the more it will resemble the original data. It actually begins to taper off at a certain point, but the first few principal components allow you to faithfully reconstruct your data for the most part.
A great visual example of performing PCA (or SVD rather) and data reconstruction is found by this great Quora post I stumbled upon in the past.
http://qr.ae/RAEU8a
Question #3
You would use this matrix to reproject your higher dimensional data onto a lower dimensional space. The number of rows being 1000 is still there, which means that there were originally 1000 features in your dataset. The 800 is what the reduced dimensionality of your data would be. Consider this matrix as a transformation from the original dimensionality of a feature (1000) down to its reduced dimensionality (800).
You would then use this matrix in conjunction with reconstructing what the original data was. Concretely, this would give you an approximation of what the original data looked like with the least amount of error. In this case, you don't need to use all of the principal components (i.e. just the k largest vectors) and you can create an approximation of your data with less information than what you had before.
How you reconstruct your data is very simple. Let's talk about the forward and reverse operations first with the full data. The forward operation is to take your original data and reproject it but instead of the lower dimensionality, we will use all of the components. You first need to have your original data but mean subtracted:
Bm = bsxfun(#minus, B, mean(B,1));
Bm will produce a matrix where each feature of every sample is mean subtracted. bsxfun allows the subtraction of two matrices in unequal dimension provided that you can broadcast the dimensions so that they can both match up. Specifically, what will happen in this case is that the mean of each column / feature of B will be computed and a temporary replicated matrix will be produced that is as large as B. When you subtract your original data with this replicated matrix, the effect will subtract every data point with their respective feature means, thus decentralizing your data so that the mean of each feature is 0.
Once you do this, the operation to project is simply:
Bproject = Bm*Asort;
The above operation is quite simple. What you are doing is expressing each sample's feature as a linear combination of principal components. For example, given the first sample or first row of the decentralized data, the first sample's feature in the projected domain is a dot product of the row vector that pertains to the entire sample and the first principal component which is a column vector.. The first sample's second feature in the projected domain is a weighted sum of the entire sample and the second component. You would repeat this for all samples and all principal components. In effect, you are reprojecting the data so that it is with respect to the principal components - which are orthogonal basis vectors that transform your data from one representation to another.
A better description of what I just talked about can be found here. Look at Amro's answer:
Matlab Principal Component Analysis (eigenvalues order)
Now to go backwards, you simply do the inverse operation, but a special property with the eigenvector matrix is that if you transpose this, you get the inverse. To get the original data back, you undo the operation above and add the means back to the problem:
out = bsxfun(#plus, Bproject*Asort.', mean(B, 1));
You want to get the original data back, so you're solving for Bm with respect to the previous operation that I did. However, the inverse of Asort is just the transpose here. What's happening after you perform this operation is that you are getting the original data back, but the data is still decentralized. To get the original data back, you must add the means of each feature back into the data matrix to get the final result. That's why we're using another bsxfun call here so that you can do this for each sample's feature values.
You should be able to go back and forth from the original domain and projected domain with the above two lines of code. Now where the dimensionality reduction (or the approximation of the original data) comes into play is the reverse operation. What you need to do first is project the data onto the bases of the principal components (i.e. the forward operation), but now to go back to the original domain where we are trying to reconstruct the data with a reduced number of principal components, you simply replace Asort in the above code with Aq and also reduce the amount of features you're using in Bproject. Concretely:
out = bsxfun(#plus, Bproject(:,1:k)*Aq.', mean(B, 1));
Doing Bproject(:,1:k) selects out the k features in the projected domain of your data, corresponding to the k largest eigenvectors. Interestingly, if you just want the representation of the data with regards to a reduced dimensionality, you can just use Bproject(:,1:k) and that'll be enough. However, if you want to go forward and compute an approximation of the original data, we need to compute the reverse step. The above code is simply what we had before with the full dimensionality of your data, but we use Aq as well as selecting out the k features in Bproject. This will give you the original data that is represented by the k largest eigenvectors / eigenvalues in your matrix.
If you'd like to see an awesome example, I'll mimic the Quora post that I linked to you but using another image. Consider doing this with a grayscale image where each row is a "sample" and each column is a feature. Let's take the cameraman image that's part of the image processing toolbox:
im = imread('camerman.tif');
imshow(im); %// Using the image processing toolbox
We get this image:
This is a 256 x 256 image, which means that we have 256 data points and each point has 256 features. What I'm going to do is convert the image to double for precision in computing the covariance matrix. Now what I'm going to do is repeat the above code, but incrementally increasing k at each go from 3, 11, 15, 25, 45, 65 and 125. Therefore, for each k, we are introducing more principal components and we should slowly start to get a reconstruction of our data.
Here's some runnable code that illustrates my point:
%%%%%%%// Pre-processing stage
clear all;
close all;
%// Read in image - make sure we cast to double
B = double(imread('cameraman.tif'));
%// Calculate covariance matrix
sigma = cov(B);
%// Find eigenvalues and eigenvectors of the covariance matrix
[A,D] = eig(sigma);
vals = diag(D);
%// Sort their eigenvalues
[~,ind] = sort(abs(vals), 'descend');
%// Rearrange eigenvectors
Asort = A(:,ind);
%// Find mean subtracted data
Bm = bsxfun(#minus, B, mean(B,1));
%// Reproject data onto principal components
Bproject = Bm*Asort;
%%%%%%%// Begin reconstruction logic
figure;
counter = 1;
for k = [3 11 15 25 45 65 125 155]
%// Extract out highest k eigenvectors
Aq = Asort(:,1:k);
%// Project back onto original domain
out = bsxfun(#plus, Bproject(:,1:k)*Aq.', mean(B, 1));
%// Place projection onto right slot and show the image
subplot(4, 2, counter);
counter = counter + 1;
imshow(out,[]);
title(['k = ' num2str(k)]);
end
As you can see, the majority of the code is the same from what we have seen. What's different is that I loop over all values of k, project back onto the original space (i.e. computing the approximation) with the k highest eigenvectors, then show the image.
We get this nice figure:
As you can see, starting with k=3 doesn't really do us any favours... we can see some general structure, but it wouldn't hurt to add more in. As we start increasing the number of components, we start to get a clearer picture of what the original data looks like. At k=25, we actually can see what the cameraman looks like perfectly, and we don't need components 26 and beyond to see what's happening. This is what I was talking about with regards to data compression where you don't need to work on all of the principal components to get a clear picture of what's going on.
I'd like to end this note by referring you to Chris Taylor's wonderful exposition on the topic of Principal Components Analysis, with code, graphs and a great explanation to boot! This is where I got started on PCA, but the Quora post is what solidified my knowledge.
Matlab - PCA analysis and reconstruction of multi dimensional data

Kullback Leibler Divergence of 2 Histograms in MatLab

I would like a function to calculate the KL distance between two histograms in MatLab. I tried this code:
http://www.mathworks.com/matlabcentral/fileexchange/13089-kldiv
However, it says that I should have two distributions P and Q of sizes n x nbins. However, I am having trouble understanding how the author of the package wants me to arrange the histograms. I thought that providing the discretized values of the random variable together with the number of bins would suffice (I would assume the algorithm would use an arbitrary support to evaluate the expectations).
Any help is appreciated.
Thanks.
The function you link to requires that the two histograms passed be aligned and thus have the same length NBIN x N (not N X NBIN), that is, if N>1 then the number of rows in the inputs should be equal to the number of bins in the histograms. If you are just going to compare two histograms (that is if N=1) it doesn't really matter, you can pass either row or column vector versions of these as long as you are consistent and the order of bins matches.
A generic call to the function looks like this:
dists = kldiv(bins,P,Q)
The implementation allows comparison of multiple histograms to each other (that is, N>1), in which case pairs of columns (with matching column index) in each array are compared and the result is a row vector with distances for each matching pair.
Array bins should be the same size as P and Q and is used to perform a very minimal check that the inputs are of the same size, but is not used in the computation. The routine expects bins to contain the numeric labels of your bins so that it can check for repeated bin labels and warn you if repeats occur, but otherwise doesn't use the information.
You could do away with bins and compute the distance with
KL = sum(P .* (log2(P)-log2(Q)));
without using the Matlab Central versions. However the version you link to performs the abovementioned minimal checks and in addition allows computation of two alternative distances (consult the documentation).
The version linked to by eigenchris checks that no histogram bins are empty (which would make the computation blow up numerically) and if there are, removes their contribution to the sum (not sure this is entirely appropriate - consult an expert on the subject). It should probably also be aware of the exact form of the formula, specifically note the use of log2 above versus natural logarithm in the version linked to by eigenchris.

Is there a statistical difference between generating many random vectors vs a single random matrix

Is there a statistical difference between generating a series of paths for a montecarlo simulation using the following two methods (note that by path I mean a vector of 350 points, normally distributed):
A)
for path = 1:300000
Zn(path, :) = randn(1, 350);
end
or the far more efficient B)
Zn = randn(300000, 350);
I just want to be sure there is no funny added correlation or dependence between the rows in method B that isn't present in method A. Like maybe method B distributes normally over 2 dimensions where A is over 1 dimension, so maybe that makes the two statistically different?
If there is a difference then I need to know the same for uniform distributions (i.e. rand instead of randn)
Just to add to the answer of #natan (+1), run the following code:
%# Store the seed
Rng1 = rng;
%# Get a matrix of random numbers
X = rand(3, 3);
%# Restore the seed
rng(Rng1);
%# Get a matrix of random numbers one vector at a time
Y = nan(3, 3);
for n = 1:3
Y(:, n) = rand(3, 1);
end
%# Test for differences
if any(any(X - Y ~= 0)); disp('Error'); end;
You'll note that there is no difference between X and Y. That is, there is no difference between building a matrix in one step, and building a matrix from a sequence of vectors.
However, there is a difference between my code and yours. Note I am populating the matrix by columns, not rows, since when rand is used to construct a matrix in one step, it populates by column. By the way, I'm not sure if you realize, but as a general rule you should always try and perform vector operations on the columns of matrices, not the rows. I explained why in a response to a question on SO the other day; see here for more...
Regarding the question of independence/dependence, one needs to be careful with the language one uses. The sequence of numbers generated by rand are perfectly dependent. For the vast majority of statistical tests, they will appear to be independent - nonetheless, in theory, one could construct a statistical test that would demonstrate the dependency between a sequence of numbers generated by rand.
Final thought, if you have a copy of Greene's "Econometric Analysis", he gives a neat discussion of random number generation in section 17.2.
As far as the base R's random number generator is concerned, also, there doesn't appear to be any difference between generating a sequence of random numbers at once or doing it one-by one. Thus, #Colin T Bowers' (+1) suggested behavior above also holds in R. Below is an R version of Colin's code:
#set seed
set.seed(1234)
# generate a sequence of 10,000 random numbers at once
X<-rnorm(10000)
# reset the seed
set.seed(1234)
# create a vector of 10,000 zeros
Y<-rep(0,times=10000)
# generate a sequence of 10,000 random numbers, one at a time
for (i in 1:10000){
Y[i]<-rnorm(1)
}
# Test for differences
if(any(X-Y!=0)){print("Error")}