Randomly rearranging data points when creating cross-validation indices? - matlab

I have a dataset where the columns corresponds to features (predictors) and the rows correspond to data points. The data points are extracted in a structured way, i.e. they are sorted. I will use either crossvalind or cvpartition from Matlab for stratified cross-validation.
If I use the above function, do I still have to first randomly rearrange the data points (rows)?

These functions shuffle your data internally, as you can see in the docs
Indices = crossvalind('Kfold', N, K) returns randomly generated indices for a K-fold cross-validation of N observations. Indices contains equal (or approximately equal) proportions of the integers 1 through K that define a partition of the N observations into K disjoint subsets. Repeated calls return different randomly generated partitions. K defaults to 5 when omitted. In K-fold cross-validation, K-1 folds are used for training and the last fold is used for evaluation. This process is repeated K times, leaving one different fold for evaluation each time.
However, if your data is structured in this sense, that object ith has some information about object i+1, then you should consider different kind of splitting. For example - if your data is actually a (locally) time series, typical random cv is not a valid estimation technique. Why? Because if your data actually contains clusters where knowledge of value of at least one element - gives you high probability of estimating remaining ones, what you will obtain in the end after applying CV is actually estimate of ability to do exactly so - predict inside these clusters. Thus if during actual real life usage of your model you expect to get completely new cluster - model you selected can be completely random there. In other words - if your data has some kind of internal cluster structure (or time series) your splits should cover this feature by splitting over clusters (thus instead of K random points splits you have K random clusters splits and so on).

Related

What's the best way to obtain cosine similarity from two vectors in MATLAB?

I'll need to repeat this process multiple times, and the number of values will vary from ~10 to ~1000. I don't have access to all the vectors at once - they'll become accessible to me two vectors at a time.
In each instance there will always be the same number of values in each of the pair of vectors. However, from instance to instance the number of values will vary.
For column vectors a and b I might try,
a.'*b/(norm(a)*norm(b))
Ideally you would combine all or a subset of your vectors into arrays and do the operations at once, taking advantage of matlab multi threading. Different length vectors is a challenge though...
Do you have access to all the vectors at once?

Kullback Leibler Divergence of 2 Histograms in MatLab

I would like a function to calculate the KL distance between two histograms in MatLab. I tried this code:
http://www.mathworks.com/matlabcentral/fileexchange/13089-kldiv
However, it says that I should have two distributions P and Q of sizes n x nbins. However, I am having trouble understanding how the author of the package wants me to arrange the histograms. I thought that providing the discretized values of the random variable together with the number of bins would suffice (I would assume the algorithm would use an arbitrary support to evaluate the expectations).
Any help is appreciated.
Thanks.
The function you link to requires that the two histograms passed be aligned and thus have the same length NBIN x N (not N X NBIN), that is, if N>1 then the number of rows in the inputs should be equal to the number of bins in the histograms. If you are just going to compare two histograms (that is if N=1) it doesn't really matter, you can pass either row or column vector versions of these as long as you are consistent and the order of bins matches.
A generic call to the function looks like this:
dists = kldiv(bins,P,Q)
The implementation allows comparison of multiple histograms to each other (that is, N>1), in which case pairs of columns (with matching column index) in each array are compared and the result is a row vector with distances for each matching pair.
Array bins should be the same size as P and Q and is used to perform a very minimal check that the inputs are of the same size, but is not used in the computation. The routine expects bins to contain the numeric labels of your bins so that it can check for repeated bin labels and warn you if repeats occur, but otherwise doesn't use the information.
You could do away with bins and compute the distance with
KL = sum(P .* (log2(P)-log2(Q)));
without using the Matlab Central versions. However the version you link to performs the abovementioned minimal checks and in addition allows computation of two alternative distances (consult the documentation).
The version linked to by eigenchris checks that no histogram bins are empty (which would make the computation blow up numerically) and if there are, removes their contribution to the sum (not sure this is entirely appropriate - consult an expert on the subject). It should probably also be aware of the exact form of the formula, specifically note the use of log2 above versus natural logarithm in the version linked to by eigenchris.

Comparing multivariate distributions

I have a set of multivariate instances and I need to extract a representative set from these instances; for instance if I have 100,000 multivariate instances, I want to extract 1000 instances that would be representative of the original distribution. I used Latin Hypercube Sampling and Random Sampling to extract two representative sets and now I want to check how much of a correlation these two representative sets have with the original set.
If I further elaborate;
I have 100,000 multivariate instances (let's call it A)
I derive two representative samples from 'A' (each set will have 1000 instances; let's call these two sets B and C)
I want to check whether 'B' and 'C' preserves the distribution of the original 'A'.
Thanks a lot in advance!
This is more of a statistics question, but here's an outline. Normally you'd use a Chi-squared test to compare the distributions. The basic steps are as follows.
Bin each of the data sets. Try to set up the bins so that there's at least 5 or more samples in each bin. (Use the same bins for all data sets).
Use the large sample "A" to determine the expected number of samples (call it f_e) in each bin. (BTW. Note that f_e for any particular bin would be 1/100 of the number samples in that particular bin, since sample A contains 100 times the data points of B or C).
To test one of the samples (say B) calculate the sum: S = sum over all bins of (f_o - f_e)^2/fe, where f_o is the observed frequency in the bin.
This sum is a Chi-squared variable with degrees of freedom one less than the total number of bins that you are using.
Calculate 1 - chi2cdf(S,dof). This is the probability that a sum as large or larger than the one you obtained (S), could have happened purely due to random variations (that is, even if the distribution were identical). So a small result (close to 0) means that the distribution are likely to be different, and a large result (close to 1) means they're not likely to be significantly different.
There's probably a library function to do all of the above. IDK, as I haven't used any statistics libraries for a long while.

Is there a statistical difference between generating many random vectors vs a single random matrix

Is there a statistical difference between generating a series of paths for a montecarlo simulation using the following two methods (note that by path I mean a vector of 350 points, normally distributed):
A)
for path = 1:300000
Zn(path, :) = randn(1, 350);
end
or the far more efficient B)
Zn = randn(300000, 350);
I just want to be sure there is no funny added correlation or dependence between the rows in method B that isn't present in method A. Like maybe method B distributes normally over 2 dimensions where A is over 1 dimension, so maybe that makes the two statistically different?
If there is a difference then I need to know the same for uniform distributions (i.e. rand instead of randn)
Just to add to the answer of #natan (+1), run the following code:
%# Store the seed
Rng1 = rng;
%# Get a matrix of random numbers
X = rand(3, 3);
%# Restore the seed
rng(Rng1);
%# Get a matrix of random numbers one vector at a time
Y = nan(3, 3);
for n = 1:3
Y(:, n) = rand(3, 1);
end
%# Test for differences
if any(any(X - Y ~= 0)); disp('Error'); end;
You'll note that there is no difference between X and Y. That is, there is no difference between building a matrix in one step, and building a matrix from a sequence of vectors.
However, there is a difference between my code and yours. Note I am populating the matrix by columns, not rows, since when rand is used to construct a matrix in one step, it populates by column. By the way, I'm not sure if you realize, but as a general rule you should always try and perform vector operations on the columns of matrices, not the rows. I explained why in a response to a question on SO the other day; see here for more...
Regarding the question of independence/dependence, one needs to be careful with the language one uses. The sequence of numbers generated by rand are perfectly dependent. For the vast majority of statistical tests, they will appear to be independent - nonetheless, in theory, one could construct a statistical test that would demonstrate the dependency between a sequence of numbers generated by rand.
Final thought, if you have a copy of Greene's "Econometric Analysis", he gives a neat discussion of random number generation in section 17.2.
As far as the base R's random number generator is concerned, also, there doesn't appear to be any difference between generating a sequence of random numbers at once or doing it one-by one. Thus, #Colin T Bowers' (+1) suggested behavior above also holds in R. Below is an R version of Colin's code:
#set seed
set.seed(1234)
# generate a sequence of 10,000 random numbers at once
X<-rnorm(10000)
# reset the seed
set.seed(1234)
# create a vector of 10,000 zeros
Y<-rep(0,times=10000)
# generate a sequence of 10,000 random numbers, one at a time
for (i in 1:10000){
Y[i]<-rnorm(1)
}
# Test for differences
if(any(X-Y!=0)){print("Error")}

Matlab - Stratified Sampling of Multidimensional Data

I want to divide a corpus into training & testing sets in a stratified fashion.
The observation data points are arranged in a Matrix A as
A=[16,3,0;12,6,4;19,2,1;.........;17,0,2;13,3,2]
Each column of the matrix represent a distinct feature.
In Matlab, the cvpartition(A,'holdout',p) function requires A to be a vector. How can I perform the same action with A as a Matrix i.e. resulting sets have roughly the same distribution of each feature as in the original corpus.
By using a matrix A rather than grouped data, you are making the assumption that a random partition of your data will return a test and train set with the same column distributions.
In general, the assumption you are making in your question is that there is a partition of A such that each of the marginal distributions of A (1 per column) has the same distribution across all three variables. There is no guarantee that this is true. Check whether the columns of your matrix are correlated. If they are not, simply partition on 1 and use the row indices to define a test matrix:
cv = cvpartition(A(:, 1), 'holdout', p);
text_mat = A(cv.test, :);
If they are correlated, you may need to go back and reconsider what you are trying to do.