Comparing multivariate distributions - matlab

I have a set of multivariate instances and I need to extract a representative set from these instances; for instance if I have 100,000 multivariate instances, I want to extract 1000 instances that would be representative of the original distribution. I used Latin Hypercube Sampling and Random Sampling to extract two representative sets and now I want to check how much of a correlation these two representative sets have with the original set.
If I further elaborate;
I have 100,000 multivariate instances (let's call it A)
I derive two representative samples from 'A' (each set will have 1000 instances; let's call these two sets B and C)
I want to check whether 'B' and 'C' preserves the distribution of the original 'A'.
Thanks a lot in advance!

This is more of a statistics question, but here's an outline. Normally you'd use a Chi-squared test to compare the distributions. The basic steps are as follows.
Bin each of the data sets. Try to set up the bins so that there's at least 5 or more samples in each bin. (Use the same bins for all data sets).
Use the large sample "A" to determine the expected number of samples (call it f_e) in each bin. (BTW. Note that f_e for any particular bin would be 1/100 of the number samples in that particular bin, since sample A contains 100 times the data points of B or C).
To test one of the samples (say B) calculate the sum: S = sum over all bins of (f_o - f_e)^2/fe, where f_o is the observed frequency in the bin.
This sum is a Chi-squared variable with degrees of freedom one less than the total number of bins that you are using.
Calculate 1 - chi2cdf(S,dof). This is the probability that a sum as large or larger than the one you obtained (S), could have happened purely due to random variations (that is, even if the distribution were identical). So a small result (close to 0) means that the distribution are likely to be different, and a large result (close to 1) means they're not likely to be significantly different.
There's probably a library function to do all of the above. IDK, as I haven't used any statistics libraries for a long while.

Related

Pearson correlation coefficent

This question of mine is not tightly related to Matlab, but is relevant to it:
I'm looking how to fill in the matrix [[a,b,c],[d,e,f]] in a few nontrivial ways so that as many places as possible in
corrcoef([a,b,c],[d,e,f])
are zero. My attempts yield NaN result in most cases.
Given the current comments, you are trying to understand how two series of random draws from two distributions can have zero correlation. Specifically, exercise 4.6.9 to which you refer mentions draws from two normal distributions.
An issue with your approach is that you are hoping to derive a link between a theoretical property and experimentation, in this case using Matlab. And, as you seem to have noticed, unless you are looking at specific degenerate cases, your experimentation will fail. That is because although the true correlation parameter rho in the exercise might be zero, a sample of random draws will always have some level of correlation. Here is an illustration, and as you'll notice if you run it the actual correlations span the whole spectrum between -1 and 1 despite their average being zero (as it should be since both generators are pseudo-uncorrelated):
n=1e4;
experiment = nan(n,1);
for i=1:n
r = corrcoef(rand(4,1),rand(4,1));
experiment(i)=r(2);
end
hist(experiment);
title(sprintf('Average correlation: %.4f%%',mean(experiment)));
If you look at the definition of Pearson correlation in wikipedia, you will see that the only way this can be zero is when the numerator is zero, i.e. E[(X-Xbar)(Y-Ybar)]=0. Though this might be the case asymptotically, you will be hard-pressed to find a non-degenerate case where this will happen in a small sample. Nevertheless, to show you you can derive some such degenerate cases, let's dig a bit further. If you want the expectation of this product to be zero, you could make either the left or the right hand part zero when the other is non-zero. For one side to be zero, the draw must be exactly equal to the average of draws. Therefore we can imagine creating such a pair of variables using this technique:
we create two vectors of 4 variables, and alternate which draw will be equal to the average.
let's say we want X to average 1, and Y to average 2, and we make even-indexed draws equal to the average for X and odd-indexed draws equal to the average for Y.
one such generation would be: X=[0,1,2,1], Y=[2,0,2,4], and you can check that corrcoef([0,1,2,1],[2,0,2,4]) does in fact produce an identity matrix. This is because, every time a component of X is different than its average of 1, the component in Y is equal to its average of 2.
another example, where the average of X is 3 and that of Y is 4 is: X=[3,-5,3,11], Y=[1008,4,-1000,4]. etc.
If you wanted to know how to create samples from non-correlated distributions altogether, that would be and entirely different question, though (perhaps) more interesting in terms of understanding statistics. If this is your case, and given the exercise you mention discusses normal distributions, I would suggest you take a look at generating antithetic variables using the Box-Muller transform.
Happy randomizing!

Randomly rearranging data points when creating cross-validation indices?

I have a dataset where the columns corresponds to features (predictors) and the rows correspond to data points. The data points are extracted in a structured way, i.e. they are sorted. I will use either crossvalind or cvpartition from Matlab for stratified cross-validation.
If I use the above function, do I still have to first randomly rearrange the data points (rows)?
These functions shuffle your data internally, as you can see in the docs
Indices = crossvalind('Kfold', N, K) returns randomly generated indices for a K-fold cross-validation of N observations. Indices contains equal (or approximately equal) proportions of the integers 1 through K that define a partition of the N observations into K disjoint subsets. Repeated calls return different randomly generated partitions. K defaults to 5 when omitted. In K-fold cross-validation, K-1 folds are used for training and the last fold is used for evaluation. This process is repeated K times, leaving one different fold for evaluation each time.
However, if your data is structured in this sense, that object ith has some information about object i+1, then you should consider different kind of splitting. For example - if your data is actually a (locally) time series, typical random cv is not a valid estimation technique. Why? Because if your data actually contains clusters where knowledge of value of at least one element - gives you high probability of estimating remaining ones, what you will obtain in the end after applying CV is actually estimate of ability to do exactly so - predict inside these clusters. Thus if during actual real life usage of your model you expect to get completely new cluster - model you selected can be completely random there. In other words - if your data has some kind of internal cluster structure (or time series) your splits should cover this feature by splitting over clusters (thus instead of K random points splits you have K random clusters splits and so on).

Matlab: What are the ways to determine the distribution of the data

I have a data set of n = 1000 realizations of a random variable X and is univariate -- X = {x1, x2,...,xn}. Data is generated by varying a parameter on which the random variable depends. For example, let the r.v be Area of a circle. So, by varying the radius (keeping the dimension fixed - say 2 dimensional circle) I generate n area for radius in the range r = 5 to n.
By using fitdist command I can fit distribution to the data set choosing distributions like Normal, Kernel, Binomial etc. Thus, data set is fitted to k distribution. So, I get k distributions. How do I select the Best fit distribution and hence the pdf ?
Also, do I need to normalize (post process) the data always in the range [0,1] before fitting?
If I understand correctly, you are asking how to decide which distribution to choose once you have a few fits.
There are three major metrics (IMO) for measuring "goodness-of-fit":
Chi-Squared
Kolmogrov-Smirnov
Anderson-Darling
Which to choose depends on a large number of factors; you can randomly pick one or read the Wiki pages to figure out which suits your need. These tests are also a part of MATLAB.
For instance, you can use kstest for the Kolmogrov-Smirnov test. You can provide the data and the hypothesized distribution to the function and evaluate the different options based on the KS test.
Alternately, you can use Anderson-Darling through adtest or Chi-Squared through chi2gof.

Kullback Leibler Divergence of 2 Histograms in MatLab

I would like a function to calculate the KL distance between two histograms in MatLab. I tried this code:
http://www.mathworks.com/matlabcentral/fileexchange/13089-kldiv
However, it says that I should have two distributions P and Q of sizes n x nbins. However, I am having trouble understanding how the author of the package wants me to arrange the histograms. I thought that providing the discretized values of the random variable together with the number of bins would suffice (I would assume the algorithm would use an arbitrary support to evaluate the expectations).
Any help is appreciated.
Thanks.
The function you link to requires that the two histograms passed be aligned and thus have the same length NBIN x N (not N X NBIN), that is, if N>1 then the number of rows in the inputs should be equal to the number of bins in the histograms. If you are just going to compare two histograms (that is if N=1) it doesn't really matter, you can pass either row or column vector versions of these as long as you are consistent and the order of bins matches.
A generic call to the function looks like this:
dists = kldiv(bins,P,Q)
The implementation allows comparison of multiple histograms to each other (that is, N>1), in which case pairs of columns (with matching column index) in each array are compared and the result is a row vector with distances for each matching pair.
Array bins should be the same size as P and Q and is used to perform a very minimal check that the inputs are of the same size, but is not used in the computation. The routine expects bins to contain the numeric labels of your bins so that it can check for repeated bin labels and warn you if repeats occur, but otherwise doesn't use the information.
You could do away with bins and compute the distance with
KL = sum(P .* (log2(P)-log2(Q)));
without using the Matlab Central versions. However the version you link to performs the abovementioned minimal checks and in addition allows computation of two alternative distances (consult the documentation).
The version linked to by eigenchris checks that no histogram bins are empty (which would make the computation blow up numerically) and if there are, removes their contribution to the sum (not sure this is entirely appropriate - consult an expert on the subject). It should probably also be aware of the exact form of the formula, specifically note the use of log2 above versus natural logarithm in the version linked to by eigenchris.

Controlled random number/dataset generation in MATLAB

Say, I have a cube of dimensions 1x1x1 spanning between coordinates (0,0,0) and (1,1,1). I want to generate a random set of points (assume 10 points) within this cube which are somewhat uniformly distributed (i.e. within certain minimum and maximum distance from each other and also not too close to the boundaries). How do I go about this without using loops? If this is not possible using vector/matrix operations then the solution with loops will also do.
Let me provide some more background details about my problem (This will help in terms of what I exactly need and why). I want to integrate a function, F(x,y,z), inside a polyhedron. I want to do it numerically as follows:
$F(x,y,z) = \sum_{i} F(x_i,y_i,z_i) \times V_i(x_i,y_i,z_i)$
Here, $F(x_i,y_i,z_i)$ is the value of function at point $(x_i,y_i,z_i)$ and $V_i$ is the weight. So to calculate the integral accurately, I need to identify set of random points which are not too close to each other or not too far from each other (Sorry but I myself don't know what this range is. I will be able to figure this out using parametric study only after I have a working code). Also, I need to do this for a 3D mesh which has multiple polyhedrons, hence I want to avoid loops to speed things out.
Check out this nice random vectors generator with fixed sum FEX file.
The code "generates m random n-element column vectors of values, [x1;x2;...;xn], each with a fixed sum, s, and subject to a restriction a<=xi<=b. The vectors are randomly and uniformly distributed in the n-1 dimensional space of solutions. This is accomplished by decomposing that space into a number of different types of simplexes (the many-dimensional generalizations of line segments, triangles, and tetrahedra.) The 'rand' function is used to distribute vectors within each simplex uniformly, and further calls on 'rand' serve to select different types of simplexes with probabilities proportional to their respective n-1 dimensional volumes. This algorithm does not perform any rejection of solutions - all are generated so as to already fit within the prescribed hypercube."
Use i=rand(3,10) where each column corresponds to one point, and each row corresponds to the coordinate in one axis (x,y,z)