Kullback Leibler Divergence of 2 Histograms in MatLab - matlab

I would like a function to calculate the KL distance between two histograms in MatLab. I tried this code:
http://www.mathworks.com/matlabcentral/fileexchange/13089-kldiv
However, it says that I should have two distributions P and Q of sizes n x nbins. However, I am having trouble understanding how the author of the package wants me to arrange the histograms. I thought that providing the discretized values of the random variable together with the number of bins would suffice (I would assume the algorithm would use an arbitrary support to evaluate the expectations).
Any help is appreciated.
Thanks.

The function you link to requires that the two histograms passed be aligned and thus have the same length NBIN x N (not N X NBIN), that is, if N>1 then the number of rows in the inputs should be equal to the number of bins in the histograms. If you are just going to compare two histograms (that is if N=1) it doesn't really matter, you can pass either row or column vector versions of these as long as you are consistent and the order of bins matches.
A generic call to the function looks like this:
dists = kldiv(bins,P,Q)
The implementation allows comparison of multiple histograms to each other (that is, N>1), in which case pairs of columns (with matching column index) in each array are compared and the result is a row vector with distances for each matching pair.
Array bins should be the same size as P and Q and is used to perform a very minimal check that the inputs are of the same size, but is not used in the computation. The routine expects bins to contain the numeric labels of your bins so that it can check for repeated bin labels and warn you if repeats occur, but otherwise doesn't use the information.
You could do away with bins and compute the distance with
KL = sum(P .* (log2(P)-log2(Q)));
without using the Matlab Central versions. However the version you link to performs the abovementioned minimal checks and in addition allows computation of two alternative distances (consult the documentation).
The version linked to by eigenchris checks that no histogram bins are empty (which would make the computation blow up numerically) and if there are, removes their contribution to the sum (not sure this is entirely appropriate - consult an expert on the subject). It should probably also be aware of the exact form of the formula, specifically note the use of log2 above versus natural logarithm in the version linked to by eigenchris.

Related

MatLab:Generate N pseudo-random numbers with a Poisson distribution having mean M and total T where N,M, and T are user defined

I’d like to be able to generate in MatLab a sequence of N pseudo-random numbers with a Poisson distribution having mean M. The sum of the N numbers should be T. N, M, and T are always positive or zero and would be user specified parameters to any function.
Obviously, if T is small relative to N it is likely that there will be problems achieving a total of T. In that case the function could just return the values T and then N-1 zeros or an error code. However, it is highly likely that in most cases T>>N.
I have been trying variations based on the method of generating random numbers with a given distribution provided at http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution and trying various normalizations at each step but have not been successful.
You could try to approximate what you want by using multinomial distribution.
If you use Wikipedia notation, then k=N, n=T and pi=M/T. Poisson distribution has distinctive property of mean equal to variance, but if your parameters are such that pi is small, then mean npi would be quite close to variance npi(1-pi). Sum would be automatically (by property of multinomial) equal of T.
Multinomial sampling in Matlab is done using mnrmd function.
UPDATE
Wrt comment, lets consider N sampled values vi, and write their sum
Sum(i=1...N) vi = T
Lets compute mean value of the left and right side of this equation.
Sum(i=1...N) E(vi) = E(T) = T
On the right side, mean value of constant is constant itself. On the left side we have
Sum(i=1...N) E(vi) = Sum(i=1...N) M = N*M = T
Therefore, M=T/N and pi=M/T=1/N.

How to create a non-linear sequence of numbers given only the starting and ending and number of steps

I am working on tuning a logistic regression model in Apache Spark using cross validation.
I would like to create a range of numbers that follow an exponential curve, e.g. each element in the list is obtained by multiplying the number before it by some constant, C. I will use this range as the regularization options in the paramGrid.
The trick is, I want to do this iteratively, so that the parameters of the best model is used to narrow down the window for the search range. Therefore, after the first iteration, I need a way to programmatically calculate C given X, Y, and N.
If there is a function I can use which will give me what I want, great. Otherwise, what is the formula for calculating C?
You want that
Y = X*C^N
which means that you compute the factor as
C = pow(Y/X, 1.0/N)

Randomly rearranging data points when creating cross-validation indices?

I have a dataset where the columns corresponds to features (predictors) and the rows correspond to data points. The data points are extracted in a structured way, i.e. they are sorted. I will use either crossvalind or cvpartition from Matlab for stratified cross-validation.
If I use the above function, do I still have to first randomly rearrange the data points (rows)?
These functions shuffle your data internally, as you can see in the docs
Indices = crossvalind('Kfold', N, K) returns randomly generated indices for a K-fold cross-validation of N observations. Indices contains equal (or approximately equal) proportions of the integers 1 through K that define a partition of the N observations into K disjoint subsets. Repeated calls return different randomly generated partitions. K defaults to 5 when omitted. In K-fold cross-validation, K-1 folds are used for training and the last fold is used for evaluation. This process is repeated K times, leaving one different fold for evaluation each time.
However, if your data is structured in this sense, that object ith has some information about object i+1, then you should consider different kind of splitting. For example - if your data is actually a (locally) time series, typical random cv is not a valid estimation technique. Why? Because if your data actually contains clusters where knowledge of value of at least one element - gives you high probability of estimating remaining ones, what you will obtain in the end after applying CV is actually estimate of ability to do exactly so - predict inside these clusters. Thus if during actual real life usage of your model you expect to get completely new cluster - model you selected can be completely random there. In other words - if your data has some kind of internal cluster structure (or time series) your splits should cover this feature by splitting over clusters (thus instead of K random points splits you have K random clusters splits and so on).

What's the significance of a negative NCC coefficient w.r.t. image template matching?

I have been using Matlab's normxcorr2 function to do template matching with images by performing normalized cross correlation. To find the maximum correspondence between a template and an image, one can simply run normxcorr2 and then find the maximum absolute value of all the values returned by normxcorr2 (the function returns values between -1.0 and 1.0).
From a quick Google search, I found out that a negative correlation coefficient means an inverse relationship between two variables (e.g. as x increases, y decreases), and that a positive correlation coefficient means the opposite (e.g. as x increases, y increases). How does this apply to image template matching? That is, what does a negative value from normxcorr2 mean conceptually with respect to template matching?
View normalized cross correlation as a normalized vector dot product. If the angle between two vectors is zero, their dot product will be 1; if they are facing in the opposite direction, then their dot product with be negative 1. This is idea is actually direct if you take the array and stack the column end to end. The result is essentially a dot product between two vectors.
Also just as a personal anecdote: what confused me about template matching at first, was intuitively I believed element wise subtraction of two images would be a good metric for image similarity. When I first saw cross correlation, I wondered why it used element wise multiplication. Then I realized that the later operation is the same thing as a vector dot product. The vector dot product, as I mentioned before, indicates when two vectors are pointing in the same direction. In your case, the two vectors are normalized first; hence why the range is from -1 to 1. If you want to read more about the implementation, "Fast Normalized Cross Correlation" by J.P. Lewis is a classical paper on the subject.
Check the formula
on wikipedia.
When f(x, y) - mean(f) and t(x,y) - mean(t) have different sign the result of an addendum will be negative (std is always positive). If there are a lot of such (x,y) then the whole sum will also be negative. You may think that if 1.0 means that one image is equal to another. -1.0 means that one image is a negative of another (try to find normxcorr2(x, -x))

Controlled random number/dataset generation in MATLAB

Say, I have a cube of dimensions 1x1x1 spanning between coordinates (0,0,0) and (1,1,1). I want to generate a random set of points (assume 10 points) within this cube which are somewhat uniformly distributed (i.e. within certain minimum and maximum distance from each other and also not too close to the boundaries). How do I go about this without using loops? If this is not possible using vector/matrix operations then the solution with loops will also do.
Let me provide some more background details about my problem (This will help in terms of what I exactly need and why). I want to integrate a function, F(x,y,z), inside a polyhedron. I want to do it numerically as follows:
$F(x,y,z) = \sum_{i} F(x_i,y_i,z_i) \times V_i(x_i,y_i,z_i)$
Here, $F(x_i,y_i,z_i)$ is the value of function at point $(x_i,y_i,z_i)$ and $V_i$ is the weight. So to calculate the integral accurately, I need to identify set of random points which are not too close to each other or not too far from each other (Sorry but I myself don't know what this range is. I will be able to figure this out using parametric study only after I have a working code). Also, I need to do this for a 3D mesh which has multiple polyhedrons, hence I want to avoid loops to speed things out.
Check out this nice random vectors generator with fixed sum FEX file.
The code "generates m random n-element column vectors of values, [x1;x2;...;xn], each with a fixed sum, s, and subject to a restriction a<=xi<=b. The vectors are randomly and uniformly distributed in the n-1 dimensional space of solutions. This is accomplished by decomposing that space into a number of different types of simplexes (the many-dimensional generalizations of line segments, triangles, and tetrahedra.) The 'rand' function is used to distribute vectors within each simplex uniformly, and further calls on 'rand' serve to select different types of simplexes with probabilities proportional to their respective n-1 dimensional volumes. This algorithm does not perform any rejection of solutions - all are generated so as to already fit within the prescribed hypercube."
Use i=rand(3,10) where each column corresponds to one point, and each row corresponds to the coordinate in one axis (x,y,z)