Measuring the entropy of a transition probability matrix in matlab - matlab

I'm working on a project which requires to analyze certain graph properties of transition probability matrices which are constructed as weighted directed graphs.
one of the properties of interest is the entropy of these graphs, which i have yet to find a proper way to measure, the general idea is that i need some sort of measure which allows me to quantify the extent to which a certain graph is "ordered" in order to ascertain the predictive value of the nodes within the graph (I.E if all the nodes have the exact same connection patterns, then effectively their predictive value is zero, though this is a very simplistic explanation as there are many other contributing factors to a nodes predictive power).
Iv'e experimented with certain built in matlab commands:
entropy - generally used to determine the entropy of an image
wentropy - to be honest i do not fully understand the proper use of this function, but iv'e tried using it with the 'shannon' and 'log energy' types, and have produced some incosistent results
this is a very basic script i whipped up to some testing, which produces two matrices:
an 20*20 matrix constructed with values drawn entirely from a uniform distribution, intended to produce a matrix with a relatively low degree of order - unordgraph
a 20*20 matrix constructed with 4 5*5 "patches" in which the values are integers drawn from a uniform distribution with a given range that is significantly larger than one, while the rest of the values are drawn from a uniform distribution on the range 0-1 (as in the previous matrix), this form of graph is more "ordered" than the previous patch - ordgraph
when i run the code:
clear all;
n = 50;
gsize = 20;
orderedrange = [100 200];
enttype = 'shannon';
for i = 1:n;
unordgraph = rand(gsize);
% entvec(1,i) = entropy(unordgraph);
entvec(1,i) = wentropy(unordgraph,enttype);
% ordgraph = reshape(1:gsize^2,gsize,gsize);
ordgraph = rand(gsize);
ordgraph(1:5,1:5) = randi(orderedrange,5);
ordgraph(6:10,6:10) = randi(orderedrange,5);
ordgraph(11:15,11:15) = randi(orderedrange,5);
ordgraph(16:20,16:20) = randi(orderedrange,5);
% entvec(2,i) = entropy(ordgraph);
entvec(2,i) = wentropy(ordgraph,enttype);
end
fprintf('the mean entropy of the unordered graph is: %.4f\n',mean(entvec(1,:)));
fprintf('the mean entropy of the ordered graph is: %.4f\n',mean(entvec(2,:)));
i get outputs such as:
the mean entropy of the unordered graph is: 88.8871
the mean entropy of the ordered graph is: -23936552.0113
i'm not really sure about the meaning of such negative values as running the same script on a matrix comprised entirely of zeros or ones (and hence maximally ordered) produces a mean entropy of 0.
i have a pretty rudimentary background in graph theory, making this task that much more difficult, and i would be really grateful for any help, whether theoretical or algorithmical
thanks in advance,
Ron

Related

Transforming draws in Matlab from Gaussian mixture to uniform

Consider the following draws for a 2x1 vector in Matlab with a probability distribution that is a mixture of two Gaussian components.
P=10^3; %number draws
v=1;
%First component
mu_a = [0,0.5];
sigma_a = [v,0;0,v];
%Second component
mu_b = [0,8.2];
sigma_b = [v,0;0,v];
%Combine
MU = [mu_a;mu_b];
SIGMA = cat(3,sigma_a,sigma_b);
w = ones(1,2)/2; %equal weight 0.5
obj = gmdistribution(MU,SIGMA,w);
%Draws
RV_temp = random(obj,P);%Px2
% Transform each component of RV_temp into a uniform in [0,1] by estimating the cdf.
RV1=ksdensity(RV_temp(:,1), RV_temp(:,1),'function', 'cdf');
RV2=ksdensity(RV_temp(:,2), RV_temp(:,2),'function', 'cdf');
Now, if we check whether RV1 and RV2 are uniformly distributed on [0,1] by doing
ecdf(RV1)
ecdf(RV2)
we can see that RV1 is uniformly distributed on [0,1] (the empirical cdf is close to the 45 degree line) while RV2 is not.
I don't understand why. It seems that the more distant are mu_a(2)and mu_b(2), the worse the job done by ksdensity with a reasonable number of draws. Why?
When you have a mixture of N(0.5,v) and N(8.2,v) then the range of the generated data is larger than if you had expectation which were closer, like N(0,v) and N(0,v), as you have in the other dimension. Then you ask ksdensity to approximate a function using P points inside this range.
Like in standard linear interpolation, the denser the points the better approximation of the function (inside the range), this is the same case here. Thus in the N(0.5,v) and N(8.2,v) where the points are "sparse" (or sparser, is that a word?) the approximation is worse than in the N(0,v) and N(0,v) where the points are denser.
As a small side note, are there any reason that you do not apply ksdensity directly on the bivariate data? Also I cannot reproduce your comment where you say that 5e2points are also good. Final comment, 1e3 is typically prefered over 10^3.
I think this is simply about the number of samples you're using. For the first example, the means of the two Gaussians are relatively close, hence a thousand samples are enough to obtain a cdf really close the the U[0,1] cdf. On the second vector though, you have a higher difference, and need more samples. With 100000 samples, I obtained the following result:
With 1000 I obtained this:
Which is clearly farther from the Uniform cdf function. Try to increase the number of samples to a million and check if the result is again getting closer.

Should I perform data centering before apply SVD?

I have to use SVD in Matlab to obtain a reduced version of my data.
I've read that the function svds(X,k) performs the SVD and returns the first k eigenvalues and eigenvectors. There is not mention in the documentation if the data have to be normalized.
With normalization I mean both substraction of the mean value and division by the standard deviation.
When I implemented PCA, I used to normalize in such way. But I know that it is not needed when using the matlab function pca() because it computes the covariance matrix by using cov() which implicitly performs the normalization.
So, the question is. I need the projection matrix useful to reduce my n-dim data to k-dim ones by SVD. Should I perform data normalization of the train data (and therefore, the same normalization to further projected new data) or not?
Thanks
Essentially, the answer is yes, you should typically perform normalization. The reason is that features can have very different scalings, and we typically do not want to take scaling into account when considering the uniqueness of features.
Suppose we have two features x and y, both with variance 1, but where x has a mean of 1 and y has a mean of 1000. Then the matrix of samples will look like
n = 500; % samples
x = 1 + randn(n,1);
y = 1000 + randn(n,1);
svd([x,y])
But the problem with this is that the scale of y (without normalizing) essentially washes out the small variations in x. Specifically, if we just examine the singular values of [x,y], we might be inclined to say that x is a linear factor of y (since one of the singular values is much smaller than the other). But actually, we know that that is not the case since x was generated independently.
In fact, you will often find that you only see the "real" data in a signal once we remove the mean. At the extremely end, you could image that we have some feature
z = 1e6 + sin(t)
Now if somebody just gave you those numbers, you might look at the sequence
z = 1000001.54, 1000001.2, 1000001.4,...
and just think, "that signal is boring, it basically is just 1e6 plus some round off terms...". But once we remove the mean, we see the signal for what it actually is... a very interesting and specific one indeed. So long story short, you should always remove the means and scale.
It really depends on what you want to do with your data. Centering and scaling can be helpful to obtain principial components that are representative of the shape of the variations in the data, irrespective of the scaling. I would say it is mostly needed if you want to further use the principal components itself, particularly, if you want to visualize them. It can also help during classification since your scores will then be normalized which may help your classifier. However, it depends on the application since in some applications the energy also carries useful information that one should not discard - there is no general answer!
Now you write that all you need is "the projection matrix useful to reduce my n-dim data to k-dim ones by SVD". In this case, no need to center or scale anything:
[U,~] = svd(TrainingData);
RecudedData = U(:,k)'*TestData;
will do the job. The svds may be worth considering when your TrainingData is huge (in both dimensions) so that svd is too slow (if it is huge in one dimension, just apply svd to the gram matrix).
It depends!!!
A common use in signal processing where it makes no sense to normalize is noise reduction via dimensionality reduction in correlated signals where all the fearures are contiminated with a random gaussian noise with the same variance. In that case if the magnitude of a certain feature is twice as large it's snr is also approximately twice as large so normalizing the features makes no sense since it would just make the parts with the worse snr larger and the parts with the good snr smaller. You also don't need to subtract the mean in that case (like in PCA), the mean (or dc) isn't different then any other frequency.

Simple binary logistic regression using MATLAB

I'm working on doing a logistic regression using MATLAB for a simple classification problem. My covariate is one continuous variable ranging between 0 and 1, while my categorical response is a binary variable of 0 (incorrect) or 1 (correct).
I'm looking to run a logistic regression to establish a predictor that would output the probability of some input observation (e.g. the continuous variable as described above) being correct or incorrect. Although this is a fairly simple scenario, I'm having some trouble running this in MATLAB.
My approach is as follows: I have one column vector X that contains the values of the continuous variable, and another equally-sized column vector Y that contains the known classification of each value of X (e.g. 0 or 1). I'm using the following code:
[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');
However, this gives me nonsensical results with a p = 1.000, coefficients (b) that are extremely high (-650.5, 1320.1), and associated standard error values on the order of 1e6.
I then tried using an additional parameter to specify the size of my binomial sample:
glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));
This gave me results that were more in line with what I expected. I extracted the coefficients, used glmval to create estimates (Y_fit = glmval(b,[0:0.01:1],'logit');), and created an array for the fitting (X_fit = linspace(0,1)). When I overlaid the plots of the original data and the model using figure, plot(X,Y,'o',X_fit,Y_fit'-'), the resulting plot of the model essentially looked like the lower 1/4th of the 'S' shaped plot that is typical with logistic regression plots.
My questions are as follows:
1) Why did my use of glmfit give strange results?
2) How should I go about addressing my initial question: given some input value, what's the probability that its classification is correct?
3) How do I get confidence intervals for my model parameters? glmval should be able to input the stats output from glmfit, but my use of glmfit is not giving correct results.
Any comments and input would be very useful, thanks!
UPDATE (3/18/14)
I found that mnrval seems to give reasonable results. I can use [b_fit,dev,stats] = mnrfit(X,Y+1); where Y+1 simply makes my binary classifier into a nominal one.
I can loop through [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats); to get various pihat probability values, where loopVal = linspace(0,1) or some appropriate input range and `ii = 1:length(loopVal)'.
The stats parameter has a great correlation coefficient (0.9973), but the p values for b_fit are 0.0847 and 0.0845, which I'm not quite sure how to interpret. Any thoughts? Also, why would mrnfit work over glmfit in my example? I should note that the p-values for the coefficients when using GeneralizedLinearModel.fit were both p<<0.001, and the coefficient estimates were quite different as well.
Finally, how does one interpret the dev output from the mnrfit function? The MATLAB document states that it is "the deviance of the fit at the solution vector. The deviance is a generalization of the residual sum of squares." Is this useful as a stand-alone value, or is this only compared to dev values from other models?
It sounds like your data may be linearly separable. In short, that means since your input data is one dimensional, that there is some value of x such that all values of x < xDiv belong to one class (say y = 0) and all values of x > xDiv belong to the other class (y = 1).
If your data were two-dimensional this means you could draw a line through your two-dimensional space X such that all instances of a particular class are on one side of the line.
This is bad news for logistic regression (LR) as LR isn't really meant to deal with problems where the data are linearly separable.
Logistic regression is trying to fit a function of the following form:
This will only return values of y = 0 or y = 1 when the expression within the exponential in the denominator is at negative infinity or infinity.
Now, because your data is linearly separable, and Matlab's LR function attempts to find a maximum likelihood fit for the data, you will get extreme weight values.
This isn't necessarily a solution, but try flipping the labels on just one of your data points (so for some index t where y(t) == 0 set y(t) = 1). This will cause your data to no longer be linearly separable and the learned weight values will be dragged dramatically closer to zero.

Calculating confidence intervals for a non-normal distribution

First, I should specify that my knowledge of statistics is fairly limited, so please forgive me if my question seems trivial or perhaps doesn't even make sense.
I have data that doesn't appear to be normally distributed. Typically, when I plot confidence intervals, I would use the mean +- 2 standard deviations, but I don't think that is acceptible for a non-uniform distribution. My sample size is currently set to 1000 samples, which would seem like enough to determine if it was a normal distribution or not.
I use Matlab for all my processing, so are there any functions in Matlab that would make it easy to calculate the confidence intervals (say 95%)?
I know there are the 'quantile' and 'prctile' functions, but I'm not sure if that's what I need to use. The function 'mle' also returns confidence intervals for normally distributed data, although you can also supply your own pdf.
Could I use ksdensity to create a pdf for my data, then feed that pdf into the mle function to give me confidence intervals?
Also, how would I go about determining if my data is normally distributed. I mean I can currently tell just by looking at the histogram or pdf from ksdensity, but is there a way to quantitatively measure it?
Thanks!
So there are a couple of questions there. Here are some suggestions
You are right that a mean of 1000 samples should be normally distributed (unless your data is "heavy tailed", which I'm assuming is not the case). to get a 1-alpha-confidence interval for the mean (in your case alpha = 0.05) you can use the 'norminv' function. For example say we wanted a 95% CI for the mean a sample of data X, then we can type
N = 1000; % sample size
X = exprnd(3,N,1); % sample from a non-normal distribution
mu = mean(X); % sample mean (normally distributed)
sig = std(X)/sqrt(N); % sample standard deviation of the mean
alphao2 = .05/2; % alpha over 2
CI = [mu + norminv(alphao2)*sig ,...
mu - norminv(alphao2)*sig ]
CI =
2.9369 3.3126
Testing if a data sample is normally distribution can be done in a lot of ways. One simple method is with a QQ plot. To do this, use 'qqplot(X)' where X is your data sample. If the result is approximately a straight line, the sample is normal. If the result is not a straight line, the sample is not normal.
For example if X = exprnd(3,1000,1) as above, the sample is non-normal and the qqplot is very non-linear:
X = exprnd(3,1000,1);
qqplot(X);
On the other hand if the data is normal the qqplot will give a straight line:
qqplot(randn(1000,1))
You might consider, also, using bootstrapping, with the bootci function.
You may use the method proposed in [1]:
MEDIAN +/- 1.7(1.25R / 1.35SQN)
Where R = Interquartile Range,
SQN = Square Root of N
This is often used in notched box plots, a useful data visualization for non-normal data. If the notches of two medians do not overlap, the medians are, approximately, significantly different at about a 95% confidence level.
[1] McGill, R., J. W. Tukey, and W. A. Larsen. "Variations of Boxplots." The American Statistician. Vol. 32, No. 1, 1978, pp. 12–16.
Are you sure you need confidence intervals or just the 90% range of the random data?
If you need the latter, I suggest you use prctile(). For example, if you have a vector holding independent identically distributed samples of random variables, you can get some useful information by running
y = prcntile(x, [5 50 95])
This will return in [y(1), y(3)] the range where 90% of your samples occur. And in y(2) you get the median of the sample.
Try the following example (using a normally distributed variable):
t = 0:99;
tt = repmat(t, 1000, 1);
x = randn(1000, 100) .* tt + tt; % simple gaussian model with varying mean and variance
y = prctile(x, [5 50 95]);
plot(t, y);
legend('5%','50%','95%')
I have not used Matlab but from my understanding of statistics, if your distribution cannot be assumed to be normal distribution, then you have to take it as Student t distribution and calculate confidence Interval and accuracy.
http://www.stat.yale.edu/Courses/1997-98/101/confint.htm

MATLAB's fminsearch function

I have two images I'm trying to co-register - ie, one could be of a ball in the centre of the picture, the other is of the same ball near the edge and I'm trying to find the numbed of pixels I have to move the second image so that the balls would be in the same place. (I'm actually using 3D MRI brain scans, but the principle is the same).
I've written a function that will move the ball left, right, up or down by a given number of pixels as well as another function that compares the correlation of the ball-in-the-centre image with the translated ball-at-the-edge image. When the two balls are in the same place the correlation function will return 0 and a number larger than 0 for other positions.
I'm trying to use fminsearch (documentation) to find the optimal translation for the correlation function's minimum (ie, the balls being in the same place) like so:
global reference_im unknown_im;
starting_trans = [0 0 0];
trans_vector = fminsearch(#correlate_images,starting_trans)
correlate_images.m:
function r = correlate_images(translate)
global reference_im unknown_im;
new_im = move_image(unknown_im,translate(1),translate(2),translate(3));
% This bit is unimportant to the question
% but you can see how I calculate my correlation
r = 1 - corr(reshape(new_im,[],1),reshape(reference_im,[],1));
There are two problems, firstly fminsearch insists on passing float values for the translation vector into the correlate_images function. Is there any way to inform it that only integers are necessary? (I would save a large number of cpu cycles!)
Secondly, when I run this program the resulting trans_vector is always the same as starting_trans - I assume this is because no minimum has been found, but is there another reason its just plain not working?
Many thanks!
EDIT
I've discovered what I think is the reason the output trans_vector is always the same as starting_trans. The fminsearch looks at the starting value, then a small increment in each direction from there, this small increment is always less than one, which means that the result from the correlation will be a perfect match (as the move_image will return the same as the input image for sub-pixel movements). I'm going to continue working on convincing matlab to only fminsearch over integer values!
First, I'd say that Matlab might not be the best tool for this problem. I'd look at Elastix, which is a pretty user-friendly wrapper around the registration functions in ITK. You get a variety of registration techniques, and the manuals for both programs do a good job of explaining the specifics of image registration.
Second, for this kind of simple translational registration, you can use the FFT. Forward transform both images, multiply the images together (pointwise! That is, use A .* B, not A * B, as those are different operations, and the first is what you want), and there should be a peak in the inverse transform whose offset from the origin is the translational amount you need. Numerical Recipes in C has a good explanation; here's a link to an index pdf. The speed difference between the FFT version and the direct correlation version is huge; the FFT is O(N log N), while the correlation method will be O (N * M), where M is the number of pixels in your search neighborhood. If you want to allow the entire image to be searched, then correlation becomes O (N*N), which will take much longer than the FFT version. Changing parameters from floats to integers won't solve the problem.
The reason the fminsearch function uses floats (if I can guess at the reasons behind the coders' decisions) is that for problems that aren't test problems (ie, spheres in a volume), you often need sub-pixel resolution to perform a correct registration. Take a look at the ITK documentation about the reasons behind this approach.
Third, I'd suggest that a good way to write this program in Matlab (if you still want to do so!) while still forcing integer correlations would be to avoid the fminsearch function, which will want to use floats. Try something like:
startXPos = -10; %these parameters dictate the size of your search neighborhood
startYPos = -10; %corresponds to M in the above explanation
endXPos = 10;
endYPos = 10;
optimalX = 0;
optimalY = 0;
maxCorrVal = 0;
for i=startXPos:endXPos
for j = startYPos:endYPos
%test the correlation of the two images here, where one image is shifted to another
currCorrVal = Correlate(image1, image2OffsetByiAndj);
if (currCorrVal > maxCorrVal)
maxCorrVal = currCorrVal;
optimalX = i;
optimalY = j;
end
end
end
From here, you just have to write the offset function. This way, you avoid the float problem, and you're also incrementing your translation vector (I don't see any way for that vector to move in your provided functions, which probably explains your lack of movement).
There is a very similar demo in the Image Processing Toolbox that uses the normalized cross-correlation function normxcorr2 to perform image registration. To avoid repeating the same thing, check out the demo directly:
Registering an Image Using Normalized Cross-Correlation