Matlab custom curve fitting fail - matlab

I'm using cftool for a custom fit of Mƶssbauer spectroscopy data. There are two coefficients, Gamma and N0.
N = f(v)
= (299792458000^2*Gamma^2*N0)/(299792458000^2*Gamma^2+4*v^2*(4.29383292e-15)^2)
Using default settings (Trust-region, robust off, etc.) I get the following fit:
Fit computation did not converge:
Fitting stopped because the number of iterations or function evaluations exceeded the specified maximum.
Fit found when optimization terminated:
General model:
f(v) = (299792458000^2*Gamma^2*N0)/(299792458000^2*Gamma^2+4*v^2*(4.29383292e-
15)^2)
Coefficients (with 95% confidence bounds):
Gamma = 0.9137 (-Inf, Inf)
N0 = 2.454e+04 (2.059e+04, 2.849e+04)
Goodness of fit:
SSE: 6.41e+11
R-square: -2068
Adjusted R-square: -2073
RMSE: 4.013e+04
Warning: A negative R-square is possible if the model does not contain a constant term and the fit is poor (worse than just fitting the mean). Try changing the model or using a different StartPoint.
If I switch to Levenberg-Marquardt, I get a straight line through the data:
General model:
f(v) = (299792458000^2*Gamma^2*N0)/(299792458000^2*Gamma^2+4*v^2*(4.29383292e-
15)^2)
Coefficients (with 95% confidence bounds):
Gamma = 0.793 (-Inf, Inf)
N0 = 6.456e+04 (6.447e+04, 6.465e+04)
Goodness of fit:
SSE: 3.098e+08
R-square: 2.22e-16
Adjusted R-square: -0.002513
RMSE: 882.3
Why is this failing so badly in both cases?

f(v) simplifies to f(v)=N0/(1+(2.8645e-26*(v/Gamma))^2) so the 1 in the denominator dominates until (v/Gamma) starts getting as big as 10^25. With your Gamma of 0.793 and your v of |15| I think matlab might have a hard time of converging to anything other than N0

Related

Testing statistical significance of multiple linear regression with known variance and š¯›¼-level rejection region

In a multiple linear regression model with two regressors, how would a test for significance change with a known variance?
The null hypothesis would be H0 = Beta_1 = Beta_2 = 0
With the variance known, we wouldn't have to estimate sigma squared hat, but can we still use the F test? F0 = MSr/MSres
I know that the estimate sigma squared hat = MSres but is that the same when sigma squared is known?

Is this code for nonlinear regression in MATLAB right?

I am trying to solve this example in MATLAB but I can't get the right answer.
i use this code
clear all;clc;
x=[4 2.25 1.45 1.0 0.65 0.25 0.006];
y=[ 0.398 0.298 0.238 0.198 0.158 0.098 0.048];
n=length(x);
sumx=sum(log10(x));
sumy=sum(log10(y));
sum2x=sum(log10(x));
sum3x=sum(log10(y));
sum4x=sum(log10(x.*y));
sumxy=sum(log10(x.^2));
sumx2y=sum(log10((x.^2) .*y));
m1=[n sumx sum2x;sumx sum2x sum3x;sum2x sum3x sum4x]
m2=[sumy;sumxy;sumx2y]
m3=inv(m1)*m2;
plot(x,y)
Isn't this just to make a power fit of the data (-r and C) to obtain the constants n and k?
If so, you can just use the Curve Fitting app in MATLAB to do so:
General model Power1:
f(x) = a*x^b
Coefficients (with 95% confidence bounds):
a = 0.2003 (0.1826, 0.2179)
b = 0.4889 (0.4028, 0.5751)
Goodness of fit:
SSE: 0.001052
R-square: 0.9877
Adjusted R-square: 0.9852
RMSE: 0.0145

Command `gmdistribution` in Matlab?

I have a question on the Matlab command gmdistribution to generate draws from mixtures of Gaussians.
Consider the following code to draw from a mixture of two bivariate normals
clear
rng default
P=10^4; %number draws
%First component (X1,X2)
v=1;
mu_a = [0,2];
sigma_a = [v,0;0,v];
%Second component (Y1,Y2)
mu_b = [0,4];
sigma_b = [v,0;0,v];
MU = [mu_a;mu_b];
SIGMA = cat(3,sigma_a,sigma_b);
w = ones(1,2)/2; %equal weight 0.5
obj = gmdistribution(MU,SIGMA,w);
%Draws of the mixture (R1,R2)
R = random(obj,P);%nx2
We know that (R1, R2) may be correlated. Indeed, we can show that
cov(R1, R2)=1/4*cov(X1,Y2)+1/4*cov(X2, Y1)
because
cov(W1,W2)=E(W1*W2)-E(W1)E(W2)
=1/4E(X1*X2)+1/4E(X1*Y2)+1/4E(Y1* X2)+1/4E(Y1* Y2)
- [1/2E(X1)+1/2E(Y1)][1/2E(X2)+1/2E(Y2)]
=1/4 cov(X1, Y2)+1/4cov(Y1, X2)
However, if I check their correlation
corr(R(:,1), R(:,2))
I get almost zero (0.0024)
I checked for many other values of MU, SIGMAbut I couldn't find any case with a correlation noticeably far from 0. Is this just a case, or is that the command gmdistribution imposes (X1,X2) independent of (Y1,Y2)?
We can best illustrate the problem with a figure. To make the effect more visible, I decreased the variance of the both components from 1 to 0.2 (v = 0.2). If we then draw some realisations from the mixed model, we get the following scatterplot:
Each "blob" corresponds to one component, one with its centre at 0,2 the other at 0,4.
Now, on its basis the linear correlation coefficient tells us how much W2 increases if W1 increases by one. But as we can see there is no such trend in the realisations; If W1 increases W2 is not increasing or decreasing.
This due to both distributions having the same mean (0) in W1. If that is not the case, e.g mu_a = [0,2]; and mu_b = [2,5]; we get following plot:
Here it is clearly visible that if W1 is high, chances are that W2 is also very high. This leads to a high positive correlation of about 0.87. Summing this up, if either mu_a(1) == mu_b(1) or mu_a(2) == mu_b(2) then the correlation will be near zero.

How should I use maximum likelihood classifier in Matlab? [duplicate]

I am implementing logistic regression using batch gradient descent. There are two classes into which the input samples are to be classified. The classes are 1 and 0. While training the data, I am using the following sigmoid function:
t = 1 ./ (1 + exp(-z));
where
z = x*theta
And I am using the following cost function to calculate cost, to determine when to stop training.
function cost = computeCost(x, y, theta)
htheta = sigmoid(x*theta);
cost = sum(-y .* log(htheta) - (1-y) .* log(1-htheta));
end
I am getting the cost at each step to be NaN as the values of htheta are either 1 or zero in most cases. What should I do to determine the cost value at each iteration?
This is the gradient descent code for logistic regression:
function [theta,cost_history] = batchGD(x,y,theta,alpha)
cost_history = zeros(1000,1);
for iter=1:1000
htheta = sigmoid(x*theta);
new_theta = zeros(size(theta,1),1);
for feature=1:size(theta,1)
new_theta(feature) = theta(feature) - alpha * sum((htheta - y) .*x(:,feature))
end
theta = new_theta;
cost_history(iter) = computeCost(x,y,theta);
end
end
There are two possible reasons why this may be happening to you.
The data is not normalized
This is because when you apply the sigmoid / logit function to your hypothesis, the output probabilities are almost all approximately 0s or all 1s and with your cost function, log(1 - 1) or log(0) will produce -Inf. The accumulation of all of these individual terms in your cost function will eventually lead to NaN.
Specifically, if y = 0 for a training example and if the output of your hypothesis is log(x) where x is a very small number which is close to 0, examining the first part of the cost function would give us 0*log(x) and will in fact produce NaN. Similarly, if y = 1 for a training example and if the output of your hypothesis is also log(x) where x is a very small number, this again would give us 0*log(x) and will produce NaN. Simply put, the output of your hypothesis is either very close to 0 or very close to 1.
This is most likely due to the fact that the dynamic range of each feature is widely different and so a part of your hypothesis, specifically the weighted sum of x*theta for each training example you have will give you either very large negative or positive values, and if you apply the sigmoid function to these values, you'll get very close to 0 or 1.
One way to combat this is to normalize the data in your matrix before performing training using gradient descent. A typical approach is to normalize with zero-mean and unit variance. Given an input feature x_k where k = 1, 2, ... n where you have n features, the new normalized feature x_k^{new} can be found by:
m_k is the mean of the feature k and s_k is the standard deviation of the feature k. This is also known as standardizing data. You can read up on more details about this on another answer I gave here: How does this code for standardizing data work?
Because you are using the linear algebra approach to gradient descent, I'm assuming you have prepended your data matrix with a column of all ones. Knowing this, we can normalize your data like so:
mX = mean(x,1);
mX(1) = 0;
sX = std(x,[],1);
sX(1) = 1;
xnew = bsxfun(#rdivide, bsxfun(#minus, x, mX), sX);
The mean and standard deviations of each feature are stored in mX and sX respectively. You can learn how this code works by reading the post I linked to you above. I won't repeat that stuff here because that isn't the scope of this post. To ensure proper normalization, I've made the mean and standard deviation of the first column to be 0 and 1 respectively. xnew contains the new normalized data matrix. Use xnew with your gradient descent algorithm instead. Now once you find the parameters, to perform any predictions you must normalize any new test instances with the mean and standard deviation from the training set. Because the parameters learned are with respect to the statistics of the training set, you must also apply the same transformations to any test data you want to submit to the prediction model.
Assuming you have new data points stored in a matrix called xx, you would do normalize then perform the predictions:
xxnew = bsxfun(#rdivide, bsxfun(#minus, xx, mX), sX);
Now that you have this, you can perform your predictions:
pred = sigmoid(xxnew*theta) >= 0.5;
You can change the threshold of 0.5 to be whatever you believe is best that determines whether examples belong in the positive or negative class.
The learning rate is too large
As you mentioned in the comments, once you normalize the data the costs appear to be finite but then suddenly go to NaN after a few iterations. Normalization can only get you so far. If your learning rate or alpha is too large, each iteration will overshoot in the direction towards the minimum and would thus make the cost at each iteration oscillate or even diverge which is what is appearing to be happening. In your case, the cost is diverging or increasing at each iteration to the point where it is so large that it can't be represented using floating point precision.
As such, one other option is to decrease your learning rate alpha until you see that the cost function is decreasing at each iteration. A popular method to determine what the best learning rate would be is to perform gradient descent on a range of logarithmically spaced values of alpha and seeing what the final cost function value is and choosing the learning rate that resulted in the smallest cost.
Using the two facts above together should allow gradient descent to converge quite nicely, assuming that the cost function is convex. In this case for logistic regression, it most certainly is.
Let's assume you have an observation where:
the true value is y_i = 1
your model is quite extreme and says that P(y_i = 1) = 1
Then your cost function will get a value of NaN because you're adding 0 * log(0), which is undefined. Hence:
Your formula for the cost function has a problem (there is a subtle 0, infinity issue)!
As #rayryeng pointed out, 0 * log(0) produces a NaN because 0 * Inf isn't kosher. This is actually a huge problem: if your algorithm believes it can predict a value perfectly, it incorrectly assigns a cost of NaN.
Instead of:
cost = sum(-y .* log(htheta) - (1-y) .* log(1-htheta));
You can avoid multiplying 0 by infinity by instead writing your cost function in Matlab as:
y_logical = y == 1;
cost = sum(-log(htheta(y_logical))) + sum( - log(1 - htheta(~y_logical)));
The idea is if y_i is 1, we add -log(htheta_i) to the cost, but if y_i is 0, we add -log(1 - htheta_i) to the cost. This is mathematically equivalent to -y_i * log(htheta_i) - (1 - y_i) * log(1- htheta_i) but without running into numerical problems that essentially stem from htheta_i being equal to 0 or 1 within the limits of double precision floating point.
It happened to me because an indetermination of the type:
0*log(0)
This can happen when one of the predicted values Y equals either 0 or 1.
In my case the solution was to add an if statement to the python code as follows:
y * np.log (Y) + (1-y) * np.log (1-Y) if ( Y != 1 and Y != 0 ) else 0
This way, when the actual value (y) and the predicted one (Y) are equal, no cost needs to be computed, which is the expected behavior.
(Notice that when a given Y is converging to 0 the left addend is canceled (because of y=0) and the right addend tends toward 0. The same happens when Y converges to 1, but with the opposite addend.)
(There is also a very rare scenario, which you probably won't need to worry about, where y=0 and Y=1 or viceversa, but if your dataset is standarized and the weights are properly initialized it won't be an issue.)

Cost function in logistic regression gives NaN as a result

I am implementing logistic regression using batch gradient descent. There are two classes into which the input samples are to be classified. The classes are 1 and 0. While training the data, I am using the following sigmoid function:
t = 1 ./ (1 + exp(-z));
where
z = x*theta
And I am using the following cost function to calculate cost, to determine when to stop training.
function cost = computeCost(x, y, theta)
htheta = sigmoid(x*theta);
cost = sum(-y .* log(htheta) - (1-y) .* log(1-htheta));
end
I am getting the cost at each step to be NaN as the values of htheta are either 1 or zero in most cases. What should I do to determine the cost value at each iteration?
This is the gradient descent code for logistic regression:
function [theta,cost_history] = batchGD(x,y,theta,alpha)
cost_history = zeros(1000,1);
for iter=1:1000
htheta = sigmoid(x*theta);
new_theta = zeros(size(theta,1),1);
for feature=1:size(theta,1)
new_theta(feature) = theta(feature) - alpha * sum((htheta - y) .*x(:,feature))
end
theta = new_theta;
cost_history(iter) = computeCost(x,y,theta);
end
end
There are two possible reasons why this may be happening to you.
The data is not normalized
This is because when you apply the sigmoid / logit function to your hypothesis, the output probabilities are almost all approximately 0s or all 1s and with your cost function, log(1 - 1) or log(0) will produce -Inf. The accumulation of all of these individual terms in your cost function will eventually lead to NaN.
Specifically, if y = 0 for a training example and if the output of your hypothesis is log(x) where x is a very small number which is close to 0, examining the first part of the cost function would give us 0*log(x) and will in fact produce NaN. Similarly, if y = 1 for a training example and if the output of your hypothesis is also log(x) where x is a very small number, this again would give us 0*log(x) and will produce NaN. Simply put, the output of your hypothesis is either very close to 0 or very close to 1.
This is most likely due to the fact that the dynamic range of each feature is widely different and so a part of your hypothesis, specifically the weighted sum of x*theta for each training example you have will give you either very large negative or positive values, and if you apply the sigmoid function to these values, you'll get very close to 0 or 1.
One way to combat this is to normalize the data in your matrix before performing training using gradient descent. A typical approach is to normalize with zero-mean and unit variance. Given an input feature x_k where k = 1, 2, ... n where you have n features, the new normalized feature x_k^{new} can be found by:
m_k is the mean of the feature k and s_k is the standard deviation of the feature k. This is also known as standardizing data. You can read up on more details about this on another answer I gave here: How does this code for standardizing data work?
Because you are using the linear algebra approach to gradient descent, I'm assuming you have prepended your data matrix with a column of all ones. Knowing this, we can normalize your data like so:
mX = mean(x,1);
mX(1) = 0;
sX = std(x,[],1);
sX(1) = 1;
xnew = bsxfun(#rdivide, bsxfun(#minus, x, mX), sX);
The mean and standard deviations of each feature are stored in mX and sX respectively. You can learn how this code works by reading the post I linked to you above. I won't repeat that stuff here because that isn't the scope of this post. To ensure proper normalization, I've made the mean and standard deviation of the first column to be 0 and 1 respectively. xnew contains the new normalized data matrix. Use xnew with your gradient descent algorithm instead. Now once you find the parameters, to perform any predictions you must normalize any new test instances with the mean and standard deviation from the training set. Because the parameters learned are with respect to the statistics of the training set, you must also apply the same transformations to any test data you want to submit to the prediction model.
Assuming you have new data points stored in a matrix called xx, you would do normalize then perform the predictions:
xxnew = bsxfun(#rdivide, bsxfun(#minus, xx, mX), sX);
Now that you have this, you can perform your predictions:
pred = sigmoid(xxnew*theta) >= 0.5;
You can change the threshold of 0.5 to be whatever you believe is best that determines whether examples belong in the positive or negative class.
The learning rate is too large
As you mentioned in the comments, once you normalize the data the costs appear to be finite but then suddenly go to NaN after a few iterations. Normalization can only get you so far. If your learning rate or alpha is too large, each iteration will overshoot in the direction towards the minimum and would thus make the cost at each iteration oscillate or even diverge which is what is appearing to be happening. In your case, the cost is diverging or increasing at each iteration to the point where it is so large that it can't be represented using floating point precision.
As such, one other option is to decrease your learning rate alpha until you see that the cost function is decreasing at each iteration. A popular method to determine what the best learning rate would be is to perform gradient descent on a range of logarithmically spaced values of alpha and seeing what the final cost function value is and choosing the learning rate that resulted in the smallest cost.
Using the two facts above together should allow gradient descent to converge quite nicely, assuming that the cost function is convex. In this case for logistic regression, it most certainly is.
Let's assume you have an observation where:
the true value is y_i = 1
your model is quite extreme and says that P(y_i = 1) = 1
Then your cost function will get a value of NaN because you're adding 0 * log(0), which is undefined. Hence:
Your formula for the cost function has a problem (there is a subtle 0, infinity issue)!
As #rayryeng pointed out, 0 * log(0) produces a NaN because 0 * Inf isn't kosher. This is actually a huge problem: if your algorithm believes it can predict a value perfectly, it incorrectly assigns a cost of NaN.
Instead of:
cost = sum(-y .* log(htheta) - (1-y) .* log(1-htheta));
You can avoid multiplying 0 by infinity by instead writing your cost function in Matlab as:
y_logical = y == 1;
cost = sum(-log(htheta(y_logical))) + sum( - log(1 - htheta(~y_logical)));
The idea is if y_i is 1, we add -log(htheta_i) to the cost, but if y_i is 0, we add -log(1 - htheta_i) to the cost. This is mathematically equivalent to -y_i * log(htheta_i) - (1 - y_i) * log(1- htheta_i) but without running into numerical problems that essentially stem from htheta_i being equal to 0 or 1 within the limits of double precision floating point.
It happened to me because an indetermination of the type:
0*log(0)
This can happen when one of the predicted values Y equals either 0 or 1.
In my case the solution was to add an if statement to the python code as follows:
y * np.log (Y) + (1-y) * np.log (1-Y) if ( Y != 1 and Y != 0 ) else 0
This way, when the actual value (y) and the predicted one (Y) are equal, no cost needs to be computed, which is the expected behavior.
(Notice that when a given Y is converging to 0 the left addend is canceled (because of y=0) and the right addend tends toward 0. The same happens when Y converges to 1, but with the opposite addend.)
(There is also a very rare scenario, which you probably won't need to worry about, where y=0 and Y=1 or viceversa, but if your dataset is standarized and the weights are properly initialized it won't be an issue.)