Global minimum value using Gradient Descent for logistic regression using MATLAB - matlab

I am trying to calculate the global minimum values using Gradient Descent algorithm in MATLAB.
The derivated form is:
Equation
My Code is:
%Load and split the features and label
data = load('Data.txt');
X = data(:,1:2);
y = data(:,3);
%Add a new feature
X = [ones(m,1),X];
%Define the length and initial theta value
[m,n] = size(X);
alpha = 0.01;
iteration = 400;
theta = zeros(n,1);
[theta,hist] = GradientDescent(X,y,m, alpha, theta, iteration);
J = CostFunction(X,y,theta,m);
fprintf('The cost for inital value zero is: %f\n',J);
fprintf('Minimum Gradient Descent is:\n')
fprintf('%f\n',theta)
My GradientDescent Functio is:
function [theta,hist] = GradientDescent(X,y,m,alpha, theta, iteration)
hist = zeros(m,1);
for i = 1:iteration
z = X * theta;
sigmoid = Sigmoid(z);
theta = theta - sum((alpha/m) *(X' * (sigmoid - y)));
hist(i) = CostFunction(X,y,theta,m);
end
Cost Function is:
function [J,grad] = CostFunction(X,y,theta,m)
J = 0;
z = X * theta;
hpyX = Sigmoid(z);
J = (1/m) .* ((-y' * log(hpyX)) - ((1-y)' * log(1-hpyX)));
grad = (1/m) * (X' * (hpyX - y));
end
end
and Sigmoid function is:
function sigmoid = Sigmoid(z)
sigmoid = zeros(size(z));
sigmoid = 1 ./ (1 + exp(-z));
end
I am getting output:
The cost for initial value zero is: *NaN*
Minimum Gradient Descent is:
0.413048
0.413048
0.413048
On the other hand, when I am getting different resutl for 400 iteration in fminunc is:
option = optimset('GradObj','on','MaxIter',400);
[theta, cost] = fminunc(#(t)(CostFunction(X,y,t,m)),theta,option);
fprintf('The cost function by fminunc is: %f\n',J)
fprintf('Theta is:\n')
fprintf('%f\n',theta)
Output
The cost function by fminunc is: 0.693147
Theta is:
-25.161343
0.206232
0.201472
I can't understand why I am getting different? Is there I made mistake in the gradient descent function? Also, I am getting NaN value for cost function, when I ma calculating Gradient Descent.

Related

Gradient descent and normal equation not giving the same results, why?

I am working on a simple script that tries to find values for my hypothesis. I am using for one a gradient descent and the second the normal equation. The normal equation is giving me the proper results, but my gradient descent not. I can't figure it out with such a simple case why is not working.
Hi, I am trying to understand why my gradient descend does not match the normal equation on linear regression. I am using matlab to implement both. Here's what I tried:
So I created a dummy training set as such:
x = {1 2 3}, y = {2 3 4}
so my hypothesis should converge to the theta = {1 1} so I get a simple
h(x) = 1 + x;
Here's the test code comparing normal equation and gradient descent:
clear;
disp("gradient descend");
X = [1; 2; 3];
y = [2; 3; 4];
theta = [0 0];
num_iters = 10;
alpha = 0.3;
thetaOut = gradientDescent(X, y, theta, 0.3, 10); % GD -> does not work, why?
disp(thetaOut);
clear;
disp("normal equation");
X = [1 1; 1 2; 1 3];
y = [2;3;4];
Xt = transpose(X);
theta = pinv(Xt*X)*Xt*y; % normal equation -> works!
disp(theta);
And here is the inner loop of the gradient descent:
samples = length(y);
for epoch = 1:iterations
hipoth = X * theta;
factor = alpha * (1/samples);
theta = theta - factor * ((hipoth - y)' * X )';
%disp(epoch);
end
and the output after 10 iterations:
gradient descend = 1.4284 1.4284 - > wrong
normal equation = 1.0000 1.0000 -> correct
does not make sense, it should converge to 1,1.
any ideas? Do I have matlab syntax problem?
thank you!
Gradient Descend can solve a lot of different problems. You want to do a linear regression, i.e. find a linear function h(x) = theta_1 * X + theta_2 that best fits your data:
h(X) = Y + error
What the "best" fit is, is debatable. The most common way to define best fit is to minimize the square of the errors between fit and actual data. Assuming that is what you want ...
Replace the function with
function [theta] = gradientDescent(X, Y, theta, alpha, num_iters)
n = length(Y);
for epoch = 1:num_iters
Y_pred = theta(1)*X + theta(2);
D_t1 = (-2/n) * X' * (Y - Y_pred);
D_t2 = (-2/n) * sum(Y - Y_pred);
theta(1) = theta(1) - alpha * D_t1;
theta(2) = theta(2) - alpha * D_t2;
end
end
and change your parameters a bit, e.g.
num_iters = 10000;
alpha = 0.05;
you get the correct answer. I took the code snippet from here which might also provide a nice starting point to read up on what is actually happening here.
Your gradient descend is solving a different thing than the normal equation, you are not inputing the same data. On top of that you seem to overcomplicate a but the theta update, but that is not a problem. Minor changes in your code result in proper output:
function theta=gradientDescent(X,y,theta,alpha,iterations)
samples = length(y);
for epoch = 1:iterations
hipoth = X * theta;
factor = alpha * (1/samples);
theta = theta - factor * X'*(hipoth - y);
%disp(epoch);
end
end
and the main code:
clear;
X = [1 1; 1 2; 1 3];
y = [2;3;4];
theta = [0 0];
num_iters = 10;
alpha = 0.3;
thetaOut = gradientDescent(X, y, theta', 0.3, 600); % Iterate a bit more, you impatient person!
theta = pinv(X.'*X)*X.'*y; % normal equation -> works!
disp("gradient descend");
disp(thetaOut);
disp("normal equation");
disp(theta);

Gradient Descent Overshooting and Cost Blowing Up when used for Regularized Logistic Regression

I'm using MATLAB to code Regularized Logistic Regression and am using Gradient Descent to discover the parameters. All is based on Andrew Ng's Coursera Machine Learning course. I am trying to code the cost function from Andrew's notes/videos. I am not entirely sure if I'm doing it right.
The main problem is... if the number of iterations gets too large, my cost seems to be blowing up. This happens regardless of whether I normalize or not (converting all the data to be between 0 and 1). This problem also causes the decision boundary being produced to shrink (underfit?). Below are three sample results that were obtained, where the decision boundaries of GD are compared against that of Matlab's fminunc.
As can be seen, the cost shoots up when the number of iterations increases. Could it be that I incorrectly coded the cost? Or is there indeed a possibility that Gradient Descent can overshoot? If it helps, I am providing my code. The code I used to calculate the cost history is:
costHistory(i) = (-1 * ( (1/m) * y'*log(h_x) + (1-y)'*log(1-h_x))) + ( (lambda/(2*m)) * sum(theta(2:end).^2) );, based on the equation below:
The full code is given below. Note that I have called other functions as well in this code. Would appreciate any pointers! :) Thank you in advance!
% REGULARIZED Logistic Regression with Gradient Descent
clc; clear all; close all;
dataset = load('ex2data2.txt');
x = dataset(:,1:end-1); y = dataset(:,end); m = length(y);
% Mapping the features (includes adding the intercept term)
x = mapFeature(x(:,1), x(:,2)); % Change to polynomial of the 6th degree
% Define the initial thetas. Same as the number of features, including
% the newly added intercept term (1s)
theta = zeros(size(x,2),1) + 0.05;
initial_theta = theta; % will be used later...
% Set lambda equals to 1
lambda = 1;
% calculate theta transpose x and also the hypothesis h_x
alpha = 0.005;
itr = 120000; % number of iterations set to 120K
for i = 1:itr
ttrx = x * theta; % theta transpose x
h_x = 1 ./ (1 + exp(-ttrx)); % sigmoid hypothesis
error = h_x - y;
% the gradient a.k.a. the derivative of J(\theta)
for j = 1:length(theta)
if j == 1
gradientA(j,1) = 1/m * (error)' * x(:,j);
theta(j) = theta(j) - alpha * gradientA(j,1);
else
gradientA(j,1) = (1/m * (error)' * x(:,j)) - (lambda/m)*theta(j);
theta(j) = theta(j) - alpha * gradientA(j,1);
end
end
costHistory(i) = (-1 * ( (1/m) * y'*log(h_x) + (1-y)'*log(1-h_x))) + ( (lambda/(2*m)) * sum(theta(2:end).^2) );
end
[cost, grad] = costFunctionReg(initial_theta, x, y, lambda);
% Using MATLAB's built-in function fminunc to minimze the cost function
% Set options for fminunc
options = optimset('GradObj', 'on', 'MaxIter', 500);
% Run fminunc to obtain the optimal theta
% This function will return theta and the cost
[thetafm, cost] = fminunc(#(t)(costFunctionReg(t, x, y, lambda)), initial_theta, options);
close all;
plotDecisionBoundary_git(theta, x, y); % based on GD
plotDecisionBoundary_git(thetafm, x, y); % based on fminunc
figure;
plot(1:itr, costHistory(:), '--r');
title('The cost history based on GD');

[Octave]Using fminunc is not always giving a consistent solution

I am trying to find the coefficients in an equation to model the step response of a motor which is of the form 1-e^x. The equation I'm using to model is of the form
a(1)*t^2 + a(2)*t^3 + a(3)*t^3 + ...
(It is derived in a research paper used to solve for motor parameters)
Sometimes using fminunc to find the coefficients works out okay, and I get a good result, and it matches the training data fairly well. Other times the returned coefficients are horrible (going extremely higher than what the output should be and is orders of magnitude off). This especially happens once I started using higher order terms: using any model that uses x^8 or higher (x^9, x^10, x^11, etc.) always produces bad results.
Since it works sometimes, I can't think why my implementation would be wrong. I have tried fminunc while providing the gradients and while also not providing the gradients yet there is no difference. I've looked into using other functions to solve for the coefficients, like polyfit, but in that instance it has to have terms that are raised from 1 to the highest order term, but the model I'm using has its lowest power at 2.
Here is the main code:
clear;
%Overall Constants
max_power = 7;
%Loads in data
%data = load('TestData.txt');
load testdata.mat
%Sets data into variables
indep_x = data(:,1); Y = data(:,2);
%number of data points
m = length(Y);
%X is a matrix with the independant variable
exps = [2:max_power];
X_prime = repmat(indep_x, 1, max_power-1); %Repeats columns of the indep var
X = bsxfun(#power, X_prime, exps);
%Initializes theta to rand vals
init_theta = rand(max_power-1,1);
%Sets up options for fminunc
options = optimset( 'MaxIter', 400, 'Algorithm', 'quasi-newton');
%fminunc minimizes the output of the cost function by changing the theta paramaeters
[theta, cost] = fminunc(#(t)(costFunction(t, X, Y)), init_theta, options)
%
Y_line = X * theta;
figure;
hold on; plot(indep_x, Y, 'or');
hold on; plot(indep_x, Y_line, 'bx');
And here is costFunction:
function [J, Grad] = costFunction (theta, X, Y)
%# of training examples
m = length(Y);
%Initialize Cost and Grad-Vector
J = 0;
Grad = zeros(size(theta));
%Poduces an output based off the current values of theta
model_output = X * theta;
%Computes the squared error for each example then adds them to get the total error
squared_error = (model_output - Y).^2;
J = (1/(2*m)) * sum(squared_error);
%Computes the gradients for each theta t
for t = 1:size(theta, 1)
Grad(t) = (1/m) * sum((model_output-Y) .* X(:, t));
end
endfunction
Any help or advice would be appreciated.
Try adding regularization to your costFunction:
function [J, Grad] = costFunction (theta, X, Y, lambda)
m = length(Y);
%Initialize Cost and Grad-Vector
J = 0;
Grad = zeros(size(theta));
%Poduces an output based off the current values of theta
model_output = X * theta;
%Computes the squared error for each example then adds them to get the total error
squared_error = (model_output - Y).^2;
J = (1/(2*m)) * sum(squared_error);
% Regularization
J = J + lambda*sum(theta(2:end).^2)/(2*m);
%Computes the gradients for each theta t
regularizator = lambda*theta/m;
% overwrite 1st element i.e the one corresponding to theta zero
regularizator(1) = 0;
for t = 1:size(theta, 1)
Grad(t) = (1/m) * sum((model_output-Y) .* X(:, t)) + regularizator(t);
end
endfunction
The regularization term lambda is used to control the learning rate. Start with lambda=1. The grater the value for lambda, the slower the learning will occur. Increase lambda if the behavior you describe persists. You may need to increase the number of iterations if lambda gets high.
You may also consider normalization of your data, and some heuristic for initializing theta - setting all theta to 0.1 may be better than random. If nothing else it'll provide better reproducibility from training to training.

Matlab Regularized Logistic Regression - how to compute gradient

I am currently taking Machine Learning on the Coursera platform and I am trying to implement Logistic Regression. To implement Logistic Regression, I am using gradient descent to minimize the cost function and I am to write a function called costFunctionReg.m that returns both the cost and the gradient of each parameter evaluated at the current set of parameters.
The problem is better described below:
My cost function is working, but the gradient function is not. Please note that I would prefer to implement this using looping, rather than element-by-element operations.
I am computing theta[0] (in MATLAB, theta(1)) separately as it is not being regularized, i.e. we do not use the first term (with lambda).
function [J, grad] = costFunctionReg(theta, X, y, lambda)
%COSTFUNCTIONREG Compute cost and gradient for logistic regression with regularization
% J = COSTFUNCTIONREG(theta, X, y, lambda) computes the cost of using
% theta as the parameter for regularized logistic regression and the
% gradient of the cost w.r.t. to the parameters.
% Initialize some useful values
m = length(y); % number of training examples
n = length(theta); %number of parameters (features)
% You need to return the following variables correctly
J = 0;
grad = zeros(size(theta));
% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost of a particular choice of theta.
% You should set J to the cost.
% Compute the partial derivatives and set grad to the partial
% derivatives of the cost w.r.t. each parameter in theta
% ----------------------1. Compute the cost-------------------
%hypothesis
h = sigmoid(X * theta);
for i = 1 : m
% The cost for the ith term before regularization
J = J - ( y(i) * log(h(i)) ) - ( (1 - y(i)) * log(1 - h(i)) );
% Adding regularization term
for j = 2 : n
J = J + (lambda / (2*m) ) * ( theta(j) )^2;
end
end
J = J/m;
% ----------------------2. Compute the gradients-------------------
%not regularizing theta[0] i.e. theta(1) in matlab
j = 1;
for i = 1 : m
grad(j) = grad(j) + ( h(i) - y(i) ) * X(i,j);
end
for j = 2 : n
for i = 1 : m
grad(j) = grad(j) + ( h(i) - y(i) ) * X(i,j) + lambda * theta(j);
end
end
grad = (1/m) * grad;
% =============================================================
end
What am I doing wrong?
The way you are applying regularization is incorrect. You add regularization after you sum over all training examples but instead you are adding regularization after each example. If you left your code as it was before the correction, you are inadvertently making the gradient step larger and will eventually overshoot the solution. This overshooting will accumulate and will inevitably give you a gradient vector of Inf or -Inf for all components (except for the bias term).
Simply put, place your lambda*theta(j) statement after the second for loop terminates:
for j = 2 : n
for i = 1 : m
grad(j) = grad(j) + ( h(i) - y(i) ) * X(i,j); % Change
end
grad(j) = grad(j) + lambda * theta(j); % Change
end

Gradient Descent with multiple variable without Matrix

I'm new with Matlab and Machine Learning and I tried to make a gradient descent function without using matrix.
m is the number of example on my training set
n is the number of feature for each example
The function gradientDescentMulti takes 5 arguments:
X mxn Matrix
y m-dimensional vector
theta : n-dimensional vector
alpha : a real number
nb_iters : a real number
I already have a solution using matrix multiplication
function theta = gradientDescentMulti(X, y, theta, alpha, num_iters)
for iter = 1:num_iters
gradJ = 1/m * (X'*X*theta - X'*y);
theta = theta - alpha * gradJ;
end
end
The result after iterations:
theta =
1.0e+05 *
3.3430
1.0009
0.0367
But now, I tried to do the same without matrix multiplication, this is the function:
function theta = gradientDescentMulti(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
n = size(X, 2); % number of features
for iter = 1:num_iters
new_theta = zeros(1, n);
%// for each feature, found the new theta
for t = 1:n
S = 0;
for example = 1:m
h = 0;
for example_feature = 1:n
h = h + (theta(example_feature) * X(example, example_feature));
end
S = S + ((h - y(example)) * X(example, n)); %// Sum each feature for this example
end
new_theta(t) = theta(t) - alpha * (1/m) * S; %// Calculate new theta for this example
end
%// only at the end of the function, update all theta simultaneously
theta = new_theta'; %// Transpose new_theta (horizontal vector) to theta (vertical vector)
end
end
The result, all the theta are the same :/
theta =
1.0e+04 *
3.5374
3.5374
3.5374
If you look at the gradient update rule, it may be more efficient to actually compute the hypothesis of all of your training examples first, then subtract this with the ground truth value of each training example and store these into an array or vector. Once you do this, you can then compute the update rule very easily. To me, it doesn't appear that you're doing this in your code.
As such, I rewrote the code, but I have a separate array that stores the difference in the hypothesis of each training example and ground truth value. Once I do this, I compute the update rule for each feature separately:
for iter = 1 : num_iters
%// Compute hypothesis differences with ground truth first
h = zeros(1, m);
for t = 1 : m
%// Compute hypothesis
for tt = 1 : n
h(t) = h(t) + theta(tt)*X(t,tt);
end
%// Compute difference between hypothesis and ground truth
h(t) = h(t) - y(t);
end
%// Now update parameters
new_theta = zeros(1, n);
%// for each feature, find the new theta
for tt = 1 : n
S = 0;
%// For each sample, compute products of hypothesis difference
%// and the right feature of the sample and accumulate
for t = 1 : m
S = S + h(t)*X(t,tt);
end
%// Compute gradient descent step
new_theta(tt) = theta(tt) - (alpha/m)*S;
end
theta = new_theta'; %// Transpose new_theta (horizontal vector) to theta (vertical vector)
end
When I do this, I get the same answers as using the matrix formulation.