Logistic Regression Implementation - matlab

I am having some difficulties in implementing logistic regression, in terms of how should I should proceed stepwise. According to what I have done so far I am implementing it in the following way:
First taking theta equal to the number of features and making it a n*1 vector of zeros. Now using this theta to compute the the following
htheta = sigmoid(theta' * X');
theta = theta - (alpha/m) * sum (htheta' - y)'*X
Now using the theta computed in the first step to compute the cost function
J= 1/m *((sum(-y*log(htheta))) - (sum((1-y) * log(1 - htheta)))) + lambda/(2*m) * sum(theta).^2
In the end computing the gradient
grad = (1/m) * sum ((sigmoid(X*theta) - y')*X);
As i am taking theta to be zero. I am getting same value of J throughout the vector, is this the right output?

You are computing the gradient in the last step, while it has been computed before in the computation of the new theta. Moreover, your definition of the cost function contains a regularization parameter, but this is not incorporated in the gradient computation. A working version without the regularization:
% generate dummy data for testing
y=randi(2,[10,1])-1;
X=[ones(10,1) randn([10,1])];
% initialize
alpha = 0.1;
theta = zeros(1,size(X,2));
J = NaN(100,1);
% loop a fixed number of times => can improve this by stopping when the
% cost function no longer decreases
htheta = sigmoid(X*theta');
for n=1:100
grad = X' * (htheta-y); % gradient
theta = theta - alpha*grad'; % update theta
htheta = sigmoid(X*theta');
J(n) = sum(-y'*log(htheta)) - sum((1-y)' * log(1 - htheta)); % cost function
end
If you now plot the cost function, you will see (except for randomness) that it converges after about 15 iterations.

Related

Gradient Descent Overshooting and Cost Blowing Up when used for Regularized Logistic Regression

I'm using MATLAB to code Regularized Logistic Regression and am using Gradient Descent to discover the parameters. All is based on Andrew Ng's Coursera Machine Learning course. I am trying to code the cost function from Andrew's notes/videos. I am not entirely sure if I'm doing it right.
The main problem is... if the number of iterations gets too large, my cost seems to be blowing up. This happens regardless of whether I normalize or not (converting all the data to be between 0 and 1). This problem also causes the decision boundary being produced to shrink (underfit?). Below are three sample results that were obtained, where the decision boundaries of GD are compared against that of Matlab's fminunc.
As can be seen, the cost shoots up when the number of iterations increases. Could it be that I incorrectly coded the cost? Or is there indeed a possibility that Gradient Descent can overshoot? If it helps, I am providing my code. The code I used to calculate the cost history is:
costHistory(i) = (-1 * ( (1/m) * y'*log(h_x) + (1-y)'*log(1-h_x))) + ( (lambda/(2*m)) * sum(theta(2:end).^2) );, based on the equation below:
The full code is given below. Note that I have called other functions as well in this code. Would appreciate any pointers! :) Thank you in advance!
% REGULARIZED Logistic Regression with Gradient Descent
clc; clear all; close all;
dataset = load('ex2data2.txt');
x = dataset(:,1:end-1); y = dataset(:,end); m = length(y);
% Mapping the features (includes adding the intercept term)
x = mapFeature(x(:,1), x(:,2)); % Change to polynomial of the 6th degree
% Define the initial thetas. Same as the number of features, including
% the newly added intercept term (1s)
theta = zeros(size(x,2),1) + 0.05;
initial_theta = theta; % will be used later...
% Set lambda equals to 1
lambda = 1;
% calculate theta transpose x and also the hypothesis h_x
alpha = 0.005;
itr = 120000; % number of iterations set to 120K
for i = 1:itr
ttrx = x * theta; % theta transpose x
h_x = 1 ./ (1 + exp(-ttrx)); % sigmoid hypothesis
error = h_x - y;
% the gradient a.k.a. the derivative of J(\theta)
for j = 1:length(theta)
if j == 1
gradientA(j,1) = 1/m * (error)' * x(:,j);
theta(j) = theta(j) - alpha * gradientA(j,1);
else
gradientA(j,1) = (1/m * (error)' * x(:,j)) - (lambda/m)*theta(j);
theta(j) = theta(j) - alpha * gradientA(j,1);
end
end
costHistory(i) = (-1 * ( (1/m) * y'*log(h_x) + (1-y)'*log(1-h_x))) + ( (lambda/(2*m)) * sum(theta(2:end).^2) );
end
[cost, grad] = costFunctionReg(initial_theta, x, y, lambda);
% Using MATLAB's built-in function fminunc to minimze the cost function
% Set options for fminunc
options = optimset('GradObj', 'on', 'MaxIter', 500);
% Run fminunc to obtain the optimal theta
% This function will return theta and the cost
[thetafm, cost] = fminunc(#(t)(costFunctionReg(t, x, y, lambda)), initial_theta, options);
close all;
plotDecisionBoundary_git(theta, x, y); % based on GD
plotDecisionBoundary_git(thetafm, x, y); % based on fminunc
figure;
plot(1:itr, costHistory(:), '--r');
title('The cost history based on GD');

I am not getting this line of code in Machine Learning

I have to write this piece of code for the lrcostfunction assignment in the Machine Learning course in coursera. But I still don't understand why
theta1 = [0 ; theta(2:end, :)];
is written? theta1 means what?
h = sigmoid(X * theta)
theta1 = [0 ; theta(2:end, :)];
p = lambda * (theta1' * theta1)/(2 * m);
J = ((-y)'*log(h)-(1-y)'*log(1-h))/m + p;
grad = (X' * (h - y) + lambda * theta1)/ m;
In logistic regression, theta (θ) is a vector representing the parameters (or weights) of the linear function of x.
Now, given a training set, one method to learn the parameters theta (θ) is to be to make h(x) close to y, at least for the training examples we have. This is defined using a cost function or the error function (J(θ)), for each value of the θ, which we want to minimize.
The first theta1 parameter is initialized as zero. Later using gradient descent, next theta parameter is computed. In gradient descent, the J(θ) parameter is calculated using partial differentiation as we want to minimize it.
Here \alpha is learning rate with which gradient descent algorithm runs. It starts with an initial value in the array - theta1 as zero and then, next value is calculated using the above equation. and so on for other theta parameters.
EDIT:
Explaining the code:
theta1 = [0 ; theta(2:end, :)];
The above code is MATLAB code. Here theta1 is an Array (vector or matrix representation). It is created using horizontal concatenation of two fields.
1) 0
2) theta(2:end, :)
First, is a scalar value 0
Second, this means that take all values as it is, except the first row from the array theta. (Note theta is input array to LRCOSTFUNCTION(theta, X, y, lambda))

Why "theta" in this code is NaN? [duplicate]

This question already has an answer here:
Machine learning - Linear regression using batch gradient descent
(1 answer)
Closed 6 years ago.
I'm learning neural networks (linear regression) in MATLAB for my research project and this is a part of the code I use.
The problem is the value of "theta" is NaN and I don't know why.
Could you tell me where is the error?
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
theta = zeros(2, 1); % initialize fitting parameters
%GRADIENTDESCENT Performs gradient descent to learn theta
% theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by
% taking num_iters gradient steps with learning rate alpha
% Initialize some useful values
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
theta = theta - ((alpha/m)*((X*theta)-y)' * X)';
end
end
% run gradient descent
theta = gradientDescent(X, y, theta, alpha, iterations);
The function you have is fine. But the sizes of X and theta are incompatible. In general, if size(X) is [N, M], then size(theta) should be [M, 1].
So I would suggest replacing the line
theta = zeros(2, 1);
with
theta = zeros(size(X, 2), 1);
should have as many columns as theta has elements. So in this example, size(X) should be [133, 2].
Also, you should move that initialization before you call the function.
For example, the following code does not return NaN if you remove the initialization of theta from the function.
X = rand(133, 1); % or rand(133, 2)
y = rand(133, 1);
theta = zeros(size(X, 2), 1); % initialize fitting parameters
% run gradient descent
theta = gradientDescent(X, y, theta, 0.1, 1500)
EDIT: This is in response to comments below.
Your problem is due to the gradient descent algorithm not converging. To see it yourself, plot J_history, which should never increase if the algorithm is stable. You can compute J_history by inserting the following line inside the for-loop in the function gradientDescent:
J_history(iter) = mean((X * theta - y).^2);
In your case (i.e. given data file and alpha = 0.01), J_history increases exponentially. This is shown in the plot below. Note that the y-axis is in logarithmic scale.
This is a clear sign of instability in gradient descent.
There are two ways to eliminate this problem.
Option 1. Use smaller alpha. alpha controls the rate of gradient descent. If it is too large, the algorithm is unstable. If it is too small, the algorithm takes a long time to reach the optimal solution. Try something like alpha = 1e-8 and go from there. For example, alpha = 1e-8 results in the following cost function:
Option 2. Use feature scaling to reduce the magnitude of the inputs. One way of doing this is called Standarization. The following is an example of using standarization and the resulting cost function:
data=xlsread('v & t.xlsx');
data(:,1) = (data(:,1)-mean(data(:,1)))/std(data(:,1));

[Octave]Using fminunc is not always giving a consistent solution

I am trying to find the coefficients in an equation to model the step response of a motor which is of the form 1-e^x. The equation I'm using to model is of the form
a(1)*t^2 + a(2)*t^3 + a(3)*t^3 + ...
(It is derived in a research paper used to solve for motor parameters)
Sometimes using fminunc to find the coefficients works out okay, and I get a good result, and it matches the training data fairly well. Other times the returned coefficients are horrible (going extremely higher than what the output should be and is orders of magnitude off). This especially happens once I started using higher order terms: using any model that uses x^8 or higher (x^9, x^10, x^11, etc.) always produces bad results.
Since it works sometimes, I can't think why my implementation would be wrong. I have tried fminunc while providing the gradients and while also not providing the gradients yet there is no difference. I've looked into using other functions to solve for the coefficients, like polyfit, but in that instance it has to have terms that are raised from 1 to the highest order term, but the model I'm using has its lowest power at 2.
Here is the main code:
clear;
%Overall Constants
max_power = 7;
%Loads in data
%data = load('TestData.txt');
load testdata.mat
%Sets data into variables
indep_x = data(:,1); Y = data(:,2);
%number of data points
m = length(Y);
%X is a matrix with the independant variable
exps = [2:max_power];
X_prime = repmat(indep_x, 1, max_power-1); %Repeats columns of the indep var
X = bsxfun(#power, X_prime, exps);
%Initializes theta to rand vals
init_theta = rand(max_power-1,1);
%Sets up options for fminunc
options = optimset( 'MaxIter', 400, 'Algorithm', 'quasi-newton');
%fminunc minimizes the output of the cost function by changing the theta paramaeters
[theta, cost] = fminunc(#(t)(costFunction(t, X, Y)), init_theta, options)
%
Y_line = X * theta;
figure;
hold on; plot(indep_x, Y, 'or');
hold on; plot(indep_x, Y_line, 'bx');
And here is costFunction:
function [J, Grad] = costFunction (theta, X, Y)
%# of training examples
m = length(Y);
%Initialize Cost and Grad-Vector
J = 0;
Grad = zeros(size(theta));
%Poduces an output based off the current values of theta
model_output = X * theta;
%Computes the squared error for each example then adds them to get the total error
squared_error = (model_output - Y).^2;
J = (1/(2*m)) * sum(squared_error);
%Computes the gradients for each theta t
for t = 1:size(theta, 1)
Grad(t) = (1/m) * sum((model_output-Y) .* X(:, t));
end
endfunction
Any help or advice would be appreciated.
Try adding regularization to your costFunction:
function [J, Grad] = costFunction (theta, X, Y, lambda)
m = length(Y);
%Initialize Cost and Grad-Vector
J = 0;
Grad = zeros(size(theta));
%Poduces an output based off the current values of theta
model_output = X * theta;
%Computes the squared error for each example then adds them to get the total error
squared_error = (model_output - Y).^2;
J = (1/(2*m)) * sum(squared_error);
% Regularization
J = J + lambda*sum(theta(2:end).^2)/(2*m);
%Computes the gradients for each theta t
regularizator = lambda*theta/m;
% overwrite 1st element i.e the one corresponding to theta zero
regularizator(1) = 0;
for t = 1:size(theta, 1)
Grad(t) = (1/m) * sum((model_output-Y) .* X(:, t)) + regularizator(t);
end
endfunction
The regularization term lambda is used to control the learning rate. Start with lambda=1. The grater the value for lambda, the slower the learning will occur. Increase lambda if the behavior you describe persists. You may need to increase the number of iterations if lambda gets high.
You may also consider normalization of your data, and some heuristic for initializing theta - setting all theta to 0.1 may be better than random. If nothing else it'll provide better reproducibility from training to training.

Matlab Regularized Logistic Regression - how to compute gradient

I am currently taking Machine Learning on the Coursera platform and I am trying to implement Logistic Regression. To implement Logistic Regression, I am using gradient descent to minimize the cost function and I am to write a function called costFunctionReg.m that returns both the cost and the gradient of each parameter evaluated at the current set of parameters.
The problem is better described below:
My cost function is working, but the gradient function is not. Please note that I would prefer to implement this using looping, rather than element-by-element operations.
I am computing theta[0] (in MATLAB, theta(1)) separately as it is not being regularized, i.e. we do not use the first term (with lambda).
function [J, grad] = costFunctionReg(theta, X, y, lambda)
%COSTFUNCTIONREG Compute cost and gradient for logistic regression with regularization
% J = COSTFUNCTIONREG(theta, X, y, lambda) computes the cost of using
% theta as the parameter for regularized logistic regression and the
% gradient of the cost w.r.t. to the parameters.
% Initialize some useful values
m = length(y); % number of training examples
n = length(theta); %number of parameters (features)
% You need to return the following variables correctly
J = 0;
grad = zeros(size(theta));
% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost of a particular choice of theta.
% You should set J to the cost.
% Compute the partial derivatives and set grad to the partial
% derivatives of the cost w.r.t. each parameter in theta
% ----------------------1. Compute the cost-------------------
%hypothesis
h = sigmoid(X * theta);
for i = 1 : m
% The cost for the ith term before regularization
J = J - ( y(i) * log(h(i)) ) - ( (1 - y(i)) * log(1 - h(i)) );
% Adding regularization term
for j = 2 : n
J = J + (lambda / (2*m) ) * ( theta(j) )^2;
end
end
J = J/m;
% ----------------------2. Compute the gradients-------------------
%not regularizing theta[0] i.e. theta(1) in matlab
j = 1;
for i = 1 : m
grad(j) = grad(j) + ( h(i) - y(i) ) * X(i,j);
end
for j = 2 : n
for i = 1 : m
grad(j) = grad(j) + ( h(i) - y(i) ) * X(i,j) + lambda * theta(j);
end
end
grad = (1/m) * grad;
% =============================================================
end
What am I doing wrong?
The way you are applying regularization is incorrect. You add regularization after you sum over all training examples but instead you are adding regularization after each example. If you left your code as it was before the correction, you are inadvertently making the gradient step larger and will eventually overshoot the solution. This overshooting will accumulate and will inevitably give you a gradient vector of Inf or -Inf for all components (except for the bias term).
Simply put, place your lambda*theta(j) statement after the second for loop terminates:
for j = 2 : n
for i = 1 : m
grad(j) = grad(j) + ( h(i) - y(i) ) * X(i,j); % Change
end
grad(j) = grad(j) + lambda * theta(j); % Change
end