Vectorization of a gradient descent code - matlab

I am implementing a batch gradient descent on Matlab. I have a problem with the update step of theta.
theta is a vector of two components (two rows).
X is a matrix containing m rows (number of training samples) and n=2 columns (number of features).
Y is an m rows vector.
During the update step, I need to set each theta(i) to
theta(i) = theta(i) - (alpha/m)*sum((X*theta-y).*X(:,i))
This can be done with a for loop, but I can't figure out how to vectorize it (because of the X(:,i) term).
Any suggestion?

Looks like you are trying to do a simple matrix multiplication, the thing MATLAB is supposedly best at.
theta = theta - (alpha/m) * (X' * (X*theta-y));

In addition to the answer given by Mad Physicist, the following can also be applied.
theta = theta - (alpha/m) * sum( (X * theta - y).* X )';

Related

I am not getting this line of code in Machine Learning

I have to write this piece of code for the lrcostfunction assignment in the Machine Learning course in coursera. But I still don't understand why
theta1 = [0 ; theta(2:end, :)];
is written? theta1 means what?
h = sigmoid(X * theta)
theta1 = [0 ; theta(2:end, :)];
p = lambda * (theta1' * theta1)/(2 * m);
J = ((-y)'*log(h)-(1-y)'*log(1-h))/m + p;
grad = (X' * (h - y) + lambda * theta1)/ m;
In logistic regression, theta (θ) is a vector representing the parameters (or weights) of the linear function of x.
Now, given a training set, one method to learn the parameters theta (θ) is to be to make h(x) close to y, at least for the training examples we have. This is defined using a cost function or the error function (J(θ)), for each value of the θ, which we want to minimize.
The first theta1 parameter is initialized as zero. Later using gradient descent, next theta parameter is computed. In gradient descent, the J(θ) parameter is calculated using partial differentiation as we want to minimize it.
Here \alpha is learning rate with which gradient descent algorithm runs. It starts with an initial value in the array - theta1 as zero and then, next value is calculated using the above equation. and so on for other theta parameters.
EDIT:
Explaining the code:
theta1 = [0 ; theta(2:end, :)];
The above code is MATLAB code. Here theta1 is an Array (vector or matrix representation). It is created using horizontal concatenation of two fields.
1) 0
2) theta(2:end, :)
First, is a scalar value 0
Second, this means that take all values as it is, except the first row from the array theta. (Note theta is input array to LRCOSTFUNCTION(theta, X, y, lambda))

Logistic Regression Implementation

I am having some difficulties in implementing logistic regression, in terms of how should I should proceed stepwise. According to what I have done so far I am implementing it in the following way:
First taking theta equal to the number of features and making it a n*1 vector of zeros. Now using this theta to compute the the following
htheta = sigmoid(theta' * X');
theta = theta - (alpha/m) * sum (htheta' - y)'*X
Now using the theta computed in the first step to compute the cost function
J= 1/m *((sum(-y*log(htheta))) - (sum((1-y) * log(1 - htheta)))) + lambda/(2*m) * sum(theta).^2
In the end computing the gradient
grad = (1/m) * sum ((sigmoid(X*theta) - y')*X);
As i am taking theta to be zero. I am getting same value of J throughout the vector, is this the right output?
You are computing the gradient in the last step, while it has been computed before in the computation of the new theta. Moreover, your definition of the cost function contains a regularization parameter, but this is not incorporated in the gradient computation. A working version without the regularization:
% generate dummy data for testing
y=randi(2,[10,1])-1;
X=[ones(10,1) randn([10,1])];
% initialize
alpha = 0.1;
theta = zeros(1,size(X,2));
J = NaN(100,1);
% loop a fixed number of times => can improve this by stopping when the
% cost function no longer decreases
htheta = sigmoid(X*theta');
for n=1:100
grad = X' * (htheta-y); % gradient
theta = theta - alpha*grad'; % update theta
htheta = sigmoid(X*theta');
J(n) = sum(-y'*log(htheta)) - sum((1-y)' * log(1 - htheta)); % cost function
end
If you now plot the cost function, you will see (except for randomness) that it converges after about 15 iterations.

Why "theta" in this code is NaN? [duplicate]

This question already has an answer here:
Machine learning - Linear regression using batch gradient descent
(1 answer)
Closed 6 years ago.
I'm learning neural networks (linear regression) in MATLAB for my research project and this is a part of the code I use.
The problem is the value of "theta" is NaN and I don't know why.
Could you tell me where is the error?
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
theta = zeros(2, 1); % initialize fitting parameters
%GRADIENTDESCENT Performs gradient descent to learn theta
% theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by
% taking num_iters gradient steps with learning rate alpha
% Initialize some useful values
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
theta = theta - ((alpha/m)*((X*theta)-y)' * X)';
end
end
% run gradient descent
theta = gradientDescent(X, y, theta, alpha, iterations);
The function you have is fine. But the sizes of X and theta are incompatible. In general, if size(X) is [N, M], then size(theta) should be [M, 1].
So I would suggest replacing the line
theta = zeros(2, 1);
with
theta = zeros(size(X, 2), 1);
should have as many columns as theta has elements. So in this example, size(X) should be [133, 2].
Also, you should move that initialization before you call the function.
For example, the following code does not return NaN if you remove the initialization of theta from the function.
X = rand(133, 1); % or rand(133, 2)
y = rand(133, 1);
theta = zeros(size(X, 2), 1); % initialize fitting parameters
% run gradient descent
theta = gradientDescent(X, y, theta, 0.1, 1500)
EDIT: This is in response to comments below.
Your problem is due to the gradient descent algorithm not converging. To see it yourself, plot J_history, which should never increase if the algorithm is stable. You can compute J_history by inserting the following line inside the for-loop in the function gradientDescent:
J_history(iter) = mean((X * theta - y).^2);
In your case (i.e. given data file and alpha = 0.01), J_history increases exponentially. This is shown in the plot below. Note that the y-axis is in logarithmic scale.
This is a clear sign of instability in gradient descent.
There are two ways to eliminate this problem.
Option 1. Use smaller alpha. alpha controls the rate of gradient descent. If it is too large, the algorithm is unstable. If it is too small, the algorithm takes a long time to reach the optimal solution. Try something like alpha = 1e-8 and go from there. For example, alpha = 1e-8 results in the following cost function:
Option 2. Use feature scaling to reduce the magnitude of the inputs. One way of doing this is called Standarization. The following is an example of using standarization and the resulting cost function:
data=xlsread('v & t.xlsx');
data(:,1) = (data(:,1)-mean(data(:,1)))/std(data(:,1));

Matlab - linear regression - y-intercept by adding one column of ones

I try to understand on the following link linear regression the computing of coefficients beta0 and beta1 for the relation y = beta0 + beta1 x.
I understand the first computing of beta1 which is actually a simple least-squares regression, but with only one paramater to find (the slope coefficient) ?
In the example of "accidents", Why do they append a colum of ones to x array to compute the 2 coefficients :
X = [ones(length(x),1) x];
b = X\y
result :
b =
1.0e+02 *
1.427120171726537
0.000001256394274
what is the underlying calculation with this column of ones ?
If anyone could explain to me.
This is more like comment. But I am not allowed to do that, so writing as an answer.
They are adding column of ones to make it suitable for matrix multiplication. You have y = beta0 + beta1*x. In matrix multiplication form, it can be written as : y = [1 x]* [beta0 beta1]'. Please note transpose sign on beta matrix.
For reasons unkonwn to me, vectorization of variables is encouraged in Matlab and R. As per my knowledge, vectorization is expected to reduce resource consumption.
Ones are often added to introduce "bias". In your case, try visualizing this equation:
y = w1 * x + c
The ones are added to represent another input, but which is always one.
y = w1 * x1 + c * x2(which is 1)
So, to model equations with constants(bias) in them, ones are added to the input.
Because in the equation y = beta0 + beta1 * x, beta0 is implicitly multiplied by 1.
Put another way consider the ith (x,y) pair:
y[i] = beta0 + beta1 * x[i]
= beta0 * 1 + beta1 * x[i]
That 1 that is multiplying beta0 for any i is where the ones vector is coming from.

Gaussian Basis Function

Can you please tell me how can I model a Gaussian Basis Function in a 2 Dimensional Space in order to obtain a scalar output?
I know how to apply this with a scalar input, but I don't understand how should I apply it to a 2 dimensional vector input. I've seen many variations of this that I am confused.
With each Gaussian basis associate a center of the same dimension as the input, lets call it c. If x is your input, you can compute the output as
y = exp( - 0.5 * (x-c)'*(x-c) )
This will work with any dimension of x and c, provided they are the same. A more general form is
y = sqrt(det(S)) * exp( - 0.5 * (x-c)'* S * (x-c) )
where S is some positive definite matrix, well the inverse covariance matrix. A simple case is to take S to be a diagonal matrix with positive entries on the diagonals.
To sample from a multivariate normal distribution, use the MVNRND function from the Statistics Toolbox. Example:
MU = [2 3]; %# mean
COV = [1 1.5; 1.5 3]; %# covariance (can be isotropic/diagonal/full)
p = mvnrnd(MU, COV, 1000); %# sample 1000 2D points
plot(p(:,1), p(:,2), '.') %# plot them