Linear regression cost function and gradient descent - linear-regression

I have been studying data science and ML topics for a while and I always get sucked at one point that makes a great confusion for me.
In courses like Andrew Ng's, it is defined that the error between the predicted value and the true value from e.g. Linear regression is expressed by:
error = predicted_value - y
In some other tutorials/courses, the error is presented as:
error = y - predicted_value
Also, for instance, on Udacity's data science Nanodegree, the gradient descent weights update is given by:
error = y - predicted_value
W_new = W + learn_rate * np.matmul(error, X)
At the same time, in several other books/courses, the same procedure is given by :
error = predicted_value - y
W_new = W - learn_rate * np.matmul(error, X)
Could someone help me out with those different notations?
Thank you!
EDIT
Following #bottaio answer, I got the following:
First case :
# compute errors
y_pred = np.matmul(X, W) + b
error = y_pred - y
# compute steps
W_new = W - learn_rate * np.matmul(error, X)
b_new = b - learn_rate * error.sum()
return W_new, b_new
Second case :
# compute errors
y_pred = np.matmul(X, W) + b
error = y - y_pred
# compute steps
W_new = W + learn_rate * np.matmul(error, X)
b_new = b + learn_rate * error.sum()
return W_new, b_new
Running the first and second cases, I get :
Third case :
# compute errors
y_pred = np.matmul(X, W) + b
error = y_pred - y
# compute steps
W_new = W + learn_rate * np.matmul(error, X)
b_new = b + learn_rate * error.sum()
return W_new, b_new
Running the third case, I get :
That's exactly the intuition I'm trying to achieve.
Whats the relation between using the error = y - y_pred and having to use the step computation as positive W_new = W + learn_rate * np.matmul(error, X) instead of W_new = W - learn_rate * np.matmul(error, X) ?
Thank you for all the support!!!!!

error = predicted_value - y
error' = y - predicted_value = -error
W = W + lr * matmul(error, X) = W + lr * matmul(-error', X) = W - lr * matmul(-error', X)
These two expressions are two ways of looking at the same thing. You propagate error backwards.
To be honest, the second states more clearly what is going on under the hood - error is just a difference between what model predicted relative to ground truth (explains predicted - y). And gradient descent step is about changing weights in opposite direction to gradient (explains minus).

Related

Is this the correct way to calculate the covariance matrix for 2D feature map?

Within a neural network, I have some 2D feature maps with values between 0 and 1. For these maps, I want to calculate the covariance matrix based on the values at each coordination. Unfortunately, pytorch has no .cov() function like in numpy. So I wrote the following function instead:
def get_covariance(tensor):
bn, nk, w, h = tensor.shape
tensor_reshape = tensor.reshape(bn, nk, 2, -1)
x = tensor_reshape[:, :, 0, :]
y = tensor_reshape[:, :, 1, :]
mean_x = torch.mean(x, dim=2).unsqueeze(-1)
mean_y = torch.mean(y, dim=2).unsqueeze(-1)
xx = torch.sum((x - mean_x) * (x - mean_x), dim=2).unsqueeze(-1) / (h * w - 1)
xy = torch.sum((x - mean_x) * (y - mean_y), dim=2).unsqueeze(-1) / (h * w - 1)
yx = xy
yy = torch.sum((y - mean_y) * (y - mean_y), dim=2).unsqueeze(-1) / (h * w - 1)
cov = torch.cat((xx, xy, yx, yy), dim=2)
cov = cov.reshape(bn, nk, 2, 2)
return cov
Is that the correct way to do it?
Edit:
Here is a comparison with the numpy function:
a = torch.randn(1, 1, 64, 64)
a_numpy = a.reshape(1, 1, 2, -1).numpy()
torch_cov = get_covariance(a)
numpy_cov = np.cov(a_numpy[0][0])
torch_cov
tensor([[[[ 0.4964, -0.0053],
[-0.0053, 0.4926]]]])
numpy_cov
array([[ 0.99295635, -0.01069122],
[-0.01069122, 0.98539236]])
Apparently, my values are too small by a factor of 2. Why could that be?
Edit2: Ahhh I figured it out. It has to be divided by (h*w/2 - 1) :) Then the values match.

Regularized logistic regresion with vectorization

I'm trying to implement a vectorized version of the regularised logistic regression. I have found a post that explains the regularised version but I don't understand it.
To make it easy I will copy the code below:
hx = sigmoid(X * theta);
m = length(X);
J = (sum(-y' * log(hx) - (1 - y') * log(1 - hx)) / m) + lambda * sum(theta(2:end).^2) / (2*m);
grad =((hx - y)' * X / m)' + lambda .* theta .* [0; ones(length(theta)-1, 1)] ./ m ;
I understand the first part of the Cost equation, If I'm correct it could be represented as:
J = ((-y' * log(hx)) - ((1-y)' * log(1-hx)))/m;
The problem it's the regularization term. Let's take more detail:
Dimensions:
X = (m x (n+1))
theta = ((n+1) x 1)
I don't understand why he let the first term of theta (theta_0) outside of the equation, when in theory the regularized term it's:
and it has to take into account all the thetas
For the gradient descent, I think that this equation it's equivalent:
L = eye(length(theta));
L(1,1) = 0;
grad = (1/m * X'* (hx - y)+ (lambda*(L*theta)/m).
In Matlab indexes begin from 1, and in mathematic indexes begin from 0 (the indexes on the formula which you mentioned are also beginning from 0).
So, in theory, the first term of theta also needs to be let outside of the equation.
And as for your second question, you right! It is an equivalent clean equation!

Supplied objective function must return a scalar value

I am trying to code a ML algorithm in Matlab. These are my different functions:
sigmoid.m:
function g = sigmoid(z)
g = zeros(size(z));
g = 1 ./ (1+exp(z));
costFunction.m
function [J, grad ] = costFunction(theta, X, y)
m = length(y); % number of training examples
z = -X * theta;
g = sigmoid(z);
J = 1/m * ((-y * log(g)') - ((1 - y) * log(1 - g)'));
grad = zeros(size(theta'));
grad = (1/m) * (X' * (g - y));
ex2.m (This is the main file of my project and I put the relative lines I get this error message)
options = optimset('GradObj', 'on', 'MaxIter', 400);
[theta, cost] = ...
fminunc(#(t)(costFunction(t, X, y)), initial_theta, options);
The error message:
Error using fminunc (line 348) Supplied objective function must return
a scalar value.
Error in ex2 (line 97) fminunc(#(t)(costFunction(t, X, y)),
initial_theta, options);
I don't know is there enough information above or not? If not, let me know to add extra information.
I changed the following line of code:
J = 1/m * ((-y * log(g)') - ((1 - y) * log(1 - g)'));
To the following line of code:
J = 1/m * (((-y)' * log(g)) - ((1 - y)' * log(1 - g)));
And problem solved!
The y and g were 100*1 matrices and with previous code I had J=100*100 matrix, but with new code I have J=1*1 matrix or scalar number and problem solved!

How to compute log of multvariate gaussian in matlab

I am trying to compute log(N(x | mu, sigma)) in MATLAB where
x is the data vector(Dimensions D x 1) , mu(Dimensions D x 1) is mean and sigma(Dimensions D x D) is covariance.
My present implementation is
function [loggaussian] = logmvnpdf(x,mu,Sigma)
[D,~] = size(x);
const = -0.5 * D * log(2*pi);
term1 = -0.5 * ((x - mu)' * (inv(Sigma) * (x - mu)));
term2 = - 0.5 * logdet(Sigma);
loggaussian = const + term1 + term2;
end
function y = logdet(A)
y = log(det(A));
end
For some cases I get an error
Warning: Matrix is close to singular or badly scaled. Results may be inaccurate. RCOND =
NaN
I know you will point out that my data is not consistent, but I need to implement the function so that I can get the best approximate instead of throwing an warning. . How do I ensure that I always get a value.
I think the warning comes from using inv(Sigma). According to the documentation, you should avoid using inv where its use can be replaced by \ (mldivide). This will give you both better speed and accuracy.
For your code, instead of inv(Sigma) * (x - mu) use Sigma \ (x - mu).
The following approach should be (a little) less sensitive to ill-conditioning of the covariance matrix:
function logpdf = logmvnpdf (x, mu, K)
n = length (x);
R = chol (K);
const = 0.5 * n * log (2 * pi);
term1 = 0.5 * sum (((R') \ (x - mu)) .^ 2);
term2 = sum (log (diag (R)));
logpdf = - (const + term1 + term2);
end
If K is singular or near-singular, you can still have warnings (or errors) when calling chol.

constant term in Matlab principal component regression (pcr) analysis

I am trying to learn principal component regression (pcr) with Matlab. I use this guide here: http://www.mathworks.fr/help/stats/examples/partial-least-squares-regression-and-principal-components-regression.html
it's really good, but I just cannot understand one step:
we do the PCA and the regression, nice and clear:
[PCALoadings,PCAScores,PCAVar] = princomp(X);
betaPCR = regress(y-mean(y), PCAScores(:,1:2));
And then we adjust the first coefficient:
betaPCR = PCALoadings(:,1:2)*betaPCR;
betaPCR = [mean(y) - mean(X)*betaPCR; betaPCR];
yfitPCR = [ones(n,1) X]*betaPCR;
How come that the coefficient needs to be 'mean(y) - mean(X)*betaPCR' for the constant one factor? Can you explain that to me?
Thanks in advance!
This is really a math question, not a coding question. Your PCA extracts a set of features and puts them in a matrix, which gives you PCALoadings and PCAScores. Pull out the first two principal components and their loadings, and put them in their own matrix:
W = PCALoadings(:, 1:2)
Z = PCAScores(:, 1:2)
The relationship between X and Z is that X can be approximated by:
Z = (X - mean(X)) * W <=> X ~ mean(X) + Z * W' (1)
The intuition is that Z captures most of the "important information" in X, and the matrix W tells you how to transform between the two representations.
Now you can do a regression of y on Z. First you have to subtract the mean from y, so that both the left and right hand sides have mean zero:
y - mean(y) = Z * beta + errors (2)
Now you want to use that regression to make predictions for y from X. Substituting from equation (1) into equation (2) gives you
y - mean(y) = (X - mean(X)) * W * beta
= (X - mean(X)) * beta1
where we have defined beta1 = W * beta (you do this in your third line of code). Rearranging:
y = mean(y) - mean(X) * beta1 + X * beta1
= [ones(n,1) X] * [mean(y) - mean(X) * beta1; beta1]
= [ones(n,1) X] * betaPCR
which works out if we define
betaPCR = [mean(y) - mean(X) * beta1; beta1]
as in your fourth line of code.