Ok, so what does this algorithm exactly mean?
What I know :
i) alpha : how big the step for gradient descent will be.
ii) Now , ∑{ hTheta[x(i)] - y(i) } : refers to Total Error with given values of Theta.
The error refers to the difference between predicted value{ hTheta[x(i)] } and the actual value.[ y(i) ]
∑{ hTheta[x(i)] - y(i) } gives us the summation of all errors from all training examples.
What does Xj^(i) at the end stand for?
Are we doing the following while implementing Gradient Descent for multiple variable Linear Regression?
Theta (j) minus:
alpha
times 1/m
times:
{ error of first training example multiplied by jth element of first training example. PLUS
error of second training example mutiplied by jth element of second training example. PLUS
.
.
.
PLUS error of nth training example multiplied by jth element of nth training example. }
Gradient Descent is an iterative algorithm for finding the minimum of a function. When given a convex function, it is guaranteed to find the global minimum of the function given small enough alpha. Here is gradient descent algorithm to find the minimum of function J:
The idea is to move the parameter in the opposite direction of the gradient at learning rate alpha. Eventually it will go down to the minimum of the function.
We can rewrite this parameter update for each axis of theta:
In multivariate linear regression, the goal of the optimization is to minimize the sum of squared errors:
The partial derivative of this cost function can be derived by using differentiation by substitution, where we use elementary power rule, substracting the power 2 to 1 and putting 2 as coefficient, eliminating the 1/2 coefficient. Then we put the derivative of h(x) to theta_j, which is x_j to the right.
Here, x_j^(i) is stands for the partial derivative of h_theta(x^(i)) to theta_j. x_j^(i) is the j-th element of i-th data.
Related
So I have found a formula describing the SGD-Descent
θ = θ-η*∇L(θ;x,y)
Where θ is a parameter, η is the learning rate and ∇L() is the gradient descent of the loss-function. But what I don't get is how the parameter θ (which should be weight and bias) can be updated mathematically? Is there a mathematical interpretation of the parameter θ?
Thanks for any answers.
That formula applies to both gradient descent and stochastic gradient descent (SGD). The difference between the two is that in SGD the loss is computed over a random subset of the training data (i.e. a mini-batch/batch) as opposed to computing the loss over all the training data as in traditional gradient descent. So in SGD x and y correspond to a subset of the training data and labels, whereas in gradient descent they correspond to all the training data and labels.
θ represents the parameters of the model. Mathematically this is usually modeled as a vector containing all the parameters of the model (all the weights, biases, etc...) arranged into a single vector. When you compute the gradient of the loss (a scalar) w.r.t. θ you get a vector containing the partial derivative of loss w.r.t. each element of θ. So ∇L(θ;x,y) is just a vector, the same size as θ. If we were to assume that the loss were a linear function of θ, then this gradient points in the direction in parameter space that would result in the maximal increase in loss with a magnitude that corresponds to the expected increase in loss if we took a step of size 1 in that direction. Since loss isn't actually a linear function and we actually want to decrease loss we instead take a smaller step in the opposite direction, hence the η and minus.
It's also worth pointing out that mathematically the form you've given is a bit problematic. We wouldn't usually write it like this since assignment and equal aren't the same thing. The equation you provided would seem to imply that the θ on the left-hand and right-hand side of the equation were the same. They are not. The θ on the left side of the equal sign represents the value of the parameters after taking a step and the θs on the right side correspond to the parameters before taking a step. We could be more clear by writing it with subscripts
where θ_{t} is the parameter vector at step t and θ_{t+1} is the parameter vector one step later.
I am not doing signal processing. But in my area, I will use the spectral density of a matrix of data. I get quite confused at a very detailed level.
%matrix H is given.
corr=xcorr2(H); %get the correlation
spec=fft2(corr); % Wiener-Khinchin Theorem
In matlab, xcorr2 will calculate the correlation function of this matrix. The lag will range from -N+1 to N-1. So if size of matrix H is N by N, then size of corr will be 2N-1 by 2N-1. For discretized data, I should use corr or half of corr?
Another problem is I think Wiener-Khinchin Theorem is basically for continuous function. I have always thought that Discretized FT is an approximation to Continuous FT, or you can say it is a tool to calculate Continuous FT. If you use matlab build in function 'fft', you should divide the final result by \delta x.
Any kind soul who knows this area well there to share some matlab code with me?
Basically, approximating a continuous FT by a Discretized FT is the same as approximating an integral by a finite sum.
We will first discuss the 1D case, then we'll discuss the 2D case.
Let's look at the Wiener-Kinchin theorem (for example here).
It states that :
"For the discrete-time case, the power spectral density of the function with discrete values x[n], is :
where
Is the autocorrelation function of x[n]."
1) You can see already that the sum is taken from -infty to +infty in the calculation of S(f)
2) Now considering the Matlab fft - You can see (command 'edit fft' in Matlab), that it is defined as :
X(k) = sum_{n=1}^N x(n)*exp(-j*2*pi*(k-1)*(n-1)/N), 1 <= k <= N.
which is exactly what you want to be done in order to calculate the power spectral density for a frequency f.
Note that, for continuous functions, S(f) will be a continuous function. For Discretized function, S(f) will be discrete.
Now that we know all that, it can easily be extended to the 2D case. Indeed, the structure of fft2 matches the structure of the right hand side of the Wiener-Kinchin Theorem for the 2D case.
Though, it will be necessary to divide your result by NxM, where N is the number of sample points in x and M is the number of sample points in y.
I have to calculate:
gamma=(I-K*A^-1)*OLS;
where I is the identity matrix, K and A are diagonal matrices of the same size, and OLS is the ordinary least squares estimate of the parameters.
I do this in Matlab using:
gamma=(I-A\K)*OLS;
However I then have to calculate:
gamma2=(I-K^2*A-2)*OLS;
I calculate this in Matlab using:
gamma2=(I+A\K)*(I-A\K)*OLS;
Is this correct?
Also I just want to calculate the variance of the OLS parameters:
The formula is simple enough:
Var(B)=sigma^2*(Delta)^-1;
Where sigma is a constant and Delta is a diagonal matrix containing the eigenvalues.
I tried doing this by:
Var_B=Delta\sigma^2;
But it comes back saying matrix dimensions must agree?
Please can you tell me how to calculate Var(B) in Matlab, as well as confirming whether or not my other calculations are correct.
In general, matrix multiplication does not commute, which makes A^2 - B^2 not equal to (A+B)*(A-B). However your case is special, because you have an identity matrix in the equation. So your method for finding gamma2 is valid.
'Var_B=Delta\sigma^2' is not a valid mldivide expression. See the documentation. Try Var_B=sigma^2*inv(Delta). The function inv returns a matrix inverse. Although this function can also be applied in your expression to find gamma or gamma2, the use of the operator \ is more recommended for better accuracy and faster computation.
In MATLAB I need to generate a second derivative of a gaussian window to apply to a vector representing the height of a curve. I need the second derivative in order to determine the locations of the inflection points and maxima along the curve. The vector representing the curve may be quite noise hence the use of the gaussian window.
What is the best way to generate this window?
Is it best to use the gausswin function to generate the gaussian window then take the second derivative of that?
Or to generate the window manually using the equation for the second derivative of the gaussian?
Or even is it best to apply the gaussian window to the data, then take the second derivative of it all? (I know these last two are mathematically the same, however with the discrete data points I do not know which will be more accurate)
The maximum length of the height vector is going to be around 100-200 elements.
Thanks
Chris
I would create a linear filter composed of the weights generated by the second derivative of a Gaussian function and convolve this with your vector.
The weights of a second derivative of a Gaussian are given by:
Where:
Tau is the time shift for the filter. If you are generating weights for a discrete filter of length T with an odd number of samples, set tau to zero and allow t to vary from [-T/2,T/2]
sigma - varies the scale of your operator. Set sigma to a value somewhere between T/6. If you are concerned about long filter length then this can be reduced to T/4
C is the normalising factor. This can be derived algebraically but in practice I always do this numerically after calculating the filter weights. For unity gain when smoothing periodic signals, I will set C = 1 / sum(G'').
In terms of your comment on the equivalence of smoothing first and taking a derivative later, I would say it is more involved than that. As which derivative operator would you use in the second step? A simple central difference would not yield the same results.
You can get an equivalent (but approximate) response to a second derivative of a Gaussian by filtering the data with two Gaussians of different scales and then taking the point-wise differences between the two resulting vectors. See Difference of Gaussians for that approach.
I want to fit and draw a curve that is constrained with the following boundary condition:
diff (yfit)<=0
where yfit is the polynomial fitted function to degree n.
The condition ensures that the slope of the polynomial to any degree of is non-positive for all x .
How can I apply the condition using the "polyfit" function or any other polynomial fitting function?
From my limited point of mathematical view, a polynomial function of degree 2 for example has by definition a reagion with positive and negative slope.
One thing you can try is using absolute values:
Build your own fitting (ie least square is easy = polyfit) and dont use polynomial
Functions, but absolute functions thereof.
Least sqare: take 0 = d/da ( sum( func-point)^2 ) and this for each order.. Wikipedia and others provide in depth descriptions.