why is gradient descent with momentum an exponentially weighted average?

why is gradient descent with momentum an exponentially weighted average? - neural-network

I recently watched Andrew Ng's video on SGDM. I understand that the momentum term updates the gradient by weighting the last gradient and using a small component of V_dw. I don't understand why momentum is also known as exponentially weighted average. Also, in Ng's video at 6:37 he says using Beta = 0.9 effectively means using an average of the last 10 gradients.
Can someone explain how that works? To me, it's just a scalar weighting of 1-0.9 to all the gradients in the vector dW.
Appreciate any insight! I feel like I'm missing something fundamental.

You just have to think about what is in your last gradient. The last gradient is already a weighted gradient, due to the momentum term.
In the first step you will just do a gradient descent. In the second step you will have a momentum gradient of m_grad_2 = grad_2 + 0.9 m_grad_1. In the third step you will again have a momentum gradient m_grad_3 = grad_3 + 0.9 m_grad_2, but the old gradient is composed of a momentum term. Therefore 0.9*m_grad_2 = 0.9 * (grad_2 + 0.9 grad_1), which is 0.9 grad_2 + 0.81 grad_1. Therefore the impact of a gradient on the kth step will be 0.9^k. After 10 steps the impact will be quite small.

Related

Is there any numerical-accurracy difference on calculating sin(pi/2-A) and cos(A) in Matlab?

I am reading a matlab function for calculating great circle distance written by my senior collegue.
The distance between two points on the earth surface should be calculated using this formula:
d = r * arccos[(sin(lat1) * sin(lat2)) + cos(lat1) * cos(lat2) * cos(long2 – long1)]
However, the script has the code like this:
dist = (acos(cos(pi/180*(90-lat2)).*cos(pi/180*(90-lat1))+sin(pi/180*(90-lat2)).*sin(pi/180*(90-lat1)).*cos(pi/180*(diff_long)))) .* r_local;
(-180 < long1,long2 <= 180, -90 < lat1,lat2 <= 90)
Why are sin(pi/2-A) and cos(pi/2-A) used to replace cos(A) and sin(A)？
Doesn't it introduced more error source by using the constant pi?
Since lat1, lat2 might be very close to zero in my work, is this a trick on the numerical accuracy of MATLAB's sin() and cos() function?
Look forward to answers that explain how trigonometric functions in MATLAB work and analyze the error of these functions when the argument is close or equal to 0 and pi/2.

If the purpose is to increase accuracy, this seems a very poor idea. When the angle is small, 90-A spoils any accuracy. That even makes tiny angles vanish (90-ε=90).
On the opposite, the sine of tiny angles is very close to the angle itself (radians) and for this reason quite accurately computed, while the cosine is virtually 1 or 1-A²/2. For top accuracy on tiny angles, you may resort to the versine, using versin(A):= 1-cos(A) = 2 sin²(A/2) and rework the equations in terms of 1-versin(A) instead of cos(A).
If the angle is close to 90°, accuracy is lost anyway, 90°-A will not restore it.

I very much doubt this has to do with accuracy. Or at least, I don't think this helps any when it comes to accuracy.
The maximum difference between both sin(pi/2-A) - cos(A) and cos(pi/2-A) - sin(A) is 1.1102e-16, which is very small. This is just basic floating point accuracy, and there's really no way of telling which of the numbers is more correct. Note that cos(pi/2) = 6.1232e-17. So, if theta = 0, your colleague's code cos(pi/2-0) will give an error of 6.1232e-17, while simply doing the obvious sin(0) will be correct.
If you need numbers that are more accurate than this then you can try vpa.
I guess this is either because your colleague found another formula and implemented that, or he/she's confused and has attempted to increase the accuracy.
The latter might be the case if he/she tried to avoid the approximations sin(theta) ≈ theta and cos(theta) ≈ 1 for small values of theta. However, this doesn't make sense, since cos(pi/2-theta) ≈ theta and sin(pi/2-theta) ≈ 1 for small values of theta.

Best chance is to ask directly to the author of the text where you got those expressions from, if possible indeed.
It may be the case that the original expressions come from navigation formulae that were written when calculations were done manually: pencil paper ruler, no computers, no calculators.
Tables and graphs were then used to speed up results: pi-x was equivalent to start read table from other side or read graph upside-down.

Scale correction for IFFT of smaller frequency space created by FFT

This might be considered a repost of this question however I am seeking a much deeper explanation on this matter and how to properly solve this problem.
I want to study the PSF/SRF of a voxel in a 44x44 matrix. For that I create a matrix 100x bigger (4400x4400) so 1 voxel in the smaller matrix corresponds to 100x100 voxels in the bigger one. I set the values to 1 of those 100^2 voxels.
Now I do a FFT of the big matrix and an IFFT of only the center portion (44x44) of the frequency space. This is the code:
A = zeros(4400,4400);
A(2201:2300,2201:2300) = 1;
B = fftshift(fft2(A));
C = ifft2(ifftshift(B(2179:2222,2179:2222)));
D = numel(C)/numel(B) * C;
figure, subplot(1,3,1), imshow(A), subplot(1,3,2), imshow(real(C)), subplot(1,3,3), imshow(real(D));
The problem is the following: I would expect the value in the voxel of the new 44x44 matrix to be 1. However, using this numel factor correction they decrease to 0.35. And if I don't apply the correction they go up to huge values.

For starters, let me try to clarify the scaling issue: For the DFT/IDFT there are various scaling conventions regarding the input size. You either need a factor of 1/N in the DFT or a factor of 1/N in the IDFT or a factor of 1/sqrt(N) in both. All have pros and cons and all are equally valid.
Matlab uses the 1/N in the IDFT convention, as you can see in the documentation.
In your example, the forward DFT has a size 4400, the backward IDFT a size of 44. Therefore the IDFT scaling is a factor 100 less than it should be to match the forward transformation and your values are a factor of 100 too large. Since you're doing a 2-D DFT/IDFT, the factor 100 is missing twice, so your rescaling should be 100^2. Your numel(C)/numel(B) does exactly that, I've just tried to give you the explanation for it.
A reason why you might not see the 1 is that you're plotting only the real part of the inverse DFT. Since you did some fftshifting you might have introduced a phase so that part of your signal is in the imaginary part.
edit: Another reason is that you truncate B to the central 44 by 44 window before transforming back. Since A is not bandlimited, B has energy also outside this window. By truncating you are losing a part of it. Therefore, it is not surprising that the resulting amplitude is lower.
Here is a zoom on the image of B to show this phenomenon:
The red square is what you keep, everything else is truncated. Due to Parsevals theorem, the total energy in image and Fourier domain is equal so by truncation you must also reduce the energy of your signal in the image domain.

scale the loss value according to "badness" in caffe

I want to scale the loss value of each image based on how close/far is the "current prediction" to the "correct label" during the training. For example if the correct label is "cat" and the network think it is "dog" the penalty (loss) should be less than the case if the network thinks it is a "car".
The way that I am doing is as following:
1- I defined a matrix of the distance between the labels,
2- pass that matrix as a bottom to the "softmaxWithLoss" layer,
3- multiply each log(prob) to this value to scale the loss according to badness in forward_cpu
However I do not know what should I do in the backward_cpu part. I understand the gradient (bottom_diff) has to be changed but not quite sure, how to incorporate the scale value here. According to the math I have to scale the gradient by the scale (because it is just an scale) but don't know how.
Also, seems like there is loosLayer in caffe called "InfoGainLoss" that does very similar job if I am not mistaken, however the backward part of this layer is a little confusing:
bottom_diff[i * dim + j] = scale * infogain_mat[label * dim + j] / prob;
I am not sure why infogain_mat[] is divide by prob rather than being multiply by! If I use identity matrix for infogain_mat isn't it supposed to act like softmax loss in both forward and backward?
It will be highly appreciated if someone can give me some pointers.

You are correct in observing that the scaling you are doing for the log(prob) is exactly what "InfogainLoss" layer is doing (You can read more about it here and here).
As for the derivative (back-prop): the loss computed by this layer is
L = - sum_j infogain_mat[label * dim + j] * log( prob(j) )
If you differentiate this expression with respect to prob(j) (which is the input variable to this layer), you'll notice that the derivative of log(x) is 1/x this is why you see that
dL/dprob(j) = - infogain_mat[label * dim + j] / prob(j)
Now, why don't you see similar expression in the back-prop of "SoftmaxWithLoss" layer?
well, as the name of that layer suggests it is actually a combination of two layers: softmax that computes class probabilities from classifiers outputs and a log loss layer on top of it. Combining these two layer enables a more numerically robust estimation of the gradients.
Working a little with "InfogainLoss" layer I noticed that sometimes prob(j) can have a very small value leading to unstable estimation of the gradients.
Here's a detailed computation of the forward and backward passes of "SoftmaxWithLoss" and "InfogainLoss" layers with respect to the raw predictions (x), rather than the "softmax" probabilities derived from these predictions using a softmax layer. You can use these equations to create a "SoftmaxWithInfogainLoss" layer that is more numerically robust than computing infogain loss on top of a softmax layer:
PS,
Note that if you are going to use infogain loss for weighing, you should feed H (the infogain_mat) with label similarities, rather than distances.
Update:
I recently implemented this robust gradient computation and created this pull request. This PR was merged to master branch on April, 2017.

fitness in inverted pendulum

What is the fitness function used to solve an inverted pendulum ?
I am evolving neural networks with genetic algorithm. And I don't know how to evaluate each individual.
I tried minimize the angle of pendulum and maximize distance traveled at the end of evaluation time (10 s), but this won't work.
inputs for neural network are: cart velocity, cart position, pendulum angular velocity and pendulum angle at time (t). The output is the force applied at time (t+1)
thanks in advance.

I found this paper which lists their objective function as being:
Defined as:
where "Xmax = 1.0, thetaMax = pi/6, _X'max = 1.0, theta'Max =
3.0, N is the number of iteration steps, T = 0.02 * TS and Wk are selected positive weights." (Using specific values for angles, velocities, and positions from the paper, however, you will want to use your own values depending on the boundary conditions of your pendulum).
The paper also states "The first and second terms determine the accumulated sum of
normalised absolute deviations of X1 and X3 from zero and the third term when minimised, maximises the survival time."
That should be more than enough to get started with, but i HIGHLY recommend you read the whole paper. Its a great read and i found it quite educational.
You can make your own fitness function, but i think the idea of using a position, velocity, angle, and the rate of change of the angle the pendulum is a good idea for the fitness function. You can, however, choose to use those variables in very different ways than the way the author of the paper chose to model their function.
It wouldn't hurt to read up on harmonic oscillators either. They take the general form:
mx" + Bx' -kx = Acos(w*t)
(where B, or A may be 0 depending on whether or not the oscillator is damped/undamped or driven/undriven respectively).

Minimization of L1-Regularized system, converging on non-minimum location?

This is my first post to stackoverflow, so if this isn't the correct area I apologize. I am working on minimizing a L1-Regularized System.
This weekend is my first dive into optimization, I have a basic linear system Y = X*B, X is an n-by-p matrix, B is a p-by-1 vector of model coefficients and Y is a n-by-1 output vector.
I am trying to find the model coefficients, I have implemented both gradient descent and coordinate descent algorithms to minimize the L1 Regularized system. To find my step size I am using the backtracking algorithm, I terminate the algorithm by looking at the norm-2 of the gradient and terminating if it is 'close enough' to zero(for now I'm using 0.001).
The function I am trying to minimize is the following (0.5)*(norm((Y - X*B),2)^2) + lambda*norm(B,1). (Note: By norm(Y,2) I mean the norm-2 value of the vector Y) My X matrix is 150-by-5 and is not sparse.
If I set the regularization parameter lambda to zero I should converge on the least squares solution, I can verify that both my algorithms do this pretty well and fairly quickly.
If I start to increase lambda my model coefficients all tend towards zero, this is what I expect, my algorithms never terminate though because the norm-2 of the gradient is always positive number. For example, a lambda of 1000 will give me coefficients in the 10^(-19) range but the norm2 of my gradient is ~1.5, this is after several thousand iterations, While my gradient values all converge to something in the 0 to 1 range, my step size becomes extremely small (10^(-37) range). If I let the algorithm run for longer the situation does not improve, it appears to have gotten stuck somehow.
Both my gradient and coordinate descent algorithms converge on the same point and give the same norm2(gradient) number for the termination condition. They also work quite well with lambda of 0. If I use a very small lambda(say 0.001) I get convergence, a lambda of 0.1 looks like it would converge if I ran it for an hour or two, a lambda any greater and the convergence rate is so small it's useless.
I had a few questions that I think might relate to the problem?
In calculating the gradient I am using a finite difference method (f(x+h) - f(x-h))/(2h)) with an h of 10^(-5). Any thoughts on this value of h?
Another thought was that at these very tiny steps it is traveling back and forth in a direction nearly orthogonal to the minimum, making the convergence rate so slow it is useless.
My last thought was that perhaps I should be using a different termination method, perhaps looking at the rate of convergence, if the convergence rate is extremely slow then terminate. Is this a common termination method?

The 1-norm isn't differentiable. This will cause fundamental problems with a lot of things, notably the termination test you chose; the gradient will change drastically around your minimum and fail to exist on a set of measure zero.
The termination test you really want will be along the lines of "there is a very short vector in the subgradient."
It is fairly easy to find the shortest vector in the subgradient of ||Ax-b||_2^2 + lambda ||x||_1. Choose, wisely, a tolerance eps and do the following steps:
Compute v = grad(||Ax-b||_2^2).
If x[i] < -eps, then subtract lambda from v[i]. If x[i] > eps, then add lambda to v[i]. If -eps <= x[i] <= eps, then add the number in [-lambda, lambda] to v[i] that minimises v[i].
You can do your termination test here, treating v as the gradient. I'd also recommend using v for the gradient when choosing where your next iterate should be.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse