Different delta rules - neural-network

I am having a lot of troubles to understand this concept about delta rule.
As far as I know, delta rule is used for updating weights during learning the network.
Lets say I have these two formulas:
first formula
second formula
The first formula says exactly what? The new weight should be counted as:
learning rate (eta) * gradient of loss function.
This result will be new weight. Am i correct?
The second formula is confusing. What does it exactly say? Both should be delta rules, but what is the difference between them? Could you please explain me, what following parts of the formula are?
I think it is like that (but i am not totally sure with something...)
change weight between neuron ij = learning rate (eta) * (I dont have any idea what is gamai(t) - it is an output?) * xj (= i belive it is input of the neuron jx) + (momento (that is ok) * wij(t-1) = i think it is previous weight).
Thank you for help

Delta rule is a gradient descent algorithm.
The two formulas you've given give you the gradient of the weights to perform the gradient descent, not the new weights.
The first formula is a general expression while the second is a rule on how to compute the gradient coefficient in function of the previous gradient.
The new weights are computed as in every gradient descent algorithm:
w_new = w - lamda*dw
where lambda is a positive number that may be constant or depend on the number of the iteration.

Related

scale the loss value according to "badness" in caffe

I want to scale the loss value of each image based on how close/far is the "current prediction" to the "correct label" during the training. For example if the correct label is "cat" and the network think it is "dog" the penalty (loss) should be less than the case if the network thinks it is a "car".
The way that I am doing is as following:
1- I defined a matrix of the distance between the labels,
2- pass that matrix as a bottom to the "softmaxWithLoss" layer,
3- multiply each log(prob) to this value to scale the loss according to badness in forward_cpu
However I do not know what should I do in the backward_cpu part. I understand the gradient (bottom_diff) has to be changed but not quite sure, how to incorporate the scale value here. According to the math I have to scale the gradient by the scale (because it is just an scale) but don't know how.
Also, seems like there is loosLayer in caffe called "InfoGainLoss" that does very similar job if I am not mistaken, however the backward part of this layer is a little confusing:
bottom_diff[i * dim + j] = scale * infogain_mat[label * dim + j] / prob;
I am not sure why infogain_mat[] is divide by prob rather than being multiply by! If I use identity matrix for infogain_mat isn't it supposed to act like softmax loss in both forward and backward?
It will be highly appreciated if someone can give me some pointers.
You are correct in observing that the scaling you are doing for the log(prob) is exactly what "InfogainLoss" layer is doing (You can read more about it here and here).
As for the derivative (back-prop): the loss computed by this layer is
L = - sum_j infogain_mat[label * dim + j] * log( prob(j) )
If you differentiate this expression with respect to prob(j) (which is the input variable to this layer), you'll notice that the derivative of log(x) is 1/x this is why you see that
dL/dprob(j) = - infogain_mat[label * dim + j] / prob(j)
Now, why don't you see similar expression in the back-prop of "SoftmaxWithLoss" layer?
well, as the name of that layer suggests it is actually a combination of two layers: softmax that computes class probabilities from classifiers outputs and a log loss layer on top of it. Combining these two layer enables a more numerically robust estimation of the gradients.
Working a little with "InfogainLoss" layer I noticed that sometimes prob(j) can have a very small value leading to unstable estimation of the gradients.
Here's a detailed computation of the forward and backward passes of "SoftmaxWithLoss" and "InfogainLoss" layers with respect to the raw predictions (x), rather than the "softmax" probabilities derived from these predictions using a softmax layer. You can use these equations to create a "SoftmaxWithInfogainLoss" layer that is more numerically robust than computing infogain loss on top of a softmax layer:
PS,
Note that if you are going to use infogain loss for weighing, you should feed H (the infogain_mat) with label similarities, rather than distances.
Update:
I recently implemented this robust gradient computation and created this pull request. This PR was merged to master branch on April, 2017.

fitness in inverted pendulum

What is the fitness function used to solve an inverted pendulum ?
I am evolving neural networks with genetic algorithm. And I don't know how to evaluate each individual.
I tried minimize the angle of pendulum and maximize distance traveled at the end of evaluation time (10 s), but this won't work.
inputs for neural network are: cart velocity, cart position, pendulum angular velocity and pendulum angle at time (t). The output is the force applied at time (t+1)
thanks in advance.
I found this paper which lists their objective function as being:
Defined as:
where "Xmax = 1.0, thetaMax = pi/6, _X'max = 1.0, theta'Max =
3.0, N is the number of iteration steps, T = 0.02 * TS and Wk are selected positive weights." (Using specific values for angles, velocities, and positions from the paper, however, you will want to use your own values depending on the boundary conditions of your pendulum).
The paper also states "The first and second terms determine the accumulated sum of
normalised absolute deviations of X1 and X3 from zero and the third term when minimised, maximises the survival time."
That should be more than enough to get started with, but i HIGHLY recommend you read the whole paper. Its a great read and i found it quite educational.
You can make your own fitness function, but i think the idea of using a position, velocity, angle, and the rate of change of the angle the pendulum is a good idea for the fitness function. You can, however, choose to use those variables in very different ways than the way the author of the paper chose to model their function.
It wouldn't hurt to read up on harmonic oscillators either. They take the general form:
mx" + Bx' -kx = Acos(w*t)
(where B, or A may be 0 depending on whether or not the oscillator is damped/undamped or driven/undriven respectively).

Minimization of L1-Regularized system, converging on non-minimum location?

This is my first post to stackoverflow, so if this isn't the correct area I apologize. I am working on minimizing a L1-Regularized System.
This weekend is my first dive into optimization, I have a basic linear system Y = X*B, X is an n-by-p matrix, B is a p-by-1 vector of model coefficients and Y is a n-by-1 output vector.
I am trying to find the model coefficients, I have implemented both gradient descent and coordinate descent algorithms to minimize the L1 Regularized system. To find my step size I am using the backtracking algorithm, I terminate the algorithm by looking at the norm-2 of the gradient and terminating if it is 'close enough' to zero(for now I'm using 0.001).
The function I am trying to minimize is the following (0.5)*(norm((Y - X*B),2)^2) + lambda*norm(B,1). (Note: By norm(Y,2) I mean the norm-2 value of the vector Y) My X matrix is 150-by-5 and is not sparse.
If I set the regularization parameter lambda to zero I should converge on the least squares solution, I can verify that both my algorithms do this pretty well and fairly quickly.
If I start to increase lambda my model coefficients all tend towards zero, this is what I expect, my algorithms never terminate though because the norm-2 of the gradient is always positive number. For example, a lambda of 1000 will give me coefficients in the 10^(-19) range but the norm2 of my gradient is ~1.5, this is after several thousand iterations, While my gradient values all converge to something in the 0 to 1 range, my step size becomes extremely small (10^(-37) range). If I let the algorithm run for longer the situation does not improve, it appears to have gotten stuck somehow.
Both my gradient and coordinate descent algorithms converge on the same point and give the same norm2(gradient) number for the termination condition. They also work quite well with lambda of 0. If I use a very small lambda(say 0.001) I get convergence, a lambda of 0.1 looks like it would converge if I ran it for an hour or two, a lambda any greater and the convergence rate is so small it's useless.
I had a few questions that I think might relate to the problem?
In calculating the gradient I am using a finite difference method (f(x+h) - f(x-h))/(2h)) with an h of 10^(-5). Any thoughts on this value of h?
Another thought was that at these very tiny steps it is traveling back and forth in a direction nearly orthogonal to the minimum, making the convergence rate so slow it is useless.
My last thought was that perhaps I should be using a different termination method, perhaps looking at the rate of convergence, if the convergence rate is extremely slow then terminate. Is this a common termination method?
The 1-norm isn't differentiable. This will cause fundamental problems with a lot of things, notably the termination test you chose; the gradient will change drastically around your minimum and fail to exist on a set of measure zero.
The termination test you really want will be along the lines of "there is a very short vector in the subgradient."
It is fairly easy to find the shortest vector in the subgradient of ||Ax-b||_2^2 + lambda ||x||_1. Choose, wisely, a tolerance eps and do the following steps:
Compute v = grad(||Ax-b||_2^2).
If x[i] < -eps, then subtract lambda from v[i]. If x[i] > eps, then add lambda to v[i]. If -eps <= x[i] <= eps, then add the number in [-lambda, lambda] to v[i] that minimises v[i].
You can do your termination test here, treating v as the gradient. I'd also recommend using v for the gradient when choosing where your next iterate should be.

Goodness of fit with MATLAB and chi-square test

I would like to measure the goodness-of-fit to an exponential decay curve. I am using the lsqcurvefit MATLAB function. I have been suggested by someone to do a chi-square test.
I would like to use the MATLAB function chi2gof but I am not sure how I would tell it that the data is being fitted to an exponential curve
The chi2gof function tests the null hypothesis that a set of data, say X, is a random sample drawn from some specified distribution (such as the exponential distribution).
From your description in the question, it sounds like you want to see how well your data X fits an exponential decay function. I really must emphasize, this is completely different to testing whether X is a random sample drawn from the exponential distribution. If you use chi2gof for your stated purpose, you'll get meaningless results.
The usual approach for testing the goodness of fit for some data X to some function f is least squares, or some variant on least squares. Further, a least squares approach can be used to generate test statistics that test goodness-of-fit, many of which are distributed according to the chi-square distribution. I believe this is probably what your friend was referring to.
EDIT: I have a few spare minutes so here's something to get you started. DISCLAIMER: I've never worked specifically on this problem, so what follows may not be correct. I'm going to assume you have a set of data x_n, n = 1, ..., N, and the corresponding timestamps for the data, t_n, n = 1, ..., N. Now, the exponential decay function is y_n = y_0 * e^{-b * t_n}. Note that by taking the natural logarithm of both sides we get: ln(y_n) = ln(y_0) - b * t_n. Okay, so this suggests using OLS to estimate the linear model ln(x_n) = ln(x_0) - b * t_n + e_n. Nice! Because now we can test goodness-of-fit using the standard R^2 measure, which matlab will return in the stats structure if you use the regress function to perform OLS. Hope this helps. Again I emphasize, I came up with this off the top of my head in a couple of minutes, so there may be good reasons why what I've suggested is a bad idea. Also, if you know the initial value of the process (ie x_0), then you may want to look into constrained least squares where you bind the parameter ln(x_0) to its known value.

Solving multiple phase angles for multiple equations

I have several equations and each have their own individual frequencies and amplitudes. I would like to sum the equations together and adjust the individual phases, phase1,phase2, and phase3 to keep the total amplitude value of eq_total under a specific value like 0.8. I know I can normalize the signal or change the vertical offset, but for my purposes I need to have the amplitude controlled by changing/finding the values for just the phases in phase1,phase2, and phase3 that will limit the maximum amplitude when the equations are summed.
Note: I'm using constructive and destructive phase interference to adjust the maximum amplitude of the summed equations.
Example:
eq1=0.2*cos(2pi*t*3+phase1)+vertical offset1
eq2=0.7*cos(2pi*t*9+phase2)+vertical offset2
eq3=0.8*cos(2pi*t*5+phase3)+vertical offset3
eq_total=eq1+eq2+eq3
Is there a way to solve for phase1,phase2, and phase3 so that the amplitude of the summed signals in eq_total never goes over 0.8 by just adjusting/finding the values of phase1,phase2,and phase3?
Here's a picture of a geogebra applet I tested this idea with.
Here's the geogebra ggb file I used to edit/test idea with. (I used this to see if my idea would work) Java is required if you want to dynamically interact with the applet
http://dl.dropbox.com/u/6576402/questions/ggb/sin_find_phases_example.ggb
I'm using matlab/octave
Thanks
Your example
eq1=0.2*cos(2pi*t*3+phase1)+vertical offset1
eq2=0.7*cos(2pi*t*9+phase2)+vertical offset2
eq3=0.8*cos(2pi*t*5+phase3)+vertical offset3
eq_total=eq1+eq2+eq3
where the maximum amplitude should be less than 0.8, has infinitely many solutions. Unless you have some additional objective you'd like to achieve, I suggest that you modify the problem such that you find the combination of phase angles that has a maximum amplitude of exactly 0.8 (or 0.79, such that you're guaranteed to be below).
Furthermore only two out of three phase angles are independent; if you increase all by, say, pi/3, the solution still holds. Thus, you have only two unknowns in eq_total.
You can solve the nonlinear optimization problem using e.g. FMINSEARCH. You formulate the problem such that max(abs(eq_total(phase1,phase2))) should equal 0.79.
Thus:
%# define the vector t, verticalOffset here
%# objectiveFunction is (eq_total-0.79)^2, so the phase shifts 1 and 2 that
%# satisfy this (approximately) should guarantee that signal never exceeds 0.8
objectiveFunction = #(phase)(max(abs(0.2*cos(2*pi*t+phase(1))+0.7*cos(2*pi*t*9+phase(2))+0.8*cos(2*pi*t*5)+verticalOffset)) - 0.79)^2;
%# search for optimal phase shift, starting at no shift
solution = fminsearch(objectiveFunction,[0;0]);
EDIT
Unfortunately when I Try this code and plot the results the maximum amplitude is not 0.79 it's over 1. Am I doing something wrong? see code below t=linspace(0,1,8000); verticalOffset=0; objectiveFunction = #(phase)(max(abs(0.2*cos(2*pi*t+phase(1))+0.7*cos(2*pi*t*9+phase(2))+0.8*cos(2*p‌​i*t*5)+verticalOffset)) - 0.79)^2; s1 = fminsearch(objectiveFunction,[0;0]) eqt=0.2*cos(2*pi*t+s1(1))+0.7*cos(2*pi*t*9+s1(2))+0.8*cos(2*pi*t*5)+verticalOffs‌​et; plot(eqt)
fminsearch will find a minimum of the objective function. Whether this solution satisfies all your conditions is something you have to test. In this case, the solution given by fminsearch with the starting value [0;0] gives a maximum of ~1.3, which is obviously not good enough. However, when you plot the maximum for a range of phase angles from 0 to 2pi, you'll see that `fminsearch didn't get stuck in a bad local minimum. Rather, there is no good solution at all (z-axis is the maximum).
If I understand you correctly, you are trying to find a phase to vary the amplitude of a signal. To my knowledge, this is not possible.
For a signal
s = A * cos (w*t + phi)
only A allows you to change the amplitude. With w you change the frequency of the signal and phi regulates the "horizontal shift".
Furthermore, I think you are missing a "moving variable" like the time t in the equation above.
Maybe this article clarifies things a little.
If you set all the vertical offsets to be equal to -1, then it solves your problem because each eq# will never be > 0, so the sum can never be >0.8.
I know that this isn't that helpful, but I'm hoping that this will help you understand your problem better.