What is the idea behind double QN?
The Bellman equation used to calculate the Q values to update the online network follows the equation:
value = reward + discount_factor * target_network.predict(next_state)[argmax(online_network.predict(next_state))]
The Bellman equation used to calculate the Q value updates in the original DQN is:
value = reward + discount_factor * max(target_network.predict(next_state))
but the target network for evaluating the action is updated using weights of the
online_network and the value and fed to the target value is basically old q value of the action.
any ideas how adding another networks based on weights from the first network helps?
I really liked the explanation from here:
https://becominghuman.ai/beat-atari-with-deep-reinforcement-learning-part-2-dqn-improvements-d3563f665a2c
"This is actually quite simple: you probably remember from the previous post that we were trying to optimize the Q function defined as follows:
Q(s, a) = r + γ maxₐ’(Q(s’, a’))
Because this definition is recursive (the Q value depends on other Q values), in Q-learning we end up training a network to predict its own output, as we pointed out last time.
The problem of course is that at each minibatch of training, we are changing both Q(s, a) and Q(s’, a’), in other words, we are getting closer to our target but also moving our target! This can make it a lot harder for our network to converge.
It thus seems like we should instead use a fixed target so as to avoid this problem of the network “chasing its own tail”, but of course that isn’t possible since the target Q function should get better and better as we train."
Related
I'm new to reinforcement learning at all, so I may be wrong.
My questions are:
Is the Q-Learning equation ( Q(s, a) = r + y * max(Q(s', a')) ) used in DQN only for computing a loss function?
Is the equation recurrent? Assume I use DQN for, say, playing Atari Breakout, the number of possible states is very large (assuming the state is single game's frame), so it's not efficient to create a matrix of all the Q-Values. The equation should update the Q-Value of given [state, action] pair, so what will it do in case of DQN? Will it call itself recursively? If it will, the quation can't be calculated, because the recurrention won't ever stop.
I've already tried to find what I want and I've seen many tutorials, but almost everyone doesn't show the background, just implements it using Python library like Keras.
Thanks in advance and I apologise if something sounds dumb, I just don't get that.
Is the Q-Learning equation ( Q(s, a) = r + y * max(Q(s', a')) ) used in DQN only for computing a loss function?
Yes, generally that equation is only used to define our losses. More specifically, it is rearranged a bit; that equation is what we expect to hold, but it generally does not yet precisely hold during training. We subtract the right-hand side from the left-hand side to compute a (temporal-difference) error, and that error is used in the loss function.
Is the equation recurrent? Assume I use DQN for, say, playing Atari Breakout, the number of possible states is very large (assuming the state is single game's frame), so it's not efficient to create a matrix of all the Q-Values. The equation should update the Q-Value of given [state, action] pair, so what will it do in case of DQN? Will it call itself recursively? If it will, the quation can't be calculated, because the recurrention won't ever stop.
Indeed the space of state-action pairs is much too large to enumerate them all in a matrix/table. In other words, we can't use Tabular RL. This is precisely why we use a Neural Network in DQN though. You can view Q(s, a) as a function. In the tabular case, Q(s, a) is simply a function that uses s and a to index into a table/matrix of values.
In the case of DQN and other Deep RL approaches, we use a Neural Network to approximate such a "function". We use s (and potentially a, though not really in the case of DQN) to create features based on that state (and action). In the case of DQN and Atari games, we simply take a stack of raw images/pixels as features. These are then used as inputs for the Neural Network. At the other end of the NN, DQN provides Q-values as outputs. In the case of DQN, multiple outputs are provided; one for every action a. So, in conclusion, when you read Q(s, a) you should think "the output corresponding to a when we plug the features/images/pixels of s as inputs into our network".
Further question from comments:
I think I still don't get the idea... Let's say we did one iteration through the network with state S and we got following output [A = 0.8, B = 0.1, C = 0.1] (where A, B and C are possible actions). We also got a reward R = 1 and set the y (a.k.a. gamma) to 0.95 . Now, how can we put these variables into the loss function formula https://imgur.com/a/2wTj7Yn? I don't understand what's the prediction if the DQN outputs which action to take? Also, what's the target Q? Could you post the formula with placed variables, please?
First a small correction: DQN does not output which action to take. Given inputs (a state s), it provides one output value per action a, which can be interpreted as an estimate of the Q(s, a) value for the input state s and the action a corresponding to that particular output. These values are typically used afterwards to determine which action to take (for example by selecting the action corresponding to the maximum Q value), so in some sense the action can be derived from the outputs of DQN, but DQN does not directly provide actions to take as outputs.
Anyway, let's consider the example situation. The loss function from the image is:
loss = (r + gamma max_a' Q-hat(s', a') - Q(s, a))^2
Note that there's a small mistake in the image, it has the old state s in the Q-hat instead of the new state s'. s' in there is correct.
In this formula:
r is the observed reward
gamma is (typically) a constant value
Q(s, a) is one of the output values from our Neural Network that we get when we provide it with s as input. Specifically, it is the output value corresponding to the action a that we have executed. So, in your example, if we chose to execute action A in state s, we have Q(s, A) = 0.8.
s' is the state we happen to end up in after having executed action a in state s.
Q-hat(s', a') (which we compute once for every possible subsequent action a') is, again, one of the output values from our Neural Network. This time, it's a value we get when we provide s' as input (instead of s), and again it will be the output value corresponding to action a'.
The Q-hat instead of Q there is because, in DQN, we typically actually use two different Neural Networks. Q-values are computed using the same Neural Network that we also modify by training. Q-hat-values are computed using a different "Target Network". This Target Network is typically a "slower-moving" version of the first network. It is constructed by occasionally (e.g. once every 10K steps) copying the other Network, and leaving its weights frozen in between those copy operations.
Firstly, the Q function is used both in the loss function and for the policy. Actual output of your Q function and the 'ideal' one is used to calculate a loss. Taking the highest value of the output of the Q function for all possible actions in a state is your policy.
Secondly, no, it's not recurrent. The equation is actually slightly different to what you have posted (perhaps a mathematician can correct me on this). It is actually Q(s, a) := r + y * max(Q(s', a')). Note the colon before the equals sign. This is called the assignment operator and means that we update the left side of the equation so that it is equal to the right side once (not recurrently). You can think of it as being the same as the assignment operator in most programming languages (x = x + 1 doesn't cause any problems).
The Q values will propagate through the network as you keep performing updates anyway, but it can take a while.
I am trying to reduce dimensionality of a training set using PCA.
I have come across two approaches.
[V,U,eigen]=pca(train_x);
eigen_sum=0;
for lamda=1:length(eigen)
eigen_sum=eigen_sum+eigen(lamda,1);
if(eigen_sum/sum(eigen)>=0.90)
break;
end
end
train_x=train_x*V(:, 1:lamda);
Here, I simply use the eigenvalue matrix to reconstruct the training set with lower amount of features determined by principal components describing 90% of original set.
The alternate method that I found is almost exactly the same, save the last line, which changes to:
train_x=U(:,1:lamda);
In other words, we take the training set as the principal component representation of the original training set up to some feature lamda.
Both of these methods seem to yield similar results (out of sample test error), but there is difference, however minuscule it may be.
My question is, which one is the right method?
The answer depends on your data, and what you want to do.
Using your variable names. Generally speaking is easy to expect that the outputs of pca maintain
U = train_x * V
But this is only true if your data is normalized, specifically if you already removed the mean from each component. If not, then what one can expect is
U = train_x * V - mean(train_x * V)
And in that regard, weather you want to remove or maintain the mean of your data before processing it, depends on your application.
It's also worth noting that even if you remove the mean before processing, there might be some small difference, but it will be around floating point precision error
((train_x * V) - U) ./ U ~~ 1.0e-15
And this error can be safely ignored
Just few questions, i hope someone will find time to answer :).
What if we have COUPLED model example: system of n indepedent variables X and n nonlinear partial differential equations PDEf(X,PDEf(X)) with respect to TIME that depends of X,PDEf(X)(partial differential equation depending of variables X ). Can you give some advice? Here is one example:
Let’s say that c is output, or desired variable. Let’s say that r is independent variable.Partial differential equation looks like:
∂c/∂t=D*1/r+∂c/∂r+2(D* (∂^2 c)/(∂r^2 ))
D=constant
r=0:0.1:Rp- Matlab syntaxis, how to represent same in Modelica (I use integrator,but didn't work)?
Here is a code (does not work):
model PDEtest
/* Boundary conditions
1. delta(c)/delta(r)=0 for r=0
2. delta(c)/delta(r)=-j*d for r=Rp*/
parameter Real Rp=88*1e-3; // length
parameter Real initialConc=1000;
parameter Real Dp=1e-14;
parameter Integer np=10; // num. of points
Real cp[np](start=fill(initialConc,np));
Modelica.Blocks.Continuous.Integrator r(k=1); // independent x1
Real j;
protected
parameter Real dr=Rp/np;
parameter Real ts= 0.01; // for using when loop (sample(0,ts) )
algorithm
j:=sin(time); // this should be indepedent variable like x2
r.u:=dr;
while r.y<=Rp loop
for i in 2:np-1 loop
der(cp[i]):=2*Dp/r.y+(cp[i]-cp[i-1])/dr+2*(Dp*(cp[i+1]-2*cp[i]+cp[i-1])/dr^2);
end for;
if r.y==Rp then
cp[np]:=-j*Dp;
end if;
cp[1]:=if time >=0 then initialConc else initialConc;
end while;
annotation (uses(Modelica(version="3.2")));
end PDEtest;
Here are more questions:
This code don’t work in OpenModelica 1.8.1, also don’t work in Dymola 2013demo. How can we have continuos function of variable c, not array of functions ?
Can we place values of array cp in combiTable? And how?
If instead “algorithm” stay “equation” code can’t be succesfull checked.Why? In OpenModelica, error is :could not flattening model :S.
Is there any simplified way to use a set of equation (PDE’s) that are coupled? I know for PDEs library in Modelica, but I think they are complicated. I want to write a function for solving PDE and call these function in “main model”, so that output of function be continuos function of “c”.I don’t know what for doing with array of functions.
Can you give me advice how to understand Modelica language, if we “speak” like in Matlab? For example: Values of independent variable r,we can specife in Matlab, like r=0:TimeStep:Rp…How to do same in Modelica? And please explain me how section “equation” works, is there similarity with Matlab, and is there necessary sequancial approach?
Cheers :)
It's hard to answer your question, since you assuming that Modelica ~ Matlab, but that's not the case. So I won't comment your code, since it's really wrong. Let me give you an example model to the burger equation. Maybe you could use it as starting point.
model burgereqn
Real u[N+2](start=u0);
parameter Real h = 1/(N+1);
parameter Integer N = 10;
parameter Real v = 234;
parameter Real Pi = 3.14159265358979;
parameter Real u0[N+2]={((sin(2*Pi*x[i]))+0.5*sin(Pi*x[i])) for i in 1:N+2};
parameter Real x[N+2] = { h*i for i in 1:N+2};
equation
der(u[1]) = 0;
for i in 2:N+1 loop
der(u[i]) = - ((u[i+1]^2-u[i-1]^2)/(4*(x[i+1]-x[i-1])))
+ (v/(x[i+1]-x[i-1])^2)*(u[i+1]-2*u[i]+u[i+1]);
end for;
der(u[N+2]) = 0;
end burgereqn;
Your further questions:
cp is an continuous variable and the array is representing
every discretization point.
Why you should want to do that, as far as I understand cp is
your desired solution variable.
You should try to use almost always equation section
algorithm sections are usually used in functions. I'm pretty
sure you can represent your desire behaviour with equations.
I don't know that library, but the hard thing on a pde is the
discretization and the solving it self. You may run into issues
while solving the pde with a modelica tool, since usually
a Modelica tool has no specialized solving algorithm for pdes.
Please consider for that question further references. You could
start with Modelica.org.
Rheological models are usually build using three (or four) basics elements, which are :
The spring (existing in Modelica.Mechanics.Translational.Components for example). Its equation is f = c * (s_rel - s_rel0);
The damper (dashpot) (also existing in Modelica.Mechanics.Translational.Components). Its equation is f = d * v_rel; for a linear damper, an could be easily modified to model a non-linear damper : f = d * v_rel^(1/n);
The slider, not existing (as far as I know) in this library... It's equation is abs(f)<= flim. Unfortunately, I don't really understand how I could write the corresponding Modelica model...
I think this model should extend Modelica.Mechanics.Translational.Interfaces.PartialCompliant, but the problem is that f (the force measured between flange_b and flange_a) should be modified only when it's greater than flim...
If the slider extends PartialCompliant, it means that it already follows the equations flange_b.f = f; and flange_a.f = -f;
Adding the equation f = if abs(f)>flim then sign(f)*flim else f; gives me an error "An independent subset of the model has imbalanced number of equations and variables", which I couldn't really explain, even if I understand that if abs(f)<=flim, the equation f = f is useless...
Actually, the slider element doesn't generate a new force (just like the spring does, depending on its strain, or just like the damper does, depending on its strain rate). The force is an input for the slider element, which is sometime modified (when this force becomes greater than the limit allowed by the element). That's why I don't really understand if I should define this force as an input or an output....
If you have any suggestion, I would greatly appreciate it ! Thanks
After the first two comments, I decided to add a picture that, I hope, will help you to understand the behaviour I'm trying to model.
On the left, you can see the four elements used to develop rheological models :
a : the spring
b : the linear damper (dashpot)
c : the non-linear damper
d : the slider
On the right, you can see the behaviour I'm trying to reproduce : a and b are two associations with springs and c and d are respectively the expected stress / strain curves. I'm trying to model the same behaviour, except that I'm thinking in terms of force and not stress. As i said in the comment to Marco's answer, the curve a reminds me the behaviour of a diode :
if the force applied to the component is less than the sliding limit, there is no relative displacement between the two flanges
if the force becomes greater than the sliding limit, the force transmitted by the system equals the limit and there is relative displacement between flanges
I can't be sure, but I suspect what you are really trying to model here is Coulomb friction (i.e. a constant force that always opposes the direction of motion). If so, there is already a component in the Modelica Standard Library, called MassWithStopAndFriction, that models that (and several other flavors of friction). The wrinkle is that it is bundled with inertia.
If you don't want the inertia effect it might be possible to set the inertia to zero. I suspect that could cause a singularity. One way you might be able to avoid the singularity is to "evaluate" the parameter (at least that is what it is called in Dymola when you set the Evaluate flat to be true in the command line). No promises whether that will work since it is model and tool dependent whether such a simplification can be properly handled.
If Coulomb friction is what you want and you really don't want inertia and the approach above doesn't work, let me know and I think I can create a simple model that will work (so long as you don't have inertia).
A few considerations:
- The force is not an input and neither an output, but it is just a relation that you add into the component in order to define how the force will be propagated between the two translational flanges of the component. When you deal with acausal connectors I think it is better to think about the degrees of freedom of your component instead of inputs and outputs. In this case you have two connectors and independently at which one of the two frames you will recieve informations about the force, the equation you implement will define how that information will be propagated to the other frame.
- I tested this:
model slider
extends
Modelica.Mechanics.Translational.Interfaces.PartialCompliantWithRelativeStates;
parameter Real flim = 1;
equation
f = if abs(f)>flim then sign(f)*flim else f;
end slider;
on Dymola and it works. It is correct modelica code so it should be work also in OpenModelica, I can't think of a reason why it should be seen as an unbalance mathematical model.
I hope this helps,
Marco
Can anyone recommend a website or give me a brief of how backpropagation is implemented in a NN? I understand the basic concept, but I'm unsure of how to go about writing the code.
Many of sources I've found simply show equations without giving any explanation of why they're doing it, and the variable names make it difficult to find out.
Example:
void bpnn_output_error(delta, target, output, nj, err)
double *delta, *target, *output, *err;
int nj;
{
int j;
double o, t, errsum;
errsum = 0.0;
for (j = 1; j <= nj; j++) {
o = output[j];
t = target[j];
delta[j] = o * (1.0 - o) * (t - o);
errsum += ABS(delta[j]);
}
*err = errsum;
}
In that example, can someone explain the purpose of
delta[j] = o * (1.0 - o) * (t - o);
Thanks.
The purpose of
delta[j] = o * (1.0 - o) * (t - o);
is to find the error of an output node in a backpropagation network.
o represents the output of the node, t is the expected value of output for the node.
The term, (o * (1.0 - o), is the derivative of a common transfer function used, the sigmoid function. (Other transfer functions are not uncommon, and would require a rewrite of the code that has the sigmoid first derivative instead. A mismatch between function and derivative would likely mean that training would not converge.) The node has an "activation" value that is fed through a transfer function to obtain the output o, like
o = f(activation)
The main thing is that backpropagation uses gradient descent, and the error gets backward-propagated by application of the Chain Rule. The problem is one of credit assignment, or blame if you will, for the hidden nodes whose output is not directly comparable to the expected value. We start with what is known and comparable, the output nodes. The error is taken to be proportional to the first derivative of the output times the raw error value between the expected output and actual output.
So more symbolically, we'd write that line as
delta[j] = f'(activation_j) * (t_j - o_j)
where f is your transfer function, and f' is the first derivative of it.
Further back in the hidden layers, the error at a node is its estimated contribution to the errors found at the next layer. So the deltas from the succeeding layer are multiplied by the connecting weights, and those products are summed. That sum is multiplied by the first derivative of the activation of the hidden node to get the delta for a hidden node, or
delta[j] = f'(activation_j) * Sum(delta[k] * w_jk)
where j now references a hidden node and k a node in a succeeding layer.
(t-o) is the error in the output of the network since t is the target output and o is the actual output. It is being stored in a normalized form in the delta array. The method used to normalize depends on the implementation and the o * ( 1.0 - o ) seems to be doing that (I could be wrong about that assumption).
This normalized error is accumulated for the entire training set to judge when the training is complete: usually when errsum is below some target threshold.
Actually, if you know the theory, the programs should be easy to understand. You can read the book and do some simple samples using a pencil to figure out the exact steps of propagation. This is a general principle for implementing numerical programs, you must know the very details in small cases.
If you know Matlab, I'd suggest you to read some Matlab source code (e.g. here), which is easier to understand than C.
For the code in your question, the names are quite self-explanatory, output may be the array of your prediction, target may be the array of training labels, delta is the error between prediction and true values, it also serves as the value to be updated into the weight vector.
Essentially, what backprop does is run the network on the training data, observe the output, then adjust the values of the nodes, going from the output nodes back to the input nodes iteratively.