I've been trying to implement the algorithm described here, and then test it on the "large action task" described in the same paper.
Overview of the algorithm:
In brief, the algorithm uses an RBM of the form shown below to solve reinforcement learning problems by changing its weights such that the free energy of a network configuration equates to the reward signal given for that state action pair.
To select an action, the algorithm performs gibbs sampling while holding the state variables fixed. With enough time, this produces the action with the lowest free energy, and thus the highest reward for the given state.
Overview of the large action task:
Overview of the author's guidelines for implementation:
A restricted Boltzmann machine with 13 hidden variables was trained on an instantiation of
the large action task with an 12-bit state space and a 40-bit action space. Thirteen key states were
randomly selected. The network was run for 12 000 actions with a learning rate going from 0.1
to 0.01 and temperature going from 1.0 to 0.1 exponentially over the course of training. Each
iteration was initialized with a random state. Each action selection consisted of 100 iterations of
Gibbs sampling.
Important omitted details:
Were bias units needed?
Was weight decay needed? And if so, L1 or L2?
Was a sparsity constraint needed for the weights and/or activations?
Was there modification of the gradient descent? (e.g. momentum)
What meta-parameters were needed for these additional mechanisms?
My implementation:
I initially assumed the authors' used no mechanisms other than those described in the guidelines, so I tried training the network without bias units. This led to near chance performance, and was my first clue to the fact that some mechanisms used must have been deemed 'obvious' by the authors and thus omitted.
I played around with the various omitted mechanisms mentioned above, and got my best results by using:
softmax hidden units
momentum of .9 (.5 until 5th iteration)
bias units for the hidden and visible layers
a learning rate 1/100th of that listed by the authors.
l2 weight decay of .0002
But even with all of these modifications, my performance on the task was generally around an average reward of 28 after 12000 iterations.
Code for each iteration:
%%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
data = [batchdata(:,:,(batch)) rand(1,numactiondims)>.5];
poshidprobs = softmax(data*vishid + hidbiases);
%%%%%%%%% END OF POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
hidstates = softmax_sample(poshidprobs);
%%%%%%%%% START ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if test
[negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,0);
else
[negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,temp);
end
data(numdims+1:end) = negaction > rand(numcases,numactiondims);
if mod(batch,100) == 1
disp(poshidprobs);
disp(min(~xor(repmat(correct_action(:,(batch)),1,size(key_actions,2)), key_actions(:,:))));
end
posprods = data' * poshidprobs;
poshidact = poshidprobs;
posvisact = data;
%%%%%%%%% END OF ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if batch>5,
momentum=.9;
else
momentum=.5;
end;
%%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
F = calcF_softmax2(data,vishid,hidbiases,visbiases,temp);
Q = -F;
action = data(numdims+1:end);
reward = maxreward - sum(abs(correct_action(:,(batch))' - action));
if correct_action(:,(batch)) == correct_action(:,1)
reward_dataA = [reward_dataA reward];
Q_A = [Q_A Q];
else
reward_dataB = [reward_dataB reward];
Q_B = [Q_B Q];
end
reward_error = sum(reward - Q);
rewardsum = rewardsum + reward;
errsum = errsum + abs(reward_error);
error_data(ind) = reward_error;
reward_data(ind) = reward;
Q_data(ind) = Q;
vishidinc = momentum*vishidinc + ...
epsilonw*( (posprods*reward_error)/numcases - weightcost*vishid);
visbiasinc = momentum*visbiasinc + (epsilonvb/numcases)*((posvisact)*reward_error - weightcost*visbiases);
hidbiasinc = momentum*hidbiasinc + (epsilonhb/numcases)*((poshidact)*reward_error - weightcost*hidbiases);
vishid = vishid + vishidinc;
hidbiases = hidbiases + hidbiasinc;
visbiases = visbiases + visbiasinc;
%%%%%%%%%%%%%%%% END OF UPDATES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
What I'm asking for:
So, if any of you can get this algorithm to work properly (the authors claim to average ~40 reward after 12000 iterations), I'd be extremely grateful.
If my code appears to be doing something obviously wrong, then calling attention to that would also constitute a great answer.
I'm hoping that what the authors left out is indeed obvious to someone with more experience with energy-based learning than myself, in which case, simply point out what needs to be included in a working implementation.
The algorithm in the paper looks wierd. They use a kind of hebbian learning that increases conectonstrength, but no mechanism to decay them. In contrast the regular CD pushes the energy of incorrect fantasies up, balancing overall activiity. I would speculate that yuo will need strong sparcity regulaton and/or weight decay.
bias never would hurt :)
Momentum and other fancy stuff may speed up, but usually not neccesary.
Why softmax on hiddens? Should it be just sigmoid?
Related
I am having a problem that could be easily solved in a causal environment like Fortran, but has proved difficult in Modelica, considering my limited knowledge
Consider a volume with an inlet and outlet. The inlet mass flow rate is specified, while the outlet mass flow is calculated based on pressure in the volume. When pressure in the volume goes above a set point, the outlet area starts to increase linearly from its initial value to a max value and remains fixed afterwards. In other words:
A = min( const * (t - t*) + A_0, A_max)
if p > p_set
where t* = the time at which pressure in the volume exceeds the set pressure.
The question is: there's a function to capture t* during the simulation? OR how could the model be programmed to do it? I have tried a number of ways, but models are never closed. Thoughts are welcome and appreciated!
Happy holidays/New Year!
Mohammad
You may find the sample and hold example in my book useful. It uses sampling based on time whereas you probably want it based on your pressure value. But the principle is the same. That will allow you to record the time at which your event occurred.
Addressing your specific case, the following (untested) code is probably pretty close to what you want:
...
Modelica.SIunits.Time t_star=-1;
equation
when p >= p_set then
t_star = time;
end when;
A = if t_star<0 then A_max else min(const*(t - t_star) + A_0, A_max);
I have the model of a dynamic system in Simulink (I cannot change the programming framework). It can be described as an oscillator subject to periodic oscillations. I am trying to control its motion, in particular, to maximize it (for energy generation).
With latching control (a popular control strategy), the idea is to 'latch', i.e. lock in place, the device when its velocity is 0 for a predefined time, and then release it until its velocity reaches 0 again.
So, what I need to do in Simulink is to output a signal 1 once the velocity signal reaches (or is close to) 0, hold it constant for a time period (at 1), then release it (the signal becomes 0), and repeat the process once the velocity reaches 0 again.
I have found a good blog on holding signals constant in Simulink:
http://blogs.mathworks.com/simulink/2014/08/06/how-do-you-hold-the-value-of-a-signal/
However, in my case, I have two conditions for determining the signal: the magnitude of the velocity and the time within the time period. Now, the problem is that as soon as the period is finished, and the device is released (signal = 0), the velocity is still very small, which could result in an incorrect signal of 1 if an if-loop is used.
I think using an S-function may be the best solution, but then I will have to use a fixed time-step. Are there any Simulink-native solutions for this problem?
I ended up using a Matlab function as a temporary solution, and it is very effective. I have taken inspiration from https://uk.mathworks.com/matlabcentral/answers/11323-hold-true-value-for-finite-length-of-time
u is the velocity signal.
function y = fcn(u,nlatch)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% This function is used to determine the latching signal.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Using persistent memory:
persistent tick started sign;
% Initialize variables:
if isempty(tick)
tick = 0;
started = 0;
sign = (u>0);
end
U=0; % no latching
s=(u>0);
if s~=sign
started = 1;
end
if started
if tick<nlatch
tick=tick+1;
U = 1;
else
tick = 0;
started = 0;
sign = s;
end
end
y = U;
end
However, as I mentioned, I have to use a fixed step solver, which is no big deal to me, but it can create problems to other users.
If anyone has a more "Simulink-native" solution, please let me know!
Edit
I have now modified the function: the latching is now applied when there is a change in sign in the velocity signal rather than looking at a small magnitude as earlier on (abs(u)<0.005), which was too case-specific.
A colleague of mine has also found a Simulink-native solution:
However, the Matlab function is faster (less computing intensive) when the same time step is employed. Maybe the least computing-intensive solution is a C S-function.
In Caffe we have a decay_ratio which is usually set as 0.0005. Then all trainable parameters, e.g., W matrix in FC6 will be decayed by:
W = W * (1 - 0.0005)
after we applied the gradient to it.
I go through many tutorial tensorflow codes, but do not see how people implement this weight decay to prevent numerical problems (very large absolute values)
I my experiences, I often run into numerical problems aften 100k iterations during training.
I also go through related questions at stackoverflow, e.g.,
How to set weight cost strength in TensorFlow?
However, the solution seems a little different as implemented in Caffe.
Does anyone has similar concerns? Thank you.
The current answer is wrong in that it doesn't give you proper "weight decay as in cuda-convnet/caffe" but instead L2-regularization, which is different.
When using pure SGD (without momentum) as an optimizer, weight decay is the same thing as adding a L2-regularization term to the loss. When using any other optimizer, this is not true.
Weight decay (don't know how to TeX here, so excuse my pseudo-notation):
w[t+1] = w[t] - learning_rate * dw - weight_decay * w
L2-regularization:
loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)
Computing the gradient of the extra term in L2-regularization gives lambda * w and thus inserting it into the SGD update equation
dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw
gives the same as weight decay, but mixes lambda with the learning_rate. Any other optimizer, even SGD with momentum, gives a different update rule for weight decay as for L2-regularization! See the paper Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper introduced "weight decay", literally as "each time the weights are updated, their magnitude is also decremented by 0.4%" at page 10)
That being said, there doesn't seem to be support for "proper" weight decay in TensorFlow yet. There are a few issues discussing it, specifically because of above paper.
One possible way to implement it is by writing an op that does the decay step manually after every optimizer step. A different way, which is what I'm currently doing, is using an additional SGD optimizer just for the weight decay, and "attaching" it to your train_op. Both of these are just crude work-arounds, though. My current code:
# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
weights_regularizer=layers.l2_regularizer(weight_decay)):
# define the network.
loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
with tf.control_dependencies([train_op]):
sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))
This somewhat makes use of TensorFlow's provided bookkeeping. Note that the arg_scope takes care of appending an L2-regularization term for every layer to the REGULARIZATION_LOSSES graph-key, which I then all sum up and optimize using SGD which, as shown above, corresponds to actual weight-decay.
Hope that helps, and if anyone gets a nicer code snippet for this, or TensorFlow implements it better (i.e. in the optimizers), please share.
Edit: see also this PR which just got merged into TF.
This is a duplicate question:
How to define weight decay for individual layers in TensorFlow?
# Create your variables
weights = tf.get_variable('weights', collections=['variables'])
with tf.variable_scope('weights_norm') as scope:
weights_norm = tf.reduce_sum(
input_tensor = WEIGHT_DECAY_FACTOR*tf.pack(
[tf.nn.l2_loss(i) for i in tf.get_collection('weights')]
),
name='weights_norm'
)
# Add the weight decay loss to another collection called losses
tf.add_to_collection('losses', weights_norm)
# Add the other loss components to the collection losses
# ...
# To calculate your total loss
tf.add_n(tf.get_collection('losses'), name='total_loss')
You can just set whatever lambda value you want to the weight decay. The above just adds the l2 norm to it.
I began to study machine learning and stuck on one issue.
My implementation of this method (both in MATLAB and C++) converge in 1 500 000 iterations, and I can not understand why. I found the method implementation in Python, and the algorithm converged in 2000 iterations. Under converged I understand that he gave almost the same answer as obviously the correct method.
Preliminary data are not processed in any way. Can you explain me please, is this normal number of iterations, or I just made a mistake in the algorithm?
The cost function used and its partial derivatives:
MATLAB code
%y=t0+t1*x
learningRate = 0.0001;
curT0 = 0;
curT1 = 0;
i = 1;
while (i < 1500000)
derT0 = 0;
derT1 = 0;
for j=1:1:N
derT0 = derT0 + (-1/N)*(Y(j) - (curT1*X(j) + curT0));
derT1 = derT1 + (-1/N)*X(j)*(Y(j) - (curT1*X(j) + curT0));
end
curT0 = curT0 - (learningRate*derT0);
curT1 = curT1 - (learningRate*derT1);
%sprintf('Iteration %d, t0=%f, t1=%f',i,curT0,curT1)
i = i+1;
end
P.S. I tried to increase the "learning Rate" variable but in this case, the algorithm diverges and comes in huge numbers.
The gradient descent depends on three things:
the initial solution [curT0,curT1] (the starting point from which to begin the search)
the learning rate (how big a step to take in the direction of the gradient)
the number of iterations
If you start far off and make small steps it will take you many iterations to reach the solution. If the steps are too big, you can miss the solution by stepping over it. Also the search could get stuck in local minima depending on where you started from in the search space..
Also you can specify other stopping criteria like a tolerance (a threshold which if crossed, stops the iterations). Your current code will always loop the maximum number of iterations (1500000).
I was reading this post online where the person mentioned that using "if statements" and "abs()" functions can have negative repercussions in MATLAB's variable-step ODE solvers (like ODE45). According to the OP, it can significantly affect time-step (requiring too low of a time step) and give poor results when the differential equations are finally integrated. I was wondering whether this is true, and if so, why. Also, how can this problem be mitigated without resorting to fix-step solvers. I've given an example code below as to what I mean:
function [Z,Y] = sauters(We,Re,rhos,nu_G,Uinj,Dinj,theta,ts,SMDs0,Uzs0,...
Uts0,Vzs0,zspan,K)
Y0 = [SMDs0;Uzs0;Uts0;Vzs0]; %Initial Conditions
options = odeset('RelTol',1e-7,'AbsTol',1e-7); %Tolerance Levels
[Z,Y] = ode45(#func,zspan,Y0,options);
function DY = func(z,y)
DY = zeros(4,1);
%Calculate Local Droplet Reynolds Numbers
Rez = y(1)*abs(y(2)-y(4))*Dinj*Uinj/nu_G;
Ret = y(1)*abs(y(3))*Dinj*Uinj/nu_G;
%Calculate Droplet Drag Coefficient
Cdz = dragcof(Rez);
Cdt = dragcof(Ret);
%Calculate Total Relative Velocity and Droplet Reynolds Number
Utot = sqrt((y(2)-y(4))^2 + y(3)^2);
Red = y(1)*abs(Utot)*Dinj*Uinj/nu_G;
%Calculate Derivatives
%SMD
if(Red > 1)
DY(1) = -(We/8)*rhos*y(1)*(Utot*Utot/y(2))*(Cdz*(y(2)-y(4)) + ...
Cdt*y(3)) + (We/6)*y(1)*y(1)*(y(2)*DY(2) + y(3)*DY(3)) + ...
(We/Re)*K*(Red^0.5)*Utot*Utot/y(2);
elseif(Red < 1)
DY(1) = -(We/8)*rhos*y(1)*(Utot*Utot/y(2))*(Cdz*(y(2)-y(4)) + ...
Cdt*y(3)) + (We/6)*y(1)*y(1)*(y(2)*DY(2) + y(3)*DY(3)) + ...
(We/Re)*K*(Red)*Utot*Utot/y(2);
end
%Axial Droplet Velocity
DY(2) = -(3/4)*rhos*(Cdz/y(1))*Utot*(1 - y(4)/y(2));
%Tangential Droplet Velocity
DY(3) = -(3/4)*rhos*(Cdt/y(1))*Utot*(y(3)/y(2));
%Axial Gas Velocity
DY(4) = (3/8)*((ts - ts^2)/(z^2))*(cos(theta)/(tan(theta)^2))*...
(Cdz/y(1))*(Utot/y(4))*(1 - y(4)/y(2)) - y(4)/z;
end
end
Where the function "dragcof" is given by the following:
function Cd = dragcof(Re)
if(Re <= 0.01)
Cd = (0.1875) + (24.0/Re);
elseif(Re > 0.01 && Re <= 260.0)
Cd = (24.0/Re)*(1.0 + 0.1315*Re^(0.32 - 0.05*log10(Re)));
else
Cd = (24.0/Re)*(1.0 + 0.1935*Re^0.6305);
end
end
This is because derivatives that are computed using if-statements, modulus operations (abs()), or things like unit step functions, dirac delta's, etc., will introduce discontinuities in the value of the solution or its derivative(s), resulting in kinks, jumps, inflection points, etc.
This implies the solution to the ODE has a complete change in behavior at the relevant times. What variable step integrators will do is
detect this
recognize that they won't be able to use information directly beyond the "problem point"
decrease the step, and repeat from the top, until the problem point satisfies the accuracy demands
Therefore, there will be many failed steps and reductions in step size near the problem points, negatively affecting the overall integration time.
Variable step integrators will continue to produce good results, however. Constant step integrators are not a good remedy for this sort of problem, since they are not able to detect such problems in the first place (there's no error estimation).
What you could do is simply split the problem up in multiple parts. If you know beforehand at what points in time the changes will occur, you just start a new integration for each interval, each time using the output of the previous integration as the initial value for the next one.
If you don't know beforehand where the problems will be, you could use this very nice feature in Matlab's ODE solvers called event functions (see the documentation). You let one of Matlab's solvers detect the event (change of sign in the derivative, change of condition in the if-statement, or whatever), and terminate the integration when such events are detected. Then start a new integration, starting from the last time and with initial conditions of the previous integration, as before, until the final time is reached.
There will still be a slight penalty in overall execution time this way, since Matlab will try to detect the location of the event accurately. However, it is still much better than running the integration blindly when it comes to both execution time and accuracy of the results.
Yes it is true and it happens because of your solution is not smooth enough at some points.
Assume you want to integrate. y'(t) = f(t,y). Then, what happens in f is getting integrated to become y. Thus, if in your definition of f there is
abs(), then f has a kink and y is still smooth and 1 times differentiable
if, then f has a jump and y a kink and no more differentiability
Matlab's ODE45 presumes that your solution is 5 times differentiable, and tries to ensure an accuracy of order 4. Nonsmooth points of your function are misinterpreted as stiffness what leads to small stepsizes and even to breakdowns.
What you can do: Because of the lack of smoothness you cannot expect a high accuracy anyways. Thus, ODE23 might be a better choice. In the worst case, you have to stick to first-order schemes.