How to implement weight decay in tensorflow as in Caffe - neural-network

In Caffe we have a decay_ratio which is usually set as 0.0005. Then all trainable parameters, e.g., W matrix in FC6 will be decayed by:
W = W * (1 - 0.0005)
after we applied the gradient to it.
I go through many tutorial tensorflow codes, but do not see how people implement this weight decay to prevent numerical problems (very large absolute values)
I my experiences, I often run into numerical problems aften 100k iterations during training.
I also go through related questions at stackoverflow, e.g.,
How to set weight cost strength in TensorFlow?
However, the solution seems a little different as implemented in Caffe.
Does anyone has similar concerns? Thank you.

The current answer is wrong in that it doesn't give you proper "weight decay as in cuda-convnet/caffe" but instead L2-regularization, which is different.
When using pure SGD (without momentum) as an optimizer, weight decay is the same thing as adding a L2-regularization term to the loss. When using any other optimizer, this is not true.
Weight decay (don't know how to TeX here, so excuse my pseudo-notation):
w[t+1] = w[t] - learning_rate * dw - weight_decay * w
L2-regularization:
loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)
Computing the gradient of the extra term in L2-regularization gives lambda * w and thus inserting it into the SGD update equation
dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw
gives the same as weight decay, but mixes lambda with the learning_rate. Any other optimizer, even SGD with momentum, gives a different update rule for weight decay as for L2-regularization! See the paper Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper introduced "weight decay", literally as "each time the weights are updated, their magnitude is also decremented by 0.4%" at page 10)
That being said, there doesn't seem to be support for "proper" weight decay in TensorFlow yet. There are a few issues discussing it, specifically because of above paper.
One possible way to implement it is by writing an op that does the decay step manually after every optimizer step. A different way, which is what I'm currently doing, is using an additional SGD optimizer just for the weight decay, and "attaching" it to your train_op. Both of these are just crude work-arounds, though. My current code:
# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
weights_regularizer=layers.l2_regularizer(weight_decay)):
# define the network.
loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
with tf.control_dependencies([train_op]):
sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))
This somewhat makes use of TensorFlow's provided bookkeeping. Note that the arg_scope takes care of appending an L2-regularization term for every layer to the REGULARIZATION_LOSSES graph-key, which I then all sum up and optimize using SGD which, as shown above, corresponds to actual weight-decay.
Hope that helps, and if anyone gets a nicer code snippet for this, or TensorFlow implements it better (i.e. in the optimizers), please share.
Edit: see also this PR which just got merged into TF.

This is a duplicate question:
How to define weight decay for individual layers in TensorFlow?
# Create your variables
weights = tf.get_variable('weights', collections=['variables'])
with tf.variable_scope('weights_norm') as scope:
weights_norm = tf.reduce_sum(
input_tensor = WEIGHT_DECAY_FACTOR*tf.pack(
[tf.nn.l2_loss(i) for i in tf.get_collection('weights')]
),
name='weights_norm'
)
# Add the weight decay loss to another collection called losses
tf.add_to_collection('losses', weights_norm)
# Add the other loss components to the collection losses
# ...
# To calculate your total loss
tf.add_n(tf.get_collection('losses'), name='total_loss')
You can just set whatever lambda value you want to the weight decay. The above just adds the l2 norm to it.

Related

Fitting data with Voigt profile in lmfit in python - huge errors

I am trying to fit some RIXS data with Voigt profiles (lmfit in Python), and I have defined the Voigt profile in the following way:
def gfunction_norm(x, pos, gwid):
gauss= (1/(gwid*(np.sqrt(2*np.pi))))*(np.exp((-1.0/2)*((((x-pos)/gwid))**2)))
return (gauss-gauss.min())/(gauss.max()-gauss.min())
def lfunction_norm(x,pos,lwid):
lorentz=(0.15915*lwid)/((x-pos)**2+0.25*lwid**2)
return (lorentz-lorentz.min())/(lorentz.max()-lorentz.min())
def voigt(x, pos, gwid, lwid, int):
step=0.005
x2=np.arange(pos-7,pos+7+step,step)
voigt3=np.convolve(gfunction_norm(x2, pos, gwid), lfunction_norm(x2, pos, lwid), mode='same')
norm=(voigt3-voigt3.min())/(voigt3.max()-voigt3.min())
y=np.interp(energy, x2, norm)
return y * int
I have used this definition instead of the popular Voigt profile definition in Python:
def voigt(x, alpha, cen, gamma):
sigma=alpha/np.sqrt(2*np.log(2))
return np.real(wofz((x-cen+1j*gamma)/sigma/np.sqrt(2)))/(sigma*2.51)
because it gives me more clarity on the intensity of the peaks etc.
Now, I have a couple of spectra with 9-10 peaks and I am trying to fit all of them with Voigt profiles (precisely in the way I defined it).
Now, I have a couple of questions:
Do you think my Voigt definition is OK? What (dis)advantages do I have by using the convolution instead of the approximative second definition?
As a result of my fit, sometimes I get crazy large standard deviations. For example, these are best-fit parameters for one of the peaks:
int8: 0.00986265 +/- 0.00113104 (11.47%) (init = 0.05)
pos8: -2.57960013 +/- 0.00790640 (0.31%) (init = -2.6)
gwid8: 0.06613237 +/- 0.02558441 (38.69%) (init = 0.1)
lwid8: 1.0909e-04 +/- 1.48706395 (1363160.91%) (init = 0.001)
(intensity, position, gaussian and lorentzian width respectively).
Does this output mean that this peak should be purely gaussian?
I have noticed that large errors usually happen when the best-fit parameter is very small. Does this have something to do with the Levenberg-Marquardt algorithm that is used by default in lmfit? I should note that I sometimes have the same problem even when I use the other definition of the Voigt profile (and not just for Lorentzian widths).
Here is a part of the code (it is a part of a bigger code and it is in a for loop, meaning I use same initial values for multiple similar spectra):
model = Model(final)
result = model.fit(spectra[:,nb_spectra], params, x=energy)
print(result.fit_report())
"final" is the sum of many voigt profiles that I previously defined.
Thank you!
This seems a duplicate or follow-up to Lmfit fit produces huge uncertainties - please use on SO question per topic.
Do you think my Voigt definition is OK? What (dis)advantages do I have by using the convolution instead of the approximative second definition?
What makes you say that the second definition is approximate? In some sense, all floating-point calculations are approximate, but the Faddeeva function from scipy.special.wofz is the analytic solution for the Voigt profile. Doing the convolution yourself is likely to be a bit slower and is also an approximation (at the machine-precision level).
Since you are using Lmfit, I would recommend using its VoigtModel which will make your life easier: it uses scipy.special.wofz and parameter names that make it easy to switch to other profiles (say, GaussianModel).
You did not give a very complete example of code (for reference, a minimal, working version of actual code is more or less expected on SO and highly recommended), but that might look something like
from lmfit.models import VoigtModel
model = VoigtModel(prefix='p1_') + VoigtModel(prefix='p2_') + ...
As a result of my fit, sometimes I get crazy large standard
deviations. For example, these are best-fit parameters for one of the
peaks:
int8: 0.00986265 +/- 0.00113104 (11.47%) (init = 0.05)
pos8: -2.57960013 +/- 0.00790640 (0.31%) (init = -2.6)
gwid8: 0.06613237 +/- 0.02558441 (38.69%) (init = 0.1)
lwid8: 1.0909e-04 +/- 1.48706395 (1363160.91%) (init = 0.001)
(intensity, position, gaussian and lorentzian width respectively).
Does this output mean that this peak should be purely gaussian?
First, that may not be a "crazy large" standard deviation - it sort of depends on the data and rest of the fit. Perhaps the value for int8 is really, really small, and heavily overlapped with other peaks -- it might be highly correlated with other variables. But, it may very well mean that the peak is more Gaussian-like.
Since you are analyzing X-ray scattering data, the use of a Voigt function is probably partially justified with the idea (assertion, assumption, expectation?) that the material response would give a Gaussian profile, while the instrumentation (including X-ray source) would give a Lorentzian broadening. That suggests that the Lorentzian width might be the same for the various peaks, or perhaps parameterized as a simple function of the incident and scattering wavelengths or q values. That is, you might be able to (and may be better off) constrain the values of the Lorentzian width (your lwid, I think, or gamma in lmfit.models.VoigtModel) to all be the same.

Why does huggingface bert pooler hack make mixed precission training stable?

Huggigface BERT implementation has a hack to remove the pooler from optimizer.
https://github.com/huggingface/transformers/blob/b832d5bb8a6dfc5965015b828e577677eace601e/examples/run_squad.py#L927
# hack to remove pooler, which is not used
# thus it produce None grad that break apex
param_optimizer = [n for n in param_optimizer if 'pooler' not in n[0]]
We are trying to run pretrining on huggingface bert models. The code always diverges later during the training if this pooler hack is not applied. I also see the pooler layer being used during classification.
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
The pooler layer is a FFN with tanh activation
class BertPooler(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()
def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output
My question is why this pooler hack solves numeric instability?
Problem seen with pooler
There are quite a few resources out there that probably tackle this issue better than me, see for example here, or here.
Specifically, the problem is that you are dealing with vanishing (or exploding) gradients, specifically when using loss functions that flatten in either direction for very small/large inputs, which is the case for both sigmoid and tanh (the only difference here is the range in which their output lies, which is [0, 1] and [-1, 1], respectively.
Additionally, if you have a low-precision decimal, as is the case with APEX, then the gradient vanishing behavior is much more likely to appear already for relatively moderate outputs, as the precision limits the numbers which it is able to differentiate from zero. One way to deal with this is to have functions that have strictly non-zero and easily computable derivatives, such as Leaky ReLU, or simply avoid the activation function altogether (which I'm assuming is what huggingface is doing here).
Note that the problem of exploding gradients is usually not as tragic, as we can apply gradient clipping (limiting it to a fixed maximum size), but nonetheless the principle is the same. For zeroed gradients, on the other hand, there is no such easy fix, since it causes your neurons to "die" (no active learning is happening with zero backflow), which is why I'm assuming that you see the diverging behavior.

Feed Forward - Neural Networks Keras

for my input in the feed forward neural network that I have implemented in Keras, I just wanted to check that my understanding is correct.
[[ 25.26000023 26.37000084 24.67000008 23.30999947]
[ 26.37000084 24.67000008 23.30999947 21.36000061]
[ 24.67000008 23.30999947 21.36000061 19.77000046]...]
So in the data above it is a time window of 4 inputs in an array. My input layer is
model.add(Dense(4, input_dim=4, activation='sigmoid'))
model.fit(trainX, trainY, nb_epoch=10000,verbose=2,batch_size=4)
and batch_size is 4, in theory when I call the fit function will the function go over all these inputs in each nb_epoch? and does the batch_size need to be 4 in order for this time window to work?
Thanks John
and batch_size is 4, in theory when I call the fit function will the function go over all these inputs in each nb_epoch?
Yes, each epoch is iteration over all training samples
and does the batch_size need to be 4 in order for this time window to work?
No, these are completely unrelated things. Batch is simply a subset of your training data which is used to compute approximation of the true gradient of the cost function. Bigger the batch - closer you get to the true gradient (and original Gradient Descent), but training gets slower. Closer to 1 you get - it becomes more and more stochastic, noisy approxmation (and closer to Stochastic Gradient Descent). The fact that you matched batch_size and data dimensionality is just an odd-coincidence, and has no meaning.
Let me put this in more generall setting, what you do in gradient descent with additive loss function (which neural nets usually use) is going against the gradient which is
grad_theta 1/N SUM_i=1^N loss(x_i, pred(x_i), y_i|theta) =
= 1/N SUM_i=1^N grad_theta loss(x_i, pred(x_i), y_i|theta)
where loss is some loss function over your pred (prediction) as compared to y_i.
And in batch based scenatio (the rough idea) is that you do not need to go over all examples, but instead some strict subset, like batch = {(x_1, y_1), (x_5, y_5), (x_89, y_89) ... } and use approximation of the gradient of form
1/|batch| SUM_(x_i, y_i) in batch: grad_theta loss(x_i, pred(x_i), y_i|theta)
As you can see this is not related in any sense to the space where x_i live, thus there is no connection with dimensionality of your data.
Let me explain this with an example:
When you have 32 training examples and you call model.fit with a batch_size of 4, the neural network will be presented with 4 examples at a time, but one epoch will still be defined as one complete pass over all 32 examples. So in this case the network will go through 4 examples at a time, and will ,theoretically at least, call the forward pass (and the backward pass) 32 / 4 = 8 times.
In the extreme case when your batch_size is 1, that is plain old stochastic gradient descent. When your batch_size is greater than 1 then it's called batch gradient descent.

Free Energy Reinforcement Learning Implementation

I've been trying to implement the algorithm described here, and then test it on the "large action task" described in the same paper.
Overview of the algorithm:
In brief, the algorithm uses an RBM of the form shown below to solve reinforcement learning problems by changing its weights such that the free energy of a network configuration equates to the reward signal given for that state action pair.
To select an action, the algorithm performs gibbs sampling while holding the state variables fixed. With enough time, this produces the action with the lowest free energy, and thus the highest reward for the given state.
Overview of the large action task:
Overview of the author's guidelines for implementation:
A restricted Boltzmann machine with 13 hidden variables was trained on an instantiation of
the large action task with an 12-bit state space and a 40-bit action space. Thirteen key states were
randomly selected. The network was run for 12 000 actions with a learning rate going from 0.1
to 0.01 and temperature going from 1.0 to 0.1 exponentially over the course of training. Each
iteration was initialized with a random state. Each action selection consisted of 100 iterations of
Gibbs sampling.
Important omitted details:
Were bias units needed?
Was weight decay needed? And if so, L1 or L2?
Was a sparsity constraint needed for the weights and/or activations?
Was there modification of the gradient descent? (e.g. momentum)
What meta-parameters were needed for these additional mechanisms?
My implementation:
I initially assumed the authors' used no mechanisms other than those described in the guidelines, so I tried training the network without bias units. This led to near chance performance, and was my first clue to the fact that some mechanisms used must have been deemed 'obvious' by the authors and thus omitted.
I played around with the various omitted mechanisms mentioned above, and got my best results by using:
softmax hidden units
momentum of .9 (.5 until 5th iteration)
bias units for the hidden and visible layers
a learning rate 1/100th of that listed by the authors.
l2 weight decay of .0002
But even with all of these modifications, my performance on the task was generally around an average reward of 28 after 12000 iterations.
Code for each iteration:
%%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
data = [batchdata(:,:,(batch)) rand(1,numactiondims)>.5];
poshidprobs = softmax(data*vishid + hidbiases);
%%%%%%%%% END OF POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
hidstates = softmax_sample(poshidprobs);
%%%%%%%%% START ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if test
[negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,0);
else
[negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,temp);
end
data(numdims+1:end) = negaction > rand(numcases,numactiondims);
if mod(batch,100) == 1
disp(poshidprobs);
disp(min(~xor(repmat(correct_action(:,(batch)),1,size(key_actions,2)), key_actions(:,:))));
end
posprods = data' * poshidprobs;
poshidact = poshidprobs;
posvisact = data;
%%%%%%%%% END OF ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if batch>5,
momentum=.9;
else
momentum=.5;
end;
%%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
F = calcF_softmax2(data,vishid,hidbiases,visbiases,temp);
Q = -F;
action = data(numdims+1:end);
reward = maxreward - sum(abs(correct_action(:,(batch))' - action));
if correct_action(:,(batch)) == correct_action(:,1)
reward_dataA = [reward_dataA reward];
Q_A = [Q_A Q];
else
reward_dataB = [reward_dataB reward];
Q_B = [Q_B Q];
end
reward_error = sum(reward - Q);
rewardsum = rewardsum + reward;
errsum = errsum + abs(reward_error);
error_data(ind) = reward_error;
reward_data(ind) = reward;
Q_data(ind) = Q;
vishidinc = momentum*vishidinc + ...
epsilonw*( (posprods*reward_error)/numcases - weightcost*vishid);
visbiasinc = momentum*visbiasinc + (epsilonvb/numcases)*((posvisact)*reward_error - weightcost*visbiases);
hidbiasinc = momentum*hidbiasinc + (epsilonhb/numcases)*((poshidact)*reward_error - weightcost*hidbiases);
vishid = vishid + vishidinc;
hidbiases = hidbiases + hidbiasinc;
visbiases = visbiases + visbiasinc;
%%%%%%%%%%%%%%%% END OF UPDATES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
What I'm asking for:
So, if any of you can get this algorithm to work properly (the authors claim to average ~40 reward after 12000 iterations), I'd be extremely grateful.
If my code appears to be doing something obviously wrong, then calling attention to that would also constitute a great answer.
I'm hoping that what the authors left out is indeed obvious to someone with more experience with energy-based learning than myself, in which case, simply point out what needs to be included in a working implementation.
The algorithm in the paper looks wierd. They use a kind of hebbian learning that increases conectonstrength, but no mechanism to decay them. In contrast the regular CD pushes the energy of incorrect fantasies up, balancing overall activiity. I would speculate that yuo will need strong sparcity regulaton and/or weight decay.
bias never would hurt :)
Momentum and other fancy stuff may speed up, but usually not neccesary.
Why softmax on hiddens? Should it be just sigmoid?

MLP Neural network not training correctly, probably converging to a local minimum

I'm making a MLP neural network with back-propagation in matlab. The problem is, it seems not to be able to handle the curves in a function well, and also doesn't scale well with the values. It can for example reach 80% of the cos(x) but if I put 100*cos(x) it will just not train at all.
What is even weirder is, that some functions it can train well to, while others it just doesn't work at all..
For example:
Well trained: http://img515.imageshack.us/img515/2148/coscox3.jpg
Not so well: http://img252.imageshack.us/img252/5370/cos2d.jpg (smoothness from being left a long time)
Wrong results, stuck like this: http://img717.imageshack.us/img717/2145/ex2ug.jpg
This is the algo I'm trying to implement:
http://img594.imageshack.us/img594/9590/13012012001.jpg
http://img27.imageshack.us/img27/954/13012012002.jpg
And this is my implementation:
close all;clc;
j=[4,3,1]; %number neurons in hidden layers and output layer
i=[1,j(1),j(2)];
X=0:0.1:pi;
d=cos(X);
%-----------Weights------------%
%-----First layer weights------%
W1p=rand([i(1)+1,j(1)]);
W1p=W1p/sum(W1p(:));
W1=rand([i(1)+1,j(1)]);
W1=W1/sum(W1(:));
%-----Second layer weights------%
W2p=rand([i(2)+1,j(2)]);
W2p=W2p/sum(W2p(:));
W2=rand([i(2)+1,j(2)]);
W2=W2/sum(W2(:));
%-----Third layer weights------%
W3p=rand([i(3)+1,j(3)]);
W3p=W3p/sum(W3p(:));
W3=rand([i(3)+1,j(3)]);
W3=W3/sum(W3(:));
%-----------/Weights-----------%
V1=zeros(1,j(1));
V2=zeros(1,j(2));
V3=zeros(1,j(3));
Y1a=zeros(1,j(1));
Y1=[0 Y1a];
Y2a=zeros(1,j(2));
Y2=[0 Y2a];
O=zeros(1,j(3));
e=zeros(1,j(3));
%----Learning and forgetting factor-----%
alpha=0.1;
etha=0.1;
sortie=zeros(1,length(X));
while(1)
n=randi(length(X),1);
%---------------Feed forward---------------%
%-----First layer-----%
X0=[-1 X(:,n)];
V1=X0*W1;
Y1a=tanh(V1/2);
%----Second layer-----%
Y1=[-1 Y1a];
V2=Y1*W2;
Y2a=tanh(V2/2);
%----Output layer-----%
Y2=[-1 Y2a];
V3=Y2*W3;
O=tanh(V3/2);
e=d(n)-O;
sortie(n)=O;
%------------/Feed Forward-----------------%
%------------Backward propagation---------%
%----Output layer-----%
delta3=e*0.5*(1+O)*(1-O);
W3n=W3+ alpha*(W3-W3p) + etha * delta3 * W3;
%----Second Layer-----%
delta2=zeros(1,length(Y2a));
for b=1:length(Y2a)
delta2(b)=0.5*(1-Y2a(b))*(1+Y2a(b)) * sum(delta3*W3(b+1,1));
end
W2n=W2 + alpha*(W2-W2p)+ (etha * delta2'*Y1)';
%----First Layer-----%
delta1=zeros(1,length(Y1a));
for b=1:length(Y1a)
for m=1:length(Y2a)
delta1(b)=0.5*(1-Y1a(b))*(1+Y1a(b)) * sum(delta2(m)*W2(b+1,m));
end
end
W1n=W1+ alpha*(W1-W1p)+ (etha * delta1'*X0)';
W3p=W3;
W3=W3n;
W2p=W2;
W2=W2n;
W1p=W1;
W1=W1n;
figure(1);
plot(1:length(d),d,1:length(d),sortie);
drawnow;
end
My question is, what can I do to correct it?
My guesses so far are, I either have something wrong in the back propagation, specifically in calculating delta and the weights. Or I have the weights initialized wrong (too small, or not dependent on the initial input)..
I am not an expert in this field, but have had some experience playing with Matlab and Java based Neural Network Systems.
I can suggest that usage of the toolbox could help you, it has helped others that I know.
I can offer a few points of information:
Do not expect NN's to work on all training data, sometimes the data is too complicated for classification in this manner
The format of your NN will have a drastic impact on the convergence performace
Finally:
Training algorithms like this will often train better when the various parameters are normalized to +/- 1. cos(x) is normalized, 100*cos*(x) is not. This is because the weighting updates required are much larger, and the training system might be taking very small steps. If you are data with multiple different ranges, then normalization is vital. Might I suggest you start with, at the very least, investigating that