Backpropagtion with ReLU - Understanding the calculation - neural-network

I've been getting started with neural networks and am attempting to implement a forward and backward pass with a ReLU activation function. However I feel like I'm misunderstanding something relatively fundamental here when it comes to the backward pass.
import numpy as np
class ReLU:
def __init__(self):
self.input_tensor = None
def forward(self, input_tensor):
self.input_tensor = input_tensor
return np.maximum(0, input_tensor)
def backward(self, error_tensor):
deriv = np.greater(error_tensor, 0).astype(int)
return self.input_tensor - deriv
My question is simple: How is the output of the backward method even supposed to look? My confusion stems from the fact that the derivative of ReLU is simple enough, but I'm not sure how this then is factored into the output that is passed onto the next lecture. I'm absolutely aware that I can't simply subtract the derivative form the old input but I'm unable to see how they go together.

For x > 0 relu is like multiplying x by 1. Else it's like multiplying x by 0. The derivative is then either 1 (x>0) or 0 (x<=0).
So depending on what the output was, you must times the error_tensor by 1 or 0.
If it isn't clear, that means you have to save the output of the forward pass to help calculate the gradient.

Related

Why does huggingface bert pooler hack make mixed precission training stable?

Huggigface BERT implementation has a hack to remove the pooler from optimizer.
https://github.com/huggingface/transformers/blob/b832d5bb8a6dfc5965015b828e577677eace601e/examples/run_squad.py#L927
# hack to remove pooler, which is not used
# thus it produce None grad that break apex
param_optimizer = [n for n in param_optimizer if 'pooler' not in n[0]]
We are trying to run pretrining on huggingface bert models. The code always diverges later during the training if this pooler hack is not applied. I also see the pooler layer being used during classification.
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
The pooler layer is a FFN with tanh activation
class BertPooler(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()
def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output
My question is why this pooler hack solves numeric instability?
Problem seen with pooler
There are quite a few resources out there that probably tackle this issue better than me, see for example here, or here.
Specifically, the problem is that you are dealing with vanishing (or exploding) gradients, specifically when using loss functions that flatten in either direction for very small/large inputs, which is the case for both sigmoid and tanh (the only difference here is the range in which their output lies, which is [0, 1] and [-1, 1], respectively.
Additionally, if you have a low-precision decimal, as is the case with APEX, then the gradient vanishing behavior is much more likely to appear already for relatively moderate outputs, as the precision limits the numbers which it is able to differentiate from zero. One way to deal with this is to have functions that have strictly non-zero and easily computable derivatives, such as Leaky ReLU, or simply avoid the activation function altogether (which I'm assuming is what huggingface is doing here).
Note that the problem of exploding gradients is usually not as tragic, as we can apply gradient clipping (limiting it to a fixed maximum size), but nonetheless the principle is the same. For zeroed gradients, on the other hand, there is no such easy fix, since it causes your neurons to "die" (no active learning is happening with zero backflow), which is why I'm assuming that you see the diverging behavior.

Keras: make specific weights in a dense layer untrainable [duplicate]

I am using keras and tensorflow 1.4.
I want to explicitly specify which neurons are connected between two layers. Therefor I have a matrix A with ones in it, whenever neuron i in the first Layer is connected to neuron j in the second Layer and zeros elsewhere.
My first attempt was to create a custom layer with a kernel, that has the same size as A with non-trainable zeros in it, where A has zeros in it and trainable weights, where A has ones in it. Then, the desired output would be a simple dot-product. Unfortunately I did not manage to figure out, how to implement a kernel that is partly trainable and partly non-trainable.
Any suggestions?
(Building a functional model with a lot of neurons that are connected by hand could be a work around, but somehow 'ugly' solution)
The simplest way I can think of, if you have this matrix correctly shaped, is to derive the Dense layer and simply add the matrix in the code multiplying the original weights:
class CustomConnected(Dense):
def __init__(self,units,connections,**kwargs):
#this is matrix A
self.connections = connections
#initalize the original Dense with all the usual arguments
super(CustomConnected,self).__init__(units,**kwargs)
def call(self,inputs):
#change the kernel before calling the original call:
self.kernel = self.kernel * self.connections
#call the original calculations:
super(CustomConnected,self).call(inputs)
Using:
model.add(CustomConnected(units,matrixA))
model.add(CustomConnected(hidden_dim2, matrixB,activation='tanh')) #can use all the other named parameters...
Notice that all the neurons/units have yet a bias added at the end. The argument use_bias=False will still work if you don't want biases. You can also do exactly the same thing using a vector B, for instance, and mask the original biases with self.biases = self.biases * vectorB
Hint for testing: use different input and output dimensions, so you can be sure that your matrix A has the correct shape.
I just realized that my code is potentially buggy, because I'm changing a property that is used by the original Dense layer. If weird behaviors or messages appear, you can try another call method:
def call(self, inputs):
output = K.dot(inputs, self.kernel * self.connections)
if self.use_bias:
output = K.bias_add(output, self.bias)
if self.activation is not None:
output = self.activation(output)
return output
Where K comes from import keras.backend as K.
You may also go further and set a custom get_weights() method if you want to see the weights masked with your matrix. (This would not be necessary in the first approach above)

Using SciPy's quad to get the principal value of an integral by integrating to just below and from just above the singular point

I am trying to compute the principal value of an integral (over s) of 1/((s - q02)*(s - q2)) on [Ecut, inf] with q02 < Ecut < q2. Doing the Principle value by hand (or Mathematica) one obtains the general result
ln((q2-Ecut)/(Ecut-q02)) / (q02 -q2)
In the specific example below this gives the result -1.58637*10^-11. One should also be able to get the same result by splitting the integral in two, integrating up to q2 - eps and then starting from q2 + eps, and then adding the two results (the divergences should cancel). By taking eps smaller and smaller one should recover the result above. When I implement this in scipi using quad however my result converges to the wrong result 6.04685e-11, as I show in the plot of eps vs integral result I include.
Why is quad doing this? even if I have eps = 0 it gives me this wrong result, when I would expect it to give me an error as the thing blows up...
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import quad
q02 = 485124412.
Ecut = 17909665929.
q2 = 90000000000.
def integrand(s):
return 1/((s - q02)*(s - q2))
xx=[1.,0.1,0.01,0.001,0.0001,0.00001,0.000001,0.0000001,0.00000001,
0.000000001,0.0000000001,0.00000000001,0.]
integral = [0*y for y in xx]
i=0
for eps in xx:
ans1,err = quad(integrand, Ecut, q2 -eps )
ans2,err= quad(integrand, q2 + eps, np.inf)
integral[i] = ans1 + ans2
i=i+1
plt.semilogx(xx,integral,marker='.')
plt.show()
One should also be able to get the same result by splitting the integral in two, integrating up to q2 - eps and then starting from q2 + eps, and then adding the two results
Only if computations were perfectly accurate. In numerical practice, what you described is basically the worst thing one could do. You get two large integrals of opposite signs that very nearly cancel each other when added; what is left has more to do with the errors of integration than with actual value of the integral.
I notice you disregarded the error values err in your script, not even printing them out. Bad idea: they are of size 1e-10, which would already tell you that the final result with "something e-11" is junk.
The Computational Science question Numerical Principal Value Integration - Hilbert like addresses this issue. One of the approaches they indicate is to add the values of the integrand at the points symmetric about the singularity, before trying to integrate it. This requires taking the integral over a symmetric interval centered at the singularity q2 (that is, from Ecut to 2*q2-Ecut), and then adding the contribution of the integral from 2*q2-Ecut to infinity. This split makes sense anyway, because quad treats infinite limits very differently (using Fourier integration), which is yet another thing that will affect the way the singularity cancels out.
So, an implementation of this approach would be
ans1, err = quad(lambda s: integrand(s) + integrand(2*q2-s), Ecut, q2)
ans2, err = quad(integrand, 2*q2-Ecut, np.inf)
No eps is needed. However, the result is still off: it's about -2.5e-11. Turns out, the second integral is the culprit. Unfortunately, the Fourier integral approach doesn't seem to be effective here (or I didn't find a way to make it work). It turns out that providing a large, but finite value as the upper limit leads to a better result, especially if the option epsabs is also used, e.g. epsabs=1e-20.
Better yet, read the documentation of quad extra carefully and notice that it directly supports integrals with Cauchy weight 1/(s-q2), choosing an appropriate numerical method for them. This still requires a finite upper limit, and a small value of epsabs, but the result is pretty accurate:
quad(lambda s: 1/(s - q02), Ecut, 1e9*q2, weight='cauchy', wvar=q2, epsabs=1e-20)
returns -1.5863735715967363e-11, compared to exact value -1.5863735704856253e-11. Notice that the factor 1/(s-q2) does not appear in the integrand above, being relegated to the weight options.

SciPy.optimize.least_squares() Objective Function Questions

I am trying to minimize a highly non-linear function by optimizing three unknown parameters a, b, and c0. I'm attempting to replicate some governing equations of a casino roulette ball in Python 3.
Here is the link to the research paper:
http://www.dewtronics.com/tutorials/roulette/documents/Roulette_Physik.pdf
I will be referencing equations (35) and (40) in the paper.
Basically, I take stopwatch lap measurements of the roulette ball spinning on the wheel. For each successive lap, the lap time will increase because of losses of momentum to non-conservative forces of friction. Then I take these time measurements and fit equation (35) using a Levenberg-Marquardt least squares method in equation (40).
My question is twofold:
(1) I'm using the scipy.optimize.least_squares() method='lm', and I'm not sure how to write the objective function! Right now I have the function written exactly as is in the paper:
def fall_time(k,a,b,c0):
F = (1 / (a * b)) * (c0 - np.arcsinh(c0) * np.exp(a * k * 2 * np.pi))
return F
def parameter_estimation_function(x0,tk):
a = x0[0]
b = x0[1]
c0 = x0[2]
S = 0
for i,t in enumerate(tk):
k = i + 1
S += (t - fall_time(k,a,b,c0))**2
return [S,1,1]
sol = least_squares(parameter_estimation_function,[0.1,0.8,-0.1],args=([tk1]),method='lm',jac='2-point',max_nfev=2000)
print(sol)
Now, in the documentation examples, I never saw the objective function written the way I have it. In the documentation, the objective function is always returns the residual, not the square of the residual. Additionally, in the documentation they never use the sum! So I'm wondering if the sum and the square are automatically handled under the hood of least_squares()?
(2) Perhaps my second question is a result of my failure to understand how to write the objective function. But anyhow, I'm having trouble getting the algorithm to converge on the minimum. I know this is because the levenberg alogrithm is "greedy" and stops near the closest minima, but I figured that I would be able to at least converge on about the same result given different initial guesses. With slight alterations in the initial guess, I'm getting parameter results with different signs. Additionally, I've yet to find a combination of initial guesses that allows the algo to converge! It always times out before it finds the solution. I've even increased the amount of function evaluations to 10,000 to see if it would. To no avail!
Perhaps somebody could shed some light on my mistakes here! I'm still relatively new to python and the scipy library!
Here is some sample data for tk that I've measured myself from the video here: https://www.youtube.com/watch?v=0Zj_9ypBnzg
tk = [0.52,1.28,2.04,3.17,4.53,6.22]
tk1 = [0.51,1.4,2.09,3,4.42,6.17]
tk2 = [0.63,1.35,2.19,3.02,4.57,6.29]
tk3 = [0.63,1.39,2.23,3.28,4.70,6.32]
tk4 = [0.57,1.4,2.1,3.06,4.53,6.17]
Thanks
1) Yes, as you suspected the sum and the square of the residuals are automatically handled.
2) Hard to say, since I'm not deeply familiar with the problem (e.g., how many local minima exist, what constitutes a 'reasonable' result, etc.). I may investigate more later.
But for kicks I fiddled with some of the values to see what would happen. For example, you can just replace the 1/b constant with a standalone variable b_inv, and this seemed to stabilize the results quite a bit. Here's the code I used to check results. (Note that I rewrote the objective function for brevity. It simply leverages the element-wise operations of numpy arrays, without changing the overall result.)
import numpy as np
from scipy.optimize import least_squares
def fall_time(k,a,b_inv,c0):
return (b_inv / a) * (c0 - np.arcsinh(c0) * np.exp(a * k * 2 * np.pi))
def parameter_estimation_function(x,tk):
return np.asarray(tk) - fall_time(k=np.arange(1,len(tk)+1), a=x[0],b_inv=x[1],c0=x[2])
tk_samples = [
[0.52,1.28,2.04,3.17,4.53,6.22],
[0.51,1.4,2.09,3,4.42,6.17],
[0.63,1.35,2.19,3.02,4.57,6.29],
[0.63,1.39,2.23,3.28,4.70,6.32],
[0.57,1.4,2.1,3.06,4.53,6.17]
]
for i in range(len(tk_samples)):
sol = least_squares(parameter_estimation_function,[0.1,1.25,-0.1],
args=(tk_samples[i],),method='lm',jac='2-point',max_nfev=2000)
print(sol.x)
with console output:
[ 0.03621789 0.64201913 -0.12072879]
[ 3.59319972e-02 1.17129458e+01 -6.53358716e-03]
[ 3.55516005e-02 1.48491493e+01 -5.31098257e-03]
[ 3.18068316e-02 1.11828091e+01 -7.75329834e-03]
[ 3.43920725e-02 1.25160378e+01 -6.36307506e-03]

Tensorflow max-margin loss training?

I want to train a neural network in tensorflow with a max-margin loss function using one negative sample per positive sample:
max(0,1 -pos_score +neg_score)
What I'm currently doing is this:
The network takes three inputs: input1, and then one positive example input2_pos and one negative example input2_neg. (These are indices to a word embeddings layer.) The network is supposed to calculate a score that expresses how related two examples are.
Here's a simplified version of my code:
input1 = tf.placeholder(dtype=tf.int32, shape=[batch_size])
input2_pos = tf.placeholder(dtype=tf.int32, shape=[batch_size])
input2_neg = tf.placeholder(dtype=tf.int32, shape=[batch_size])
# f is a neural network outputting a score
pos_score = f(input1,input2_pos)
neg_score = f(input1,input2_neg)
cost = tf.maximum(0., 1. -pos_score +neg_score)
optimizer= tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
What I see when I run this, is that like this the network just learns which input holds the positive example - it always predicts a similar score along the lines of:
pos_score = 0.9965983
neg_score = 0.00341663
How can I structure the variables/training so that the network learns the task instead?
I want just one network that takes two inputs and calculates a score expressing the correlation between them, and train it with max-margin loss.
Calculating scores for positive and negative separately does not seem like an option to me, since then it won't backpropagate properly. Another option seems to be randomizing inputs - but then for the loss function I need to know which example is the positive one - inputting that as another parameter would give away the solution again?
Any ideas?
Given your results (1 for every positive, 0 for every negative) it seems you have two different networks learning:
to predict 1 for the first one
to predict 0 for the second one
When using max-margin loss, you need to use the same network for computing both pos_score and neg_score. The way to do that is to share the variables. I will give you a small example using tf.get_variable():
with tf.variable_scope("network"):
w = tf.get_variable("weights", shape=..., initializer=...)
def f(x, y):
with tf.variable_scope("network", reuse=True):
w = tf.get_variable("weights")
res = w * (x - y) # some computation
return res
With this function f as model, the training will optimize the shared variable with name "network/weights".