The paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" by Devlin & Co. calculated for the base model size 110M parameters (i.e. L=12, H=768, A=12) where L = number of layers, H = hidden size and A = number of self-attention operations. As far as I know parameters in a neural network are usually the count of "weights and biases" between the layers. So how is this calculated based on the given information? 12768768*12?
Transformer Encoder-Decoder Architecture
The BERT model contains only the encoder block of the transformer architecture. Let's look at individual elements of an encoder block for BERT to visualize the number weight matrices as well as the bias vectors. The given configuration L = 12 means there will be 12 layers of self attention, H = 768 means that the embedding dimension of individual tokens will be of 768 dimensions, A = 12 means there will be 12 attention heads in one layer of self attention. The encoder block performs the following sequence of operations:
The input will be the sequence of tokens as a matrix of S * d dimension. Where s is the sequence length and d is the embedding dimension. The resultant input sequence will be the sum of token embeddings, token type embeddings as well as position embedding as a d-dimensional vector for each token. In the BERT model, the first set of parameters is the vocabulary embeddings. BERT uses WordPiece[2] embeddings that has 30522 tokens. Each token is of 768 dimensions.
Embedding layer normalization. One weight matrix and one bias vector.
Multi-head self attention. There will be h number of heads, and for each head there will be three matrices which will correspond to query matrix, key matrix and the value matrix. The first dimension of these matrices will be the embedding dimension and the second dimension will be the embedding dimension divided by the number of attention heads. Apart from this, there will be one more matrix to transform the concatenated values generated by attention heads to the final token representation.
Residual connection and layer normalization. One weight matrix and one bias vector.
Position-wise feedforward network will have one hidden layer, that will correspond to two weight matrices and two bias vectors. In the paper, it is mentioned that the number of units in the hidden layer will be four times the embedding dimension.
Residual connection and layer normalization. One weight matrix and one bias vector.
Let's calculate the actual number of parameters by associating the right dimensions to the weight matrices and bias vectors for the BERT base model.
Embedding Matrices:
Word Embedding Matrix size [Vocabulary size, embedding dimension] = [30522, 768] = 23440896
Position embedding matrix size, [Maximum sequence length, embedding dimension] = [512, 768] = 393216
Token Type Embedding matrix size [2, 768] = 1536
Embedding Layer Normalization, weight and Bias [768] + [768] = 1536
Total Embedding parameters = 𝟐𝟑𝟖𝟑𝟕𝟏𝟖𝟒 ≈ 𝟐𝟒𝑴
Attention Head:
Query Weight Matrix size [768, 64] = 49152 and Bias [768] = 768
Key Weight Matrix size [768, 64] = 49152 and Bias [768] = 768
Value Weight Matrix size [768, 64] = 49152 and Bias [768] = 768
Total parameters for one layer attention with 12 heads = 12∗(3 ∗(49152+768)) = 1797120
Dense weight for projection after concatenation of heads [768, 768] = 589824 and Bias [768] = 768, (589824+768 = 590592)
Layer Normalization weight and Bias [768], [768] = 1536
Position wise feedforward network weight matrices and bias [3072, 768] = 2359296, [3072] = 3072 and [768, 3072 ] = 2359296, [768] = 768, (2359296+3072+ 2359296+768 = 4722432)
Layer Normalization weight and Bias [768], [768] = 1536
Total parameters for one complete attention layer (1797120 + 590592 + 1536 + 4722432 + 1536 = 7113216 ≈ 7𝑀)
Total parameters for 12 layers of attention (𝟏𝟐 ∗ 𝟕𝟏𝟏𝟑𝟐𝟏𝟔 = 𝟖𝟓𝟑𝟓𝟖𝟓𝟗𝟐 ≈ 𝟖𝟓𝑴)
Output layer of BERT Encoder:
Dense Weight Matrix and Bias [768, 768] = 589824, [768] = 768, (589824 + 768 = 590592)
Total Parameters in 𝑩𝑬𝑹𝑻 𝑩ase = 𝟐𝟑𝟖𝟑𝟕𝟏𝟖𝟒 + 𝟖𝟓𝟑𝟓𝟖𝟓𝟗𝟐 + 𝟓𝟗𝟎𝟓𝟗𝟐 = 𝟏𝟎𝟗𝟕𝟖𝟔𝟑𝟔𝟖 ≈ 𝟏𝟏𝟎𝑴
So from what I've understood the formula of the MSE is: MSE= 1/n * ∑(t−y)^2, where n is the number of training sets, t is my target output and y my actual output. Let's say I had 2 training sets each with 1 output:
[0;0] t=[0] y=[1]
[1;1] t=[1] y=[1]
If I apply the MSE I would get MSE = 1/2 * [(0-1)^2 + (1-1)^2] = 1/2
But what if I have more than 1 output? Do I calculate the MSE of each training set and then I calculate the mean of all the MSEs I got?
Consider the following code in Keras for building a LSTM model.
model = Sequential()
model.add(LSTM(30, input_dim=22,return_sequences=True, init = 'glorot_uniform'))
model.add(TimeDistributed(Dense(1)))
model.compile(loss='mean_squared_error', optimizer='Nadam')
model.fit(train3d, trainY3d, nb_epoch=100, batch_size=8)
As one can see there is only 1 LSTM layer. I see there are 14 matrices as output. If you check this link you would notice that there are 4 triples (Input gate, Forget gate, Cell State, Output gate) of W, U (parameter matrices) and b (bias vector) each for the LSTM layer. And then there are 2 matrices W and b for the dense layer.
My question is let's say for this case of 1 layer LSTM, is there a way to attribute 100% of Y to the impact for each input feature X_i for all i.
I have read that you calculate the output of a Neuron in a Neural Net by adding up all the inputs times their corresponding weights and then smoothing it with e.g. the Sigmoid Function.
But what I don't understand is that this sum (without smoothing) could get bigger than 1.
When this happens my Sigmoid Function outputs 1.0.
The function I am using to calculate the Neuron Output (without smoothing) is:
def sum(self, inputs):
valu = 0
for i, val in enumerate(inputs):
valu += float(val) * self.weights[i]
return valu
So my question is:
Am I doing something wrong, because I have read that the output should be between 0 and 1?
The sigmoid function is not exactly a smoothing function, it is a non-linear function that maps the domain to the [0, 1] range in a non-linear manner. Informally speaking, a non-linear function does not have a constant slope or, in other words, it can't be described as a straight line.
The sigmoid function, as you can see in the image below, squishes the input such that, as the magnitude of the input increases, the output of the sigmoid asymptotically approximates 0 (negative input) and 1 (positive input)
I am trying to implement neural network with RELU.
input layer -> 1 hidden layer -> relu -> output layer -> softmax layer
Above is the architecture of my neural network.
I am confused about backpropagation of this relu.
For derivative of RELU, if x <= 0, output is 0.
if x > 0, output is 1.
So when you calculate the gradient, does that mean I kill gradient decent if x<=0?
Can someone explain the backpropagation of my neural network architecture 'step by step'?
if x <= 0, output is 0. if x > 0, output is 1
The ReLU function is defined as: For x > 0 the output is x, i.e. f(x) = max(0,x)
So for the derivative f '(x) it's actually:
if x < 0, output is 0. if x > 0, output is 1.
The derivative f '(0) is not defined. So it's usually set to 0 or you modify the activation function to be f(x) = max(e,x) for a small e.
Generally: A ReLU is a unit that uses the rectifier activation function. That means it works exactly like any other hidden layer but except tanh(x), sigmoid(x) or whatever activation you use, you'll instead use f(x) = max(0,x).
If you have written code for a working multilayer network with sigmoid activation it's literally 1 line of change. Nothing about forward- or back-propagation changes algorithmically. If you haven't got the simpler model working yet, go back and start with that first. Otherwise your question isn't really about ReLUs but about implementing a NN as a whole.
If you have a layer made out of a single ReLU, like your architecture suggests, then yes, you kill the gradient at 0. During training, the ReLU will return 0 to your output layer, which will either return 0 or 0.5 if you're using logistic units, and the softmax will squash those. So a value of 0 under your current architecture doesn't make much sense for the forward propagation part either.
See for example this. What you can do is use a "leaky ReLU", which is a small value at 0, such as 0.01.
I would reconsider this architecture however, it doesn't make much sense to me to feed a single ReLU into a bunch of other units then apply a softmax.
Here is a good example, use ReLU to implement XOR:
reference, http://pytorch.org/tutorials/beginner/pytorch_with_examples.html
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
# N is batch size(sample size); D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 4, 2, 30, 1
# Create random input and output data
x = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
learning_rate = 0.002
loss_col = []
for t in range(200):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0) # using ReLU as activate function
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum() # loss function
loss_col.append(loss)
print(t, loss, y_pred)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y) # the last layer's error
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T) # the second laye's error
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0 # the derivate of ReLU
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
plt.plot(loss_col)
plt.show()
More about the derivate of ReLU, you can see here: http://kawahara.ca/what-is-the-derivative-of-relu/
So when you calculate the gradient, does that mean I kill gradient
decent if x <= 0?
Yes! If the weighted sum of the inputs and bias of the neuron (activation function input) is less than zero and the neuron uses the Relu activation function, the value of the derivative is zero during backpropagation and the input weights to this neuron do not change (not updated).
Can someone explain the backpropagation of my neural network architecture 'step by step'?
A simple example can show one step of backpropagation. This example covers a complete process of one step. But you can also check only the part that related to Relu. This is similar to the architecture introduced in question and uses one neuron in each layer for simplicity. The architecture is as follows:
f and g represent Relu and sigmoid, respectively, and b represents bias.
Step 1:
First, the output is calculated:
This merely represents the output calculation. "z" and "a" represent the sum of the input to the neuron and the output value of the neuron activating function, respectively.
So h is the estimated value. Suppose the real value is y.
Weights are now updated with backpropagation.
The new weight is obtained by calculating the gradient of the error function relative to the weight, and subtracting this gradient from the previous weight, ie:
In backpropagation, the gradient of the last neuron(s) of the last layer is first calculated. A chain derivative rule is used to calculate:
The three general terms used above are:
The difference between the actual value and the estimated value
Neuron output square
And the derivative of the activator function, given that the activator function in the last layer is sigmoid, we have this:
And the above statement does not necessarily become zero.
Now we go to the second layer. In the second layer we will have:
It consisted of 4 main terms:
The difference between the actual value and the estimated value.
Neuron output square
The sum of the loss derivatives of the connected neurons in the next layer
A derivative of the activator function and since the activator function is Relu we will have:
if z2<=0 (z2 is the input of Relu function):
Otherwise, it's not necessarily zero:
So if the input of neurons is less than zero, the loss derivative is always zero and weights will not update.
*It is repeated that the sum of the neuron inputs must be less than zero to kill gradient descent.
The example given is a very simple example to illustrate the backpropagation process.
Yes the orginal Relu function has the problem you describe.
So they later made a change to the formula, and called it leaky Relu
In essence Leaky Relu tilts the horizontal part of the function slightly by a very small amount. for more information watch this :
An explantion of activation methods, and a improved Relu on youtube
Additionally, here you can find an implementation in caffe framework: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/relu_layer.cpp
The negative_slope specifies whether to "leak" the negative part by multiplying it with the slope value rather than setting it to 0. Of course you should set this parameter to zero to have classical version.