The paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" by Devlin & Co. calculated for the base model size 110M parameters (i.e. L=12, H=768, A=12) where L = number of layers, H = hidden size and A = number of self-attention operations. As far as I know parameters in a neural network are usually the count of "weights and biases" between the layers. So how is this calculated based on the given information? 12768768*12?
Transformer Encoder-Decoder Architecture
The BERT model contains only the encoder block of the transformer architecture. Let's look at individual elements of an encoder block for BERT to visualize the number weight matrices as well as the bias vectors. The given configuration L = 12 means there will be 12 layers of self attention, H = 768 means that the embedding dimension of individual tokens will be of 768 dimensions, A = 12 means there will be 12 attention heads in one layer of self attention. The encoder block performs the following sequence of operations:
The input will be the sequence of tokens as a matrix of S * d dimension. Where s is the sequence length and d is the embedding dimension. The resultant input sequence will be the sum of token embeddings, token type embeddings as well as position embedding as a d-dimensional vector for each token. In the BERT model, the first set of parameters is the vocabulary embeddings. BERT uses WordPiece[2] embeddings that has 30522 tokens. Each token is of 768 dimensions.
Embedding layer normalization. One weight matrix and one bias vector.
Multi-head self attention. There will be h number of heads, and for each head there will be three matrices which will correspond to query matrix, key matrix and the value matrix. The first dimension of these matrices will be the embedding dimension and the second dimension will be the embedding dimension divided by the number of attention heads. Apart from this, there will be one more matrix to transform the concatenated values generated by attention heads to the final token representation.
Residual connection and layer normalization. One weight matrix and one bias vector.
Position-wise feedforward network will have one hidden layer, that will correspond to two weight matrices and two bias vectors. In the paper, it is mentioned that the number of units in the hidden layer will be four times the embedding dimension.
Residual connection and layer normalization. One weight matrix and one bias vector.
Let's calculate the actual number of parameters by associating the right dimensions to the weight matrices and bias vectors for the BERT base model.
Embedding Matrices:
Word Embedding Matrix size [Vocabulary size, embedding dimension] = [30522, 768] = 23440896
Position embedding matrix size, [Maximum sequence length, embedding dimension] = [512, 768] = 393216
Token Type Embedding matrix size [2, 768] = 1536
Embedding Layer Normalization, weight and Bias [768] + [768] = 1536
Total Embedding parameters = 𝟐𝟑𝟖𝟑𝟕𝟏𝟖𝟒 ≈ 𝟐𝟒𝑴
Attention Head:
Query Weight Matrix size [768, 64] = 49152 and Bias [768] = 768
Key Weight Matrix size [768, 64] = 49152 and Bias [768] = 768
Value Weight Matrix size [768, 64] = 49152 and Bias [768] = 768
Total parameters for one layer attention with 12 heads = 12∗(3 ∗(49152+768)) = 1797120
Dense weight for projection after concatenation of heads [768, 768] = 589824 and Bias [768] = 768, (589824+768 = 590592)
Layer Normalization weight and Bias [768], [768] = 1536
Position wise feedforward network weight matrices and bias [3072, 768] = 2359296, [3072] = 3072 and [768, 3072 ] = 2359296, [768] = 768, (2359296+3072+ 2359296+768 = 4722432)
Layer Normalization weight and Bias [768], [768] = 1536
Total parameters for one complete attention layer (1797120 + 590592 + 1536 + 4722432 + 1536 = 7113216 ≈ 7𝑀)
Total parameters for 12 layers of attention (𝟏𝟐 ∗ 𝟕𝟏𝟏𝟑𝟐𝟏𝟔 = 𝟖𝟓𝟑𝟓𝟖𝟓𝟗𝟐 ≈ 𝟖𝟓𝑴)
Output layer of BERT Encoder:
Dense Weight Matrix and Bias [768, 768] = 589824, [768] = 768, (589824 + 768 = 590592)
Total Parameters in 𝑩𝑬𝑹𝑻 𝑩ase = 𝟐𝟑𝟖𝟑𝟕𝟏𝟖𝟒 + 𝟖𝟓𝟑𝟓𝟖𝟓𝟗𝟐 + 𝟓𝟗𝟎𝟓𝟗𝟐 = 𝟏𝟎𝟗𝟕𝟖𝟔𝟑𝟔𝟖 ≈ 𝟏𝟏𝟎𝑴
Related
I'm building a Deep Learning model for regression with Matlab. My training data is composed of 1512 samples of 512 numeric features.
disp(size(X_train)); % 1512x512
disp(size(Y_train)); % 1512x3
layers = [
featureInputLayer(size(X_train, 2))
convolution1dLayer(3, 20)
tanhLayer()
averagePooling1dLayer(2)
convolution1dLayer(3, 10)
tanhLayer()
averagePooling1dLayer(2)
flattenLayer()
fullyConnectedLayer(20)
tanhLayer()
fullyConnectedLayer(10)
tanhLayer()
fullyConnectedLayer(3)
tanhLayer()
regressionLayer()
];
opts = trainingOptions('adam', ...
'MaxEpochs',200, ...
'Shuffle','every-epoch', ...
'Plots','training-progress', ...
'Verbose',false);
net = trainNetwork(X_train, Y_train, layers, opts);
Layer 'conv_layer_1': Input data must have one spatial dimension only, one temporal dimension only, or one of each. Instead, it has 0 spatial dimensions and 0 temporal dimensions.
Layer 'conv_layer_1': Input data must have one spatial dimension only, one temporal dimension only, or one of each. Instead, it has 0 spatial dimensions and 0 temporal dimensions.
The function to perform an N-dimensional convolution of arrays A and B in matlab is shown below:
C = convn(A,B) % returns the N-dimensional convolution of arrays A and B.
I am interested in a 3-D convolution with a Gaussian filter.
If A is a 3 x 5 x 6 matrix, what do the dimensions of B have to be?
The dimensions of B can be anything you want. There is no set restriction in terms of size. For the Gaussian filter, it can be 1D, 2D or 3D. In 1D, what will happen is that each row gets filtered independently. In 2D, what will happen is that each slice gets filtered independently. Finally, in 3D you will be doing what is expected in 3D convolution. I am assuming you would like a full 3D convolution, not just 1D or 2D.
You may be interested in the output size of convn. If you refer to the documentation, given the two N dimensional matrices, for each dimension k of the output and if nak is the size of dimension k for the matrix A and nbk is the size of dimension k for matrix B, the size of dimension of the output matrix C or nck is such that:
nck = max([nak + nbk - 1, nak, nbk])
nak + nbk - 1 is straight from convolution theory. The final output size of a dimension is simply the sum of the two sizes in dimension k subtracted by 1. However should this value be smaller than either of nak or nbk, we need to make sure that the output size is compatible so that any of the input matrices can fit in the final output. This is why you have the final output size and bounded by both A and B.
To make this easier, you can set the size of the filter guided by the standard deviation of the distribution. I would like to refer you to my previous Stack Overflow post: By which measures should I set the size of my Gaussian filter in MATLAB?
This determines what the output size of a Gaussian filter should be given a standard deviation.
In 2D, the dimensions of the filter are N x N, such that N = ceil(6*sigma + 1) with sigma being the desired standard deviation. Therefore, you would allocate a 3D matrix of size N x N x N with N = ceil(6*sigma + 1);.
Therefore, the code you would want to use to create a 3D Gaussian filter would be something like this:
% Example input
A = rand(3, 5, 6);
sigma = 0.5; % Example
% Find size of Gaussian filter
N = ceil(6*sigma + 1);
% Define grid of centered coordinates of size N x N x N
[X, Y, Z] = meshgrid(-N/2 : N/2);
% Compute Gaussian filter - note normalization step
B = exp(-(X.^2 + Y.^2 + Z.^2) / (2.0*sigma^2));
B = B / sum(B(:));
% Convolve
C = convn(A, B);
One final note is that if the filter you provide has any of its dimensions that are beyond the size of the input matrix A, you will get a matrix using the constraints of each nck value, but then the border elements will be zeroed due to zero-padding.
Introduction
From what I understood from CS231n Convolutional Neural Networks for Visual Recognition is that the Size of the output volume represents the number of neurones given the following parameters:
Input volume size (W)
The receptive field size of the Conv Layer neurons (F) which is the size of the kernel or filter
Stride with which they are applied (S) or steps that we use to move the kernel
Amount of zero padding used (P) on the border
I posted two examples. In example 1 I have no problem at all. But it's in example 2 that I get confused.
Example 1
In the Real-world example section they start with a [227 x 227 x 3] input image. The parameters are the following: F = 11, S = 4, P = 0, W = 227.
We note that the convolution has a depth of K = 96. (Why?)
The size of the output volume is (227 - 11)/4 + 1 = 55. So we will have 55 x 55 x 96 = 290,400 neurones each pointing (excuse me if butchered the term) to an [11 x 11 x 3] region in the image which is in fact the kernel where we want to compute the dot product.
Example 2
In the following example taken from the Numpy examples section. we have an input image with the following shape [11 x 11 x 3]. The parameters used to compute the size of the output Volume are the following: W = 11, P = 0, S = 2 and F = 5.
We note that the convolution has a depth of K = 4
The formula (11-5)/2+1 = 4 produces only 4 neurones. Each neurone points to a region of size [5 x 5 x 4] in the image.
It seems that they are moving the Kernel in the x direction only. Shouldn't we have 12 Neurones each having[5 x 5 x 4] weights.
V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0
V[1,0,0] = np.sum(X[2:7,:5,:] * W0) + b0
V[2,0,0] = np.sum(X[4:9,:5,:] * W0) + b0
V[3,0,0] = np.sum(X[6:11,:5,:] * W0) + b0
Questions
I really don't understand why only 4 neurones are used and not 12
Why did they pick K = 96 in example 1?
is the W parameter always equal to the width in the imput image?
Example 1
Why that the convolution has a depth of K = 96?
The depth (K) is equals to the number of filters used on the convolutional layer. A Bigger number gives, usually, better results. The problem is: slower training. Complex images would required more filters. I usually starts tests with 32 filters on the first layer and 64 on the second layer.
Example 2
The formula (11-5)/2+1 = 4 produces only 4 neurones.
I'm no expert, but I think this is false. The formula only define the output size (height and width). A convolutional layer has the size (height and width) and the depth. The size is defined by this formula, the depth by the number of filters used. The total number of neurons is:
## height * width * depth
4 * 4 * 4 = 64
Questions
The layer has 64 neurons, 16 for each depth slice.
Bigger number of filters, usually better.
As far as I know you need to calculate the height and width of the convolutional layer separately. When calculating the width of the output W will be the width of the image and F will be the width of the filters used. When calculating the height you will use the height of the image and filters. When the image and filters are squared you can do a single operation because both operations will have the same result.
I'm totally confused regarding PCA. I have a 4D image of size 90x60x12x350. That means that each voxel is a vector of size 350 (time series).
Now I divide the 3D image (90x60x12) into cubes. So let's say a cube contains n voxels, so I have n vectors of size 350. I want to reduce this n vectors to only one vector and then calculate the correlations between all vectors of all cubes.
So for a cube I can construct the matrix M where I just put each voxel after each other, i.e. M = [v1 v2 v3 ... vn] and each v is of size 350.
Now I can apply PCA in Matlab by using [coeff, score, latent, ~, explained] = pca(M); and taking the first component. And now my confusion begins.
Should I transpose the matrix M, i.e. PCA(M')?
Should I take the first column of coeff or of score?
This third question is now a bit unrelated. Let's assume we have a
matrix A = rand(30,100) where the rows are the datapoints and the
columns are the features. Now I want to reduce the dimensionality of
the feature vectors but keeping all data points.
How can I do this with PCA?
When I do [coeff, score, latent, ~, explained] = pca(M); then
coeff is of dimension 100x29 and score is of size 30x29. I'm
totally confused.
Yes, according to the pca help, "Rows of X correspond to observations and columns to variables."
score just tells you the representation of M in the principal component space. You want the first column of coeff.
numberOfDimensions = 5;
coeff = pca(A);
reducedDimension = coeff(:,1:numberOfDimensions);
reducedData = A * reducedDimension;
I disagree with the answer above.
[coeff,score]=pca(A)
where A has rows as observations and column as features.
If A has 3 featuers and >3 observations (Let's say 100) and you want the "feature" of 2 dimensions, say matrix B (the size of B is 100X2). What you should do is:
B = score(:,1:2);
I have read a image file into a array like this
A = imread(fileName);
and now i want to calculate shannon entropy. The shannon entropy implementation found in maltab is a byte level entropy analysis which considers a file to be composed of 256 byte levels.
wentropy(x,'shannon')
But i need to perform a bigram entropy analysis which would need to view a file as consisting of 65536 levels. Could anyone suggest me a good method of accomplishing this.
The entropy of a random variable can be calculated using the following formula:
Where p(x) is the Prob(X=x).
Given a set of n observations (x1, x2, .... xn) You then compute P(X=x) for the range all x values (in your case it would be between (0 and 65535) and then sum across all values. The easiest way to do this is using hist
byteLevel = 65536
% count the observations
observationHist = hist(observations, byteLevel);
% convert to a probability
probXVal = observationHist ./ sum(observationHist);
% compute the entropy
entropy = - sum( probXVal .* log2(probXVal) );
There are several implementations of this on the file exchange that are worth checking out.
Note: where are you getting that wentropy is using 256 byte levels? I don't see that anywhere in the docks? Remember that in Matlab the pixels of a color image have 3 channels (R,G,B) with each channel requiring 8 bits (or 256 byte levels?).
Also because each channel is bound between [0 256) you could create a mapping from P(R=r,G=g,B=b) to P(X=x) as follows:
data = imageData(:,:,1);
data = data + (imgData(:,:,2) * 256);
data = data + (imgData(:,:,3) * 256 * 256);
I believe you can then use data to calculate the total entropy of the image where each channel is independent.
Convert color image with "65536" levels to gray image with "256" levels and consider entropy evaluation.