Introduction
From what I understood from CS231n Convolutional Neural Networks for Visual Recognition is that the Size of the output volume represents the number of neurones given the following parameters:
Input volume size (W)
The receptive field size of the Conv Layer neurons (F) which is the size of the kernel or filter
Stride with which they are applied (S) or steps that we use to move the kernel
Amount of zero padding used (P) on the border
I posted two examples. In example 1 I have no problem at all. But it's in example 2 that I get confused.
Example 1
In the Real-world example section they start with a [227 x 227 x 3] input image. The parameters are the following: F = 11, S = 4, P = 0, W = 227.
We note that the convolution has a depth of K = 96. (Why?)
The size of the output volume is (227 - 11)/4 + 1 = 55. So we will have 55 x 55 x 96 = 290,400 neurones each pointing (excuse me if butchered the term) to an [11 x 11 x 3] region in the image which is in fact the kernel where we want to compute the dot product.
Example 2
In the following example taken from the Numpy examples section. we have an input image with the following shape [11 x 11 x 3]. The parameters used to compute the size of the output Volume are the following: W = 11, P = 0, S = 2 and F = 5.
We note that the convolution has a depth of K = 4
The formula (11-5)/2+1 = 4 produces only 4 neurones. Each neurone points to a region of size [5 x 5 x 4] in the image.
It seems that they are moving the Kernel in the x direction only. Shouldn't we have 12 Neurones each having[5 x 5 x 4] weights.
V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0
V[1,0,0] = np.sum(X[2:7,:5,:] * W0) + b0
V[2,0,0] = np.sum(X[4:9,:5,:] * W0) + b0
V[3,0,0] = np.sum(X[6:11,:5,:] * W0) + b0
Questions
I really don't understand why only 4 neurones are used and not 12
Why did they pick K = 96 in example 1?
is the W parameter always equal to the width in the imput image?
Example 1
Why that the convolution has a depth of K = 96?
The depth (K) is equals to the number of filters used on the convolutional layer. A Bigger number gives, usually, better results. The problem is: slower training. Complex images would required more filters. I usually starts tests with 32 filters on the first layer and 64 on the second layer.
Example 2
The formula (11-5)/2+1 = 4 produces only 4 neurones.
I'm no expert, but I think this is false. The formula only define the output size (height and width). A convolutional layer has the size (height and width) and the depth. The size is defined by this formula, the depth by the number of filters used. The total number of neurons is:
## height * width * depth
4 * 4 * 4 = 64
Questions
The layer has 64 neurons, 16 for each depth slice.
Bigger number of filters, usually better.
As far as I know you need to calculate the height and width of the convolutional layer separately. When calculating the width of the output W will be the width of the image and F will be the width of the filters used. When calculating the height you will use the height of the image and filters. When the image and filters are squared you can do a single operation because both operations will have the same result.
Related
The paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" by Devlin & Co. calculated for the base model size 110M parameters (i.e. L=12, H=768, A=12) where L = number of layers, H = hidden size and A = number of self-attention operations. As far as I know parameters in a neural network are usually the count of "weights and biases" between the layers. So how is this calculated based on the given information? 12768768*12?
Transformer Encoder-Decoder Architecture
The BERT model contains only the encoder block of the transformer architecture. Let's look at individual elements of an encoder block for BERT to visualize the number weight matrices as well as the bias vectors. The given configuration L = 12 means there will be 12 layers of self attention, H = 768 means that the embedding dimension of individual tokens will be of 768 dimensions, A = 12 means there will be 12 attention heads in one layer of self attention. The encoder block performs the following sequence of operations:
The input will be the sequence of tokens as a matrix of S * d dimension. Where s is the sequence length and d is the embedding dimension. The resultant input sequence will be the sum of token embeddings, token type embeddings as well as position embedding as a d-dimensional vector for each token. In the BERT model, the first set of parameters is the vocabulary embeddings. BERT uses WordPiece[2] embeddings that has 30522 tokens. Each token is of 768 dimensions.
Embedding layer normalization. One weight matrix and one bias vector.
Multi-head self attention. There will be h number of heads, and for each head there will be three matrices which will correspond to query matrix, key matrix and the value matrix. The first dimension of these matrices will be the embedding dimension and the second dimension will be the embedding dimension divided by the number of attention heads. Apart from this, there will be one more matrix to transform the concatenated values generated by attention heads to the final token representation.
Residual connection and layer normalization. One weight matrix and one bias vector.
Position-wise feedforward network will have one hidden layer, that will correspond to two weight matrices and two bias vectors. In the paper, it is mentioned that the number of units in the hidden layer will be four times the embedding dimension.
Residual connection and layer normalization. One weight matrix and one bias vector.
Let's calculate the actual number of parameters by associating the right dimensions to the weight matrices and bias vectors for the BERT base model.
Embedding Matrices:
Word Embedding Matrix size [Vocabulary size, embedding dimension] = [30522, 768] = 23440896
Position embedding matrix size, [Maximum sequence length, embedding dimension] = [512, 768] = 393216
Token Type Embedding matrix size [2, 768] = 1536
Embedding Layer Normalization, weight and Bias [768] + [768] = 1536
Total Embedding parameters = 𝟐𝟑𝟖𝟑𝟕𝟏𝟖𝟒 ≈ 𝟐𝟒𝑴
Attention Head:
Query Weight Matrix size [768, 64] = 49152 and Bias [768] = 768
Key Weight Matrix size [768, 64] = 49152 and Bias [768] = 768
Value Weight Matrix size [768, 64] = 49152 and Bias [768] = 768
Total parameters for one layer attention with 12 heads = 12∗(3 ∗(49152+768)) = 1797120
Dense weight for projection after concatenation of heads [768, 768] = 589824 and Bias [768] = 768, (589824+768 = 590592)
Layer Normalization weight and Bias [768], [768] = 1536
Position wise feedforward network weight matrices and bias [3072, 768] = 2359296, [3072] = 3072 and [768, 3072 ] = 2359296, [768] = 768, (2359296+3072+ 2359296+768 = 4722432)
Layer Normalization weight and Bias [768], [768] = 1536
Total parameters for one complete attention layer (1797120 + 590592 + 1536 + 4722432 + 1536 = 7113216 ≈ 7𝑀)
Total parameters for 12 layers of attention (𝟏𝟐 ∗ 𝟕𝟏𝟏𝟑𝟐𝟏𝟔 = 𝟖𝟓𝟑𝟓𝟖𝟓𝟗𝟐 ≈ 𝟖𝟓𝑴)
Output layer of BERT Encoder:
Dense Weight Matrix and Bias [768, 768] = 589824, [768] = 768, (589824 + 768 = 590592)
Total Parameters in 𝑩𝑬𝑹𝑻 𝑩ase = 𝟐𝟑𝟖𝟑𝟕𝟏𝟖𝟒 + 𝟖𝟓𝟑𝟓𝟖𝟓𝟗𝟐 + 𝟓𝟗𝟎𝟓𝟗𝟐 = 𝟏𝟎𝟗𝟕𝟖𝟔𝟑𝟔𝟖 ≈ 𝟏𝟏𝟎𝑴
I want to perform a Census transform in MATLAB at the center pixels of each filter's window as shown below:
If the the image does not appear, an alternative link: https://i.ibb.co/9Y6LfSL/Shared-Screenshot.jpg
My Initial attempt for the code is:
function output =census(img,census_size)
img_gray = rgb2gray(img);
[y,x]=size(img_gray);
borders = floor(census_size/2); % limit to exclude image borders when filtering
for(iy = 1+borders : y-borders)
for(ix = 1+borders : x-borders)
f=img_gray(iy-borders:iy+borders,ix-borders:ix+borders);
iix=ix-borders;
iiy=iy-borders;
% shift=bitsll(img_out(iiy,iix),1);
img_out(iiy,iix)= % Must be Implemented with census
end
end
%normalised_image = img_out ./ max(max(img_out)) ;
output=img_out;
imshow(normalised_image);
end
iix and iiy at the second for loop represents my current location for the center pixels. f is my current filter window.
In addition to comparsion operation with the window's other pixels, I need to set each comparsion result to logical 1/0, and extend the total result ( by shifting I guess) to 8-bit data, then convert this binary number to a deciaml number. How I can do this in a practical way in MATLAB?
I have checked this in Python: https://stackoverflow.com/a/38269363/12173333
But could not make a similarity in MATLAB.
If you have the image processing toolbox you can use blockproc:
%Load your image
I = imread('https://i.stack.imgur.com/9oxaQ.png');
%Creation of the census transform function
fun = #(B) [128 64 32 16 0 8 4 2 1]*(B.data(:)>B.data(2,2));
%Process the image, block-by-block with overlap, force the result to be of type uint8
I2 = uint8(blockproc(I.',[1 1],fun,'BorderSize',[1,1],'TrimBorder',0)).'
Here blockproc is configured for a 3x3 windows (with overlap) and work for grayscale image. The function fun check which part of the block is strictly bigger than the center of the block. We obtain a 1x9 logical vector. Then I multiply this vector with [128 64 32 16 0 8 4 2 1] (binary to decimal transformation).
Update:
Optimisation with linear algeabra
For random windows size you can use:
I = imread('https://i.stack.imgur.com/9oxaQ.png');
w = 5; % windows size, any odd number between 3 and 31.
b2d = 2.^[w^2-1:-1:ceil(w^2/2),0,floor(w^2/2):-1:1] % binary to decimal vector
cen = ceil(w/2); % center position
%Creation of the census transform function
fun = #(B) b2d*(B.data(:)>B.data(cen,cen));
%Process the image, block-by-block with overlap
I2 = blockproc(I.',[1 1],fun,'BorderSize',[cen-1,cen-1],'TrimBorder',0).'/sum(b2d)
Input:
Output:
I have a set of 25 images of label 'Infected' and 25 images of label 'Normal'.
I am trying to extract the dual-tree complex wavelet transform based coefficients as features for each of the images.
My code to obtain coefficients using DT-CWT ia as follows:
I = imread('infected_img1.jpg'); %read image
I = rgb2gray(I); %rgb ro gray-scale
L = 6; %no. of levels for wavelet decomposition
I = reshape(I',1,size(I,1)*size(I,2)); %change into a vector
I = [I,zeros(1,2^L - rem(length(I),2^L))]; %pad zeros to make dim(I) a multiple of 2^L
I = double(I);
dt = dddtree('cplxdt',I,L,'dtf3'); %perform DT-CWT
dt_Coeffs = (dt.cfs{L}(:,:,1) + 1i*dt.cfs{L}(:,:,2)); %extract coefficents at Level 6
Now, since I have 24 more images to extract coefficients from, I do this block for each of the images. My ultimate aim is to append all coefficient vectors generated in each iteration to form a matrix. But each image tends to give a different sized coefficient vector.
I want to know about some dimension reduction method that can reduce each vector to a uniform size and at the same time preserve its information.
Can anyone suggest methods with a good amount of clarity?
As I mentioned in my comment,
You can't shrink something (i.e. remove information) and still preserve all of the information.
Instead you can pad all of the vectors to the length of the largest vector and then concatenate them to create a matrix. You could program your own method but in the spirit of not reinventing the wheel I've previously used padcat(). It is very easy to use and pads with NaN but you can easily change this to 0.
Here's an example usage which also contains a handy conversion from NaN to 0.
>> a = [1 2 3 4];
>> b = [1 2 3];
>> c = padcat(a, b);
c =
1 2 3 4
1 2 3 NaN
>> c(isnan(c)) = 0
c =
1 2 3 4
1 2 3 0
I need to find pixel values that are between the intersection of 2 lines. The following image shows the points that I want namely the brown region.
These 4 co-ordinates can change and are not necessarily the corner points.
What is the fastest way to get the pixel values ? Is there any function that can give me the necessary mask.
You should calculate for each point, whether it is above the line or below. If the line is given in its equation form Ax+By+C, then it is as simple as calculating the sign of this expression, per your point (x,y). If your lines are given in any other form, you should first calculate the form above. (See here and here)
Let L1 be the set of all points below the first line, and L2 the set of all points below the second line. Then, your set is X = Xor(L1,L2)
[ ] Xor []
Equals:
Here is a Matlab code that solves you problem for the corner points, based on the solution that I've described. You can adjust the line equations in your code.
function CreateMask()
rows = 100;
cols = 200;
[X,Y] = ndgrid(1:cols,1:rows);
belowFirstLine = X*(1/cols) + Y*(-1/rows) + 0 < 0;
belowSecondLine = X*(-1/cols) + Y*(-1/rows) + 1 < 0;
figure;imshow( transpose(xor(belowSecondLine,belowFirstLine)));
end
Here is geometrical, rather than analytic solution.
First, you need to construct a mask image, initially filled with all zeroes. Then you should draw both lines using Bresenham's algorithm. There is no default implementation in Matlab, but you can pick one at Matlab Central. I assume, you have coordinates of intersections of the lines with image borders.
After that your image is divided into four areas and you need to flood-fill two of them using bwfill. And now you have the mask.
You can start with generating two matrices with x & y coordinates:
1 2 3 4 5 1 1 1 1 1
1 2 3 4 5 vs. 2 2 2 2 2 sized as the region
1 2 3 4 5 3 3 3 3 3
Then one needs 4 line equations that convert x*a + y*b < c into 4 masks:
diagonals have to be XORED and top/bottom masks ANDED
or without logical expressions: mask=mod(diag1+diag2,2)*top_mask*bot_mask;
The line width can be controlled by adding to 'c' half of the line width, assuming that a and b are normalized.
I have an image of size 480 (height) by 640 (width). I also have a matrix of size [1 x 280]. These 280 values are points which can be found in the image.
I would like to find out what are the points from the matrix that can be found in a particular section of the image. I did a nested for loop to specify the location where I want to scan to search for the points, but I have trouble doing the "scanning".
% matrixC = [435 560 424 132 453 ........ 596] for example of size 280
for height = 1:480
for width = 635:640
W = C(max);
end
end
This it only displays W as the greatest value of C, but I need to only show the greatest value of C within the sections of between 1 and 480 for the height, and 635 to 640 for the width. How do I write code to only scan the particular section I am interested in, and if there are, like, 10 numbers found within that section, how do I select them?
You can use ismember and direct indexing in the image matrix to get a binary matrix of "is" or "is-not" values.
imageC = randi(256, 480, 640); % random image
vectorW = randi(256, 1, 280); % random vector of points
imagePart = imageC(1:480, 635:640); % select section by indexing
imageMember = ismember(imagePart, vectorW); % check membership
Update (changing vectorW to C, adding handling for 3-channel image, and actual point value return): You can apply your own image imageC and vector C on the following by replacing first two lines.
imageC = randi(256, 480, 640, 3); % (random) image [480 x 640 x 3]
C = randi(256, 1, 280); % (random) vector of points [1 x 280]
imagePart = imageC(1:480, 635:640, :); % select section by indexing [480 x 6 x 3]
imageMember = ismember(imagePart, C); % check membership [480 x 6 x 3]
pointsInImage = unique(imagePart(imageMember)); % unique set of points from C found in imagePart