Understanding 3D convolution and when to use it? - neural-network

I am new to convolutional neural networks, and I am learning 3D convolution.
What I could understand is that 2D convolution gives us relationships between low level features in the X-Y dimension, while the 3D convolution helps detect low level features and relationships between them in all the 3 dimensions.
Consider a CNN employing 2D conv layers to recognize hand written digits. If a digit, say 5, was written in different colors:
Would a strictly 2D CNN would perform poorly (since they belong to different channels in the z dimension)?
Also, are there practical well-known neural nets that employ 3D convolution?

The problem is that the 2D aspects of an image have locality. In a sense, things that are nearby are expected to be related in some fundamental way. E.g. a pixel near a hair pixel is expected to be a hair pixel, a priori. However, the different channels have no such relationship. When you only have 3 channels, a 3D convolution is equivalent to being fully connected in z. When you have 27 channels (e.g. in the middle of the net), why would any 3 channels be considered "close" to each other?
This answer explains the difference nicely.
Doing a "fully-connected" relationship over the channels is what most libraries do by default. Note this line in particular: "...a filter / kernel tensor of shape [filter_height, filter_width, in_channels, out_channels]". For an input vector of size in_channels, a matrix of size [in_channels, out_channels] is fully-connected. So, the filter can be thought of as a fully-connected layer on a "patch" of image size [filter_height, filter_width].
To illustrate, on a single channel, a regular plain old image filter takes a patch of image and maps that patch to a single pixel in a new image. Like so: (image credit)
On the other hand, suppose that we have multiple channels. Instead of performing a linear mapping from a 3x3 patch to a 1x1 pixel, we perform a linear mapping from a 3x3xin_channels patch to a 1x1xout_channels set of pixels. How do we do this? Well, a linear mapping is just a matrix. Note that a 3x3xin_channels patch can be written as a vector with 3*3*in_channels entries. A 1x1xout_channels set of pixels can be written as a vector with out_channels entries. A linear mapping between the two is given by a matrix with 3*3*in_channels rows and out_channels columns. The entries of that matrix are the parameters of that layer of the network. The layer works by simply multiplying the in vector by the matrix of weights to get the out vector. This is repeated over all patches of an image. (Actually, instead of doing this in a loop over all patches, you can achieve an equivalent thing with some fanciness which is what libraries do in practice, but it gives the same result)
To illustrate, the mapping takes this 3x3xin_channels column:
To this 1x1xout_channels stack of pixels:
Now, what you are proposing is to do something with the following bit:
There is no mathematical reason why you can't do something with that 3x3x3 patch containing only 3 channels of your whole set of in_channels. However, whatever 3 channels you choose is totally arbitrary, and they have no intrinsic relationship to one another that would suggest that treating them as being "nearby" would help.
To reiterate, in an image, the pixels that are near each other are expected to be "similar" or "related" in some sense. This is why a convolution works at all. If you jumbled up the pixels and then did a convolution, it would be worthless. On that note, all of the channels are just a jumble. There is no "nearby relatedness" property along the channels. E.g. the "red" channel isn't near the "green" channel OR the "blue" channel, because "nearness" doesn't make any sense between the channels. Since "nearness" isn't a property of the channel dimension, then doing a convolution in that dimension probably isn't going to be useful.
On the other hand, we can simply take the input of ALL of the in_channels to generate the output from ALL of the out_channels simultaneously, and let them influence each other in a linear sort of way. Note that the linear transformation described involves a sort of cross-pollination of the parameters. For example, for a layer at the top of the network, taking in a 3x3 patch of r,g,b channels labeled r_1_1-r_3_3 etc., a single pixel in a single channel of the output from that patch would look like:
A*r_1_1 + B*r_1_2 + ... C*r_3_3 + D*b_1_1 + E*b_1_2 + ... F*b_3_3 + G*g_1_1 + ...
Where the capital letters are entries of the weight matrix.
So your observation: "Would a strictly 2D CNN would perform poorly?" is based on an assumption that the convolutional layer doesn't include any "mixing" between the various channels. This is not the case. The in_channels are ALL combined in a linear mapping to obtain the out_channels.

Related

Dimensions of inputs to a fully connected layer from convolutional layer in a CNN

The question is on the mathematical details of the convolutional neural networks. Assume that the architecture of the net (objective of which is image classification) is as such
Input image 32x32
First hidden layer 3x28x28 (formed by convolving with 3 filters of
size 5x5, stride length = 0 and no padding), followed by
activation
Pooling layer (pooling over a 2x2 region) producing an output of
3x14x14
Second hidden layer 6x10x10 (formed by convolving with 6 filters
of size 5x5, stride length = 0 and no padding), followed by
activation
Pooling layer (pooling over a 2x2 region) producing an output of
6x5x5
Fully connected layer (FCN) -1 with 100 neurons
Fully connected layer (FCN) -2 with 10 neurons
From my readings thus far, I have understood that each of the 6x5x5 matrices are connected to the FCN-1. I have two questions, both of which are related to the way output from one layer is fed to another.
The output of the second pooling layer is 6x5x5. How are these fed to the FCN-1? What I mean is that each neuron in the FCN-1 can be seen as node that takes a scalar as input (or a 1x1 matrix). So how do we feed it an input of 6x5x5? I initially thought we’d flatten out the 6x5x5 matrices and convert it into a 150x1 array and then feed it to the neuron as if we have 150 training points. But doesn’t flattening out the feature map defeat the argument of spatial architecture of images?
From the first pooling layer we get 3 feature maps of size 14x14. How are the feature maps in the second layer generated? Lets say I look at the same region (a 5x5 area starting from the top left of the feature maps) across the 3 feature maps I get from the first convolutional layer. Are these three 5x5 patches used as separate training examples to produce the corresponding region in the next set of feature maps? If so then what if the three feature maps are instead RGB values of an input image? Would we still use them as separate training examples?
Generally what some CNN (like VGG 16 , VGG 19) do is, they flatten out the 3D tensor output from the MAX_POOL layer, so in your example the input to the FC layer would become (None,150), but other CNNs (like ResNet50 ) use a global max function to get 6x1x1 (dimension of output tensor) then which is flattened (would become (None,6)) and fed into FC layers.
This link has an image to a popular CNN architecture called VGG19.
To answer your query wherein flattening defeats spatial arrangement, when you flatten the image, lets say a pixel location is Xij (i.e ith row, jth column = n*i+j , where n is the width of the image) then based on matrix representation we can say that its upper neighbor is Xi-1,j (n*(i-1)+j) and so on for other neighbors, since there exists a co-relation for pixels and their neighboring pixels, the FC layer will automatically adjust weights to reflect that information.
Hence you can consider the convo->activation->pooling layers group as feature extraction layers whose output tensors (analogous to dimensions/features in vector) that will be fed into a standard ANN at the end of the network.

For what reason Convolution 1x1 is used in deep neural networks?

I'm looking at InceptionV3 (GoogLeNet) architecture and cannot understand why do we need conv1x1 layers?
I know how convolution works, but I see a profit with patch size > 1.
You can think about 1x1xD convolution as a dimensionality reduction technique when it's placed somewhere into a network.
If you have an input volume of 100x100x512 and you convolve it with a set of D filters each one with size 1x1x512 you reduce the number of features from 512 to D.
The output volume is, therefore, 100x100xD.
As you can see this (1x1x512)xD convolution is mathematically equivalent to a fully connected layer. The main difference is that whilst FC layer requires the input to have a fixed size, the convolutional layer can accept in input every volume with spatial extent greater or equal than 100x100.
A 1x1xD convolution can substitute any fully connected layer because of this equivalence.
In addition, 1x1xD convolutions not only reduce the features in input to the next layer, but also introduces new parameters and new non-linearity into the network that will help to increase model accuracy.
When the 1x1xD convolution is placed at the end of a classification network, it acts exactly as a FC layer, but instead of thinking about it as a dimensionality reduction technique it's more intuitive to think about it as a layer that will output a tensor with shape WxHxnum_classes.
The spatial extent of the output tensor (identified by W and H) is dynamic and is determined by the locations of the input image that the network analyzed.
If the network has been defined with an input of 200x200x3 and we give it in input an image with this size, the output will be a map with W = H = 1 and depth = num_classes.
But, if the input image have a spatial extent greater than 200x200 than the convolutional network will analyze different locations of the input image (just like a standard convolution does) and will produce a tensor with W > 1 and H > 1.
This is not possibile with a FC layer that constrains the network to accept fixed size input and produce fixed size output.
A 1x1 convolution simply maps in input pixel to an output pixel, not looking at anything around itself. It is often used to reduce the number of depth channels, since it is often very slow to multiply volumes with extremely large depths.
input (256 depth) -> 1x1 convolution (64 depth) -> 4x4 convolution (256 depth)
input (256 depth) -> 4x4 convolution (256 depth)
The bottom one is about ~3.7x slower.
Theoretically the neural network can 'choose' which input 'colors' to look at using this, instead of brute force multiplying everything.

Coordinate normalization for NN input in matlab

I am trying to implement a classification NN in Matlab.
My inputs are clusters of coordinates from an image. (Corresponding to delaunay triangulation vertexes)
There are 3 clusters (results of the optics algorithm) in this format:
( Not all clusters are of the same size.). Elements represent coordinates in euclidean 2d space . So (110,12) is a point in my image and the matrix depicted represents one cluster of points.
Clustering was done on image edges. So coordinates refer to logical values (always 1s in this case) on the image matrix.(After edge detection there are 3 "dense" areas in an image, and these collections of pixels are used for classification). There are 6 target classes.
So, my question is how can I format them into single column vector inputs to use in a neural network?
(There is a relevant answer here but I would like some elaboration if possible. ( I am probably too tired right now from 12 hours of trying stuff and dont get it 100% :D :( )
Remember, there are 3 different coordinate matrices for each picture, so my initial thought was, create an nn with 3 inputs (of different length). But how to serialize this?
Here's a cluster with its tags on in case it helps:
For you to train the classifier, you need a matrix X where each row will correspond to an image. If you want to use a coordinate representation, this means all images will have to be of the same size, say, M by N. So, the row of an image will have M times N elements (features) and the corresponding feature values will be the cluster assignments. Class vector y will be whatever labels you have, that is one of the six different classes you mentioned through the comments above. You should keep in mind that if you use a coordinate representation, X can get very high-dimensional, and unless you have a large number of images, chances are your classifier will perform very poorly. If you have few images, consider using fractions of pixels belonging to clusters that I suggested in one of the comments: this can give you a shorter feature description that is invariant to rotation and translation, and may yield better classification.

How plot U-Matrix, Sample Hit and Input Planes from a trained data by SOM

I have written a simple SOM algorithm in MATLAB. My big challenge is that, how can I visualize/plot data in the format of U-Matrix, Sample Hits and Component/Input Planes? These three plots exists in the SOM toolbox in MATLAB. But the problem is that I cannot call them to visualize my data over my written code. Because they need a 'net' as input in which my code does not make any 'net'.
Is there any guidance?
You can create your own functions as they are not too complicated. I will assume a SOM of 20x20x10 (400 nodes, 4 features) for explanation.
The Hit-Map is no more than giving each sample to the already learned SOM and incrementing +1 to the node that was chosen as the Best Matching Unit (BMU). Then you plot this map. So if node(1,1) fires 10 times, and node(1,2) fires 100 times, then you will have an image where node(1,2) has a higher intensity than node(1,1).
The U-Matrix is a map representing the average distance between the node's weight vector and its closest neighbours. So here you can calculate the Euclidean distance between the feature vector of node X to every neighbour. So if you had a feature vector for node(1,1,:)=[1,1,2,3], node(1,2,:)=[2,2,1,1], and node(2,1,:)=[1,1,1,1], then the value of the U-matrix for node(1,1) could be U(1,1)=norm(squeeze(node(1,1,:)-node(1,2,:)))+norm(squeeze(node(1,1,:)-node(2,1,:)))=4.8818
The Component/Input Planes is the simplest one and does not require any processing. You just basically pick each feature of the SOM map and plot. So in our example of a 20x20x4 SOM, you would have 4 features and therefore 4 components, which you can plot through imagesc(node(:,:,1)) for feature 1

Making feature vector from Gabor filters for classification

My aim is to classify types of cars (Sedans,SUV,Hatchbacks) and earlier I was using corner features for classification but it didn't work out very well so now I am trying Gabor features.
code from here
Now the features are extracted and suppose when I give an image as input then for 5 scales and 8 orientations I get 2 [1x40] matrices.
1. 40 columns of squared Energy.
2. 40 colums of mean Amplitude.
Problem is I want to use these two matrices for classification and I have about 230 images of 3 classes (SUV,sedan,hatchback).
I do not know how to create a [N x 230] matrix which can be taken as vInputs by the neural netowrk in matlab.(where N be the total features of one image).
My question:
How to create a one dimensional image vector from the 2 [1x40] matrices for one image.(should I append the mean Amplitude to square energy matrix to get a [1x80] matrix or something else?)
Should I be using these gabor features for my purpose of classification in first place? if not then what?
Thanks in advance
In general, there is nothing to think about - simple neural network requires one dimensional feature vector and does not care about the ordering, so you can simply concatenate any number of feature vectors into one (and even do it in random order - it does not matter). In particular if you have same feature matrices you also concatenate each of its row to create a vectorized format.
The only exception is when your data actually has some underlying geometrical dependicies, for example - matrix is actualy a pixels matrix. In such case architectures like PyraNet, Convolutional Neural Networks and others, which apply some kind of receptive fields based on this 2d structure - should be better. But those implementations simply accept 2d feature vector as an input.