YOLOv3 convolutional layers count - neural-network

I am really confused about the count of the convolutional layers exist in YOLOv3!
According to the paper they are using darknet-53 and they don't mention any further details or addition to that structure!
However, according to the build of AlexeyAB it is composed of 106 layers!
moreover, the towardsdatascience website claims that the additional 53 layers are added for the detection process, but what does that really mean are the first 53 layers only for feature extraction then?
So my question is, what is the matter of these extra unmentioned-in-the-paper 53 layers? where did they come from? and why?

According to AlexeyAB (creator of very popular forked Darknet version) https://groups.google.com/forum/?nomobile=true#!topic/darknet/9WppEzRouMU (This link is deprecated somehow)
Yolo has
75 cnn-layers (convolutional layers) + 31 other layers (shortcut, route, upsample, yolo) = 106 layers in total.
You can count the total of CNN layer in cfg file, there are 75. Also remember that Yolo V3 does detection at 3 different scales, which are at layer 82,94,106.

Darknet-53 is the name of the extractor developed by Joseph Redmon et al., and it does indeed constitute the first 53 layers of YOLOv3. The next 53 layers are dedicated to resizing, concatenation and upsampling the input to prepare them for detection at three different scales at layer 82, 94 and 106 respectively. The first layer detects the largest objects, the second the ones in the middle, and the last layer all that remains (in theory at least).
I think the idea of this hierarchical structure is the further one moves into YOLOv3, the more high-level information it is able to extract.

Related

Classifying an object from an ordered sequence of spatially aligned images

This is about a project to count the number of tumour cells in a given 88 by 88 pixel frame. To say there is just one image is not actually accurate; there are 4 independent images of that frame i.e. the same 'situation' on the ground. All 4 of them have to be considered to count how many tumour cells there are (usually 0, sometimes 1, rarely 2).
Here are the 4 images of a sample frame concatenated together
These images are obtained from visualising the situation through different lens (wavelengths).
I have read up on several blog articles of neural networks. However, all assume one image-one label relationship, instead of the many image-one label relationship I am working with.
Hence, I am looking for suggestions for possible tools or just the technical term related to this kind of problem so I can proceed further.

Where is the class-prediction layer located in FCN (Semantic Segmentation)? And how does it work for per-pixel prediction?

Recently, I read FCN paper carefully and remained with one unanswered-question of myself.
Brief introduction : FCN is a dense prediction that predicts each pixel belongs to which classes. In PASCAL case, it will be 21 classes (21 objects + 1 background). For prediction, older method is using FC-layer with 21 output nodes and pick the highest output as the class. This layer (class-prediction layer) is located in the very end of the network which outs the class as we as for calculating the loss function. Newer method, it can be replaced by 21 1x1-conv operations.
My question is, once I read the paper (especially chapter 4.1), it seems that the class-prediction layer is right after the coarse-output (fused-feature maps), then followed by up-sampling layer (See pic no 1 below). In this case, class-prediction layer is not in the very end of network. How can this work to predict each pixel class? In the beginning, I was thinking that class-prediction layer was in the very end of the network (See Pic no 2 below). Thus, per-pixel, it will have 21 1x1-conv operations for per-pixel class prediction. But, my mind changed after I read chapter 4.1 in the paper I paste below. What do you think about this?
Best,
Ardian.
Click link below to see attached images.

What kind of features are extracted with the AlexNet layers?

Question is regarding this method, which extracts features from the FC7 layer of AlexNet.
What kind of features is it actually extracting?
I used this method on images of paintings done by two artists. The training set is about 150 training images from each artist (so that the features are stored in a 300x4096 matrix); the validation set is 40 images. This works really well, 85-90% correct classification. I would like to know why it works so well.
WHAT FEATURES ?
FC8 is the classification layer; FC7 is the one before it, where all of the prior kernel pixels are linearised and concatenated. These represent the abstract, top-level features that the model training has "discovered". To examine these features, try one of the many layer visualization tools available on line (don't ask for references here; SO bans requests for resources).
The features vary from one training to another, depending on the kernel initialization (usually random) and very dependent on the training set. However, the features tend to be simple in the early layers, with greater variety and detail in the later ones. For instance, on the original AlexNet target (ILSVRC 2012, a.k.a. ImageNet data set), the FC7 features often include vehicle tires, animal faces, various types of flower petals, green leaves and stems, two-legged animal torsos, airplane sections, car/truck/bus grill work, etc.
Does that help?
WHY DOES IT WORK SO WELL ?
That depends on the data set and training parameters. How different are the images from the artists? There are plenty of features to extract: choice of subject, palette, compositional complexity, hard/soft edges, even direction of brush strokes. For instance, differentiating any two early cubists could be a little tricky; telling Rembrandt from Jackson Pollack should hit 100%.

How to fine tune an FCN-32s for interactive object segmentation

I'm trying to implement the proposed model in a CVPR paper (Deep Interactive Object Selection) in which the data set contains 5 channels for each input sample:
1.Red
2.Blue
3.Green
4.Euclidean distance map associated to positive clicks
5.Euclidean distance map associated to negative clicks (as follows):
To do so, I should fine tune the FCN-32s network using "object binary masks" as labels:
As you see, in the first conv layer I have 2 extra channels, so I did net surgery to use pretrained parameters for the first 3 channels and Xavier initialization for 2 extras.
For the rest of the FCN architecture, I have these questions:
Should I freeze all the layers before "fc6" (except the first conv layer)? If yes, how the extra channels of the first conv will be learned? Are the gradients strong enough to reach the first conv layer during training process?
What should be the kernel size of the "fc6"? should I keep 7? I saw in "Caffe net_surgery" notebook that it depends on the output size of the last layer ("pool5").
The main problem is the number of outputs of the "score_fr" and "upscore" layers, since I'm not doing class segmentation (to use 21 for 20 classes and the background), how should I change it? What about 2? (one for object and the other for the non-object (background) area)?
Should I change "crop" layer "offset" to 32 to have center crops?
In case of changing each of these layers, what is the best initialization strategy for them? "bilinear" for "upscore" and "Xavier" for the rest?
Should I convert my binary label matrix values into zero-centered ( {-0.5,0.5} ) status, or it is OK to use them with the values in {0,1} ?
Any useful idea will be appreciated.
PS:
I'm using Euclidean loss, while I'm using "1" as the number of outputs for "score_fr" and "upscore" layers. If I use 2 for that, I guess it should be softmax.
I can answer some of your questions.
The gradients will reach the first layer so it should be possible to learn the weights even if you freeze the other layers.
Change the num_output to 2 and finetune. You should get a good output.
I think you'll need to experiment with each of the options and see how the accuracy is.
You can use the values 0,1.

Depth of Artificial Neural Networks

According to this answer, one should never use more than two hidden layers of Neurons.
According to this answer, a middle layer should contain at most twice the amount of input or output neurons (so if you have 5 input neurons and 10 output neurons, one should use (at most) 20 middle neurons per layer).
Does that mean that all data will be modeled within that amount of Neurons?
So if, for example, one wants to do anything from modeling weather (a million input nodes from data from different weather stations) to simple OCR (of scanned text with a resolution of 1000x1000DPI) one would need the same amount of nodes?
PS.
My last question was closed. Is there another SE site where these kinds of questions are on topic?
You will likely have overfitting of your data (aka, High Variance). Think of it like this: The more neurons and layers you have gives you more parameters to fit your data better.
Remember that for the first layer node the equation becomes Z = sigmoid(sum(W*x))
The second layer node becomes Z2 = Sigmoid(sum(W*Z))
Look into machine learning class taught at Stanford...its a great online course and good tool as a reference.
More than two hidden layers can be useful in certain architectures
such as cascade correlation (Fahlman and Lebiere 1990) and in special
applications, such as the two-spirals problem (Lang and Witbrock 1988)
and ZIP code recognition (Le Cun et al. 1989).
Fahlman, S.E. and Lebiere, C. (1990), "The Cascade Correlation
Learning Architecture," NIPS2, 524-532.
Le Cun, Y., Boser, B., Denker, J.s., Henderson, D., Howard, R.E.,
Hubbard, W., and Jackel, L.D. (1989), "Backpropagation applied to
handwritten ZIP code recognition", Neural Computation, 1, 541-551.
Check out the sections "How many hidden layers should I use?" and "How many hidden units should I use?" on comp.ai.neural-nets's FAQ for more information.