Dropout Implementation: Do all images in a batch get the same mask in Caffe? - neural-network

I just wanted to confirm whether all the images in a mini-batch get the same dropout mask or not.
For clarification: Suppose a 1000 sized vector is passed through the Dropout layer with a mini batch size of 100.
Now, the 21st and 31st elements of the 1st image's vector gets dropped out.
Is it necessary that the 21st and 31st elements will get dropped for all the remaining 99 images in the batch? Or is it the case that every image gets a separate mask?

No, each image in a batch gets an independent, completely random mask.
Actually, the Dropout layer doesn't even care about the shape of its input: It counts the number of elements in the bottom Blob (i.e. for a batch size of 10 images, 3 channels and 224x224, it would be 10 * 3 * 224 * 224 = 1505280), and generates the same number of independent random numbers.

Related

Classifying an object from an ordered sequence of spatially aligned images

This is about a project to count the number of tumour cells in a given 88 by 88 pixel frame. To say there is just one image is not actually accurate; there are 4 independent images of that frame i.e. the same 'situation' on the ground. All 4 of them have to be considered to count how many tumour cells there are (usually 0, sometimes 1, rarely 2).
Here are the 4 images of a sample frame concatenated together
These images are obtained from visualising the situation through different lens (wavelengths).
I have read up on several blog articles of neural networks. However, all assume one image-one label relationship, instead of the many image-one label relationship I am working with.
Hence, I am looking for suggestions for possible tools or just the technical term related to this kind of problem so I can proceed further.

implementing CNN - but graph is in russian

so by google translating i figured out
Вход means input
Слой means layer.
Свертка means convolution (this
must be the number of filters?)
Шаг means step (this must be stride?)
субдискр means subdiskr (i guess this is pooling?)
Now my question is how would a
size 22x256 image result in 6x256 with a 5 filters?
The filter size (kernel) that i found out results in 6x256 is [17,1] with 1 filter. From layer 1 to layer a kernel size of [1,8] and stride [1,8] is what i found to work. This just does not look like anything on this graph though.
In the paper they wrote this about the layer between 1 and 2
"The second layer allows to reduce the dimensionality of the signal in time, producing a weighted average of the signal over 16 values"
Heres a clear explanation how the sizes of the inputs vary with proceeding among the layers.
In the input the dimensions that you are giving are 28 wide and 28 height and depth as 1. For filters in layer1 the depth dimension of filter must be equal to the depth of the input. so the dimension of the filter will be 5x5x1, applying one filter the dimension is reduced (due to strides)to produce 14x14x1 dimension activation map, so applying 32 such filters will give you 32 activations maps. Combining all of these 14x14x32 is output of the layer 1 and input to your second layer. Again in second layer you need to apply a filter of dimension 5(width)x5(height)x32(depth) on the layer to produce one activation map of 14x14x1 , stacking all the 64 activation maps give you output dimension of the second layer as 14x14x64 and so on.
In the figure that you posted looks very different in representation. Check the standard ones in your language.
I asked the authors:
They told me that they used 1 dimensional CNN.
This Means that the first number is the depth and the second number is the Width:
depth # width.

What is the optimal hidden units size?

Suppose we have a standard autoencoder with three layers (i.e. L1 is the input layer, L3 the output layer with #input = #output = 100 and L2 is the hidden layer (50 units)). I know the interesting part of an autoencoder is the hidden part L2. Instead of passing 100 inputs to my supervised model, it will feed it with 50 inputs. What is the optimal hidden units size? 50 is well, but why not using 51, 52 or 63 hidden units? Does 51 will perform better the supervised model than 50 hidden units?
Suppose now that the number of inputs is 1,000,000. If N is the number of units, then I don't want to test out each possible value for N to find out the optimal N. I thought there exists at least an algorithm to do not be obligated to test each possible value or eliminate some of them.
Could that question help?
There is no rule for it. number of Hidden layer selection is purely based on hit and trial.

How to fine tune an FCN-32s for interactive object segmentation

I'm trying to implement the proposed model in a CVPR paper (Deep Interactive Object Selection) in which the data set contains 5 channels for each input sample:
1.Red
2.Blue
3.Green
4.Euclidean distance map associated to positive clicks
5.Euclidean distance map associated to negative clicks (as follows):
To do so, I should fine tune the FCN-32s network using "object binary masks" as labels:
As you see, in the first conv layer I have 2 extra channels, so I did net surgery to use pretrained parameters for the first 3 channels and Xavier initialization for 2 extras.
For the rest of the FCN architecture, I have these questions:
Should I freeze all the layers before "fc6" (except the first conv layer)? If yes, how the extra channels of the first conv will be learned? Are the gradients strong enough to reach the first conv layer during training process?
What should be the kernel size of the "fc6"? should I keep 7? I saw in "Caffe net_surgery" notebook that it depends on the output size of the last layer ("pool5").
The main problem is the number of outputs of the "score_fr" and "upscore" layers, since I'm not doing class segmentation (to use 21 for 20 classes and the background), how should I change it? What about 2? (one for object and the other for the non-object (background) area)?
Should I change "crop" layer "offset" to 32 to have center crops?
In case of changing each of these layers, what is the best initialization strategy for them? "bilinear" for "upscore" and "Xavier" for the rest?
Should I convert my binary label matrix values into zero-centered ( {-0.5,0.5} ) status, or it is OK to use them with the values in {0,1} ?
Any useful idea will be appreciated.
PS:
I'm using Euclidean loss, while I'm using "1" as the number of outputs for "score_fr" and "upscore" layers. If I use 2 for that, I guess it should be softmax.
I can answer some of your questions.
The gradients will reach the first layer so it should be possible to learn the weights even if you freeze the other layers.
Change the num_output to 2 and finetune. You should get a good output.
I think you'll need to experiment with each of the options and see how the accuracy is.
You can use the values 0,1.

Enhancing 8 bit images to 16 bit

My objective is to enhance 8 bit images to 16 bit ones. In other words, I want to increase the dynamic range of an 8 bit image. And to do that, I can sequentially take multiple images of 8 bit with fixed scene and fixed camera. To simplify the issue, let's assume they are grayscale images
Intuitively, I think I can achieve the goal by
Multiplying two 8 bit images
resimage = double(img1) .* double(img2)
Averaging specified number of 8 bit images
resImage = mean(images,3)
assuming images(:,:,i) contains ith 8 bit image.
After that, I can make the resulting image to 16 bit one.
resImage = uint16(resImage)
But before testing these methods, I wonder there is another way to do this - except for buying 16 bit camera, or literature for this subject might be better.
UPDATE: As comments below display, I got great information on drawbacks of simple averaging above and on image stacks for the enhancement. So it may be a good topic to study after all. Thank all for your great comments.
This question appears to relate to increasing the Dynamic Range of an image by integrating information from multiple 8 bit exposures into a 16 bit image. This is related to the practice of capturing and combining "image stacks" in astronomical imaging among other fields. An explanation of this practice and how it can both reduce image noise, and enhance dynamic range is available here:
http://keithwiley.com/astroPhotography/imageStacking.shtml
The idea is that successive captures of the same scene are subject to image noise, and this noise leads to stochastic variation of the pixel values captured. In the simplest case these variations can be leveraged by summing and dividing i.e. mean averaging the stack to improve its dynamic range but the practicality would depend very much on the noise characteristics of the camera.
You want to sum many images together, assuming there is no jitter and the camera is steady. Accumulate a large sum and then divide by some amount.
Note that to get a reasonable 16-bit image from an 8 bit source, you'd need to take hundreds of images to get any kind of reasonable result. Note that jitter will distort edge information and there is some inherent noise level of the camera that might mean you are essentially 'grinding metal'. In a practical sense, you might get 2 or 3 more bits of data from image summing, but not 8 more. To get 3 bits more would require at least 64 images (6 bits) to sum. Then divide by 8 (3 bits), as the lower bits are garbage.
Rule of thumb is to get a new bit of data, you need the squared(bits) of images, so 3 bits (8) means 64 images, 4 bits would be 256 images, etc.
Here's a link that talks about sampling:
http://electronicdesign.com/analog/understand-tradeoffs-increasing-resolution-averaging
"In fact, it can be shown that the improvement is proportional to the square root of the number of samples in the average."
Note that SNR is a log scale so equating it to bits is reasonable.