I heard the convolution and pooling operations happens in hidden layers but I see the opposite in the picture below....So which one is correct?
enter image description here
Every layer except the input layer (in your case the input image) and the output layer (in your case the final 2-node output layer) can be considered as hidden layer. So yes, convolution and pooling operations happens in hidden layers.
Related
I also know the fact that saliency map is also a form of image segmentation task.
But it has been used very widely for interpretable deep learning ( Read GradCam etc ) .
I also came across this paper (http://img.cs.uec.ac.jp/pub/conf16/161011shimok_0.pdf)
which talks about Class Saliency Maps - something that rings a bell when it comes to Image Segmentation. Please tell if this concept exists for Image Segmentation or I need to read more on this subject.
Class saliency maps as described in Deep Inside Convolutional Networks: VisualisingImage Classification Models and Saliency Maps explain that such a map describes per pixel how much changing such a pixel will influence a prediction. Hence I see no reason why this could not be applied to image segmentation tasks.
The resulting images from the segmentation task and saliency map have to be interpreted differently however. In an image segmentation task the output is a per pixel prediction of whether or not a pixel belongs to a class, sometimes in the form of a certainty score.
A class saliency map describes per pixel how much changing that pixel would change the score of the classifier. Or quote from above paper: "which pixels need to be changed the least to affect the class score the most"
Edit: Added example.
Say that a pixel gets a score of 99% for being of the class "Dog", we can be rather certain that this pixel actually is part of a dog. The salience map can show a low score for this same pixel. This means that changing this pixel slightly would not influence the prediction of that pixel belonging to the class "Dog". In my experience so far, both the per pixel class probability map and the salience map show somewhat similar patterns, but this does not mean they are to be interpreted equal.
A piece of code I came across that can be applied to pytorch models (from Nikhil Kasukurthi, not mine) can be found on github.
I've just finished reading the notes for Stanford's CS231n on CNNs and there is a link to a live demo; however, I am unsure by what "Activations", "Activation Gradients", "Weights" and "Weight Gradients" is referring to in the demo. The below screenshots have been copied from the demo.
Confusion point 1
I'm first confused by what "activations" refers to for the input layer. Based on the notes, I thought that the activation layer refers to the RELU layer in a CNN, which essentially tells the CNN which neurons should be lit up (using the RELU function). I'm not sure how that relates to the input layer as shown below. Furthermore, why are there two images displayed? The first image seems to display the image that is provided to the CNN but I'm unable to distinguish what the second image is displaying.
Confusion point 2
I'm unsure what "activations" and "activation gradients" is displaying here for the same reason as above. I think the "weights" display what the 16 filters in the convolution layer look like but I'm not sure what "Weight Gradients" is supposed to be showing.
Confusion point 3
I think I understand what the "activations" is referring to in the RELU layers. It is displaying the output images of all 16 filters after every value (pixel) of the output image has had the RELU function applied to it hence why each of the 16 images contains pixels that are black (un-activated) or some shade of white (activated). However, I don't understand what "activation gradients" is referring to.
Confusion point 4
Also don't understand what "activation gradients" is referring to here.
I'm hoping that by understanding this demo, I'll understand CNNs a little more
This question is similar to this question, but not quite. Also, here's a link to the ConvNetJS example code with comments (here's a link to the full documentation). You can take a look at the code at the top of the demo page for the code itself.
An activation function is a function that takes in some input and outputs some value based on if it reaches some "threshold" (this is specific for each different activation function). This comes from how neurons work, where they take some electrical input and will only activate if they reach some threshold.
Confusion Point 1: The first set of images show the raw input image (the left colored image) and the right of the two images is the output after going through the activation functions. You shouldn't really be able to interpret the second image because it is going through non-linear and perceived random non-linear transformations through the network.
Confusion Point 2: Similar to the previous point, the "activations" are the functions the image pixel information is passed into. A gradient is essentially the slope of the activation function. It appears more sparse (i.e., colors show up in only certain places) because it shows possible areas in the image that each node is focusing on. For example, the 6th image on the first row has some color in the bottom left corner; this may indicate a large change in the activation function to indicate something interesting in this area. This article may clear up some confusion on weights and activation functions. And this article has some really great visuals on what each step is doing.
Confusion Point 3: This confused me at first, because if you think about a ReLu function, you will see that it has a slope of one for positive x and 0 everywhere else. So to take the gradient (or slope) of the activation function (ReLu in this case) doesn't make sense. The "max activation" and "min activation" values make sense for a ReLu: the minimum value will be zero and the max is whatever the maximum value is. This is straight from the definition of a ReLu. To explain the gradient values, I suspect that some Gaussian noise and a bias term of 0.1 has been added to those values. Edit: the gradient refers to the slope of the cost-weight curve shown below. The y-axis is the loss value or the calculated error using the weight values w on the x-axis.
Image source https://i.ytimg.com/vi/b4Vyma9wPHo/maxresdefault.jpg
Confusion Point 4: See above.
Confusion point 1: Looking at the code it seems like in the case of the input layer the "Activations" visualisation are the coloured image for the first figure. The second figure does not really make any sense because the code is trying to display some gradient values but it's not clear where they come from.
// HACK to draw in color in input layer
if(i===0) {
draw_activations_COLOR(activations_div, L.out_act, scale);
draw_activations_COLOR(activations_div, L.out_act, scale, true);
Confusion point 2, 3 & 4:
Activations: It is the output of the layer
Activation Gradients: This name is confusing but it is basically the gradient of the loss with respect to the input of the current layer l. This is useful in case you want to debug the autodif algorithm
Weights: This is only printed if the layer is a convolution. It's basically the different filters of the convolution
Weight Gradients:It is the gradient of the loss with respect to the weights of the current layer l
Confusion point 1
For Convolutional Layers, every layer has a duty to detect features. Imagine that you want to detect a human face, first layer will detect edges, maybe next layer will detect your noses and so on. Towards last layer, more complex features will be detected. In first layer, what you see is what first layer detected from image.
Confusion point 2
If you look through fully connnected layers, I think they probably showing up the gradients they obtained during back-propagation. Because through fully connected layers, they get only gray black etc colors.
Confusion point 3
There is nothing relu layers. After convulution you use activation function, and you get another matric, and you pass it through another layer. After relu, you get the colors.
Confusion point 4
It is same above.
Please let me know when you don't understand any point.
This may be a basic conceptual question, but reading on different CNN's such as VGG, Alexnet, GoogleNet, etc it seems that once the model has been trained on a specific image size as input (lets say 256x256), I can't give a different image size to the model (1,920 x 1,080) during inference without resizing or croping. Is this true?
I know that YOLO handles images with different resolutions, is Yolo resizing the image before giving it to the convolution layers?
The requirement that I have is to do object recognition on a series of images that may not have the same image size, the obvious approach would be resizing the image, but that may lead to losing information on the image.
If so, do I need to train a model for every image size that I have, and then reload the model each time for that specific image?
There are more conceptual issues, VGG, AlexNet, GoogleNet are image classification models, while YOLO is an object detection model. Only if the network is fully convolutional it can accept variable-sized images.
So your only option is resize images to a common size, this works well in practice, so you should do it and evaluate different image sizes to see how accuracy changes with it. Only after doing such experiment you can decide if resizing is not appropriate.
I need to classify pixel-wise instances in an image. Most object detection models, e.g., RetinaNet, R-CNNs, only detect bounding box. In my case the non-instance region in a bounding box can be significantly different from the instance. Even though the mask R-CNN model still does object classification based on the bounding box area. Does anybody know what model should I use? I guess Facebook's MultiPathNet probably works, but I am not using Linux. Are there any other models? Thanks a lot.
It sounds that you're looking for instance-level segmentation (as a short term for the long explanation).
Mask R-CNN sounds just right for the job.
It does instance-level segmentation based on the region proposals, not only bounding boxes.
The segmentation is a binary mask of the instance. The classification is made by a dedicated branch.
I'm trying to fit two data sets. Those contain the results of measuring the same object with two different measurement devices (x-ray vs. µct).
I did manage to reconstruct the image data and fit the orientation and offset of the stacks. It looks like this (one image from a stack of about 500 images):
The whole point of this is to compare several denoising algorithms on the x-ray data (left). It is assumed that the data from µCT (right) is close to the real signal without any noise. So, I want to compare the denoised x-ray data from each of the algorithms to the "pure" signal from µCT to see which algorithm produces the lowest RMS-error. Therefore, I need to somehow fit the grayvalues from the left part to those of the right part without manipulating the noise too much.
The gray values in the right are in the range of 0 to 100 whereas the x-ray data ranges from about 4000 to 30000. The "bubbles" are in a range of about 8000 to 11000. (those are not real bubbles but an artificial phantom with holes out of a 3D printer)
What I tried to do is (kind of) band pass those bubbles and map them to ~100 while shifting everything else towards 4 (which is the value for the background on the µCT data).
That's the code for this:
zwst = zwsr;
zwsr(zwst<=8000)=round(zwst(zwst<=6500)*4/8000);
zwsr(zwst<=11000 & zwst>8000)= round(zwst(zwst<=11000 & zwst>8000)/9500*100);
zwsr(zwst>11000)=round(zwst(zwst>11000)*4/30000);
The results look like this:
Some of those bubbles look distorted and the noise part in the background is gone completely. Is there any better way to fit those gray values while maintaining the noisy part?
EDIT: To clarify things: The µCT data is assumed to be noise free while the x-ray data is assumed to be noisy. In other words, µCT = signal while x-ray = signal + noise. To quantize the quality of my denoising methods, I want to calculate x-ray - µCT = noise.
Too long for a comment, and I believe a reasonable answer:
There is a huge subfield of image processing/ signal processing called image fusion. There is even a specific Matlab library for that using wavelets (http://uk.mathworks.com/help/wavelet/gs/image-fusion.html).
The idea behind image fusion is: given 2 images of the same thing but with very different resolution/data, how can we create a single image containing the information of both?
Stitching both images "by hand" does not give very good result generally so there are a big amount of techniques to do it mathematically. Waveletes are very common here.
Generally this techniques are widely used in medical imaging , as (like in your case) different imaging techniques give different information, and doctors want all of them together:
Example (top row: images pasted together, bottom row: image fusion techniques)
Have a look to some papers, some matlab tutorials, and probably you'll get there with the easy-to-use matlab code, without any fancy state of the art programming.
Good luck!