Convolutional layers convolve the wrong way around(Pytorch)? - neural-network

I have been trying to visualize the outputs of a VGG-16 network. But the output seems to be just wrong. As you know the convolution doesn't translate the semantic segment of the picture. like for the following picture if the head is on the top part of the picture it should be on top of the picture still after the convolution is done. But it doesn't seem to be the case. I used the following code to extract the intermediate layers.
class vgg16(torch.nn.Module):
def __init__(self, pretrained=True):
super(vgg16, self).__init__()
vgg_pretrained_features = tv.vgg16(pretrained=pretrained).features
self.layerss = torch.nn.Sequential()
for x in range(30):
self.layerss.add_module(str(x), vgg_pretrained_features[x])
self.layerss.eval()
def forward(self, x):
output=[]
for i,layer in enumerate( self.layerss):
# print (i)
x=layer(x)
output.append(x)
return output
model=vgg16()
output=model.forward(img)
import matplotlib.pyplot as plt
plt.imshow(output[0][0][0].detach())
Here is the original picture and the output of the first channel of the first layer in the VGG network :
As you can see the face has moved all the way down and the neckless is all the way up and the overall structure of the picture is broken

doesn't translate the semantic segment of the picture,
I kind of understand where you're coming from. This might be true, but here is the thing: your model doesn't exclusively contain convolutions layers. It also has max-pooling layers (namely nn.MaxPool2d). These layers can indeed disturb the spatial coherence that is initially apparent in the input image.
Combined with a rather high receptive field (which is the case for this type of CNN), having the observed output is not inconceivable.
Then understanding why the result is this way is another problem to which I don't have the answer. The features you are extracting here should reflect higher-level information. These ultimately depend on the pretraining that was performed on the model, i.e. on which type of task and dataset the model was trained on prior to this inference.

Related

Predictions using Convolutional Neural Networks and DL4J

This is my first time working with DL4J (Deep Learning for Java) and also my first Convolutional Neural Network. My Goal is to use the Convolutional Neural Netowrk to give me some predicted values about an image. I gathered and labelled my images myself. The labels or expected outputs consist of two numbers between 0 and 1 (I just wrote them in the file name like 0.01x0.87.jpg).
Now I can't find any way to use the DataSetIterator Class which DL4J uses so that I can also set my label values.
Is there a simple way to tell DL4J that I want to train my Network to recognize that image 0.01x0.01.jpg should spit out the values 0.01 and 0.01?
What you want to do is usually known as regression. In contrast to classification where you want to either have a 0 or 1 output, in regression any value can be the target.
In your case, you will likely want to use a network architecture that uses either a sigmoid (which forces your values to be between 0 and 1) or an identity (which keeps the values as is, i.e. allows for them to be outside of the 0 to 1 range) activation function.
As you have two values that you are trying to predict, you will have to also define that you are using two outputs.
So much for your model architecture.
For data loading, you can use the ImageRecordReader, but also pass it a PathMultiLabelGenerator of your own. When you implement the PathMultiLabelGenerator interface, you will get the full path of the image as a string, and you can do whatever you want with it, like for example remove the file ending, split on x and parse your filename into a list of DoubleWritable. DoubleWritable is just a simple wrapper class for double so creating that is as easy as just instantiating it by passing the actual value to the constructor.
To create a dataset iterator you can now follow the documentation on RecordReaderDataSetIterator.

What does the parameter retain_graph mean in the Variable's backward() method?

I'm going through the neural transfer pytorch tutorial and am confused about the use of retain_variable(deprecated, now referred to as retain_graph). The code example show:
class ContentLoss(nn.Module):
def __init__(self, target, weight):
super(ContentLoss, self).__init__()
self.target = target.detach() * weight
self.weight = weight
self.criterion = nn.MSELoss()
def forward(self, input):
self.loss = self.criterion(input * self.weight, self.target)
self.output = input
return self.output
def backward(self, retain_variables=True):
#Why is retain_variables True??
self.loss.backward(retain_variables=retain_variables)
return self.loss
From the documentation
retain_graph (bool, optional) – If False, the graph used to compute
the grad will be freed. Note that in nearly all cases setting this
option to True is not needed and often can be worked around in a much
more efficient way. Defaults to the value of create_graph.
So by setting retain_graph= True, we're not freeing the memory allocated for the graph on the backward pass. What is the advantage of keeping this memory around, why do we need it?
#cleros is pretty on the point about the use of retain_graph=True. In essence, it will retain any necessary information to calculate a certain variable, so that we can do backward pass on it.
An illustrative example
Suppose that we have a computation graph shown above. The variable d and e is the output, and a is the input. For example,
import torch
from torch.autograd import Variable
a = Variable(torch.rand(1, 4), requires_grad=True)
b = a**2
c = b*2
d = c.mean()
e = c.sum()
when we do d.backward(), that is fine. After this computation, the parts of the graph that calculate d will be freed by default to save memory. So if we do e.backward(), the error message will pop up. In order to do e.backward(), we have to set the parameter retain_graph to True in d.backward(), i.e.,
d.backward(retain_graph=True)
As long as you use retain_graph=True in your backward method, you can do backward any time you want:
d.backward(retain_graph=True) # fine
e.backward(retain_graph=True) # fine
d.backward() # also fine
e.backward() # error will occur!
More useful discussion can be found here.
A real use case
Right now, a real use case is multi-task learning where you have multiple losses that maybe be at different layers. Suppose that you have 2 losses: loss1 and loss2 and they reside in different layers. In order to backprop the gradient of loss1 and loss2 w.r.t to the learnable weight of your network independently. You have to use retain_graph=True in backward() method in the first back-propagated loss.
# suppose you first back-propagate loss1, then loss2 (you can also do the reverse)
loss1.backward(retain_graph=True)
loss2.backward() # now the graph is freed, and next process of batch gradient descent is ready
optimizer.step() # update the network parameters
This is a very useful feature when you have more than one output of a network. Here's a completely made up example: imagine you want to build some random convolutional network that you can ask two questions of: Does the input image contain a cat, and does the image contain a car?
One way of doing this is to have a network that shares the convolutional layers, but that has two parallel classification layers following (forgive my terrible ASCII graph, but this is supposed to be three convlayers, followed by three fully connected layers, one for cats and one for cars):
-- FC - FC - FC - cat?
Conv - Conv - Conv -|
-- FC - FC - FC - car?
Given a picture that we want to run both branches on, when training the network, we can do so in several ways. First (which would probably be the best thing here, illustrating how bad the example is), we simply compute a loss on both assessments and sum the loss, and then backpropagate.
However, there's another scenario - in which we want to do this sequentially. First we want to backprop through one branch, and then through the other (I have had this use-case before, so it is not completely made up). In that case, running .backward() on one graph will destroy any gradient information in the convolutional layers, too, and the second branch's convolutional computations (since these are the only ones shared with the other branch) will not contain a graph anymore! That means, that when we try to backprop through the second branch, Pytorch will throw an error since it cannot find a graph connecting the input to the output!
In these cases, we can solve the problem by simple retaining the graph on the first backward pass. The graph will then not be consumed, but only be consumed by the first backward pass that does not require to retain it.
EDIT: If you retain the graph at all backward passes, the implicit graph definitions attached to the output variables will never be freed. There might be a usecase here as well, but I cannot think of one. So in general, you should make sure that the last backwards pass frees the memory by not retaining the graph information.
As for what happens for multiple backward passes: As you guessed, pytorch accumulates gradients by adding them in-place (to a variable's/parameters .grad property).
This can be very useful, since it means that looping over a batch and processing it once at a time, accumulating the gradients at the end, will do the same optimization step as doing a full batched update (which only sums up all the gradients as well). While a fully batched update can be parallelized more, and is thus generally preferable, there are cases where batched computation is either very, very difficult to implement or simply not possible. Using this accumulation, however, we can still rely on some of the nice stabilizing properties that batching brings. (If not on the performance gain)

How to fine tune an FCN-32s for interactive object segmentation

I'm trying to implement the proposed model in a CVPR paper (Deep Interactive Object Selection) in which the data set contains 5 channels for each input sample:
1.Red
2.Blue
3.Green
4.Euclidean distance map associated to positive clicks
5.Euclidean distance map associated to negative clicks (as follows):
To do so, I should fine tune the FCN-32s network using "object binary masks" as labels:
As you see, in the first conv layer I have 2 extra channels, so I did net surgery to use pretrained parameters for the first 3 channels and Xavier initialization for 2 extras.
For the rest of the FCN architecture, I have these questions:
Should I freeze all the layers before "fc6" (except the first conv layer)? If yes, how the extra channels of the first conv will be learned? Are the gradients strong enough to reach the first conv layer during training process?
What should be the kernel size of the "fc6"? should I keep 7? I saw in "Caffe net_surgery" notebook that it depends on the output size of the last layer ("pool5").
The main problem is the number of outputs of the "score_fr" and "upscore" layers, since I'm not doing class segmentation (to use 21 for 20 classes and the background), how should I change it? What about 2? (one for object and the other for the non-object (background) area)?
Should I change "crop" layer "offset" to 32 to have center crops?
In case of changing each of these layers, what is the best initialization strategy for them? "bilinear" for "upscore" and "Xavier" for the rest?
Should I convert my binary label matrix values into zero-centered ( {-0.5,0.5} ) status, or it is OK to use them with the values in {0,1} ?
Any useful idea will be appreciated.
PS:
I'm using Euclidean loss, while I'm using "1" as the number of outputs for "score_fr" and "upscore" layers. If I use 2 for that, I guess it should be softmax.
I can answer some of your questions.
The gradients will reach the first layer so it should be possible to learn the weights even if you freeze the other layers.
Change the num_output to 2 and finetune. You should get a good output.
I think you'll need to experiment with each of the options and see how the accuracy is.
You can use the values 0,1.

Choosing train images for convolutional neural network

The goal is to localise objects from images. I decided to modify and train an existing model. However, I can't decide wether I should train the model using masks or only with ROI's.
For example : For class 1 data, only the class 1 object will be appearable on the image, every other regions will be filled with 0's and for the 2'nd class I'll do the same thing and will leave only 2'nd class's object in the mask, and so on for 3'rd and 4'th class.
The second way, using the ROI's : I'll crop each class from the image without mask, only the region on interest.
Then, I hope to continue do similar thing this : https://github.com/jazzsaxmafia/Weakly_detector
Shall I choose the the first way or second ? Any comments like "Your plan won't work, try this" is also appreciated.
--Edit--
To be clear,
Original image : http://s31.postimg.org/btyn660bf/image.jpg
1'st approach using masks:
1'st class : http://s31.postimg.org/4s0pjywpn/class11.png
2'nd class : http://s31.postimg.org/3zy1krsij/class21.png
3'rd class : http://s31.postimg.org/itcp5j09n/class31.png
4'rd class : http://s31.postimg.org/yowxv31gb/class41.png
1'st approach using ROI's:
1'st class : http://s31.postimg.org/4x4gtn40r/class1.png
2'nd class : http://s31.postimg.org/8s7uw7n6j/class2.png
3'rd class : http://s31.postimg.org/mxdny0w7v/class3.png
4'rd class : http://s31.postimg.org/qfpnuex3v/class4.png
P.S : The locations of objects will be in very similar for the new examples, so maybe using the mask approach can be a bit more useful. For the ROI approach I need to normalise each object which have very different sizes. However normalising the whole image mask may keep the variance between the original one much more less.
CNNs are generally quite robust to varying backgrounds assuming they're trained on a large amount of high-quality data. So I would guess that the difference between using the mask and ROI approaches won't be very substantial. For what it's worth, you will need to normalize the size of the images you're feeding to the CNN, regardless of which approach you use.
I have implemented some gesture recognition software and encountered a similar question. I could just use the raw, unprocessed ROI, or I could use a pre-processed version that filtered out much of the background. I basically tried it both ways and compared the accuracy of the models. In my case, I was able to get slightly better results from the pre-processed images. On the other hand, the backgrounds in my images were much more complex and varied. Anyway, my recommendation would be to build a solid mechanism for testing the accuracy of your model and experiment to see what works best.
Honestly, the most important thing is collecting lots of good samples for each class. In my case, I kept seeing substantial improvements until I hit about 5000 images per class. Since collecting lots of data takes a long time, it's best to capture and store the raw, full size images, along with any meta-data involved in the actual collection of the data so that you can experiment with different approaches (masking vs. ROI, varying input image sizes, other pre-processing such as histogram normalization, etc.) without having to collect new data.

trainning neural network

I have a picture.1200*1175 pixel.I want to train a net(mlp or hopfield) to learn a specific part of it(201*111pixel) to save its weight to use in a new net(with the same previous feature)only without train it to find that specific part.now there are this questions :what kind of nets is useful;mlp or hopfield,if mlp;the number of hidden layers;the trainlm function is unuseful because "out of memory" error.I convert the picture to a binary image,is it useful?
What exactly do you need the solution to do? Find an object with an image (like "Where's Waldo"?). Will the target object always be the same size and orientation? Might it look different because of lighting changes?
If you just need to find a fixed pattern of pixels within a larger image, I suggest using a straightforward correlation measure, such as crosscorrelation to find it efficiently.
If you need to contend with any of the issues mentioned above, then there are two basic solutions: 1. Build a model using examples of the object in different poses, scalings, etc. so that the model will recognize any of them, or 2. Develop a way to normalize the patch of pixels being examined, to minimize the effect of those distortions (like Hu's invariant moments). If nothing else, yuo'll want to perform some sort of data reduction to get the number of inputs down. Technically, you could also try a model which is invariant to rotations, etc., but I don't know how well those work. I suspect that they are more tempermental than traditional approaches.
I found AdaBoost to be helpful in picking out only important bits of an image. That, and resizing the image to something very tiny (like 40x30) using a Gaussian filter will speed it up and put weight on more of an area of the photo rather than on a tiny insignificant pixel.