I'm using fully convolutional networks for semantic segmentation in Caffe, using the Cityscapes dataset.
This script allows to convert IDs of classes, and says to set IDs of classes to ignore at 255, and "ignore these labels during training". How do we do that in practice ? I mean, how do I 'tell' my network that 255 is not a true class as the other integers ?
Thanks for giving me an intuition behind it.
Using, e.g. "SoftmaxWithLoss" layer, you can add a loss_param { ignore_label: 255 } to tell caffe to ignore this label:
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "prediction"
bottom: "labels_with_255_as_ignore"
loss_weight: 1
loss_param: { ignore_label: 255 }
}
I did not check it, but I believe ignore_label is also used by InfogainLoss loss and some other loss layer.
Related
I'm new to Tensorflow, Neural Nets and I never used other than the JavaScript version of Tensorflow. And basically I'm experimenting and studdying all this.
Reading the (Python) Tensorflow docs I saw that Pruning can be done by TF.CONTRIB.MODEL_PRUNING, but, as far as I have found, there is nothing similar for Tensorflow.JS. So I'd like to experiment a bit and implement at least a very simple / basic pruning method.
This "very simple / basic pruning method" can be something like removing from the hidden layers those neurons whose weight is very near to 0. I would then train the model a bit more and see if I can recover the loss in accuracy.
I know I can access the weights with something like this:
const weights = model.layers.map(layer => {
return layer.getWeights()[0].dataSync();
});
What I would like to know if it is actually possible to find and remove units associated with those weights (and if I can do this during training).
Thanks!
Edu
It is possible to set the weights on the model. The same way you retrieve the model weights using get, you can use set to change the weights of your model.
model.fit(x, y, {epochs: 1000,
callbacks: {
onEpochEnd: () => {
// check your weight
model.layers[0].getWeights()
// set your weiths
model.layers[0].setWeights([tensors])
}
}})
I've noticed that a frequent occurrence during training is NANs being introduced.
Often times it seems to be introduced by weights in inner-product/fully-connected or convolution layers blowing up.
Is this occurring because the gradient computation is blowing up? Or is it because of weight initialization (if so, why does weight initialization have this effect)? Or is it likely caused by the nature of the input data?
The overarching question here is simply: What is the most common reason for NANs to occurring during training? And secondly, what are some methods for combatting this (and why do they work)?
I came across this phenomenon several times. Here are my observations:
Gradient blow up
Reason: large gradients throw the learning process off-track.
What you should expect: Looking at the runtime log, you should look at the loss values per-iteration. You'll notice that the loss starts to grow significantly from iteration to iteration, eventually the loss will be too large to be represented by a floating point variable and it will become nan.
What can you do: Decrease the base_lr (in the solver.prototxt) by an order of magnitude (at least). If you have several loss layers, you should inspect the log to see which layer is responsible for the gradient blow up and decrease the loss_weight (in train_val.prototxt) for that specific layer, instead of the general base_lr.
Bad learning rate policy and params
Reason: caffe fails to compute a valid learning rate and gets 'inf' or 'nan' instead, this invalid rate multiplies all updates and thus invalidating all parameters.
What you should expect: Looking at the runtime log, you should see that the learning rate itself becomes 'nan', for example:
... sgd_solver.cpp:106] Iteration 0, lr = -nan
What can you do: fix all parameters affecting the learning rate in your 'solver.prototxt' file.
For instance, if you use lr_policy: "poly" and you forget to define max_iter parameter, you'll end up with lr = nan...
For more information about learning rate in caffe, see this thread.
Faulty Loss function
Reason: Sometimes the computations of the loss in the loss layers causes nans to appear. For example, Feeding InfogainLoss layer with non-normalized values, using custom loss layer with bugs, etc.
What you should expect: Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.
What can you do: See if you can reproduce the error, add printout to the loss layer and debug the error.
For example: Once I used a loss that normalized the penalty by the frequency of label occurrence in a batch. It just so happened that if one of the training labels did not appear in the batch at all - the loss computed produced nans. In that case, working with large enough batches (with respect to the number of labels in the set) was enough to avoid this error.
Faulty input
Reason: you have an input with nan in it!
What you should expect: once the learning process "hits" this faulty input - output becomes nan. Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears.
What can you do: re-build your input datasets (lmdb/leveldn/hdf5...) make sure you do not have bad image files in your training/validation set. For debug you can build a simple net that read the input layer, has a dummy loss on top of it and runs through all the inputs: if one of them is faulty, this dummy net should also produce nan.
stride larger than kernel size in "Pooling" layer
For some reason, choosing stride > kernel_size for pooling may results with nans. For example:
layer {
name: "faulty_pooling"
type: "Pooling"
bottom: "x"
top: "y"
pooling_param {
pool: AVE
stride: 5
kernel: 3
}
}
results with nans in y.
Instabilities in "BatchNorm"
It was reported that under some settings "BatchNorm" layer may output nans due to numerical instabilities.
This issue was raised in bvlc/caffe and PR #5136 is attempting to fix it.
Recently, I became aware of debug_info flag: setting debug_info: true in 'solver.prototxt' will make caffe print to log more debug information (including gradient magnitudes and activation values) during training: This information can help in spotting gradient blowups and other problems in the training process.
In my case, not setting the bias in the convolution/deconvolution layers was the cause.
Solution: add the following to the convolution layer parameters.
bias_filler {
type: "constant"
value: 0
}
This answer is not about a cause for nans, but rather proposes a way to help debug it.
You can have this python layer:
class checkFiniteLayer(caffe.Layer):
def setup(self, bottom, top):
self.prefix = self.param_str
def reshape(self, bottom, top):
pass
def forward(self, bottom, top):
for i in xrange(len(bottom)):
isbad = np.sum(1-np.isfinite(bottom[i].data[...]))
if isbad>0:
raise Exception("checkFiniteLayer: %s forward pass bottom %d has %.2f%% non-finite elements" %
(self.prefix,i,100*float(isbad)/bottom[i].count))
def backward(self, top, propagate_down, bottom):
for i in xrange(len(top)):
if not propagate_down[i]:
continue
isf = np.sum(1-np.isfinite(top[i].diff[...]))
if isf>0:
raise Exception("checkFiniteLayer: %s backward pass top %d has %.2f%% non-finite elements" %
(self.prefix,i,100*float(isf)/top[i].count))
Adding this layer into your train_val.prototxt at certain points you suspect may cause trouble:
layer {
type: "Python"
name: "check_loss"
bottom: "fc2"
top: "fc2" # "in-place" layer
python_param {
module: "/path/to/python/file/check_finite_layer.py" # must be in $PYTHONPATH
layer: "checkFiniteLayer"
param_str: "prefix-check_loss" # string for printouts
}
}
learning_rate is high and should be decreased
The accuracy in the RNN code was nan, with select the low value for learning rate it fixes
One more solution for anyone stuck like I just was-
I was receiving nan or inf losses on a network I setup with float16 dtype across the layers and input data. After all else failed, it occurred to me to switch back to float32, and the nan losses were solved!
So bottom line, if you switched dtype to float16, change it back to float32.
I was trying to build a sparse autoencoder and had several layers in it to induce sparsity. While running my net, I encountered the NaN's. On removing some of the layers (in my case, I actually had to remove 1), I found that the NaN's disappeared. So, I guess too much sparsity may lead to NaN's as well (some 0/0 computations may have been invoked!?)
I'm using Caffe (http://caffe.berkeleyvision.org/) for image classification. I'm using it on Windows and everything seems to be compiling just fine.
To start learning I followed the MNIST tutorial (http://caffe.berkeleyvision.org/gathered/examples/mnist.html). I downloaded the data and ran ..\caffe.exe train --solver=...examples\mnist\lenet_solver.prototxt. It ran 10.000 iterations, printed that the accuracy was 98.5, and generated two files: lenet_iter_10000.solverstate, and lenet_iter_10000.caffemodel.
So, I though it would be funny to try to classify my own image, it should be easy right?.
I can find resources such as: https://software.intel.com/en-us/articles/training-and-deploying-deep-learning-networks-with-caffe-optimized-for-intel-architecture#Examples telling how to prepare, train and time my model. But each time a tutorial/article comes to actually putting a single instance into the CNN, they skip to the next point and tell to download some new model. Some resources tell to use the classifier.bin/.exe, but this file takes a imagenet_mean.binaryproto or similar for mnist. I have no idea where to find or generated this file.
So in short: When I have trained a CNN using Caffe, how to I input a single image and get the output using the files I already have?
Update: Based on the help, I got the Net to recognize an image but the recognition is not correct even if the network had an accuracy of 99.0%. I used the following python code to recognice an image:
NET_FILE = 'deploy.prototxt'
MODEL_FILE = 'lenet_iter_10000.caffemodel'
net = caffe.Net(NET_FILE, MODEL_FILE, caffe.TEST)
im = Image.open("img4.jpg")
in_ = np.array(im, dtype=np.float32)
net.blobs['data'].data[...] = in_
out = net.forward() # Run the network for the given input image
print out;
I'm not sure if I format the image correctly for the MNIST example. The image is a 28x28 grayscale image with a basic 4. Do I have to do more transformations on the image?
The network (deploy) looks like this (start and end):
input: "data"
input_shape {
dim: 1 # batchsize
dim: 1 # number of colour channels - rgb
dim: 28 # width
dim: 28 # height
}
....
layer {
name: "loss"
type: "Softmax"
bottom: "ip2"
top: "loss"
}
If I understand the question correctly, you have a trained model and you want to test the model using your own input images. There are many ways to do this.
One method I commonly use is to run a python script similar to what I have here.
Just keep in mind that you have to build python in caffe using make pycaffe and point to the folder by editing the line sys.path.append('../../../python')
Also edit the following lines to your model filenames.
NET_FILE = 'deploy.prototxt'
MODEL_FILE = 'fcn8s-heavy-pascal.caffemodel'
Edit the following line. Instead of score you should use the last layer of your network to get the output.
out = net.blobs['score'].data
You need to create a deploy.prototxt file from your original network.prototxt file. The data layer has to look like this:
input: "data"
input_shape {
dim: 1
dim: [channles]
dim: [width]
dim: [height]
}
where you replace [channels], [width], and [height] with the correct values of your image.
You also need to remove any layers which get the "label" as its bottom input (this would usually be only your loss layer).
Then you can use this deploy.prototxt file to test your inputs using MATLAB or PYTHON.
I have a big net with many layers. I add a new full-connected layer in the net and want to do a fine-tuning. However, it's so difficult to set lr_mult: 0 in every layer except the new one, since there are many layers in the net.
If there is a good way to solve these problem?
Thanks.
How about, instead of setting lr_mult: 0 for all parameters to all layers prior to the new fully connected layer, just stop the back propagation after the new layer?
You can do that by setting propagate_down: false.
For example:
layer {
name: "new_layer"
type: "InnerProduct"
...
inner_product_param {
...
}
propagate_down: false # do not continue backprop after this layer
}
Alternatively, you can use sed, a command line utility, to directly change all entries in your prototxt file:
~$ sed -i -E 's/lr_mult *: *[0-9]+/lr_mult: 0/g' train_val.prototxt
This one line will change all lr_mult in your train_val.prototxt to zero. You'll only need to manually set the lr_mult for the new layer.
I am trying to run a caffe Experiment.I am using the following loss layer in my Train.prototxt,
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
include {
phase: TRAIN
}
}
I see the following configuration being displayed when the training is started,
I0923 21:19:13.101313 26423 net.cpp:410] loss <- ip2
I0923 21:19:13.101323 26423 net.cpp:410] loss <- label
I0923 21:19:13.101339 26423 net.cpp:368] loss -> (automatic)
I have not given top parameter in the loss layer.
What exactly the automatic(loss -> (automatic)) means here?
Thanks in advance!
Caffe layers, including Loss layers, produce Blob (4-D arrays) as output of their computations. If you don't set a Blob name through the top parameter, the corresponding Blob will be added to the "output" of the net.
This means that, if you call the Net::forward() method, it will return a list of Blobs, i.e., the ones that are unbounded to be the input for another layer.
When you call the Caffe training tool, it automatically print to screen such Blobs. This way you can follow the value of loss or accuracy during training.