How do I use a pre-trained Caffe model? - neural-network

I have some questions about how to actually interact with a pre-trained Caffe model. In my case I'm using a model for scene recognition.
In the caffe git repository, there are some code examples in Python and C++ on the implementations of Image Classifiers. However, those do not apply to my use case (since they only classify the input image as ONE class).
My goal is an application that takes an input image (jpg) and outputs the highest predicted class label for each pixel in the input image (e.i., indices for sky, beach, road, car).
Could anyone give me some pointers on how to proceed?
There already seem to exist implementations for this. This demo ( is kind of what I what.
Thank you!

What you are looking for is not image classification, but rather semantic segmentation.
A recent work, by Jonathan Long, Evan Shelhamer and Trevor Darrell is based on Caffe, and can be found here. It uses fully convolutional network, that is, a network with no "InnerProduct" layers only convolutional layers, thus capable of producing outputs with different sizes for different sizes of inputs.


can we make a convolution network that use more than one image to make a prediction

I cropped the following image from a tutorial.
this diagram shows a rough structure of a standard neural network. takes one image as input and make a prediction.
what I am thinking about is some kind of parallel structure. think about something like the following image.
not exactly as in the above image. But you can see I am trying to use two images to make one prediction. this image is for you to get an idea about what I am trying to ask.
is it possible to use more than one (two, three ..) images like this or any other way in order to make one prediction. now, this is not to be used in actual photo classification. But I think such a technique can be used in a file like audio classification where a graphical representation of data is used with image classification techniques.
any advice, guidance or opinion on this?
if we consider implementing exactly what is in the diagram, if I use a high-level API like Keras (Keras.model.sequential) all we can do is keep adding a layer one after the other.
so what kind of technology can I use to implement the parallel structure
Yes, you can use more than one image as input. See for example the Siamese Neural Network which takes as input 2 images and passes them through a shared network architecture.
If instead you want to have an arbitrary and variable number of images as input you can use an architecture based on Recurrent Neural Networks like Convolutional LSTM, which essentially applies a CNN to every image of the input sequence using an LSTM recurrent network.

Why we need CNN for the Object Detection?

I want to ask one general question that nowadays Deep learning specially Convolutional Neural Network (CNN) has been used in every field. Sometimes it is not necessary to use CNN for the problem but the researchers are using and following the trend.
So for the Object Detection problem, is it a kind of problem where CNN is really needed to solve the detection problem?
That is unhappy question. In title you ask about CNN, but you ask about deep learning in general.
So we don't necessary need deep learning for object recognition. But trained deep networks gets better results. Companies like Google and others are thankful for every % of better results.
About CNN, they gets better results than "traditional" ANN and also have less parameters because of weights sharing. CNN also allow transfer learning(you take a feature detector- convolution and pooling layers and than you connect on feature detector yours full connected layers).
A key concept of CNN's is the idea of translational invariance. In short, using a convolutional kernel on an image allows the machine to learn a set of weights for a specific feature (an edge, or a much more detailed object, depending on the layering of the network) and apply it across the entire image.
Consider detecting a cat in an image. If we designed some set of weights that allowed the learner to recognize a cat, we would like those weights to be the same no matter where the cat is in the image! So we would "assign" a layer in the convolutional kernel to detecting cats, and then convolve over the entire image.
Whatever the reason for the recent successes of CNN's, it should be noted that regular fully-connected ANN's should perform just as well. The problem is that they quickly become computationally infeasible on larger images, whereas CNN's are much more efficient due to parameter sharing.

Convolution Neural Network for image detection/classification

So here is there setup, I have a set of images (labeled train and test) and I want to train a conv net that tells me whether or not a specific object is within this image.
To do this, I followed the tensorflow tutorial on MNIST, and I train a simple conv net reduced to the area of interest (the object) which are training on image of size 128x128. The architecture is as follows : successively 3 layers consisting of 2 conv layers and 1 max pool down-sampling layers, and one fully connected softmax layers (with two class 0 and 1 whether the object is present or not)
I impleted it using tensorflow, and this works quite well, but since I have enough computing power I was wondering how I could improve the complexity of the classification:
- adding more layers ?
- adding more channel at each layer ? (currently 32,64,128 and 1024 for the fully connected)
- anything else ?
But the most important part is that now I want to detect this same object on larger images (roughle 600x600 whereas the size of the object should be around 100x100).
I was wondering how I could use the previously training "small" network used for small images, in order to pretrained a larger network on the large images ? One option could be to classify the image using a slicing window of size 128x128 and scan the whole image but I would like to try if possible to train a whole network on it.
Any suggestion on how to proceed ? Or an article / ressource tackling this kind of problem ? (I am really new to deep learning so sorry if this is stupid question...)
Thanks !
I suggest that you continue reading on the field overall. Your search keys include CNN, image classification, neural net, AlexNet, GoogleNet, and ResNet. This will return many articles, on-line classes and lectures, and other materials to help you learn about classification with neural nets.
Don't just add layers or filters: the complexity of the topology (net design) must be fitted to the task; a net that's too complex will over-fit the training data. The one you've been using is probably LeNet; the three I cite above are for the ImageNet image classification contest.
Since you are working on images, I would suggest you to use a pretrained image classification network (like VGG, Alexnet etc.)and fine tune this network with your 128x128 image data. In my experience until we have very large data set fine tuned network will give more accuracy and also save training time. After building a good image classifier on your data set you can use any popular algorithm to generate region of proposal from the image. Now take all regions of proposal and pass them to classification network one by one and check weather this network is classifying given region of proposal as positive or negative. If it classifying as positively then most probably your object is present in that region. Otherwise it's not. If there are a lot of region of proposal in which object is present according to classifier then you can use non maximal suppression algorithms to reduce number of positive proposals.

Training a model for Latent-SVM

I am very into train a new model from my own data set of faces!
I have found no information about this topic, then I hope my information could help people and I can get some answers as well.
I will try to explain the steps I have needed to do to train my own model and later on some questions...
I have download the Latent code from:
I have download the PASCAL VOC 2008 code (devkit) from:
I have emulate the structure of files/folders of the VOC PASCAL but in my own data set:
Annotations. I have created a .xml where I have defined a object, face, (in each image I only have one face). I didn't define difficulties or poses...
JPEGImages where I have stored all the images
ImageSets where I have defined three files:
test.txt, where I wrote the file name of my positive samples
train.txt, where I wrote the file name of my negative samples
trainval.txt, where I wrote the file name of my positive samples (exactly the same file than test.txt).
I have change some things in globals.m and VOCinit.m (to tell the algorithm the path and the location of some files...)
Then I run the training with the command: pascal('face', 1);
Following these steps I have achieved that the training run completely and doesn't fail and I get my own model BUT I have some doubts...
Can you see anything weird in my explanation? Could it work?
Must the files test.txt/trainval.txt be equal? Why... What does it mean?
Do I have to choose the number of parts I want in the model INSIDE the function?
Please, you imagine I have two kind of samples (frontal faces and side faces) and I want to detect both... How can I address this issue? I thought I have to train a model with two components... but How can I tell to the training code which are frontal or side samples?? In the annotations with the label pose?? (I don't think so...) Are there other way to handle this purpose?
Thank you for your time!!
I hope you can solve my doubts :)
I think test.txt should contain samples (images) that will be used to estimate how good the system is after learning the faces. However, trainval.txt is used during the learning stage (training) to fine-tune the parameters of the model; it is an essential part of supervised learning.
Also, it is very hard to have one single SVM to classify faces that are both frontal and sideways. Here is my suggestion:
Train one SVM to detect if the input image is a frontal face or a sideways face. Call this something like SVM-0.
Train another SVM for frontal faces. This SVM will classify all your individuals. Note, however, that SVM is usually a binary classifier, so make sure you choose the right SVM, one that as a multiclass architecture. Call this SVM-F.
Tran a final SVM for sideways faces. Again, use a multiclass SVM. Call it SVM-S.
Present the input image to SVM-0 and if it detects it is a frontal face, present the input again to SVM-F; otherwise, give the input to SVM-S.
In my experience, you should expect very low performance in SVM-S. It is a hard problem to solve. But frontal faces is not a big deal, unless you are working with faces that vary in pose, illumination, and expression (PIE). Face recognition is affected greatly with PIE variations in the images.
I recommend you this website, it contains very good information and tutorials for starters, with or without experience.

Where do filters/kernels for a convolutional network come from?

I've seen some tutorial examples, like UFLDL covolutional net, where they use features obtained by unsupervised learning, or some others, where kernels are engineered by hand (using Sobel and Gabor detectors, different sharpness/blur settings etc). Strangely, I can't find a general guideline on how one should choose a good kernel for something more than a toy network. For example, considering a deep network with many convolutional-pooling layers, are the same kernels used at each layer, or does each layer have its own kernel subset? If so, where do these, deeper layer's filters come from - should I learn them using some unsupervised learning algorithm on data passed through the first convolution-and-pooling layer pair?
I understand that this question doesn't have a singular answer, I'd be happy to just the the general approach (some review article would be fantastic).
The current state of the art suggest to learn all the convolutional layers from the data using backpropagation (ref).
Also, this paper recommend small kernels (3x3) and pooling (2x2). You should train different filters for each layer.
Kernels in deep networks are mostly trained all at the same time in a supervised way (known inputs and outputs of network) using Backpropagation (computes gradients) and some version of Stochastic Gradient Descent (optimization algorithm).
Kernels in different layers are usually independent. They can have different sizes and their numbers can differ as well. How to design a network is an open question and it depends on your data and the problem itself.
If you want to work with your own dataset, you should start with an existing pre-trained network [Caffe Model Zoo] and fine-tune it on your dataset. This way, the architecture of the network would be fixed, as you would have to respect the architecture of the original network. The networks you can donwload are trained on very large problems which makes them able to generalize well to other classification/regression problems. If your dataset is at least partly similar to the original dataset, the fine-tuned networks should work very well.
Good place to get more information is Caffe # CVPR2015 tutorial.