Keras: using VGG16 to detect specific, non-generic item? - neural-network

I'm learning about using neural networks and object detection, using Python and Keras. My goal is to detect something very specific in an image, let's say a very specific brand / type of car carburetor (part of a car engine).
The tutorials I found so far use the detection of cats and dogs as example, and many of those use a pre-trained VGG16 network to improve performance.
If I want to detect only my specific carburetor, and don't care about anything else in the image, does it make sense to use VGG16.? Is VGG16 only useful when you want to detect many generic items, rather than one specific item.?
Edit: I only want to know if there is a specific object (carburetor) in the image. No need to locate or put a box around it. I have about 1000 images of this specific carburetor for the network to train on.

VGG16 or some other pretrained neural network is primarily used for classification. That means that you can use it to distinguish in what category the image belongs in.
As i understand, what you need is to detect where in an image a carburetor is located. For something like that you need a different, more complicated approach.
You could use
NVIDIA DetectNet,
Segmentation networks such as U-net or
SegNet, etc...

The VGG 16 can be used for that. (Now is it the best? This is an open question without a clear answer)
But you must replace its ending to fit your needs.
While a regular VGG model has about a thousand classes at its end, a cats x dogs VGG has its end changed to have two classes. In your case, you should change its ending to have only one class.
In Keras, you'd have to load the VGG model with the option include_top = False.
And you should then add your own final Dense layers (two or three dense layers at the end), making sure that the last layer has only one unit: Dense(1, activation='sigmoid').
This will work for "detecting" (yes / no results).
But if your goal is "locating/segmentation", then you should create your own version of a U-net or a SegNet, for instance.


can we make a convolution network that use more than one image to make a prediction

I cropped the following image from a tutorial.
this diagram shows a rough structure of a standard neural network. takes one image as input and make a prediction.
what I am thinking about is some kind of parallel structure. think about something like the following image.
not exactly as in the above image. But you can see I am trying to use two images to make one prediction. this image is for you to get an idea about what I am trying to ask.
is it possible to use more than one (two, three ..) images like this or any other way in order to make one prediction. now, this is not to be used in actual photo classification. But I think such a technique can be used in a file like audio classification where a graphical representation of data is used with image classification techniques.
any advice, guidance or opinion on this?
if we consider implementing exactly what is in the diagram, if I use a high-level API like Keras (Keras.model.sequential) all we can do is keep adding a layer one after the other.
so what kind of technology can I use to implement the parallel structure
Yes, you can use more than one image as input. See for example the Siamese Neural Network which takes as input 2 images and passes them through a shared network architecture.
If instead you want to have an arbitrary and variable number of images as input you can use an architecture based on Recurrent Neural Networks like Convolutional LSTM, which essentially applies a CNN to every image of the input sequence using an LSTM recurrent network.

Why we need CNN for the Object Detection?

I want to ask one general question that nowadays Deep learning specially Convolutional Neural Network (CNN) has been used in every field. Sometimes it is not necessary to use CNN for the problem but the researchers are using and following the trend.
So for the Object Detection problem, is it a kind of problem where CNN is really needed to solve the detection problem?
That is unhappy question. In title you ask about CNN, but you ask about deep learning in general.
So we don't necessary need deep learning for object recognition. But trained deep networks gets better results. Companies like Google and others are thankful for every % of better results.
About CNN, they gets better results than "traditional" ANN and also have less parameters because of weights sharing. CNN also allow transfer learning(you take a feature detector- convolution and pooling layers and than you connect on feature detector yours full connected layers).
A key concept of CNN's is the idea of translational invariance. In short, using a convolutional kernel on an image allows the machine to learn a set of weights for a specific feature (an edge, or a much more detailed object, depending on the layering of the network) and apply it across the entire image.
Consider detecting a cat in an image. If we designed some set of weights that allowed the learner to recognize a cat, we would like those weights to be the same no matter where the cat is in the image! So we would "assign" a layer in the convolutional kernel to detecting cats, and then convolve over the entire image.
Whatever the reason for the recent successes of CNN's, it should be noted that regular fully-connected ANN's should perform just as well. The problem is that they quickly become computationally infeasible on larger images, whereas CNN's are much more efficient due to parameter sharing.

Face Recognition based on Deep Learning (Siamese Architecture)

I want to use pre-trained model for the face identification. I try to use Siamese architecture which requires a few number of images. Could you give me any trained model which I can change for the Siamese architecture? How can I change the network model which I can put two images to find their similarities (I do not want to create image based on the tutorial here)? I only want to use the system for real time application. Do you have any recommendations?
I suppose you can use this model, described in Xiang Wu, Ran He, Zhenan Sun, Tieniu Tan A Light CNN for Deep Face Representation with Noisy Labels (arXiv 2015) as a a strating point for your experiments.
As for the Siamese network, what you are trying to earn is a mapping from a face image into some high dimensional vector space, in which distances between points reflects (dis)similarity between faces.
To do so, you only need one network that gets a face as an input and produce a high-dim vector as an output.
However, to train this single network using the Siamese approach, you are going to duplicate it: creating two instances of the same net (you need to explicitly link the weights of the two copies). During training you are going to provide pairs of faces to the nets: one to each copy, then the single loss layer on top of the two copies can compare the high-dimensional vectors representing the two faces and compute a loss according to a "same/not same" label associated with this pair.
Hence, you only need the duplication for the training. In test time ('deploy') you are going to have a single net providing you with a semantically meaningful high dimensional representation of faces.
For a more advance Siamese architecture and loss see this thread.
On the other hand, you might want to consider the approach described in Oren Tadmor, Yonatan Wexler, Tal Rosenwein, Shai Shalev-Shwartz, Amnon Shashua Learning a Metric Embedding for Face Recognition using the Multibatch Method (arXiv 2016). This approach is more efficient and easy to implement than pair-wise losses over image pairs.

How to Combine two classification model in matlab?

I am trying to detect the faces using the Matlab built-in viola jones face detection. Is there anyway that I can combine two classification models like "FrontalFaceCART" and "ProfileFace" into one in order to get a better result?
Thank you.
You can't combine models. That's a non-sense in any classification task since every classifier is different (works differently, i.e. different algorithm behind it, and maybe is also trained differently).
According to the classification model(s) help (which can be found here), your two classifiers work as follows:
FrontalFaceCART is a model composed of weak classifiers, based on classification and regression tree analysis
ProfileFace is composed of weak classifiers, based on a decision stump
More infos can be found in the link provided but you can easily see that their inner behaviour is rather different, so you can't mix them or combine them.
It's like (in Machine Learning) mixing a Support Vector Machine with a K-Nearest Neighbour: the first one uses separating hyperplanes whereas the latter is simply based on distance(s).
You can, however, train several models in parallel (e.g. independently) and choose the model that better suits you (e.g. smaller error rate/higher accuracy): so you basically create as many different classifiers as you like, give them the same training set, evaluate each accuracy (and/or other parameters) and choose the best model.
One option is to make a hierarchical classifier. So in a first step you use the frontal face classifier (assuming that most pictures are frontal faces). If the classifier fails, you try with the profile classifier.
I did that with a dataset of faces and it improved my overall classification accuracy. Furthermore, if you have some a priori information, you can use it. In my case the faces were usually in the middle up part of the picture.
To further improve your performance, without using the two classifiers in MATLAB you are using, you would need to change your technique (and probably your programming language). This is the best method so far: Facenet.

How do I use a pre-trained Caffe model?

I have some questions about how to actually interact with a pre-trained Caffe model. In my case I'm using a model for scene recognition.
In the caffe git repository, there are some code examples in Python and C++ on the implementations of Image Classifiers. However, those do not apply to my use case (since they only classify the input image as ONE class).
My goal is an application that takes an input image (jpg) and outputs the highest predicted class label for each pixel in the input image (e.i., indices for sky, beach, road, car).
Could anyone give me some pointers on how to proceed?
There already seem to exist implementations for this. This demo ( is kind of what I what.
Thank you!
What you are looking for is not image classification, but rather semantic segmentation.
A recent work, by Jonathan Long, Evan Shelhamer and Trevor Darrell is based on Caffe, and can be found here. It uses fully convolutional network, that is, a network with no "InnerProduct" layers only convolutional layers, thus capable of producing outputs with different sizes for different sizes of inputs.