Why we need CNN for the Object Detection? - neural-network

I want to ask one general question that nowadays Deep learning specially Convolutional Neural Network (CNN) has been used in every field. Sometimes it is not necessary to use CNN for the problem but the researchers are using and following the trend.
So for the Object Detection problem, is it a kind of problem where CNN is really needed to solve the detection problem?

That is unhappy question. In title you ask about CNN, but you ask about deep learning in general.
So we don't necessary need deep learning for object recognition. But trained deep networks gets better results. Companies like Google and others are thankful for every % of better results.
About CNN, they gets better results than "traditional" ANN and also have less parameters because of weights sharing. CNN also allow transfer learning(you take a feature detector- convolution and pooling layers and than you connect on feature detector yours full connected layers).

A key concept of CNN's is the idea of translational invariance. In short, using a convolutional kernel on an image allows the machine to learn a set of weights for a specific feature (an edge, or a much more detailed object, depending on the layering of the network) and apply it across the entire image.
Consider detecting a cat in an image. If we designed some set of weights that allowed the learner to recognize a cat, we would like those weights to be the same no matter where the cat is in the image! So we would "assign" a layer in the convolutional kernel to detecting cats, and then convolve over the entire image.
Whatever the reason for the recent successes of CNN's, it should be noted that regular fully-connected ANN's should perform just as well. The problem is that they quickly become computationally infeasible on larger images, whereas CNN's are much more efficient due to parameter sharing.

Related

How to evaluate a quality of neural network for object detection in Keras?

I've already trained the neural network in Keras for detecting two classes of images (cats and dogs) and got accuracy on test data. Is it enough for the conclusion in the master thesis or should I do other actions for evaluating the quality of network (for instance, cross-validation)?
Not really, I would expect more than just accuracy from my students in any classification setup. Accuracy only evaluates that particular network on that particular test set but you would have to some extent justify the design choices you've made in building that network. Here are some things to consider:
Presumably you have some hyper-parameters you've fixed, you can investigate how these affect your results. How many filters? How many layers? and most importantly why?
An important aspect of object classification is how your model handles noise. Depending on your dataset, one simple way would be to pre-process the test data, blur it, invert colours etc and you'll see that your performance will drop. Why does it do that? How does the confusion matrix look like then?
What is the performance of the network? Is it fast, slow compared to another system, say VGG?
When you evaluate your project in general not just the network, asking why things worked helps a lot, not just why things didn't work.

Use a trained neural network to imitate its training data

I'm in the overtures of designing a prose imitation system. It will read a bunch of prose, then mimic it. It's mostly for fun so the mimicking prose doesn't need to make too much sense, but I'd like to make it as good as I can, with a minimal amount of effort.
My first idea is to use my example prose to train a classifying feed-forward neural network, which classifies its input as either part of the training data or not part. Then I'd like to somehow invert the neural network, finding new random inputs that also get classified by the trained network as being part of the training data. The obvious and stupid way of doing this is to randomly generate word lists and only output the ones that get classified above a certain threshold, but I think there is a better way, using the network itself to limit the search to certain regions of the input space. For example, maybe you could start with a random vector and do gradient descent optimisation to find a local maximum around the random starting point. Is there a word for this kind of imitation process? What are some of the known methods?
How about Generative Adversarial Networks (GAN, Goodfellow 2014) and their more advanced siblings like Deep Convolutional Generative Adversarial Networks? There are plenty of proper research articles out there, and also more gentle introductions like this one on DCGAN and this on GAN. To quote the latter:
GANs are an interesting idea that were first introduced in 2014 by a
group of researchers at the University of Montreal lead by Ian
Goodfellow (now at OpenAI). The main idea behind a GAN is to have two
competing neural network models. One takes noise as input and
generates samples (and so is called the generator). The other model
(called the discriminator) receives samples from both the generator
and the training data, and has to be able to distinguish between the
two sources. These two networks play a continuous game, where the
generator is learning to produce more and more realistic samples, and
the discriminator is learning to get better and better at
distinguishing generated data from real data. These two networks are
trained simultaneously, and the hope is that the competition will
drive the generated samples to be indistinguishable from real data.
(DC)GAN should fit your task quite well.

Criteria Behind Structuring a Neural Network

I'm just starting with Torch and neural networks and just glancing at a lot of sample code and tutorials, I see a lot of variety in the how people structure their neural networks. There are layers like Linear(), Tanh(), Sigmoid() as well as criterions like MSE, ClassNLL, MultiMargin, etc.
I'm wondering what kind of factors people keep in mind when creating the structure of their network? For example, I know that in a ClassNLLCriterion, you want to have the last layer of your network be a LogSoftMax() layer so that you can input the right log probabilities.
Are there any other general rules or guidelines when it comes to creating these networks?
Thanks
Here is a good webpage which contains the pros and cons of some of the main activation functions;
http://cs231n.github.io/neural-networks-1/#actfun
It can boil down to the problem at hand and knowing what to do when something goes wrong. As an example, if you have a huge dataset and you can't churn through it terribly quickly then a ReLU might be better in order to quickly get to a local minimum. However you could find that some of the ReLU units "die" so you might want to keep a track on the proportion of activated neurons in that particular layer to make sure this hasn't happened.
In terms of criterions, they are also problem specific but a bit less ambiguous. For example, binary cross entropy for binary classification, MSE for regression etc. It really depends on the objective of the whole project.
For the overall network architecture, I personally find it can be a case of trying out different architectures and seeing which ones work and which don't on your test set. If you think that the problem at hand is terribly complex and you need a complex network to solve the problem then you will probably want to try making a very deep network to begin with, then add/remove a few layers at a time to see if you have under/overfitted. As another example, if you are using convolutional network and the input is relatively small then you might try and use a smaller set of convolutional filters to begin with.

Where do filters/kernels for a convolutional network come from?

I've seen some tutorial examples, like UFLDL covolutional net, where they use features obtained by unsupervised learning, or some others, where kernels are engineered by hand (using Sobel and Gabor detectors, different sharpness/blur settings etc). Strangely, I can't find a general guideline on how one should choose a good kernel for something more than a toy network. For example, considering a deep network with many convolutional-pooling layers, are the same kernels used at each layer, or does each layer have its own kernel subset? If so, where do these, deeper layer's filters come from - should I learn them using some unsupervised learning algorithm on data passed through the first convolution-and-pooling layer pair?
I understand that this question doesn't have a singular answer, I'd be happy to just the the general approach (some review article would be fantastic).
The current state of the art suggest to learn all the convolutional layers from the data using backpropagation (ref).
Also, this paper recommend small kernels (3x3) and pooling (2x2). You should train different filters for each layer.
Kernels in deep networks are mostly trained all at the same time in a supervised way (known inputs and outputs of network) using Backpropagation (computes gradients) and some version of Stochastic Gradient Descent (optimization algorithm).
Kernels in different layers are usually independent. They can have different sizes and their numbers can differ as well. How to design a network is an open question and it depends on your data and the problem itself.
If you want to work with your own dataset, you should start with an existing pre-trained network [Caffe Model Zoo] and fine-tune it on your dataset. This way, the architecture of the network would be fixed, as you would have to respect the architecture of the original network. The networks you can donwload are trained on very large problems which makes them able to generalize well to other classification/regression problems. If your dataset is at least partly similar to the original dataset, the fine-tuned networks should work very well.
Good place to get more information is Caffe # CVPR2015 tutorial.

How do neural networks handle large images where the area of interest is small?

If I've understood correctly, when training neural networks to recognize objects in images it's common to map single pixel to a single input layer node. However, sometimes we might have a large picture with only a small area of interest. For example, if we're training a neural net to recognize traffic signs, we might have images where the traffic sign covers only a small portion of it, while the rest is taken by the road, trees, sky etc. Creating a neural net which tries to find a traffic sign from every position seems extremely expensive.
My question is, are there any specific strategies to handle these sort of situations with neural networks, apart from preprocessing the image?
Thanks.
Using 1 pixel per input node is usually not done. What enters your network is the feature vector and as such you should input actual features, not raw data. Inputing raw data (with all its noise) will not only lead to bad classification but training will take longer than necessary.
In short: preprocessing is unavoidable. You need a more abstract representation of your data. There are hundreds of ways to deal with the problem you're asking. Let me give you some popular approaches.
1) Image proccessing to find regions of interest. When detecting traffic signs a common strategy is to use edge detection (i.e. convolution with some filter), apply some heuristics, use a threshold filter and isolate regions of interest (blobs, strongly connected components etc) which are taken as input to the network.
2) Applying features without any prior knowledge or image processing. Viola/Jones use a specific image representation, from which they can compute features in a very fast way. Their framework has been shown to work in real-time. (I know their original work doesn't state NNs but I applied their features to Multilayer Perceptrons in my thesis, so you can use it with any classifier, really.)
3) Deep Learning.
Learning better representations of the data can be incorporated into the neural network itself. These approaches are amongst the most popular researched atm. Since this is a very large topic, I can only give you some keywords so that you can research it on your own. Autoencoders are networks that learn efficient representations. It is possible to use them with conventional ANNs. Convolutional Neural Networks seem a bit sophisticated at first sight but they are worth checking out. Before the actual classification of a neural network, they have alternating layers of subwindow convolution (edge detection) and resampling. CNNs are currently able to achieve some of the best results in OCR.
In every scenario you have to ask yourself: Am I 1) giving my ANN a representation that has all the data it needs to do the job (a representation that is not too abstract) and 2) keeping too much noise away (and thus staying abstract enough).
We usually dont use fully connected network to deal with image because the number of units in the input layer will be huge. In neural network, we have specific neural network to deal with image which is Convolutional neural network(CNN).
However, CNN plays a role of feature extractor. The encoded feature will finally feed into a fully connected network which act as a classifier. In your case, I dont know how small your object is compare to the full image. But if the interested object is really small, even use CNN, the performance for image classification wont be very good. Then we probably need to use object detection(which used sliding window) to deal with it.
If you want recognize small objects on large sized image, you should use "scanning window".
For "scanning window" you can to apply dimention reducing methods:
DCT (http://en.wikipedia.org/wiki/Discrete_cosine_transform)
PCA (http://en.wikipedia.org/wiki/Principal_component_analysis)