Training image classifier - Neural Network - neural-network

I would like to train a conv neural network to detect the presence of hands in images.
The difficulty is that:
1/ the images will contain other objects than the hands, just like a picture of a group of people where the hands are just a small part of the image
2/ hands can have many orientations / shapes etc (whether they are open or not , depending on the angle etc..)
I was thinking of training the convnet on a big set of cropped hand images (+ random images without hands) and then apply the classifier on all the subsquares of my images. Is this a good approach?
Are there other examples of complex 2-class convnets / RNNs I could use for inspiration?
Thank you!

I was thinking of training the convnet on a big set of cropped hand
images (+ random images without hands) and then apply the classifier
on all the subsquares of my images. Is this a good approach?
Yes, I believe this would be a good approach. However, note that when you say random, you should perhaps sample it from images where "hands are most likely to appear". It really depends on your use case, and you have to tune the data set to fit what you're doing.
How you should build your data set, would be something like this:
Crop images of hands from a big image.
Sample X number of images from that same image, but not anywhere near the hand/hands.
If however, you should choose to do something like this:
Crop images of hands from a big image.
Download 1 million images (an exaggeration) that definitely don't have hands. For example, deserts, oceans, skies, caves, mountains, basically lots of scenery. And then use this as your "random images without hands", you might get bad results.
The reason for this, is because there is an underlying distribution already. I assume that most of your images could be pictures of groups of friends, having a party at a house, or perhaps the background images would be buildings. Hence, introducing scenery images, could corrupt this distribution, whilst holding the above assumption.
Therefore, be really careful when using "random images"!
on all the subsquares of my images
As to this part of your question, you are essentially running a sliding window on the entire image. Yes, practically, it would work. But if you're looking for performance, this may not be a good idea. You might want to run some segmentation algorithms, to narrow down the search space.
Are there other examples of complex 2-class convnets / RNNs I could
use for inspiration?
I'm not sure what you mean by complex 2-class convnets. I'm not familiar with RNNs, so let me focus on convnets. You can basically define the convolutional net yourself. For example, the convolutional layers size, how many layers, what's your max pooling method, how big is your fully connected layer going to be, etc. The last layer, is basically a softmax layer, where the net decides what class it's going to be. If you have 2 classes, your last layer has 2 nodes. If you have 3, then 3. And so on. So it can range from 2, to perhaps even 1000. I've not heard of convnets that have more than 1000 classes, but I could be ill-informed. I hope this helps!

This seems more a matter of finding good labeled training data than of choosing a network. A neural network can learn the difference between "pictures of hands" and "pictures which incidentally include hands", but it needs some labeled examples to figure out which category an image belongs to.
You might want to take a look at this:


Can we use Deep Learning networks to detect interesting or boring pictures?

I'm handling a Deep Learning classification task to distinguish whether an image/video is boring or interesting.
Based on ten-thousand labeled data(1. interesting 2. a little interesting 3. normal 4. boring), I used some pre-trained imagenet model(resnet / inception / VGG etc) to fine-tune my classification task.
My training error is very small, means it has been converged already. But test error is very high, accuracy is only around 35%, very similar with a random result.
I found the difficult parts are:
Same object has different label, for example, a dog on grass, maybe a very cute dog can be labeled as an interesting image. But an ugly dog may be labeled as a boring image.
Factors to define interesting or boring is so many, image quality, image color, the object, the environment... If we just detect good image quality image or we just detect good environment image, it may be possible, but how we can combine all these factors.
Every one's interesting point is different, I may be interested with pets, but some other one may think it is boring, but there are some common sense that everyone think the same. But how can I detect it?
At last, do you think it is a possible problem that can be solved using deep learning? If so, what will you do with this task?
This is a very broad question. I'll try and give some pointers:
"My training error is very small... But test error is very high" means you overfit your training set: your model learns specific training examples instead of learning general "classification rules" applicable to unseen examples.
This usually means you have too many trainable parameters relative to the number of training samples.
Your problem is not exactly a "classification" problem: classifying a "little interesting" image as "boring" is worse than classifying it as "interesting". Your label set has order. Consider using a loss function that takes that into account. Maybe "InfogainLoss" (if you want to maintain discrete labels), or "EuclideanLoss" (if you are willing to accept a continuous score).
If you have enough training examples, I think it is not too much to ask from a deep model to distinguish between an "interesting" dog image and a "boring" one. Even though the semantic difference is not large, there is a difference between the images, and a deep model should be able to capture it.
However, you might want to start your finetuning from a net that is trained for "aesthetic" tasks (e.g., MemNet, flickr style etc.) and not a "semantic" net like VGG/GoogLeNet etc.

Multiscale search for HOG+SVM in Matlab

first of all this is my first question here, so I hope I can explain it in a clear way.
My goal is to detect different classes of traffic signs in images. For that purpose I have trained binary SVMs following these steps:
First I got a database of cropped traffic signs like the one in the link. I considered different classes (prohibition, danger, etc), and negative images. All of them were scaled to 40x40 pixels.
I trained linear-SVM models for each class (1-vs-all), using HOG as feature. Each image is described with a 1728-dimensional feature. (I append the three feature vectors for all three image planes). I did crossvalidation to set parameter C, and tested on previously unseen 40x40 images, and I got very accurate results (F1 score over 0.9 for all classes). I used libsvm for training and testing.
Now I'd want to detect signs in full road images, sliding a window in different image scales. The problem I'm facing is that I couldn't find any function that can do it for me (as DetectMultiScale in OpenCV), and my solution is very slow and rudimentary (I'm just doing a triple for loop, and for each scale I crop consecutive and overlapping 40x40 images, obtain HOG features and apply svmpredict for each one).
Can someone give me a clue to find a faster way to do it? I thought too about getting the HOG feature vector of the whole input image, and then reorder that vector to a matrix where each row will have the features corresponding to each 40x40 window, but I couldn't find a straightforward way of doing it.
I would suggest using SURF feature detection, however I don't know if this would also be too slow your needs.
See : for more information on how to implement and weather it is a viable solution for you.

How do neural networks handle large images where the area of interest is small?

If I've understood correctly, when training neural networks to recognize objects in images it's common to map single pixel to a single input layer node. However, sometimes we might have a large picture with only a small area of interest. For example, if we're training a neural net to recognize traffic signs, we might have images where the traffic sign covers only a small portion of it, while the rest is taken by the road, trees, sky etc. Creating a neural net which tries to find a traffic sign from every position seems extremely expensive.
My question is, are there any specific strategies to handle these sort of situations with neural networks, apart from preprocessing the image?
Using 1 pixel per input node is usually not done. What enters your network is the feature vector and as such you should input actual features, not raw data. Inputing raw data (with all its noise) will not only lead to bad classification but training will take longer than necessary.
In short: preprocessing is unavoidable. You need a more abstract representation of your data. There are hundreds of ways to deal with the problem you're asking. Let me give you some popular approaches.
1) Image proccessing to find regions of interest. When detecting traffic signs a common strategy is to use edge detection (i.e. convolution with some filter), apply some heuristics, use a threshold filter and isolate regions of interest (blobs, strongly connected components etc) which are taken as input to the network.
2) Applying features without any prior knowledge or image processing. Viola/Jones use a specific image representation, from which they can compute features in a very fast way. Their framework has been shown to work in real-time. (I know their original work doesn't state NNs but I applied their features to Multilayer Perceptrons in my thesis, so you can use it with any classifier, really.)
3) Deep Learning.
Learning better representations of the data can be incorporated into the neural network itself. These approaches are amongst the most popular researched atm. Since this is a very large topic, I can only give you some keywords so that you can research it on your own. Autoencoders are networks that learn efficient representations. It is possible to use them with conventional ANNs. Convolutional Neural Networks seem a bit sophisticated at first sight but they are worth checking out. Before the actual classification of a neural network, they have alternating layers of subwindow convolution (edge detection) and resampling. CNNs are currently able to achieve some of the best results in OCR.
In every scenario you have to ask yourself: Am I 1) giving my ANN a representation that has all the data it needs to do the job (a representation that is not too abstract) and 2) keeping too much noise away (and thus staying abstract enough).
We usually dont use fully connected network to deal with image because the number of units in the input layer will be huge. In neural network, we have specific neural network to deal with image which is Convolutional neural network(CNN).
However, CNN plays a role of feature extractor. The encoded feature will finally feed into a fully connected network which act as a classifier. In your case, I dont know how small your object is compare to the full image. But if the interested object is really small, even use CNN, the performance for image classification wont be very good. Then we probably need to use object detection(which used sliding window) to deal with it.
If you want recognize small objects on large sized image, you should use "scanning window".
For "scanning window" you can to apply dimention reducing methods:

How to Compare the quality of two images?

I have applied Two different Image Enhancement Algorithm on a particular Image and got two resultant image , Now i want to compare the quality of those two image in order to find the effectiveness of those two Algorithms and find the more appropriate one based on the comparison of Feature vectors of those two images.So what Suitable Feature Vectors should i compare in this Case?
Iam asking in context of comparing the texture features of the images and which feature vector will be more suitable.
I need Mathematical support for verifying the effectiveness of any one algorithm based on the evaluation of Images for example using Constrast and Variance.So are there any more approaches do that?
A better approach would be to do some Noise/Signal ratio by comparing image spectra ?
Slayton is right, you need a metric and a way to measure against it, which can be an academic project in itself. However, i could think of one approach straightaway, not sure if it makes sense to your specific task at hand:
The sum of abs( colour difference ) across all pixels. The lower, the more similar the images are.
For each pixel, get the absolute colour difference (or distance, to be precise) in LAB space between original and processed image and sum that up. Don't ruin your day trying to understand the full wikipedia article and coding that, this has been done before. Try re-using the methods getDistanceLabFrom(Color color) or getDistanceRgbFrom(Color color) from this PHP implementation. It worked like a charm for me when i needed a way to match a color of pixels in a jpg picture - which basically is the same principle.
The theory behind it (as far as my limited understanding goes): It's doing a mathematical abstraction of rgb or (better) lab colour space as a three dimensional room, and then calculate the distance, that's why it works well - and hardly worked for me when looking at a color code from a one-dimensionional perspective.
The usual way is to start with a reference image (a good one), then add some noise on it (in a controlled way).
Then, your algorithm should remove as much as possible from the added noise. The results are easy to compare with a signal-to-noise ration (see wikipedia).
Now, the approach is easy to apply on simple noise models, but if you aim to improve more complex appearance issues, you must devise a way to apply the noise, which is not easy.
Another, quite common way to do it is the one recommended by slayton: take all your colleagues to appreciate the output of your algorithm, then average their impressions.
If you have only the 2 images and no reference (higest quality) image, then you can see my rude solution/bash script there:
It gets the 2 filenames and outputs the higher quality filename. It assumes the content of the images is identical (same source image).
It can be fooled though.

Does enlarging images make them easier to analyze programmatically?

Can you enlarge a feature so that rather than take up a certain number of pixels it actually takes up one or two times that many to make it easier to analyze? Would there be a way to generalize that in MATLAB?
This sounds an awful lot like a fictitious "zoom, enhance!" procedure that you'd hear about on CSI. In general, "blowing up" a feature doesn't make it any easier to analyze, because no additional information is created when you do this. Generally you would apply other, different transformations like noise reduction to make analysis easier.
As John F has stated, you are not adding any information. In fact, with more pixels to crunch through you are making it "harder" in the sense of requiring more processing.
You might be able to intelligently increase the resolution of an image using Compressed Sensing. It will require some work (or at least some serious thought), though, as you'll have to determine how best to sample the image you already have. There's a large number of papers referenced at Rice University Compressive Sensing Resources.
The challenge is that the image is already sampled using Nyquist-Shannon constraints. You essentially have to re-sample it using a linear basis function (with IID random elements) in such a way that the estimate is at the desired resolution and find some surrogate for the original image at that same resolution that doesn't bias the estimate.
The function imresize is useful for, well, resizing images, larger or smaller. And imcrop is useful for cropping images.
You might get other more useful answers if you tag the question image-processing too.