Explain how is heat map used in CNN for crowd count case? - neural-network

im new to neural netwworks, and through the readings i often come across where heat maps are used in the network along with the ground truth provided with the dataset to (as far as my understanding is) evaluate the accuracy of network performance.
to be specific, consider the application of a crowd density estimation network, the dataset provide the crowd images, each image hase a corresponding ground truth .mat file, this file has:
a matrix of X and Y coordinations representing the appearanse of human head in the image.
the total number of human heads in the image (crowd count) which is equal to the matrix rows.
my current understanding is that one image will get through the network and the result is a going to be compared with the given ground-truth (either the head locations, ore the crowd count),
SO, how and at which point and is the crowd density map or heat map is used? do we generate one for the image while training an compare it with the one generated from the ground truth? how is this done?
all the papers i've read neglect describing this process.
any clear explanation will be appreciated.

The counting-by-density CNN approache, which you describe are fully convolutional regression network where the output is a density (heat map). That mean the training ground-truth data is a density map too. In general a MSE loss is used for training. The ground-truth density maps are in general precomputed from the head detection ground-truth.
E.g. if you look at the MCNN code preparation folder you will find get_density_map_gaussian.m file which performs the density estimation from the ground-truth annotated head.

Related

Convolutional Neural Networks, Matrix of Convolutional (Kernel)

Good afternoon! In the first stage where on input of Convolutional Neural Network (input layer) we recieve a source image (hence an image of handwritten English letter). First of all we are using an nxn window which goes from left to right for scanning image and multiplication on kernel (convolutional matrix) to build Feature maps? But nowhere written about what exact values a kernel should be have (In other words on what Kernel values I should multiply data retrieved from n*n window ). Is it suitable to multiply data on this Convolutional Kernel intended for edge detection? There a numerous Convolutional Kernels (Emboss, Gaussian Filter, Edge detection, Angle detection, etc.)? But nowhere is written to what exact kernel it is need to multiply data for detecting hand written symbols.
Sample of Edge detection 3*3 kernel
Convolutional operation for multiplication on kernel
In addition, if size of entire image is 30*30, than is it possible to use window of 5*5 for building feature maps? Would it be enough sufficient for reaching optimal precision of letter detection?
On what exact kernel it is best to multiply area of entire image for the maximum precision of letter recognition? Or initially all values in kernel is equaled to 0? Could i also ask, what formula or rule is applied to detect overall needed amount of to be built feature maps? Or if the task is in letter recognition of English Language, than in each stage of Feature maps building process there must be exact 25 feature maps? Thank you for reply!
In a CNN, the convolutional kernel is a shared weight matrix, and is learned in a similar way to other weights. It is initialized in the same way, with small random values, and the weight deltas from back propagation are summed across all the features that receive its output (i.e. usually all "pixels" in the output of the convolutional layer)
A typical random kernel will perform a little like an edge detector.
After training, the first CNN layer can be displayed and will often have learned some kernels that can be interpreted if you are familiar with image processing
There is a nice animated view of kernel features being learned here: http://cs.nyu.edu/~yann/research/sparse/
In short your answer is this: There is no need to look for correct kernels to use. Instead look for a CNN library where you set params such as number of convolutional layers, and research the way to view the kernels as they learn - most CNN libraries will have a documented way to visualise them.

PCA on Sift desciptors and Fisher Vectors

I was reading this particular paper http://www.robots.ox.ac.uk/~vgg/publications/2011/Chatfield11/chatfield11.pdf and I find the Fisher Vector with GMM vocabulary approach very interesting and I would like to test it myself.
However, it is totally unclear (to me) how do they apply PCA dimensionality reduction on the data. I mean, do they calculate Feature Space and once it is calculated they perform PCA on it? Or do they just perform PCA on every image after SIFT is calculated and then they create feature space?
Is this supposed to be done for both training test sets? To me it's an 'obviously yes' answer, however it is not clear.
I was thinking of creating the feature space from training set and then run PCA on it. Then, I could use that PCA coefficient from training set to reduce each image's sift descriptor that is going to be encoded into Fisher Vector for later classification, whether it is a test or a train image.
EDIT 1;
Simplistic example:
[coef , reduced_feat_space]= pca(Feat_Space','NumComponents', 80);
and then (for both test and train images)
reduced_test_img = test_img * coef; (And then choose the first 80 dimensions of the reduced_test_img)
What do you think? Cheers
It looks to me like they do SIFT first and then do PCA. the article states in section 2.1 "The local descriptors are fixed in all experiments to be SIFT descriptors..."
also in the introduction section "the following three steps:(i) extraction
of local image features (e.g., SIFT descriptors), (ii) encoding of the local features in an image descriptor (e.g., a histogram of the quantized local features), and (iii) classification ... Recently several authors have focused on improving the second component" so it looks to me that the dimensionality reduction occurs after SIFT and the paper is simply talking about a few different methods of doing this, and the performance of each
I would also guess (as you did) that you would have to run it on both sets of images. Otherwise your would be using two different metrics to classify the images it really is like comparing apples to oranges. Comparing a reduced dimensional representation to the full one (even for the same exact image) will show some variation. In fact that is the whole premise of PCA, you are giving up some smaller features (usually) for computational efficiency. The real question with PCA or any dimensionality reduction algorithm is how much information can I give up and still reliably classify/segment different data sets
And as a last point, you would have to treat both images the same way, because your end goal is to use the Fisher Feature Vector for classification as either test or training. Now imagine you decided training images dont get PCA and test images do. Now I give you some image X, what would you do with it? How could you treat one set of images differently from another BEFORE you've classified them? Using the same technique on both sets means you'd process my image X then decide where to put it.
Anyway, I hope that helped and wasn't to rant-like. Good Luck :-)

How to choose the number of nodes for using BP network in face recognition?

I read some books but still cannot make sure how should I organize the network. For example, I have pgm image with size 120*100, how the input should be like(like a one dimensional array with size 120*100)? and how many nodes should I adapt.
It's typically best to organize your input image as a 2D matrix. The reason is that the layers at the lower levels of the neural networks used in machine perception tasks are typically locally connected. For example, each neuron of the first layer of such a neural net will only process the pixels of a small NxN patch of the input image. This naturally leads to a 2D structure which can be more easily described with 2D matrices.
For a detailed explanation I'll refer you to the DeepFace paper which describes the stat of the art in face recognition systems.
120*100 one dimensional vector is fine. The locations of the pixel values in that vector does not matter, because all nodes are fully connected with the nodes in the next layer anyway. But you must be consistent with their locations between training, validating, and testing.
The most successful approach so far was to go with a convolutional neural network with 2D input, just as #benoitsteiner stated. For a far simpler example I'd refer you to a LeNet-5, a small neural network developed for MNIST hand-written digit recognition. It is used in EBLearn for face recognition with quite good results.

Convert SURFpoints object MATLAB

Is there any way I can convert the SURFpoints object, generated by matlab, into a matrix with x and y positions, for feeding into a neural network?
I am a pretty much complete beginner, but from what I can tell, and by looking at documentation, I wasn't sure if there was a way to get SURFpoints into neural networks?
Many thanks,
Hugh
SURFPoints has a field, Location, that is an n x 2 matrix that has the (x,y) coordinates of each SURF point detected in the image.
Note, however, that SURF points have other attributes beside their location (such as scale and orientation). If you only take into account the (x,y) locations, you are throwing away a lot of data.
Also, it's unclear how you would feed this information into a neural network. A neural network, like many other machine learning models, expects a uniform length feature vector of an entity. If your task is something like image classification, you'll have to come up with some way to convert the list of SURF points into a feature vector that captures the properties you want your classifier to care about. Depending on your application, a neural network may or may not be the best way to go. In the context of computer vision and image processing, neural networks these days are more commonly used for unsupervised feature discovery (see "deep learning"). For supervised learning tasks, other models like boosted decision trees and SVMs give better theoretical guarantees and have fared much better in practice.

Pattern recognition in Neural Network using matlab simulation

I am new to this neural network in matlab. I wanted to create a Neural Network using matlab simulation.
This matlab simulation is using pattern recognition.
I am running on a windows XP platform.
For example, I have a sets of waveforms of circular shape.
I have extracted out the poles.
These poles will teach my Neural Network that it is circular in shape, hence whenever I input another set of slightly different circular shape waveform, the Neural Network is able to distinguish between the shape.
Currently, I have extracted the poles of these 3 shapes, cylinder, circle and rectangle.
But I am clueless of how I should go about creating my Neural Network.
I'd recommend utilizing SOM (Self-organizing map) for pattern recognition since it's really robust. Also there's a Som Toolbox for Matlab you might be interested in. However, to make it learn waves while neglecting their offsets, you'd need to make some changes to the "similarity function". These changes will affect quite a lot on the SOM's training time but if that's not a problem, keep reading.
For the SOM you'll have to sample your waves to constant sized vectors, let say:
sin x -> sin_vector = (a1, a2, a3, ..., aN)
cos x -> cos_vector = (b1, b2, b3, ..., bN)
Usually similarity of "SOM-vectors" is calculated with euclidian distance. Euclidian distance of those two vectors is huge since they have a different offset. In your case they should be considered to be similar ie. distance to be small. So.. if you don't sample all the similar waves from the same starting point, they will be classified in different classes. That is probably a problem. But! Similarity of vectors in SOM is calculated in order to find the BMU (best-matching unit) from the map and pulling the BMU's and its neigborhood's vectors torwards the values of the given sample. So all you need to change is the way to compare those vectors and the way to pull the vectors' values torwards the sample so that both will be "offset-tolerent".
Slow but working solution is first finding the best offset index for each vector. Best offset index is the one that will produce the smallest value with euclidian distance for the sample. Smallest distance calculated with some node of the net will then be the BMU. Then the BMU's and its neigborhood's vectors are pulled torwards the given sample using the offset index calculated for each node just before. Everything else should work out-of-the-box.
This solution is relatively slow but should work great. I'd recommend studying the consept of SOM thoroughly and then reading this post (and angry comments) again :)
PLEASE comment if you know some mathematical solution that would be better than that previous one!
You can try to use Matlab's Neural network pattern recognition tool nprtool as it is specialize to train and test neural network for pattern recognition.