KNN Classifier using cross validation - matlab

I am trying to implement KNN classifier using the cross validation approach where I have different images of a certain character for training(e.g 5 images) and another two for testing. Now I get the idea of the cross validation by simply choosing the K with the least error value when training & then using it with the test data to find how accurate my results are.
My question is how do I train images in matlab to get my K value? Do I compare them and try to find mismatch or what?!
Any help would be really appreciated.

First of you need to define your task precisely. F.ex Given an image I in R^(MxN) we wish to classify I as an image containing faces or an image without faces.
I often work with pixel classifiers, where the task is something like: For an image I decide if each pixel is a face pixel or a non-face pixel.
An important part of defining the task is to make a hypotheses that can be used as basis for training a classifier. F.ex We believe that the distribution of pixel intensities can be used to discriminate images of faces from images not containing faces.
Then you need to select some features that define your image. This can be done in many ways and you should search for what other people do when they analyse the same type of images you are working with.
One widely used method in pixel classification is to use pixel intensity values and do a multi-scale analysis of the image. The idea in multi-scale analysis is that different structures are most evident at different level of blurring called scales. As an illustration consider an image of a tree. Without blurring we notice the fine structure, such as small branches and leafs. When we blur the image we notice the trunk and major branches. This is often used as part of segmentation methods.
When you know your task and the features, you can train a classifier. If you use kNN and cross-validation to find the best k, you should split you dataset in train/testing and then split the training set in train/validate sets. You then train using the reduced training set and use the validation set to decide which k is the best. In the case of binary classification e.g face vs non-face the error rate is often used as a measure of performance.
Finally you use the parameters to train the classifier on the full dataset and estimate its performance on the test set.
A classification example: With or without milk?
As a full example, consider images of a cup of coffee taken from above so it shows the rim of the cup surrounding a brownly colored disk. Further assume that all images are scaled and cropped so the diameter of the disk is the same and dimensions of the image are the same. To simplify the task, we convert the color image to grayscale and scale the pixel intensities to the range [0,1].
We want to train a classifier so it can distinguish coffee with milk from coffee without milk. From inspection of histograms of some of the coffee images, we see that each image has two "bumps" in the histogram that are clearly separated. We believe that these bumps correspond to foreground (coffee) and background. Now we make the hypothesis that the average intensity of the foreground can be used to distinguish between coffee+milk/coffee.
To find the foreground pixels we observe that because the foreground/background ratio is the same (by design) we can just find the intensity value that gives us that ratio for each image. Then we calculate the average intensity of the foreground pixels and use this value as a feature for each image.
If we have N images that we have manually labeled, we split this into training and test set. We then calculate the average foreground intensity for each image in the training set, giving us a set of (average foreground intensity, label) values. We want to use kNN where an image is assigned the same class as the majority class of the k closest images. We measure the distance as the absolute value of the difference in average foreground pixel intensity.
We search for the optimal k with cross validation. We use 2-fold cross validation (aka holdout) to find the best k. We test k = {1,3,5} and select the k that gives the least prediction error on the validation set.

Related

How to extract memnet heat maps with the caffe model?

I want to extract both memorability score and memorability heat maps by using the available memnet caffemodel by Khosla et al. at link
Looking at the prototxt model, I can understand that the final inner-product output should be the memorability score, but how should I obtain the memorability map for a given input image? Here some examples.
Thanks in advance
As described in their paper [1], the CNN (MemNet) outputs a single, real-valued output for the memorability. So, the network they made publicly available, calculates this single memorability score, given an input image - and not a heatmap.
In section 5 of the paper, they describe how to use this trained CNN to predict a memorability heatmap:
To generate memorability maps, we simply scale up the image and apply MemNet to overlapping regions of the image. We do this for multiple scales of the image and average the resulting memorability maps.
Let's consider the two important steps here:
Problem 1: Make the CNN work with any input size.
To make the CNN work on images of any arbitrary size, they use the method presented in [2].
While convolutional layers can be applied to images of arbitrary size - resulting in smaller or larger outputs - the inner product layers have a fixed input and output size.
To make an inner product layer work with any input size, you apply it just like a convolutional kernel. For an FC layer with 4096 outputs, you interpret it as a 1x1 convolution with 4096 feature maps.
To do that in Caffe, you can directly follow the Net Surgery tutorial. You create a new .prototxt file, where you replace the InnerProduct layers with Convolution layers. Now, Caffe won't recognize the weights in the .caffemodel anymore, as the layer types don't match anymore. So, you load the old net and its parameters into Python, load the new net, and assign the old parameters to the new net and save it as a new .caffemodel file.
Now, we can run images of any dimensions (larger or equal than 227x227) through the network.
Problem 2: Generate the heat map
As explained in the paper [1], you apply this fully-convolutional network from Problem 1 to the same image at different scales. The MemNet is a re-trained AlexNet, so the default input dimension is 227x227. They mention that a 451x451 input gives a 8x8 output, which implies a stride of 28 for applying the layers. So a simple example could be:
Scale 1: 227x227 → 1x1. (I guess they definitely use this scale.)
Scale 2: 283x283 → 2x2. (Wild guess)
Scale 3: 339x339 → 4x4. (Wild guess)
Scale 4: 451x451 → 8x8. (This scale is mentioned in the paper.)
The results will look like this:
So, you'll just average these outputs to get your final 8x8 heatmap. From the image above, it should be clear how to average the different-scale outputs: you'll have to upsample the low-res ones to 8x8, and average then.
From the paper, I assume that they use very high-res scales, so their heatmap will be around the same size as the image initially was. They write that it takes 1s on a "normal" GPU. This is a quite long time, which also indicates that they probably upsample the input images quite to quite high dimensions.
Bibliography:
[1]: A. Khosla, A. S. Raju, A. Torralba, and A. Oliva, "Understanding and Predicting Image Memorability at a Large Scale", in: ICCV, 2015. [PDF]
[2]: J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation", in: CVPR, 2015. [PDF]

Selecting initial seeds of rectified images in matlab

Dear friends I am currently working on a disparity algorithm that visits only a small fraction of disparity space in order to find a semi-dense disparity map. It works by growing from a small set of correspondence seeds. But before that I am implementing the standard region growing algorithm in matlab to understand how it works.
The first step of the baseline growing algorithm says that:
Require: Rectified images Il, Ir, initial correspondence
seeds S, image similarity threshold. Compute similarity simil(s) for every seed s belonging to S.
Now i cannot understand this step. First of all how do i calculate initial seed points from two rectified images. Should i use SIFT algorithm in matlab or is there any better way to do it.???Can anybody also give me some idea about how does a region growing based disparity calculating algorithm works and whether it is better than SAD or SSD.
If you have rectified images, finding disparity is a matter of calculating costs between pixels in left and right images on the same horizontal line.
You can take a few selected points in the images (for example the ones that have high gradient or feature points coming from SIFT), set those as roots/seeds of your regions and calculate cost for a range of disparities using SAD/SSD or whatever cost function you prefer.
Then take the best disparity for a root and assign it to a neighbor. If the cost for that is lower than a predefined threshold, add it to the region otherwise go to next neighbor. When you cannot add any more points the region growing is finished.
This is a detailed example of the process: http://arxiv.org/pdf/0812.1340.pdf

How to ensure consistency in SIFT features?

I am working with a classification algorithm that requires the size of the feature vector of all samples in training and testing to be the same.
I am also to use the SIFT feature extractor. This is causing problems as the feature vector of every image is coming up as a different sized matrix. I know that SIFT detects variable keypoints in each image, but is there a way to ensure that the size of the SIFT features is consistent so that I do not get a dimension mismatch error.
I have tried rootSIFT as a workaround:
[~, features] = vl_sift(single(images{i}));
double_features = double(features);
root_it = sqrt( double_features/sum(double_features) ); %root-sift
feats{i} = root_it;
This gives me a consistent 128 x 1 vector for every image, but it is not working for me as the size of each vector is now very small and I am getting a lot of NaN in my classification result.
Is there any way to solve this?
Using SIFT there are 2 steps you need to perform in general.
Extract SIFT features. These points (first output argument of
size NPx2 (x,y) of your function) are scale invariant, and should in
theory be present in each different image of the same object. This
is not completely true. Often points are unique to each frame
(image). These points are described by 128 descriptors each (second
argument of your function).
Match points. Each time you compute features of a different image the amount of points computed is different! Lots of them should be the same point as in the previous image, but lots of them WON'T. You will have new points and old points may not be present any more. This is why you should perform a feature matching step, to link those points in different images. usually this is made by knn matching or RANSAC. You can Google how to perform this task and you'll have tons of examples.
After the second step, you should have a fixed amount of points for the whole set of images (considering they are images of the same object). The amount of points will be significantly smaller than in each single image (sometimes 30~ times less amount of points). Then do whatever you want with them!
Hint for matching: http://www.vlfeat.org/matlab/vl_ubcmatch.html
UPDATE:
You seem to be trying to train some kind of OCR. You would need to probably match SIFT features independently for each character.
How to use vl_ubcmatch:
[~, features1] = vl_sift(I1);
[~, features2] = vl_sift(I2);
matches=vl_ubcmatch(features1,features2)
You can apply a dense SIFT to the image. This way you have more control over from where you get the feature descriptors. I haven't used vlfeat, but looking at the documentation I see there's a function to extract dense SIFT features called vl_dsift. With vl_sift, I see there's a way to bypass the detector and extract the descriptors from points of your choice using the 'frames' option. Either way it seems you can get a fixed number of descriptors.
If you are using images of the same size, dense SIFT or the frames option is okay. There's a another approach you can take and it's called the bag-of-features model (similar to bag-of-words model) in which you cluster the features that you extracted from images to generate codewords and feed them into a classifier.

DWT: What is it and when and where we use it

I was reading up on the DWT for the first time and the document stated that it is used to represent time-frequency data of a signal which other transforms do not provide.
But when I look for a usage example of the DWT in MATLAB I see the following code:
X=imread('cameraman.tif');
X=im2double(X);
[F1,F2]= wfilters('db1', 'd');
[LL,LH,HL,HH] = dwt2(X,'db1','d');
I am unable to understand the implementation of dwt2 or rather what is it and when and where we use it. What actually does dwt2 return and what does the above code do?
The first two statements simply read in the image, and convert it so that the dynamic range of each channel is between [0,1] through im2double.
Now, the third statement, wfilters constructs the wavelet filter banks for you. These filter banks are what are used in the DWT. The method of the DWT is the same, but you can use different kinds of filters to achieve specific results.
Basically, with wfilters, you get to choose what kind of filter you want (in your case, you chose db1: Daubechies), and you can optionally specify the type of filter that you want. Different filters provide different results and have different characteristics. There are a lot of different wavelet filter banks you could use and I'm not quite the expert as to the advantages and disadvantages for each filter bank that exists. Traditionally, Daubechies-type filters are used so stick with those if you don't know which ones to use.
Not specifying the type will output both the decomposition and the reconstruction filters. Decomposition is the forward transformation where you are given the original image / 2D data and want to transform it using the DWT. Reconstruction is the reverse transformation where you are given the transform data and want to recreate the original data.
The fourth statement, dwt2, computes the 2D DWT for you, but we will get into that later.
You specified the flag d, so you want only the decomposition filters. You can use wfilters as input into the 2D DWT if you wish, as this will specify the low-pass and high-pass filters that you want to use when decomposing your image. You don't have to do it like this. You can simply specify what filter you want to use, which is how you're calling the function in your code. In other words, you can do this:
[F1,F2]= wfilters('db1', 'd');
[LL,LH,HL,HH] = dwt2(X,F1,F2);
... or you can just do this:
[LL,LH,HL,HH] = dwt2(X,'db1','d');
The above statements are the same thing. Note that there is a 'd' flag on the dwt2 function because you want the forward transform as well.
Now, dwt2 is the 2D DWT (Discrete Wavelet Transform). I won't go into the DWT in detail here because this isn't the place to talk about it, but I would definitely check out this link for better details. They also have fully working MATLAB code and their own implementation of the 2D DWT so you can fully understand what exactly the DWT is and how it's computed.
However, the basics behind the 2D DWT is that it is known as a multi-resolution transform. It analyzes your signal and decomposes your signal into multiple scales / sizes and features. Each scale / size has a bunch of features that describe something about the signal that was not seen in the other scales.
One thing about the DWT is that it naturally subsamples your image by a factor of 2 (i.e. halves each dimension) after the analysis is done - hence the multi-resolution bit I was talking about. For MATLAB, dwt2 outputs four different variables, and these correspond to the variable names of the output of dwt2:
LL - Low-Low. This means that the vertical direction of your 2D image / signal is low-pass filtered as well as the horizontal direction.
LH - Low-High. This means that the vertical direction of your 2D image / signal is low-pass filtered while the horizontal direction is high-pass filtered.
HL - High-Low. This means that the vertical direction of your 2D image / signal is high-pass filtered while the horizontal direction is low-pass filtered.
HH - High-High. This means that the vertical direction of your 2D image / signal is high-pass filtered as well as the horizontal direction.
Roughly speaking, LL corresponds to just the structural / predominant information of your image while HH corresponds to the edges of your image. The LH and HL components I'm not too familiar with, but they're used in feature analysis sometimes. If you want to do a further decomposition, you would apply the DWT again on the LL only. However, depending on your analysis, the other components are used.... it just depends on what you want to use it for! dwt2 only performs a single-level DWT decomposition, so if you want to use this again for the next level, you would call dwt2 on the LL component.
Applications
Now, for your specific question of applications. The DWT for images is mostly used in image compression and image analysis. One application of the 2D DWT is in JPEG 2000. The core of the algorithm is that they break down the image into the DWT components, then construct trees of the coefficients generated by the DWT to determine which components can be omitted before you save the image. This way, you eliminate extraneous information, but there is also a great benefit that the DWT is lossless. I don't know which filter(s) is/are being used in JPEG 2000, but I know for certain that the standard is lossless. This means that you will be able to reconstruct the original data back without any artifacts or quantization errors. JPEG 2000 also has a lossy option, where you can reduce the file size even more by eliminating more of the DWT coefficients in such a way that is imperceptible to the average use.
Another application is in watermarking images. You can embed information in the wavelet coefficients so that it prevents people from trying to steal your images without acknowledgement. The DWT is also heavily used in medical image analysis and compression as the images generated in this domain are quite high resolution and quite large. It would be extremely useful if you could represent the images in the same way but occupying less physical space in comparison to the standard image compression algorithms (that are also lossy if you want high compression ratios) that exist.
One more application I can think of would be the dynamic delivery of video content over networks. Depending on what your connection speed is or the resolution of your screen, you get a lower or higher quality video. If you specifically use the LL component of each frame, you would stream / use a particular version of the LL component depending on what device / connection you have. So if you had a bad connection or if your screen has a low resolution, you would most likely show the video with the smallest size. You would then keep increasing the resolution depending on the connection speed and/or the size of your screen.
This is just a taste as to what the DWT is used for (personally, I don't use it because the DWT is used in domains that I don't personally have any experience in), but there are a lot more applications that are quite useful where the DWT is used.

Neural network output layer for multiple pattern recognition

Assume that I have a method or other neural network to do pattern detection on an image correctly. How should I design a neural network where there are multiple patterns in an image?
Say that in an image, there are X patterns to be detected, what would be the best approach? AFAIK output layer neurons values should be [-1,1]. How would I know if there are X amount of patterns recognised? Does this mean that I have to set a hardcoded limit on how many patterns it can recognise (since number of output neuron is fixed)?
Here's a suggestion using face detection as an example. This Face Detection link on Github is described to detect multiples pattern (i.e. faces) using a Haar Classifier. If you read under the Implementation section it states that the algorithm uses scaleOption and templateSizeOption parameters (among others) to govern how many faces are detected in an image. It sounds like you should look for features in subspaces or windows of a given image (perhaps even spaces that overlap).
scaleOption - this parameter is used to specify the
rate at which the haar features used
for face detection will be scaled. A
lower scale option means that more
faces will be detected, while a higher
scale option will perform a faster
detection, but may miss some faces
from the input image. The default
scale value is 1.1, that determines an
increase in the features dimension of
10% at each step.
templateSizeOption – it is used to
specify the minimal area in which to
search for a face. If we want to
detect persons from close-up images,
the size should be over 40 pixels,
otherwise a 25 region pixels (which is
the default value ) is enough for
detecting a large number of faces.
to do this use a hopfild net.at first in equal windows extract your target and save in your the net. then with a simple algoritm search in your image and in any time compare the sim of the net with your target and for any target use separate array to save the result.at the end extract the nearest pattern in each array.you can use some image proccesing in your original image before starting.
Yes, this can be done by neural network. I think that most practical solutions would involve applying the neural network to a window which scanned over the image. Multiple hits from the neural network would imply multiple target objects in the image.
Incidentally, neural networks do not have to lie in the range -1 .. 1.