how can i use surf feature in hierarchical clustering - matlab

First,Is it true to extract surf feature and use them in clustering? I want to cluster similar objects in images ?(each image contain one object)
If yes,how is it possible.
I extract features like this:
I = imread('cameraman.tif');
points = detectSURFFeatures(I);
[features, valid_points] = extractFeatures(I, points);
features is not a vector and is a matrix.Also number of points extracted by 'detectSURFFeatures' differ in different images.
how should features use?

First, you are detecting SURF features, not SIFT features (although they serve the same basic purpose). The reason you get multiple SURF features is that SURF is a local image feature, i.e. it describes only a small portion of the image. In general, multiple features will be detected in a single image. You probably want to find a way to combine these features into a single image descriptor before clustering.
A common method for combining these features is Bag-of-words.
Since it seems you are doing unsupervised learning, you will first need to learn a codebook. A popular method is to use k-means clustering on all the SURF features you extracted in all of your images.
Use these clusters to generate a k-dimensional image descriptor by creating a histogram of "codeword appearances" for each image. In this case there is one "codeword" per cluster. A codeword is said to "appear" each time a SURF feature belongs to its associated cluster.
The histogram of codewords will act as an image descriptor for each image. You can then apply clustering on the image descriptors to find similar images, I recommend either normalizing the image descriptors to have constant norm or using the cosine similarity metric (kmeans(X,k,'Distance','cosine') in MATLAB if you use k-means clustering).
That said, a solution which is likely to work better is to extract a deep feature using a convolutional neural network that was trained on a very large, diverse dataset (like ImageNet) then use those as image descriptors.

Related

bagoffeatures extract different types of features

I have a problem with the function bagoffeatures implemented in matlab computer vision system toolbox.
I'm doing a study of a classification of different types of images, first of all i'm trying to use bagoffeatures with diffenrets custom extractors, i want to divide my work in 2 branches, first extract SURFpoints and extract 3 different types of descriptors, for example SURF BRISK and FREAK, when i use in my custom extractor the next line:
features = extractFeatures(grayImage,multiscaleGridPoints,'Upright',true, 'method', 'SURF');
It allways need to get SURF method to work, but i need to be able to get differents types of descriptors.
Can i use the function bag of features from computer vision system toolbox to do this? or it only support surffeature extractions?
Unfortunately, you cannot use BRISK or FREAK with MATLAB's implementation of bag-of-features, because the bag-of-features algorithm uses K-means clustering to create the "visual words". The problem is that BRISK and FREAK descriptors are binary bit strings, and you cannot cluster them with K-means, which only works on real-valued vectors.
You can certainly use different kinds of interest point detectors with the MATLAB's framework. However, you are limited to descriptors which are real-valued vectors. So SURF and SIFT will work, but BRISK and FREAK will not. If you absolutely must use BRISK and FREAK, you will have implement your own bag of features. There are several methods for clustering binary descriptors, but I do not know how well any of them work in the context of bag-of-features.

How to ensure consistency in SIFT features?

I am working with a classification algorithm that requires the size of the feature vector of all samples in training and testing to be the same.
I am also to use the SIFT feature extractor. This is causing problems as the feature vector of every image is coming up as a different sized matrix. I know that SIFT detects variable keypoints in each image, but is there a way to ensure that the size of the SIFT features is consistent so that I do not get a dimension mismatch error.
I have tried rootSIFT as a workaround:
[~, features] = vl_sift(single(images{i}));
double_features = double(features);
root_it = sqrt( double_features/sum(double_features) ); %root-sift
feats{i} = root_it;
This gives me a consistent 128 x 1 vector for every image, but it is not working for me as the size of each vector is now very small and I am getting a lot of NaN in my classification result.
Is there any way to solve this?
Using SIFT there are 2 steps you need to perform in general.
Extract SIFT features. These points (first output argument of
size NPx2 (x,y) of your function) are scale invariant, and should in
theory be present in each different image of the same object. This
is not completely true. Often points are unique to each frame
(image). These points are described by 128 descriptors each (second
argument of your function).
Match points. Each time you compute features of a different image the amount of points computed is different! Lots of them should be the same point as in the previous image, but lots of them WON'T. You will have new points and old points may not be present any more. This is why you should perform a feature matching step, to link those points in different images. usually this is made by knn matching or RANSAC. You can Google how to perform this task and you'll have tons of examples.
After the second step, you should have a fixed amount of points for the whole set of images (considering they are images of the same object). The amount of points will be significantly smaller than in each single image (sometimes 30~ times less amount of points). Then do whatever you want with them!
Hint for matching: http://www.vlfeat.org/matlab/vl_ubcmatch.html
UPDATE:
You seem to be trying to train some kind of OCR. You would need to probably match SIFT features independently for each character.
How to use vl_ubcmatch:
[~, features1] = vl_sift(I1);
[~, features2] = vl_sift(I2);
matches=vl_ubcmatch(features1,features2)
You can apply a dense SIFT to the image. This way you have more control over from where you get the feature descriptors. I haven't used vlfeat, but looking at the documentation I see there's a function to extract dense SIFT features called vl_dsift. With vl_sift, I see there's a way to bypass the detector and extract the descriptors from points of your choice using the 'frames' option. Either way it seems you can get a fixed number of descriptors.
If you are using images of the same size, dense SIFT or the frames option is okay. There's a another approach you can take and it's called the bag-of-features model (similar to bag-of-words model) in which you cluster the features that you extracted from images to generate codewords and feed them into a classifier.

PCA on Sift desciptors and Fisher Vectors

I was reading this particular paper http://www.robots.ox.ac.uk/~vgg/publications/2011/Chatfield11/chatfield11.pdf and I find the Fisher Vector with GMM vocabulary approach very interesting and I would like to test it myself.
However, it is totally unclear (to me) how do they apply PCA dimensionality reduction on the data. I mean, do they calculate Feature Space and once it is calculated they perform PCA on it? Or do they just perform PCA on every image after SIFT is calculated and then they create feature space?
Is this supposed to be done for both training test sets? To me it's an 'obviously yes' answer, however it is not clear.
I was thinking of creating the feature space from training set and then run PCA on it. Then, I could use that PCA coefficient from training set to reduce each image's sift descriptor that is going to be encoded into Fisher Vector for later classification, whether it is a test or a train image.
EDIT 1;
Simplistic example:
[coef , reduced_feat_space]= pca(Feat_Space','NumComponents', 80);
and then (for both test and train images)
reduced_test_img = test_img * coef; (And then choose the first 80 dimensions of the reduced_test_img)
What do you think? Cheers
It looks to me like they do SIFT first and then do PCA. the article states in section 2.1 "The local descriptors are fixed in all experiments to be SIFT descriptors..."
also in the introduction section "the following three steps:(i) extraction
of local image features (e.g., SIFT descriptors), (ii) encoding of the local features in an image descriptor (e.g., a histogram of the quantized local features), and (iii) classification ... Recently several authors have focused on improving the second component" so it looks to me that the dimensionality reduction occurs after SIFT and the paper is simply talking about a few different methods of doing this, and the performance of each
I would also guess (as you did) that you would have to run it on both sets of images. Otherwise your would be using two different metrics to classify the images it really is like comparing apples to oranges. Comparing a reduced dimensional representation to the full one (even for the same exact image) will show some variation. In fact that is the whole premise of PCA, you are giving up some smaller features (usually) for computational efficiency. The real question with PCA or any dimensionality reduction algorithm is how much information can I give up and still reliably classify/segment different data sets
And as a last point, you would have to treat both images the same way, because your end goal is to use the Fisher Feature Vector for classification as either test or training. Now imagine you decided training images dont get PCA and test images do. Now I give you some image X, what would you do with it? How could you treat one set of images differently from another BEFORE you've classified them? Using the same technique on both sets means you'd process my image X then decide where to put it.
Anyway, I hope that helped and wasn't to rant-like. Good Luck :-)

Combine detectors in Bag of visual words

I am using Bag of visual words for classification.
I have quantized SIFT descriptor into 100 words for each image and encoded the histogram of the images and have completed classification.
Now, I want to try to combine two different descriptors and detectors i.e. SIFT and SURF, which means neither the number of key points will be the same nor will be the descriptor dimensionality (SIFT 128D and SURF 64D).
What will be the easiest way to combine them?
If, for each image, I encode one histogram for SIFT (which will be a 100x1 histogram) and another for SURF (another 100x1) and then stack them together making 200x1 histogram, will that be correct?
Any other way?
Thanks a lot in advance.
In bag of words, the number of key points or the descriptor size is irrelevant, once you generate the code book, you get a histogram whose dimensions are dependent on your codebook size. Again, the histogram is normalized, so it does not depend on the number of features detected per image. Suppose you have SIFT and SURF features, all you need to do is generate 2 codebooks and concatenate them to get a feature vector.
A brief overview of the method is mentioned here:
http://en.wikipedia.org/wiki/Bag-of-words_model_in_computer_vision

Ideas for extracting features of an object using keypoints of image

I'll be appreciated if you help me to create a feature vector of an simple object using keypoints. For now, I use ETH-80 dataset, objects have an almost blue background and pictures are took from different views. Like this:
After creating a feature vector, I want to train a neural network with this vector and use that neural network to recognize an input image of an object. I don't want make it complex, input images will be as simple as train images.
I asked similar questions before, some one suggested using average value of 20x20 neighborhood of keypoints. I tried it, It seems it's not working with ETH-80 images, because of different views of images. It's why I asked another question.
SURF or SIFT. Look for interest point detectors. A MATLAB SIFT implementation is freely available.
Update: Object Recognition from Local Scale-Invariant Features
SIFT and SURF features consist of two parts, the detector and the descriptor. The detector finds the point in some n-dimensional space (4D for SIFT), the descriptor is used to robustly describe the surroundings of said points. The latter is increasingly used for image categorization and identification in what is commonly known as the "bag of word" or "visual words" approach. In the most simple form, one can collect all data from all descriptors from all images and cluster them, for example using k-means. Every original image then has descriptors that contribute to a number of clusters. The centroids of these clusters, i.e. the visual words, can be used as a new descriptor for the image. The VLfeat website contains a nice demo of this approach, classifying the caltech 101 dataset:
http://www.vlfeat.org/applications/apps.html#apps.caltech-101