How can I get the similarity matrix from minhash LSH? - cluster-analysis

I have read many tutorials and tried a number of minhash LSH, but it cannot generate the similarity matrix, instead it returns just similar data which exceeds the threshold. How can I generate it? My intention is to use the LSH results for clustering.

The whole point of LSH is to avoid pairwise distances, because that does not scale.
If you then put the data into a distance matrix, you get all the scalability problems again!
Instead consider an algorithm like DBSCAN clustering. It doesn't need a distance matrix, only neighbors at distance epsilon.

Related

Bag of feature: how to create the query histogram?

I'm trying to implement the Bag of Features model.
Given a descriptors matrix object (representing an image) belonging to the initial dataset, compute its histogram is easy, since we already know to which cluster each descriptor vector belongs to from k-means.
But what about if we want to compute the histogram of a query matrix? The only solution that crosses my mind is to compute the distance between each vector descriptor to each of the k cluster centroids.
This can be inefficient: supposing that k=100 (so 100 centroids), then we have an query image represented through 1000 SIFT descriptors, so a matrix 1000x100.
What we have to do now is computing 1000 * 100 eucledian distances in 128 dimensions. This seems really inefficient.
How to solve this problem?
NOTE: can you suggest me some implementations where this point is explained?
NOTE: I know LSH is a solution (since we are using high-dim vectors), but I don't think that actual implementations use it.
UPDATE:
I was talking with a collegue of mine: using a hierarchical cluster approach instead of classic k-means, should speed up the process so much! Is it correct to say that if we have k centroids, with an hierarchical cluster we have to do only log(k) comparisons in order to find the closest centroid instead of k comparisons?
For a bag of features approach, you indeed need to quantize the descriptors. Yes, if you have 10000 features and 100 features that 10000*100 distances (unless you use an index here).
Compare this to comparing each of the 10000 features to each of the 10000 features of each image in your database. Does it still sound that bad?

Self-Organizing Maps

I have a question on self-organizing maps:
But first, here is my approach on implementing one:
The som neurons are stored in a basic array. Each neuron consists of a vector (another array of the size of the input neurons) of double values which are initialized to a random value.
As far as I understand the algorithm, this is actually all I need to implement it.
So, for the training I choose a sample of the training data at random an calculate the BMU using the Euclidian distance of sample's values and the neuron weights.
Afterwards I update it's weights and all other neurons in it's range depending on the neighborhood function and the learning rate.
Then, I decrease the neighborhood function and the learning rate.
This is done until a fixed amount of iterations.
My question is now: How do I determine the clusters after the training? My approach so far is to present a new input vector and calculate the min Euclidian distance between it and the BMU . But this seems a little naive to me. I'm sure that I've missed something.
There is no single correct way of doing that. As you noted, finding the BMU is one of them and the only one that makes sense if you just want to find the most similar cluster.
If you want to reconstruct your input vector, returning the BMU prototype works too, but may not be very precise (it is equivalent to the Nearest Neighbor rule or 1NN). Then you need to interpolate between neurons to find a better reconstruction. This could be done by weighting each neuron inversely proportional to their distance to the input vector and then computing the weighted average (this is equivalent to weighted KNN). You can also restrict this interpolation only to the BMU's neighbors, which will work faster and may give better results (this would be weighted 5NN). This technique was used here: The Continuous Interpolating Self-organizing Map.
You can see and experiment with those different options here: http://www.inf.ufrgs.br/~rcpinto/itm/ (not a SOM, but a close cousin). Click "Apply" to do regression on a curve using the reconstructed vectors, then check "Draw Regression" and try the different options.
BTW, the description of your implementation is correct.
A pretty common approach nowadays is the soft subspace clustering, where feature weights are added to find the most relevant features. You can use these weights to increase performance and improve the BMU calculation with euclidean distance.

Selecting an appropriate similarity metric of a k-means clustering model

I 'm using k-means algorithm for clustering my data.
I have 5 thousand samples. .(Each of my sample is about a customer. to analyse customer value I 'm going to clustering them base on 4 behavior features.)
The distance is calculated using the Euclidean metric and Pearson correlation.
I need to know
I don't know Euclidean distance is the correct method for calculating distances or Pearson correlation?
I 'm using silhouette to validate my clustering. when I'm using Pearson correlation silhouette value is more than when I use Euclidean metric.
Whether this means that Pearson correlation is more appropriate for distance metric?
k-means does not support arbitrary distances.
It is based on variance minimization, which corresponds to (squared) Euclidean distance.
With Peason correlation, it will fail badly.
See this answer for an example how k-means fails badly with Pearson:
https://stackoverflow.com/a/21335448/1060350
short summary: the mean does not work for Pearson, but k-means is based on computing means. Use PAM or a similar method instead that uses medoids.

Knn regression in Matlab

What is the k nearest neighbour regression function in Matlab? Is only knn classification function available? Is anybody knowing any useful literature regarding to that?
Regards
Farideh
I don't believe the k-NN regression algorithm is directly implemented in matlab, but if you do some googling you can find some valid implementations. The algorithm is fairly simple though.
Find the k-Nearest elements using whatever distance metric is suitable.
Convert the inverse distance weight of each of the k elements
Compute weighted mean of the k elements using the inverse distance weight.

how to do clustering with similarity as a measure?

I read about spherical kmeans but i did not come across an implementation.To be clear, similarity is simple the dot product of two document unit vectors.I have read that standard k means uses distance as measure. Is the distance being specified the vector distance just like in coordinate geometry sqrt((x2 -x1)^2 + (y2-y1)^2)?
There are more clustering methods than k-means. The problem with k-means is not so much that is is built on Euclidean distance, but that the mean must reduce the distances for the algorithm to converge.
However, there are tons of other clustering algorithms that do not need to compute a mean or have triangle inequality. If you read the Wikipedia article on DBSCAN, it also mentions a version called GDBSCAN, Generalized DBSCAN. You definitely should be able to plug your similarity function into GDBSCAN. Most likely, you could just use 1/similarity and use it as a distance function, unless the algorithm requires triangle inequality. So this trick should work with DBSCAN and OPTICS, for example. Probably also with hierarchical clustering, k-medians and k-medoids (PAM).