Selecting an appropriate similarity metric of a k-means clustering model - cluster-analysis

I 'm using k-means algorithm for clustering my data.
I have 5 thousand samples. .(Each of my sample is about a customer. to analyse customer value I 'm going to clustering them base on 4 behavior features.)
The distance is calculated using the Euclidean metric and Pearson correlation.
I need to know
I don't know Euclidean distance is the correct method for calculating distances or Pearson correlation?
I 'm using silhouette to validate my clustering. when I'm using Pearson correlation silhouette value is more than when I use Euclidean metric.
Whether this means that Pearson correlation is more appropriate for distance metric?

k-means does not support arbitrary distances.
It is based on variance minimization, which corresponds to (squared) Euclidean distance.
With Peason correlation, it will fail badly.
See this answer for an example how k-means fails badly with Pearson:
https://stackoverflow.com/a/21335448/1060350
short summary: the mean does not work for Pearson, but k-means is based on computing means. Use PAM or a similar method instead that uses medoids.

Related

How can I get the similarity matrix from minhash LSH?

I have read many tutorials and tried a number of minhash LSH, but it cannot generate the similarity matrix, instead it returns just similar data which exceeds the threshold. How can I generate it? My intention is to use the LSH results for clustering.
The whole point of LSH is to avoid pairwise distances, because that does not scale.
If you then put the data into a distance matrix, you get all the scalability problems again!
Instead consider an algorithm like DBSCAN clustering. It doesn't need a distance matrix, only neighbors at distance epsilon.

Manhattan distance for calculating the bisecting k-means instead of Euclidean distance

I was asked to use manhattan distance for bisecting kmeans instead of euclidean distance in Spark.I tried changing it and use the code .But due to various private declarations and limited scope in existing code i am unable create a complete solution.Could somebody help me what other way i can do it?
There is a good reason why Spark chooses Euclidean distance without giving out an easy way to override it. You should be aware that k-means is designed for Euclidean distance. It might stop converging to optimal with other distances functions when the mean is no longer the best estimation for the cluster "centroid". Please see the below paper. http://research.ijcaonline.org/volume67/number10/pxc3886785.pdf
And here is the paper conclusion:
As a conclusion, the K-means, which is implemented using Euclidean
distance metric gives best result and K-means based on Manhattan
distance metric’s performance, is worst.

Clustering based on pearson correlation

I have a use case where I have traffic data for every 15 minutes for 1 month.
This data is collected for various resources in netwrok.
Now I need to group resources which are similar(based on traffic usage pattern over 00 hours to 23:45 hrs).
One way to check if two resources have similar traffic behavior is that I can use Pearson correlation coefficient for all the resources and create N*N matrix.
My question is which method I should apply to cluster the similar resources ?
Existing methods in K-Means clustering are based on euclidean distance. Which algorithm I can use to cluster based on similarity of pattern ?
Any thoughts or link to possible solution is welcome. I want to implement using Java.
Pearson correlation is not compatible with the mean. Thus, k-means must not be used - it is proper for least-squares, but not for correlation.
Instead, just use hierarchical agglomerative clustering, which will work with Pearson correlation matrixes just fine. Or DBSCAN: it also works with arbitary distance functions. You can set a threshold: an absolute correlation of, e.g. +0.75, may be a desireable value of epsilon. But to get a feeling of your distance function, dendrograms as used by HAC are probably easier.
Beware that Pearson is not defined for constant patterns. If you have a resource with 0 usage, your distance will be undefined.

Knn regression in Matlab

What is the k nearest neighbour regression function in Matlab? Is only knn classification function available? Is anybody knowing any useful literature regarding to that?
Regards
Farideh
I don't believe the k-NN regression algorithm is directly implemented in matlab, but if you do some googling you can find some valid implementations. The algorithm is fairly simple though.
Find the k-Nearest elements using whatever distance metric is suitable.
Convert the inverse distance weight of each of the k elements
Compute weighted mean of the k elements using the inverse distance weight.

how to do clustering with similarity as a measure?

I read about spherical kmeans but i did not come across an implementation.To be clear, similarity is simple the dot product of two document unit vectors.I have read that standard k means uses distance as measure. Is the distance being specified the vector distance just like in coordinate geometry sqrt((x2 -x1)^2 + (y2-y1)^2)?
There are more clustering methods than k-means. The problem with k-means is not so much that is is built on Euclidean distance, but that the mean must reduce the distances for the algorithm to converge.
However, there are tons of other clustering algorithms that do not need to compute a mean or have triangle inequality. If you read the Wikipedia article on DBSCAN, it also mentions a version called GDBSCAN, Generalized DBSCAN. You definitely should be able to plug your similarity function into GDBSCAN. Most likely, you could just use 1/similarity and use it as a distance function, unless the algorithm requires triangle inequality. So this trick should work with DBSCAN and OPTICS, for example. Probably also with hierarchical clustering, k-medians and k-medoids (PAM).