I was asked to use manhattan distance for bisecting kmeans instead of euclidean distance in Spark.I tried changing it and use the code .But due to various private declarations and limited scope in existing code i am unable create a complete solution.Could somebody help me what other way i can do it?
There is a good reason why Spark chooses Euclidean distance without giving out an easy way to override it. You should be aware that k-means is designed for Euclidean distance. It might stop converging to optimal with other distances functions when the mean is no longer the best estimation for the cluster "centroid". Please see the below paper. http://research.ijcaonline.org/volume67/number10/pxc3886785.pdf
And here is the paper conclusion:
As a conclusion, the K-means, which is implemented using Euclidean
distance metric gives best result and K-means based on Manhattan
distance metric’s performance, is worst.
I 'm using k-means algorithm for clustering my data.
I have 5 thousand samples. .(Each of my sample is about a customer. to analyse customer value I 'm going to clustering them base on 4 behavior features.)
The distance is calculated using the Euclidean metric and Pearson correlation.
I need to know
I don't know Euclidean distance is the correct method for calculating distances or Pearson correlation?
I 'm using silhouette to validate my clustering. when I'm using Pearson correlation silhouette value is more than when I use Euclidean metric.
Whether this means that Pearson correlation is more appropriate for distance metric?
k-means does not support arbitrary distances.
It is based on variance minimization, which corresponds to (squared) Euclidean distance.
With Peason correlation, it will fail badly.
See this answer for an example how k-means fails badly with Pearson:
https://stackoverflow.com/a/21335448/1060350
short summary: the mean does not work for Pearson, but k-means is based on computing means. Use PAM or a similar method instead that uses medoids.
One approach for clustering a high dimensional dataset is to use linear transformation, and the most common approaches are PCA and random projection (where random projection arises from the Johnson-Lindenstrauss Lemma). I was wondering why we can't use other random transformation s like when our transformation matrix R was drawn from a uniform distribution?
There are many random projections in use, such as Achlioptas.
Achlioptas, D. (2001, May). Database-friendly random projections. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (pp. 274-281). ACM.
J-L only proves there is at least one with the desired properties, but it does not give an actual projection. iirc, uniform random was not shown to satisfy these optimality criterions.
Can we use Hierarchical agglomerative clustering for clustering data in this format ?
"beirut,proff,email1"
"beirut,proff,email2"
"swiss,aproff,email1"
"france,instrc,email2"
"swiss,instrc,email2"
"beirut,proff,email1"
"swiss,instrc,email2"
"france,aproff,email2"
If not, what is the compatible clustering algorithm to cluster data with string values ?
Thank you for your help!
Any type of clustering requires a distance metric. If all you're willing to do with your strings is treat them as equal to each other or not equal to each other, the best you can really do is the field-wise Hamming distance... that is, the distance between "abc,def,ghi" and "uvw,xyz,ghi" is 2, and the distance between "abw,dez,ghi" is also 2. If you want to cluster similar strings within a particular field -- say clustering "Slovakia" and "Slovenia" because of the name similarity, or "Poland" and "Ukraine" because they border each other, you'll use more complex metrics. Given a distance metric, hierarchical agglomerative clustering should work fine.
All this assumes, however, that clustering is what you actually want to do. Your dataset seems like sort of an odd use-case for clustering.
Hierarchical clustering is a rather flexible clustering algorithm. Except for some linkages (Ward?) it does not have any requirement on the "distance" - it could be a similarity as well, usually negative values will work just as well, you don't need triangle inequality etc.
Other algorithms - such as k-means - are much more limited. K-means minimizes variance; so it can only handle (squared) Euclidean distance; and it needs to be able to compute means, thus the data needs to be in a continuous, fixed dimensionality vector space; and sparsity may be an issue.
One algorithm that probably is even more flexible is Generalized DBSCAN. Essentially, it needs a binary decision "x is a neighbor of y" (e.g. distance less than epsilon), and a predicate to measure "core point" (e.g. density). You can come up with arbitary complex such predicates, that may no longer be a single "distance" anymore.
Either way: If you can measure similarity of these records, hiearchical clustering should work. The question is, if you can get enough similarity out of that data, and not just 3 bit: "has the same email", "has the same name", "has the same location" -- 3 bit will not provide a very interesting hierarchy.
In my clustering problem, not only the points can come and go but also the features can be removed or added. Is there any clustering algorithm for my problem.
Specifically I am looking for an agglomerative hierarchical clustering version of these kind of clustering algorithms.
You can use hierarchical clustering (except it scales really bad) or any other distance based clustering. Just k-means is a bit tricky because how do you compute the mean when the value is not present?
You only need to define an appropriate distance function first.
Clustering is usually done based on similarity, so: first find out what "similar" means for you. This is very data set and use case specific, although many people can use some kind of distance function. There is no "one size fits all" solution.