Clustering based on pearson correlation - cluster-analysis

I have a use case where I have traffic data for every 15 minutes for 1 month.
This data is collected for various resources in netwrok.
Now I need to group resources which are similar(based on traffic usage pattern over 00 hours to 23:45 hrs).
One way to check if two resources have similar traffic behavior is that I can use Pearson correlation coefficient for all the resources and create N*N matrix.
My question is which method I should apply to cluster the similar resources ?
Existing methods in K-Means clustering are based on euclidean distance. Which algorithm I can use to cluster based on similarity of pattern ?
Any thoughts or link to possible solution is welcome. I want to implement using Java.

Pearson correlation is not compatible with the mean. Thus, k-means must not be used - it is proper for least-squares, but not for correlation.
Instead, just use hierarchical agglomerative clustering, which will work with Pearson correlation matrixes just fine. Or DBSCAN: it also works with arbitary distance functions. You can set a threshold: an absolute correlation of, e.g. +0.75, may be a desireable value of epsilon. But to get a feeling of your distance function, dendrograms as used by HAC are probably easier.
Beware that Pearson is not defined for constant patterns. If you have a resource with 0 usage, your distance will be undefined.

Related

Comparing k-means clustering

I have 150 images, 15 each of 10 different people. So basically I know which image should belong together, if clustered.
These images are of 73 dimensions (feature-vector) and I clustered them into 10 clusters using kmeans function in matlab.
Later, I processed these 150 data points and reduced its dimension from 73 to 3 for my work and applied the same kmeans function on them.
I want to compare the results obtained on these data sets (processed and unprocessed) by applying the same k-means function and wish to know if the processing which reduced it to lower dimension improves the kmeans clustering or not.
I thought comparing the variance of each cluster can be one parameter for comparison, however I am not sure if I can directly compare and evaluate my results (within cluster sum of distances etc.) as both the cases are of different dimension. Could anyone please suggest a way where I can compare the kmean results, some way to normalize them or any other comparison that I can make?
I can think of three options. I am unaware of any well developed methodology to do this specifically with K-means clustering.
Look at the confusion matrices between the two approaches.
Compare the mahalanobis distances between the clusters, and between items in clusters to their nearest other clusters.
Look at the Vornoi cells and see how far your points are from the boundaries of the cells.
The problem with 3, is the distance metrics get skewed, 3D distance vs. 73D distances are not commensurate, so I'm not a fan of that approach. I'd recommend reading some books on K-means if you are adamant of that path, rank speculation is fun, but standing on the shoulders of giants is better.

Best way to validate DBSCAN Clusters

I have used the ELKI implementation of DBSCAN to identify fire hot spot clusters from a fire data set and the results look quite good. The data set is spatial and the clusters are based on latitude, longitude. Basically, the DBSCAN parameters identify hot spot regions where there is a high concentration of fire points (defined by density). These are the fire hot spot regions.
My question is, after experimenting with several different parameters and finding a pair that gives a reasonable clustering result, how does one validate the clusters?
Is there a suitable formal validation method for my use case? Or is this subjective depending on the application domain?
ELKI contains a number of evaluation functions for clusterings.
Use the -evaluator parameter to enable them, from the evaluation.clustering.internal package.
Some of them will not automatically run because they have quadratic runtime cost - probably more than your clustering algorithm.
I do not trust these measures. They are designed for particular clustering algorithms; and are mostly useful for deciding the k parameter of k-means; not much more than that. If you blindly go by these measures, you end up with useless results most of the time. Also, these measures do not work with noise, with either of the strategies we tried.
The cheapest are the label-based evaluators. These will automatically run, but apparently your data does not have labels (or they are numeric, in which case you need to set the -parser.labelindex parameter accordingly). Personally, I prefer the Adjusted Rand Index to compare the similarity of two clusterings. All of these indexes are sensitive to noise so they don't work too well with DBSCAN, unless your reference has the same concept of noise as DBSCAN.
If you can afford it, a "subjective" evaluation is always best.
You want to solve a problem, not a number. That is the whole point of "data science", being problem oriented and solving the problem, not obsessed with minimizing some random quality number. If the results don't work in reality, you failed.
There are different methods to validate a DBSCAN clustering output. Generally we can distinguish between internal and external indices, depending if you have labeled data available or not. For DBSCAN there is a great internal validation indice called DBCV.
External Indices:
If you have some labeled data, external indices are great and can demonstrate how well the cluster did vs. the labeled data. One example indice is the RAND indice.https://en.wikipedia.org/wiki/Rand_index
Internal Indices:
If you don't have labeled data, then internal indices can be used to give the clustering result a score. In general the indices calculate the distance of points within the cluster and to other clusters and try to give you a score based on the compactness (how close are the points to each other in a cluster?) and
separability (how much distance is between the clusters?).
For DBSCAN, there is one great internal validation indice called DBCV by Moulavi et al. Paper is available here: https://epubs.siam.org/doi/pdf/10.1137/1.9781611973440.96
Python package: https://github.com/christopherjenness/DBCV

ELKI - Clustering Statistics

When a data set is analyzed by a clustering algorithm in ELKI 0.5, the program produces a number of statistics: the Jaccard index, F1-Measures, etc. In order to calculate these statistics, there have to be 2 clusterings to compare. What is the clustering created by the algorithm compared to?
The automatic evaluation (note that you can configure the evaluation manually!) is based on labels in your data set. At least in the current version (why are you using 0.5 and not 0.6.0?) it should only automatically evaluate if it finds labels in the data set.
We currently have not published internal measures. There are some implementations, such as evaluation/clustering/internal/EvaluateSilhouette.java, some of which will be in the next release.
In my experiments, internal evaluation measures were badly misleading. For example on the Silhouette coefficient, the labeled "solution" would often even score a negative silhouette coefficient (i.e. worse than not clustering at all).
Also, these measures are not scalable. The silhouette coefficient is in O(n^2) to compute; which usually makes this evaluation more expensive than the actual clustering!
We do appreciate contributions!
You are more than welcome to contribute your favorite evaluation measure to ELKI, to share with others.

How do I choose k when using k-means clustering with Silhouette function?

I've been studying about k-means clustering, and one big thing which is not clear is what Silhouette function really tell to me?
i know it shows that what appropriate k should be detemine but i cant understand what mean of silhouette function really say to me?
i read somewhere, if the mean of silhouette is less than 0.5 your clustering is not valid.
thanks for your answers in advance.
From the definition of silhouette :
Silhouette Value
The silhouette value for each point is a measure of how similar that
point is to points in its own cluster compared to points in other
clusters, and ranges from -1 to +1.
The silhouette value for the ith point, Si, is defined as
Si = (bi-ai)/ max(ai,bi) where ai is the average distance from the ith
point to the other points in the same cluster as i, and bi is the
minimum average distance from the ith point to points in a different
cluster, minimized over clusters.
This method just compares the intra-group similarity to closest group similarity. If any data member average distance to other members of the same cluster is higher than average distance to some other cluster members, then this value is negative and clustering is not successful. On the other hand, silhuette values close to 1 indicates a successful clustering operation. 0.5 is not an exact measure for clustering.
#fatihk gave a good citation;
additionally, you may think about the Silhouette value as a degree of
how clusters overlap with each other, i.e. -1: overlap perfectly,
+1: clusters are perfectly separable;
BUT low Silhouette values for a particular algorithm does NOT mean that there are no clusters, rather it means that the algorithm used cannot separate clusters and you may consider tune your algorithm or use a different algorithm (think about K-means for concentric circles, vs DBSCAN).
There is an explicit formula associated with the elbow method to automatically determine the number of clusters. The formula tells you about the strength of the elbow(s) being detected when using the elbow method to determine the number of clusters, see here. See illustration here:
Enhanced Elbow rule

Clustering words into groups

This is a Homework question. I have a huge document full of words. My challenge is to classify these words into different groups/clusters that adequately represent the words. My strategy to deal with it is using the K-Means algorithm, which as you know takes the following steps.
Generate k random means for the entire group
Create K clusters by associating each word with the nearest mean
Compute centroid of each cluster, which becomes the new mean
Repeat Step 2 and Step 3 until a certain benchmark/convergence has been reached.
Theoretically, I kind of get it, but not quite. I think at each step, I have questions that correspond to it, these are:
How do I decide on k random means, technically I could say 5, but that may not necessarily be a good random number. So is this k purely a random number or is it actually driven by heuristics such as size of the dataset, number of words involved etc
How do you associate each word with the nearest mean? Theoretically I can conclude that each word is associated by its distance to the nearest mean, hence if there are 3 means, any word that belongs to a specific cluster is dependent on which mean it has the shortest distance to. However, how is this actually computed? Between two words "group", "textword" and assume a mean word "pencil", how do I create a similarity matrix.
How do you calculate the centroid?
When you repeat step 2 and step 3, you are assuming each previous cluster as a new data set?
Lots of questions, and I am obviously not clear. If there are any resources that I can read from, it would be great. Wikipedia did not suffice :(
As you don't know exact number of clusters - I'd suggest you to use a kind of hierarchical clustering:
Imagine that all your words just a points in non-euclidean space. Use Levenshtein distance to calculate distance between words (it works great, in case, if you want to detect clusters of lexicographically similar words)
Build minimum spanning tree which contains all of your words
Remove links, which have length greater than some threshold
Linked groups of words are clusters of similar words
Here is small illustration:
P.S. you can find many papers in web, where described clustering based on building of minimal spanning tree
P.P.S. If you want to detect clusters of semantically similar words, you need some algorithms of automatic thesaurus construction
That you have to choose "k" for k-means is one of the biggest drawbacks of k-means.
However, if you use the search function here, you will find a number of questions that deal with the known heuristical approaches to choosing k. Mostly by comparing the results of running the algorithm multiple times.
As for "nearest". K-means acutally does not use distances. Some people believe it uses euclidean, other say it is squared euclidean. Technically, what k-means is interested in, is the variance. It minimizes the overall variance, by assigning each object to the cluster such that the variance is minimized. Coincidentially, the sum of squared deviations - one objects contribution to the total variance - over all dimensions is exactly the definition of squared euclidean distance. And since the square root is monotone, you can also use euclidean distance instead.
Anyway, if you want to use k-means with words, you first need to represent the words as vectors where the squared euclidean distance is meaningful. I don't think this will be easy or maybe not even possible.
About the distance: In fact, Levenshtein (or edit) distance satisfies triangle inequality. It also satisfies the rest of the necessary properties to become a metric (not all distance functions are metric functions). Therefore you can implement a clustering algorithm using this metric function, and this is the function you could use to compute your similarity matrix S:
-> S_{i,j} = d(x_i, x_j) = S_{j,i} = d(x_j, x_i)
It's worth to mention that the Damerau-Levenshtein distance doesn't satisfy the triangle inequality, so be careful with this.
About the k-means algorithm: Yes, in the basic version you must define by hand the K parameter. And the rest of the algorithm is the same for a given metric.