Applying vector based clustering algorithms to social network context - cluster-analysis

i have a social network described as edges in a file. I used graph based clustering algorithms to find dense parts of the graph. However there is also vector based clustering which i need to apply to the data i have, but i can not find any context to this. I have also information about each node considering their features. I think using vectors containing the features of each user makes no sense here. For example k-Means would calculate the distance between user u1 with his feature vector v1 = [f1,f2,f3,..] and user u2 with its feature vector v2 = [f1,f2,f3,...]. However both vectors would have binary values depending on which feature the user has. Additionally i have a matrix with the users on one axis and the features on the other, where the user is able to set permission.
My Question is now, how i can make use of k-means, dbscan etc. in the context of this topic.
Best wishes.

Many algorithms can be modified to allow being used with distances for binary features. For example k-means can be modified for binary data: k-modes.
But I don't think it will do anything useful on your data.
You approach to this problem is bad: don't first decide the algorithm, then try to make it run. You are then bound to solve the wrong problem. Instead, formalize the problem first, in mathematics, what a good clustering would be. Then identify the appropriate algorithm by it's mathematical ability to find a good solution to this objective.

Related

Density Based Clustering with Representatives

I'm looking for a method to perform density based clustering. The resulting clusters should have a representative unlike DBSCAN.
Mean-Shift seems to fit those needs but doesn't scale enough for my needs. I have looked into some subspace clustering algorithms and only found CLIQUE using representatives, but this part is not implemented in Elki.
As I noted in the comments on the previous iteration of your question,
https://stackoverflow.com/questions/34720959/dbscan-java-library-with-corepoints
Density-based clustering does not assume there is a center or representative.
Consider the following example image from Wikipedia user Chire (BY-CC-SA 3.0):
Which object should be the representative of the red cluster?
Density-based clustering is about finding "arbitrarily shaped" clusters. These do not have a meaningful single representative object. They are not meant to "compress" your data - this is not a vector quantization method, but structure discovery. But it is the nature of such complex structure that it cannot be reduced to a single representative. The proper representation of such a cluster is the set of all points in the cluster. For geometric understanding in 2D, you can also compute convex hulls, for example, to get an area as in that picture.
Choosing representative objects is a different task. This is not needed for discovering this kind of structure, and thus these algorithms do not compute representative objects - it would waste CPU.
You could choose the object with the highest density as representative of the cluster.
It is a fairly easy modification to DBSCAN to store the neighbor count of every object.
But as Anony-Mousse mentioned, the object may nevertheless be a rather bad choice. Density-based clustering is not designed to yield representative objects.
You could try AffinityPropagation, but it will also not scale very well.

Best way to validate DBSCAN Clusters

I have used the ELKI implementation of DBSCAN to identify fire hot spot clusters from a fire data set and the results look quite good. The data set is spatial and the clusters are based on latitude, longitude. Basically, the DBSCAN parameters identify hot spot regions where there is a high concentration of fire points (defined by density). These are the fire hot spot regions.
My question is, after experimenting with several different parameters and finding a pair that gives a reasonable clustering result, how does one validate the clusters?
Is there a suitable formal validation method for my use case? Or is this subjective depending on the application domain?
ELKI contains a number of evaluation functions for clusterings.
Use the -evaluator parameter to enable them, from the evaluation.clustering.internal package.
Some of them will not automatically run because they have quadratic runtime cost - probably more than your clustering algorithm.
I do not trust these measures. They are designed for particular clustering algorithms; and are mostly useful for deciding the k parameter of k-means; not much more than that. If you blindly go by these measures, you end up with useless results most of the time. Also, these measures do not work with noise, with either of the strategies we tried.
The cheapest are the label-based evaluators. These will automatically run, but apparently your data does not have labels (or they are numeric, in which case you need to set the -parser.labelindex parameter accordingly). Personally, I prefer the Adjusted Rand Index to compare the similarity of two clusterings. All of these indexes are sensitive to noise so they don't work too well with DBSCAN, unless your reference has the same concept of noise as DBSCAN.
If you can afford it, a "subjective" evaluation is always best.
You want to solve a problem, not a number. That is the whole point of "data science", being problem oriented and solving the problem, not obsessed with minimizing some random quality number. If the results don't work in reality, you failed.
There are different methods to validate a DBSCAN clustering output. Generally we can distinguish between internal and external indices, depending if you have labeled data available or not. For DBSCAN there is a great internal validation indice called DBCV.
External Indices:
If you have some labeled data, external indices are great and can demonstrate how well the cluster did vs. the labeled data. One example indice is the RAND indice.https://en.wikipedia.org/wiki/Rand_index
Internal Indices:
If you don't have labeled data, then internal indices can be used to give the clustering result a score. In general the indices calculate the distance of points within the cluster and to other clusters and try to give you a score based on the compactness (how close are the points to each other in a cluster?) and
separability (how much distance is between the clusters?).
For DBSCAN, there is one great internal validation indice called DBCV by Moulavi et al. Paper is available here: https://epubs.siam.org/doi/pdf/10.1137/1.9781611973440.96
Python package: https://github.com/christopherjenness/DBCV

Python Clustering Algorithms

I've been looking around scipy and sklearn for clustering algorithms for a particular problem I have. I need some way of characterizing a population of N particles into k groups, where k is not necessarily know, and in addition to this, no a priori linking lengths are known (similar to this question).
I've tried kmeans, which works well if you know how many clusters you want. I've tried dbscan, which does poorly unless you tell it a characteristic length scale on which to stop looking (or start looking) for clusters. The problem is, I have potentially thousands of these clusters of particles, and I cannot spend the time to tell kmeans/dbscan algorithms what they should go off of.
Here is an example of what dbscan find:
You can see that there really are two separate populations here, though adjusting the epsilon factor (the max. distance between neighboring clusters parameter), I simply cannot get it to see those two populations of particles.
Is there any other algorithms which would work here? I'm looking for minimal information upfront - in other words, I'd like the algorithm to be able to make "smart" decisions about what could constitute a separate cluster.
I've found one that requires NO a priori information/guesses and does very well for what I'm asking it to do. It's called Mean Shift and is located in SciKit-Learn. It's also relatively quick (compared to other algorithms like Affinity Propagation).
Here's an example of what it gives:
I also want to point out that in the documentation is states that it may not scale well.
When using DBSCAN it can be helpful to scale/normalize data or
distances beforehand, so that estimation of epsilon will be relative.
There is a implementation of DBSCAN - I think its the one
Anony-Mousse somewhere denoted as 'floating around' - , which comes
with a epsilon estimator function. It works, as long as its not fed
with large datasets.
There are several incomplete versions of OPTICS at github. Maybe
you can find one to adapt it for your purpose. Still
trying to figure out myself, which effect minPts has, using one and
the same extraction method.
You can try a minimum spanning tree (zahn algorithm) and then remove the longest edge similar to alpha shapes. I used it with a delaunay triangulation and a concave hull:http://www.phpdevpad.de/geofence. You can also try a hierarchical cluster for example clusterfck.
Your plot indicates that you chose the minPts parameter way too small.
Have a look at OPTICS, which does no longer need the epsilon parameter of DBSCAN.

Matlab: K-means clustering with predefined populations

I am trying to differentiate two populations. Each population is an NxM matrix in which N is fixed between the two and M is variable in length (N=column specific attributes of each run, M=run number). I have looked at PCA and K-means for differentiating the two, but I was curious of the best practice.
To my knowledge, in K-means, there is no initial 'calibration' in which the clusters are chosen such that known bimodal populations can be differentiated. It simply minimizes the distance and assigns the data to an arbitrary number of populations. I would like to tell the clustering algorithm that I want the best fit in which the two populations are separated. I can then use the fit I get from the initial clustering on future datasets. Any help, example code, or reading material would be appreciated.
-R
K-means and PCA are typically used in unsupervised learning problems, i.e. problems where you have a single batch of data and want to find some easier way to describe it. In principle, you could run K-means (with K=2) on your data, and then evaluate the degree to which your two classes of data match up with the data clusters found by this algorithm (note: you may want multiple starts).
It sounds to like you have a supervised learning problem: you have a training data set which has already been partitioned into two classes. In this case k-nearest neighbors (as mentioned by #amas) is probably the approach most like k-means; however Support Vector Machines can also be an attractive approach.
I frequently refer to The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics) by Trevor Hastie (Author), Robert Tibshirani (Author), Jerome Friedman (Author).
It really depends on the data. But just to let you know K-means does get stuck at local minima so if you wanna use it try running it from different random starting points. PCA's might also be useful how ever like any other spectral clustering method you have much less control over the clustering procedure. I recommend that you cluster the data using k-means with multiple random starting points and c how it works then you can predict and learn for each the new samples with K-NN (I don't know if it is useful for your case).
Check Lazy learners and K-NN for prediction.

Clustering: a training dataset of variable data dimensions

I have a dataset of n data, where each data is represented by a set of extracted features. Generally, the clustering algorithms need that all input data have the same dimensions (the same number of features), that is, the input data X is a n*d matrix of n data points each of which has d features.
In my case, I've previously extracted some features from my data but the number of extracted features for each data is most likely to be different (I mean, I have a dataset X where data points have not the same number of features).
Is there any way to adapt them, in order to cluster them using some common clustering algorithms requiring data to be of the same dimensions.
Thanks
Sounds like the problem you have is that it's a 'sparse' data set. There are generally two options.
Reduce the dimensionality of the input data set using multi-dimensional scaling techniques. For example Sparse SVD (e.g. Lanczos algorithm) or sparse PCA. Then apply traditional clustering on the dense lower dimensional outputs.
Directly apply a sparse clustering algorithm, such as sparse k-mean. Note you can probably find a PDF of this paper if you look hard enough online (try scholar.google.com).
[Updated after problem clarification]
In the problem, a handwritten word is analyzed visually for connected components (lines). For each component, a fixed number of multi-dimensional features is extracted. We need to cluster the words, each of which may have one or more connected components.
Suggested solution:
Classify the connected components first, into 1000(*) unique component classifications. Then classify the words against the classified components they contain (a sparse problem described above).
*Note, the exact number of component classifications you choose doesn't really matter as long as it's high enough as the MDS analysis will reduce them to the essential 'orthogonal' classifications.
There are also clustering algorithms such as DBSCAN that in fact do not care about your data. All this algorithm needs is a distance function. So if you can specify a distance function for your features, then you can use DBSCAN (or OPTICS, which is an extension of DBSCAN, that doesn't need the epsilon parameter).
So the key question here is how you want to compare your features. This doesn't have much to do with clustering, and is highly domain dependant. If your features are e.g. word occurrences, Cosine distance is a good choice (using 0s for non-present features). But if you e.g. have a set of SIFT keypoints extracted from a picture, there is no obvious way to relate the different features with each other efficiently, as there is no order to the features (so one could compare the first keypoint with the first keypoint etc.) A possible approach here is to derive another - uniform - set of features. Typically, bag of words features are used for such a situation. For images, this is also known as visual words. Essentially, you first cluster the sub-features to obtain a limited vocabulary. Then you can assign each of the original objects a "text" composed of these "words" and use a distance function such as cosine distance on them.
I see two options here:
Restrict yourself to those features for which all your data-points have a value.
See if you can generate sensible default values for missing features.
However, if possible, you should probably resample all your data-points, so that they all have values for all features.