Clustering a sparse dataset of binary vectors - cluster-analysis

If I have a sparse dataset where each data is described by a vector of 1000 elements, each element of this vector can be either 0 or 1 (a lot of 0 and some 1), do you know any distance function that could help me to cluster them ? Is something like euclidean distance convenient in this case ? I would like to know if there is a simple convenient distance metric for such a situation, to try on my data.
Thanks

Your question doesn't have one answer. There are best-practices depending on the domain.
Once you decide on the similarity metric, the clustering is usually done by averaging or by finding a medoid. See these papers on clustering binary data for algorithm examples:
Carlos Ordonez. Clustering Binary Data Streams with K-means. PDF
Tao Li. A General Model for Clustering Binary Data. PDF
For ideas on similarity measures see this online "tool for measuring similarity between binary strings". They mention: Sokal-Michener, Jaccard, Russell-Rao, Hamann, Sorensen, antiDice, Sneath-Sokal, Rodger-Tanimoto, Ochiai, Yule, Anderberg, Kulczynski, Pearson's Phi, and Gower2, Dot Product, Cosine Coefficient, Hamming Distance.
They also cite these papers:
Luke, B. T., Clustering Binary Objects
Lin, D., An Information-Theoretic Definition of Similarity.
Toit, du S.H.C.; Steyn, A.G.W.; Stumpf, R.H.; Graphical Exploratory Data Analysis; Chapter 3, p. 77, 1986; Springer-Verlag.
(I personally like the cosine. There is also KL-divergence, and its Jensen distance counterpart.)

Have a look at distance functions used for sparse text vectors, such as Cosine Distance and for comparing sets, such as the Jaccard distance.

Many distance / similarity function for binary vectors have been proposed.
In A Survey of Binary Similarity and Distance Measures - Choi, Cha, Tappert 2010, the authors list 76 such functions.

If it really is lots of 0 and a few 1, you could try clustering for the first or last 1 - see http://aggregate.org/MAGIC/#Least Significant 1 Bit

Related

How to efficiently calculate/estimate cosine similarity for billions of pairs in a non-spars matrix?

Consider I have 10 million items, each identified with a 100 dimension vector of real numbers (actually they are word2vec embeddings). For each item I want to get (approximately) the top 200 most similar items to it, using Cosine similarity. My current cosine similarity standard implementation as UDF function in Hadoop (hive) takes about 1s to calculate the cosine similarity of 1 item compared with 10 million other items. This renders it infeasible to run for whole matrix. My next move is to run it on Spark, with more parallelization, but still it won't solve the problem completely.
I know there are some methods to reduce the calculation for a spars matrix. But my matrix is NOT sparse.
How can I efficiently get the most similar items for each item?
Is there an approximation of cosine similarity that will be more efficient to calculate?
You can compress the vector to make the score calculation simpler.
By new distance approach like hamming distance.
There is a keyword called vector quantization, and there are many algorithms talk about vector compression.
Here is an example of making it comparable to cosine similarity.
https://github.com/tdebatty/java-LSH/blob/master/src/main/java/info/debatty/java/lsh/SuperBit.java#L208

Clustering algorithm with different epsilons on different axes

I am looking for a clustering algorithm such a s DBSCAN do deal with 3d data, in which is possible to set different epsilons depending on the axis. So for instance an epsilon of 10m on the x-y plan, and an epsilon 0.2m on the z axis.
Essentially, I am looking for large but flat clusters.
Note: I am an archaeologist, the algorithm will be used to look for potential correlations between objects scattered in large surfaces, but in narrow vertical layers
Solution 1:
Scale your data set to match your desired epsilon.
In your case, scale z by 50.
Solution 2:
Use a weighted distance function.
E.g. WeightedEuclideanDistanceFunction in ELKI, and choose your weights accordingly, e.g. -distance.weights 1,1,50 will put 50x as much weight on the third axis.
This may be the most convenient option, since you are already using ELKI.
Just define a custom distance metric when computing the DBSCAN core points. The standard DBSCAN uses the Euclidean distance to compute points within an epsilon. So all dimensions are treated the same.
However, you could use the Mahalanobis distance to weigh each dimension differently. You can use a diagonal covariance matrix for flat clusters. You can use a full symmetric covariance matrix for flat tilted clusters, etc.
In your case, you would use a covariance matrix like:
100 0 0 0 100 0 0 0 0.04
In the pseudo code provided at the Wikipedia entry for DBSCAN just use one of the distance metrics suggested above in the regionQuery function.
Update
Note: scaling the data is equivalent to using an appropriate metric.

Clustering with a Distance Matrix via Mahalanobis distance

I have a set of pairwise distances (in a matrix) between objects that I would like to cluster. I currently use k-means clustering (computing distance from the centroid as the average distance to all members of the given cluster, since I do not have coordinates), with k chosen by the best Davies-Bouldin index over an interval.
However, I have three separate metrics (more in the future, potentially) describing the difference between the data, each fairly different in terms of magnitude and spread. Currently, I compute the distance matrix with the Euclidean distance across the three metrics, but I am fairly certain that the difference between the metrics is messing it up (e.g. the largest one is overpowering the other ones).
I thought a good way to deal with this is to use the Mahalanobis distance to combine the metrics. However, I obviously cannot compute the covariance matrix between the coordinates, but I can compute it for the distance metrics. Does this make sense? That is, if I get the distance between two objects i and j as:
D(i,j) = sqrt( dt S^-1 d )
where d is the 3-vector of the different distance metrics between i and j, dt is the transpose of d, and S is the covariance matrix of the distances, would D be a good, normalized metric for clustering?
I have also thought of normalizing the metrics (i.e. subtracting the mean and dividing out the variance) and then simply staying with the euclidean distance (in fact it would seem that this essentially is Mahalanobis distance, at least in some cases), or of switching to something like DBSCAN or EM, and have not ruled them out (though MDS then clustering might be a bit excessive). As a sidenote, any packages able to do all of this would be greatly appreciated. Thanks!
Consider using k-medoids (PAM) instead of a hacked k-means, which can work with arbitary distance functions; whereas k-means is designed to minimize variances, not arbitrary distances.
EM will have the same problem - it needs to be able to compute meaningful centers.
You can also use hierarchical linkage clustering. It only needs a distance matrix.

How to select top 100 features(a subset) which are most relevant after pca?

I performed PCA on a 63*2308 matrix and obtained a score and a co-efficient matrix. The score matrix is 63*2308 and the co-efficient matrix is 2308*2308 in dimensions.
How do i extract the column names for the top 100 features which are most important so that i can perform regression on them?
PCA should give you both a set of eigenvectors (your co-efficient matrix) and a vector of eigenvalues (1*2308) often referred to as lambda). You might been to use a different PCA function in matlab to get them.
The eigenvalues indicate how much of your data each eigenvector explains. A simple method for selecting features would be to select the 100 features with the highest eigen values. This gives you a set of feature which explain most of the variance in the data.
If you need to justify your approach for a write up you can actually calculate the amount of variance explained per eigenvector and cut of at, for example, 95% variance explained.
Bear in mind that selecting based solely on eigenvalue, might not correspond to the set of features most important to your regression, so if you don't get the performance you expect you might want to try a different feature selection method such as recursive feature selection. I would suggest using google scholar to find a couple of papers doing something similar and see what methods they use.
A quick matlab example of taking the top 100 principle components using PCA.
[eigenvectors, projected_data, eigenvalues] = princomp(X);
[foo, feature_idx] = sort(eigenvalues, 'descend');
selected_projected_data = projected(:, feature_idx(1:100));
Have you tried with
B = sort(your_matrix,2,'descend');
C = B(:,1:100);
Be careful!
With just 63 observations and 2308 variables, your PCA result will be meaningless because the data is underspecified. You should have at least (rule of thumb) dimensions*3 observations.
With 63 observations, you can at most define a 62 dimensional hyperspace!

K-means distance parameters in Matlab - Varying results

I have a matrix I am working with which 300x5000 and I wanted to test which distance calculation parameter is the most effective. I got the following results:
'Sqeuclidean' = 17 iterations, total sum of distances = 25175.4
'Correlation' = 9 iterations, total sum of distances = 32.7
'Cityblock' = 34 iterations, total sum of distances = 105175.3
'Cosine' = 11 iterations, total sum of distances = 11.9
I am having trouble understanding why the results vary so much and how to choose the most effective distance parameter. Any advice?
EDIT:
I have 300 features with 5000 instances of each feature.
the function looks like this:
[idx, ctrs, sumd, d] = kmeans(matrix, 25, 'distance', 'cityblock', 'replicate', 20)
with interchanging the distance parameter. The features were already normalized.
Thanks!
As slayton commented, you really need to define what 'best' means for your particular problem.
The only thing that matters is how well the distance function clusters the data. In general, clustering is highly-dependent on the distance function. The two metrics that you've selected (number of iterations, sum of distances) are pretty irrelevant to how well the clustering works.
You need to know what you're trying to achieve with clustering, and you need some metric for how well you've achieved that goal. If there's an objective metric to determine how good your clusters are, then use that. Often, the metric is fuzzier: does this look right when I visualize the data. Look at your data, and look at how each distance function clusters the data. Select the distance function that seems to generate the best clusters. Do this for several subsets of your data, to make sure that your intuition is correct. You should also try to understand the result that each distance function gives you.
Lastly, some problems lend themselves to a particular distance function. If your problem has spatial features, then a Euclidean (geometric) distance is often a natural choice. Other distance functions will perform better for different problems.
Distance values from different
distance functions
data sets
normalizations
are generally not comparable. Simple example from reality: measure distances in "meter" or in "inch", and you get very different results. The result in meters will not be better just because it is measured on a different scale. So you must not compare the variances of different results.
Notice that k-means is meant to be used with euclidean distance only, and may not converge with other distance functions. IMHO, L_p norms should be fine, and on TF-IDF maybe also cosine. But I do not know a proof for that.
Oh, and k-means works really bad with high-dimensional data. It is meant for low dimensionality.