cluster a set of probability distributions - cluster-analysis

I have a set of probability distributions I got from inferring a trained LDA model on a set of documents. Each document has a topic distribution which means how it is likely to be assigned to topic i. So I have a set of vectors, and each vector is such topic distribution, ex. [0.1, 0.2, 0.3, 0, 0, 0.2, 0.1, 0.1].
I want to cluster the vectors so I know how the documents are related somehow, and I want to see vectors that are closer to the center of each cluster, but I don't know which clustering algorithms work better?
Thank you!

Related

sklearn python affinity propagation - is there a method to calculate error in clusters?

In looking at the docs for sklearn.cluster and Affinity Propagation I don't see anything that would calculate error in a cluster. Does this exist or is this something I have to write on my own?
Update: Let me propose a possible idea:
With Affinity Propagation we have a dissimilarity matrix (that is a matrix that measures how dissimilar each row is from each other). When AP is finished I have all the label assignments to which cluster they belong. What if I took the dissimilarity measurement from the matrix? For example, say in an 10x10 matrix point 3 is my cluster and label 4 is assigned to the exemplar 3. The dissimilarity between the centroid and label is say -5, as an example. Let's say there are two more labels assigned to this centroid with a dissimilarity of -3 and -8 respectively. Now if I said the total error is -16/3. If I have another cluster with dissimilarity measurements of -2, -3, -2, -3, -2, -3 = -15/6. This seems to provide a potential error measurement.
I don't think there is a commonly accepted definition of "error" that would make sense in the context of affinity propagation, which is a similarity based method.
Errors work well with coordinate based methods such as k-means, but on AP we may not have coordinates.

How's it even possible to use softmax for word2vec?

How is it possible to use softmax for word2vec? I mean softmax outputs probabilities of all classes which sum up to 1, e.g. [0, 0.1, 0.8, 0.1]. But if my label is, for example [0, 1, 0, 1, 0] (multiple correct classes), then it is impossible for softmax to output the correct value?
Should I use softmax instead? Or am I missing something?
I suppose you're talking about Skip-Gram model (i.e., predict the context word by the center), because CBOW model predicts the single center word, so it assumes exactly one correct class.
Strictly speaking, if you were to train word2vec using SG model and ordinary softmax loss, the correct label would be [0, 0.5, 0, 0.5, 0]. Or, alternatively, you can feed several examples per center word, with labels [0, 1, 0, 0, 0] and [0, 0, 0, 1, 0]. It's hard to say, which one performs better, but the label must be a valid probability distribution per input example.
In practice, however, ordinary softmax is rarely used, because there are too many classes and strict distribution is too expensive and simply not needed (almost all probabilities are nearly zero all the time). Instead, the researchers use sampled loss functions for training, which approximate softmax loss, but are much more efficient. The following loss functions are particularly popular:
Negative Sampling
Noise-Contrastive Estimation
These losses are more complicated than softmax, but if you're using tensorflow, all of them are implemented and can be used just as easily.

Matlab : how to change initial probability distribution of hidden markov model

How do I change initial probabilities distribution of hidden Markov model in MATLAB. I know by default, MATLAB begins the HMM algorithms at state 1. To assign a different distribution of probabilities, the transmission and emission matrices are augmented to include the prior matrix. How can I do this? I need answer with practical example. now assuming I have Trans=[o.2, 0.8; 0.4, 0.6] , Emis=[0.6, 0.5, o.3; 0.4, 0.5, 0.7] and initial=[0.2 ,0.8]

K-means Clustering, major understanding issue

Suppose that we have a 64dim matrix to cluster, let's say that the matrix dataset is dt=64x150.
Using from vl_feat's library its kmeans function, I will cluster my dataset to 20 centrers:
[centers, assignments] = vl_kmeans(dt, 20);
centers is a 64x20 matrix.
assignments is a 1x150 matrix with values inside it.
According to manual: The vector assignments contains the (hard) assignments of the input data to the clusters.
I still can not understand what those numbers in the matrix assignments mean. I dont get it at all. Anyone mind helping me a bit here? An example or something would be great. What do these values represent anyway?
In k-means the problem you are trying to solve is the problem of clustering your 150 points into 20 clusters. Each point is a 64-dimension point and thus represented by a vector of size 64. So in your case dt is the set of points, each column is a 64-dim vector.
After running the algorithm you get centers and assignments. centers are the 20 positions of the cluster's center in a 64-dim space, in case you want to visualize it, measure distances between points and clusters, etc. 'assignments' on the other hand contains the actual assignments of each 64-dim point in dt. So if assignments[7] is 15 it indicates that the 7th vector in dt belongs to the 15th cluster.
For example here you can see clustering of lots of 2d points, let's say 1000 into 3 clusters. In this case dt would be 2x1000, centers would be 2x3 and assignments would be 1x1000 and will hold numbers ranging from 1 to 3 (or 0 to 2, in case you're using openCV)
EDIT:
The code to produce this image is located here: http://pypr.sourceforge.net/kmeans.html#k-means-example along with a tutorial on kmeans for pyPR.
In openCV it is the number of the cluster that each of the input points belong to

Clustering with a Distance Matrix via Mahalanobis distance

I have a set of pairwise distances (in a matrix) between objects that I would like to cluster. I currently use k-means clustering (computing distance from the centroid as the average distance to all members of the given cluster, since I do not have coordinates), with k chosen by the best Davies-Bouldin index over an interval.
However, I have three separate metrics (more in the future, potentially) describing the difference between the data, each fairly different in terms of magnitude and spread. Currently, I compute the distance matrix with the Euclidean distance across the three metrics, but I am fairly certain that the difference between the metrics is messing it up (e.g. the largest one is overpowering the other ones).
I thought a good way to deal with this is to use the Mahalanobis distance to combine the metrics. However, I obviously cannot compute the covariance matrix between the coordinates, but I can compute it for the distance metrics. Does this make sense? That is, if I get the distance between two objects i and j as:
D(i,j) = sqrt( dt S^-1 d )
where d is the 3-vector of the different distance metrics between i and j, dt is the transpose of d, and S is the covariance matrix of the distances, would D be a good, normalized metric for clustering?
I have also thought of normalizing the metrics (i.e. subtracting the mean and dividing out the variance) and then simply staying with the euclidean distance (in fact it would seem that this essentially is Mahalanobis distance, at least in some cases), or of switching to something like DBSCAN or EM, and have not ruled them out (though MDS then clustering might be a bit excessive). As a sidenote, any packages able to do all of this would be greatly appreciated. Thanks!
Consider using k-medoids (PAM) instead of a hacked k-means, which can work with arbitary distance functions; whereas k-means is designed to minimize variances, not arbitrary distances.
EM will have the same problem - it needs to be able to compute meaningful centers.
You can also use hierarchical linkage clustering. It only needs a distance matrix.