This question already has an answer here:
K-Medoids / K-Means Algorithm. Data point with the equal distances between two or more cluster representatives
(1 answer)
Closed 4 years ago.
I am learning about k-means clustering, I didn't get how clustering will be done if a data point p is equidistant from two centroids c1 and c2 , to which cluster the point p will belong to.
It is not uniquely defined.
If kmeans works, it should not matter much for the result. This is mostly an issue on data where kmeans does not work well, e.g. binary data.
Most implementations will likely prefer the first cluster.
Related
This question already has answers here:
Why should weights of Neural Networks be initialized to random numbers? [closed]
(9 answers)
Closed 3 years ago.
I have heard a lot about "breaking the symmetry" within the context of neural network programming and initialization. Can somebody please explain what this means? As far as I can tell, it has something to do with neurons performing similarly during forward and backward propagation if the weight matrix is filled with identical values during initialization. Asymmetrical behavior would be more clearly replicated with random initialization, i.e., not using identical values throughout the matrix.
Your understanding is correct.
When all initial values are identical, for example initialize every weight to 0, then when doing backpropagation, all weights will get the same gradient, and hence the same update. This is what is referred to as the symmetry.
Intuitively, that means all nodes will learn the same thing, and we don't want that, because we want the network to learn different kinds of features. This is achieved by random initialization, since then the gradient will be different, and each node will grow to be more distinct to other nodes, enabling diverse feature extraction. This is what is referred to as breaking the symmetry.
I'm using function fcm from Matlab for overlapping clustering. The output of this function is a matrix of size kxn with k being the number of clusters and n being the number of examples.
Now my problem is that how do I choose clusters for an example? For each example, I have scores for all clusters so I can easily find the best matched cluster, but what about other clusters?
Many thanks.
It depends on the clustering algorithm, but you can probably interpret those soft clustering values as probabilities. This gives two well-founded options for extracting a hard clustering:
Sample each point's cluster from its cluster distribution (a column in your kxn matrix).
Assign each point to its most probable cluster. This corresponds to the MAP (max a posteriori) solution to the clustering problem.
Option 2 is probably the way to go - a single sample may not be a great representation of what's going on; with MAP, you're at least guaranteed to get something probable.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I want to do k-means clustering to classify Testing data based on Training data both of which have 3 classes (1,2 and 3).
How would I classify the Testing data set using a cluster size of e.g. k=10 in kmeans (e.g. using Matlab)? I know that I can have k=3 and then use nearest neighbour to identify the data based on its nearest cluster size... but not sure what I would use for values other that k=3? How would you label each of those 10 clusters?
Thanks
The classification of 10 clusters would be no different than the classification of 3 clusters. The number of clusters given by k-means is independent of the number of "classes" in the data. k-means is an unsupervised learning algorithm, meaning that it gives no consideration to the class of the training data during training.
The algorithm would look something like this:
distances = dist(test_point, cluster_centers)
cluster = clusters[ min(distances) ]
class = mode(cluster.class)
where we find the cluster with minimum distance between the cluster center and our test point, then we find the most common class label among the elements contained in that minimally-distant cluster.
It is a little bit unclear what exactly you want to do, although here is an outline from what I understand.
When you are clustering data, the labels are ideally not present, as either you use the clustering to get insights from the data or use it for pre-processing.
Although, if you want to perform a clustering and then assign class id to a new datapoint based on the nearness of the cluster centers, then you can do the following.
First, you select the k by bootstrapping or other methods, maybe use Silhouette coefficients. Once you get the cluster centers, check which center is closest to the new datapoint and assign the class id accordingly.
In such cases you might be interested to use the Rand Index or the Adjusted Rand Index, to get the cluster quality.
I know that Mahout is used for batch processing, but I am interested if I can use its KMeans, and how, for clustering individual points?
Let's say that we have following situation
Global clustering, that performs batch processing on all data and gives centroids as result
One point clustering, that uses centroids from global clustering, to assign that point to a cluster - it does not require cluster centroid re-computation - just assigning that point to an existing cluster
Can I do this using Mahout, or I have to implement it myself? I thought setting number of iterations to 1, and in that way assign the point, but the thing is, KMeans recomputes cluster centroids and if that new point is an outlier, it makes a new cluster from it. I don't want that, I actually want the distance to closest centroid.
For now, it seems that it is not very appropriate to use KMeans for this, but it should be implemented separately... Is that correct?
Thanks
You don't need to use Mahout for this.
K-means assigns points to the nearest center.
So just get all centers (which should fit easily into RAM), and compute the least-squares difference to each center.
It's just a few CPU cycles, there is absolutely no benefit in trying to do this on Mahout - the overhead will be much too large for just some k distance computations.
This question already has answers here:
Test if a data distribution follows a Gaussian distribution in MATLAB
(3 answers)
Closed 8 years ago.
I have a data set that i want find which distribution is fit to it. How can I check difference distributions on this database? Is any code or automatic code for do that in MATLAB?
Thanks.
I think what you're looking for is called the Bayesian Information Criterion or BIC. Check it out on Wikipedia... Then pick several distributions, calculate the BIC for each distribution with your data, and finally see which one has the best BIC.
Although I make this out to be a simple problem, it actually isn't. For many distributions calculating the BIC requires numerical optimization over the parameters of the distribution. However for some distributions Matlab can calculate the Maximum Likelihood Estimator (MLE) for you automatically, which is part of what you'll need for the BIC.