In calculating the XMeans clustering solution of a dataset, it is necessary (in the algorithm description) to seed the centers properly.
In WEKA Xmeans, there is an option to specify the initial centers. Additionally, in other Xmeans libraries, the user often has to provide an initial set of centers.
However, there is no indication whether or what the WEKA xmeans library does to create the initial centers if none are provided.
How does WEKA produce initial centers if none are provided? Or, is it necessary to generate the initial centers yourself in order to run the Xmeans algorithm properly?
You cannot use predefined centers with x-means.
Because it recursively works on subsets.
You could define the initial kmin (usually 2) centers. But you cannot predefine what happen after that, and the whole purpose of xmeans is to not have to know k beforehand. If you predefine k centers, you do assume this is the right k.
Related
How can I apply KMEANS algorithm with determined cluster position which has specified from PSO algorithm ??
Just do it.
K-means allows you to specify the initial centroids.
Without any information on the nature of the data you're dealing with (number if dimensions, datatypes, outliers, overlap etc), it is impossible to give specific answers.
I don't know of any genuine k-means implementation where you can pass in a list of centroids that the algorithm uses to initialize the k-means centroids. Usually these are selected randomly. (Can't you write your own implementation of k-means that does this initialization? Simple take an open-source implementaion and add an argument)
However, In the python sklearn implementation of kmeans, there is a kmeans++ implementation, where you can pass in the initial centers as an array.
init : {‘k-means++’, ‘random’ or an ndarray}
Method for initialization, defaults to ‘k-means++’:
‘k-means++’ : selects initial cluster centers for k-mean clustering
in a smart way to speed up convergence.
...
If an ndarray is passed, it should be of shape
(n_clusters, n_features) and gives the initial centers.
Haven't used it, though.
And I wrote this before I remembered/looked up kmeans++:
This is a poor-man's approach:
You can run kmeans with a k parameter equal to the length of the list/array that the PSO algorithm (whatever it did) has given you.
Then kmeans will quickly find its own centroids. Do this several times, maybe with different distance-measures (Euclidean, manhattan, shortest, longest, avg...), and different seeds for your random-number generator. Each time, afterwards, compare the coordinates of the k-means centroids with the coordinates of the PSO centroids.
When there is a near 1:1 correspondence (depending on your requirements), you've found a match. then do something with your list of k-means classfication-results.
I have a question about using a clustering method vs fitting the same data with a distribution.
Assuming that I have a dataset with 2 features (feat_A and feat_B) and let's assume that I use a clustering algorithm to divide the data in an optimal number of clusters...say 3.
My goal is to assign for each of the input data [feat_Ai,feat_Bi] a probability (or something similar) that the point belongs to cluster 1 2 3.
a. First approach with clustering:
I cluster the data in the 3 clusters and I assign to each point the probability of belonging to a cluster depending on the distance from the center of that cluster.
b. Second approach using mixture model:
I fit a mixture model or mixture distribution to the data. Data are fit to the distribution using an expectation maximization (EM) algorithm, which assigns posterior probabilities to each component density with respect to each observation. Clusters are assigned by selecting the component that maximizes the posterior probability.
In my problem I find the cluster centers (or I fit the model if approach b. is used) with a subsample of data. Then I have to assign a probability to a lot of other data... I would like to know in presence of new data which approach is better to use to still have meaningful assignments.
I would go for a clustering method for example a kmean because:
If the new data come from a distribution different from the one used to create the mixture model, the assignment could be not correct.
With new data the posterior probability changes.
The clustering method minimizes the variance of the clusters in order to find a kind of optimal separation border, the mixture model take into consideration the variance of the data to create the model (not sure that the clusters that will be formed are separated in an optimal way).
More info about the data:
Features shouldn't be assumed dependent.
Feat_A represents the duration of a physical activity Feat_B the step counts In principle we could say that with an higher duration of the activity the step counts increase, but it is not always true.
Please help me to think and if you have any other point please let me know..
I'm created a code book based on k-means clustering algorithm.But the algorithm didn't converge to an optimal code book, each time, the cluster centroids are varying(because of random selection of initial seeds). There is an option in Matlab to give an initial matrix to K-Means.But how we can can select the initial code book from a large data set? Is there any other way to get a unique code book using K-means?
It's somewhat standard to run k-means multiple times using different initial states (e.g., initial seeds) and choose the result with the lowest error as the best result.
It's also typical to seed k-means by randomly choosing k elements from your data set as the initial seeds.
Since by default MATLAB's K-Means uses the K-MEans++ algorithm for initialization it means it uses random numbers.
Hence each call (For sequential calls) to K-Means will probably produce different results.
You have 3 options to make this deterministic:
Set MATLAB's Random Number Generator state to certain state before calling K-Means.
Use the stream option in K-Means options to set the stream inside K-Means.
Write your own version of K-Means which uses a deterministic way to initialize K-Means.
I running kmeans in matlab on a 400x1000 matrix and for some reason whenever I run the algorithm I get different results. Below is a code example:
[idx, ~, ~, ~] = kmeans(factor_matrix, 10, 'dist','sqeuclidean','replicates',20);
For some reason, each time I run this code I get different results? any ideas?
I am using it to identify multicollinearity issues.
Thanks for the help!
The k-means implementation in MATLAB has a randomized component: the selection of initial centers. This causes different outcomes. Practically however, MATLAB runs k-means a number of times and returns you the clustering with the lowest distortion. If you're seeing wildly different clusterings each time, it may mean that your data is not amenable to the kind of clusters (spherical) that k-means looks for, and is an indication toward trying other clustering algorithms (e.g. spectral ones).
You can get deterministic behavior by passing it an initial set of centers as one of the function arguments (the start parameter). This will give you the same output clustering each time. There are several heuristics to choose the initial set of centers (e.g. K-means++).
As you can read on the wiki, k-means algorithms are generally heuristic and partially probabilistic, the one in Matlab being no exception.
This means that there is a certain random part to the algorithm (in Matlab's case, repeatedly using random starting points to find the global solution). This makes kmeans output clusters that are of good-quality-on-average. But: given the pseudo-random nature of the algorithm, you will get slightly different clusters each time -- this is normal behavior.
This is called initialization problem, as kmeans starts with random iniinital points to cluster your data. matlab selects k random points and calculates the distance of points in your data to these locations and finds new centroids to further minimize the distance. so you might get different results for centroid locations, but the answer is similar.
I am trying to differentiate two populations. Each population is an NxM matrix in which N is fixed between the two and M is variable in length (N=column specific attributes of each run, M=run number). I have looked at PCA and K-means for differentiating the two, but I was curious of the best practice.
To my knowledge, in K-means, there is no initial 'calibration' in which the clusters are chosen such that known bimodal populations can be differentiated. It simply minimizes the distance and assigns the data to an arbitrary number of populations. I would like to tell the clustering algorithm that I want the best fit in which the two populations are separated. I can then use the fit I get from the initial clustering on future datasets. Any help, example code, or reading material would be appreciated.
-R
K-means and PCA are typically used in unsupervised learning problems, i.e. problems where you have a single batch of data and want to find some easier way to describe it. In principle, you could run K-means (with K=2) on your data, and then evaluate the degree to which your two classes of data match up with the data clusters found by this algorithm (note: you may want multiple starts).
It sounds to like you have a supervised learning problem: you have a training data set which has already been partitioned into two classes. In this case k-nearest neighbors (as mentioned by #amas) is probably the approach most like k-means; however Support Vector Machines can also be an attractive approach.
I frequently refer to The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics) by Trevor Hastie (Author), Robert Tibshirani (Author), Jerome Friedman (Author).
It really depends on the data. But just to let you know K-means does get stuck at local minima so if you wanna use it try running it from different random starting points. PCA's might also be useful how ever like any other spectral clustering method you have much less control over the clustering procedure. I recommend that you cluster the data using k-means with multiple random starting points and c how it works then you can predict and learn for each the new samples with K-NN (I don't know if it is useful for your case).
Check Lazy learners and K-NN for prediction.