How to generate clusters in k-means algorithm without giving the k value.
I want do k-means clustering and generate clusters automatically.
You may try mean shift clustering, it behaves like k-means clustering and does not have a k parameter.
The basic idea is as follows: clustering is like increasing the "high frequencies" in your dataset, or "sharpening" your dataset, in order to find the "modes" (the "modes" correspond to the significant "trends" in your dataset).
The inverse operation, i.e. smoothing the dataset, is easier to define (in short, replace each sample with the mean of its neighbors). Thus, from this definition, you can extract the "high frequency" component of the signal, as the difference between the initial signal and the smoothed one. This gives you a "gradient direction", or a "good move" that will sharpen the signal. In the end of the process, all the samples will be clustered in a small number of points, corresponding to the "modes".
Reference:
https://en.wikipedia.org/wiki/Mean_shift
there is X-means (K-means variation), it is implemented in Weka. For more info see documentation:
http://www.cs.cmu.edu/~dpelleg/download/xmeans.pdf
http://weka.sourceforge.net/doc.packages/XMeans/weka/clusterers/XMeans.html
http://www.cs.cmu.edu/~dpelleg/kmeans.html
Related
How can I apply KMEANS algorithm with determined cluster position which has specified from PSO algorithm ??
Just do it.
K-means allows you to specify the initial centroids.
Without any information on the nature of the data you're dealing with (number if dimensions, datatypes, outliers, overlap etc), it is impossible to give specific answers.
I don't know of any genuine k-means implementation where you can pass in a list of centroids that the algorithm uses to initialize the k-means centroids. Usually these are selected randomly. (Can't you write your own implementation of k-means that does this initialization? Simple take an open-source implementaion and add an argument)
However, In the python sklearn implementation of kmeans, there is a kmeans++ implementation, where you can pass in the initial centers as an array.
init : {‘k-means++’, ‘random’ or an ndarray}
Method for initialization, defaults to ‘k-means++’:
‘k-means++’ : selects initial cluster centers for k-mean clustering
in a smart way to speed up convergence.
...
If an ndarray is passed, it should be of shape
(n_clusters, n_features) and gives the initial centers.
Haven't used it, though.
And I wrote this before I remembered/looked up kmeans++:
This is a poor-man's approach:
You can run kmeans with a k parameter equal to the length of the list/array that the PSO algorithm (whatever it did) has given you.
Then kmeans will quickly find its own centroids. Do this several times, maybe with different distance-measures (Euclidean, manhattan, shortest, longest, avg...), and different seeds for your random-number generator. Each time, afterwards, compare the coordinates of the k-means centroids with the coordinates of the PSO centroids.
When there is a near 1:1 correspondence (depending on your requirements), you've found a match. then do something with your list of k-means classfication-results.
I'm created a code book based on k-means clustering algorithm.But the algorithm didn't converge to an optimal code book, each time, the cluster centroids are varying(because of random selection of initial seeds). There is an option in Matlab to give an initial matrix to K-Means.But how we can can select the initial code book from a large data set? Is there any other way to get a unique code book using K-means?
It's somewhat standard to run k-means multiple times using different initial states (e.g., initial seeds) and choose the result with the lowest error as the best result.
It's also typical to seed k-means by randomly choosing k elements from your data set as the initial seeds.
Since by default MATLAB's K-Means uses the K-MEans++ algorithm for initialization it means it uses random numbers.
Hence each call (For sequential calls) to K-Means will probably produce different results.
You have 3 options to make this deterministic:
Set MATLAB's Random Number Generator state to certain state before calling K-Means.
Use the stream option in K-Means options to set the stream inside K-Means.
Write your own version of K-Means which uses a deterministic way to initialize K-Means.
In Lingpipe's EM tutorial they said that it is possible to run the algorithm with no supervised data:
It is possible to train a classifier in a completely unsupervised fashion by having the initial classifier assign categories at random. Only the number of categories must be fixed. The algorithm is exactly the same, and the result after convergence or the maximum number of epochs is a classifier.
But their class, TradNaiveBayesClassifier required a labeled and an unlabeled corpora to run. How can I modify it to run with no labelled data?
EM is a probabilistic maximal likelihood optimization algorithm. In general, it is applied to unsupervised algorithms (for clustering) such as PLSA, Gaussian Mixture Model.
I think the linepipe doc is saying that you can using random initialization of all data labels (distribution of labels for each data) and then feed into NB to compute the ELBO (evidence lower bound), and then maximize it to get update of parameters.
In short, you will need to use the NB to write up the M step --- updating the model parameters.
I running kmeans in matlab on a 400x1000 matrix and for some reason whenever I run the algorithm I get different results. Below is a code example:
[idx, ~, ~, ~] = kmeans(factor_matrix, 10, 'dist','sqeuclidean','replicates',20);
For some reason, each time I run this code I get different results? any ideas?
I am using it to identify multicollinearity issues.
Thanks for the help!
The k-means implementation in MATLAB has a randomized component: the selection of initial centers. This causes different outcomes. Practically however, MATLAB runs k-means a number of times and returns you the clustering with the lowest distortion. If you're seeing wildly different clusterings each time, it may mean that your data is not amenable to the kind of clusters (spherical) that k-means looks for, and is an indication toward trying other clustering algorithms (e.g. spectral ones).
You can get deterministic behavior by passing it an initial set of centers as one of the function arguments (the start parameter). This will give you the same output clustering each time. There are several heuristics to choose the initial set of centers (e.g. K-means++).
As you can read on the wiki, k-means algorithms are generally heuristic and partially probabilistic, the one in Matlab being no exception.
This means that there is a certain random part to the algorithm (in Matlab's case, repeatedly using random starting points to find the global solution). This makes kmeans output clusters that are of good-quality-on-average. But: given the pseudo-random nature of the algorithm, you will get slightly different clusters each time -- this is normal behavior.
This is called initialization problem, as kmeans starts with random iniinital points to cluster your data. matlab selects k random points and calculates the distance of points in your data to these locations and finds new centroids to further minimize the distance. so you might get different results for centroid locations, but the answer is similar.
How can I do K-means clustering of time series data?
I understand how this works when the input data is a set of points, but I don't know how to cluster a time series with 1XM, where M is the data length. In particular, I'm not sure how to update the mean of the cluster for time series data.
I have a set of labelled time series, and I want to use the K-means algorithm to check whether I will get back a similar label or not. My X matrix will be N X M, where N is number of time series and M is data length as mentioned above.
Does anyone know how to do this? For example, how could I modify this k-means MATLAB code so that it would work for time series data? Also, I would like to be able to use different distance metrics besides Euclidean distance.
To better illustrate my doubts, here is the code I modified for time series data:
% Check if second input is centroids
if ~isscalar(k)
c=k;
k=size(c,1);
else
c=X(ceil(rand(k,1)*n),:); % assign centroid randomly at start
end
% allocating variables
g0=ones(n,1);
gIdx=zeros(n,1);
D=zeros(n,k);
% Main loop converge if previous partition is the same as current
while any(g0~=gIdx)
% disp(sum(g0~=gIdx))
g0=gIdx;
% Loop for each centroid
for t=1:k
% d=zeros(n,1);
% Loop for each dimension
for s=1:n
D(s,t) = sqrt(sum((X(s,:)-c(t,:)).^2));
end
end
% Partition data to closest centroids
[z,gIdx]=min(D,[],2);
% Update centroids using means of partitions
for t=1:k
% Is this how we calculate new mean of the time series?
c(t,:)=mean(X(gIdx==t,:));
end
end
Time series are usually high-dimensional. And you need specialized distance function to compare them for similarity. Plus, there might be outliers.
k-means is designed for low-dimensional spaces with a (meaningful) euclidean distance. It is not very robust towards outliers, as it puts squared weight on them.
Doesn't sound like a good idea to me to use k-means on time series data. Try looking into more modern, robust clustering algorithms. Many will allow you to use arbitrary distance functions, including time series distances such as DTW.
It's probably too late for an answer, but:
k-means can be used to cluster longitudinal data
Anony-Mousse is right, DWT distance is the way to go for time series
The methods above use R. You'll find more methods by looking, e.g., for "Iterative Incremental Clustering of Time Series".
I have recently come across the kml R package which claims to implement k-means clustering for longitudinal data. I have not tried it out myself.
Also the Time-series clustering - A decade review paper by S. Aghabozorgi, A. S. Shirkhorshidi and T. Ying Wah might be useful to you to seek out alternatives. Another nice paper although somewhat dated is Clustering of time series data-a survey by T. Warren Liao.
If you did really want to use clustering, then dependent on your application you could generate a low dimensional feature vector for each time series. For example, use time series mean, standard deviation, dominant frequency from a Fourier transform etc. This would be suitable for use with k-means, but whether it would give you useful results is dependent on your specific application and the content of your time series.
I don't think k-means is the right way for it either. As #Anony-Mousse suggested you can utilize DTW. In fact, I had the same problem for one of my projects and I wrote my own class for that in Python. The logic is;
Create your all cluster combinations. k is for cluster count and n is for number of series. The number of items returned should be n! / k! / (n-k)!. These would be something like potential centers.
For each series, calculate distances for each center in each cluster groups and assign it to the minimum one.
For each cluster groups, calculate total distance within individual clusters.
Choose the minimum.
And, the Python implementation is here if you're interested.