Matlab: Kmeans gives different results each time - matlab

I running kmeans in matlab on a 400x1000 matrix and for some reason whenever I run the algorithm I get different results. Below is a code example:
[idx, ~, ~, ~] = kmeans(factor_matrix, 10, 'dist','sqeuclidean','replicates',20);
For some reason, each time I run this code I get different results? any ideas?
I am using it to identify multicollinearity issues.
Thanks for the help!

The k-means implementation in MATLAB has a randomized component: the selection of initial centers. This causes different outcomes. Practically however, MATLAB runs k-means a number of times and returns you the clustering with the lowest distortion. If you're seeing wildly different clusterings each time, it may mean that your data is not amenable to the kind of clusters (spherical) that k-means looks for, and is an indication toward trying other clustering algorithms (e.g. spectral ones).
You can get deterministic behavior by passing it an initial set of centers as one of the function arguments (the start parameter). This will give you the same output clustering each time. There are several heuristics to choose the initial set of centers (e.g. K-means++).

As you can read on the wiki, k-means algorithms are generally heuristic and partially probabilistic, the one in Matlab being no exception.
This means that there is a certain random part to the algorithm (in Matlab's case, repeatedly using random starting points to find the global solution). This makes kmeans output clusters that are of good-quality-on-average. But: given the pseudo-random nature of the algorithm, you will get slightly different clusters each time -- this is normal behavior.

This is called initialization problem, as kmeans starts with random iniinital points to cluster your data. matlab selects k random points and calculates the distance of points in your data to these locations and finds new centroids to further minimize the distance. so you might get different results for centroid locations, but the answer is similar.

Related

Selecting the K value for Kmeans clustering [duplicate]

This question already has answers here:
Cluster analysis in R: determine the optimal number of clusters
(8 answers)
Closed 3 years ago.
I am going to build a K-means clustering model for outlier detection. For that, I need to identify the best number of clusters needs to be selected.
For now, I have tried to do this using Elbow Method. I plotted the sum of squared error vs. the number of clusters(k) but, I got a graph like below which makes confusion to identify the elbow point.
I need to know, why do I get a graph like this and how do I identify the optimal number of clusters.
K-means is not suitable for outlier detection. This keeps popping up here all the time.
K-means is conceptualized for "pure" data, with no false points. All measurements are supposed to come from the data, and only vary by some Gaussian measurement error. Occasionally this may yield some more extreme values, but even these are real measurements, from the real clusters, and should be explained not removed.
K-means itself is known to not work well on noisy data where data points do not belong to the clusters
It tends to split large real clusters in two, and then points right in the middle of the real cluster will have a large distance to the k-means centers
It tends to put outliers into their own clusters (because that reduces SSQ), and then the actual outliers will have a small distance, even 0.
Rather use an actual outlier detection algorithm such as Local Outlier Factor, kNN, LOOP etc. instead that were conceptualized with noisy data in mind.
Remember that the Elbow Method doesn't just 'give' the best value of k, since the best value of k is up to interpretation.
The theory behind the Elbow Method is that we in tandem both want to minimize some error function (i.e. sum of squared errors) while also picking a low value of k.
The Elbow Method thus suggests that a good value of k would lie in a point on the plot that resembles an elbow. That is the error is small, but doesn't decrease drastically when k increases locally.
In your plot you could argue that both k=3 and k=6 resembles elbows. By picking k=3 you'd have picked a small k, and we see that k=4, and k=5 doesn't do much better in minimizing the error. Same goes with k=6.

Iteration in k means clustering

I am implementing k means clustering in tensorflow and have successfully made the function where we randomly select centroids from the sample points. Then these centroids are to be updated based on distance from sample points.
Is it always guaranteed that the more i iterate the better I get the cluster prediction or there is some point after which the predictions start getting wrong/anomalous??
Usually, K-means solving algorithm behaves as expected, in that it converges to a local minimum always. (I assume you're talking about the Lloyd/Florgy method) This is a statistical method used to find a local minima. It may stall at a saddle point where one of the dimensions is optimized but the others is not.
To abbreviate the rigorousness of the proof, it will always converge, albeit slowly due to many saddle points in your function.
There is no point in which your prediction gets more "wrong". It will be closer to the minima that you wanted, but the minima may not be the global. This may be your source of concern, because random initializations of K-means does not guarrantee this to happen.
One way to alleviate this is to actually run K-means on subgroups of your data, and then take those final points and average them to find a good initializer for your final clustering on the whole dataset.
Hope this helps.

How can I apply KMEANS algorithm with determined cluster position which has specified from PSO algorith?

How can I apply KMEANS algorithm with determined cluster position which has specified from PSO algorithm ??
Just do it.
K-means allows you to specify the initial centroids.
Without any information on the nature of the data you're dealing with (number if dimensions, datatypes, outliers, overlap etc), it is impossible to give specific answers.
I don't know of any genuine k-means implementation where you can pass in a list of centroids that the algorithm uses to initialize the k-means centroids. Usually these are selected randomly. (Can't you write your own implementation of k-means that does this initialization? Simple take an open-source implementaion and add an argument)
However, In the python sklearn implementation of kmeans, there is a kmeans++ implementation, where you can pass in the initial centers as an array.
init : {‘k-means++’, ‘random’ or an ndarray}
Method for initialization, defaults to ‘k-means++’:
‘k-means++’ : selects initial cluster centers for k-mean clustering
in a smart way to speed up convergence.
...
If an ndarray is passed, it should be of shape
(n_clusters, n_features) and gives the initial centers.
Haven't used it, though.
And I wrote this before I remembered/looked up kmeans++:
This is a poor-man's approach:
You can run kmeans with a k parameter equal to the length of the list/array that the PSO algorithm (whatever it did) has given you.
Then kmeans will quickly find its own centroids. Do this several times, maybe with different distance-measures (Euclidean, manhattan, shortest, longest, avg...), and different seeds for your random-number generator. Each time, afterwards, compare the coordinates of the k-means centroids with the coordinates of the PSO centroids.
When there is a near 1:1 correspondence (depending on your requirements), you've found a match. then do something with your list of k-means classfication-results.

MATLAB: K Means clustering With varying centroids

I'm created a code book based on k-means clustering algorithm.But the algorithm didn't converge to an optimal code book, each time, the cluster centroids are varying(because of random selection of initial seeds). There is an option in Matlab to give an initial matrix to K-Means.But how we can can select the initial code book from a large data set? Is there any other way to get a unique code book using K-means?
It's somewhat standard to run k-means multiple times using different initial states (e.g., initial seeds) and choose the result with the lowest error as the best result.
It's also typical to seed k-means by randomly choosing k elements from your data set as the initial seeds.
Since by default MATLAB's K-Means uses the K-MEans++ algorithm for initialization it means it uses random numbers.
Hence each call (For sequential calls) to K-Means will probably produce different results.
You have 3 options to make this deterministic:
Set MATLAB's Random Number Generator state to certain state before calling K-Means.
Use the stream option in K-Means options to set the stream inside K-Means.
Write your own version of K-Means which uses a deterministic way to initialize K-Means.

How can I perform K-means clustering on time series data?

How can I do K-means clustering of time series data?
I understand how this works when the input data is a set of points, but I don't know how to cluster a time series with 1XM, where M is the data length. In particular, I'm not sure how to update the mean of the cluster for time series data.
I have a set of labelled time series, and I want to use the K-means algorithm to check whether I will get back a similar label or not. My X matrix will be N X M, where N is number of time series and M is data length as mentioned above.
Does anyone know how to do this? For example, how could I modify this k-means MATLAB code so that it would work for time series data? Also, I would like to be able to use different distance metrics besides Euclidean distance.
To better illustrate my doubts, here is the code I modified for time series data:
% Check if second input is centroids
if ~isscalar(k)
c=k;
k=size(c,1);
else
c=X(ceil(rand(k,1)*n),:); % assign centroid randomly at start
end
% allocating variables
g0=ones(n,1);
gIdx=zeros(n,1);
D=zeros(n,k);
% Main loop converge if previous partition is the same as current
while any(g0~=gIdx)
% disp(sum(g0~=gIdx))
g0=gIdx;
% Loop for each centroid
for t=1:k
% d=zeros(n,1);
% Loop for each dimension
for s=1:n
D(s,t) = sqrt(sum((X(s,:)-c(t,:)).^2));
end
end
% Partition data to closest centroids
[z,gIdx]=min(D,[],2);
% Update centroids using means of partitions
for t=1:k
% Is this how we calculate new mean of the time series?
c(t,:)=mean(X(gIdx==t,:));
end
end
Time series are usually high-dimensional. And you need specialized distance function to compare them for similarity. Plus, there might be outliers.
k-means is designed for low-dimensional spaces with a (meaningful) euclidean distance. It is not very robust towards outliers, as it puts squared weight on them.
Doesn't sound like a good idea to me to use k-means on time series data. Try looking into more modern, robust clustering algorithms. Many will allow you to use arbitrary distance functions, including time series distances such as DTW.
It's probably too late for an answer, but:
k-means can be used to cluster longitudinal data
Anony-Mousse is right, DWT distance is the way to go for time series
The methods above use R. You'll find more methods by looking, e.g., for "Iterative Incremental Clustering of Time Series".
I have recently come across the kml R package which claims to implement k-means clustering for longitudinal data. I have not tried it out myself.
Also the Time-series clustering - A decade review paper by S. Aghabozorgi, A. S. Shirkhorshidi and T. Ying Wah might be useful to you to seek out alternatives. Another nice paper although somewhat dated is Clustering of time series data-a survey by T. Warren Liao.
If you did really want to use clustering, then dependent on your application you could generate a low dimensional feature vector for each time series. For example, use time series mean, standard deviation, dominant frequency from a Fourier transform etc. This would be suitable for use with k-means, but whether it would give you useful results is dependent on your specific application and the content of your time series.
I don't think k-means is the right way for it either. As #Anony-Mousse suggested you can utilize DTW. In fact, I had the same problem for one of my projects and I wrote my own class for that in Python. The logic is;
Create your all cluster combinations. k is for cluster count and n is for number of series. The number of items returned should be n! / k! / (n-k)!. These would be something like potential centers.
For each series, calculate distances for each center in each cluster groups and assign it to the minimum one.
For each cluster groups, calculate total distance within individual clusters.
Choose the minimum.
And, the Python implementation is here if you're interested.