Using K-means clustering with predefined seeds in MATLAB - matlab

I need and example showing how to use K-means clustering in MATLAB but using some prespecified datapoints as the initial seeds.
Thanks

IDX = kmeans(X,k,'start',seeds)
will run K-means with predefined datapoints seeds (such as k rows of X, but you can choose any seeds as long as it's a k-by-p array, where p is the number of columns of X) as initial seeds. Note that if you specify seeds, you don't need to specify k (pass [] instead). kmeans will infer from the number of rows of seeds how many clusters you want.
By default, kmeans chooses k randomly picked rows of X as seeds.

Related

Evaluation of K-means clustering ( accuracy)

I created a 2-dimensional random datasets (composed from a dataset of points and a column of labels) for centroid based k-means clustering in MATLAB where each point is represented by a vector of X and Y (the point coordinates) and each label represents the data point cluster,see example in figure below.
I applied the K-means clustering algorithm on these point datasets. I need help with the following:
What function can I use to evaluate the accuracy of the K-means algorithm? In more detail: My aim is to score the Kmeans algorithm based on how many assigned labels it correctly identifies by comparing with assigned numbers by matlab. For example, I verify if the point (7.200592168, 11.73878455) is assigned with the point (6.951107307, 11.27498898) to the same cluster... etc.
If I correctly understand your question, you are looking for the adjusted rand index. This will score the similarity between your matlab labels and your k-means labels.
Alternatively you can create a confusion matrix to visualise the mapping between your two labelsets.
I would use squared error
You are trying to minimize the total squared distance between each point and the mean coordinate of it's cluster.

Randomly rearranging data points when creating cross-validation indices?

I have a dataset where the columns corresponds to features (predictors) and the rows correspond to data points. The data points are extracted in a structured way, i.e. they are sorted. I will use either crossvalind or cvpartition from Matlab for stratified cross-validation.
If I use the above function, do I still have to first randomly rearrange the data points (rows)?
These functions shuffle your data internally, as you can see in the docs
Indices = crossvalind('Kfold', N, K) returns randomly generated indices for a K-fold cross-validation of N observations. Indices contains equal (or approximately equal) proportions of the integers 1 through K that define a partition of the N observations into K disjoint subsets. Repeated calls return different randomly generated partitions. K defaults to 5 when omitted. In K-fold cross-validation, K-1 folds are used for training and the last fold is used for evaluation. This process is repeated K times, leaving one different fold for evaluation each time.
However, if your data is structured in this sense, that object ith has some information about object i+1, then you should consider different kind of splitting. For example - if your data is actually a (locally) time series, typical random cv is not a valid estimation technique. Why? Because if your data actually contains clusters where knowledge of value of at least one element - gives you high probability of estimating remaining ones, what you will obtain in the end after applying CV is actually estimate of ability to do exactly so - predict inside these clusters. Thus if during actual real life usage of your model you expect to get completely new cluster - model you selected can be completely random there. In other words - if your data has some kind of internal cluster structure (or time series) your splits should cover this feature by splitting over clusters (thus instead of K random points splits you have K random clusters splits and so on).

How to apply K-mean algorithm to multidimensional array?

I have a matrix A = (a1,a2,a3,...,an)',where a1, a2,..., an are row vectors. I want to apply k-means algorithm to matrix A in order to cluster the row vector ai (i=1,2,3...,n) to k clusters or more. Suppose b1, b2, b3,...,bk are the centers of k clusters, k samples are randomly selected to be the initial centers of k clusters. All the samples (a1,a2,a3,...,an) are classified according to their cosine distance to the centers bi (i=1,2,3,...,k) into k classes, that is, k clusters. The centers of k clusters are recalculated, all samples are reclassified until the centers do not change, and then the final centers b1,b2,b3,...,bk are obtained. For each cluster, only the vector closest to the center of cluster is retained. How to realize this?
The kmeans function (in the Statistics and Machine learning toolbox) performs exactly this. Simply use:
C = kmeans(A, k, 'Distance', 'cosine')
to get the desired output.
Best,

Best Way to Randomly Initialize Clusters in MATLAB

Say you have k clusters, and you have an array with n rows and 3 columns. Each row is a datapoint. What is the best (i.e., vectorized) way to randomly assign each row to a cluster.
Bonus points: commenting the code.
You could make an n-length vector with integers 1 to k:
k = 4
n = length(examples)
cluster_assignments = randi(k,1,n)
and use the indexing to match up this n-length vector of cluster membership to the n examples you are working with.
I can give you 2 options:
Random Initialization.
K-Means++.
They are implemented in my Stack Overflow Q22342015 GitHub Repository.
The code includes K-Means implementation which accepts arbitrary Distance Function as in - K-Means Algorithm with Arbitrary Distance Function MATLAB (Chebyshev Distance).
Result:

Matlab - Stratified Sampling of Multidimensional Data

I want to divide a corpus into training & testing sets in a stratified fashion.
The observation data points are arranged in a Matrix A as
A=[16,3,0;12,6,4;19,2,1;.........;17,0,2;13,3,2]
Each column of the matrix represent a distinct feature.
In Matlab, the cvpartition(A,'holdout',p) function requires A to be a vector. How can I perform the same action with A as a Matrix i.e. resulting sets have roughly the same distribution of each feature as in the original corpus.
By using a matrix A rather than grouped data, you are making the assumption that a random partition of your data will return a test and train set with the same column distributions.
In general, the assumption you are making in your question is that there is a partition of A such that each of the marginal distributions of A (1 per column) has the same distribution across all three variables. There is no guarantee that this is true. Check whether the columns of your matrix are correlated. If they are not, simply partition on 1 and use the row indices to define a test matrix:
cv = cvpartition(A(:, 1), 'holdout', p);
text_mat = A(cv.test, :);
If they are correlated, you may need to go back and reconsider what you are trying to do.