I am using the k mean clustering function of MATLAB. I need to make K=2 every time till all the clusters match a defined condition.
If my data was
all_data=5*[3*rand(1000,1)+5,3*rand(1000,1)+5];
and I run
[IDX C]=kmeans(all_data,2);
I need to check the size of all_data(IDX==1) and all_data(IDX==2) every iteration and continue clustering with k=2 till the size of resulting cluster reach a specific value.
How can this be implemented in an iterative loop in MATLAB?
Related
I have a dataset where the columns corresponds to features (predictors) and the rows correspond to data points. The data points are extracted in a structured way, i.e. they are sorted. I will use either crossvalind or cvpartition from Matlab for stratified cross-validation.
If I use the above function, do I still have to first randomly rearrange the data points (rows)?
These functions shuffle your data internally, as you can see in the docs
Indices = crossvalind('Kfold', N, K) returns randomly generated indices for a K-fold cross-validation of N observations. Indices contains equal (or approximately equal) proportions of the integers 1 through K that define a partition of the N observations into K disjoint subsets. Repeated calls return different randomly generated partitions. K defaults to 5 when omitted. In K-fold cross-validation, K-1 folds are used for training and the last fold is used for evaluation. This process is repeated K times, leaving one different fold for evaluation each time.
However, if your data is structured in this sense, that object ith has some information about object i+1, then you should consider different kind of splitting. For example - if your data is actually a (locally) time series, typical random cv is not a valid estimation technique. Why? Because if your data actually contains clusters where knowledge of value of at least one element - gives you high probability of estimating remaining ones, what you will obtain in the end after applying CV is actually estimate of ability to do exactly so - predict inside these clusters. Thus if during actual real life usage of your model you expect to get completely new cluster - model you selected can be completely random there. In other words - if your data has some kind of internal cluster structure (or time series) your splits should cover this feature by splitting over clusters (thus instead of K random points splits you have K random clusters splits and so on).
I have a very large amount of data in the form of matrix.I have already clustered it using k-means clustering in MATLAB R2013a. I want the exact coordinates of the centroid of each cluster formed.. Is it possible using any formula or anything else?
I want to find out the centroid of each cluster so that whenever some new data arrives in matrix, i can compute its distance from each centroid so as to find out the cluster to which new data will belong
My data is heterogeneous in nature.So,its difficult to find out average of data of each cluster.So, i am trying to write some code for printing the centroid location automatically.
In MATLAB, use
[idx,C] = kmeans(..)
instead of
idx = kmeans(..)
As per the documentation:
[idx,C] = kmeans(..) returns the k cluster centroid locations in the k-by-p matrix C.
The centroid is simply evaluated as the average value of all the points' coordinates that are assigned to that cluster.
If you have the assignments {point;cluster} you can easily evaluate the centroid: let's say you have a given cluster with n points assigned to it and these points are a1,a2,...,an. You can evaluate the centroid for such cluster by using:
centroid=(a1+a2+...+an)/n
Obviously you can run this process in a loop, depending on how your data structure (i.e. the assignment point/centroid) is organized.
I have a parfor loop for parallel computing in Matlab. I want have different random numbers in every calling of these parforloops on 8 workers. If i don't use rng('shuffle') function i have same random number for randperm(10). In this case my code run rng('shuffle') function before randperm at the same time in all workers. Have i different random numbers in this condition? when I see randperm outputs in parfor loop, Some of these outputs are same !
I need save rng before rng('shuffle') and use something likes rng(saved_rng) after ending parallel loop?
We have this in Matlab help :
Note Because rng('shuffle') seeds the random number generator based
on the current time, you should not use this command to set the random
number stream on different workers if you want to assure independent
streams. This is especially true when the command is sent to multiple
workers simultaneously, such as inside a parfor, spmd, or a
communicating job. For independent streams on the workers, use the
default behavior; or if that is not sufficient for your needs,
consider using a unique substream on each worker.
So what should i do? Have I different random numbers if i delete rng? I have two versions of these codes. One of them is calculation with parfor and other using for loop, Can i remove shuffle from for loop? I have different random numbers in this condition?
Thanks.
Ps.
I can have these structures:
parfor I=1:X
xx = randperm(10)
end
parfor I=1:X
rng('shuffle');
xx = randperm(10)
end
rng('shuffle');
parfor I=1:X
xx = randperm(10)
end
I want have different random numbers from randperm function. How can I do that? for for structure i need shuffle function (without it the random numbers are the same) but when i add it to parfor some random outputs of randperm are the same !
To do this properly, you need to choose an RNG algorithm that supports parallel substreams (in other words, you can split up the random stream into substreams, and each of the substreams still has the right statistical properties that you want from a random stream).
The default RNG algorithm (Mersenne Twister, or mt19937ar) does not support parallel substreams, but MATLAB supports two algorithms that do (the multiplicative lagged Fibonacci generator mlfg6331_64 and the combined multiple recursive generator mrg32k3a).
For example:
s = RandStream.create('mrg32k3a','NumStreams',4,'Seed','shuffle','CellOutput',true)
s is now a cell array of random number substreams. All have the same seed, and you can record s{1}.Seed for reproducibility if you want.
Now, you can call rand(s{1}) (or randn(s{1})) to generate random numbers from stream 1, and so on. Reset a stream to its initial configuration with reset(s{1}), and you should find that each stream is separately reproducible.
Each worker can then generate random numbers in a way that is still statistically sound, and reproducible even in parallel:
parfor i = 1:4
rand(s{i})
end
For more information, look in the documentation for Statistics Toolbox under Speed up Statistical Computations. There are a few articles in there that take you through all the complicated details. If you don't have Statistics Toolbox, the documentation is online on MathWorks website.
I'm using the Statistics Toolbox function kmeans in MATLAB for the first time. I want to get the total euclidian distance to nearest centroid as an indicator of optimal k.
Here is my code :
clear all
N=10;
opts=statset('MaxIter',1000);
X=dlmread(['data.txt']);
crit=zeros(1,N);
for j=1:N
[a,b,c]=kmeans(X,j,'Start','cluster','EmptyAction','drop','Options',opts);
clear a b
crit(j)=sum(c);
end
save(['crit_',VF,'_',num2str(i),'_limswvl1.mat'],'crit')
Well everything should go well except that I get this error for j = 6 :
X must have more rows than the number of clusters.
I do not understand the problem since X has 54 rows, and no NaNs.
I tried using different EmptyAction options but it still won't work.
Any idea ? :)
The problem occurs since you use the cluster method to get initial centroids. From MATLAB documentation:
'cluster' - Perform preliminary clustering phase on random 10%
subsample of X. This preliminary phase is itself
initialized using 'sample'.
So when j=6, it tries to divide 10% of data into 6 clusters, i.e. 10% of 54 ~ 5. Therefore, you get the error X must have more rows than the number of clusters.
To get around this problem, either choose the points randomly (sample method) or choose points uniformly (uniform method).
I am using the kmeans2 algorithm from scipy to cluster pixel colors in an image to get the top average colors in the image.
http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans2.html#scipy.cluster.vq.kmeans2
I am confused on the meaning for this parameter:
iter : int
Number of iterations of the k-means algrithm to run. Note that this differs in meaning from the iters parameter to the kmeans function.
If I want the kmeans algorithm to run until the clusters don't change, would I set the iter value high? Is there a way to find a best iter value?
The K-means algorithm works by initializing some K points and clustering your data by their distance from those points. Then it iterates by calculating the centroid of each cluster and redefining clusters by distance from the centroid. This isn't guaranteed to converge quickly, though it often does, so it's asking for a maximum iteration value.
edit: maximum iteration value. is incorrect I think, it is literally going to iterate iter times. The default 10 is a common iter value, though.
The higher the iter value the better the clustering. You could try running K-means on some of your data with various iter values and seeing where the time to compute for some gain in cluster quality is too high for your needs.