Best Way to Randomly Initialize Clusters in MATLAB - matlab

Say you have k clusters, and you have an array with n rows and 3 columns. Each row is a datapoint. What is the best (i.e., vectorized) way to randomly assign each row to a cluster.
Bonus points: commenting the code.

You could make an n-length vector with integers 1 to k:
k = 4
n = length(examples)
cluster_assignments = randi(k,1,n)
and use the indexing to match up this n-length vector of cluster membership to the n examples you are working with.

I can give you 2 options:
Random Initialization.
K-Means++.
They are implemented in my Stack Overflow Q22342015 GitHub Repository.
The code includes K-Means implementation which accepts arbitrary Distance Function as in - K-Means Algorithm with Arbitrary Distance Function MATLAB (Chebyshev Distance).
Result:

Related

How can I classify my data for K-Means Clustering

A proof of concept prototype I have to do for my final year project is to implement K-Means Clustering on a big data set and display the results on a graph. I only know object-oriented languages like Java and C# and decided to give MATLAB a try. I notice that with a functional language the approach to solving problems is very different, so I would like some insight on a few things if possible.
Suppose I have the following data set:
raw_data
400.39 513.29 499.99 466.62 396.67
234.78 231.92 215.82 203.93 290.43
15.07 14.08 12.27 13.21 13.15
334.02 328.79 272.2 306.99 347.79
49.88 52.2 66.35 47.69 47.86
732.88 744.62 687.53 699.63 694.98
And I picked row 2 and 4 to be the 2 centroids:
centroids
234.78 231.92 215.82 203.93 290.43 % Centroid 1
334.02 328.79 272.2 306.99 347.79 % Centroid 2
I want to now compute the euclidean distances of each point to each centroid, then assign each point to it's closest centroid and display this on a graph. Let's say I want I want to classify the centroids as blue and green. How can I do this in MATLAB? If this was Java I would initialise each row as an object and add to separate ArrayLists (representing the clusters).
If rows 1, 2 and 3 all belong to the first centroid / cluster, and rows 4, 5 and 6 belong to the second centroid / cluster - how can I classify these to display them as blue or green points on a graph? I am new to MATLAB and really curious about this. Thanks for any help.
(To begin with, Matlab has a flexible distance measuring function, pdist2 and also kmeans implementation, but I'm assuming that you want to build your code from scratch).
In Matlab, you try to implement everything as matrix algebra, without loops over elements.
In your case, if R is the raw_data matrix and C is the centroids matrix,
you can shift the dimension that represents centroid number to the 3rd place by
permC=permute(C,[3 2 1]); Then the bsxfun function allows you to subtract C from R while expanding R's third dimension as necessary: D=bsxfun(#minus,R,permC). Element-wise square followed by summation across columns SqD=sum(D.^2,2) will give you the squared distances of each observation from each centroid. Performing all these operations within a single statement and shifting the third (centroid) dimension back to the 2nd place will look like this:
SqD=permute(sum(bsxfun(#minus,R,permute(C,[3 2 1])).^2,2),[1 3 2])
Picking the centroid of minimal distance is now straightforward: [minDist,minCentroid]=min(SqD,[],2)
If this looks complex, I recommend inspecting the product of each sub-step and reading the help of each command.

Kullback Leibler Divergence of 2 Histograms in MatLab

I would like a function to calculate the KL distance between two histograms in MatLab. I tried this code:
http://www.mathworks.com/matlabcentral/fileexchange/13089-kldiv
However, it says that I should have two distributions P and Q of sizes n x nbins. However, I am having trouble understanding how the author of the package wants me to arrange the histograms. I thought that providing the discretized values of the random variable together with the number of bins would suffice (I would assume the algorithm would use an arbitrary support to evaluate the expectations).
Any help is appreciated.
Thanks.
The function you link to requires that the two histograms passed be aligned and thus have the same length NBIN x N (not N X NBIN), that is, if N>1 then the number of rows in the inputs should be equal to the number of bins in the histograms. If you are just going to compare two histograms (that is if N=1) it doesn't really matter, you can pass either row or column vector versions of these as long as you are consistent and the order of bins matches.
A generic call to the function looks like this:
dists = kldiv(bins,P,Q)
The implementation allows comparison of multiple histograms to each other (that is, N>1), in which case pairs of columns (with matching column index) in each array are compared and the result is a row vector with distances for each matching pair.
Array bins should be the same size as P and Q and is used to perform a very minimal check that the inputs are of the same size, but is not used in the computation. The routine expects bins to contain the numeric labels of your bins so that it can check for repeated bin labels and warn you if repeats occur, but otherwise doesn't use the information.
You could do away with bins and compute the distance with
KL = sum(P .* (log2(P)-log2(Q)));
without using the Matlab Central versions. However the version you link to performs the abovementioned minimal checks and in addition allows computation of two alternative distances (consult the documentation).
The version linked to by eigenchris checks that no histogram bins are empty (which would make the computation blow up numerically) and if there are, removes their contribution to the sum (not sure this is entirely appropriate - consult an expert on the subject). It should probably also be aware of the exact form of the formula, specifically note the use of log2 above versus natural logarithm in the version linked to by eigenchris.

K-means Clustering, major understanding issue

Suppose that we have a 64dim matrix to cluster, let's say that the matrix dataset is dt=64x150.
Using from vl_feat's library its kmeans function, I will cluster my dataset to 20 centrers:
[centers, assignments] = vl_kmeans(dt, 20);
centers is a 64x20 matrix.
assignments is a 1x150 matrix with values inside it.
According to manual: The vector assignments contains the (hard) assignments of the input data to the clusters.
I still can not understand what those numbers in the matrix assignments mean. I dont get it at all. Anyone mind helping me a bit here? An example or something would be great. What do these values represent anyway?
In k-means the problem you are trying to solve is the problem of clustering your 150 points into 20 clusters. Each point is a 64-dimension point and thus represented by a vector of size 64. So in your case dt is the set of points, each column is a 64-dim vector.
After running the algorithm you get centers and assignments. centers are the 20 positions of the cluster's center in a 64-dim space, in case you want to visualize it, measure distances between points and clusters, etc. 'assignments' on the other hand contains the actual assignments of each 64-dim point in dt. So if assignments[7] is 15 it indicates that the 7th vector in dt belongs to the 15th cluster.
For example here you can see clustering of lots of 2d points, let's say 1000 into 3 clusters. In this case dt would be 2x1000, centers would be 2x3 and assignments would be 1x1000 and will hold numbers ranging from 1 to 3 (or 0 to 2, in case you're using openCV)
EDIT:
The code to produce this image is located here: http://pypr.sourceforge.net/kmeans.html#k-means-example along with a tutorial on kmeans for pyPR.
In openCV it is the number of the cluster that each of the input points belong to

Find the closest weight vector to each instance in the data matrix

Suppose I have a weight matrix W nxm where m is the number of variables and the n is the number of instances. Also I have data matrix X of the same size. I try to find the closest weight vector to each instance in X. However both matrices are so dimensional therefore plain methods are not sufficient enough. I have tried some GPU trick at MATLAB but it does not work well since it was sequential approach that was calculating the closest weight for each instance sequentially. I am now looking for efficient one shot code. That takes all the W and X and find the winner with some MATLAB tricks with possibly some GPU addition. Is there any one that can suggest any code snippet in the MATLAB?
This is the thing that I wrote for sequential
x_in_d = gpuArray(x_in); % take input instance to device
W_d = gpuArray(W); % take weight matrix to device
Dx = W_d - x_in_d(ones(size(W_d,1),1),logical(ones(1,length(x_in_d))));
[d_min,winner] = min(sum((Dx.^2)'));
d_min = gather(d_min); %gather results
winner = gather(winner);
What do you mean by so dimensional? It's just an m x n matrix right?
It would be really helpful if you could provide some sample data, based off your description (which isn't the clearest), here is what I think your data looks like.
weights=
[1 4 2
5 3 1]
data=
[2 5 1
1 2 2]
And you want to figure out which row of weights is closest to the row of data? Which in this case would be the first row of weights for both rows of data.
Please edit your question to clarify what your asking for and consider using some examples.
EDIT:
I like Rody's Dup. Comment, if I am correct, check out: Link Here

Is there a statistical difference between generating many random vectors vs a single random matrix

Is there a statistical difference between generating a series of paths for a montecarlo simulation using the following two methods (note that by path I mean a vector of 350 points, normally distributed):
A)
for path = 1:300000
Zn(path, :) = randn(1, 350);
end
or the far more efficient B)
Zn = randn(300000, 350);
I just want to be sure there is no funny added correlation or dependence between the rows in method B that isn't present in method A. Like maybe method B distributes normally over 2 dimensions where A is over 1 dimension, so maybe that makes the two statistically different?
If there is a difference then I need to know the same for uniform distributions (i.e. rand instead of randn)
Just to add to the answer of #natan (+1), run the following code:
%# Store the seed
Rng1 = rng;
%# Get a matrix of random numbers
X = rand(3, 3);
%# Restore the seed
rng(Rng1);
%# Get a matrix of random numbers one vector at a time
Y = nan(3, 3);
for n = 1:3
Y(:, n) = rand(3, 1);
end
%# Test for differences
if any(any(X - Y ~= 0)); disp('Error'); end;
You'll note that there is no difference between X and Y. That is, there is no difference between building a matrix in one step, and building a matrix from a sequence of vectors.
However, there is a difference between my code and yours. Note I am populating the matrix by columns, not rows, since when rand is used to construct a matrix in one step, it populates by column. By the way, I'm not sure if you realize, but as a general rule you should always try and perform vector operations on the columns of matrices, not the rows. I explained why in a response to a question on SO the other day; see here for more...
Regarding the question of independence/dependence, one needs to be careful with the language one uses. The sequence of numbers generated by rand are perfectly dependent. For the vast majority of statistical tests, they will appear to be independent - nonetheless, in theory, one could construct a statistical test that would demonstrate the dependency between a sequence of numbers generated by rand.
Final thought, if you have a copy of Greene's "Econometric Analysis", he gives a neat discussion of random number generation in section 17.2.
As far as the base R's random number generator is concerned, also, there doesn't appear to be any difference between generating a sequence of random numbers at once or doing it one-by one. Thus, #Colin T Bowers' (+1) suggested behavior above also holds in R. Below is an R version of Colin's code:
#set seed
set.seed(1234)
# generate a sequence of 10,000 random numbers at once
X<-rnorm(10000)
# reset the seed
set.seed(1234)
# create a vector of 10,000 zeros
Y<-rep(0,times=10000)
# generate a sequence of 10,000 random numbers, one at a time
for (i in 1:10000){
Y[i]<-rnorm(1)
}
# Test for differences
if(any(X-Y!=0)){print("Error")}