How do I classify data points without label information? - matlab

Based on the previous question from here
I have another question. classification accuracy?

I'll address your last point first to get it out of the way. If you don't know what the classification labels were to begin with, then there's no way to assess classification accuracy. How do you know whether the correct label was assigned to the point in C or D if you don't know what label it was to begin with? In that case, we're going to have to leave that alone.
However, what you could do is calculate the percentage of what values get classified as A or B in the matrices C and D to get a sense of the distribution of samples in them both. Specifically, if for example in matrix C, the majority of samples get classified to belong to the group defined by matrix A, then that is probably a good indication that a C is very much like A in distribution.
In any case, one thing I can suggest for you to classify which points in C or D belong to either A or B is to use the k-nearest neighbours algorithm. Concretely, you have a bunch of source data points, namely those that belong in matrices A and B, where A and B have their own labels. In your case, samples in A are assigned a label of 1 and samples in B are assigned a label of -1. To determine where an unknown point belongs to for a group, you can simply find the distance between this point in feature space with all values in A and B. Whichever point in A or B that is the closest with the unknown point, then whatever group that point belonged to in your source points, that's the group you would apply to this unknown point.
As such, simply concatenate C and D into a single N x 1000 matrix, apply k-nearest neighbour to another concatenated matrix with A and B and figure out which point it's the closest to in this other concatenated matrix. Then, read off what the label was and that'll give you what the label of the unknown point can possibly be.
In MATLAB, use the knnsearch function that's part of the Statistics Toolbox. However, I encourage you to take a look at my previous post on explaining the k-nearest neighbour algorithm here: Finding K-nearest neighbors and its implementation
In any case, here's how you'd apply what I said above with your problem statement, assuming A, B, C and D are already defined:
labels = [ones(size(A,1),1); -ones(size(B,1),1)]; %// Create labels for A and B
%// Create source and query points
sourcePoints = [A; B];
queryPoints = [C; D];
%// Perform knnsearch
IDX = knnsearch(sourcePoints, queryPoints);
%// Extract out the groups per point
groups = labels(IDX);
groups will contain the labels associated with each of the points provided by queryPoints. knnsearch returns the row location of the source point in sourcePoints that best matched with a query point. As such, each value of the output tells you which point in the source point matrix best matched with that particular query point. Ultimately, this returns the location we need in the labels array to figure out what the actual labels are.
Therefore, if you want to see what labels were assigned to the points in C, you can do:
labelsC = groups(1:size(C,1));
labelsD = groups(size(C,1)+1:end);
Therefore, in labelsC and labelsD, they contain the labels assigned for each of the unknown points in both matrices. Any values that are 1 meant that the particular points resembled those from matrix A. Similarly, any values that are -1 meant that the particular points resembled those from matrix B.
If you want to plot all of this together, just combine what you did in the previous question with your new data from this question:
%// Code as before
[coeffA, scoreA] = pca(A);
[coeffB, scoreB] = pca(B);
numDimensions = 2;
scoreAred = scoreA(:,1:numDimensions);
scoreBred = scoreB(:,1:numDimensions);
%// New - Perform dimensionality reduction on C and D
[coeffC, scoreC] = pca(C);
[coeffD, scoreD] = pca(D);
scoreCred = scoreC(:,1:numDimensions);
scoreDred = scoreD(:,1:numDimensions);
%// Plot the data
plot(scoreAred(:,1), scoreAred(:,2), 'rx', scoreBred(:,1), scoreBred(:,2), 'bo');
hold on;
plot(scoreCred(labelsC == 1,1), scoreCred(labelsC == 1,2), 'gx', ...
scoreCred(labelsC == -1,1), scoreCred(labelsC == -1,2), 'mo');
plot(scoreDred(labelsD == 1,1), scoreDred(labelsD == 1,2), 'kx', ...
scoreDred(labelsD == -1,1), scoreDred(labelsD == -1,2), 'co');
The above is the case for two dimensions. We plot both A and B with their dimensions reduced to 2. Similarly, we apply PCA to C and D, then plot everything together. The first line plots A and B normally. Next, we have to use hold on; so we can invoke plot multiple times and append results to the same figure. We have to call plot four times to account for four different combinations:
Matrix C having labels from A
Matrix C having labels from B
Matrix D having labels from A
Matrix D having labels from B
Each case I have placed a different colour, but used the same marker to denote which class each point belongs to: x for group A and o for group B.
I'll leave it to you to extend this to three dimensions.

Related

Vectorizing a parallel FOR loop across multiple dimensions MATLAB

Please correct me if there are somethings unclear in this question. I have two matrices pop, and ben of 3 dimensions. Call these dimensions as c,t,w . I want to repeat the exact same process I describe below for all of the c dimensions, without using a for loop as that is slow. For the discussion below, fix a value of the dimension c, to explain my thinking, later I will give a MWE. So when c is fixed I have a 2D matrix with dimension t,w.
Now I repeat the entire process (coming below!) for all of the w dimension.
If the value of u is zero, then I find the next non zero entry in this same t dimension. I save both this entry as well as the corresponding t index. If the value of u is non zero, I simply store this value and the corresponding t index. Call the index as i - note i would be of dimension (c,t,w). The last entry of every u(c,:,w) is guaranteed to be non zero.
Example if the u(c,:,w) vector is [ 3 0 4 2 0 1], then the corresponding i values are [1,3,3,4,6,6].
Now I take these entries and define a new 3d array of dimension (c,t,w) as follows. I take my B array and do the following what is not a correct syntax but to explain you: B(c,t,w)/u(c,i(c,t,w),w). Meaning I take the B values and divide it by the u values corresponding to the non zero indices of u from i that I computed.
For the above example, the denominator would be [3,4,4,2,1,1]. I hope that makes sense!!
QUESTION:
To do this, as this process simply repeats for all c, I can do a very fast vectorizable calculation for a single c. But for multiple c I do not know how to avoid the for loop. I don't knw how to do vectorizable calculations across dimensions.
Here is what I did, where c_size is the dimension of c.
for c=c_size:-1:1
uu=squeeze(pop(c,:,:)) ; % EXTRACT A 2D MATRIX FROM pop.
BB=squeeze(B(c,:,:)) ; % EXTRACT A 2D MATRIX FROM B
ii = nan(size(uu)); % Start with all nan values
[dum_row, ~] = find(uu); % Get row indices of non-zero values
ii(uu ~= 0) = dum_row; % Place row indices in locations of non-zero values
ii = cummin(ii, 1, 'reverse'); % Column-wise cumulative minimum, starting from bottomi
dum_i = ii+(time_size+1).*repmat(0:(scenario_size-1), time_size+1, 1); % Create linear index
ben(c,:,:) = BB(dum_i)./uu(dum_i);
i(c,:,:) = ii ;
clear dum_i dum_row uu BB ii
end
The central question is to avoid this for loop.
Related questions:
Vectorizable FIND function with if statement MATLAB
Efficiently finding non zero numbers from a large matrix
Vectorizable FIND function with if statement MATLAB

How to understand the Matlab build in function "kmeans"?

Suppose I have a matrix A, the size of which is 2000*1000 double. Then I apply
Matlab build in function "kmeans"to the matrix A.
k = 8;
[idx,C] = kmeans(A, k, 'Distance', 'cosine');
I get C = 8*1000 double; idx = 2000*1 double, with values from 1 to 8;
According to the documentation, C returns the k cluster centroid locations in the k-by-p (8 by 1000) matrix. And idx returns an n-by-1 vector containing cluster indices of each observation.
My question is:
1) I do not know how to understand the C, the centroid locations. Locations should be represented as (x,y), right? How to understand the matrix C correctly?
2) What are the final centers c1, c2,...,ck? Are they just values or locations?
3) For each cluster, if I only want to get the vector closest to the center of this cluster, how to calculate and get it?
Thanks!
Before I answer the three parts, I'll just explain the syntax that is used in MATLAB's explanation of k-means (http://www.mathworks.com/help/stats/kmeans.html).
A is your data matrix (it's represented as X in the link). There are n rows (in this case, 2000), which represent the number of observations/data points that you have. There are also p columns (in this case, 1000), which represent the number of "features" that each data points has. For example, if your data consisted of 2D points, then p would equal 2.
k is the number of clusters that you want to group the data into. Based on the dimensions of C that you gave, k must be 8.
Now I will answer the three parts:
The C matrix has dimensions k x p. Each row represents a centroid. Centroid locations DO NOT have to be (x, y) at all. The dimensions of the centroid locations are equal to p. In other words, if you have 2D points, you could graph the centroids as (x, y). If you have 3D points, you could graph the centroids as (x, y, z). Since each data point in A has 1000 features, your centroids therefore have 1000 dimensions.
This is sort of difficult to explain without knowing what your data is exactly. Centroids are certainly not just values, and they may not necessarily be locations. If your data A were coordinate points, you could certainly represent the centroids as locations. However, we can view it more generally. If you had a cluster centroid i and the data points v that are grouped with that centroid, the centroid would represent the data point that is most similar to those in its cluster. Hopefully, that makes sense, and I can give a clearer explanation if necessary.
The k-means method actually gives us a good way to accomplish this. The function actually has 4 possible outputs, but I will focus on the 4th, which I will call D:
[idx,C,sumd,D] = kmeans(A, k, 'Distance', 'cosine');
D has dimensions n x k. For a data point i, the row i in the D matrix gives the distance from that point to every centroid. Therefore, for each centroid, you simply need to find the data point closest to this, and return that corresponding data point. I can supply the short code for this if you need it.
Also, just a tip. You should probably use kmeans++ method of initializing the centroids. It's faster and generally better. You can call it using this:
[idx,C,sumd,D] = kmeans(A, k, 'Distance', 'cosine', 'Start', 'plus');
Edit:
Here is the code necessary for part 3:
[~, min_idxs] = min(D, [], 1);
closest_vecs = A(min_idxs, :);
Each row i of closest_vecs is the vector that is closest to centroid i.
OK, before we actually get into the details, let's give a brief overview on what K-means clustering is first.
k-means clustering works such that for some data that you have, you want to group them into k groups. You initially choose k random points in your data, and these will have labels from 1,2,...,k. These are what we call the centroids. Then, you determine how close the rest of the data are to each of these points. You then group those points so that whichever points are closest to any of these k points, you assign those points to belong to that particular group (1,2,...,k). After, for all of the points for each group, you update the centroids, which actually is defined as the representative point for each group. For each group, you compute the average of all of the points in each of the k groups. These become the new centroids for the next iteration. In the next iteration, you determine how close each point in your data is to each of the centroids. You keep iterating and repeating this behaviour until the centroids don't move anymore, or they move very little.
How you use the kmeans function in MATLAB is that assuming you have a data matrix (A in your case), it is arranged such that each row is a sample and each column is a feature / dimension of a sample. For example, we could have N x 2 or N x 3 arrays of Cartesian coordinates, either in 2D or 3D. In colour images, we could have N x 3 arrays where each column is a colour component in an image - red, green or blue.
How you invoke kmeans in MATLAB is the following way:
[IDX, C] = kmeans(X, K);
X is the data matrix we talked about, K is the total number of clusters / groups you would like to see and the outputs IDX and C are respectively an index and centroid matrix. IDX is a N x 1 array where N is the total number of samples that you have put into the function. Each value in IDX tells you which centroid the sample / row in X best matched with a particular centroid. You can also override the distance measure used to measure the distance between points. By default, this is the Euclidean distance, but you used the cosine distance in your invocation.
C has K rows where each row is a centroid. Therefore, for the case of Cartesian coordinates, this would be a K x 2 or K x 3 array. Therefore, you would interpret IDX as telling which group / centroid that the point is closest to when computing k-means. As such, if we got a value of IDX=1 for a point, this means that the point best matched with the first centroid, which is the first row of C. Similarly, if we got a value of IDX=1 for a point, this means that the point best matched with the third centroid, which is the third row of C.
Now to answer your questions:
We just talked about C and IDX so this should be clear.
The final centres are stored in C. Each row gives you a centroid / centre that is representative of a group.
It sounds like you want to find the closest point to each cluster in the data, besides the actual centroid itself. That's easy to do if you use knnsearch which performs K-Nearest Neighbour search by giving a set of points and it outputs the K closest points within your data that are close to a query point. As such, you supply the clusters as the input and your data as the output, then use K=2 and skip the first point. The first point will have a distance of 0 as this will be equal to the centroid itself and the second point will give you the closest point that is closest to the cluster.
You can do that by the following, assuming you already ran kmeans:
out = knnsearch(A, C, 'k', 2);
out = out(:,2);
You run knnsearch, then toss out the closest point as it would essentially have a distance of 0. The second column is what you're after, which gives you the closest point to the cluster excluding the actual centroid. out will give you which points in your data matrix A that was closest to each centroid. To get the actual points, do this:
pts = A(out,:);
Hope this helps!

k-means clustering using function 'kmeans' in MATLAB

I have this matrix:
x = [2+2*i 2-2*i -2+2*i -2-2*i];
I want to simulate transmitting it and adding noise to it. I represented the components of the complex number as below:
A = randn(150, 2) + 2*ones(150, 2); C = randn(150, 2) - 2*ones(150, 2);
At the receiver, I received the below vector, where the components are ordered based on what I sent originally, i.e., the components of x).
X = [A A A C C A C C];
Now I want to apply the kmeans(X) to have four clusters, so kmeans(X, 4). I am experiencing the following problems:
I am not sure if I can represent the complex numbers as shown in X above.
I can't plot the result of the kmeans to show the clusters.
I could not understand the clusters centroid results.
How can I find the best error rate, if this example was to represent a communication system and at the receiver, k-means clustering was used in order to decide what the transmitted signal was?
If you don't "understand" the cluster centroid results, then you don't understand how k-means works. I'll present a small summary here.
How k-means works is that for some data that you have, you want to group them into k groups. You initially choose k random points in your data, and these will have labels from 1,2,...,k. These are what we call the centroids. Then, you determine how close the rest of the data are to each of these points. You then group those points so that whichever points are closest to any of these k points, you assign those points to belong to that particular group (1,2,...,k). After, for all of the points for each group, you update the centroids, which actually is defined as the representative point for each group. For each group, you compute the average of all of the points in each of the k groups. These become the new centroids for the next iteration. In the next iteration, you determine how close each point in your data is to each of the centroids. You keep iterating and repeating this behaviour until the centroids don't move anymore, or they move very little.
Now, let's answer your questions one-by-one.
1. Complex number representation
k-means in MATLAB doesn't define how complex data is handled. A common way for people to deal with complex numbered data is to split up the real and imaginary parts into separate dimensions as you have done. This is a perfectly valid way to use k-means for complex valued data.
See this post on the MathWorks MATLAB forum for more details: https://www.mathworks.com/matlabcentral/newsreader/view_thread/78306
2. Plot the results
You aren't constructing your matrix X properly. Note that A and C are both 150 x 2 matrices. You need to structure X such that each row is a point, and each column is a variable. Therefore, you need to concatenate your A and C row-wise. Therefore:
X = [A; A; A; C; C; A; C; C];
Note that you have duplicate points. This is actually no different than doing X = [A; C]; as far as kmeans is concerned. Perhaps you should generate X, then add the noise in rather than taking A and C, adding noise, then constructing your signal.
Now, if you want to plot the results as well as the centroids, what you need to do is use the two output version of kmeans like so:
[idx, centroids] = kmeans(X, 4);
idx will contain the cluster number that each point in X belongs to, and centroids will be a 4 x 2 matrix where each row tells you the mean of each cluster found in the data. If you want to plot the data, as well as the clusters, you simply need to do following. I'm going to loop over each cluster membership and plot the results on a figure. I'm also going to colour in where the mean of each cluster is located:
x = X(:,1);
y = X(:,2);
figure;
hold on;
colors = 'rgbk';
for num = 1 : 4
plot(x(idx == num), y(idx == num), [colors(num) '.']);
end
plot(centroids(:,1), centroids(:,2), 'c.', 'MarkerSize', 14);
grid;
The above code goes through each cluster, plots them in a different colour, then plots the centroids in cyan with a slightly larger thickness so you can see what the graph looks like.
This is what I get:
3. Understanding centroid results
This is probably because you didn't construct X properly. This is what I get for my centroids:
centroids =
-1.9176 -2.0759
1.5980 2.8071
2.7486 1.6147
0.8202 0.8025
This is pretty self-explanatory and I talked about how this is structured earlier.
4. Best representation of the signal
What you can do is repeat the clustering a number of times, then the algorithm will decide what the best clustering was out of these times. You would simply use the Replicates flag and denote how many times you want this run. Obviously, the more times you run this, the better your results may be. Therefore, do something like:
[idx, centroids] = kmeans(X, 4, 'Replicates', 5);
This will run kmeans 5 times and give you the best centroids of these 5 times.
Now, if you want to determine what the best sequence that was transmitted, you'd have to split up your X into 150 rows each (as your random sequence was 150 elements), then run a separate kmeans on each subset. You can try to find the best representation of each part of the sequence by using the Replicates flag each time.... so you can do something like:
for num = 1 : 8
%// Look at 150 points at a time
[idx, centroids] = kmeans(X((num-1)*150 + 1 : num*150, :), 4, 'Replicates', 5);
%// Do your analysis
%//...
%//...
end
idx and centroids would be the results for each portion of your transmitted signal. You probably want to look at centroids at each iteration to determine what symbol was transmitted at a particular time.
If you want to plot the decision regions, then you're probably looking for a Voronoi diagram. All you do is given a set of points that are defined within the domain of your problem, you just have to determine which cluster each point belongs to. Given that our data spans between -5 <= (x,y) <= 5, let's go through each point in the grid and determine which cluster each point belongs to. We'd then colour the appropriate point according to which cluster it belongs to.
Something like:
colors = 'rgbk';
[X,Y] = meshgrid(-5:0.05:5, -5:0.05:5);
X = X(:);
Y = Y(:);
figure;
hold on;
for idx = 1 : numel(X)
[~,ind] = min(sum(bsxfun(#minus, [X(idx) Y(idx)], centroids).^2, 2));
plot(X(idx), Y(idx), [colors(ind), '.']);
end
plot(centroids(:,1), centroids(:,2), 'c.', 'MarkerSize', 14);
The above code will plot the decision regions / Voronoi diagram of the particular configuration, as well as where the cluster centres are located. Note that the code is rather unoptimized and it'll take a while for the graph to generate, but I wanted to write something quick to illustrate my point.
Here's what the decision regions look like:
Hope this helps! Good luck!

Remove duplicates in correlations in matlab

Please see the following issue:
P=rand(4,4);
for i=1:size(P,2)
for j=1:size(P,2)
[r,p]=corr(P(:,i),P(:,j))
end
end
Clearly, the loop will cause the number of correlations to be doubled (i.e., corr(P(:,1),P(:,4)) and corr(P(:,4),P(:,1)). Does anyone have a suggestion on how to avoid this? Perhaps not using a loop?
Thanks!
I have four suggestions for you, depending on what exactly you are doing to compute your matrices. I'm assuming the example you gave is a simplified version of what needs to be done.
First Method - Adjusting the inner loop index
One thing you can do is change your j loop index so that it only goes from 1 up to i. This way, you get a lower triangular matrix and just concentrate on the values within the lower triangular half of your matrix. The upper half would essentially be all set to zero. In other words:
for i = 1 : size(P,2)
for j = 1 : i
%// Your code here
end
end
Second Method - Leave it unchanged, but then use unique
You can go ahead and use the same matrix like you did before with the full two for loops, but you can then filter the duplicates by using unique. In other words, you can do this:
[Y,indices] = unique(P);
Y will give you a list of unique values within the matrix P and indices will give you the locations of where these occurred within P. Note that these are column major indices, and so if you wanted to find the row and column locations of where these locations occur, you can do:
[rows,cols] = ind2sub(size(P), indices);
Third Method - Use pdist and squareform
Since you're looking for a solution that requires no loops, take a look at the pdist function. Given a M x N matrix, pdist will find distances between each pair of rows in a matrix. squareform will then transform these distances into a matrix like what you have seen above. In other words, do this:
dists = pdist(P.', 'correlation');
distMatrix = squareform(dists);
Fourth Method - Use the corr method straight out of the box
You can just use corr in the following way:
[rho, pvals] = corr(P);
corr in this case will produce a m x m matrix that contains the correlation coefficient between each pair of columns an n x m matrix stored in P.
Hopefully one of these will work!
this works ?
for i=1:size(P,2)
for j=1:i
Since you are just correlating each column with the other, then why not just use (straight from the documentation)
[Rho,Pval] = corr(P);
I don't have the Statistics Toolbox, but according to http://www.mathworks.com/help/stats/corr.html,
corr(X) returns a p-by-p matrix containing the pairwise linear correlation coefficient between each pair of columns in the n-by-p matrix X.

Error in Centroid calculation of kmeans in matlab

I got a strange output of kMeans implemented in matlab.
All my entries in my input matrix F of dimension d x n are between 0 and 1. When i run the kmeans algorithm using the following matlab command which creates 50 cluster.
[IDX, B] = kmeans(F,50,'MaxIter',1000,'EmptyAction','singleton')
Here IDX is the label returned and B is the centroid of cluster created. Since all your data points are in [0,1]^d , you expect the calculated centroid is also in [0,1]^d , where d is the dimension of the point.
However, the resultant centroid which i got from kmeans after several different initialization contains negative value.
Can anyone let me know the reason for it?
I can't really answer your question without the actual data matrix "F". However, I note that, if size(F) == [d, n] then the code
[IDX, B] = kmeans(F,50,'MaxIter',1000,'EmptyAction','singleton')
will treat F as a set of d points, each of n-variables. So all d points belong to [0,1]^n.
Also
Do you really need the optional arguments? What happens if you remove them?
What happens if you reduce the the number of data points in the input matrix F?
what happens if you reduce the the number of clusters, to say 10, instead of 50?