clusterdata Matlab function - matlab

I am using Matlab clusterdata function to classify my data (noise and non-noise) into 2 categories: noise and non-noise groups. The function works well except that sometimes it names all noise data as group 1 and all non-noise data as group 2. Sometimes it names all noise data as group 2 and all non-noise data as group 1.
How can I control it? I mean label all noise data as group 1.

Having control over the name of the labels an unsupervised learning algorithm uses can generally be a problem. I suggets to try to evaluate some of the features of the data after doing the clustering to see if the labels are as you want them.
If all your data is in X (N x d) matrix, with a label vector Y(N x 1) taking values -1 and 1, you could evaluate the variance of each of the clusters. I suspect the noise data would exhibit higher variance, which could be used to see if the labels should be switched.
In the code below, 1 should be the non-noise, and -1 should be noise (this choice of labels (groups) makes it easier to flip the labels around).
%#Variance summed over all dimensions
varL1 = sum(var(X(Y==1,:)));
varL2= sum(var(X(Y==-1,:)));
%#Flip labels if if L1 is higher than L2
if varL1 > varL2
Y = Y * (-1);
end
If this works, you could afterwards change noise cluster to be group 1 and non-noise to group 2 by
Y(Y==1) = 2; %#NB: The order of which these statements are evaluated is important.
Y(Y==-1) = 1;

Related

Plotting K-means results in Matlab

I have 3 sets of signals, each containing 4 distinct operational states, and I have to classify the states in each signal using K-means in Matlab. The classification is done after I have smoothened the original signal using a filter. My output should be a plot of the smoothened signal with each part of the signal in a different color to denote the different operational state.
I am very new to Matlab, and this is what I have for the classification part.
numClusters = 4;
idx_1 = kmeans([X_1 smoothY_1],numClusters,'Replicates', 5);
[numDataPoints,numDimensions] = size(smoothY_1);
Colors = hsv(numClusters);
for i = 1 : numDataPoints
plot(X_1(i),smoothY_1(i),'.','Color',Colors(idx_1(i),:))
hold on
end
I have a few questions.
1) It appears to me that the kmeans function in Matlab will return a set of arbitrary cluster index in every run. For example, running the code above on the same signal twice may give me the cluster index (for 10 data points) [4 4 2 2 2 1 1 3 3 3] and [2 2 1 1 1 4 4 3 3 3], resulting in arbitrary colors denoting each state. Ideally, I would like the indices to be (somewhat) ordered and the colors to be the same for corresponding states, so that it makes sense to say "Red means Operational State 1, blue means State 2, etc". How can I synchronize this?
I have 2 pictures to illustrate this.
Set 1 and 2 are two of the datasets. Each stage of the signal is in a different color. I would like, for example, the first segment to be red, second in cyan, third in green, fourth in purple.
2) I can't seem to plot the graph using the specifier '-'. There is no output when I tried to do that, so I'm forced to use '.', which isn't what i want. How can I plot a continuous curve here?
3) Right now, I'm running K-means independently on all 3 sets of data, so there's no concept of training/test datasets. I would like to use one dataset for training and the other 2 for testing, but I don't know how to do that using K-means in Matlab. How can I do that?
ETA: I noticed that my smoothed plots are all about half the heights of my plots of the original data, e.g. the highest point in my original signal is y = 22, while the highest point in my smoothed signal is y = 11, although the shape remains the same. Is this correct?
ETA2: I realized that it seems as if what the K-means clustering did was simply divide the graph into numClusters segments (based on X_1 values) and that's it. I've tried with different values of numClusters and each gave me equally divided segments. Surely this can't be right? For instance, isn't it more likely that the long segment after the biggest spike belong to the same cluster, rather than 3 clusters? Should I be using K-means at all?
For the first question:
You can reorder your vector with
[~,~,a] = unique(a,'stable');
For the second question:
You can find all the information about the LineSpec here:
LineSpec
If you don't add a LineSpec the default option is a continuous line, as you want.
For the third question:
I don't think that you can train your kmean algorithm (due to the method) as it could be possible with an SVM, but i'm waiting for an expert opinion.

Inputs for the ROC curve

I have a 2 column matrix, where in each row are observations for healthy (column 1) and not healthy (2 column) patients. Also, I have 5 partition values which should be used to plot ROC curve.
Could you please help me to understand how to get the inputs from this data for the perfcurve function?
Thank you for any reply!
I've made a small script that shows the basics of a perfcurve given a two column matrix input. If you execute this in MATLAB and take a careful look at it then you should have no trouble using perfcurve
%Simulate your data as Gaussian data with 1000 measurements in each group.
%Lets give them a mean difference of 1 and a standard deviation of 1.
Data = zeros(1000,2);
Data(:,1) = normrnd(0,1,1000,1);
Data(:,2) = normrnd(1,1,1000,1);
%Now the data is reshaped to a vector (required for perfcurve) and I create the labels.
Data = reshape(Data,2000,1);
Labels = zeros(size(Data,1),1);
Labels(end/2+1:end) = 1;
%Your bottom half of the data (initially second column) is now group 1, the
%top half is group 0.
%Lets set the positive class to group 1.
PosClass = 1;
%Now we have all required variables to call perfcurve. We will give
%perfcurve the 'Xvals' input to define the values at which the ROC curve is
%calculated. This parameter can be left out to let matlab calculate the
%curve at all values.
[X Y] = perfcurve(Labels,Data,PosClass, 'Xvals', 0:0.25:1);
%Lets plot this
plot(X,Y)
%One limitation in scripting it like this is that you must have equal group
%sizes for healthy and sick. If you reshape your Data matrix to a vector
%and keep a seperate labels vector then you can also handle groups of
%different sizes.

k-means clustering using function 'kmeans' in MATLAB

I have this matrix:
x = [2+2*i 2-2*i -2+2*i -2-2*i];
I want to simulate transmitting it and adding noise to it. I represented the components of the complex number as below:
A = randn(150, 2) + 2*ones(150, 2); C = randn(150, 2) - 2*ones(150, 2);
At the receiver, I received the below vector, where the components are ordered based on what I sent originally, i.e., the components of x).
X = [A A A C C A C C];
Now I want to apply the kmeans(X) to have four clusters, so kmeans(X, 4). I am experiencing the following problems:
I am not sure if I can represent the complex numbers as shown in X above.
I can't plot the result of the kmeans to show the clusters.
I could not understand the clusters centroid results.
How can I find the best error rate, if this example was to represent a communication system and at the receiver, k-means clustering was used in order to decide what the transmitted signal was?
If you don't "understand" the cluster centroid results, then you don't understand how k-means works. I'll present a small summary here.
How k-means works is that for some data that you have, you want to group them into k groups. You initially choose k random points in your data, and these will have labels from 1,2,...,k. These are what we call the centroids. Then, you determine how close the rest of the data are to each of these points. You then group those points so that whichever points are closest to any of these k points, you assign those points to belong to that particular group (1,2,...,k). After, for all of the points for each group, you update the centroids, which actually is defined as the representative point for each group. For each group, you compute the average of all of the points in each of the k groups. These become the new centroids for the next iteration. In the next iteration, you determine how close each point in your data is to each of the centroids. You keep iterating and repeating this behaviour until the centroids don't move anymore, or they move very little.
Now, let's answer your questions one-by-one.
1. Complex number representation
k-means in MATLAB doesn't define how complex data is handled. A common way for people to deal with complex numbered data is to split up the real and imaginary parts into separate dimensions as you have done. This is a perfectly valid way to use k-means for complex valued data.
See this post on the MathWorks MATLAB forum for more details: https://www.mathworks.com/matlabcentral/newsreader/view_thread/78306
2. Plot the results
You aren't constructing your matrix X properly. Note that A and C are both 150 x 2 matrices. You need to structure X such that each row is a point, and each column is a variable. Therefore, you need to concatenate your A and C row-wise. Therefore:
X = [A; A; A; C; C; A; C; C];
Note that you have duplicate points. This is actually no different than doing X = [A; C]; as far as kmeans is concerned. Perhaps you should generate X, then add the noise in rather than taking A and C, adding noise, then constructing your signal.
Now, if you want to plot the results as well as the centroids, what you need to do is use the two output version of kmeans like so:
[idx, centroids] = kmeans(X, 4);
idx will contain the cluster number that each point in X belongs to, and centroids will be a 4 x 2 matrix where each row tells you the mean of each cluster found in the data. If you want to plot the data, as well as the clusters, you simply need to do following. I'm going to loop over each cluster membership and plot the results on a figure. I'm also going to colour in where the mean of each cluster is located:
x = X(:,1);
y = X(:,2);
figure;
hold on;
colors = 'rgbk';
for num = 1 : 4
plot(x(idx == num), y(idx == num), [colors(num) '.']);
end
plot(centroids(:,1), centroids(:,2), 'c.', 'MarkerSize', 14);
grid;
The above code goes through each cluster, plots them in a different colour, then plots the centroids in cyan with a slightly larger thickness so you can see what the graph looks like.
This is what I get:
3. Understanding centroid results
This is probably because you didn't construct X properly. This is what I get for my centroids:
centroids =
-1.9176 -2.0759
1.5980 2.8071
2.7486 1.6147
0.8202 0.8025
This is pretty self-explanatory and I talked about how this is structured earlier.
4. Best representation of the signal
What you can do is repeat the clustering a number of times, then the algorithm will decide what the best clustering was out of these times. You would simply use the Replicates flag and denote how many times you want this run. Obviously, the more times you run this, the better your results may be. Therefore, do something like:
[idx, centroids] = kmeans(X, 4, 'Replicates', 5);
This will run kmeans 5 times and give you the best centroids of these 5 times.
Now, if you want to determine what the best sequence that was transmitted, you'd have to split up your X into 150 rows each (as your random sequence was 150 elements), then run a separate kmeans on each subset. You can try to find the best representation of each part of the sequence by using the Replicates flag each time.... so you can do something like:
for num = 1 : 8
%// Look at 150 points at a time
[idx, centroids] = kmeans(X((num-1)*150 + 1 : num*150, :), 4, 'Replicates', 5);
%// Do your analysis
%//...
%//...
end
idx and centroids would be the results for each portion of your transmitted signal. You probably want to look at centroids at each iteration to determine what symbol was transmitted at a particular time.
If you want to plot the decision regions, then you're probably looking for a Voronoi diagram. All you do is given a set of points that are defined within the domain of your problem, you just have to determine which cluster each point belongs to. Given that our data spans between -5 <= (x,y) <= 5, let's go through each point in the grid and determine which cluster each point belongs to. We'd then colour the appropriate point according to which cluster it belongs to.
Something like:
colors = 'rgbk';
[X,Y] = meshgrid(-5:0.05:5, -5:0.05:5);
X = X(:);
Y = Y(:);
figure;
hold on;
for idx = 1 : numel(X)
[~,ind] = min(sum(bsxfun(#minus, [X(idx) Y(idx)], centroids).^2, 2));
plot(X(idx), Y(idx), [colors(ind), '.']);
end
plot(centroids(:,1), centroids(:,2), 'c.', 'MarkerSize', 14);
The above code will plot the decision regions / Voronoi diagram of the particular configuration, as well as where the cluster centres are located. Note that the code is rather unoptimized and it'll take a while for the graph to generate, but I wanted to write something quick to illustrate my point.
Here's what the decision regions look like:
Hope this helps! Good luck!

Creating a new probabilistic matrix from two existing ones according to prespecified rules in MATLAB

I have a problem in my MATLAB code. Let me first give you some explanation about the issue. I have two matrices which represent probabilities of specific outcomes of events. The first one is called DemandProbabilityMatrix or in short DemandP. Entry (i,j) shows the probability that item i is demanded j many times. Similarly, we have a ReturnProbabilityMatrix, i.e. ReturnP. An element of type (i,j) stores the probability that item i is returned j many times.
We want to compute the net demand probability out of these two matrices. For an example:
DemandP=[ .4 .5 .1]
ReturnP=[ .2 .3 .5]
In this case we have 1 item and it can be demanded or returned either 1,2 or 3 times with the given probabilities. To be more specific That item will be demanded just for once with probability .4 .
Then we need to compute the net demand. In this case, net demand can be -2,-1,0,1 or 2. For instance in order to get a net demand of -1 we can either have a demand of 1 and return of 2 or demand of 2 and return of 3. Thus we have
NetDemandP(1,2)= DemandP(1,1)*ReturnP(1,2)+DemandP(1,2)*ReturnP(1,3).
Thus the NetDemandP should look as:
NetDemandP=[.20 .37 .28 .13 .02]
I can do this with nested for loops but I'm trying to come up with a faster way. In case it helps I have the following for loops solutions where I denotes the number of rows in ReturnP and DemandP, J+1 denotes the number of columns in those matrices.
NetDemandP=zeros(I,2*J+1);
for i=1:I
for j=1:J+1
for k=1:J+1
NetDemandP(i,j-k+J+1)=NetDemandP(i,j-k+J+1)+DemandP(i,j)*ReturnP(i,k);
end
end
end
Thanks in advance
What you want is the convolution of your probability density functions. Or, more specifically, you want the convolution of the demand density with the reverse of the return density. This is easily achieved in Matlab. For example:
DemandP = [.4 .5 .1];
ReturnP = [.2 .3 .5];
NetDemandP = conv(DemandP,fliplr(ReturnP))
If you have matrices instead of vectors, then just iterate through the rows:
for i = 1:size(DemandP,1)
NetDemandP(i,:) = conv(DemandP(i,:),fliplr(ReturnP(i,:)))
end

Matlab: how to find which variables from dataset could be discarded using PCA in matlab?

I am using PCA to find out which variables in my dataset are redundand due to being highly correlated with other variables. I am using princomp matlab function on the data previously normalized using zscore:
[coeff, PC, eigenvalues] = princomp(zscore(x))
I know that eigenvalues tell me how much variation of the dataset covers every principal component, and that coeff tells me how much of i-th original variable is in the j-th principal component (where i - rows, j - columns).
So I assumed that to find out which variables out of the original dataset are the most important and which are the least I should multiply the coeff matrix by eigenvalues - coeff values represent how much of every variable each component has and eigenvalues tell how important this component is.
So this is my full code:
[coeff, PC, eigenvalues] = princomp(zscore(x));
e = eigenvalues./sum(eigenvalues);
abs(coeff)/e
But this does not really show anything - I tried it on a following set, where variable 1 is fully correlated with variable 2 (v2 = v1 + 2):
v1 v2 v3
1 3 4
2 4 -1
4 6 9
3 5 -2
but the results of my calculations were following:
v1 0.5525
v2 0.5525
v3 0.5264
and this does not really show anything. I would expect the result for variable 2 show that it is far less important than v1 or v3.
Which of my assuptions is wrong?
EDIT I have completely reworked the answer now that I understand which assumptions were wrong.
Before explaining what doesn't work in the OP, let me make sure we'll have the same terminology. In principal component analysis, the goal is to obtain a coordinate transformation that separates the observations well, and that may make it easy to describe the data , i.e. the different multi-dimensional observations, in a lower-dimensional space. Observations are multidimensional when they're made up from multiple measurements. If there are fewer linearly independent observations than there are measurements, we expect at least one of the eigenvalues to be zero, because e.g. two linearly independent observation vectors in a 3D space can be described by a 2D plane.
If we have an array
x = [ 1 3 4
2 4 -1
4 6 9
3 5 -2];
that consists of four observations with three measurements each, princomp(x) will find the lower-dimensional space spanned by the four observations. Since there are two co-dependent measurements, one of the eigenvalues will be near zero, since the space of measurements is only 2D and not 3D, which is probably the result you wanted to find. Indeed, if you inspect the eigenvectors (coeff), you find that the first two components are extremely obviously collinear
coeff = princomp(x)
coeff =
0.10124 0.69982 0.70711
0.10124 0.69982 -0.70711
0.9897 -0.14317 1.1102e-16
Since the first two components are, in fact, pointing in opposite directions, the values of the first two components of the transformed observations are, on their own, meaningless: [1 1 25] is equivalent to [1000 1000 25].
Now, if we want to find out whether any measurements are linearly dependent, and if we really want to use principal components for this, because in real life, measurements my not be perfectly collinear and we are interested in finding good vectors of descriptors for a machine-learning application, it makes a lot more sense to consider the three measurements as "observations", and run princomp(x'). Since there are thus three "observations" only, but four "measurements", the fourth eigenvector will be zero. However, since there are two linearly dependent observations, we're left with only two non-zero eigenvalues:
eigenvalues =
24.263
3.7368
0
0
To find out which of the measurements are so highly correlated (not actually necessary if you use the eigenvector-transformed measurements as input for e.g. machine learning), the best way would be to look at the correlation between the measurements:
corr(x)
ans =
1 1 0.35675
1 1 0.35675
0.35675 0.35675 1
Unsurprisingly, each measurement is perfectly correlated with itself, and v1 is perfectly correlated with v2.
EDIT2
but the eigenvalues tell us which vectors in the new space are most important (cover the most of variation) and also coefficients tell us how much of each variable is in each component. so I assume we can use this data to find out which of the original variables hold the most of variance and thus are most important (and get rid of those that represent small amount)
This works if your observations show very little variance in one measurement variable (e.g. where x = [1 2 3;1 4 22;1 25 -25;1 11 100];, and thus the first variable contributes nothing to the variance). However, with collinear measurements, both vectors hold equivalent information, and contribute equally to the variance. Thus, the eigenvectors (coefficients) are likely to be similar to one another.
In order for #agnieszka's comments to keep making sense, I have left the original points 1-4 of my answer below. Note that #3 was in response to the division of the eigenvectors by the eigenvalues, which to me didn't make a lot of sense.
the vectors should be in rows, not columns (each vector is an
observation).
coeff returns the basis vectors of the principal
components, and its order has little to do with the original input
To see the importance of the principal components, you use eigenvalues/sum(eigenvalues)
If you have two collinear vectors, you can't say that the first is important and the second isn't. How do you know that it shouldn't be the other way around? If you want to test for colinearity, you should check the rank of the array instead, or call unique on normalized (i.e. norm equal to 1) vectors.