Plotting K-means results in Matlab - matlab

I have 3 sets of signals, each containing 4 distinct operational states, and I have to classify the states in each signal using K-means in Matlab. The classification is done after I have smoothened the original signal using a filter. My output should be a plot of the smoothened signal with each part of the signal in a different color to denote the different operational state.
I am very new to Matlab, and this is what I have for the classification part.
numClusters = 4;
idx_1 = kmeans([X_1 smoothY_1],numClusters,'Replicates', 5);
[numDataPoints,numDimensions] = size(smoothY_1);
Colors = hsv(numClusters);
for i = 1 : numDataPoints
plot(X_1(i),smoothY_1(i),'.','Color',Colors(idx_1(i),:))
hold on
end
I have a few questions.
1) It appears to me that the kmeans function in Matlab will return a set of arbitrary cluster index in every run. For example, running the code above on the same signal twice may give me the cluster index (for 10 data points) [4 4 2 2 2 1 1 3 3 3] and [2 2 1 1 1 4 4 3 3 3], resulting in arbitrary colors denoting each state. Ideally, I would like the indices to be (somewhat) ordered and the colors to be the same for corresponding states, so that it makes sense to say "Red means Operational State 1, blue means State 2, etc". How can I synchronize this?
I have 2 pictures to illustrate this.
Set 1 and 2 are two of the datasets. Each stage of the signal is in a different color. I would like, for example, the first segment to be red, second in cyan, third in green, fourth in purple.
2) I can't seem to plot the graph using the specifier '-'. There is no output when I tried to do that, so I'm forced to use '.', which isn't what i want. How can I plot a continuous curve here?
3) Right now, I'm running K-means independently on all 3 sets of data, so there's no concept of training/test datasets. I would like to use one dataset for training and the other 2 for testing, but I don't know how to do that using K-means in Matlab. How can I do that?
ETA: I noticed that my smoothed plots are all about half the heights of my plots of the original data, e.g. the highest point in my original signal is y = 22, while the highest point in my smoothed signal is y = 11, although the shape remains the same. Is this correct?
ETA2: I realized that it seems as if what the K-means clustering did was simply divide the graph into numClusters segments (based on X_1 values) and that's it. I've tried with different values of numClusters and each gave me equally divided segments. Surely this can't be right? For instance, isn't it more likely that the long segment after the biggest spike belong to the same cluster, rather than 3 clusters? Should I be using K-means at all?

For the first question:
You can reorder your vector with
[~,~,a] = unique(a,'stable');
For the second question:
You can find all the information about the LineSpec here:
LineSpec
If you don't add a LineSpec the default option is a continuous line, as you want.
For the third question:
I don't think that you can train your kmean algorithm (due to the method) as it could be possible with an SVM, but i'm waiting for an expert opinion.

Related

How can I classify my data for K-Means Clustering

A proof of concept prototype I have to do for my final year project is to implement K-Means Clustering on a big data set and display the results on a graph. I only know object-oriented languages like Java and C# and decided to give MATLAB a try. I notice that with a functional language the approach to solving problems is very different, so I would like some insight on a few things if possible.
Suppose I have the following data set:
raw_data
400.39 513.29 499.99 466.62 396.67
234.78 231.92 215.82 203.93 290.43
15.07 14.08 12.27 13.21 13.15
334.02 328.79 272.2 306.99 347.79
49.88 52.2 66.35 47.69 47.86
732.88 744.62 687.53 699.63 694.98
And I picked row 2 and 4 to be the 2 centroids:
centroids
234.78 231.92 215.82 203.93 290.43 % Centroid 1
334.02 328.79 272.2 306.99 347.79 % Centroid 2
I want to now compute the euclidean distances of each point to each centroid, then assign each point to it's closest centroid and display this on a graph. Let's say I want I want to classify the centroids as blue and green. How can I do this in MATLAB? If this was Java I would initialise each row as an object and add to separate ArrayLists (representing the clusters).
If rows 1, 2 and 3 all belong to the first centroid / cluster, and rows 4, 5 and 6 belong to the second centroid / cluster - how can I classify these to display them as blue or green points on a graph? I am new to MATLAB and really curious about this. Thanks for any help.
(To begin with, Matlab has a flexible distance measuring function, pdist2 and also kmeans implementation, but I'm assuming that you want to build your code from scratch).
In Matlab, you try to implement everything as matrix algebra, without loops over elements.
In your case, if R is the raw_data matrix and C is the centroids matrix,
you can shift the dimension that represents centroid number to the 3rd place by
permC=permute(C,[3 2 1]); Then the bsxfun function allows you to subtract C from R while expanding R's third dimension as necessary: D=bsxfun(#minus,R,permC). Element-wise square followed by summation across columns SqD=sum(D.^2,2) will give you the squared distances of each observation from each centroid. Performing all these operations within a single statement and shifting the third (centroid) dimension back to the 2nd place will look like this:
SqD=permute(sum(bsxfun(#minus,R,permute(C,[3 2 1])).^2,2),[1 3 2])
Picking the centroid of minimal distance is now straightforward: [minDist,minCentroid]=min(SqD,[],2)
If this looks complex, I recommend inspecting the product of each sub-step and reading the help of each command.

K-means Clustering, major understanding issue

Suppose that we have a 64dim matrix to cluster, let's say that the matrix dataset is dt=64x150.
Using from vl_feat's library its kmeans function, I will cluster my dataset to 20 centrers:
[centers, assignments] = vl_kmeans(dt, 20);
centers is a 64x20 matrix.
assignments is a 1x150 matrix with values inside it.
According to manual: The vector assignments contains the (hard) assignments of the input data to the clusters.
I still can not understand what those numbers in the matrix assignments mean. I dont get it at all. Anyone mind helping me a bit here? An example or something would be great. What do these values represent anyway?
In k-means the problem you are trying to solve is the problem of clustering your 150 points into 20 clusters. Each point is a 64-dimension point and thus represented by a vector of size 64. So in your case dt is the set of points, each column is a 64-dim vector.
After running the algorithm you get centers and assignments. centers are the 20 positions of the cluster's center in a 64-dim space, in case you want to visualize it, measure distances between points and clusters, etc. 'assignments' on the other hand contains the actual assignments of each 64-dim point in dt. So if assignments[7] is 15 it indicates that the 7th vector in dt belongs to the 15th cluster.
For example here you can see clustering of lots of 2d points, let's say 1000 into 3 clusters. In this case dt would be 2x1000, centers would be 2x3 and assignments would be 1x1000 and will hold numbers ranging from 1 to 3 (or 0 to 2, in case you're using openCV)
EDIT:
The code to produce this image is located here: http://pypr.sourceforge.net/kmeans.html#k-means-example along with a tutorial on kmeans for pyPR.
In openCV it is the number of the cluster that each of the input points belong to

How to put a treshold or enhancement on my colorbar plots ?

I have two time series and I plot some similarity measures using colorbar. However, for one of my metrics, one of the results is very high compared to the other. Therefore, I can't distinguish enough variability in the charts. Is there a way to exclude some too high data from the figure ?
Thnx
How about just apply a threshold before plotting:
%//Code assumes 2D image:
I_th = I;
I_th(I < threshold ) = threshold ; %//where threshold is a constant you define
imagesc(I_th);
you can force the values above a certain threshold to be the threshold value.For example,
A=[1 2 3 4 5];
A(A>3)=3;
this will give you A=[1 2 3 3 3];
Alternatively, instead of excluding values, you might consider doing a color scale with a log transform, so that you can distinguish the colors better.
here is one example:
http://www.mikesoltys.com/2012/03/16/matlab-tip-logarithmic-color-scales-for-contour-and-image-plots/

Matlab: how to find which variables from dataset could be discarded using PCA in matlab?

I am using PCA to find out which variables in my dataset are redundand due to being highly correlated with other variables. I am using princomp matlab function on the data previously normalized using zscore:
[coeff, PC, eigenvalues] = princomp(zscore(x))
I know that eigenvalues tell me how much variation of the dataset covers every principal component, and that coeff tells me how much of i-th original variable is in the j-th principal component (where i - rows, j - columns).
So I assumed that to find out which variables out of the original dataset are the most important and which are the least I should multiply the coeff matrix by eigenvalues - coeff values represent how much of every variable each component has and eigenvalues tell how important this component is.
So this is my full code:
[coeff, PC, eigenvalues] = princomp(zscore(x));
e = eigenvalues./sum(eigenvalues);
abs(coeff)/e
But this does not really show anything - I tried it on a following set, where variable 1 is fully correlated with variable 2 (v2 = v1 + 2):
v1 v2 v3
1 3 4
2 4 -1
4 6 9
3 5 -2
but the results of my calculations were following:
v1 0.5525
v2 0.5525
v3 0.5264
and this does not really show anything. I would expect the result for variable 2 show that it is far less important than v1 or v3.
Which of my assuptions is wrong?
EDIT I have completely reworked the answer now that I understand which assumptions were wrong.
Before explaining what doesn't work in the OP, let me make sure we'll have the same terminology. In principal component analysis, the goal is to obtain a coordinate transformation that separates the observations well, and that may make it easy to describe the data , i.e. the different multi-dimensional observations, in a lower-dimensional space. Observations are multidimensional when they're made up from multiple measurements. If there are fewer linearly independent observations than there are measurements, we expect at least one of the eigenvalues to be zero, because e.g. two linearly independent observation vectors in a 3D space can be described by a 2D plane.
If we have an array
x = [ 1 3 4
2 4 -1
4 6 9
3 5 -2];
that consists of four observations with three measurements each, princomp(x) will find the lower-dimensional space spanned by the four observations. Since there are two co-dependent measurements, one of the eigenvalues will be near zero, since the space of measurements is only 2D and not 3D, which is probably the result you wanted to find. Indeed, if you inspect the eigenvectors (coeff), you find that the first two components are extremely obviously collinear
coeff = princomp(x)
coeff =
0.10124 0.69982 0.70711
0.10124 0.69982 -0.70711
0.9897 -0.14317 1.1102e-16
Since the first two components are, in fact, pointing in opposite directions, the values of the first two components of the transformed observations are, on their own, meaningless: [1 1 25] is equivalent to [1000 1000 25].
Now, if we want to find out whether any measurements are linearly dependent, and if we really want to use principal components for this, because in real life, measurements my not be perfectly collinear and we are interested in finding good vectors of descriptors for a machine-learning application, it makes a lot more sense to consider the three measurements as "observations", and run princomp(x'). Since there are thus three "observations" only, but four "measurements", the fourth eigenvector will be zero. However, since there are two linearly dependent observations, we're left with only two non-zero eigenvalues:
eigenvalues =
24.263
3.7368
0
0
To find out which of the measurements are so highly correlated (not actually necessary if you use the eigenvector-transformed measurements as input for e.g. machine learning), the best way would be to look at the correlation between the measurements:
corr(x)
ans =
1 1 0.35675
1 1 0.35675
0.35675 0.35675 1
Unsurprisingly, each measurement is perfectly correlated with itself, and v1 is perfectly correlated with v2.
EDIT2
but the eigenvalues tell us which vectors in the new space are most important (cover the most of variation) and also coefficients tell us how much of each variable is in each component. so I assume we can use this data to find out which of the original variables hold the most of variance and thus are most important (and get rid of those that represent small amount)
This works if your observations show very little variance in one measurement variable (e.g. where x = [1 2 3;1 4 22;1 25 -25;1 11 100];, and thus the first variable contributes nothing to the variance). However, with collinear measurements, both vectors hold equivalent information, and contribute equally to the variance. Thus, the eigenvectors (coefficients) are likely to be similar to one another.
In order for #agnieszka's comments to keep making sense, I have left the original points 1-4 of my answer below. Note that #3 was in response to the division of the eigenvectors by the eigenvalues, which to me didn't make a lot of sense.
the vectors should be in rows, not columns (each vector is an
observation).
coeff returns the basis vectors of the principal
components, and its order has little to do with the original input
To see the importance of the principal components, you use eigenvalues/sum(eigenvalues)
If you have two collinear vectors, you can't say that the first is important and the second isn't. How do you know that it shouldn't be the other way around? If you want to test for colinearity, you should check the rank of the array instead, or call unique on normalized (i.e. norm equal to 1) vectors.

Neural networks - input values

I have a question that may be trivial but it's not described anywhere i've looked. I'm studying neural networks and everywhere i look there's some theory and some trivial example with some 0s and 1s as an input. I'm wondering: do i have to put only one value as an input value for one neuron, or can it be a vector of, let's say, 3 values (RGB colour for example)?
The above answers are technically correct, but don't explain the simple truth: there is never a situation where you'd need to give a vector of numbers to a single neuron.
From a practical standpoint this is because (as one of the earlier solutions has shown) you can just have a neuron for each number in a vector and then have all of those be the input to a single neuron. This should get you your desired behavior after training, as the second layer neuron can effectively make use of the entire vector.
From a mathematical standpoint, there is a fundamental theorem of coding theory that states that any vector of numbers can be represented as a single number. Thus, if you really don't want an extra layer of neurons, you could simply encode the RGB values into a single number and input that to the neuron. Though, this coding function would probably make most learning problems more difficult, so I doubt this solution would be worth it in most cases.
To summarize: artificial neural networks are used without giving a vector to an input unit, but lose no computational power because of this.
When dealing with multi-dimensional data, I believe a two layer neural network is said to give better result.
In your case:
R[0..1] => (N1)----\
\
G[0..1] => (N2)-----(N4) => Result[0..1]
/
B[0..1] => (N3)----/
As you can see, the N4 neurone can handle 3 entries.
The [0..1] interval is a convention but a good one imo. That way, you can easily code a set of generic neuron classes that can take an arbitrary number of entries (I had template C++ classes with the number of entries as template parameter personally). So you code the logic of your neurons once, then you toy with the structure of the network and/or combinations of functions within your neurons.
Generally, the input for a single neuron is a value between 0 and 1. That convention is not just for ease of implementation but because normalizing the input values to the same range ensures that each input carries similar weighting. (If you have some images with 8 bit color with pixel values between 0 and 7 and some images with 16 bit color with pixel values between 0 and 255 you probably wouldn't want to favor the 24 bit color images just because the numerical values are higher. Similarly, you will probably want your images to be the same dimensions.)
As far as using pixel values as inputs, it is very common to try to gather a higher level representation of the image than its pixels (more info). For example, given a 5 x 5 (normalized) gray scale image:
[1 1 1 1 1]
[0 0 1 0 0]
[0 0 1 0 0]
[0 0 1 0 0]
[0 0 1 0 0]
We could use a the following feature matrices to help discover horizontal, vertical, and diagonal features of the images. See python haar face detection for more information.
[1 1] [0 0] [1 0] [0 1] [1 0], [0 1]
[0 0], [1 1], [1 0], [0 1], [0 1], [1 0]
To build the input vector, v, for this image, take the first 2x2 feature matrix and "apply" it with element-wise multiplication to the first position in the image. Applying,
[1 1] (the first feature matrix) to [1 1] (the first position in the image)
[0 0] [0 0]
will result in 2 because 1*1 + 1*1 + 0*0 + 0*0 = 2. Append 2 to the back of your input vector for this image. Then move this feature matrix to the next position, one to the right, and apply it again, adding the result to the input vector. Do this repeatedly for each position of the feature matrix and for each of the feature matrices. This will build your input vector for a single image. Be sure that you build the vectors in the same order for each image.
In this case the image is black and white, but with RGB values you could extend the algorithm to do the same computation but add 3 values to the input vector for each pixel--one for each color. This should provide you with one input vector per image and a single input to each neuron. The vectors will then need to be normalized before running through the network.
Normally a single neuron takes as its input multiple real numbers and outputs a real number, which typically is calculated as applying the sigmoid function to the sum of the real numbers (scaled, and then plus or minus a constant offset).
If you want to put in, say, two RGB vectors (2 x 3 reals), you need to decide how you want to combine the values. If you add all the elements together and apply the sigmoid function, it is equivalent to getting in six reals "flat". On the other hand, if you process the R elements, then the G elements, and the B elements, all individually (e.g. sum or subtract the pairs), you have in practice three independent neurons.
So in short, no, a single neuron does not take in vector values.
Use light wavelength normalized to visible spectrum as the input.
There are some approximate equations on the net.
Search for RGB to wavelength conversion
or
use HSL color model and extract Hue component and possibly use Saturation and Lightness as well. Well...
It can be whatever you want, as long as you write your inner function accordingly.
The examples you mention use [0;1] as their domain, but you can use R, R², or whatever you want, as long as the function you use in your neurons is defined on this domain.
In your case, you can define your functions on R3 to allow for RGB values to be handled
A trivial example : use (x1, y1, z1),(x2,y2,z2)->(ax1+x2,by1+y2,cz1+z2) as your function to transform two colors into one, a b and c being your learning coefs, which you will determine during the learning phase.
Very detailed information (including the answer to your question) is available on Wikipedia.