How do I correctly plot the clusters produced from a cluster analysis in matlab? - matlab

I want to carry out hierarchical clustering in Matlab and plot the clusters on a scatterplot. I have used the evalclusters function to first investigate what a 'good' number of clusters would be using different criteria values eg Silhouette, CalinskiHarabasz. Here is the code I used for the evaluation (x is my data with 200 observations and 10 variables):
E = evalclusters(x,'linkage','CalinskiHarabasz','KList',[1:10])
%store kmean optimal clusters
optk=E.OptimalK;
%save the outouts to a structure
clust_struc(1).Optimalk=optk;
clust_struc(1).method={'CalinskiHarabasz'}
I then used code similar to what I have found online:
gscatter(x(:,1),x(:,2),E.OptimalY,'rbgckmr','xod*s.p')
%OptimalY is a vector 200 long with the cluster numbers
and this is what I get:
My question may be silly, but I don't understand why I am only using the first two columns of data to produce the scatter plot? I realise that the clusters themselves are being incorporated through the use of the Optimal Y, but should I not be using all of the data in x?

Each row in x is an observation with properties in size(x,2) dimensions. All this dimensions are used for clustering x rows.
However, when plotting the clusters, we cannot plot more than 2-3 dimensions so we try to represent each element with its key properties. I'm not sure that x(:,1),x(:,2) are the best option, but you have to choose 2 for a 2-D plot.
Usually you would have some property of interest that you want to plot. Have a look at the example in MATLAB doc: the fisheriris data has 4 different variables - the length and width measurements from the sepals and petals of three species of iris flowers. It is up to you to decide which you want to plot against each other (in the example they choosed Petal Length and Petal Width).
Here is a comparison between taking Petals measurements and Sepals measurements as the axis for plotting the grouping:

Related

Matlab - how does contour plot generating levels automatically?

I am using contourf to generate filled contour plots on MatLab with specified levels number.
According to the documents (https://www.mathworks.com/help/matlab/ref/contourf.html#mw_9088c636-4036-4e00-bd43-f6c5632b63ec)
It says Specify levels as a scalar value n to display the contour lines at n automatically chosen levels (heights).
I am wondering how does it choose the threshold automatically? What is the algorithm of choosing the thresholds? Take level as 1 as an example.
Many thanks!
As said in the comments, it just makes sure there are n dividing lines between your max and min.
Proof:
n=10;
z=peaks;
[m,c]=contour(z,10,'ShowText','on');
levels=linspace(min(z(:)),max(z(:)),n+2);
isequal(c.LevelList,levels(2:end-1))

How to quickly/easily merge and average data in matrix in MATLAB?

I have got a matrix of AirFuelRatio values at certain engine speeds and throttlepositions. (eg. the AFR is 14 at 2500rpm and 60% throttle)
The matrix is now 25x10, and the engine speed ranges from 1200-6000rpm with interval 200rpm, the throttle range from 0.1-1 with interval 0.1.
Say i have measured new values, eg. an AFR of 13.5 at 2138rpm and 74,3% throttle, how do i merge that in the matrix? The matrix closest values are 2000 or 2200rpm and 70 or 80% throttle. Also i don't want new data to replace the older data. How can i make the matrix take this value in and adjust its values to take the new value in account?
Simplified i have the following x-axis values(top row) and 1x4 matrix(below):
2 4 6 8
14 16 18 20
I just measured an AFR value of 15.5 at 3 rpm. If you interpolate the AFR matrix you would've gotten a 15, so this value is out of the ordinary.
I want the matrix to take this data and adjust the other variables to it, ie. average everything so that the more data i put in the more reliable and accurate the matrix becomes. So in the simplified case the matrix would become something like:
2 4 6 8
14.3 16.3 18.2 20.1
So it averages between old and new data. I've read the documentation about concatenation but i believe my problem can't be solved with that function.
EDIT: To clarify my question, the following visual clarification.
The 'matrix' keeps the same size of 5 points whil a new data point is added. It takes the new data in account and adjusts the matrix accordingly. This is what i'm trying to achieve. The more scatterd data i get, the more accurate the matrix becomes. (and yes the green dot in this case would be an outlier, but it explains my case)
Cheers
This is not a matter of simple merge/average. I don't think there's a quick method to do this unless you have simplifying assumptions. What you want is a statistical inference of the underlying trend. I suggest using Gaussian process regression to solve this problem. There's a great MATLAB toolbox by Rasmussen and Williams called GPML. http://www.gaussianprocess.org/gpml/
This sounds more like a data fitting task to me. What you are suggesting is that you have a set of measurements for which you wish to get the best linear fit. Instead of producing a table of data, what you need is a table of values, and then find the best fit to those values. So, for example, I could create a matrix, A, which has all of the recorded values. Let's start with:
A=[2,14;3,15.5;4,16;6,18;8,20];
I now need a matrix of points for the inputs to my fitting curve (which, in this instance, lets assume it is linear, so is the set of values 1 and x)
B=[ones(size(A,1),1), A(:,1)];
We can find the linear fit parameters (where it cuts the y-axis and the gradient) using:
B\A(:,2)
Or, if you want the points that the line goes through for the values of x:
B*(B\A(:,2))
This results in the points:
2,14.1897 3,15.1552 4,16.1207 6,18.0517 8,19.9828
which represents the best fit line through these points.
You can manually extend this to polynomial fitting if you want, or you can use the Matlab function polyfit. To manually extend the process you should use a revised B matrix. You can also produce only a specified set of points in the last line. The complete code would then be:
% Original measurements - could be read in from a file,
% but for this example we will set it to a matrix
% Note that not all tabulated values need to be present
A=[2,14; 3,15.5; 4,16; 5,17; 8,20];
% Now create the polynomial values of x corresponding to
% the data points. Choosing a second order polynomial...
B=[ones(size(A,1),1), A(:,1), A(:,1).^2];
% Find the polynomial coefficients for the best fit curve
coeffs=B\A(:,2);
% Now generate a table of values at specific points
% First define the x-values
tabinds = 2:2:8;
% Then generate the polynomial values of x
tabpolys=[ones(length(tabinds),1), tabinds', (tabinds').^2];
% Finally, multiply by the coefficients found
curve_table = [tabinds', tabpolys*coeffs];
% and display the results
disp(curve_table);

How can I classify my data for K-Means Clustering

A proof of concept prototype I have to do for my final year project is to implement K-Means Clustering on a big data set and display the results on a graph. I only know object-oriented languages like Java and C# and decided to give MATLAB a try. I notice that with a functional language the approach to solving problems is very different, so I would like some insight on a few things if possible.
Suppose I have the following data set:
raw_data
400.39 513.29 499.99 466.62 396.67
234.78 231.92 215.82 203.93 290.43
15.07 14.08 12.27 13.21 13.15
334.02 328.79 272.2 306.99 347.79
49.88 52.2 66.35 47.69 47.86
732.88 744.62 687.53 699.63 694.98
And I picked row 2 and 4 to be the 2 centroids:
centroids
234.78 231.92 215.82 203.93 290.43 % Centroid 1
334.02 328.79 272.2 306.99 347.79 % Centroid 2
I want to now compute the euclidean distances of each point to each centroid, then assign each point to it's closest centroid and display this on a graph. Let's say I want I want to classify the centroids as blue and green. How can I do this in MATLAB? If this was Java I would initialise each row as an object and add to separate ArrayLists (representing the clusters).
If rows 1, 2 and 3 all belong to the first centroid / cluster, and rows 4, 5 and 6 belong to the second centroid / cluster - how can I classify these to display them as blue or green points on a graph? I am new to MATLAB and really curious about this. Thanks for any help.
(To begin with, Matlab has a flexible distance measuring function, pdist2 and also kmeans implementation, but I'm assuming that you want to build your code from scratch).
In Matlab, you try to implement everything as matrix algebra, without loops over elements.
In your case, if R is the raw_data matrix and C is the centroids matrix,
you can shift the dimension that represents centroid number to the 3rd place by
permC=permute(C,[3 2 1]); Then the bsxfun function allows you to subtract C from R while expanding R's third dimension as necessary: D=bsxfun(#minus,R,permC). Element-wise square followed by summation across columns SqD=sum(D.^2,2) will give you the squared distances of each observation from each centroid. Performing all these operations within a single statement and shifting the third (centroid) dimension back to the 2nd place will look like this:
SqD=permute(sum(bsxfun(#minus,R,permute(C,[3 2 1])).^2,2),[1 3 2])
Picking the centroid of minimal distance is now straightforward: [minDist,minCentroid]=min(SqD,[],2)
If this looks complex, I recommend inspecting the product of each sub-step and reading the help of each command.

Coordinate normalization for NN input in matlab

I am trying to implement a classification NN in Matlab.
My inputs are clusters of coordinates from an image. (Corresponding to delaunay triangulation vertexes)
There are 3 clusters (results of the optics algorithm) in this format:
( Not all clusters are of the same size.). Elements represent coordinates in euclidean 2d space . So (110,12) is a point in my image and the matrix depicted represents one cluster of points.
Clustering was done on image edges. So coordinates refer to logical values (always 1s in this case) on the image matrix.(After edge detection there are 3 "dense" areas in an image, and these collections of pixels are used for classification). There are 6 target classes.
So, my question is how can I format them into single column vector inputs to use in a neural network?
(There is a relevant answer here but I would like some elaboration if possible. ( I am probably too tired right now from 12 hours of trying stuff and dont get it 100% :D :( )
Remember, there are 3 different coordinate matrices for each picture, so my initial thought was, create an nn with 3 inputs (of different length). But how to serialize this?
Here's a cluster with its tags on in case it helps:
For you to train the classifier, you need a matrix X where each row will correspond to an image. If you want to use a coordinate representation, this means all images will have to be of the same size, say, M by N. So, the row of an image will have M times N elements (features) and the corresponding feature values will be the cluster assignments. Class vector y will be whatever labels you have, that is one of the six different classes you mentioned through the comments above. You should keep in mind that if you use a coordinate representation, X can get very high-dimensional, and unless you have a large number of images, chances are your classifier will perform very poorly. If you have few images, consider using fractions of pixels belonging to clusters that I suggested in one of the comments: this can give you a shorter feature description that is invariant to rotation and translation, and may yield better classification.

K-means Clustering, major understanding issue

Suppose that we have a 64dim matrix to cluster, let's say that the matrix dataset is dt=64x150.
Using from vl_feat's library its kmeans function, I will cluster my dataset to 20 centrers:
[centers, assignments] = vl_kmeans(dt, 20);
centers is a 64x20 matrix.
assignments is a 1x150 matrix with values inside it.
According to manual: The vector assignments contains the (hard) assignments of the input data to the clusters.
I still can not understand what those numbers in the matrix assignments mean. I dont get it at all. Anyone mind helping me a bit here? An example or something would be great. What do these values represent anyway?
In k-means the problem you are trying to solve is the problem of clustering your 150 points into 20 clusters. Each point is a 64-dimension point and thus represented by a vector of size 64. So in your case dt is the set of points, each column is a 64-dim vector.
After running the algorithm you get centers and assignments. centers are the 20 positions of the cluster's center in a 64-dim space, in case you want to visualize it, measure distances between points and clusters, etc. 'assignments' on the other hand contains the actual assignments of each 64-dim point in dt. So if assignments[7] is 15 it indicates that the 7th vector in dt belongs to the 15th cluster.
For example here you can see clustering of lots of 2d points, let's say 1000 into 3 clusters. In this case dt would be 2x1000, centers would be 2x3 and assignments would be 1x1000 and will hold numbers ranging from 1 to 3 (or 0 to 2, in case you're using openCV)
EDIT:
The code to produce this image is located here: http://pypr.sourceforge.net/kmeans.html#k-means-example along with a tutorial on kmeans for pyPR.
In openCV it is the number of the cluster that each of the input points belong to