Matlab plot scatterplot with intermediate coordinates - matlab

This is an exam task that I have. Lets say I have a 200x6 matrix where 200 people voted a movie with respect to 6 questions, each on a continous [0, 1]-scale (0: disagree, 1: agree).
To get a useful overview of the 6-dimensional dataset I want to plot the rank-2 approximation of the data. First I do the rank 2 approximation:
A = (200, 6); %some data
[U, S, V] = svd(A);
Ak = U(:, 1:2) * S(1:2, 1:2) * V(:, 1:2)';
I want to plot this approximation as a 2D scatterplot with a "*"-mark per survey participant using either U or V coordinates as intermediate coordinates depending on how my data is organized.. The problem is that I don't know what intermediate coordinates mean, and I can't find a good explanation anywhere. Wonder if someone could help, eventually providing a small code example. Any help appreciated, thank you.

Formally, intermediate axes are (ortogonal) linear combinations of your data (along maximun explained variance, a.k.a. principal components).
If most of data have similar shape (e.g. [5 4 3 2 1 0] pattern), then the first component will be similar to this shape/vector, since the variance arount it is minimal (or: variance along it is maximal). Next components as well minimize the rest of variance in ortogonal planes.
So, the answer is: principal components 1 and 2.
And more precicely: first intermediate coordinate value can be understood as magnitude of that "first main pattern" in a single data sample.

Related

Distance Calculations for Nearest Mean Classifer

Greetins,
How can I calculate how many distance calculations would need to be performed to classify the IRIS dataset using Nearest Mean Classifier.
I know that IRIS dataset has 4 features and every record is classified according to 3 different labels.
According to some textbooks, the calculation can be carried out as follow:
However, I am lost on these different notations and what does this equation mean. For example, what is s^2 is in the equation?
The notation is standard with most machine learning textbooks. s in this case is the sample standard deviation for the training set. It is quite common to assume that each class has the same standard deviation, which is why every class is assigned the same value.
However you shouldn't be paying attention to that. The most important point is when the priors are equal. This is a fair assumption which means that you expect that the distribution of each class in your dataset are roughly equal. By doing this, the classifier simply boils down to finding the smallest distance from a training sample x to each of the other classes represented by their mean vectors.
How you'd compute this is quite simple. In your training set, you have a set of training examples with each example belonging to a particular class. For the case of the iris dataset, you have three classes. You find the mean feature vector for each class, which would be stored as m1, m2 and m3 respectively. After, to classify a new feature vector, simply find the smallest distance from this vector to each of the mean vectors. Whichever one has the smallest distance is the class you'd assign.
Since you chose MATLAB as the language, allow me to demonstrate with the actual iris dataset.
load fisheriris; % Load iris dataset
[~,~,id] = unique(species); % Assign for each example a unique ID
means = zeros(3, 4); % Store the mean vectors for each class
for i = 1 : 3 % Find the mean vectors per class
means(i,:) = mean(meas(id == i, :), 1); % Find the mean vector for class 1
end
x = meas(10, :); % Choose a random row from the dataset
% Determine which class has the smallest distance and thus figure out the class
[~,c] = min(sum(bsxfun(#minus, x, means).^2, 2));
The code is fairly straight forward. Load in the dataset and since the labels are in a cell array, it's handy to create a new set of labels that are enumerated as 1, 2 and 3 so that it's easy to isolate out the training examples per class and compute their mean vectors. That's what's happening in the for loop. Once that's done, I choose a random data point from the training set then compute the distance from this point to each of the mean vectors. We choose the class that gives us the smallest distance.
If you wanted to do this for the entire dataset, you can but that will require some permutation of the dimensions to do so.
data = permute(meas, [1 3 2]);
means_p = permute(means, [3 1 2]);
P = sum(bsxfun(#minus, data, means_p).^2, 3);
[~,c] = min(P, [], 2);
data and means_p are the transformed features and mean vectors in a way that is a 3D matrix with a singleton dimension. The third line of code computes the distances vectorized so that it finally generates a 2D matrix with each row i calculating the distance from the training example i to each of the mean vectors. We finally find the class with the smallest distance for each example.
To get a sense of the accuracy, we can simply compute the fraction of the total number of times we classified correctly:
>> sum(c == id) / numel(id)
ans =
0.9267
With this simple nearest mean classifier, we have an accuracy of 92.67%... not bad, but you can do better. Finally, to answer your question, you would need K * d distance calculations, with K being the number of examples and d being the number of classes. You can clearly see that this is required by examining the logic and code above.

MATLAB: give IDs to points stored in a matrix to distinguish between neighbours

Even though the title might sound trivial at first, I hope someone can help me by giving me hints about the MATLAB functions I can use:
I have a matrix of points with properties for each (read: individuals with properties) of the form (x, y, direction):
A = [1 1 45°]
B = [3 1 225°]
C = [0 2 90°]
D = [5 5 187°]
With a probablity P particle A chooses one of B and C as neighbours and turns it direction according to its neighbour (while D is too far away) EDIT and moves towards it with a constant velocity (I basically forgot the most important part of the question ..., stupid me).
I have now implemented a matrix called:
I = [1 1 45; 3 1 225; 0 2 90; 5 5 187];
In a scenario A chooses C (randomly) as attractive neighbour and turns towards C. This means my program has to be able to distinguish between B and C.
Does there maybe exist a type like "point" where you can store properties with an ID? Do I have to use Vectors instead of one matrix? I am right now working with a lot of individuals, so preallocating 50 vectors would be not optimal (this is why chose a matrix).
To make a clear question:
I have a lot of points, I need to store 3 properties to an ID for each point and then check for one point with IDx which other points with IDy's are within reach.
The mathematics are irrelevant for now, but I need a function in MATLAB that gives a better option than storing these information in a matrix (because that one seems not good for identifying each point). This is part of a flocking simulation for individuals.
If anyone can help me with this I would be very happy! If I asked that question in a bad way please give me feedback as well to clarify.
Thanks!
From what I understood from you, the following can be done:
When you store your elements in the original matrix, let the row index be their ID.
Since points do not change locations but only orientation, then you can compute only once a matrix or relative distances (Upper triangle matrix with size n^2).
In the distance matrix use the IDs you have from your first matrix as IDs for the same objects in the second matrix. Your search will be a min-search over ~0.5*n^2 elements.

How can I classify my data for K-Means Clustering

A proof of concept prototype I have to do for my final year project is to implement K-Means Clustering on a big data set and display the results on a graph. I only know object-oriented languages like Java and C# and decided to give MATLAB a try. I notice that with a functional language the approach to solving problems is very different, so I would like some insight on a few things if possible.
Suppose I have the following data set:
raw_data
400.39 513.29 499.99 466.62 396.67
234.78 231.92 215.82 203.93 290.43
15.07 14.08 12.27 13.21 13.15
334.02 328.79 272.2 306.99 347.79
49.88 52.2 66.35 47.69 47.86
732.88 744.62 687.53 699.63 694.98
And I picked row 2 and 4 to be the 2 centroids:
centroids
234.78 231.92 215.82 203.93 290.43 % Centroid 1
334.02 328.79 272.2 306.99 347.79 % Centroid 2
I want to now compute the euclidean distances of each point to each centroid, then assign each point to it's closest centroid and display this on a graph. Let's say I want I want to classify the centroids as blue and green. How can I do this in MATLAB? If this was Java I would initialise each row as an object and add to separate ArrayLists (representing the clusters).
If rows 1, 2 and 3 all belong to the first centroid / cluster, and rows 4, 5 and 6 belong to the second centroid / cluster - how can I classify these to display them as blue or green points on a graph? I am new to MATLAB and really curious about this. Thanks for any help.
(To begin with, Matlab has a flexible distance measuring function, pdist2 and also kmeans implementation, but I'm assuming that you want to build your code from scratch).
In Matlab, you try to implement everything as matrix algebra, without loops over elements.
In your case, if R is the raw_data matrix and C is the centroids matrix,
you can shift the dimension that represents centroid number to the 3rd place by
permC=permute(C,[3 2 1]); Then the bsxfun function allows you to subtract C from R while expanding R's third dimension as necessary: D=bsxfun(#minus,R,permC). Element-wise square followed by summation across columns SqD=sum(D.^2,2) will give you the squared distances of each observation from each centroid. Performing all these operations within a single statement and shifting the third (centroid) dimension back to the 2nd place will look like this:
SqD=permute(sum(bsxfun(#minus,R,permute(C,[3 2 1])).^2,2),[1 3 2])
Picking the centroid of minimal distance is now straightforward: [minDist,minCentroid]=min(SqD,[],2)
If this looks complex, I recommend inspecting the product of each sub-step and reading the help of each command.

k-means clustering using function 'kmeans' in MATLAB

I have this matrix:
x = [2+2*i 2-2*i -2+2*i -2-2*i];
I want to simulate transmitting it and adding noise to it. I represented the components of the complex number as below:
A = randn(150, 2) + 2*ones(150, 2); C = randn(150, 2) - 2*ones(150, 2);
At the receiver, I received the below vector, where the components are ordered based on what I sent originally, i.e., the components of x).
X = [A A A C C A C C];
Now I want to apply the kmeans(X) to have four clusters, so kmeans(X, 4). I am experiencing the following problems:
I am not sure if I can represent the complex numbers as shown in X above.
I can't plot the result of the kmeans to show the clusters.
I could not understand the clusters centroid results.
How can I find the best error rate, if this example was to represent a communication system and at the receiver, k-means clustering was used in order to decide what the transmitted signal was?
If you don't "understand" the cluster centroid results, then you don't understand how k-means works. I'll present a small summary here.
How k-means works is that for some data that you have, you want to group them into k groups. You initially choose k random points in your data, and these will have labels from 1,2,...,k. These are what we call the centroids. Then, you determine how close the rest of the data are to each of these points. You then group those points so that whichever points are closest to any of these k points, you assign those points to belong to that particular group (1,2,...,k). After, for all of the points for each group, you update the centroids, which actually is defined as the representative point for each group. For each group, you compute the average of all of the points in each of the k groups. These become the new centroids for the next iteration. In the next iteration, you determine how close each point in your data is to each of the centroids. You keep iterating and repeating this behaviour until the centroids don't move anymore, or they move very little.
Now, let's answer your questions one-by-one.
1. Complex number representation
k-means in MATLAB doesn't define how complex data is handled. A common way for people to deal with complex numbered data is to split up the real and imaginary parts into separate dimensions as you have done. This is a perfectly valid way to use k-means for complex valued data.
See this post on the MathWorks MATLAB forum for more details: https://www.mathworks.com/matlabcentral/newsreader/view_thread/78306
2. Plot the results
You aren't constructing your matrix X properly. Note that A and C are both 150 x 2 matrices. You need to structure X such that each row is a point, and each column is a variable. Therefore, you need to concatenate your A and C row-wise. Therefore:
X = [A; A; A; C; C; A; C; C];
Note that you have duplicate points. This is actually no different than doing X = [A; C]; as far as kmeans is concerned. Perhaps you should generate X, then add the noise in rather than taking A and C, adding noise, then constructing your signal.
Now, if you want to plot the results as well as the centroids, what you need to do is use the two output version of kmeans like so:
[idx, centroids] = kmeans(X, 4);
idx will contain the cluster number that each point in X belongs to, and centroids will be a 4 x 2 matrix where each row tells you the mean of each cluster found in the data. If you want to plot the data, as well as the clusters, you simply need to do following. I'm going to loop over each cluster membership and plot the results on a figure. I'm also going to colour in where the mean of each cluster is located:
x = X(:,1);
y = X(:,2);
figure;
hold on;
colors = 'rgbk';
for num = 1 : 4
plot(x(idx == num), y(idx == num), [colors(num) '.']);
end
plot(centroids(:,1), centroids(:,2), 'c.', 'MarkerSize', 14);
grid;
The above code goes through each cluster, plots them in a different colour, then plots the centroids in cyan with a slightly larger thickness so you can see what the graph looks like.
This is what I get:
3. Understanding centroid results
This is probably because you didn't construct X properly. This is what I get for my centroids:
centroids =
-1.9176 -2.0759
1.5980 2.8071
2.7486 1.6147
0.8202 0.8025
This is pretty self-explanatory and I talked about how this is structured earlier.
4. Best representation of the signal
What you can do is repeat the clustering a number of times, then the algorithm will decide what the best clustering was out of these times. You would simply use the Replicates flag and denote how many times you want this run. Obviously, the more times you run this, the better your results may be. Therefore, do something like:
[idx, centroids] = kmeans(X, 4, 'Replicates', 5);
This will run kmeans 5 times and give you the best centroids of these 5 times.
Now, if you want to determine what the best sequence that was transmitted, you'd have to split up your X into 150 rows each (as your random sequence was 150 elements), then run a separate kmeans on each subset. You can try to find the best representation of each part of the sequence by using the Replicates flag each time.... so you can do something like:
for num = 1 : 8
%// Look at 150 points at a time
[idx, centroids] = kmeans(X((num-1)*150 + 1 : num*150, :), 4, 'Replicates', 5);
%// Do your analysis
%//...
%//...
end
idx and centroids would be the results for each portion of your transmitted signal. You probably want to look at centroids at each iteration to determine what symbol was transmitted at a particular time.
If you want to plot the decision regions, then you're probably looking for a Voronoi diagram. All you do is given a set of points that are defined within the domain of your problem, you just have to determine which cluster each point belongs to. Given that our data spans between -5 <= (x,y) <= 5, let's go through each point in the grid and determine which cluster each point belongs to. We'd then colour the appropriate point according to which cluster it belongs to.
Something like:
colors = 'rgbk';
[X,Y] = meshgrid(-5:0.05:5, -5:0.05:5);
X = X(:);
Y = Y(:);
figure;
hold on;
for idx = 1 : numel(X)
[~,ind] = min(sum(bsxfun(#minus, [X(idx) Y(idx)], centroids).^2, 2));
plot(X(idx), Y(idx), [colors(ind), '.']);
end
plot(centroids(:,1), centroids(:,2), 'c.', 'MarkerSize', 14);
The above code will plot the decision regions / Voronoi diagram of the particular configuration, as well as where the cluster centres are located. Note that the code is rather unoptimized and it'll take a while for the graph to generate, but I wanted to write something quick to illustrate my point.
Here's what the decision regions look like:
Hope this helps! Good luck!

3-D Plotting with MATLAB for Galton's Skewness and Moor's Kurtosis

I know there are many plotting documents for Matlab online and I am pretty sure that it has been asked many times. I aplogize in advance for any inconvenience.
I am dealing with a new distribution and I need to draw 3D plot for different values of parameters (I can do it with Excel or any other programs, however, since my other graphs is drawn with MATLAB, and I need to put this 3D in Matlab, too, to publish it as an article). I calculated the result using MATLAB loops, however, plotting gives me the hardest time. I had no other choice but to ask for your assistance. I have these equations for different alphas and betas with a constant sigma and calculate Galton's Skewness and Moor's Kurtosis given with the last two equations.
median=sqrt(2*(sigma^2)*beta*gammaincinv(0.5,alpha));
q1=sqrt(2*(sigma^2)*beta*gammaincinv((6/8),alpha));
q3=sqrt(2*(sigma^2)*beta*gammaincinv((2/8),alpha));
q4=sqrt(2*(sigma^2)*beta*gammaincinv((7/8),alpha));
q5=sqrt(2*(sigma^2)*beta*gammaincinv((5/8),alpha));
q6=sqrt(2*(sigma^2)*beta*gammaincinv((3/8),alpha));
q7=sqrt(2*(sigma^2)*beta*gammaincinv((1/8),alpha));
galtonskewness=(q1-2*median+q3)/(q1-q3);
moorskurtosis=(q4-q5+q6-q7)/(q1-q3);
Let's assume that,
sigma=1
beta=[0.1 0.2 0.5 1 2 5];
alpha=[0.1 0.2 0.5 1 2 5];
I have used mesh(X,Y,Z) for the same range of alphas and betas with the same increment but I take the error "these values cannot be complex". I just want to draw something like the one below.
It must be something easy that I am missing out, but I do not understand where the mistake is. I appreciate any help. Thank you!
I ran the above code for a 2D mesh of points for alpha and beta between 0.1 and 5 for both dimensions and I got results for both.
I suspect it's due to your alpha and beta declaration. You are only providing a few points, and if you try to use mesh, it won't get good results. Therefore, define a meshgrid of points for both alpha and beta, then vectorize your MATLAB code to produce the kurotsis and skewness curves. Only under certain situations should you use for loops. In general, you should avoid using them whenever possible.
How meshgrid works is that given a range of X and Y values, it will produce two (or three if you want 3D co-ordinates) arrays where each location in each array gives you the spatial co-ordinate at that particular location. Therefore, if we did something like:
[X,Y] = meshgrid(1:3, 1:3);
This is what we get:
X =
1 2 3
1 2 3
1 2 3
Y =
1 1 1
2 2 2
3 3 3
Notice that in a 2D grid, for the top-left corner, (x,y) = (1,1), and so for the corresponding location in X, we get 1 and Y we get 1. If you do the same logic for any other position in the 2D grid, you simply look at the X and Y values in each array and it will tell you what the component is for each dimension.
As such, instead of looping through all possible points in your grid, generate them all using meshgrid, then vectorize the computation by calculating your values all at once rather than individually. Once you do this, you have the right structure to be able to put this into mesh.
Therefore, try doing this instead:
%// Define meshgrid of points
[alpha,beta] = meshgrid(0.1:0.1:5, 0.1:0.1:5);
%// From your code
sigma = 1;
%// Calculate quantities - Notice that this is all vectorized
med=sqrt(2*(sigma^2)*beta.*gammaincinv(0.5,alpha));
q1=sqrt(2*(sigma^2)*beta.*gammaincinv((6/8),alpha));
q3=sqrt(2*(sigma^2)*beta.*gammaincinv((2/8),alpha));
q4=sqrt(2*(sigma^2)*beta.*gammaincinv((7/8),alpha));
q5=sqrt(2*(sigma^2)*beta.*gammaincinv((5/8),alpha));
q6=sqrt(2*(sigma^2)*beta.*gammaincinv((3/8),alpha));
q7=sqrt(2*(sigma^2)*beta.*gammaincinv((1/8),alpha));
galtonskewness=(q1-2*med+q3)./(q1-q3);
moorskurtosis=(q4-q5+q6-q7)./(q1-q3);
%// Show our meshes
figure;
mesh(alpha, beta, galtonskewness);
figure;
mesh(alpha, beta, moorskurtosis);
Also take note that I renamed your median variable to med. MATLAB has a function called median and so you don't want to unintentionally shadow over this function with a variable of the same name.
This is what I get:
Take note that I'm not getting the plots that you have placed in your post. It may be because I'm choosing the wrong variables to define the mesh, or perhaps your equations may be incorrect. Double check what you know in theory to what you have here in code and try again.
This should hopefully give you enough to start with though!