How to predict KNN classifier without using built-in function - matlab

I have some trouble on predicting KNN classifier without using built-in function. I got stuck here and had no idea how to go to next step. Here is my code:
% calculate Euclidean distance
dist = pdist2(test, train, 'euclidean');
for k = [1 3 5 7]
[~, nearest] = sort(dist, 2);
nearst = nearest(:, 1:k);
end % for loop
Where test is a 297x64 matrix, and train is a 1500x64 matrix. The dist matrix is 297x1500. Any help will be thankful!

So you managed to get sorted indices in terms of distances in your nearst, all you have to do is to refer to your labels of the original data. So you have somewhere a variable labels which holds a true label for each point. Use indices stored in nearst to read them out and simply report the most common value.

Related

Clustering of 1 dimensional data

I am trying to learn the k-means clustering algorithm in MATLAB without using inbuilt k-means function. Say I have the data of size 1x100 and I want to group them into two clusters. So how can I do this. I want to visualize the two centroids and data together on a plot in MATLAB.
Note : When I plot in MATLAB, I am able to see only data but not the data and two centroids simultaneously.
Any help in this regard is highly appreciated.
A minimal K-means clustering algorithm in matlab could be:
p = rand(100,2); % rand(number_of_points,number_of_dimension)
c = p(1:3,:); % We create 3 centroids
% We run this minimal KNN algorithm:
for ii = 1:10
% Which centroids is the closest for each points ? min(Euclidian_distance):
[~,idx] = min(sum((permute(p,[3,2,1])-c).^2,2),[],1);
% We calculate the new centroids (the center of mass of the corresponding points)
c = splitapply(#mean,p,idx(:))
end
And we can plot the result if needed:
hold on
scatter(p(:,1),p(:,2),[],idx(:))
scatter(c(:,1),c(:,2),[],'red')
And we obtain:
With our 3 centroids in red and the clusters with a distinct color.
Noticed that in this example the data are of dimension 2, but it will also work with any other dimension.
The 3 initial centroids correspond to 3 points of the dataset (randomly selected), it ensure that every centroids are the closest centroid for, at least, 1 point.
In this example there is 10 iterations. But it is certainly better to define a tolerance and stop the iteration when the centroids have converged.

Distance Calculations for Nearest Mean Classifer

Greetins,
How can I calculate how many distance calculations would need to be performed to classify the IRIS dataset using Nearest Mean Classifier.
I know that IRIS dataset has 4 features and every record is classified according to 3 different labels.
According to some textbooks, the calculation can be carried out as follow:
However, I am lost on these different notations and what does this equation mean. For example, what is s^2 is in the equation?
The notation is standard with most machine learning textbooks. s in this case is the sample standard deviation for the training set. It is quite common to assume that each class has the same standard deviation, which is why every class is assigned the same value.
However you shouldn't be paying attention to that. The most important point is when the priors are equal. This is a fair assumption which means that you expect that the distribution of each class in your dataset are roughly equal. By doing this, the classifier simply boils down to finding the smallest distance from a training sample x to each of the other classes represented by their mean vectors.
How you'd compute this is quite simple. In your training set, you have a set of training examples with each example belonging to a particular class. For the case of the iris dataset, you have three classes. You find the mean feature vector for each class, which would be stored as m1, m2 and m3 respectively. After, to classify a new feature vector, simply find the smallest distance from this vector to each of the mean vectors. Whichever one has the smallest distance is the class you'd assign.
Since you chose MATLAB as the language, allow me to demonstrate with the actual iris dataset.
load fisheriris; % Load iris dataset
[~,~,id] = unique(species); % Assign for each example a unique ID
means = zeros(3, 4); % Store the mean vectors for each class
for i = 1 : 3 % Find the mean vectors per class
means(i,:) = mean(meas(id == i, :), 1); % Find the mean vector for class 1
end
x = meas(10, :); % Choose a random row from the dataset
% Determine which class has the smallest distance and thus figure out the class
[~,c] = min(sum(bsxfun(#minus, x, means).^2, 2));
The code is fairly straight forward. Load in the dataset and since the labels are in a cell array, it's handy to create a new set of labels that are enumerated as 1, 2 and 3 so that it's easy to isolate out the training examples per class and compute their mean vectors. That's what's happening in the for loop. Once that's done, I choose a random data point from the training set then compute the distance from this point to each of the mean vectors. We choose the class that gives us the smallest distance.
If you wanted to do this for the entire dataset, you can but that will require some permutation of the dimensions to do so.
data = permute(meas, [1 3 2]);
means_p = permute(means, [3 1 2]);
P = sum(bsxfun(#minus, data, means_p).^2, 3);
[~,c] = min(P, [], 2);
data and means_p are the transformed features and mean vectors in a way that is a 3D matrix with a singleton dimension. The third line of code computes the distances vectorized so that it finally generates a 2D matrix with each row i calculating the distance from the training example i to each of the mean vectors. We finally find the class with the smallest distance for each example.
To get a sense of the accuracy, we can simply compute the fraction of the total number of times we classified correctly:
>> sum(c == id) / numel(id)
ans =
0.9267
With this simple nearest mean classifier, we have an accuracy of 92.67%... not bad, but you can do better. Finally, to answer your question, you would need K * d distance calculations, with K being the number of examples and d being the number of classes. You can clearly see that this is required by examining the logic and code above.

Generate random samples from arbitrary discrete probability density function in Matlab

I've got an arbitrary probability density function discretized as a matrix in Matlab, that means that for every pair x,y the probability is stored in the matrix:
A(x,y) = probability
This is a 100x100 matrix, and I would like to be able to generate random samples of two dimensions (x,y) out of this matrix and also, if possible, to be able to calculate the mean and other moments of the PDF. I want to do this because after resampling, I want to fit the samples to an approximated Gaussian Mixture Model.
I've been looking everywhere but I haven't found anything as specific as this. I hope you may be able to help me.
Thank you.
If you really have a discrete probably density function defined by A (as opposed to a continuous probability density function that is merely described by A), you can "cheat" by turning your 2D problem into a 1D problem.
%define the possible values for the (x,y) pair
row_vals = [1:size(A,1)]'*ones(1,size(A,2)); %all x values
col_vals = ones(size(A,1),1)*[1:size(A,2)]; %all y values
%convert your 2D problem into a 1D problem
A = A(:);
row_vals = row_vals(:);
col_vals = col_vals(:);
%calculate your fake 1D CDF, assumes sum(A(:))==1
CDF = cumsum(A); %remember, first term out of of cumsum is not zero
%because of the operation we're doing below (interp1 followed by ceil)
%we need the CDF to start at zero
CDF = [0; CDF(:)];
%generate random values
N_vals = 1000; %give me 1000 values
rand_vals = rand(N_vals,1); %spans zero to one
%look into CDF to see which index the rand val corresponds to
out_val = interp1(CDF,[0:1/(length(CDF)-1):1],rand_vals); %spans zero to one
ind = ceil(out_val*length(A));
%using the inds, you can lookup each pair of values
xy_values = [row_vals(ind) col_vals(ind)];
I hope that this helps!
Chip
I don't believe matlab has built-in functionality for generating multivariate random variables with arbitrary distribution. As a matter of fact, the same is true for univariate random numbers. But while the latter can be easily generated based on the cumulative distribution function, the CDF does not exist for multivariate distributions, so generating such numbers is much more messy (the main problem is the fact that 2 or more variables have correlation). So this part of your question is far beyond the scope of this site.
Since half an answer is better than no answer, here's how you can compute the mean and higher moments numerically using matlab:
%generate some dummy input
xv=linspace(-50,50,101);
yv=linspace(-30,30,100);
[x y]=meshgrid(xv,yv);
%define a discretized two-hump Gaussian distribution
A=floor(15*exp(-((x-10).^2+y.^2)/100)+15*exp(-((x+25).^2+y.^2)/100));
A=A/sum(A(:)); %normalized to sum to 1
%plot it if you like
%figure;
%surf(x,y,A)
%actual half-answer starts here
%get normalized pdf
weight=trapz(xv,trapz(yv,A));
A=A/weight; %A normalized to 1 according to trapz^2
%mean
mean_x=trapz(xv,trapz(yv,A.*x));
mean_y=trapz(xv,trapz(yv,A.*y));
So, the point is that you can perform a double integral on a rectangular mesh using two consecutive calls to trapz. This allows you to compute the integral of any quantity that has the same shape as your mesh, but a drawback is that vector components have to be computed independently. If you only wish to compute things which can be parametrized with x and y (which are naturally the same size as you mesh), then you can get along without having to do any additional thinking.
You could also define a function for the integration:
function res=trapz2(xv,yv,A,arg)
if ~isscalar(arg) && any(size(arg)~=size(A))
error('Size of A and var must be the same!')
end
res=trapz(xv,trapz(yv,A.*arg));
end
This way you can compute stuff like
weight=trapz2(xv,yv,A,1);
mean_x=trapz2(xv,yv,A,x);
NOTE: the reason I used a 101x100 mesh in the example is that the double call to trapz should be performed in the proper order. If you interchange xv and yv in the calls, you get the wrong answer due to inconsistency with the definition of A, but this will not be evident if A is square. I suggest avoiding symmetric quantities during the development stage.

Mahalanobis distance in matlab: pdist2() vs. mahal() function

I have two matrices X and Y. Both represent a number of positions in 3D-space. X is a 50*3 matrix, Y is a 60*3 matrix.
My question: why does applying the mean-function over the output of pdist2() in combination with 'Mahalanobis' not give the result obtained with mahal()?
More details on what I'm trying to do below, as well as the code I used to test this.
Let's suppose the 60 observations in matrix Y are obtained after an experimental manipulation of some kind. I'm trying to assess whether this manipulation had a significant effect on the positions observed in Y. Therefore, I used pdist2(X,X,'Mahalanobis') to compare X to X to obtain a baseline, and later, X to Y (with X the reference matrix: pdist2(X,Y,'Mahalanobis')), and I plotted both distributions to have a look at the overlap.
Subsequently, I calculated the mean Mahalanobis distance for both distributions and the 95% CI and did a t-test and Kolmogorov-Smirnoff test to asses if the difference between the distributions was significant. This seemed very intuitive to me, however, when testing with mahal(), I get different values, although the reference matrix is the same. I don't get what the difference between both ways of calculating mahalanobis distance is exactly.
Comment that is too long #3lectrologos:
You mean this: d(I) = (Y(I,:)-mu)inv(SIGMA)(Y(I,:)-mu)'? This is just the formula for calculating mahalanobis, so should be the same for pdist2() and mahal() functions. I think mu is a scalar and SIGMA is a matrix based on the reference distribution as a whole in both pdist2() and mahal(). Only in mahal you are comparing each point of your sample set to the points of the reference distribution, while in pdist2 you are making pairwise comparisons based on a reference distribution. Actually, with my purpose in my mind, I think I should go for mahal() instead of pdist2(). I can interpret a pairwise distance based on a reference distribution, but I don't think it's what I need here.
% test pdist2 vs. mahal in matlab
% the purpose of this script is to see whether the average over the rows of E equals the values in d...
% data
X = []; % 50*3 matrix, data omitted
Y = []; % 60*3 matrix, data omitted
% calculations
S = nancov(X);
% mahal()
d = mahal(Y,X); % gives an 60*1 matrix with a value for each Cartesian element in Y (second matrix is always the reference matrix)
% pairwise mahalanobis distance with pdist2()
E = pdist2(X,Y,'mahalanobis',S); % outputs an 50*60 matrix with each ij-th element the pairwise distance between element X(i,:) and Y(j,:) based on the covariance matrix of X: nancov(X)
%{
so this is harder to interpret than mahal(), as elements of Y are not just compared to the "mahalanobis-centroid" based on X,
% but to each individual element of X
% so the purpose of this script is to see whether the average over the rows of E equals the values in d...
%}
F = mean(E); % now I averaged over the rows, which means, over all values of X, the reference matrix
mean(d)
mean(E(:)) % not equal to mean(d)
d-F' % not zero
% plot output
figure(1)
plot(d,'bo'), hold on
plot(mean(E),'ro')
legend('mahal()','avaraged over all x values pdist2()')
ylabel('Mahalanobis distance')
figure(2)
plot(d,'bo'), hold on
plot(E','ro')
plot(d,'bo','MarkerFaceColor','b')
xlabel('values in matrix Y (Yi) ... or ... pairwise comparison Yi. (Yi vs. all Xi values)')
ylabel('Mahalanobis distance')
legend('mahal()','pdist2()')
One immediate difference between the two is that mahal subtracts the sample mean of X from each point in Y before computing distances.
Try something like E = pdist2(X,Y-mean(X),'mahalanobis',S); to see if that gives you the same results as mahal.
Note that
mahal(X,Y)
is equivalent to
pdist2(X,mean(Y),'mahalanobis',cov(Y)).^2
Well, I guess there are two different ways to calculate mahalanobis distance between two clusters of data like you explain above:
1) you compare each data point from your sample set to mu and sigma matrices calculated from your reference distribution (although labeling one cluster sample set and the other reference distribution may be arbitrary), thereby calculating the distance from each point to this so called mahalanobis-centroid of the reference distribution.
2) you compare each datapoint from matrix Y to each datapoint of matrix X, with, X the reference distribution (mu and sigma are calculated from X only)
The values of the distances will be different, but I guess the ordinal order of dissimilarity between clusters is preserved when using either method 1 or 2? I actually wonder when comparing 10 different clusters to a reference matrix X, or to each other, if the order of the dissimilarities would differ using method 1 or method 2? Also, I can't imagine a situation where one method would be wrong and the other method not. Although method 1 seems more intuitive in some situations, like mine.

Multivariate Random Number Generation in Matlab

I'm probably being a little dense but I'm not very mathsy and can't seem to understand the covariance element of creating multivariate data.
I'm after two columns of random data (representing two correlated variables).
I think I am right in needing to use the mvnrnd function and I understand that 'mu' must be a column of my mean vectors. As I need 4 distinct classes within my data these are going to be (1, 1) (-1 1) (1 -1) and (-1 -1). I assume I will have to do the function 4x with a different column of mean vectors each time and then combine them to get my full data set.
I don't understand what I should put for SIGMA - Matlab help tells me that it must be 'a d-by-d symmetric positive semi-definite matrix, or a d-by-d-by-n array' i.e. a covariance matrix. I don't understand how I create a covariance matrix for numbers that I am yet to generate.
Any advice would be greatly appreciated!
Assuming that I understood your case properly, I would go this way:
data = [normrnd(0,1,5000,1),normrnd(0,1,5000,1)]; %% your starting data series
MU = mean(data,1);
SIGMA = cov(data);
Now, it should be possible to feed mvnrnd with MU and SIGMA:
r = mvnrnd(MU,SIGMA,5000);
plot(r(:,1),r(:,2),'+') %% in case you wanna plot the results
I hope this helps.
I think your aim is to generate the simulated multivariate gaussian distributed data. For example, I use
k = 6; % feature dimension
mu = rand(1,k);
sigma = 10*eye(k,k);
unit matrix by 10 times is a symmetric positive semi-definite matrix. And the gaussian distribution will be more round than other type of sigma.
then you can use it as the above example of mvnrnd function and see the plot.