Suppose that I have generated some data in matlab as follows:
n = 100;
x = randi(n,[n,1]);
y = rand(n,1);
data = [x y];
plot(x,y,'rx')
axis([0 100 0 1])
Now I want to generate an algorithm to classify all these data into some clusters(which are arbitrary) in a way such that a point be a member of a cluster only if the distance between this point and at least one of the members of the cluster be less than 10.How could I generate the code?
The clustering method you are describing is DBSCAN. Note that this algorithm will find only one cluster in provided data, since it's very unlikely that there is a point in the dataset so that its distance to all other points is more than 10.
If this is really what you want, you can use ِDBSCAN, or the one posted in FE, if you are using versions older than 2019a.
% Generating random points, almost similar to the data provided by OP
data = bsxfun(#times, rand(100, 2), [100 1]);
% Adding more random points
for i=1:5
mu = rand(1, 2)*100 -50;
A = rand(2)*5;
sigma = A*A'+eye(2)*(1+rand*2);%[1,1.5;1.5,3];
data = [data;mvnrnd(mu,sigma,20)];
end
% clustering using DBSCAN, with epsilon = 10, and min-points = 1 as
idx = DBSCAN(data, 10, 1);
% plotting clusters
numCluster = max(idx);
colors = lines(numCluster);
scatter(data(:, 1), data(:, 2), 30, colors(idx, :), 'filled')
title(['No. of Clusters: ' num2str(numCluster)])
axis equal
The numbers in above figure shows the distance between closest pairs of points in any two different clusters.
The Matlab built-in function clusterdata() works well for what you're asking.
Here is how to apply it to your example:
% number of points
n = 100;
% create the data
x = randi(n,[n,1]);
y = rand(n,1);
data = [x y];
% the number of clusters you want to create
num_clusters = 5;
T1 = clusterdata(data,'Criterion','distance',...
'Distance','euclidean',...
'MaxClust', num_clusters)
scatter(x, y, 100, T1,'filled')
In this case, I used 5 clusters and used the Euclidean distance to be the metric to group the data points, but you can always change that (see documentation of clusterdata())
See the result below for 5 clusters with some random data.
Note that the data is skewed (x-values are from 0 to 100, and y-values are from 0 to 1), so the results are also skewed, but you could always normalize your data.
Here is a way using the connected components of graph:
D = pdist2(x, y) < 10;
D(1:size(D,1)+1:end) = 0;
G = graph(D);
C = conncomp(G);
The connected components is vector that shows the cluster numbers.
Use pdist2 to compute distance matrix of x and y.
Use the distance matrix to create a logical adjacency matrix that shows two point are neighbors if distance between them is less than 10.
Set the diagonal elements of the adjacency matrix to 0 to eliminate self loops.
Create a graph from the adjacency matrix.
Compute the connected components of graph.
Note that using pdist2 for large datasets may not be applicable and you need to use other methods to form a sparse adjacency matrix.
I notified after posing my answer the answer provided by #saastn suggested to use DBSCAN algorithm that nearly follows the same approach.
Related
I would like to calculate the mahalanobis distance of input feature vector Y (1x14) to all feature vectors in matrix X (18x14). Each 6 vectors of X represent one class (So I have 3 classes). Then based on mahalanobis distances I will choose the vector that is the nearest to the input and classify it to one of the three classes as well.
My problem is when I use the following code I got only one value. How can I get mahalanobis distance between the input Y and every vector in X. So at the end I have 18 values and then I choose the smallest one. Any help will be appreciated. Thank you.
Note: I know that mahalanobis distance is a measure of the distance between a point P and a distribution D, but I don't how could this be applied in my situation.
Y = test1; % Y: 1x14 vector
S = cov(X); % X: 18x14 matrix
mu = mean(X,1);
d = ((Y-mu)/S)*(Y-mu)'
I also tried to separate the matrix X into 3; so each one represent the feature vectors of one class. This is the code, but it doesn't work properly and I got 3 distances and some have negative value!
Y = test1;
X1 = Action1;
S1 = cov(X1);
mu1 = mean(X1,1);
d1 = ((Y-mu1)/S1)*(Y-mu1)'
X2 = Action2;
S2 = cov(X2);
mu2 = mean(X2,1);
d2 = ((Y-mu2)/S2)*(Y-mu2)'
X3= Action3;
S3 = cov(X3);
mu3 = mean(X3,1);
d3 = ((Y-mu3)/S3)*(Y-mu3)'
d= [d1,d2,d3];
MahalanobisDist= min(d)
One last thing, when I used mahal function provided by Matlab I got this error:
Warning: Matrix is close to singular or badly scaled. Results may be inaccurate.
If you have to implement the distance yourself (school assignment for instance) this is of absolutely no use to you, but if you just need to calculate the distance as an intermediate step for other calculations I highly recommend d = Pdist2(a,b, distance_measure) the documentation is on matlabs site
It computes the pairwise distance between a vector (or even a matrix) b and all elements in a and stores them in vector d where the columns correspond to entries in b and the rows are entries from a. So d(i,j) is the distance between row j in b and row i in a (hope that made sense). If you want it could even parameters to find the k nearest neighbors, it's a great function.
in your case you would use the following code and you'd end up with the distance between elements, and the index as well
%number of neighbors
K = 1;
% X=18x14, Y=1x14, dist=18x1
[dist, iidx] = pdist2(X,Y,'mahalanobis','smallest',K);
%to find the class, you can do something like this
num_samples_per_class = 6;
matching_class = ceil(iidx/ num_samples_per_class);
Consider this example of code to obtain the best fit from data varying the number of fitting Gaussians according the Akaike criterion
MU1 = [1];
SIGMA1 = [2];
MU2 = [-3];
SIGMA2 = [1 ];
X = [mvnrnd(MU1,SIGMA1,1000);mvnrnd(MU2,SIGMA2,1000)];
AIC = zeros(1,4);
obj = cell(1,4);
options = statset('Display','final');
for k = 1:4
obj{k} = gmdistribution.fit(X,k,'Options',options);
AIC(k)= obj{k}.AIC;
end
[minAIC,numComponents] = min(AIC)
I want to do the same thing but with data that are given in a form of a histogram (consider for example the data http://pastebin.com/embed_js.php?i=1mNRuEHZ).
What is the most direct way to implement the same procedure in matlab in this case?
If I'm getting you right, then your problem is to convert between data that is already compiled as a histogram (so numbers of observations paired with the actual value of an observation) and the original individual observations. Of course, when compiling the histogram, you have lost two things:
Order. You don't know what the order of observations was in the original data, which is probably not important, provided your observations are independent. Also, the way I get gmdistribution.fit() it doesn't take into account order anyway.
Resolution. When you create a histogram, you need to bin your data, which makes you lose precision, so to speak, because it is impossible to recover the precise values of your observations from the bins.
Once you are aware of that you can create a 'vector of observations' from your histogram data. Say, X1 is your histogram data (Nx2 vector). If you do
invX = cell2mat(arrayfun(#(x,y) repmat(y,1,x), abs(int16(1000*X1(:, 2)))', X1(:, 1)', ...
'UniformOutput', false))';
you get a vector that contains individual observations, just like X in your example.
Note that you have to convert the bin counts to integers first. At this step, because the given data's precision is quite high, I had to round to make the computation possible for my machine. However, the final result seems fairly reasonable.
Also note that I used absolute values, there are some cases in your histogram data were your data is actually negative, which, for a histogram obviously doesn't make sense.
Last but not least you have to change the number of iterations for the fit procedure to 1000. The final code to produce the below figure reads
MU1 = [1];
SIGMA1 = [2];
MU2 = [-3];
SIGMA2 = [1 ];
X = [mvnrnd(MU1,SIGMA1,1000);mvnrnd(MU2,SIGMA2,1000)];
X = X1(:, 2);
invX = cell2mat(arrayfun(#(x,y) repmat(y,1,x), abs(int16(1000*X1(:, 2)))', X1(:, 1)', ...
'UniformOutput', false))'; %'
X = invX;
AIC = zeros(1,4);
obj = cell(1,4);
options = statset('Display','final', 'MaxIter', 1000);
for k = 1:4
obj{k} = gmdistribution.fit(X,k,'Options',options);
AIC(k)= obj{k}.AIC;
end
[minAIC,numComponents] = min(AIC);
hold on;
plot(linspace(-1, 2, length(X1(:, 2))), abs(X1(:, 2)), 'LineWidth', 2)
plot(x, pd/max(pd)*double(max(abs(X1(:, 2)))), 'LineWidth', 5);
h = legend('Original data', 'PDF');
set(h,'FontSize',32);
Output looks like this:
I'am new in matlab programming,I should Write a script to generate a random sequence (x1,..., XN) of size N following the normal distribution N (0, 1) and calculate the empirical mean mN and variance σN^2
Then,I should plot them:
this is my essai:
function f = normal_distribution(n)
x =randn(n);
muem = 1./n .* (sum(x));
muem
%mean(muem)
vaem = 1./n .* (sum((x).^2));
vaem
hold on
plot(x,muem,'-')
grid on
plot(x,vaem,'*')
NB:those are the formules that I have used:
I have obtained,a Figure and I don't know if is it correct or not ,thanks for Help
From your question, it seems what you want to do is calculate the mean and variance from a sample of size N (nor an NxN matrix) drawn from a standard normal distribution. So you may want to use randn(n, 1), instead of randn(n). Also as #ThP pointed out, it does not make sense to plot mean and variance vs. x. What you could do is to calculate means and variances for inceasing sample sizes n1, n2, ..., nm, and then plot sample size vs. mean or variance, to see them converge to 0 and 1. See the code below:
function [] = plotMnV(nIter)
means = zeros(nIter, 1);
vars = zeros(nIter, 1);
for pow = 1:nIter
n = 2^pow;
x =randn(n, 1);
means(pow) = 1./n * sum(x);
vars(pow) = 1./n * sum(x.^2);
end
plot(1:nIter, means, 'o-');
hold on;
plot(1:nIter, vars, '*-');
end
For example, plotMnV(20) gave me the plot below.
I am working on thumb recognition system. I need to implement KNN algorithm to classify my images. according to this, it has only 2 measurements, through which it is calculating the distance to find the nearest neighbour but in my case I have 400 images of 25 X 42, in which 200 are for training and 200 for testing. I am searching for few hours but I am not finding the way to find the distance between the points.
EDIT:
I have reshaped 1st 200 images in to 1 X 1050 and stored them in a matrix trainingData of 200 X 1050. similarly I made testingData.
Here is an illustration code for k-nearest neighbor classification (some functions used require the Statistics toolbox):
%# image size
sz = [25,42];
%# training images
numTrain = 200;
trainData = zeros(numTrain,prod(sz));
for i=1:numTrain
img = imread( sprintf('train/image_%03d.jpg',i) );
trainData(i,:) = img(:);
end
%# testing images
numTest = 200;
testData = zeros(numTest,prod(sz));
for i=1:numTest
img = imread( sprintf('test/image_%03d.jpg',i) );
testData(i,:) = img(:);
end
%# target class (I'm just using random values. Load your actual values instead)
trainClass = randi([1 5], [numTrain 1]);
testClass = randi([1 5], [numTest 1]);
%# compute pairwise distances between each test instance vs. all training data
D = pdist2(testData, trainData, 'euclidean');
[D,idx] = sort(D, 2, 'ascend');
%# K nearest neighbors
K = 5;
D = D(:,1:K);
idx = idx(:,1:K);
%# majority vote
prediction = mode(trainClass(idx),2);
%# performance (confusion matrix and classification error)
C = confusionmat(testClass, prediction);
err = sum(C(:)) - sum(diag(C))
If you want to compute the Euclidean distance between vectors a and b, just use Pythagoras. In Matlab:
dist = sqrt(sum((a-b).^2));
However, you might want to use pdist to compute it for all combinations of vectors in your matrix at once.
dist = squareform(pdist(myVectors, 'euclidean'));
I'm interpreting columns as instances to classify and rows as potential neighbors. This is arbitrary though and you could switch them around.
If have a separate test set, you can compute the distance to the instances in the training set with pdist2:
dist = pdist2(trainingSet, testSet, 'euclidean')
You can use this distance matrix to knn-classify your vectors as follows. I'll generate some random data to serve as example, which will result in low (around chance level) accuracy. But of course you should plug in your actual data and results will probably be better.
m = rand(nrOfVectors,nrOfFeatures); % random example data
classes = randi(nrOfClasses, 1, nrOfVectors); % random true classes
k = 3; % number of neighbors to consider, 3 is a common value
d = squareform(pdist(m, 'euclidean')); % distance matrix
[neighborvals, neighborindex] = sort(d,1); % get sorted distances
Take a look at the neighborvals and neighborindex matrices and see if they make sense to you. The first is a sorted version of the earlier d matrix, and the latter gives the corresponding instance numbers. Note that the self-distances (on the diagonal in d) have floated to the top. We're not interested in this (always zero), so we'll skip the top row in the next step.
assignedClasses = mode(neighborclasses(2:1+k,:),1);
So we assign the most common class among the k nearest neighbors!
You can compare the assigned classes with the actual classes to get an accuracy score:
accuracy = 100 * sum(classes == assignedClasses)/length(classes);
fprintf('KNN Classifier Accuracy: %.2f%%\n', 100*accuracy)
Or make a confusion matrix to see the distribution of classifications:
confusionmat(classes, assignedClasses)
yes, there is a function for knn : knnclassify
Play around with the number of neighbors you want to keep in order to get the best result (use a confusion matrix). This function takes care of the distance, of course.
I have a matrice of A(369x10) which I want to cluster in 19 clusters.
I use this method
[idx ctrs]=kmeans(A,19)
which yields
idx(369x1) and ctrs(19x10)
I get the point up to here.All my rows in A is clustered in 19 clusters.
Now I have an array B(49x10).I want to know where the rows of this B corresponds in the among given 19 clusters.
How is it possible in MATLAB?
Thank you in advance
The following is a a complete example on clustering:
%% generate sample data
K = 3;
numObservarations = 100;
dimensions = 3;
data = rand([numObservarations dimensions]);
%% cluster
opts = statset('MaxIter', 500, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 50, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 200, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
%% Assign data to clusters
% calculate distance (squared) of all instances to each cluster centroid
D = zeros(numObservarations, K); % init distances
for k=1:K
%d = sum((x-y).^2).^0.5
D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end
% find for all instances the cluster closet to it
[minDists, clusterIndices] = min(D, [], 2);
% compare it with what you expect it to be
sum(clusterIndices == clustIDX)
I can't think of a better way to do it than what you described. A built-in function would save one line, but I couldn't find one. Here's the code I would use:
[ids ctrs]=kmeans(A,19);
D = dist([testpoint;ctrs]); %testpoint is 1x10 and D will be 20x20
[distance testpointID] = min(D(1,2:end));
I don't know if I get your meaning right, but if you want to know which cluster your points belong you can use KnnSearch function easily. It has two arguments and will search in first argument for the first one of them that is closest to argument two.
Assuming you're using squared euclidean distance metric, try this:
for i = 1:size(ctrs,2)
d(:,i) = sum((B-ctrs(repmat(i,size(B,1),1),:)).^2,2);
end
[distances,predicted] = min(d,[],2)
predicted should then contain the index of the closest centroid, and distances should contain the distances to the closest centroid.
Take a look inside the kmeans function, at the subfunction 'distfun'. This shows you how to do the above, and also contains the equivalents for other distance metrics.
for small amount of data, you could do
[testpointID,dum] = find(permute(all(bsxfun(#eq,B,permute(ctrs,[3,2,1])),2),[3,1,2]))
but this is somewhat obscure; the bsxfun with the permuted ctrs creates a 49 x 10 x 19 array of booleans, which is then 'all-ed' across the second dimension, permuted back and then the row ids are found. again, probably not practical for large amounts of data.