What data of images are given to kmeans clustering in matlab? - matlab

Iam having 100 images in my database.Iam using those 100 images as both training set and also test images.I have to make 5 clusters.Iam using eigen faces(PCA) for feature extraction.What data should be given for kmeans command in matlab?
Syntax for kmeans command:
[IDX,C] = kmeans(X,k)
1.What is the X value?
2.Whether we have to give euclidian distance as input?
3.Whether we have to give weight vector of input images?
Please explain me in detail.
Source code i tried
X = []
srcFiles = dir('C:\Users\rahul\Desktop\tomorow\*.jpg'); % the folder in which ur images exists
for i = 1 : length(srcFiles)
filename = strcat('C:\Users\rahul\Desktop\tomorow\',srcFiles(b).name);
Imgdata = imread(filename);
X(:, i) = princomp(Imgdata);
end
[idx, c] = kmeans(X, 5)
Error iam getting:
Index exceeds matrix dimensions.
Error in pca (line 4)
filename =strcat('C:\Users\rahul\Desktop\tomorow\',srcFiles(b).name);

The PCA function you are using (I don't know what it is exactly), produces a vector of n numbers. This vectors describes the picture, and is what needs to be given to the k-means algorithm.
First of all, run the PCA for all 100 images, producing a nX100 matrix.
X = []
for i = 1 : 100
X(:, i) = PCA(picture...)
end
If pca return a line instead of column, you need
X(:, i) = PCA(picture)'
The k-means functions takes this parameter, as well as the number k of clusters. So
[idx, c] = kmeans(X, 5);
The distance used for clustering is euclidean by default. If you want some different distance metric, you can supply it as a parameter. See the table here for the available distance metrics.
Finally, the standard k-means algorithm is not weighted, so you can't supply weights to the vectors.

Related

Creating Clusters in matlab

Suppose that I have generated some data in matlab as follows:
n = 100;
x = randi(n,[n,1]);
y = rand(n,1);
data = [x y];
plot(x,y,'rx')
axis([0 100 0 1])
Now I want to generate an algorithm to classify all these data into some clusters(which are arbitrary) in a way such that a point be a member of a cluster only if the distance between this point and at least one of the members of the cluster be less than 10.How could I generate the code?
The clustering method you are describing is DBSCAN. Note that this algorithm will find only one cluster in provided data, since it's very unlikely that there is a point in the dataset so that its distance to all other points is more than 10.
If this is really what you want, you can use ِDBSCAN, or the one posted in FE, if you are using versions older than 2019a.
% Generating random points, almost similar to the data provided by OP
data = bsxfun(#times, rand(100, 2), [100 1]);
% Adding more random points
for i=1:5
mu = rand(1, 2)*100 -50;
A = rand(2)*5;
sigma = A*A'+eye(2)*(1+rand*2);%[1,1.5;1.5,3];
data = [data;mvnrnd(mu,sigma,20)];
end
% clustering using DBSCAN, with epsilon = 10, and min-points = 1 as
idx = DBSCAN(data, 10, 1);
% plotting clusters
numCluster = max(idx);
colors = lines(numCluster);
scatter(data(:, 1), data(:, 2), 30, colors(idx, :), 'filled')
title(['No. of Clusters: ' num2str(numCluster)])
axis equal
The numbers in above figure shows the distance between closest pairs of points in any two different clusters.
The Matlab built-in function clusterdata() works well for what you're asking.
Here is how to apply it to your example:
% number of points
n = 100;
% create the data
x = randi(n,[n,1]);
y = rand(n,1);
data = [x y];
% the number of clusters you want to create
num_clusters = 5;
T1 = clusterdata(data,'Criterion','distance',...
'Distance','euclidean',...
'MaxClust', num_clusters)
scatter(x, y, 100, T1,'filled')
In this case, I used 5 clusters and used the Euclidean distance to be the metric to group the data points, but you can always change that (see documentation of clusterdata())
See the result below for 5 clusters with some random data.
Note that the data is skewed (x-values are from 0 to 100, and y-values are from 0 to 1), so the results are also skewed, but you could always normalize your data.
Here is a way using the connected components of graph:
D = pdist2(x, y) < 10;
D(1:size(D,1)+1:end) = 0;
G = graph(D);
C = conncomp(G);
The connected components is vector that shows the cluster numbers.
Use pdist2 to compute distance matrix of x and y.
Use the distance matrix to create a logical adjacency matrix that shows two point are neighbors if distance between them is less than 10.
Set the diagonal elements of the adjacency matrix to 0 to eliminate self loops.
Create a graph from the adjacency matrix.
Compute the connected components of graph.
Note that using pdist2 for large datasets may not be applicable and you need to use other methods to form a sparse adjacency matrix.
I notified after posing my answer the answer provided by #saastn suggested to use DBSCAN algorithm that nearly follows the same approach.

Storing output from each for loop iteration in MATLAB

Say I have 3 matrix data files in a folder..
I have a function (clustering_coef_bu) which calculates the clustering coefficient of a 2D matrix (data; has the dimensions 512x512) file. The output vector of the function creates a 512x1 Matrix (Clustering Coefficient), in double format.
With the for loop below, for each matrix (data) I'm calculating the clustering coefficient. However, I am having difficulties being able to store the output clustering coefficient for each run of the for loop. It would be ideal to output the clustering coefficient of each matrix into one singular structure. I.e a cell array, which has the dimensions 512x3.
for k = 1:3
ClusteringCoefficient=clustering_coef_bu(data)
end
Any help would be great. Thanks.
Something like this would probably help you:
widthArray = 3;
ClustingeringCoefficient = zeros(size(data, 1), widthArray);
for k = 1:widthArray
ClusteringCoefficient(:, k) = clustering_coef_bu(data); % a 512x3 double matrix
end

PCA using princomp in MATLAB (for face recognition)

I'm trying to do dimensionality reduction using MATLAB's princomp, but I'm not sure I'm doing it right.
Here is the my code just for testing, but I'm not sure that I'm doing projection right:
A = rand(4,3)
AMean = mean(A)
[n m] = size(A)
Ac = (A - repmat(AMean,[n 1]))
pc = princomp(A)
k = 2; %Number of first principal components
A_pca = Ac * pc(1:k,:)' %Not sure I'm doing projection right
reconstructedA = A_pca * pc(1:k,:)
error = reconstructedA- Ac
And my code for face recognition using ORL dataset:
%load orl_data 400x768 double matrix (400 images 768 features)
%make labels
orl_label = [];
for i = 1:40
orl_label = [orl_label;ones(10,1)*i];
end
n = size(orl_data,1);
k = randperm(n);
s = round(0.25*n); %Take 25% for train
%Raw pixels
%Split on test and train sets
data_tr = orl_data(k(1:s),:);
label_tr = orl_label(k(1:s),:);
data_te = orl_data(k(s+1:end),:);
label_te = orl_label(k(s+1:end),:);
tic
[nn_ind, estimated_label] = EuclDistClassifier(data_tr,label_tr,data_te);
toc
rate = sum(estimated_label == label_te)/size(label_te,1)
%Using PCA
tic
pc = princomp(data_tr);
toc
mean_face = mean(data_tr);
pc_n = 100;
f_pc = pc(1:pc_n,:)';
data_pca_tr = (data_tr - repmat(mean_face, [s,1])) * f_pc;
data_pca_te = (data_te - repmat(mean_face, [n-s,1])) * f_pc;
tic
[nn_ind, estimated_label] = EuclDistClassifier(data_pca_tr,label_tr,data_pca_te);
toc
rate = sum(estimated_label == label_te)/size(label_te,1)
If I choose enough principal components it gives me equal recognition rates. If I use a small number of principal components (PCA) then the rate using PCA is poorer.
Here are some questions:
Is princomp function the best way to calculate first k principal components using MATLAB?
Using PCA projected features vs raw features don't give extra accuracy, but only smaller features vector size? (faster to compare feature vectors).
How to automatically choose min k (number of principal components) that give the same accuracy vs raw feature vector?
What if I have very big set of samples can I use only subset of them with comparable accuracy? Or can I compute PCA on some set and later "add" some other set (I don't want to recompute pca for set1+set2, but somehow iteratively add information from set2 to existing PCA from set1)?
I also tried the GPU version simply using gpuArray:
%Test using GPU
tic
A_cpu = rand(30000,32*24);
A = gpuArray(A_cpu);
AMean = mean(A);
[n m] = size(A)
pc = princomp(A);
k = 100;
A_pca = (A - repmat(AMean,[n 1])) * pc(1:k,:)';
A_pca_cpu = gather(A_pca);
toc
clear;
tic
A = rand(30000,32*24);
AMean = mean(A);
[n m] = size(A)
pc = princomp(A);
k = 100;
A_pca = (A - repmat(AMean,[n 1])) * pc(1:k,:)';
toc
clear;
It is working faster, but it's not suitable for big matrices. Maybe I'm wrong?
If I use a big matrix, it gives me:
Error using gpuArray Out of memory on device.
"Is princomp function the best way to calculate first k principal components using MATLAB?"
It's computing a full SVD, so it will be slow on large datasets. You can speed this up significantly by specifying the number of dimensions you need at the start and computing a partial svd. The matlab functions for a partial svd is svds.
If svds' not fast enough for you there's a more modern implementation here:
http://cims.nyu.edu/~tygert/software.html (matlab version: http://code.google.com/p/framelet-mri/source/browse/pca.m )
(cf the paper describing the algorithm http://cims.nyu.edu/~tygert/blanczos.pdf )
You can control the error of your approximation by increasing the number of singular vectors computed, there's precise bounds on that in the linked paper. Here's an example:
>> A = rand(40,30); %random rank-30 matrix
>> [U,S,V] = pca(A,2); %compute a rank-2 approximation to A
>> norm(A-U*S*V',2)/norm(A,2) %relative error
ans =
0.1636
>> [U,S,V] = pca(A,25); %compute a rank-25 approximation to A
>> norm(A-U*S*V',2)/norm(A,2) %relative error
ans =
0.0410
When you have large data and a sparse matrix computing a full SVD is often impossible since the factors will never be sparse. In this case you must compute a partial SVD to fit within memory. Example:
>> A = sprandn(5000,5000,10000);
>> tic;[U,S,V]=pca(A,2);toc;
no pivots
Elapsed time is 124.282113 seconds.
>> tic;[U,S,V]=svd(A);toc;
??? Error using ==> svd
Use svds for sparse singular values and vectors.
>> tic;[U,S,V]=princomp(A);toc;
??? Error using ==> svd
Use svds for sparse singular values and vectors.
Error in ==> princomp at 86
[U,sigma,coeff] = svd(x0,econFlag); % put in 1/sqrt(n-1) later
>> tic;pc=princomp(A);toc;
??? Error using ==> eig
Use eigs for sparse eigenvalues and vectors.
Error in ==> princomp at 69
[coeff,~] = eig(x0'*x0);

generate synthetic data 2d x t x v using matlab

i am trying to generate/simulate a set of synthetic/ simulated data set to generate a synthetic blood flow image in matlab. but i dont know how or where to starts from...
i know i should use the mesh function but how do i make it so it could be in time dimension?
I will be very thankful if anybody could help/ guide me through. I want to generate a data set of size 25x25x10x4. Which is X x Y x t x V. The image should be something similar to this:
or like this:
thank you in advance!
Dataset #1:
Use your favorite line representation (polar, linear, whatever) and randomly generate the parameters for your line. E.g. if you go for y = mx + c, randomly generate m and c. Now that you have defined your line, use this SO method to draw it on the image.
Dataset #2:
They look like 2D Gaussians. Use mvnpdf in the following manner.
[X Y] = meshgrid(x_range,y_range);
Z = reshape( mvnpdf([X(:) Y(:)],MU,SIGMA) ,size(X));
imagesc(Z);
Use some randomly generated MU and SIGMA such that MU lies in x_range and y_range. E.g. x_range = -3:0.1:3;y_range = x_range; and
MU =
0.9575 0.9649
SIGMA =
1.2647 0.3760
0.3760 1.0938
Just to complement #Jacob 's very specific answer, you need a 4D MxNxTxV matrix. In this, according to the post, MxN is the dimension of each image, T is the time dimension, and V is the number of channels or samples per time frame (3 for RGB or >3 for any spectral image).
For each T, generate V images.
Simulate the V images with random parameters for Dataset #1 and Dataset #2.
Put everything in one 4D matrix per Dataset (i.e. using a double for or concatenation)
Replace rand() with generate_image() below, i.e. a function generating random samples of the type of structure you want, according to #Jacob 's suggestions:
M = 25; N = 25;
T = 10; V = 4;
DataSet1 = zeros(M,N,T,V);
DataSet2 = zeros(M,N,T,V);
for t = 1:T
for v = 1:V
DataSet1(:,:,t,v) = randn(M,N);
DataSet2(:,:,t,v) = randn(M,N);
end
end

KNN algo in matlab

I am working on thumb recognition system. I need to implement KNN algorithm to classify my images. according to this, it has only 2 measurements, through which it is calculating the distance to find the nearest neighbour but in my case I have 400 images of 25 X 42, in which 200 are for training and 200 for testing. I am searching for few hours but I am not finding the way to find the distance between the points.
EDIT:
I have reshaped 1st 200 images in to 1 X 1050 and stored them in a matrix trainingData of 200 X 1050. similarly I made testingData.
Here is an illustration code for k-nearest neighbor classification (some functions used require the Statistics toolbox):
%# image size
sz = [25,42];
%# training images
numTrain = 200;
trainData = zeros(numTrain,prod(sz));
for i=1:numTrain
img = imread( sprintf('train/image_%03d.jpg',i) );
trainData(i,:) = img(:);
end
%# testing images
numTest = 200;
testData = zeros(numTest,prod(sz));
for i=1:numTest
img = imread( sprintf('test/image_%03d.jpg',i) );
testData(i,:) = img(:);
end
%# target class (I'm just using random values. Load your actual values instead)
trainClass = randi([1 5], [numTrain 1]);
testClass = randi([1 5], [numTest 1]);
%# compute pairwise distances between each test instance vs. all training data
D = pdist2(testData, trainData, 'euclidean');
[D,idx] = sort(D, 2, 'ascend');
%# K nearest neighbors
K = 5;
D = D(:,1:K);
idx = idx(:,1:K);
%# majority vote
prediction = mode(trainClass(idx),2);
%# performance (confusion matrix and classification error)
C = confusionmat(testClass, prediction);
err = sum(C(:)) - sum(diag(C))
If you want to compute the Euclidean distance between vectors a and b, just use Pythagoras. In Matlab:
dist = sqrt(sum((a-b).^2));
However, you might want to use pdist to compute it for all combinations of vectors in your matrix at once.
dist = squareform(pdist(myVectors, 'euclidean'));
I'm interpreting columns as instances to classify and rows as potential neighbors. This is arbitrary though and you could switch them around.
If have a separate test set, you can compute the distance to the instances in the training set with pdist2:
dist = pdist2(trainingSet, testSet, 'euclidean')
You can use this distance matrix to knn-classify your vectors as follows. I'll generate some random data to serve as example, which will result in low (around chance level) accuracy. But of course you should plug in your actual data and results will probably be better.
m = rand(nrOfVectors,nrOfFeatures); % random example data
classes = randi(nrOfClasses, 1, nrOfVectors); % random true classes
k = 3; % number of neighbors to consider, 3 is a common value
d = squareform(pdist(m, 'euclidean')); % distance matrix
[neighborvals, neighborindex] = sort(d,1); % get sorted distances
Take a look at the neighborvals and neighborindex matrices and see if they make sense to you. The first is a sorted version of the earlier d matrix, and the latter gives the corresponding instance numbers. Note that the self-distances (on the diagonal in d) have floated to the top. We're not interested in this (always zero), so we'll skip the top row in the next step.
assignedClasses = mode(neighborclasses(2:1+k,:),1);
So we assign the most common class among the k nearest neighbors!
You can compare the assigned classes with the actual classes to get an accuracy score:
accuracy = 100 * sum(classes == assignedClasses)/length(classes);
fprintf('KNN Classifier Accuracy: %.2f%%\n', 100*accuracy)
Or make a confusion matrix to see the distribution of classifications:
confusionmat(classes, assignedClasses)
yes, there is a function for knn : knnclassify
Play around with the number of neighbors you want to keep in order to get the best result (use a confusion matrix). This function takes care of the distance, of course.