I have a 21x5 sized matrix (top5features) containing values for 5 different feature types extracted from 21 cancer nodules. I am trying to apply principal component analysis on my data and plotting the results, but am having trouble understanding how to do so. The following is my code so far, but it only plots a portion of the data and I do not believe it is what I am going for:
top5features = features(1:21,[42 55 61 62 60]);
[W, pc] = princomp(top5features);
pc = pc'; W = W';
plot(pc(1,:),pc(2,:),'.');
title('{\bf PCA} of Top 5 Features')
My goal is to make the plot so that it has 21 points, with each point pertaining to a specific nodule. These 21 nodules are also divided into two groups, and if possible I would like to color code them according to the group they belong to. I am somewhat of a beginner using Matlab and any help would be appreciated.
Given your comments, the first 10 columns of your PCA decomposed data denote one group while the last 11 columns of your PCA decomposed data denote another group. This can simply be done in a single plot command like so, using your code earlier:
%// Your code
top5features = features(1:21,[42 55 61 62 60]);
[W, pc] = princomp(top5features);
pc = pc'; W = W';
%// Group 1
group1 = pc(:,1:10);
%// Group 2
group2 = pc(:,11:21);
%// Plot as separate colours
plot(group1(1,:), group1(2,:), 'b.', group2(1,:), group2(2,:), 'r.');
title('{\bf PCA} of Top 5 Features')
legend('Group 1', 'Group 2');
The above code first separates your PCA reduced data into the two groups that you have specified. Next, we use a single plot command to plot both groups together on a single plot but we colour differentiate them. Blue is for group 1 while red is for group 2. We place dot markers at each of the points. As a bonus, we add in a legend that denotes which group each point belongs to.
Hope this helps!
Related
I am trying to plot the histogram from the attached datasets in excel files. I have 2 questions (Q.2 is more important). The related csv files can be accessed from this link:
CSV files
1.Why the two histograms are different though exact same bins and bin sizes are used.
aa = xlsread('LF_NPV_Branch_Run.csv','C2:C828');
bb = xlsread('RES_Cob.csv','A1:CV827');
cc = aa*ones(1,100);
dev=bb-cc;
err_a=dev';
nbins = 20;
bound_n=min([floor(min(min(err_a))/10)*10,-10])
bound_p=max([ceil(max(max(err_a))/10)*10,10])
bins = linspace(bound_n,bound_p,nbins)
hist(err_a, bins)
figure(2)
hist(err_a(:), bins)
2.For figure 2, though the number for the tallest bin shows ~38000, but when I calculate the number using the bin on the center (zero) the number of points should be 63039 (which is more than the limit on the Y axis), not ~38000. What is the reason of this apparent mismatch?
val = dev(dev > bins(10) & dev < bins(11));
size(val)
Normally, if you have multiple questions, you should ask them seperately, but I can see that these two questions are closely related.
If you read MATLAB's documentation for hist(x,xbins):
If xbins is a vector of evenly spaced values, then hist uses the values as the bin centers.
The bin edges for the bin centred at bin(10) are actually
lower=(bins(9)+bins(10))/2
upper=(bins(10)+bins(11))/2
Therefore, to answer your Q2, you should find the result of the following matches the bin size shown in figure:
val = dev(dev > lower & dev <= upper);
size(val)
If you want bins to be the edges, you should use histogram(err_a(:), bins). See Specify Bin Edges of Histogram.
Q1:
err_a is a 100x827 matrix; err_a(:) makes it a 82700x1 column vector.
hist(m, bins) returns a bin for every column in m for each bin centre specified in bins. In your case, err_a has 827 columns. For each bin centre, hist(err_a, bins) gives 827 results and that is why there is a cluster of columns for every bin centre. hist(err_a(:), bins) on the other hand only gives 1 result per bin centre.
I'm using MATLAB to perform some statistics on some data. I have two 17x206x378 matrices where dimension 1 are subjects from the same group (so 17 subjects in matrix1, 17 in matrix 2). I want to perform ttests so I get 206 p-values. I then want to do this SEPARATELY for each of the 378 elements in the third dimension.
So say u is a 17x206x378 matrix and d is a different 17x206x378 matrix.
I basically started by doing:
[h,p,ci,s] = ttest2(u,d)
Which does in fact give me a p-matrix size 1x206x378 so everything looked great.
Then to do a quick check I just extracted the first of the third dimension elements from each matrix with:
u1=u(:,:,1); d1=d(:,:,1);
and ran test2 on this data via what you would expect:
[h1,p1,ci1,s1] = ttest2(u1,d1);
I again got a 1x206 p1-matrix of results but the values are not the same as those in the 1x206x378 p-matrix. When I plot the values in both the p(:,:,1) and the p1 vectors the resulting plots look very similar but not exactly the same.
Obviously one of these give results that are significant (below .05) in some instances where the other does not and I do not want to report a fake result so 2 questions I suppose?
1) I am under the impression I am doing the ttests on the same data so what exactly is going on here?
2) If I do ultimately want to get 206 p-values for each of the 378 third dimension elements, what is the correct way to do this?
Thanks for your help!
I ran the following code:
u = rand(17,206,378);
d = rand(17,206,378);
u1 = u(:,:,1);
d1 = d(:,:,1);
[h,p,ci,s] = ttest(u,d);
[h1,p1,ci1,s1] = ttest(u1,d1);
sum(abs(p1(1,:)- p(1,:,1)))
And the output was 0, indicating that the corresponding elements of p and p1 are the same. Maybe it's an indexing issue.
I am accessing 10 images from a folder "c1" and I have query image. I have implemented code for loading images in cell array and then I'm calculating histogram intersection between query image and each image from folder "c1" one-by-one. Now i want to draw precision-recall curve but i am not sure how to write code for getting "precision-recall curve" using the data obtained from histogram intersection.
My code:
Inp1=rgb2gray(imread('D:\visionImages\c1\1.ppm'));
figure, imshow(Inp1), title('Input image 1');
srcFiles = dir('D:\visionImages\c1\*.ppm'); % the folder in which images exists
for i = 1 : length(srcFiles)
filename = strcat('D:\visionImages\c1\',srcFiles(i).name);
I = imread(filename);
I=rgb2gray(I);
Seq{i}=I;
end
for i = 1 : length(srcFiles) % loop for calculating histogram intersections
A=Seq{i};
B=Inp1;
a = size(A,2); b = size(B,2);
K = zeros(a, b);
for j = 1:a
Va = repmat(A(:,j),1,b);
K(j,:) = 0.5*sum(Va + B - abs(Va - B));
end
end
Precision-Recall graphs measure the accuracy of your image retrieval system. They're also used in the performance of any search engine really, like text or documents. They're also used in machine learning evaluation and performance, though ROC Curves are what are more commonly used.
Precision-Recall graphs are more suitable for document and data retrieval. For the case of images here, given your query image, you are measuring how similar this image is with the rest of the images in your database. You then have similarity measures for each of the database images in relation to your query image and then you sort these similarities in descending order. For a good retrieval system, you would want the images that are the most relevant (i.e. what you are searching for) to all appear at the beginning, while the irrelevant images would appear after.
Precision
The definition of Precision is the ratio of the number of relevant images you have retrieved to the total number of irrelevant and relevant images retrieved. In other words, supposing that A was the number of relevant images retrieved and B was the total number of irrelevant images retrieved. When calculating precision, you take a look at the first several images, and this amount is A + B, as the total number of relevant and irrelevant images is how many images you are considering at this point. As such, another definition Precision is defined as the ratio of how many relevant images you have retrieved so far out of the bunch that you have grabbed:
Precision = A / (A + B)
Recall
The definition of Recall is slightly different. This evaluates how many of the relevant images you have retrieved so far out of a known total, which is the the total number of relevant images that exist. As such, let's say you again take a look at the first several images. You then determine how many relevant images there are, then you calculate how many relevant images that have been retrieved so far out of all of the relevant images in the database. This is defined as the ratio of how many relevant images you have retrieved overall. Supposing that A was again the total number of relevant images you have retrieved out of a bunch you have grabbed from the database, and C represents the total number of relevant images in your database. Recall is thus defined as:
Recall = A / C
How you calculate this in MATLAB is actually quite easy. You first need to know how many relevant images are in your database. After, you need to know the similarity measures assigned to each database image with respect to the query image. Once you compute these, you need to know which similarity measures map to which relevant images in your database. I don't see that in your code, so I will leave that to you. Once you do this, you then sort on the similarity values then you go through where in the sorted similarity values these relevant images occur. You then use these to calculate your precision and recall.
I'll provide a toy example so I can show you what the graph looks like as it isn't quite clear on how you're calculating your similarities here. Let's say I have 5 images in a database of 20, and I have a bunch of similarity values between them and a query image:
rng(123); %// Set seed for reproducibility
num_images = 20;
sims = rand(1,num_images);
sims =
Columns 1 through 13
0.6965 0.2861 0.2269 0.5513 0.7195 0.4231 0.9808 0.6848 0.4809 0.3921 0.3432 0.7290 0.4386
Columns 14 through 20
0.0597 0.3980 0.7380 0.1825 0.1755 0.5316 0.5318
Also, I know that images [1 5 7 9 12] are my relevant images.
relevant_IDs = [1 5 7 9 12];
num_relevant_images = numel(relevant_IDs);
Now let's sort the similarity values in descending order, as higher values mean higher similarity. You'd reverse this if you were calculating a dissimilarity measure:
[sorted_sims, locs] = sort(sims, 'descend');
locs will now contain the image ranks that each image ranked as. Specifically, these tell you which position in similarity the image belongs to. sorted_sims will have the similarities sorted in descending order:
sorted_sims =
Columns 1 through 13
0.9808 0.7380 0.7290 0.7195 0.6965 0.6848 0.5513 0.5318 0.5316 0.4809 0.4386 0.4231 0.3980
Columns 14 through 20
0.3921 0.3432 0.2861 0.2269 0.1825 0.1755 0.0597
locs =
7 16 12 5 1 8 4 20 19 9 13 6 15 10 11 2 3 17 18 14
Therefore, the 7th image is the highest ranked image, followed by the 16th image being the second highest image and so on. What you need to do now is for each of the images that you know are relevant, you need to figure out where these are located after sorting. We will go through each of the image IDs that we know are relevant, and figure out where these are located in the above locations array:
locations_final = arrayfun(#(x) find(locs == x, 1), relevant_IDs)
locations_final =
5 4 1 10 3
Let's sort these to get a better understand of what this is saying:
locations_sorted = sort(locations_final)
locations_sorted =
1 3 4 5 10
These locations above now tell you the order in which the relevant images will appear. As such, the first relevant image will appear first, the second relevant image will appear in the third position, the third relevant image appears in the fourth position and so on. These precisely correspond to part of the definition of Precision. For example, in the last position of locations_sorted, it would take ten images to retrieve all of the relevant images (5) in our database. Similarly, it would take five images to retrieve four relevant images in the database. As such, you would compute precision like this:
precision = (1:num_relevant_images) ./ locations_sorted;
Similarly for recall, it's simply the ratio of how many images were retrieved so far from the total, and so it would just be:
recall = (1:num_relevant_images) / num_relevant_images;
Your Precision-Recall graph would now look like the following, with Recall on the x-axis and Precision on the y-axis:
plot(recall, precision, 'b.-');
xlabel('Recall');
ylabel('Precision');
title('Precision-Recall Graph - Toy Example');
axis([0 1 0 1.05]); %// Adjust axes for better viewing
grid;
This is the graph I get:
You'll notice that between a recall ratio of 0.4 to 0.8 the precision is increasing a bit. This is because you have managed to retrieve a successive chain of images without touching any of the irrelevant ones, and so your precision will naturally increase. It goes way down after the last image, as you've had to retrieve so many irrelevant images before finally hitting a relevant image.
You'll also notice that precision and recall are inversely related. As such, if precision increases, then recall decreases. Similarly, if precision decreases, then recall will increase.
The first part makes sense because if you don't retrieve that many images in the beginning, you have a greater chance of not including irrelevant images in your results but at the same time, the amount of relevant images is rather small. This is why recall would decrease when precision would increase
The second part also makes sense because as you keep trying to retrieve more images in your database, you'll inevitably be able to retrieve all of the relevant ones, but you'll most likely start to include more irrelevant images, which would thus drive your precision down.
In an ideal world, if you had N relevant images in your database, you would want to see all of these images in the top N most similar spots. As such, this would make your precision-recall graph a flat horizontal line hovering at y = 1, which means that you've managed to retrieve all of your images in all of the top spots without accessing any irrelevant images. Unfortunately, that's never going to happen (or at least not for now...) as trying to figure out the best features for CBIR is still an on-going investigation, and no image search engine that I have seen has managed to get this perfect. This is still one of the most broadest and unsolved computer vision problems that exist today!
Edit
You retrieved this code to compute histogram intersection from this post. They have a neat way of computing histogram intersection as:
n is the total number of bins in your histogram. You'll have to play around with this to get good results, but we can leave that as a parameter in your code. The code above assumes that you have two matrices A and B where each column is a histogram. You'll generate a matrix that is of a x b, where a is the number of columns in A and b is the number of columns in b. The row and column of this matrix (i,j) tells you the similarity between the ith column in A with the b jth column in B. In your case, A would be a single column which denotes the histogram of your query image. B would be a 10 column matrix that denotes the histograms for each of the database images. Therefore, we will get a 1 x 10 array of similarity measures through histogram intersection.
As such, we need to modify your code so that you're using imhist for each of the images. We can also specify an additional parameter that gives you how many bins each histogram will have. Therefore, your code will look like this. Each new line that I have placed will have a %// NEW comment beside each line.
Inp1=rgb2gray(imread('D:\visionImages\c1\1.ppm'));
figure, imshow(Inp1), title('Input image 1');
num_bins = 32; %// NEW - I'm specifying 32 bins here. Play around with this yourself
A = imhist(Inp1, num_bins); %// NEW - Calculate histogram
srcFiles = dir('D:\visionImages\c1\*.ppm'); % the folder in which images exists
B = zeros(num_bins, length(srcFiles)); %// NEW - Store histograms of database images
for i = 1 : length(srcFiles)
filename = strcat('D:\visionImages\c1\',srcFiles(i).name);
I = imread(filename);
I=rgb2gray(I);
B(:,i) = imhist(I, num_bins); %// NEW - Put each histogram in a separate
%// column
end
%// NEW - Taken directly from the website
%// but modified for only one histogram in `A`
b = size(B,2);
Va = repmat(A, 1, b);
K = 0.5*sum(Va + B - abs(Va - B));
Take note that I have copied the code from the website, but I have modified it because there is only one image in A and so there is some code that isn't necessary.
K should now be a 1 x 10 array of histogram intersection similarities. You would then use K and assign sims to this variable (i.e. sims = K;) in the code I have written above, then run through your images. You also need to know which images are relevant images, and you'd have to change the code I've written to reflect that.
Hope this helps!
I have a dataset 6x1000 of binary data (6 data points, 1000 boolean dimensions).
I perform cluster analysis on it
[idx, ctrs] = kmeans(x, 3, 'distance', 'hamming');
And I get the three clusters. How can I visualize my result?
I have 6 rows of data each having 1000 attributes; 3 of them should be alike or similar in a way. Applying clustering will reveal the clusters. Since I know the number of clusters
I only need to find similar rows. Hamming distance tell us the similarity between rows and the result is correct that there are 3 clusters.
[EDIT: for any reasonable data, kmeans will always finds asked number
of clusters]
I want to take that knowledge
and make it easily observable and understandable without having to write huge explanations.
Matlab's example is not suitable since it deals with numerical 2D data while my questions concerns n-dimensional categorical data.
The dataset is here http://pastebin.com/cEWJfrAR
[EDIT1: how to check if clusters are significant?]
For more information please visit the following link:
https://chat.stackoverflow.com/rooms/32090/discussion-between-oleg-komarov-and-justcurious
If the question is not clear ask, for anything you are missing.
For representing the differences between high-dimensional vectors or clusters, I have used Matlab's dendrogram function. For instance, after loading your dataset into the matrix x I ran the following code:
l = linkage(a, 'average');
dendrogram(l);
and got the following plot:
The height of the bar that connects two groups of nodes represents the average distance between members of those two groups. In this case it looks like (5 and 6), (1 and 2), and (3 and 4) are clustered.
If you would rather use the hamming distance rather than the euclidian distance (which linkage does by default), then you can just do
l = linkage(x, 'average', {'hamming'});
although it makes little difference to the plot.
You can start by visualizing your data with a 'barcode' plot and then labeling rows with the cluster group they belong:
% Create figure
figure('pos',[100,300,640,150])
% Calculate patch xy coordinates
[r,c] = find(A);
Y = bsxfun(#minus,r,[.5,-.5,-.5, .5])';
X = bsxfun(#minus,c,[.5, .5,-.5,-.5])';
% plot patch
patch(X,Y,ones(size(X)),'EdgeColor','none','FaceColor','k');
% Set axis prop
set(gca,'pos',[0.05,0.05,.9,.9],'ylim',[0.5 6.5],'xlim',[0.5 1000.5],'xtick',[],'ytick',1:6,'ydir','reverse')
% Cluster
c = kmeans(A,3,'distance','hamming');
% Add lateral labeling of the clusters
nc = numel(c);
h = text(repmat(1010,nc,1),1:nc,reshape(sprintf('%3d',c),3,numel(c))');
cmap = hsv(max(c));
set(h,{'Background'},num2cell(cmap(c,:),2))
Definition
The Hamming distance for binary strings a and b the Hamming distance is equal to the number of ones (population count) in a XOR b (see Hamming distance).
Solution
Since you have six data strings, so you could create a 6 by 6 matrix filled with the Hamming distance. The matrix would be symetric (distance from a to b is the same as distance from b to a) and the diagonal is 0 (distance for a to itself is nul).
For example, the Hamming distance between your first and second string is:
hamming_dist12 = sum(xor(x(1,:),x(2,:)));
Loop that and fill your matrix:
hamming_dist = zeros(6);
for i=1:6,
for j=1:6,
hamming_dist(i,j) = sum(xor(x(i,:),x(j,:)));
end
end
(And yes this code is a redundant given the symmetry and zero diagonal, but the computation is minimal and optimizing not worth the effort).
Print your matrix as a spreadsheet in text format, and let the reader find which data string is similar to which.
This does not use your "kmeans" approach, but your added description regarding the problem helped shaping this out-of-the-box answer. I hope it helps.
Results
0 182 481 495 490 500
182 0 479 489 492 488
481 479 0 180 497 517
495 489 180 0 503 515
490 492 497 503 0 174
500 488 517 515 174 0
Edit 1:
How to read the table? The table is a simple distance table. Each row and each column represent a series of data (herein a binary string). The value at the intersection of row 1 and column 2 is the Hamming distance between string 1 and string 2, which is 182. The distance between string 1 and 2 is the same as between string 2 and 1, this is why the matrix is symmetric.
Data analysis
Three clusters can readily be identified: 1-2, 3-4 and 5-6, whose Hamming distance are, respectively, 182, 180, and 174.
Within a cluster, the data has ~18% dissimilarity. By contrast, data not part of a cluster has ~50% dissimilarity (which is random given binary data).
Presentation
I recommend Kohonen network or similar technique to present your data in, say, 2 dimensions. In general this area is called Dimensionality reduction.
I you can also go simpler way, e.g. Principal Component Analysis, but there's no quarantee you can effectively remove 9998 dimensions :P
scikit-learn is a good Python package to get you started, similar exist in matlab, java, ect. I can assure you it's rather easy to implement some of these algorithms yourself.
Concerns
I have a concern over your data set though. 6 data points is really a small number. moreover your attributes seem boolean at first glance, if that's the case, manhattan distance if what you should use. I think (someone correct me if I'm wrong) Hamming distance only makes sense if your attributes are somehow related, e.g. if attributes are actually a 1000-bit long binary string rather than 1000 independent 1-bit attributes.
Moreover, with 6 data points, you have only 2 ** 6 combinations, that means 936 out of 1000 attributes you have are either truly redundant or indistinguishable from redundant.
K-means almost always finds as many clusters as you ask for. To test significance of your clusters, run K-means several times with different initial conditions and check if you get same clusters. If you get different clusters every time or even from time to time, you cannot really trust your result.
I used a barcode type visualization for my data. The code which was posted here earlier by Oleg was too heavy for my solution (image files were over 500 kb) so I used image() to make the figures
function barcode(A)
B = (A+1)*2;
image(B);
colormap flag;
set(gca,'Ydir','Normal')
axis([0 size(B,2) 0 size(B,1)]);
ax = gca;
ax.TickDir = 'out'
end
I have been working with MATLAB's treeplot function, but it seems to provide surprisingly little plotting functionality and/or extendibility.
I am plotting a tree like so:
tree = [0 1 2 2 2 2 2 1 8 8 1 11 11 1 14];
treeplot(tree)
Giving:
What I would like to do is add annotations or labels to specific nodes. A good starter would be to add the node numbers to each node, as in the example from the help file:
As they state, though:
These indices are shown only for the point of illustrating the example; they are not part of the treeplot output.
Is there a way to get the locations of the plotted nodes, or at the very least to plot the node numbers? I couldn't find any FEX submissions with more advanced tree plots.
Ultimately, I'd like to plot small pictures at the nodes (using methods from answers to a previous question of mine).
This should help you make a labeled tree:
(You supply the 'treeVec'.)
treeplot(treeVec);
count = size(treeVec,2);
[x,y] = treelayout(treeVec);
x = x';
y = y';
name1 = cellstr(num2str((1:count)'));
text(x(:,1), y(:,1), name1, 'VerticalAlignment','bottom','HorizontalAlignment','right')
title({'Level Lines'},'FontSize',12,'FontName','Times New Roman');
With your sample input, this gives
To get the position of the nodes, use treelayout
[x,y]=treelayout(tree);
The vectors x and y give you the positions, which you can then use to plot images at the nodes.