Pairwise Similarity and Sorting Samples - matlab

The following is a problem from an assignment that I am trying to solve:
Visualization of similarity matrix. Represent every sample with a four-dimension vector (sepal length, sepal width, petal length, petal width). For every two samples, compute their pair-wise similarity. You may do so using the Euclidean distance or other metrics. This leads to a similarity matrix where the element (i,j) stores the similarity between samples i and j. Please sort all samples so that samples from the same category appear together. Visualize the matrix using the function imagesc() or any other function.
Here is the code I have written so far:
load('iris.mat'); % create a table of the data
iris.Properties.VariableNames = {'Sepal_Length' 'Sepal_Width' 'Petal_Length' 'Petal_Width' 'Class'}; % change the variable names to their actual meaning
iris_copy = iris(1:150,{'Sepal_Length' 'Sepal_Width' 'Petal_Length' 'Petal_Width'}); % make a copy of the (numerical) features of the table
iris_distance = table2array(iris_copy); % convert the table to an array
% pairwise similarity
D = pdist(iris_distance); % calculate the Euclidean distance and store the result in D
W = squareform(D); % convert to squareform
figure()
imagesc(W); % visualize the matrix
Now, I think I've got the coding mostly right to answer the question. My issue is how to sort all the samples so that samples from the same category appear together because I got rid of the names when I created the copy. Is it already sorted by converting to squareform? Other suggestions? Thank you!

It should be in the same order as the original data. While you could sort it afterwards, the easiest solution is to actually sort your data by class after line 2 and before line 3.
load('iris.mat'); % create a table of the data
iris.Properties.VariableNames = {'Sepal_Length' 'Sepal_Width' 'Petal_Length' 'Petal_Width' 'Class'}; % change the variable names to their actual meaning
% Sort the table here on the "Class" attribute. Don't forget to change the table name
% in the next line too if you need to.
iris_copy = iris(1:150,{'Sepal_Length' 'Sepal_Width' 'Petal_Length' 'Petal_Width'}); % make a copy of the (numerical) features of the table
Consider using sortrows:
tblB = sortrows(tblA,'RowNames') sorts a table based on its row names. Row names of a table label the rows along the first dimension of the table. If tblA does not have row names, that is, if tblA.Properties.RowNames is empty, then sortrows returns tblA.

Related

Calculating the most similar pair of column vectors using cosine distance in a matrix

I have a 943x1682 matrix in which I want to calculate the two most similar vectors in this matrix. So I want see the cosine distance of each vector in the matrix to each vector in the matrix, of course not including the vector with itself, if one cannot do that I can just ignore those.
I made this loop to try to calculate this, so I can get a 1682x1682 matrix, with each cell corresponding to the similarity between i and j. However when I run this, it takes forever to run, and when I try to open the resulting matrix in my workspace, it says:
Cannot display summaries of variables with more than 524288 elements.
Is there an easier way to do this or am I doing something wrong?
Cross posted on MATLAB Answers. Repeating answer here:
Use a standard matrix multiply to get the dot products. MATLAB is very fast at standard matrix multiplies. And then normalize the result. E.g.,
AA = A' * A; % the column dot products via a standard matrix multiply
Anorm = sqrt(diag(AA)); % the norms of the columns
Adist = AA ./ (Anorm .* Anorm.'); % normalize the column dot products into cosine distances
Then pick off the maximum value for your answer, disregarding the diagonal. E.g.,
n = size(A,2); % the number of columns
Adist(1:n+1:end) = -inf; % disregard the diagonal (column compared to itself)
[~,x] = max(Adist(:)); % find the max cosine distance linear index
[col1,col2] = ind2sub(size(Adist),x); % convert linear index into the original columns
Then col1 and col2 are the column numbers of the most similiar columns using cosine distance as a measure.
You can normalise the columns of the matrix first, then the cosine similarity equation simplifies to a matrix multiplication:
aNorm = normc(A);
cosSim = aNorm' * aNorm;
Generally, matrix multiplication is more performant than looping. In a quick test, with N = 1000, the looping code takes ~7 seconds and the matrix multiplication code ~0.5 seconds.
The resultant matrix may still be too large to open in your workspace, you could copy any individual rows or columns into a temporary and view those, or do a contour plot (heat-map) of the matrix to get a visual representation.

Find the minimum absolute values along the third dimension in a 3D matrix and ensuring the sign is maintained

I have a m X n X k matrix and I want to find the elements that have minimal absolute value along the third dimension for each unique 2D spatial coordinate. An additional constraint is that once I find these minimum values, the sign of these values (i.e. before I took the absolute value) must be maintained.
The code I wrote to accomplish this is shown below.
tmp = abs(dist); %the size(dist)=[m,n,k]
[v,ind] = min(tmp,[],3); %find the index of minimal absolute value in the 3rd dimension
ind = reshape(ind,m*n,1);
[col,row]=meshgrid(1:n,1:m); row = reshape(row,m*n,1); col = reshape(col,m*n,1);
ind2 = sub2ind(size(dist),row,col,ind); % row, col, ind are sub
dm = dist(ind2); %take the signed value from dist
dm = reshape(dm,m,n);
The resulting matrix dm which is m X n corresponds to the matrix that is subject to the constraints that I have mentioned previously. However, this code sounds a little bit inefficient since I have to generate linear indices. Is there any way to improve this?
If I'm interpreting your problem statement correctly, you wish to find the minimum absolute value along the third dimension for each unique 2D spatial coordinate in your 3D matrix. This is already being done by the first two lines of your code.
However, a small caveat is that once you find these minimum values, you must ensure that the original sign of these values (i.e. before taking the absolute value) are respected. That is the purpose of the rest of the code.
If you want to select the original values, you don't have a choice but to generate the correct linear indices and sample from the original matrix. However, a lot of code is rather superfluous. There is no need to perform any kind of reshaping.
We can simplify your method by using ndgrid to generate the correct spatial coordinates to sample from the 3D matrix then use ind from your code to reference the third dimension. After, use this to sample dist and complete your code:
%// From your code
[v,ind] = min(abs(dist),[],3);
%// New code
[row,col] = ndgrid(1:size(dist,1), 1:size(dist,2));
%// Output
dm = dist(sub2ind(size(dist), row, col, ind));

How to change the value of a random subset of elements in a matrix without using a loop?

I'm currently attempting to select a subset of 0's in a very large matrix (about 400x300 elements) and change their value to 1. I am able to do this, but it requires using a loop where each instance it selects the next value in a randperm vector. In other words, 50% of the 0's in the matrix are randomly selected, one-at-a-time, and changed to 1:
z=1;
for z=1:(.5*numberofzeroes)
A(zeroposition(rpnumberofzeroes(z),1),zeroposition(rpnumberofzeroes(z),2))=1;
z=z+1;
end
Where 'A' is the matrix, 'zeroposition' is a 2-column-wide matrix with the positions of the 0's in the matrix (the "coordinates" if you like), and 'rpnumberofzeros' is a randperm vector from 1 to the number of zeroes in the matrix.
So for example, for z=20, the code might be something like this:
A(3557,2684)=1;
...so that the 0 which appears in this location within A will now be a 1.
It performs this loop thousands of times, because .5*numberofzeroes is a very big number. This inevitably takes a long time, so my question is can this be done without using a loop? Or at least, in some way that takes less processing resources/time?
As I said, the only thing that needs to be done is an entirely random selection of 50% (or whatever proportion) of the 0's changed to 1.
Thanks in advance for the help, and let me know if I can clear anything up! I'm new here, so apologies in advance if I've made any faux pa's.
That's very easy. I'd like to introduce you to my friend sub2ind. sub2ind allows you to take row and column coordinates of a matrix and convert them into linear column-major indices so that you can access multiple values in a matrix simultaneously in a single call. As such, the equivalent code you want is:
%// First access the values in rpnumberofzeroes
vals = rpnumberofzeroes(1:0.5*numberofzeroes, :);
%// Now, use the columns of these to determine which rows and columns we want
%// to access A
rows = zeroposition(vals(:,1), 1);
cols = zeroposition(vals(:,2), 2);
%// Get linear indices via sub2ind
ind1 = sub2ind(size(A), rows, cols);
%// Now set these locations to 1
A(ind1) = 1;
The first statement gets the first half of your matrix of coordinates stored in rpnumberofzeroes. The first column is the row coordinates, the second column is the column coordinates. Notice that in your code, you wish to use the values in zeroposition to access the locations in A. As such, extract out the corresponding rows and columns from rpnumberofzeroes to figure out the right rows and columns from zeroposition. Once that's done, we wish to use these new rows and columns from zeroposition and index into A. sub2ind requires three inputs - the size of the matrix you are trying to access... so in our case, that's A, the row coordinates and the column coordinates. The output is a set of column major indices that are computed for each row and column pair.
The last piece of the puzzle is to use these to index into A and set the locations to 1.
This can be accomplished with linear indexing as well:
% find linear position of all zeros in matrix
ix=find(abs(A)<eps);
% set one half of those, selected at random, to one.
A(ix(randperm(round(numel(ix)*.5)))=1;

Maintaining order of corresponding matrices in MATLAB

I have a logistic regression model an I want to create a lift chart to show its efficacy. To do that I need to order my validation set by descending predicted probability. This sort is easily done in MATLAB but I need to know how it changes the order of my predictions so that I can re-order the actual values of validation set accordingly, is there a simple way to do this without writing code?
The second output of sort:
[As,inds] = sort(A,'descend');
Bs = B(inds);
Note that if you have your vectors in a single matrix, you can use sortrows. For example, if you want to sort a matrix X according to the second column:
Y = sortrows(X,-2) % -2 means second column, descending
Y1 = Y(:,1); % first column of X sorted according to X(:,2)

Matlab code to compare two histograms

I want to compare two image histograms. They are as follows:
h1 --> double valued 1 dimension vector .4096 in length.
h2 --> double valued 1 dimension vector .4096 in length.
I am using this matlab function here:
http://clickdamage.com/sourcecode/code/compareHists.m
It is as follows:
% s = compareHists(h1,h2)
% returns a histogram similarity in the range 0..1
%
% Compares 2 normalised histograms using the Bhattacharyya coefficient.
% Assumes that sum(h1) == sum(h2) == 1
%
function s = compareHists(h1,h2)
s = sum(sum(sum(sqrt(h1).*sqrt(h2))));
My question is :
Is there a need for multiple sums?
Even if there is only one sum in the above equation, it would suffice..right?
like this: sum(sqrt(h1).*sqrt(h2)) --> ?
Can some one please explain the code above? Also, tell me if I use a single sum will it be all right?
I tried both ways and got the same answer for two image histograms. I did this with only two histograms not more and hence want to be sure.
Thanks!
In general, sum does the sum along one dimension only. If you want to sum along multiple dimensions you either
use sum several times; or
use linear indexing to reduce to a single dimension and then use sum once: sum(sqrt(h1(:)).*sqrt(h2(:))).
In your case, if there's only one dimension, yes, a single sum would suffice.
You are right. Only one sum is needed. However, if either h1 or h2 is a multidimensional matrix, then you may want to sum as many as the dimensions. For example:
A=magic(4); % a 4 by 4 matrix of magic numbers.
sum(A) % returns [34,34,34,34], i.e. the sum of elements in each column.
sum(sum(A)) % returns 136, i.e. the sum of all elements in A.
I believe the code you downloaded originaly was written to handle multiple histograms stacked as columns of a matrix. This is (IMHO) the reason for the multiple sums.
In your case you can leave it with only one sum.
You can do even better - without any sum
Hover here to see the answer
s = sqrt(h1(:)')*sqrt(h2(:));
The trick is to use vector multiplication!
I don't see any points in 3 sums too, but if you have not a vector with histogram but a matrix you will need 2 sums like this sum(sum(sqrt(h1).*sqrt(h2))) to compare them. First one will calculate the sum of the rows, the second - the sum of the columns.