Unable to figure out the ground truth databased in calculating the mean Average Precision Recall using Matlab

Unable to figure out the ground truth databased in calculating the mean Average Precision Recall using Matlab - matlab

Assuming that I have a dataset of the following size:
train = 500,000 * 960 %number of training samples (vector) each of 960 length
B_base = 1000000*960 %number of base samples (vector) each of 960 length
Query = 1000*960 %number of query samples (vector) each of 960 length
truth_nn = 1000*100
truth_nn contains the ground truth neighbors in the form of the
pre-computed k nearest neighbors and their square Euclidean distance. So, the columns of truth_nn represent the k = 100 nearest neighbors. I am finding difficult to apply nearest neighbor search in the code snippet. Can somebody please show how to apply the ground truth neighbors truth_nn in finding the mean average precision-recall?
It will be of immense help if somebody can show with any small example by creating any data matrix, query matrix, and the ground truth neighbors in the form of the pre-computed k nearest neighbors and their square Euclidean distance. I tried creating a sample database.
Assume, the base data is
B_base = [1 1; 2 2; 3 2; 4 4; 5 6];
Query data is
Query = [1 1; 2 1; 6 2];
[neighbors distances] = knnsearch(a,b,'k',2);
would find 2 nearest neighbors.
Question 1: how do I create the truth data containing the ground truth neighbors and pre-computed k nearest neighbor distances?
This is called as the mean average precision recall. I tried implementing the knearest neighbor search and the average precision recall as follows but cannot understand (unsure) how to apply the ground truth table
Question 2:
I am trying to apply k nearest neighbor search by converting first the real-valued features into binary.
I am unable to apply the concept of k-nearest neighbor search for different values of k = 10,20,50 and to check how much data has been correctly recalled using the GIST database. In the GIST truth_nn() file, when I specify truth_nn(i,1:k) for a query vector i, the function AveragePrecision throws error. So, if somebody can show using any sample ground truth that is of similar structure to that in GIST, how to properly specify k and calculate the Average precision recall, then I shall be able to apply the solution to the GIST database. As of now, this is my approach and shall be of immense help if the correct way is provided using any example that will be easier for me to relate to the GIST database. The problem is on how I can find neighbors from the ground truth and compare it with the neighbors obtained after sorting the distances?
I am also interested on how I can apply pdist2() instead of the present distance calcualtion as it takes a long time.
numQueryVectors = size(Query,1);
%Calculate distances
for i=1:numQueryVectors,
queryMatrix(i,:)
dist = sum((repmat(queryMatrix(i,:),numDataVectors,1)-B_base ).^2,2);
[sortval sortpos] = sort(dist,'ascend');
neighborIds(i,:) = sortpos(1:k);
neighborDistances(i,:) = sqrt(sortval(1:k));
end
%Sorting calculated nearest neighbor distances for k = 50
%HOW DO I SPECIFY k = 50 in the ground truth, truth_nn
for i=1:numQueryVectors
AP(i) = AveragePrecision(neighborIds(i,:),truth_nn(i,:));
end
mAP = mean(AP);
function ap = AveragePrecision(rank_id, truth_id)
truth_num = length(truth_id);
truth_pos = zeros(truth_num,1);
for j=1:50 %% for k = 50 nearest neighbors
truth_pos(j) = find(rank_id == truth_id(j));
end
truth_pos = sort(truth_pos, 'ascend');
% compute average precision as the area below the recall-precision curve
ap = 0;
delta_recall = 1/truth_num;
for j=1:truth_num
p = j/truth_pos(j);
ap = ap + p*delta_recall;
end
end
end
UPDATE : Based on solution, I tried to calculate the mean average precision using the formula given formula hereand a reference code . But, not sure if my approach is correct because the theory says that I need to rank the returned queries based on the indices. I do not understand this fully. Mean average precision is required to judge the quality of the retrieval algortihm.
precision = positives/total_data;
recal = positives /(positives+negatives);
precision = positives/total_data;
recall = positives /(positives+negatives);
truth_pos = sort(positives, 'ascend');
truth_num = length(truth_pos);
ap = 0;
delta_recall = 1/truth_num;
for j=1:truth_num
p = j/truth_pos(j);
ap = ap + p*delta_recall;
end
ap
The value of ap = infinity , value of positive = 0 and negatives = 150. This means that knnsearch() does not work at all.

I think you are doing extra work. This process is very simple in matlab, you can also operate on entire arrays. This should be faster than for loops, and is a bit easier to read.
Your truth_nn and neighbors should have the same data, if there are no errors. There is one entry per row. Matlab already sorts the kmeans result in ascending order, so the column 1 is the closest neighbor, the second closest is column 2, 3rd closest is 3,.... There is no need to sort the data again.
Just compare truth_nn to neighbors to get your statistics. This is a simple example to show you how the program should go. It will not work on your data without some modification
%in your example this is provided, I created my own
truth_nn = [1,2;
1,3;
4,3];
B_base = [1 1; 2 2; 3 2; 4 4; 5 6];
Query = [1 1; 2 1; 6 2];
%performs k means
num_clusters = 2;
[neighbors distances] = knnsearch(B_base,Query,'k',num_clusters);
%--- output---
% neighbors = [1,2;
% 1,2; notice this doesn't match truth_nn 1,3
% 4,3]
% distances = [ 0 1.4142;
% 1.0000 1.0000;
% 2.8284 3.0000];
%computes statistics, nnz counts number of nonzero elements, in the first
%case every piece of data that matches
%NOTE1: the indexing on truth_nn (:,1:num_clusters ) it says all rows
% but only use the first num_clusters columns. This should
% prevent the dimension mistmatch error you were getting
positives = nnz(neighbors == truth_nn(:,1:num_clusters )); %result = 5
negatives = nnz(neighbors ~= truth_nn(:,1:num_clusters )); %result = 1
%NOTE1: I've switched this from truth_nn to neighbors, this helps
% when you cahnge num_neghbors
total_data = numel(neighbors); %result = 6
percent_incorrect = 100*(negatives / total_data); % 16.6666
percent_correct = 100*(positives / total_data); % 93.3333

Related

Optimize nested for loop for calculating xcorr of matrix rows

I have 2 nested loops which do the following:
Get two rows of a matrix
Check if indices meet a condition or not
If they do: calculate xcorr between the two rows and put it into new vector
Find the index of the maximum value of sub vector and replace element of LAG matrix with this value
I dont know how I can speed this code up by vectorizing or otherwise.
b=size(data,1);
F=size(data,2);
LAG= zeros(b,b);
for i=1:b
for j=1:b
if j>i
x=data(i,:);
y=data(j,:);
d=xcorr(x,y);
d=d(:,F:(2*F)-1);
[M,I] = max(d);
LAG(i,j)=I-1;
d=xcorr(y,x);
d=d(:,F:(2*F)-1);
[M,I] = max(d);
LAG(j,i)=I-1;
end
end
end

First, a note on floating point precision...
You mention in a comment that your data contains the integers 0, 1, and 2. You would therefore expect a cross-correlation to give integer results. However, since the calculation is being done in double-precision, there appears to be some floating-point error introduced. This error can cause the results to be ever so slightly larger or smaller than integer values.
Since your calculations involve looking for the location of the maxima, then you could get slightly different results if there are repeated maximal integer values with added precision errors. For example, let's say you expect the value 10 to be the maximum and appear in indices 2 and 4 of a vector d. You might calculate d one way and get d(2) = 10 and d(4) = 10.00000000000001, with some added precision error. The maximum would therefore be located in index 4. If you use a different method to calculate d, you might get d(2) = 10 and d(4) = 9.99999999999999, with the error going in the opposite direction, causing the maximum to be located in index 2.
The solution? Round your cross-correlation data first:
d = round(xcorr(x, y));
This will eliminate the floating-point errors and give you the integer results you expect.
Now, on to the actual solutions...
Solution 1: Non-loop option
You can pass a matrix to xcorr and it will perform the cross-correlation for every pairwise combination of columns. Using this, you can forego your loops altogether like so:
d = round(xcorr(data.'));
[~, I] = max(d(F:(2*F)-1,:), [], 1);
LAG = reshape(I-1, b, b).';
Solution 2: Improved loop option
There are limits to how large data can be for the above solution, since it will produce large intermediate and output variables that can exceed the maximum array size available. In such a case for loops may be unavoidable, but you can improve upon the for-loop solution above. Specifically, you can compute the cross-correlation once for a pair (x, y), then just flip the result for the pair (y, x):
% Loop over rows:
for row = 1:b
% Loop over upper matrix triangle:
for col = (row+1):b
% Cross-correlation for upper triangle:
d = round(xcorr(data(row, :), data(col, :)));
[~, I] = max(d(:, F:(2*F)-1));
LAG(row, col) = I-1;
% Cross-correlation for lower triangle:
d = fliplr(d);
[~, I] = max(d(:, F:(2*F)-1));
LAG(col, row) = I-1;
end
end

Find all neighbors within specified distance of each element in 2-D Array

I would like to implement Adaptive watershed segmentation using Matlab.
One step in this algorithm is :
Scan the marker map pixel by pixel. If M(x0,y0) is 1, seek the spurious maxima in its neighbourhood with a radius of D(x ,y ).
D and M are all 2-D array.
Is there any function can Find all neighbors within specified distance of each element in 2-D Array?
I couldn't use the rangesearch() so I don't know dose it can solve my problem.
Thank you in advance!

In general, rengesearch is definitely the best way to solve this problem.
However, since you don't have Statistics and Machine Learning Toolbox, you can try using MATLAB's bwdist function.
bwdist function is basically a distance transform, and it can also return a map of distances from a certain point.
The second parameter of bwdist defines the type of distance function to use (The default distance function is Euclidean).
Example:
%inputMat is the original matrix
inputMat = ones(10,10);
%(Px,Py) is the point to calculate the distance from
Px = 5; Py = 5;
%calls bwdist
tempMat = false(size(inputMat));
tempMat(Py,Px) = true;
distMat = bwdist(tempMat);
%search for neighbour pixels
neigbourPixels = distMat < D(Px,Py);
[Y, X] = ind2sub(size(neigbourPixels),find(neigbourPixels));
%prints result
[Y,X]
result:
ans =
4 4
5 4
6 4
4 5
5 5
6 5
4 6
5 6
6 6
Runtime optimization
The downside of this method is that the computation time of bwdist may be slow if you call it many times. Therefore, if the runtime is important, I suggest the following optimization:
First stage:
Calculate the distance map only once, on a big matrix as follows:
N = max(size(inputMat));
tempMat = false(N,N);
tempMat(N/2,N/2) = true;
distMatInitial = bwdist(tempMat);
Second stage:
Given a new point (Px,Py) which you want to find its neigbours, calculate its distance map without calling to bwdist again, by simply copying a patch from the center of distMatInitial:
Mp = 30; %determines maximal patch size
distsMapFromPxPy = zeros(size(inputMat));
distsMapFromPxPy(Py-Mp:Py+Mp,Px-Mp:Px+Mp) = distMatInitial(N/2-Mp,N/2+Mp,N/2-Mp:N/2+Mp);
*Notice that this example doesn't handle edge cases such as:
(1)(Py-Mp)<0 (2)(Py+Mp)>m (3)(Px-Mp)<0 (4)(Px+Mp)>n
Therefore, if you choose to try this approach, don't forget to handle these cases.

Bit Error Rate using Test Data

I have the below code. I want to find the bit error rate considering the clustered data as a trained data and send a test data. Can I do that with this code? I appreciate your active support.
clear all;
clc;
T=[ 2+2*i 2-2*i -2+2*i -2-2*i];
A=randn(150,2)+2*ones(150,2); C=randn(150,2)-2*ones(150,2); B=randn(150,2)+2*ones(150,2);
F=randn(150,2)-2*ones(150,2); D=randn(150,2)+2*ones(150,2); G=randn(150,2)-2*ones(150,2);
E=randn(150,2)+2*ones(150,2); H=randn(150,2)-2*ones(150,2);
X = [A; B; D; C; F; E; G; H];
[idx, centroids] = kmeans(X, 4, 'Replicates', 20);
x = X(:,1); y = X(:,2);
figure;
colors = 'rgbk';
[X,Y] = meshgrid(-5:0.05:5, -5:0.05:5);
X = X(:);
Y = Y(:);
figure; hold on;
for idx = 1 : numel(X)
[dummy,ind] = min(sum(bsxfun(#minus, [X(idx) Y(idx)], centroids).^2, 2));
plot(X(idx), Y(idx), [colors(ind), '.']);
end

OK, now your question is more clear. I didn't understand what you meant in your other post. Alright, it looks like your T is your transmission alphabet. Be advised that the clusters that you get through k-means will probably not be the same as those from your transmission alphabet, so you're going to have to figure out which centroids are the closest from to your transmission alphabet. We can do that with the following code:
gt = zeros(1,4);
for idx = 1 : 4
[dummy,gt(idx)] = min(sum(bsxfun(#minus, [real(T(idx)), imag(T(idx))], centroids).^2, 2));
end
gt will contain which symbol in your alphabet matches which cluster centroid in your data. For your results to be reproducible, I set the random seed generator to 1234 (i.e. rng(1234);) then ran your code. It gave me the following for gt:
gt =
4 2 3 1
Simply put, each element in gt tells you which symbol in T matches up with what centroid in centroids. Therefore, gt(1) = 4 means that centroid #1 got matched to the 4th symbol in your transmission alphabet, gt(2) = 2 means that centroid #2 got matched to the 2nd symbol in your alphabet and so on.
As such, given your test sequence that is composed of the alphabet in T, simply create your test sequence, keeping in mind what gt is. Therefore, you could do something like this:
rng(1234);
rand_ind = randi(4, 10, 1);
test_sequence = T(rand_ind);
gt_labels = gt(rand_ind);
The above code will generate a random integer sequence from 1 to 4 and there will be 10 of these numbers. I then use this to create a random test sequence using the alphabet from T. gt_labels will also contain what the actual labels of each symbol is with respect to the cluster centroids. Now, let's split this up into real and imaginary components and add some noise.
x = real(test_sequence).*randn(1, 10);
y = imag(test_sequence).*randn(1, 10);
This noise... let's say... this was added as you sent this through a communication channel. Now that we have our real and imaginary components, let's figure out how this sequence was classified as. We would use x and y and determine the cluster that each point belongs to:
labels = zeros(1, 10);
for idx = 1 : 10
[dummy,labels(idx)] = min(sum(bsxfun(#minus, [x(idx), y(idx)], centroids).^2, 2));
end
labels will contain how your clustering mechanism classified each point as. I got:
labels =
1 4 3 1 4 1 2 2 2 3
Similarly, this is the labelling that was assigned to your test sequence before transmission:
gt_labels =
4 3 2 1 1 2 2 1 1 1
As such, the BER (bit error rate) is simply counting the number of mismatches and dividing by the total sequence. You would multiply this by 100% to get this in percentage instead of a proportion. Therefore:
BER = sum(labels ~= gt_labels) / 10 * 100;
BER =
80
As such, we have a BER of 80%.... not very good!
This should be enough to get you started. Hope this helps!

How to compare two unsorted lists in Matlab?

I have two lists of 2-dimensional points given as M x 2 - and N x 2 - matrices, respectively, with M and N possibly being very large.
What is the fastest way to determine how many of them are equal?

I am not sure whether you want to count repetitive entries, but if not you could use intersect or some quite intuitive algorithm based on sorting (see below). I would not prefer a nested-loop version...
function test_compareVecs()
%% create some random data
N = 31415;
M1 = 100000;
M2 = 200000;
vec = rand(N,2);
v1 = [rand(M1-N,2); vec];
v2 = [rand(M2-N,2); vec];
v1 = v1(randperm(M1),:);
v2 = v2(randperm(M2),:);
%% intersect
disp('intersect:');
tic
s = size(intersect(v1,v2,'rows'),1);
toc;
s
%% alternative approach
disp('alternative approach:');
tic;
s = compareVecs(v1,v2);
toc;
s
end
function s = compareVecs(v1,v2)
%% create help vector
help_vec = [[v1,zeros(size(v1,1),1)]; ...
[v2,ones(size(v2,1),1)]];
%% sort by first column
% note: for some reason "sortrows(help_vec,1)" is slower
hash_vec = help_vec(:,1); % dummy hash
[~,sidx] = sort(hash_vec);
help_vec = help_vec(sidx,:);
%% diff + compare
help_vec = diff(help_vec);
s = sum(help_vec(:,1) == 0 & ...
help_vec(:,2) == 0 & ...
help_vec(:,3) ~= 0);
end
Result
intersect:
Elapsed time is 0.145717 seconds.
s = 31415
alternative approach:
Elapsed time is 0.048084 seconds.
s = 31415

Compute all pair-wise distances with pdist2 and then count pairs with zero distance. If the coordinates are float values, you may want to use a tolerance instead of comparing against zero:
%// Data:
M = 10;
N = 8;
listM = randi(10,M,2)-1;
listN = randi(10,N,2)-1;
tol = 1e-6;
%// Distance matrix:
d = pdist2(listM, listN);
%// Count:
count = sum(d(:)<tol);

This should work irrespective of the order of the points in each list, or their lengths. It is a hash-table/dictionary solution that should be fast but with memory demand linear with the lengths of the lists. Please, note that the syntax below may not be perfect, but a quick reference to the main data structures mentioned should make corrections trivial.
(1) populate a dictionary-like containers.Map, in a way that the key is a unique function of the points, e.g. num2str(M(i,1))'-'num2str(M(i,2)).
(2) Then, go over all elements of the second list, create the key just as in (1) and check if it exists. If it does, set map(key)=1 else set it to 0. In the end, all the keys consisting of common points will have 1s stored, and the rest will be zeros.
(3) Finalize by summing over the values of the map (something like sum(map.values())) which should give you the total number of unique intersections among the two sets, irrespective of the order these points appear in each list.
OBS: if you don't want to count just unique intersections but all repeated points, in (2), rather than making map(key)=1, add 1 to map(key). The rest is the same.

Extremely large weighted average

I am using 64 bit matlab with 32g of RAM (just so you know).
I have a file (vector) of 1.3 million numbers (integers). I want to make another vector of the same length, where each point is a weighted average of the entire first vector, weighted by the inverse distance from that position (actually it's position ^-0.1, not ^-1, but for example purposes). I can't use matlab's 'filter' function, because it can only average things before the current point, right? To explain more clearly, here's an example of 3 elements
data = [ 2 6 9 ]
weights = [ 1 1/2 1/3; 1/2 1 1/2; 1/3 1/2 1 ]
results=data*weights= [ 8 11.5 12.666 ]
i.e.
8 = 2*1 + 6*1/2 + 9*1/3
11.5 = 2*1/2 + 6*1 + 9*1/2
12.666 = 2*1/3 + 6*1/2 + 9*1
So each point in the new vector is the weighted average of the entire first vector, weighting by 1/(distance from that position+1).
I could just remake the weight vector for each point, then calculate the results vector element by element, but this requires 1.3 million iterations of a for loop, each of which contains 1.3million multiplications. I would rather use straight matrix multiplication, multiplying a 1x1.3mil by a 1.3milx1.3mil, which works in theory, but I can't load a matrix that large.
I am then trying to make the matrix using a shell script and index it in matlab so only the relevant column of the matrix is called at a time, but that is also taking a very long time.
I don't have to do this in matlab, so any advice people have about utilizing such large numbers and getting averages would be appreciated. Since I am using a weight of ^-0.1, and not ^-1, it does not drop off that fast - the millionth point is still weighted at 0.25 compared to the original points weighting of 1, so I can't just cut it off as it gets big either.
Hope this was clear enough?
Here is the code for the answer below (so it can be formatted?):
data = load('/Users/mmanary/Documents/test/insertion.txt');
data=data.';
total=length(data);
x=1:total;
datapad=[zeros(1,total) data];
weights = ([(total+1):-1:2 1:total]).^(-.4);
weights = weights/sum(weights);
Fdata = fft(datapad);
Fweights = fft(weights);
Fresults = Fdata .* Fweights;
results = ifft(Fresults);
results = results(1:total);
plot(x,results)

The only sensible way to do this is with FFT convolution, as underpins the filter function and similar. It is very easy to do manually:
% Simulate some data
n = 10^6;
x = randi(10,1,n);
xpad = [zeros(1,n) x];
% Setup smoothing kernel
k = 1 ./ [(n+1):-1:2 1:n];
% FFT convolution
Fx = fft(xpad);
Fk = fft(k);
Fxk = Fx .* Fk;
xk = ifft(Fxk);
xk = xk(1:n);
Takes less than half a second for n=10^6!

This is probably not the best way to do it, but with lots of memory you could definitely parallelize the process.
You can construct sparse matrices consisting of entries of your original matrix which have value i^(-1) (where i = 1 .. 1.3 million), multiply them with your original vector, and sum all the results together.
So for your example the product would be essentially:
a = rand(3,1);
b1 = [1 0 0;
0 1 0;
0 0 1];
b2 = [0 1 0;
1 0 1;
0 1 0] / 2;
b3 = [0 0 1;
0 0 0;
1 0 0] / 3;
c = sparse(b1) * a + sparse(b2) * a + sparse(b3) * a;
Of course, you wouldn't construct the sparse matrices this way. If you wanted to have less iterations of the inside loop, you could have more than one of the i's in each matrix.
Look into the parfor loop in MATLAB: http://www.mathworks.com/help/toolbox/distcomp/parfor.html

I can't use matlab's 'filter' function, because it can only average
things before the current point, right?
That is not correct. You can always add samples (i.e, adding or removing zeros) from your data or from the filtered data. Since filtering with filter (you can also use conv by the way) is a linear action, it won't change the result (it's like adding and removing zeros, which does nothing, and then filtering. Then linearity allows you to swap the order to add samples -> filter -> remove sample).
Anyway, in your example, you can take the averaging kernel to be:
weights = 1 ./ [3 2 1 2 3]; % this kernel introduces a delay of 2 samples
and then simply:
result = filter(w,1,[data, zeros(1,3)]); % or conv (data, w)
% removing the delay introduced by the kernel
result = result (3:end-1);

You considered only 2 options:
Multiplying 1.3M*1.3M matrix with a vector once or multiplying 2 1.3M vectors 1.3M times.
But you can divide your weight matrix to as many sub-matrices as you wish and do a multiplication of n*1.3M matrix with the vector 1.3M/n times.
I assume that the fastest will be when there will be the smallest number of iterations and n is such that creates the largest sub-matrix that fits in your memory, without making your computer start swapping pages to your hard drive.
with your memory size you should start with n=5000.
you can also make it faster by using parfor (with n divided by the number of processors).

The brute force way will probably work for you, with one minor optimisation in the mix.
The ^-0.1 operations to create the weights will take a lot longer than the + and * operations to compute the weighted-means, but you re-use the weights across all the million weighted-mean operations. The algorithm becomes:
Create a weightings vector with all the weights any computation would need:
weights = (-n:n).^-0.1
For each element in the vector:
Index the relevent portion of the weights vector to consider the current element as the 'centre'.
Perform the weighted-mean with the weights portion and the entire vector. This can be done with a fast vector dot-multiply followed by a scalar division.
The main loop does n^2 additions and subractions. With n equal to 1.3 million that's 3.4 trillion operations. A single core of a modern 3GHz CPU can do say 6 billion additions/multiplications a second, so that comes out to around 10 minutes. Add time for indexing the weights vector and overheads, and I still estimate you could come in under half an hour.