Jaccard index on Matlab produces wrong results

Jaccard index on Matlab produces wrong results - matlab

I have the following matrix
a =
0 10 10 0 0
0 5 5 0 0
1 0 0 50 51
0 0 10 100 100
I compute the Jaccard distances
D = pdist(a,'jaccard');
D =
1.0000 1.0000 0.7500 1.0000 1.0000 1.0000
and finally I put the distances in a matrix
sim = squareform(D)
sim =
0 1.0000 1.0000 0.7500
1.0000 0 1.0000 1.0000
1.0000 1.0000 0 1.0000
0.7500 1.0000 1.0000 0
The jaccard index is computed as "One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ." (http://www.mathworks.it/help/stats/pdist.html)
The distance between row 1 and 4 is correct (0.75), while the distance between row 1 and 2 should be 0 and is, instead, 1. It seems that when the jaccard similarity is 1, matlab doesn't execute the 1-similarity computation. What am I doing wrong?

MATLAB seems right to me.
All of the non-zero numbers in rows 1 and 2 differ (in row 1 they're all 10, in row 2 they're all 5), so rows 1 and 2 should have a distance of 1.
Three out of four of the non-zero numbers in rows 1 and 4 differ (10:0, 10:10, 0:100, 0:100), so rows 1 and 4 should have a distance of 0.75.
There seems to be a lot of disagreement about what thing is the Jaccard "coefficient", the Jaccard "index", the Jaccard "similarity" and the Jaccard "distance", and which is one minus the other. MATLAB's documentation doesn't help, as it's not obvious, in the sentence you quote, whether "which" refers to (what MATLAB is describing as) the Jaccard coefficient, or to one minus the Jaccard coefficient.
In any case, whether the terminology used by the MATLAB documentation is correct, the function pdist seems to be giving consistent results, and you can always take one minus whatever it outputs if you want something different.

Related

Distance Transform Behaviors Differently Between Matlab and OpenCV

I am currently translating code from Matlab to OpenCV but found the distance transform function behaviors differently between Matlab and OpenCV.
Take the simple matrix as an example
bw =
0 0 0 0 0
0 1 0 0 0
0 0 0 0 0
0 0 0 1 0
0 0 0 0 0
Matlab version distance transform assigns a number that is the distance between that pixel and the nearest nonzero pixel of BW, which makes sense and I got
1.4142 1.0000 1.4142 2.2361 3.1623
1.0000 0 1.0000 2.0000 2.2361
1.4142 1.0000 1.4142 1.0000 1.4142
2.2361 2.0000 1.0000 0 1.0000
3.1623 2.2361 1.4142 1.0000 1.4142
In OpenCV, I choose the DIST_L2 (the simple euclidean distance). it gives me
1.3692 0.9550 1.3692 2.3242 3.2792
0.9550 0 0.9550 1.9100 2.3242
1.3692 0.9550 1.3692 2.3242 1.3692
2.3242 1.9100 0.9550 0 0.9550
3.2792 2.3242 1.3692 0.9550 1.3692
I don't understand why and it doesn't make sense to me. I realized that OpenCV compute the pixel with nearest zero pixel, so I already inverted the input matrix.

maskSize – Size of the distance transform mask. It can be 3, 5, or CV_DIST_MASK_PRECISE (the latter option is only supported by the first function).
It looks like OpenCV version distance transform is doing some normalization using maskSize. Set it to 0 (Even the documentation didn't mention it) and it solves the issue.

Normalization of inputs of a feedforward Neural network

Let's say I have a mxn matrix of different features of a time series signal (column 1 represents linear regression of the last n samples, column 2 represents the average of the last n samples, column 3 represents the local max values of a different time series but correlated signal, etc). How should I normalize these inputs? All the inputs fall into different categories, so they have a different range. One ranges from 0,1, the other ranges from -5 to 50, etc etc.
Should I normalize the WHOLE matrix? Or should I normalize each set of inputs one by one individually?
Note: I usually use mapminmax function from MATLAB for the normalization.

You should normalise each vector/column of your matrix individually, they represent different data types and shouldn't be mixed up together.
You could for example transpose your matrix to have your 3 different data types in the rows instead of in the columns of your matrix and still use mapminmax:
A = [0 0.1 -5; 0.2 0.3 50; 0.8 0.8 10; 0.7 0.9 20];
A =
0 0.1000 -5.0000
0.2000 0.3000 50.0000
0.8000 0.8000 10.0000
0.7000 0.9000 20.0000
B = mapminmax(A')
B =
-1.0000 -0.5000 1.0000 0.7500
-1.0000 -0.5000 0.7500 1.0000
-1.0000 1.0000 -0.4545 -0.0909

You should normalize each feature independently.
column 1 represents linear regression of the last n samples, column 2 represents the average of the last n samples, column 3 represents the local max values of a different time series but correlated signal, etc
I can't say for sure about your particular problem, but generally, you should normalize each feature independently. So normalize column 1, then column 2 etc.
Should I normalize the WHOLE matrix? Or should I normalize each set of inputs one by one individually?
I'm not sure what you mean here. What is an input? If by that you mean an instance (a row of your matrix), then no, you should not normalize rows individually, but columns.
I don't know how you would do this in Matlab, but I took your question more as a theoretical one than an implementation one.

If you want to have a range of [0,1] for all the columns that normalized within each column, you can use mapminmax like so (assuming A as the 2D input array) -
out = mapminmax(A.',0,1).'
You can also use bsxfun for the same output, like so -
Aoffsetted = bsxfun(#minus,A,min(A,[],1))
out = bsxfun(#rdivide,Aoffsetted,max(Aoffsetted,[],1))
Sample run -
>> A
A =
3 7 4 2 7
1 3 4 5 7
1 9 7 5 3
8 1 8 6 7
>> mapminmax(A.',0,1).'
ans =
0.28571 0.75 0 0 1
0 0.25 0 0.75 1
0 1 0.75 0.75 0
1 0 1 1 1
>> Aoffsetted = bsxfun(#minus,A,min(A,[],1));
>> bsxfun(#rdivide,Aoffsetted,max(Aoffsetted,[],1))
ans =
0.28571 0.75 0 0 1
0 0.25 0 0.75 1
0 1 0.75 0.75 0
1 0 1 1 1

special point distance transform in matlab

I use matlab to calculate the distance transform of a binary image, and I found that bwdist() can calculate distances of all the points of the image, but I just want to know the distance of a special point.
for example,I have a binary image like this
image =
1 0 0
0 0 1
0 0 0
The bwdist() compute the distance transform of all points
>> bwdist(a)
ans =
0 1.0000 1.0000
1.0000 1.0000 0
2.0000 1.4142 1.0000
But I just want to compute distance of the point image(3,2), so the function give me 1.4142
any function can do?

You can use find to find row and column indices for all 1's, then use pdist2 from Statistics and Machine Learning Toolbox to calculate distances for all 1's from the search point (3,2) and finally choose the minimum of those distances to get the final output. Here's the implementation shown as a sample run -
>> image
image =
1 0 0
0 0 1
0 0 0
>> point
point =
3 2
>> [R,C] = find(image);
>> min(pdist2([R C],point))
ans =
1.4142
If you don't have access to pdist2, you can use bsxfun to replace it like so -
min(sqrt(sum(bsxfun(#minus,[R C],point).^2,2)))

shape context matching technique

I'm trying to implement shape context matching algorithm which is
Calculate the distance of a point to all other points.
Normalize the distance by mean distance.
Create logarithmic distance scale for normalized distances.
Create distance histogram: Iterate for each scale incrementing bins when
bins with higher numbers describe points closer together.
Calculate angle between all points.
Bin angles which is slightly different than distance.
Matching - Cost Matrix: Calculate cost of matching each point to every
other point.
Matching - Additional Cost Terms:
Surrounding Texture Difference,
Tangent Angle Difference.
Matching: Find pairing of points that leads to least total cost according
to equation 3
I'm now in step 3, and I don't know how to calculate log distance scale
Example:
if we have coordinates in shape as:
0.2000 0.5000
0.4000 0.5000
0.3000 0.4000
0.1500 0.3000
0.3000 0.2000
0.4500 0.3000
step(1): Euclidean distance from each point to all others:
0 0.2000 0.1414 0.2062 0.3162 0.3202
0.2000 0 0.1414 0.3202 0.3162 0.2062
0.1414 0.1414 0 0.1803 0.2000 0.1803
0.2062 0.3202 0.1803 0 0.1803 0.3000
0.3162 0.3162 0.2000 0.1803 0 0.1803
0.3202 0.2062 0.1803 0.3000 0.1803 0
step(2):Normalized distances between each point:
0 1.0623 0.7511 1.0949 1.6796 1.7004
1.0623 0 0.7511 1.7004 1.6796 1.0949
0.7511 0.75110 0.9575 1.0623 0.9575
1.0949 1.7004 0.9575 0 0.9575 1.5934
1.6796 1.6796 1.0623 0.9575 0 0.9575
1.7004 1.0949 0.9575 1.5934 0.9575 0
I don't know how to create log distance scale where
The log distance scale for normalized distances (closer = more discriminate):
0.1250 0.2500 0.5000 1.0000 2.0000
Any one help me?

I have no background in shape context, and here's my thinking:
The teacher wants to measure the distance in a way that fits the need by shape context. To achieve this,
(step 3) First build a scale (aka a ruler). The scale has 5 bins, namely 0~0.1250, 0.1250~0.2500, 0.2500~0.5000, 0.5000~1.0000, 1.0000~2.0000. The maximum range should cover the maximum value in your data of normalized distance (1.7004 here). This scale is increasing logarithmically (I mean "exponentially"). I still have no idea about "closer = more discriminate".
(step 4) throw each data value into different bins, using the scale. The teacher uses iteration method, which works; but essentially you just need to find some way to categorize the values.
I think you can also achieve it in this way:
>> x = rand(1,10)*2;ceil(log2(ceil(x/.125)))+1,x
ans =
2 1 5 5 5 3 5 4 1 4
x =
0.1517 0.1079 1.0616 1.5583 1.8680 0.2598 1.1376 0.9388 0.0238 0.6742
So you are finding logarithm of distance to base 0.125.

Finding generalized eigenvectors numerically in Matlab

I have a matrix such as this example (my actual matrices can be much larger)
A = [-1 -2 -0.5;
0 0.5 0;
0 0 -1];
that has only two linearly-independent eigenvalues (the eigenvalue -1 is repeated). I would like to obtain a complete basis with generalized eigenvectors. One way I know how to do this is with Matlab's jordan function in the Symbolic Math toolbox, but I'd prefer something designed for numeric inputs (indeed, with two outputs, jordan fails for large matrices: "Error in MuPAD command: Similarity matrix too large."). I don't need the Jordan canonical form, which is notoriously unstable in numeric contexts, just a matrix of generalized eigenvectors. Is there a function or combination of functions that automates this in a numerically stable way or must one use the generic manual method (how stable is such a procedure)?
NOTE: By "generalized eigenvector," I mean a non-zero vector that can be used to augment the incomplete basis of a so-called defective matrix. I do not mean the eigenvectors that correspond to the eigenvalues obtained from solving the generalized eigenvalue problem using eig or qz (though this latter usage is quite common, I'd say that it's best avoided). Unless someone can correct me, I don't believe the two are the same.
UPDATE 1 – Five months later:
See my answer here for how to obtain generalized eigenvectors symbolically for matrices larger than 82-by-82 (the limit for my test matrix in this question).
I'm still interested in numeric schemes (or how such schemes might be unstable if they're all related to calculating the Jordan form). I don't wish to blindly implement the linear algebra 101 method that has been marked as a duplicate of this question as it's not a numerical algorithm, but rather a pencil-and paper method used to evaluate students (I suppose that it could be implemented symbolically however). If anyone can point me to either an implementation of that scheme or a numerical analysis of it, I'd be interested in that.
UPDATE 2 – Feb. 2015: All of the above is still true as tested in R2014b.

As mentioned in my comments, if your matrix is defective, but you know which eigenvectors/eigenvalue pair you want to consider as identical given your tolerance, you can proceed as with this example below:
% example matrix A:
A = [1 0 0 0 0;
3 1 0 0 0;
6 3 2 0 0;
10 6 3 2 0;
15 10 6 3 2]
% Produce eigenvalues and eigenvectors (not generalized ones)
[vecs,vals] = eig(A)
This should output:
vecs =
0 0 0 0 0.0000
0 0 0 0.2236 -0.2236
0 0 0.0000 -0.6708 0.6708
0 0.0000 -0.0000 0.6708 -0.6708
1.0000 -1.0000 1.0000 -0.2236 0.2236
vals =
2 0 0 0 0
0 2 0 0 0
0 0 2 0 0
0 0 0 1 0
0 0 0 0 1
Where we see that the first three eigenvectors are almost identical to working precision, as are the two last ones. Here, you must know the structure of your problem and identify the identical eigenvectors of identical eigenvalues. Here, eigenvalues are exactly identical, so we know which ones to consider, and we will assume that corresponding vectors 1-2-3 are identical and vectors 4-5. (In practice you will likely check the norm of the differences of eigenvectors and compare it to your tolerance)
Now we proceed to compute the generalized eigenvectors, but this is ill-conditioned to solve simply with matlab's \, because obviously (A - lambda*I) is not full rank. So we use pseudoinverses:
genvec21 = pinv(A - vals(1,1)*eye(size(A)))*vecs(:,1);
genvec22 = pinv(A - vals(1,1)*eye(size(A)))*genvec21;
genvec1 = pinv(A - vals(4,4)*eye(size(A)))*vecs(:,4);
Which should give:
genvec21 =
-0.0000
0.0000
-0.0000
0.3333
0
genvec22 =
0.0000
-0.0000
0.1111
-0.2222
0
genvec1 =
0.0745
-0.8832
1.5317
0.6298
-3.5889
Which are our other generalized eigenvectors. If we now check these to obtain the jordan normal form like this:
jordanJ = [vecs(:,1) genvec21 genvec22 vecs(:,4) genvec1];
jordanJ^-1*A*jordanJ
We obtain:
ans =
2.0000 1.0000 0.0000 -0.0000 -0.0000
0 2.0000 1.0000 -0.0000 -0.0000
0 0.0000 2.0000 0.0000 -0.0000
0 0.0000 0.0000 1.0000 1.0000
0 0.0000 0.0000 -0.0000 1.0000
Which is our Jordan normal form (with working precision errors).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse