How Orange3 gets cosine value in Distances widget - algebra

Orange3 says that cosine of No.1 vector[1, 0] to No.2 vector[0, 1] is 1.000 and No.1 to No.7 vector[-1, 0] is 2.000 in Distance Matrix as below capture. I believe that it has to be 0.000 and -1.000 because it is supposed to be cosine. Or if it is radian, it has to be 1.5708(pi/2) and 3.1415(pi).
Sounds like range of cosine is 0.0 to 2.0 in Orange3, but I've never told this before.
Does someone have any idea of this cosine results?
Thank you.

What you describe is cosine similarity. Orange computes cosine distance.
The code is here: https://github.com/biolab/orange3/blob/master/Orange/distance/distance.py#L455.

Related

Is there analytical way to find the distance of a point to its projection on a Logarithmic spiral? If not, how to approximate the distance?

In one of my experiment, I need measure the Euclidean distance from a point to its projection on a Logarithmic spiral. The spiral formula is:
$$x=e^{0.14\theta}cos(\theta)$$
$$y=e^{0.14\theta}sin(\theta)$$
$\theta$ is a sequence of 158 numbers ranging from -19 to -12.7 by step 0.04. There is a point outside spiral. Is there anyway to find the distance from it to its projection on the spiral?

roc curve and speaker recognition

I am using Euclidean distance for speaker recognition. I want to plot the ROC curve using perfcurve in MATLAB. Since the scores are the resulting euclidean distances, am I doing right? Thanks
Labels=[1 1 1 1 1 1 1 0 0 1];
scores=[18.5573 15.3364 16.8427 19.6381 16.4195 17.3226 18.9520 21.6811 21.4013 22.3880];
[x,y]=perfcurve(Labels,scores,1);
plot(x,y);
xlabel('False positive rate'); ylabel( 'True positive rate')
You did right.
Only sensitive point is that you have to understand the meaning of your scores. Is it higher the better or lower the better?
If its lower the better, then I would use [x,y]=perfcurve(Labels,-scores,1); instead

How to convert distance into probability?

Сan anyone shine a light to my matlab program?
I have data from two sensors and i'm doing a kNN classification for each of them separately.
In both cases training set looks like a set of vectors of 42 rows total, like this:
[44 12 53 29 35 30 49;
54 36 58 30 38 24 37;..]
Then I get a sample, e.g. [40 30 50 25 40 25 30] and I want to classify the sample to its closest neighbor.
As a criteria of proximity I use Euclidean metrics, sqrt(sum(Y2)), where Y is a difference between each element and it gives me an array of distances between Sample and each Class of Training Set.
So, two questions:
Is it possible to convert distance into distribution of probabilities, something like: Class1: 60%, Class 2: 30%, Class 3: 5%, Class 5: 1%, etc.
added: Up to this moment I'm using formula: probability = distance/sum of distances, but I cannot plot a correct cdf or histogram.
This gives me a distribution in some way, but I see a problem there, because if distance is large, for example 700, then the closest class will get a biggest probability, but it'd be wrong because the distance is too big to be compared with any of classes.
If I would be able to get two probability density functions, I guess then I would do some product of them. Is it possible?
Any help or remark is highly appreciated.
I think there are multiple way of doing this:
as Adam suggested using 1/d / sum(1/d)
use the square, or even higher ordered of inverse of distance, e.g 1/d^2 / sum(1/d^2), This will make the class probability distribution more skewed. For example if 1/d generated 40%/60% prediction, the 1/d^2 may gave a 10%/90%.
use softmax (https://en.wikipedia.org/wiki/Softmax_function), the exponential of negative distance.
use exp(-d^2)/sigma^2 / sum[exp(-d^2)/sigma^2], this will imitate the Gaussian Distribution likelihoods. Sigma could be the average within-cluster distance, or simply set to 1 for all clusters.
You could try to inverse your distances to get a likelihood measure. I.e. the bigger the distance x, the smaller the inverse of it. Then, you can normalize as in probability = (1/distance) / (sum (1/distance) )
Hi: Have you ever tried with the formula probability = 1-distance assuming that you are using a standardized distance between 0 and 1?

Cosine distance range interpretation

I am trying to use the cosine distance in pdist2. I am confused about it's output. As far as I know it should be between 0 and 1. Since MATLAB uses 1-(cosine), then 1 would be the highest variability while 0 would be the lowest. However the output seems to range from 0.5 to 1.5 or something along that!
Can somebody please advise me on how to interpret this output?
From help pdist2:
'cosine' - One minus the cosine of the included angle
between observations (treated as vectors)
Since the cosine varies between -1 and 1, the result of pdist2(...'cosine') varies between 0 and 2. If you want the cosine, use 1-pdist2(matrix1,matrix2,'cosine').

Mahalanobis distance

I would like to apply Mahalanobis distanc method to the data obained from the observation.
Each observation is a time response of the system. I have 30 onservations each 14000 points.
I would like to use MAHAL command in matlab. but it notifies me that the number of the rows in variable X must be greater than the columns. But the nature of my observations is so that for each observation I have 1 row (observation) and 14000 columns (time points).
I don't know how to overcome such a problem.
If anybody knows please help me.
You can't do that. The Mahalanobis distance of a point x from a group of values with mean mu and variance sigma is defined as sqrt((x-mu)*sigma^-1*(x-mu)). If sigma is not invertible - and it will not be if you have 30 observations and 14000 variables - the Mahalanobis distance is not defined.