How to convert distance into probability? - matlab

Сan anyone shine a light to my matlab program?
I have data from two sensors and i'm doing a kNN classification for each of them separately.
In both cases training set looks like a set of vectors of 42 rows total, like this:
[44 12 53 29 35 30 49;
54 36 58 30 38 24 37;..]
Then I get a sample, e.g. [40 30 50 25 40 25 30] and I want to classify the sample to its closest neighbor.
As a criteria of proximity I use Euclidean metrics, sqrt(sum(Y2)), where Y is a difference between each element and it gives me an array of distances between Sample and each Class of Training Set.
So, two questions:
Is it possible to convert distance into distribution of probabilities, something like: Class1: 60%, Class 2: 30%, Class 3: 5%, Class 5: 1%, etc.
added: Up to this moment I'm using formula: probability = distance/sum of distances, but I cannot plot a correct cdf or histogram.
This gives me a distribution in some way, but I see a problem there, because if distance is large, for example 700, then the closest class will get a biggest probability, but it'd be wrong because the distance is too big to be compared with any of classes.
If I would be able to get two probability density functions, I guess then I would do some product of them. Is it possible?
Any help or remark is highly appreciated.

I think there are multiple way of doing this:
as Adam suggested using 1/d / sum(1/d)
use the square, or even higher ordered of inverse of distance, e.g 1/d^2 / sum(1/d^2), This will make the class probability distribution more skewed. For example if 1/d generated 40%/60% prediction, the 1/d^2 may gave a 10%/90%.
use softmax (https://en.wikipedia.org/wiki/Softmax_function), the exponential of negative distance.
use exp(-d^2)/sigma^2 / sum[exp(-d^2)/sigma^2], this will imitate the Gaussian Distribution likelihoods. Sigma could be the average within-cluster distance, or simply set to 1 for all clusters.

You could try to inverse your distances to get a likelihood measure. I.e. the bigger the distance x, the smaller the inverse of it. Then, you can normalize as in probability = (1/distance) / (sum (1/distance) )

Hi: Have you ever tried with the formula probability = 1-distance assuming that you are using a standardized distance between 0 and 1?

Related

How to take the difference between the resulting and the correct bucket of a one hot vector into account?

Hi I am using tensorflow at my university to try to classify steering angles of a simulation program using only the images the simulation produces.
The Steering angles are values from -1 to 1 and I separated them into 50 "buckets". So the first value of my prediction vector would mean that the predicted steering angle is between -1 and -0.96.
The following shows the classification and optimization functions I am using.
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(prediction, y))
optimizer = tf.train.AdamOptimizer(0.001).minimize(cost)
y is a vector that with 49 zeros and a single 1 for the correct bucket. My question now is.
How do I take into account if e.g. the correct bucket is at index 25, that the a prediction of 26 is much better than a prediction of 48.
I didn't post the actual network since it is just a couple of conv2d and maxpool layers with a fully connected layer at the end.
Since you are applying Cross entropy or negative log likelihood. you are penalizing the system given the predicted output and the ground truth.
So saying that your system predicted different numbers on your 50 classes output and the highest one was the class number 25 but your ground truth is class 26. So your system will take the value predicted on 26 and adapt the parameters to produce the highest number on this output the next time it sees this input.
You could do two basic things:
Change your y and prediction to be scalars in the range -1..1; make the loss function be (y-prediction)**2 or something. A very different model, but perhaps more reasonable that the one-hot.
Keep the one-hot target and loss, but have y = target*w, where w is a constant matrix, mostly zeros, 1s on the diagonal, and smaller values on the next diagonal, elements (e.g. y(i) = target(i) * 1. + target(i-1) * .5 + target(i+1) * .5 + ...); kind of gross, but it should converge to something reasonable.

KNN Classifier for simple digit recognition

Actually , i have an assignment where it is required to recognize individual decimal digits as a part of the text recognition process.I am already given a set of JPEG formatted images of some digits. Each image is of size 160 x 160 pixels.After checking some resources here i managed to write this code but :
1)I am not sure if reading the images and resizing them in matrices for holding them is right or not.
2)Supposing that i have 30 train data images for numbers [0-9] each number has three images and i have 10 images for test each image is of only one digit.How to calculate distance between every test and train in a loop ? Because in my part of code for calculating Euclidean it gives an output zero.
3)How to calculate accuracy by using confusion matrix ?
% number of train data
Train = 30;
%number of test data
Test =10;
% to store my images
tData = uint8(zeros(160,160,30));
tTest = uint8(zeros(160,160,10));
for k=1:Test
s1='im-';
s2=num2str(k);
t = strcat('testy/im-',num2str(k),'.jpg');
im=rgb2gray(imread(t));
I=imresize(im,[160 160]);
tTest(:,:,k)=I;
%case testing if it belongs to zero
for l=1:3
ss1='zero-';
ss2=num2str(l);
t1 = strcat('data/zero-',num2str(l),'.jpg');
im1=rgb2gray(imread(t1));
I1=imresize(im1,[160 160]);
tData(:,:,l)=I1;
% Euclidean distance
distance= sqrt(sum(bsxfun(#minus, tData(:,:,k), tTest(:,:,l)).^2, 2));
[d,index] = sort(distance);
%k=3
% index_close(l) = index(l:3);
%x_close = I(index_close,:);
end
end
First of all i think 10 test data is not enough.
Just use the below function, data_test is your training data() and data_label is their labels. re size your images to smaller sizes!
I think the default distance measure is Euclidean distance but you can choose other ways such as City-block method for example.
Class = knnclassify(data_test, data_train, lab_train, 11);
fprintf('11-NN Accuracy: %f\n', sum(Class == lab_test')/length(lab_test));
Class = knnclassify(data_test, data_train, lab_train, 1, 'cityblock');
fprintf('1-NN Accuracy (cityblock): %f\n', sum(Class == lab_test')/length(lab_test));
Ok now you have the overall accuracy but this is not a good measure, it's better to calculate the accuracy separately for each class and then calculate their mean.
you can calculate a specific class (id) accuracy like this,,
idLocations = (lab_test == id);
NumberOfId = sum(idLocations);
NumberOfCurrect =sum (lab_test (idLocations) == Class(idLocations));
NumberOfCurrect/NumberOfId %Class id accuracy
as your questions are:
1) image re-sizing does affects the accuracy of the whole process.
Ans: As you mentioned in your question your images are already of the size 160 by 160, imresize will not affect it, but if your image is too small in size say 60*60 it will perform interpolation to increase the spatial dimensions of the image, which may affects structure and shape of the digit, to tackle these kind of variability, your training data should have much more samples(at least 50 samples per class), and some pre-processing should be apply on data like de-skewing of the digit image.
2) euclidean distance is good measure but not the best to deal with these kind of problems, as its distribution is a spherical distribution it may give same distance for to different digits. if you are working in MATLAB beware of of variable casting, you are taking difference so both the variable should be double in nature. it may be one of the reason of wrong distance calculation. in this line distance= sqrt(sum(bsxfun(#minus, tData(:,:,k), tTest(:,:,l)).^2, 2)); you are summing matrices column wise so output of this will be a row vector(1 X 160) which have sum along each corner. i think it should be like this:distance= sqrt(sum(sum(bsxfun(#minus, tData(:,:,k), tTest(:,:,l)).^2, 2))); i have just added one more sum there for getting sum of differences for whole matrix try it whether it helps or not.
3) For checking accuracy of your classifier precisely you have to have a large training dataset,by the way, Confusion matrix created during the process of cross-validation, where you split your training data into training samples and testing samples, so you know output classes in both the sample, now perform classification process, prepare a matrix for num_classe X num_classes(in your case 10 X 10), where rows resembles actual classes and columns belongs to prediction. take a sample from test and predict output class, suppose your classifier predict 5 and sample's actual class is also 5 put +1 in the confusion_matrix(5,5); if your classifier have predicted it as 3, you should do +1 at confusion_matrix(5,3). finally add diagonal elements of the confusion_mat and divide it by the total number of the test samples. output will be accuracy of your classifier.
P.S. Try to have atleast 50 samples per class and during cross-validation divide the training data 85:10 ratio where 90% sample should be used for training and rest 10 % should be used for testing the classifier.
Hope it have helps you.
feel free to share your thoughts.
Thank You

How can I find the difference between two plots with a dimensional mismatch?

I have a question that I don't know if there is a solution off the bat.
Here it goes,
I have two data sets, plotted on the same figure. I need to find their difference, simple so far...
the problem arises in the fact that say matrix A has 1000 data points while the second (matrix B) has 580 data points. How will I be able to find the difference between the two graphs since there is a dimensional miss match between the two figures.
One way that I thought of is artificially inflating matrix B to 1000 data points, but the trend of the plot will remain the same. Would this be possible? and if yes how?
for example:
A=[1 45 33 4 1009 ];
B=[1 22 33 44 55 66 77 88 99 1010];
Ya=A.*20+4;
Yb=B./10+3;
C=abs(B - A)
plot(A,Ya,'r',B,Yb)
xlim([-100 1000])
grid on
hold on
plot(length(B),C)
One way to do it is to resample the 580 element vector to 1000 samples. Use matlab resample (requires the Signal Processing Toolbox, I believe) for this:
x = randn(580,1);
y = randn(1000,1);
xr = resample(x, 50,29); # 50/29 = 1000/580 is the resampling ratio
You should then be able to compare the two data vectors.
There are two ways that I can think of:
1- Matching the size:
Generating more data for the matrix with lower number of elements (using interpolation, etc.)
Removing some data from the matrix with higher number of elements (i.e. outlier removal)
2- Comparing the matrices with their properties.
For instance, you can calculate the mean and the covariance of a matrix and compare it to the other matrix. The other options include, cov , mean , median , std, var , xcorr , xcov.

Finding Probability of Gaussian Distribution Using Matlab

The original question was to model a lightbulb, which are used 24/7, and usually one lasts 25 days. A box of bulbs contains 12. What is the probability that the box will last longer than a year?
I had to use MATLAB to model a Gaussian curve based on an exponential variable.
The code below generates a Gaussian model with mean = 300 and std= sqrt(12)*25.
The reason I had to use so many different variables and add them up was because I was supposed to be demonstrating the central limit theorem. The Gaussian curve represents the probability of a box of bulbs lasting for a # of days, where 300 is the average number of days a box will last.
I am having trouble using the gaussian I generated and finding the probability for days >365. The statement 1-normcdf(365,300, sqrt(12)*25) was an attempt to figure out the expected value for the probability, which I got as .2265. Any tips on how to find the probability for days>365 based on the Gaussian I generated would be greatly appreciated.
Thank you!!!
clear all
samp_num=10000000;
param=1/25;
a=-log(rand(1,samp_num))/param;
b=-log(rand(1,samp_num))/param;
c=-log(rand(1,samp_num))/param;
d=-log(rand(1,samp_num))/param;
e=-log(rand(1,samp_num))/param;
f=-log(rand(1,samp_num))/param;
g=-log(rand(1,samp_num))/param;
h=-log(rand(1,samp_num))/param;
i=-log(rand(1,samp_num))/param;
j=-log(rand(1,samp_num))/param;
k=-log(rand(1,samp_num))/param;
l=-log(rand(1,samp_num))/param;
x=a+b+c+d+e+f+g+h+i+j+k+l;
mean_x=mean(x);
std_x=std(x);
bin_sizex=.01*10/param;
binsx=[0:bin_sizex:800];
u=hist(x,binsx);
u1=u/samp_num;
1-normcdf(365,300, sqrt(12)*25)
bar(binsx,u1)
legend(['mean=',num2str(mean_x),'std=',num2str(std_x)]);
[f, y]=ecdf(x) will create an empirical cdf for the data in x. You can then find the probability where it first crosses 365 to get your answer.
Generate N replicates of x, where N should be several thousand or tens of thousands. Then p-hat = count(x > 365) / N, and has a standard error of sqrt[p-hat * (1 - p-hat) / N]. The larger the number of replications is, the smaller the margin of error will be for the estimate.
When I did this in JMP with N=10,000 I ended up with [0.2039, 0.2199] as a 95% CI for the true proportion of the time that a box of bulbs lasts more than a year. The discrepancy with your value of 0.2265, along with a histogram of the 10,000 outcomes, indicates that actual distribution is still somewhat skewed. In other words, using a CLT approximation for the sum of 12 exponentials is going to give answers that are slightly off.

K-means distance parameters in Matlab - Varying results

I have a matrix I am working with which 300x5000 and I wanted to test which distance calculation parameter is the most effective. I got the following results:
'Sqeuclidean' = 17 iterations, total sum of distances = 25175.4
'Correlation' = 9 iterations, total sum of distances = 32.7
'Cityblock' = 34 iterations, total sum of distances = 105175.3
'Cosine' = 11 iterations, total sum of distances = 11.9
I am having trouble understanding why the results vary so much and how to choose the most effective distance parameter. Any advice?
EDIT:
I have 300 features with 5000 instances of each feature.
the function looks like this:
[idx, ctrs, sumd, d] = kmeans(matrix, 25, 'distance', 'cityblock', 'replicate', 20)
with interchanging the distance parameter. The features were already normalized.
Thanks!
As slayton commented, you really need to define what 'best' means for your particular problem.
The only thing that matters is how well the distance function clusters the data. In general, clustering is highly-dependent on the distance function. The two metrics that you've selected (number of iterations, sum of distances) are pretty irrelevant to how well the clustering works.
You need to know what you're trying to achieve with clustering, and you need some metric for how well you've achieved that goal. If there's an objective metric to determine how good your clusters are, then use that. Often, the metric is fuzzier: does this look right when I visualize the data. Look at your data, and look at how each distance function clusters the data. Select the distance function that seems to generate the best clusters. Do this for several subsets of your data, to make sure that your intuition is correct. You should also try to understand the result that each distance function gives you.
Lastly, some problems lend themselves to a particular distance function. If your problem has spatial features, then a Euclidean (geometric) distance is often a natural choice. Other distance functions will perform better for different problems.
Distance values from different
distance functions
data sets
normalizations
are generally not comparable. Simple example from reality: measure distances in "meter" or in "inch", and you get very different results. The result in meters will not be better just because it is measured on a different scale. So you must not compare the variances of different results.
Notice that k-means is meant to be used with euclidean distance only, and may not converge with other distance functions. IMHO, L_p norms should be fine, and on TF-IDF maybe also cosine. But I do not know a proof for that.
Oh, and k-means works really bad with high-dimensional data. It is meant for low dimensionality.