How do I know the confidence level of my correct probability score? - matlab

I have a writer recognition system that gives back an NLL (Negative Least Likelihood) score for a test sample against every trained model. For example if there are thirteen models to compare the sample against the NLL output will look like this.
15885.1881156907 17948.1931699086 17205.1548161452 16846.8936368077 20798.8048757930 18153.8179076007 18972.6746781821 17398.9047592641 19292.8326540969 22559.3178790489 17315.0994094185 19471.9518308519 18867.2297851016
Where each column represents the score for that sample against every model. Column 1 gives the score against model 1 and so on.
This test sample is written by model 1. So the first column should have the minimum value for correct prediction.
The output I provided here gives the desired prediction, as the value of column 1 is minimum.
When I presented my results I was asked how confident I was about the scores or the predicted values? I was asked to provide a confidence level of each score.
I did some reading after this and found some posts on 95 % confidence interval which appears as every result to my google query but it does not appear to be what I need.
The reason I need this is suppose for a test sample I have scores from 2 models. Then using the confidence level I am supposed to know which score to pick up.
For example for the same test sample the scores from another model are:
124494.535128967 129586.451168849 126269.733526396 129579.895935672 128582.387405272 125984.657455834 127486.755531507 125162.136816278 129790.811437270 135902.112799503 126599.346536290 136223.382395325 126182.202727967
Both are correctly predicting as in both cases score in column 1 is minimum. But again how do I find the confidence level of my score?
Would appreciate any guidance here.

As my knowledge you cannot evaluate a confidence level for just one value.
Suppose you can store your results in a matrix where each column corresponds to a model and each row corresponds to an example (or observation). You can evaluate the confidence for every single model by using all the predicted results from that model (i.e. you can evaluate the confidence interval for any column in our matrix) according to the following procedure:
Evaluate the mean value of the column, let's call this µ
Evaluate the standard deviation of the column, let's call this σ
Evaluate the mean error as ε=σ/sqrt(N), where N is the number of samples (rows)
the lower bound for the confidence interval is given by µ-2ε whereas the upper bound is given by µ+2ε. By straightforward subtraction you can find the amplitude of such confidence interval. The more is closer to zero, the more accurate is your measurement.
Hope this is what you're looking for.

Related

Pearson correlation coefficent

This question of mine is not tightly related to Matlab, but is relevant to it:
I'm looking how to fill in the matrix [[a,b,c],[d,e,f]] in a few nontrivial ways so that as many places as possible in
corrcoef([a,b,c],[d,e,f])
are zero. My attempts yield NaN result in most cases.
Given the current comments, you are trying to understand how two series of random draws from two distributions can have zero correlation. Specifically, exercise 4.6.9 to which you refer mentions draws from two normal distributions.
An issue with your approach is that you are hoping to derive a link between a theoretical property and experimentation, in this case using Matlab. And, as you seem to have noticed, unless you are looking at specific degenerate cases, your experimentation will fail. That is because although the true correlation parameter rho in the exercise might be zero, a sample of random draws will always have some level of correlation. Here is an illustration, and as you'll notice if you run it the actual correlations span the whole spectrum between -1 and 1 despite their average being zero (as it should be since both generators are pseudo-uncorrelated):
n=1e4;
experiment = nan(n,1);
for i=1:n
r = corrcoef(rand(4,1),rand(4,1));
experiment(i)=r(2);
end
hist(experiment);
title(sprintf('Average correlation: %.4f%%',mean(experiment)));
If you look at the definition of Pearson correlation in wikipedia, you will see that the only way this can be zero is when the numerator is zero, i.e. E[(X-Xbar)(Y-Ybar)]=0. Though this might be the case asymptotically, you will be hard-pressed to find a non-degenerate case where this will happen in a small sample. Nevertheless, to show you you can derive some such degenerate cases, let's dig a bit further. If you want the expectation of this product to be zero, you could make either the left or the right hand part zero when the other is non-zero. For one side to be zero, the draw must be exactly equal to the average of draws. Therefore we can imagine creating such a pair of variables using this technique:
we create two vectors of 4 variables, and alternate which draw will be equal to the average.
let's say we want X to average 1, and Y to average 2, and we make even-indexed draws equal to the average for X and odd-indexed draws equal to the average for Y.
one such generation would be: X=[0,1,2,1], Y=[2,0,2,4], and you can check that corrcoef([0,1,2,1],[2,0,2,4]) does in fact produce an identity matrix. This is because, every time a component of X is different than its average of 1, the component in Y is equal to its average of 2.
another example, where the average of X is 3 and that of Y is 4 is: X=[3,-5,3,11], Y=[1008,4,-1000,4]. etc.
If you wanted to know how to create samples from non-correlated distributions altogether, that would be and entirely different question, though (perhaps) more interesting in terms of understanding statistics. If this is your case, and given the exercise you mention discusses normal distributions, I would suggest you take a look at generating antithetic variables using the Box-Muller transform.
Happy randomizing!

Query about NaiveBayes Classifier

I am building a text classifier for classifying reviews as positive or negative. I have a query on NaiveBayes classifier formula:
| P(label) * P(f1|label) * ... * P(fn|label)
| P(label|features) = --------------------------------------------
| P(features)
As per my understanding, probabilities are multiplied if the events occur together. E.g. what is the probability of A and B occurring together. Is it appropriate to multiply the probabilities in this case? Appreciate if someone can explain this formula in a bit detail. I am trying to do some manual classification (just to check some algorithm generated classifications which seem a tad off, this will enable me to identify the exact reason for misclassification).
In basic probability terms, to calculate p(label|feature1,feature2), we have to multiply the probabilites to calculate the occurrence of feature 1 and feature 2 together. But in this case I am not trying to calculate a standard probability, rather the strength of positivity/negativity of the text. So if I sum up the probabilities, I get a number which can identify the positivity/negativity quotient. This is a bit unconventional but do you think this can give some good results. The reason is the sum and product can be quite different. E.g. 2*2 =4 but 3*1 = 3
The class-conditional probabilities P(feature|label) can be multiplied together if they are statistically independent. However, it's been found in practice that Naive Bayes still produces good results even for class-conditional probabilities that are not independent. Thus, you can compute the individual class-conditional probabilities P(feature|label) from simple counting and then multiply them together.
One thing to note is that in some applications, these probabilities can be extremely small, resulting in potential numerical underflow. Thus, you may want to add together the logs of the probabilities (rather than multiply the probabilities).
I understand if the features were different like what is the probability of a person being male if the height was 170 cm and weight 200 pounds. Then these probabilities have to be multiplied together as these conditions (events) occur together. But in case of text classification, this is not valid as it really doesn't matter if the events occur together.. E.g. the probability of a review being positive given the occurrence of word best is 0.1 and the probability of a review being positive given the occurrence of word polite is 0.05, then the probability of the review being positive given the occurrence of both words (best and polite) is not 0.1*0.05. A more indicative number would be the sum of the probabilities (needs to be normalized),

Using Linear Prediction Over Time Series to Determine Next K Points

I have a time series of N data points of sunspots and would like to predict based on a subset of these points the remaining points in the series and then compare the correctness.
I'm just getting introduced to linear prediction using Matlab and so have decided that I would go the route of using the following code segment within a loop so that every point outside of the training set until the end of the given data has a prediction:
%x is the data, training set is some subset of x starting from beginning
%'unknown' is the number of points to extend the prediction over starting from the
%end of the training set (i.e. difference in length of training set and data vectors)
%x_pred is set to x initially
p = length(training_set);
coeffs = lpc(training_set, p);
for i=1:unknown
nextValue = -coeffs(2:end) * x_pred(end-unknown-1+i:-1:end-unknown-1+i-p+1)';
x_pred(end-unknown+i) = nextValue;
end
error = norm(x - x_pred)
I have three questions regarding this:
1) Does this appropriately do what I have described? I ask because my error seems rather large (>100) when predicting over only the last 20 points of a dataset that has hundreds of points.
2) Am I interpreting the second argument of lpc correctly? Namely, that it means the 'order' or rather number of points that you want to use in predicting the next point?
3) If this is there a more efficient, single line function in Matlab that I can call to replace the looping and just compute all necessary predictions for me given some subset of my overall data as a training set?
I tried looking through the lpc Matlab tutorial but it didn't seem to do the prediction as I have described my needs require. I have also been using How to use aryule() in Matlab to extend a number series? as a reference.
So after much deliberation and experimentation I have found the above approach to be correct and there does not appear to be any single Matlab function to do the above work. The large errors experienced are reasonable since I am using a linear prediction algorithm for a problem (i.e. sunspot prediction) that has inherent nonlinear behavior.
Hope this helps anyone else out there working on something similar.

Higher order moments and shape parameters

I have used 5th moment of my data as a feature for classification and it gives good results, but i don't know what it measures? is it a shape parameter like kurtosis and skewness?
I'm using matlab's
m=moment(X,order);
which returns the central sample moment of X specified by the positive integer order.

Understanding the Pearson Correlation Coefficient

As part of the calculations to generate a Pearson Correlation Coefficient, the following computation is performed:
In the second formula: p_a,i is the predicted rating user a would give item i, n is the number of similar users being compared to, and ru,i is the rating of item i by user u.
What value will be used if user u has not rated this item? Did I misunderstand anything here?
According to the link, earlier calculations in step 1 of the algorithm are over a set of items, indexed 1 to m, whe m is the total number of items in common.
Step 3 of the algorithm specifies: "To find a rating prediction for a particular user for a particular item, first select a number of users with the highest, weighted similarity scores with respect to the current user that have rated on the item in question."
These calculations are performed only on the intersection of different users set of rated items. There will be no calculations performed when a user has not rated an item.
It only makes sense to calculate results if both users have rated a movie. Linear regression can be visualised as a method of finding a straight line through a two-dimensional graph where one variable is plotted on the X axis and another one - on Y axis. Each combination of ratings is represented as a point on an euclidean plane [u1_rating, u2_rating]. Since you can not plot points which only have one dimension to them, you'll have to discard those cases.