Interpretation of MATLAB's NaiveBayses 'posterior' function - matlab

After we created a Naive Bayes classifier object nb (say, with multivariate multinomial (mvmn) distribution), we can call posterior function on testing data using nb object. This function has 3 output parameters:
[post,cpre,logp] = posterior(nb,test)
I understand how post is computed and the meaning of that, also cpre is the predicted class, based on the maximum over posterior probabilities for each class.
The question is about logp. It is clear how it is computed (logarithm of the PDF of each pattern in test), but I don't understand the meaning of this measure and how it can be used in the context of Naive Bayes procedure. Any light on this is very much appreciated.
Thanks.

The logp you are referring to is the log likelihood, which is one way to measure how good a model is fitting. We use log probabilities to prevent computers from underflowing on very small floating-point numbers, and also because adding is faster than multiplying.
If you learned your classifier several times with different starting points, you would get different results because the likelihood function is not log-concave, meaning there are local maxima that you would get stuck in. If you computed the likelihood of the posterior on your original data you would get the likelihood of the model. Although the likelihood gives you a good measure of how one set of parameters fits compared to another, you need to be careful that you're not overfitting.
In your case, you are computing the likelihood on some unobserved (test) data, which gives you an idea of how well your learned classifier is fitting on the data. If you were trying to learn this model based on the test set, you would pick the parameters based on the highest test likelihood; however in general when you're doing this it's better to use a validation set. What you are doing here is computing predictive likelihood.
Computing the log likelihood is not limited to Naive Bayes classifiers and can in fact be computed for any Bayesian model (gaussian mixture, latent dirichlet allocation, etc).

Related

Poorly calibrated probabilities but good classification in confusion matrix

I have an imbalanced data set. My goal is to balance sensitivity and specificity via the confusion matrix. I used glmnet in r with class weights. The model does well at balancing the sensitivity/specificity, but I looked at the calibration plot, and the probabilities are not well calibrated. I have read about calibrating probabilities, but I am wondering if it matters if my goal is to produce class predictions. If it does matter, I have not found a way to calibrate the probabilities when using caret::train().
This topic has been widely discussed, especially in some answers by Stephan Kolassa. I will try to summarize the main take-home messages for your specific question.
From a pure statistical point of view your interest should be on producing as output a probability for each class of any new data instance. As you deal with unbalanced data such probabilities can be small which however - as long as they are correct - is not an issue. Of course, some models can give you poor estimates of the class probabilities. In such cases, the calibration allows you to better calibrate the probabilities obtained from a given model. This means that whenever you estimate for a new observation a probability p of belonging to the target class, then p is indeed its true probability to be of that class.
If you are able to obtain a good probability estimator, then balancing sensitivity or specificity is not part of the statistical part of your problem, but rather of the decision component. Such the final decision will likely need to use some kind of threshold. Depending on the costs of type I and II errors, the cost-optimal threshold might change; however, an optimal decision might also include more than one threshold.
Ultimately, you really have to be careful about which is the specific need of the end-user of your model, because this is what is going to determine the best way of taking decisions using it.

How to deal with the randomness of NN training process?

Consider the training process of deep FF neural network using mini-batch gradient descent. As far as I understand, at each epoch of the training we have different random set of mini-batches. Then iterating over all mini batches and computing the gradients of the NN parameters we will get random gradients at each iteration and, therefore, random directions for the model parameters to minimize the cost function. Let's imagine we fixed the hyperparameters of the training algorithm and started the training process again and again, then we would end up with models, which completely differs from each other, because in those trainings the changes of model parameters were different.
1) Is it always the case when we use such random based training algorithms?
2) If it is so, where is the guaranty that training the NN one more time with the best hyperparameters found during the previous trainings and validations will yield us the best model again?
3) Is it possible to find such hyperparameters, which will always yield the best models?
Neural Network are solving a optimization problem, As long as it is computing a gradient in right direction but can be random, it doesn't hurt its objective to generalize over data. It can stuck in some local optima. But there are many good methods like Adam, RMSProp, momentum based etc, by which it can accomplish its objective.
Another reason, when you say mini-batch, there is at least some sample by which it can generalize over those sample, there can be fluctuation in the error rate, and but at least it can give us a local solution.
Even, at each random sampling, these mini-batch have different-2 sample, which helps in generalize well over the complete distribution.
For hyperparameter selection, you need to do tuning and validate result on unseen data, there is no straight forward method to choose these.

MATLAB computing Bayesian Information Criterion with the fit.m results

I'm trying to compute the Bayesian with results from fit.m
According to the Wikipedia, log-likelihood can be approximated (when noise is ~N(0,sigma^2)) as:
L = -(n/2)*log(2*pi*sigma^2) - (rss(2*sigma^2))
with n as the number of samples, k as the number of free parameters, and rss as residual sum of squares. And BIC is defined as:
-2*L + k*log(n)
But this is a bit different from the fitglm.m result even for simple polynomial models and the discrepancy seems to increase when higher order terms are used.
Because I want to fit Gaussian models and compute BICs of them, I cannot just use fitglm.m Or, is there any other way to write Gaussian model with the Wilkinson notation? I'm not familiar with the notation, so I don't know if it's possible.
I'm not 100% sure this is your issue, but I think your definition of BIC may be misunderstood.
The Bayesian Information Criterion (BIC) is an approximation to the log of the evidence, and is defined as:
where
is the data,
is the number of adaptive parameters of your model,
is the data size, and most importantly,
is the maximimum a posteriori estimate for your model / parameter set.
Compare for instance with the much simpler Akaike Information Criterion (AIK):
which relies on the usually simpler to obtain maximum likelihood estimate
of the model instead.
Your
is simply a parameter, which is subject to estimation. If the
you're using here is derived from the sample variance, for instance, then that simply corresponds to the
estimate, and not the
one.
So, your discrepancy may simply derive from the builtin function using the 'correct' estimate and you using the wrong one in your 'by-hand' calculations of the BIC.

K nearest neighbour validation performance

I am using knn to do classification for a telecom problem. I splitted my data into 70% training and 30% validation. While the knn classifier is able to catch over 80% in 2 deciles in training, its performance in validation sample is as good as random 45 degree line. I am surprised how does KNN work that the model performance in training and validation are so different.
Any pointers ?
Reasonable pointers are hardly possible without more details. The behavior of your KNN depends on several aspects:
The parameter K defining the neighbors. If it is set to K=1, for example, you will get no training error at all, this showing that the consideration of training-to-validation-error may not be justified.
The parameter K is often found using cross validation. I would suggest you to do this as well.
The distance metric. Which function are you using, are there different units, length scales, etc.?
The noise of your data, the size of your data ... -- there simply exist data sets which are hard to describe.
By the way: can you tell what kind of data you want to describe, and, if possible, also provide some examples or show some scatter plot (data and your result)?

SVM Classification with Cross Validation

I am new to using Matlab and am trying to follow the example in the Bioinformatics Toolbox documentation (SVM Classification with Cross Validation) to handle a classification problem.
However, I am not able to understand Step 9, which says:
Set up a function that takes an input z=[rbf_sigma,boxconstraint], and returns the cross-validation value of exp(z).
The reason to take exp(z) is twofold:
rbf_sigma and boxconstraint must be positive.
You should look at points spaced approximately exponentially apart.
This function handle computes the cross validation at parameters
exp([rbf_sigma,boxconstraint]):
minfn = #(z)crossval('mcr',cdata,grp,'Predfun', ...
#(xtrain,ytrain,xtest)crossfun(xtrain,ytrain,...
xtest,exp(z(1)),exp(z(2))),'partition',c);
What is the function that I should be implementing here? Is it exp or minfn? I will appreciate if you can give me the code for this section. Thanks.
I will like to know what does it mean when it says exp([rbf_sigma,boxconstraint])
rbf_sigma: The svm is using a gaussian kernel, the rbf_sigma set the standard deviation (~size) of the kernel. To understand how kernels work, the SVM is putting the kernel around every sample (so that you have a gaussian around every sample). Then the kernels are added up (sumed) for the samples of each category/type. At each point the type which sum is higher would be the "winner". For example if type A has a higher sum of these kernels at point X, then if you have a new datum to classify in point X, it will be classified as type A. (there are other configuration parameters that may change the actual threshold where a category is selected over another)
Fig. Analyze this figure from the webpage you gave us. You can see how by adding up the gaussian kernels on the red samples "sumA", and on the green samples "sumB"; it is logical that sumA>sumB in the center part of the figure. It is also logical that sumB>sumA in the outer part of the image.
boxconstraint: it is a cost/penalty over miss-classified data. During the training stage of the classifier, where you use the training data to adjust the SVM parameters, the training algorithm is using an error function to decide how to optimize the SVM parameters in an iterative fashion. The cost for a miss-classified sample is proportional to how far it is from the boundary where it would have been classified correctly. In the figure that I am attaching the boundary is the inner blue circumference.
Taking into account BGreene indications and from what I understand of the tutorial:
In the tutorial they advice to try values for rbf_sigma and boxconstraint that are exponentially apart. This means that you should compare values like {0.2, 2, 20, ...} (note that this is {2*10^(i-2), i=1,2,3,...}), and NOT like {0.2, 0.3, 0.4, 0.5} (which would be linearly apart). They advice this to try a wide range of values first. You can further optimize later FROM the first optimum that you obtained before.
The command "[searchmin fval] = fminsearch(minfn,randn(2,1),opts)" will give you back the optimum values for rbf_sigma and boxconstraint. Probably you have to use exp(z) because it affects how fminsearch increments the values of z(1) and z(2) during the search for the optimum value. I suppose that when you put exp(z(1)) in the definition of #minfn, then fminsearch will take 'exponentially' big steps.
In machine learning, always try to understand that there are three subsets in your data: training data, cross-validation data, and test data. The training set is used to optimize the parameters of the SVM classifier for EACH value of rbf_sigma and boxconstraint. Then the cross validation set is used to select the optimum value of the parameters rbf_sigma and boxconstraint. And finally the test data is used to obtain an idea of the performance of your classifier (the efficiency of the classifier is determined upon the test set).
So, if you start with 10000 samples you may divide the data for example as training(50%), cross-validation(25%), test(25%). So that you will sample randomly 5000 samples for the training set, then 2500 samples from the 5000 remaining samples for the cross-validation set, and the rest of samples (that is 2500) would be separated for the test set.
I hope that I could clarify your doubts. By the way, if you are interested in the optimization of the parameters of classifiers and machine learning algorithms I strongly suggest that you follow this free course -> www.ml-class.org (it is awesome, really).
You need to implement a function called crossfun (see example).
The function handle minfn is passed to fminsearch to be minimized.
exp([rbf_sigma,boxconstraint]) is the quantity being optimized to minimize classification error.
There are a number of functions nested within this function handle:
- crossval is producing the classification error based on cross validation using partition c
- crossfun - classifies data using an SVM
- fminsearch - optimizes SVM hyperparameters to minimize classification error
Hope this helps