EM soft clustering in lingpipe - unsupervised-learning

In Lingpipe's EM tutorial they said that it is possible to run the algorithm with no supervised data:
It is possible to train a classifier in a completely unsupervised fashion by having the initial classifier assign categories at random. Only the number of categories must be fixed. The algorithm is exactly the same, and the result after convergence or the maximum number of epochs is a classifier.
But their class, TradNaiveBayesClassifier required a labeled and an unlabeled corpora to run. How can I modify it to run with no labelled data?

EM is a probabilistic maximal likelihood optimization algorithm. In general, it is applied to unsupervised algorithms (for clustering) such as PLSA, Gaussian Mixture Model.
I think the linepipe doc is saying that you can using random initialization of all data labels (distribution of labels for each data) and then feed into NB to compute the ELBO (evidence lower bound), and then maximize it to get update of parameters.
In short, you will need to use the NB to write up the M step --- updating the model parameters.

Related

quadprog in MATLAB taking lot of time

My goal is to classify an image using multi class linear SVM (with out kernel). I would like to write my own SVM classifier
I am using MATLAB and have trained linear SVM using image sets provided.
I have around 20 classes, 5 images in each class (total of 100 images) and I am using one-versus-all strategy.
Each image is a (112,92) matrix. That means 112*92=10304 values.
I am using quadprog(H,f,A,C) to solve the quadratic equation (y=w'x+b) in the SVM. One call to quadprog returns w vector of size 10304 for one image. That means I have to call quadprog for 100 times.
The problem is one quadprog call takes 35 secs to get executed. That means for 100 images it will take 3500 secs. This might be due to large size of vectors and matrices involved.
I want to reduce the execution time of quadprog. Is there any way to do it?
First of all, when you do classification using SVM, you usually extract a feature (like HOG) of an image, so that the dimensionality of space on which SVM has to operate gets reduced. You are using raw pixel values, which generates a 10304-D vector. That is not good. Use some standard feature.
Secondly, you do not call quadprog 100 times. You call only once. The concept behind the optimization is, you want to find a weight vector w and a bias b such that it satisfies w'x_i+b=y_i for all images (i.e. all x_i). Note that i will go from 1 to the number of examples in your training set, but w and b stay the same. If you wanted a (w,b) that will satisfy only one x, you do not need any fancy optimization. So stack your x in a big matrix of dimension NxD, y will be a vector of Nx1, and then call quadprog. It will take a longer time than 35 seconds, but you do it only once. This is called training an SVM. While testing for a new image, you just extract its feature, and do w*x+b to get its class.
Third, unless you are doing this as an exercise to understand SVMs, quadprog is not the best way to solve this problem. You need to use some other techniques which work well with large data, for example, Sequential Minimal Optimization.
How can one linear hyperplane classify more than 2 classes: For multi-class classification, SVMs usually use two popular approaches:
One-vs-one SVM: For a C class problem, you train C(C-1)/2 classifiers and at test time, you test that many classifiers and choose the class which receives most votes.
One-vs-all SVM: As name suggests, you train one classifier per class with positive samples from that class and negative samples from all other classes.
From LIBSVM FAQs:
It is one-against-one. We chose it after doing the following comparison: C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines , IEEE Transactions on Neural Networks, 13(2002), 415-425.
"1-against-the rest" is a good method whose performance is comparable to "1-against-1." We do the latter simply because its training time is shorter.
However, also note that a naive implementation of one-vs-one may not be practical for large-scale problems. LIBSVM website also lists this shortcoming and provides an extension.
LIBLINEAR does not support one-versus-one multi-classification, so we provide an extension here. If k is the number of classes, we generate k(k-1)/2 models, each of which involves only two classes of training data. According to Yuan et al. (2012), one-versus-one is not practical for large-scale linear classification because of the huge space needed to store k(k-1)/2 models. However, this approach may still be viable if model vectors (i.e., weight vectors) are very sparse. Our implementation stores models in a sparse form and can effectively handle some large-scale data.

h2o random forest calculating MSE for multinomial classification

Why is h2o.randomforest calculating MSE on Out of bag sample and while training for a multinomail classification problem?
I have done binary classification also using h2o.randomforest, there it used to calculate AUC on out of bag sample and while training but for multi classification random forest is calculating MSE which seems suspicious. Please see this screenshot.
My target variable was a factor containing 4 factor levels model1, model2, model3 and model4. In the screenshot you would also a confusion matrix for these factors.
Can someone please explain this behaviour?
Both binomial and multinomial classification display MSE, so you will see it in the Scoring History table for both models (highlighted training_MSE column).
H2O does not evaluate a multinomial AUC. A few evaluation methods exist, but there is not yet a single widely adopted method. The pROC package discusses the method of Hand and Till, but mentions that it cannot be plotted and results rarely tested. Log loss and classification error are still available, specific to classification, as each has standard methods of evaluation in a multinomial context.
There is a confusion matrix comparing your 4 factor levels, as you highlighted. Can you clarify what more you are expecting? If you were looking for four individual confusion matrices, the four-column table contains enough information that they could be computed.

K nearest neighbour validation performance

I am using knn to do classification for a telecom problem. I splitted my data into 70% training and 30% validation. While the knn classifier is able to catch over 80% in 2 deciles in training, its performance in validation sample is as good as random 45 degree line. I am surprised how does KNN work that the model performance in training and validation are so different.
Any pointers ?
Reasonable pointers are hardly possible without more details. The behavior of your KNN depends on several aspects:
The parameter K defining the neighbors. If it is set to K=1, for example, you will get no training error at all, this showing that the consideration of training-to-validation-error may not be justified.
The parameter K is often found using cross validation. I would suggest you to do this as well.
The distance metric. Which function are you using, are there different units, length scales, etc.?
The noise of your data, the size of your data ... -- there simply exist data sets which are hard to describe.
By the way: can you tell what kind of data you want to describe, and, if possible, also provide some examples or show some scatter plot (data and your result)?

Interpretation of MATLAB's NaiveBayses 'posterior' function

After we created a Naive Bayes classifier object nb (say, with multivariate multinomial (mvmn) distribution), we can call posterior function on testing data using nb object. This function has 3 output parameters:
[post,cpre,logp] = posterior(nb,test)
I understand how post is computed and the meaning of that, also cpre is the predicted class, based on the maximum over posterior probabilities for each class.
The question is about logp. It is clear how it is computed (logarithm of the PDF of each pattern in test), but I don't understand the meaning of this measure and how it can be used in the context of Naive Bayes procedure. Any light on this is very much appreciated.
Thanks.
The logp you are referring to is the log likelihood, which is one way to measure how good a model is fitting. We use log probabilities to prevent computers from underflowing on very small floating-point numbers, and also because adding is faster than multiplying.
If you learned your classifier several times with different starting points, you would get different results because the likelihood function is not log-concave, meaning there are local maxima that you would get stuck in. If you computed the likelihood of the posterior on your original data you would get the likelihood of the model. Although the likelihood gives you a good measure of how one set of parameters fits compared to another, you need to be careful that you're not overfitting.
In your case, you are computing the likelihood on some unobserved (test) data, which gives you an idea of how well your learned classifier is fitting on the data. If you were trying to learn this model based on the test set, you would pick the parameters based on the highest test likelihood; however in general when you're doing this it's better to use a validation set. What you are doing here is computing predictive likelihood.
Computing the log likelihood is not limited to Naive Bayes classifiers and can in fact be computed for any Bayesian model (gaussian mixture, latent dirichlet allocation, etc).

SVM Classification with Cross Validation

I am new to using Matlab and am trying to follow the example in the Bioinformatics Toolbox documentation (SVM Classification with Cross Validation) to handle a classification problem.
However, I am not able to understand Step 9, which says:
Set up a function that takes an input z=[rbf_sigma,boxconstraint], and returns the cross-validation value of exp(z).
The reason to take exp(z) is twofold:
rbf_sigma and boxconstraint must be positive.
You should look at points spaced approximately exponentially apart.
This function handle computes the cross validation at parameters
exp([rbf_sigma,boxconstraint]):
minfn = #(z)crossval('mcr',cdata,grp,'Predfun', ...
#(xtrain,ytrain,xtest)crossfun(xtrain,ytrain,...
xtest,exp(z(1)),exp(z(2))),'partition',c);
What is the function that I should be implementing here? Is it exp or minfn? I will appreciate if you can give me the code for this section. Thanks.
I will like to know what does it mean when it says exp([rbf_sigma,boxconstraint])
rbf_sigma: The svm is using a gaussian kernel, the rbf_sigma set the standard deviation (~size) of the kernel. To understand how kernels work, the SVM is putting the kernel around every sample (so that you have a gaussian around every sample). Then the kernels are added up (sumed) for the samples of each category/type. At each point the type which sum is higher would be the "winner". For example if type A has a higher sum of these kernels at point X, then if you have a new datum to classify in point X, it will be classified as type A. (there are other configuration parameters that may change the actual threshold where a category is selected over another)
Fig. Analyze this figure from the webpage you gave us. You can see how by adding up the gaussian kernels on the red samples "sumA", and on the green samples "sumB"; it is logical that sumA>sumB in the center part of the figure. It is also logical that sumB>sumA in the outer part of the image.
boxconstraint: it is a cost/penalty over miss-classified data. During the training stage of the classifier, where you use the training data to adjust the SVM parameters, the training algorithm is using an error function to decide how to optimize the SVM parameters in an iterative fashion. The cost for a miss-classified sample is proportional to how far it is from the boundary where it would have been classified correctly. In the figure that I am attaching the boundary is the inner blue circumference.
Taking into account BGreene indications and from what I understand of the tutorial:
In the tutorial they advice to try values for rbf_sigma and boxconstraint that are exponentially apart. This means that you should compare values like {0.2, 2, 20, ...} (note that this is {2*10^(i-2), i=1,2,3,...}), and NOT like {0.2, 0.3, 0.4, 0.5} (which would be linearly apart). They advice this to try a wide range of values first. You can further optimize later FROM the first optimum that you obtained before.
The command "[searchmin fval] = fminsearch(minfn,randn(2,1),opts)" will give you back the optimum values for rbf_sigma and boxconstraint. Probably you have to use exp(z) because it affects how fminsearch increments the values of z(1) and z(2) during the search for the optimum value. I suppose that when you put exp(z(1)) in the definition of #minfn, then fminsearch will take 'exponentially' big steps.
In machine learning, always try to understand that there are three subsets in your data: training data, cross-validation data, and test data. The training set is used to optimize the parameters of the SVM classifier for EACH value of rbf_sigma and boxconstraint. Then the cross validation set is used to select the optimum value of the parameters rbf_sigma and boxconstraint. And finally the test data is used to obtain an idea of the performance of your classifier (the efficiency of the classifier is determined upon the test set).
So, if you start with 10000 samples you may divide the data for example as training(50%), cross-validation(25%), test(25%). So that you will sample randomly 5000 samples for the training set, then 2500 samples from the 5000 remaining samples for the cross-validation set, and the rest of samples (that is 2500) would be separated for the test set.
I hope that I could clarify your doubts. By the way, if you are interested in the optimization of the parameters of classifiers and machine learning algorithms I strongly suggest that you follow this free course -> www.ml-class.org (it is awesome, really).
You need to implement a function called crossfun (see example).
The function handle minfn is passed to fminsearch to be minimized.
exp([rbf_sigma,boxconstraint]) is the quantity being optimized to minimize classification error.
There are a number of functions nested within this function handle:
- crossval is producing the classification error based on cross validation using partition c
- crossfun - classifies data using an SVM
- fminsearch - optimizes SVM hyperparameters to minimize classification error
Hope this helps