How to automatically optimize a classifier in Weka in order to have a given class to contain 100 % sure data? - classification

I have two (or three) classes and each classes can only possess one label.
I want to optimize (automatically if possible) parameters and thresholds of classifiers in order for my first class to contain only 100 % sure data. Even if it contains a small number of instances.
I don't mind for the remaining classes to contain false alarm or correct rejection.
I don't mind to have unclassified data.
I have already been searching on stackoverflow and on the weka's wiki but maybe my lack of knowledge concerning weka made me miss some keywords.
I also tried to perform the task with the well-known "iris" database but I think that in this case, any class can be 100 % sure.
Yet, I have only succeed in testing multiple classifiers and tuning them manually but without performing 100 % correct for my first class. (I checked this result in the confusion matrix given by weka's report.)
Somehow, I know it is possible for my class to contain 100% sure data because I managed to do it in Matlab with simple threshold set manually. But I would like to try out a bigger database, to obtain better threshold and to use the power of weka.
Any suggestions would be helpful, thanks !

You probably need the "Cost Sensitive Classifier" among "meta" classifiers.
If you are working in the Explorer, here is the dialog you get.
Choose the your "classifier" (something beyond ZeroR :) ).
Set your "cost matrix". For 2-class problem this will be 2x2 matrix.
By setting one non-diagonal component very large (>>1, let us say 1000) you ensure that misclassifying one class (your "first" class) is 1000 times more expensive than misclassifying another class. This should do the job.

Related

How does word embedding/ word vectors work/created?

How does word2vec create vectors for words? I trained two word2vec models using two different files (from commoncrawl website) but I am getting same word vectors for a given word from both models.
Actually, I have created multiple word2vec models using different text files from the commoncrawl website. Now I want to check which model is better among all. How can select the best model out of all these models and why I am getting same word vectors for different models?
Sorry, If the question is not clear.
If you are getting identical word-vectors from models that you've prepared from different text corpuses, something is likely wrong in your process. You may not be performing any training at all, perhaps because of a problem in how the text iterable is provided to the Word2Vec class. (In that case, word-vectors would remain at their initial, randomly-initialized values.)
You should enable logging, and review the logs carefully to see that sensible counts of words, examples, progress, and incremental-progress are displayed during the process. You should also check that results for some superficial, ad-hoc checks look sensible after training. For example, does model.most_similar('hot') return other words/concepts somewhat like 'hot'?
Once you're sure models are being trained on varied corpuses – in which case their word-vectors should be very different from each other – deciding which model is 'best' depends on your specific goals with word-vectors.
You should devise a repeatable, quantitative way to evaluate a model against your intended end-uses. This might start crudely with a few of your own manual reviews of results, like looking over most_similar() results for important words for better/worse results – but should become more extensive. rigorous, and automated as your project progresses.
An example of such an automated scoring is the accuracy() method on gensim's word-vectors object. See:
https://github.com/RaRe-Technologies/gensim/blob/6d6f5dcfa3af4bc61c47dfdf5cdbd8e1364d0c3a/gensim/models/keyedvectors.py#L652
If supplied with a specifically-formatted file of word-analogies, it will check how well the word-vectors solve those analogies. For example, the questions-words.txt of Google's original word2vec code release includes the analogies they used to report vector quality. Note, though, that the word-vectors that are best for some purposes, like understanding text topics or sentiment, might not also be the best at solving this style of analogy, and vice-versa. If training your own word-vectors, it's best to choose your training corpus/parameters based on your own goal-specific criteria for what 'good' vectors will be.

How best to deal with "None of the above" in Image Classification?

This seems to be a fundamental question which some of you out there must have an opinion on. I have an image classifier implemented in CNTK with 48 classes. If the image does not match any of the 48 classes very well, then I'd like to be able to conclude that it was not among these 48 image types. My original idea was simply that if the highest output of the final Softmax layer was low, I would be able to conclude that the test image matched none well. While I occasionally see this occur, in most testing, Softmax still produces a very high (and mistaken) result when handed an 'unknown image type'. But maybe my network is 'over fit' and if it wasn't, my original idea would work fine. What do you think? Any way to define a 49-th class called 'none-of-the-above'?
You really have these two options indeed--thresholding the posterior probabilities (softmax values), and adding a garbage class.
In my area (speech), both approaches are their place:
If "none of the above" inputs are of the same nature as the "above" (e.g. non-grammatical inputs), thresholding works fine. Note that the posterior probability for a class is equal to one minus an estimate of the error rate for choosing this class. Rejecting anything with posterior < 50% would be rejecting all cases where you are more likely wrong than right. As long as your none-of-the-above classes are of similar nature, the estimate may be accurate enough to make this correct for them as well.
If "none of the above" inputs are of similar nature but your number of classes is very small (e.g. 10 digits), or if the inputs are of a totally different nature (e.g. a sound of a door slam or someone coughing), thresholding typically fails. Then, one would train a "garbage model." In our experience, it is OK to include the training data for the correct classes. Now the none-of-the-above class may match a correct class as well. But that's OK as long as the none-of-the-above class is not overtrained--its distribution will be much flatter, and thus even if it matches a known class, it will match it with a lower score and thus not win against the actual known class' softmax output.
In the end, I would use both. Definitely use a threshold (to catch the cases that the system can rule out) and use a garbage model, which I would just train it on whatever you have. I would expect that including the correct examples in training will not harm, even if it is the only data you have (please check the paper Anton posted for whether that applies to image as well). It may also make sense to try to synthesize data, e.g. by randomly combining patches from different images.
I agree with you that this is a key question, but I am not aware of much work in that area either.
There's one recent paper by Zhang and LeCun, that addresses the question for image classification in particular. They use large quantities of unlabelled data to create an additional "none of the above" class. The catch though is that, in some cases, their unlabelled data is not completely unlabelled, and they have means of removing "unlabelled" images that are actually in one of their labelled classes. Having said that, the authors report that apart from solving the "none of the above" problem, they even see performance gains even on their test sets.
As for fitting something post-hoc, just by looking at the outputs of the softmax, I can't provide any pointers.

Multiclass classification in SVM

I have been working on "Script identification from bilingual documents".
I want to classify the pages/blocks as either Eng(class 1), Hindi (class 2) or Mixed using libsvm in matlab. but the problem is that the training data i have consists of samples corresponding to Hindi and english pages/blocks only but no mixed pages.
The test data i want to give may consists of Mixed pages/blocks also, in that case i want it to be classified as "Mixed". I am planning to do it using confidence score or probability values. like if the prob value of class 1 is greater than a threshold (say 0.8) and prob value of class 2 is less than a threshold say(0.05) then it will be classified as class 1, and class 2 vice-versa. but if aforementioned two conditions dont satisfy then i want to classify it as "Mixed".
The third return value from the "libsvmpredict" is prob_values and i was planning to go ahead with this prob_values to decide whether the testdata is Hindi, English or Mixed. but at few places i learnt that "libsvmpredict" does not produce the actual prob_values.
Is there any way which can help me to classify the test data into 3 classes( Hindi, English, Mixed) using training data consisting of only 2 classes in SVM.
This is not the modus operandi for SVMs.
In no way SVMs can predict a given class without knowing it, without knowing how to separate such class from all other classes.
The function svmpredict() in LibSVM actually shows the probability estimates and the greater this value is, the more confident you can be regarding your prediction. But you cannot rely on such values if you have just two classes in order to predict a third class: indeed svmpredict() will return as many decision values as there are classes.
You can go on with your thresholding system (which, again, is not SVM-based) but it most likely fail or give bad performances. Think about that: you have to set up two thresholds and use them in a logic AND manner. The chance of correctly classified non-Mixed documents will indeed drastically decrease.
My suggestion is: instead of wasting time setting up thresholds, with a high chance of bad performances, join some of these texts together or create some new files with some Hindi and some English lines in order to add to your training data some proper Mixed documents and perform a standard 3-classes SVM system.
In order to create such files you can as well use Matlab, which has a pretty decent file I/O functions such as fread(), fwrite(), fprintf(), fscanf(), importdata() and so on...

running NN software with my own data

New with Matlab.
When I try to load my own date using the NN pattern recognition app window, I can load the source data, but not the target (it is never on the drop down list). Both source and target are in the same directory. Source is 5000 observations with 400 vars per observation and target can take on 10 different values (recognizing digits). Any Ideas?
Before you do anything with your own data you might want to try out the example data sets available in the toolbox. That should make many problems easier to find later on because they definitely work, so you can see what's wrong with your code.
Regarding your actual question: Without more details, e.g. what your matrices contain and what their dimensions are, it's hard to help you. In your case some of the problems mentioned here might be similar to yours:
http://www.mathworks.com/matlabcentral/answers/17531-problem-with-targets-in-nprtool
From what I understand about nprtool your targets have to consist of a matrix with only one 1 (for the correct class) in either row or column (depending on the input matrix), so make sure that's the case.

Error Correcting Tournaments (ect) Multi Class Classification in Vowpal Wabbit

I tried to go through this paper which describes the ECT algorithm but could not make much out of it.
I know it is different from one-against-al (oaa) and even performs better than oaa.I wanted a simple explanation about how ect works.
ECT and Filter trees are useful (only) if you have a very big number of output labels (classes), let's say N=1000. With OAA (one-against-all), it would mean to do N binary classification tasks for each example (during both training and testing). With ECT you can make the prediction much faster: log(N). You can imagine Filter trees (which are the basis of ECT) as a decision tree where in each node you ask whether the example belongs to one set of labels or another set of labels (using all the features, unlike original decision trees).
In general, ECT is worse (in terms of loss or accuracy) than OAA (but in some cases it may be almost as good as OAA). With N=10 labels, you should try OAA first. With N>1000, OAA is too slow (and even the accuracy is low), you should try ECT (or --log_multi or --csoaa_ldf in VW, if you can preselect a smaller number of labels which are relevant for each example).
See http://cilvr.cs.nyu.edu/diglib/lsml/logarithmic.pdf