Building weka classifier - classification

I'm trying to build a classifier in Weka. I have two data sets: training and testing. The two files are identical: with the same number and type of attributes. However, the weka explorer is giving me error saying Train and test set are not compatible. How to resolve this error?
Here is a snap of the two sets:
training set
testing set

Searching through their wiki, here's what i found:
One of Weka's fundamental assumption is that the structure of the training and test sets are exactly the same. This does not only mean that you need the exact same number of attributes, but also the exact same type. In case of nominal attributes, you must ensure that the number of labels and the order of the labels are the same.
https://weka.wikispaces.com/Why+do+I+get+the+error+message+%27training+and+test+set+are+not+compatible%27%3F
They may seem to be the same but you never know, you should at least try one of the visual diff applications suggested by them.
I hope this helped you in some way.

Related

How does word embedding/ word vectors work/created?

How does word2vec create vectors for words? I trained two word2vec models using two different files (from commoncrawl website) but I am getting same word vectors for a given word from both models.
Actually, I have created multiple word2vec models using different text files from the commoncrawl website. Now I want to check which model is better among all. How can select the best model out of all these models and why I am getting same word vectors for different models?
Sorry, If the question is not clear.
If you are getting identical word-vectors from models that you've prepared from different text corpuses, something is likely wrong in your process. You may not be performing any training at all, perhaps because of a problem in how the text iterable is provided to the Word2Vec class. (In that case, word-vectors would remain at their initial, randomly-initialized values.)
You should enable logging, and review the logs carefully to see that sensible counts of words, examples, progress, and incremental-progress are displayed during the process. You should also check that results for some superficial, ad-hoc checks look sensible after training. For example, does model.most_similar('hot') return other words/concepts somewhat like 'hot'?
Once you're sure models are being trained on varied corpuses – in which case their word-vectors should be very different from each other – deciding which model is 'best' depends on your specific goals with word-vectors.
You should devise a repeatable, quantitative way to evaluate a model against your intended end-uses. This might start crudely with a few of your own manual reviews of results, like looking over most_similar() results for important words for better/worse results – but should become more extensive. rigorous, and automated as your project progresses.
An example of such an automated scoring is the accuracy() method on gensim's word-vectors object. See:
https://github.com/RaRe-Technologies/gensim/blob/6d6f5dcfa3af4bc61c47dfdf5cdbd8e1364d0c3a/gensim/models/keyedvectors.py#L652
If supplied with a specifically-formatted file of word-analogies, it will check how well the word-vectors solve those analogies. For example, the questions-words.txt of Google's original word2vec code release includes the analogies they used to report vector quality. Note, though, that the word-vectors that are best for some purposes, like understanding text topics or sentiment, might not also be the best at solving this style of analogy, and vice-versa. If training your own word-vectors, it's best to choose your training corpus/parameters based on your own goal-specific criteria for what 'good' vectors will be.

Multiclass classification in SVM

I have been working on "Script identification from bilingual documents".
I want to classify the pages/blocks as either Eng(class 1), Hindi (class 2) or Mixed using libsvm in matlab. but the problem is that the training data i have consists of samples corresponding to Hindi and english pages/blocks only but no mixed pages.
The test data i want to give may consists of Mixed pages/blocks also, in that case i want it to be classified as "Mixed". I am planning to do it using confidence score or probability values. like if the prob value of class 1 is greater than a threshold (say 0.8) and prob value of class 2 is less than a threshold say(0.05) then it will be classified as class 1, and class 2 vice-versa. but if aforementioned two conditions dont satisfy then i want to classify it as "Mixed".
The third return value from the "libsvmpredict" is prob_values and i was planning to go ahead with this prob_values to decide whether the testdata is Hindi, English or Mixed. but at few places i learnt that "libsvmpredict" does not produce the actual prob_values.
Is there any way which can help me to classify the test data into 3 classes( Hindi, English, Mixed) using training data consisting of only 2 classes in SVM.
This is not the modus operandi for SVMs.
In no way SVMs can predict a given class without knowing it, without knowing how to separate such class from all other classes.
The function svmpredict() in LibSVM actually shows the probability estimates and the greater this value is, the more confident you can be regarding your prediction. But you cannot rely on such values if you have just two classes in order to predict a third class: indeed svmpredict() will return as many decision values as there are classes.
You can go on with your thresholding system (which, again, is not SVM-based) but it most likely fail or give bad performances. Think about that: you have to set up two thresholds and use them in a logic AND manner. The chance of correctly classified non-Mixed documents will indeed drastically decrease.
My suggestion is: instead of wasting time setting up thresholds, with a high chance of bad performances, join some of these texts together or create some new files with some Hindi and some English lines in order to add to your training data some proper Mixed documents and perform a standard 3-classes SVM system.
In order to create such files you can as well use Matlab, which has a pretty decent file I/O functions such as fread(), fwrite(), fprintf(), fscanf(), importdata() and so on...

How to automatically optimize a classifier in Weka in order to have a given class to contain 100 % sure data?

I have two (or three) classes and each classes can only possess one label.
I want to optimize (automatically if possible) parameters and thresholds of classifiers in order for my first class to contain only 100 % sure data. Even if it contains a small number of instances.
I don't mind for the remaining classes to contain false alarm or correct rejection.
I don't mind to have unclassified data.
I have already been searching on stackoverflow and on the weka's wiki but maybe my lack of knowledge concerning weka made me miss some keywords.
I also tried to perform the task with the well-known "iris" database but I think that in this case, any class can be 100 % sure.
Yet, I have only succeed in testing multiple classifiers and tuning them manually but without performing 100 % correct for my first class. (I checked this result in the confusion matrix given by weka's report.)
Somehow, I know it is possible for my class to contain 100% sure data because I managed to do it in Matlab with simple threshold set manually. But I would like to try out a bigger database, to obtain better threshold and to use the power of weka.
Any suggestions would be helpful, thanks !
You probably need the "Cost Sensitive Classifier" among "meta" classifiers.
If you are working in the Explorer, here is the dialog you get.
Choose the your "classifier" (something beyond ZeroR :) ).
Set your "cost matrix". For 2-class problem this will be 2x2 matrix.
By setting one non-diagonal component very large (>>1, let us say 1000) you ensure that misclassifying one class (your "first" class) is 1000 times more expensive than misclassifying another class. This should do the job.

running NN software with my own data

New with Matlab.
When I try to load my own date using the NN pattern recognition app window, I can load the source data, but not the target (it is never on the drop down list). Both source and target are in the same directory. Source is 5000 observations with 400 vars per observation and target can take on 10 different values (recognizing digits). Any Ideas?
Before you do anything with your own data you might want to try out the example data sets available in the toolbox. That should make many problems easier to find later on because they definitely work, so you can see what's wrong with your code.
Regarding your actual question: Without more details, e.g. what your matrices contain and what their dimensions are, it's hard to help you. In your case some of the problems mentioned here might be similar to yours:
http://www.mathworks.com/matlabcentral/answers/17531-problem-with-targets-in-nprtool
From what I understand about nprtool your targets have to consist of a matrix with only one 1 (for the correct class) in either row or column (depending on the input matrix), so make sure that's the case.

Why the decision tree shows a correct classificationthe while some instances are being misclassified

I am using WEKA, 10-fold cross validation or split 66% to create training and testing sets .. I used c4.5 (J48) as a classifier ..
I get in my results that some instances are misclassified, but, when I visualize the tree I see that based on the tree the instances should have been classified correctly !!!
I don't see this when the testing set is the same training set .. if the classifier decided to create such a tree why some instances are not being classified based on this tree ???
Thanks in advance.
It sounds like you are after a completely unpruned tree such that the Training Data should return a 100% accuracy.
The options that are likely causing the undesirable results are outlined below:
Unpruned is used to minimise the number of rules in the decision tree and can lead to lower generalisation error
minNumObj is used to determine the minimum number of cases required to make a rule. If this is higher than one, you could get some errors on the training data.
I wouldn't generally recommend using these options for a given problem, but if you are trying to obtain a 100% result on the training data, this would be the place to start.
Hope this helps!