SVM-pref package from Cornell university - classification

I'm using SVM-pref (http://svmlight.joachims.org) for a binary classification problem. I don't have much experience with this package and so I seek help with the following questions:
(1) My features are all discrete/nominal. Is there a special way to represent the feature vectors like a special way to convert the nominal values into continuous values or do we just replace the nominal values for dummy numbers like 1, 2, 3 .. etc.?
(2) If the answer to the first question is we replace nominal values with dummy numbers, then my second question is we start numbering feature values from 1 so we have 1:1 but not 1:0 otherwise the learner will consider a zero-value feature as non-existent. Is that correct?
(3) How to we configure the best -c values and the values for the rest of the parameters? Is it only by error and trial or are their other approaches used to decide on these parameters?

To use categorical features in SVM you must encode them using dummy variables, e.g. one-hot coding. For every level of the category, you should introduce a dimension. Something like this for a feature with levels A, B and C:
A -> [1,0,0]
B -> [0,1,0]
C -> [0,0,1]
See answer to previous question: use one dimension per categorical level.
Typically this is done by testing possible values in a cross-validation setting.

Here is also another useful and informative discussion about representing nominal features for SVM classifiers.

Related

Implementing one hot encoding

I already understand the uses and concept behind one hot encoding with neural networks. My question is just how to implement the concept.
Let's say, for example, I have a neural network that takes in up to 10 letters (not case sensitive) and uses one hot encoding. Each input will be a 26 dimensional vector of some kind for each spot. In order to code this, do I act as if I have 260 inputs with each one displaying only a 1 or 0, or is there some other standard way to implement these 26 dimensional vectors?
In your case, you have to differ between various frameworks. I can speak for PyTorch, which is my goto framework when programming a neural network.
There, one-hot encodings for sequences are generally performed in a way where your network will expect a sequence of indices. Taking your 10 letters as an example, this could be the sequence of ["a", "b", "c" , ...]
The embedding layer will be initialized with a "dictionary length", i.e. the number of distinct elements (num_embeddings) your network can receive - in your case 26. Additionally, you can specify embedding_dim, i.e. the output dimension of a single character. This is already past the step of one-hot encodings, since you generally only need them to know which value to associate with that item.
Then, you would feed a coded version of the above string to the layer, which could be looking like this: [0,1,2,3, ...]. Assuming the sequence is of length 10, his will produce an output of [10,embedding_dim], i.e. a 2-dimensional Tensor.
To summarize, PyTorch essentially allows you to skip this rather tedious step of encoding it as a one-hot encoding. This is mainly due to the fact that your vocabulary can in some instances be quite large: Consider for example Machine Translation Systems, in which you could have 10,000+ words in your vocabulary. Instead of storing every single word as a 10,000-dimensional vector, using a single index is more convenient.
If that should not completely answer your question (since I am essentially telling you how it is generally preferred): Instead of making a 260-dimensional vector, you would again use a [10,26] Tensor, in which each line represents a different letter.
If you have 10 distinct elements(Ex: a,b....j OR 1,2...10) to be represented as 'one hot-encoding' vector of dimension-26 then, your inputs are 10 vectors only each of which is to be represented by 26-dim vector. Do this:
y = torch.eye(26) # If you want a tensor for each 'letter' of length 26.
y[torch.arange(0,10)] #This line gives you 10 one hot-encoding vector each of dimension 26.
Hope this helps a bit.

Multiclass classification in SVM

I have been working on "Script identification from bilingual documents".
I want to classify the pages/blocks as either Eng(class 1), Hindi (class 2) or Mixed using libsvm in matlab. but the problem is that the training data i have consists of samples corresponding to Hindi and english pages/blocks only but no mixed pages.
The test data i want to give may consists of Mixed pages/blocks also, in that case i want it to be classified as "Mixed". I am planning to do it using confidence score or probability values. like if the prob value of class 1 is greater than a threshold (say 0.8) and prob value of class 2 is less than a threshold say(0.05) then it will be classified as class 1, and class 2 vice-versa. but if aforementioned two conditions dont satisfy then i want to classify it as "Mixed".
The third return value from the "libsvmpredict" is prob_values and i was planning to go ahead with this prob_values to decide whether the testdata is Hindi, English or Mixed. but at few places i learnt that "libsvmpredict" does not produce the actual prob_values.
Is there any way which can help me to classify the test data into 3 classes( Hindi, English, Mixed) using training data consisting of only 2 classes in SVM.
This is not the modus operandi for SVMs.
In no way SVMs can predict a given class without knowing it, without knowing how to separate such class from all other classes.
The function svmpredict() in LibSVM actually shows the probability estimates and the greater this value is, the more confident you can be regarding your prediction. But you cannot rely on such values if you have just two classes in order to predict a third class: indeed svmpredict() will return as many decision values as there are classes.
You can go on with your thresholding system (which, again, is not SVM-based) but it most likely fail or give bad performances. Think about that: you have to set up two thresholds and use them in a logic AND manner. The chance of correctly classified non-Mixed documents will indeed drastically decrease.
My suggestion is: instead of wasting time setting up thresholds, with a high chance of bad performances, join some of these texts together or create some new files with some Hindi and some English lines in order to add to your training data some proper Mixed documents and perform a standard 3-classes SVM system.
In order to create such files you can as well use Matlab, which has a pretty decent file I/O functions such as fread(), fwrite(), fprintf(), fscanf(), importdata() and so on...

RapidMiner: Ability to classify based off user set support threshold?

I am have built a small text analysis model that is classifying small text files as either good, bad, or neutral. I was using a Support-Vector Machine as my classifier. However, I was wondering if instead of classifying all three I could classify into either Good or Bad but if the support for that text file is below .7 or some user specified threshold it would classify that text file as neutral. I know this isn't looked at as the best way of doing this, I am just trying to see what would happen if I took a different approach.
The operator Drop Uncertain Predictions might be what you want.
After you have applied your model to some test data, the resulting example set will have a prediction and two new attributes called confidence(Good) and confidence(Bad). These confidences are between 0 and 1 and for the two class case they will sum to 1 for each example within the example set. The highest confidence dictates the value of the prediction.
The Drop Uncertain Predictions operator requires a min confidence parameter and will set the prediction to missing if the maximum confidence it finds is below this value (you can also have different confidences for different class values for more advanced investigations).
You could then use the Replace Missing Values operator to change all missing predictions to be a text value of your choice.

In preprocessing data with high cardinality, do you hash first or one-hot-encode first?

Hashing reduces dimensionality while one-hot-encoding essentially blows up the feature space by transforming multi-categorical variables into many binary variables. So it seems like they have opposite effects. My questions are:
What is the benefit of doing both on the same dataset? I read something about capturing interactions but not in detail - can somebody elaborate on this?
Which one comes first and why?
Binary one-hot-encoding is needed for feeding categorical data to linear models and SVMs with the standard kernels.
For example, you might have a feature which is a day of a week. Then you create a one-hot-encoding for each of them.
1000000 Sunday
0100000 Monday
0010000 Tuesday
...
0000001 Saturday
Feature-hashing is mostly used to allow for significant storage compression for parameter vectors: one hashes the high dimensional input vectors into a lower dimensional feature space. Now the parameter vector of a resulting classifier can therefore live in the lower-dimensional space instead of in the original input space. This can be used as a method of dimension reduction thus usually you expect to trade a bit of decreasing of performance with significant storage benefit.
The example in wikipedia is a good one. Suppose your have three documents:
John likes to watch movies.
Mary likes movies too.
John also likes football.
Using a bag-of-words model, you first create below document to words model. (each row is a document, each entry in the matrix indicates whether a word appears in the document).
The problem with this process is that such dictionaries take up a large amount of storage space, and grow in size as the training set grows.
Instead of maintaining a dictionary, a feature vectorizer that uses the hashing trick can build a vector of a pre-defined length by applying a hash function h to the features (e.g., words) in the items under consideration, then using the hash values directly as feature indices and updating the resulting vector at those indices.
Suppose you generate below hashed features with 3 buckets. (you apply k different hash functions to the original features and count how many times the hashed value hit a bucket).
bucket1 bucket2 bucket3
doc1: 3 2 0
doc2: 2 2 0
doc3: 1 0 2
Now you successfully transformed the features in 9-dimensions to 3-dimensions.
A more interesting application of feature hashing is to do personalization. The original paper of feature hashing contains a nice example.
Imagine you want to design a spam filter but customized to each user. The naive way of doing this is to train a separate classifier for each user, which are unfeasible regarding either training (to train and update the personalized model) or serving (to hold all classifiers in memory). A smart way is illustrated below:
Each token is duplicated and one copy is individualized by concatenating each word with a unique user id. (See USER123_NEU and USER123_Votre).
The bag of words model now holds the common keywords and also use-specific keywords.
All words are then hashed into a low dimensioanl feature space where the document is trained and classified.
Now to answer your questions:
Yes. one-hot-encoding should come first since it is transforming a categorical feature to binary feature to make it consumable by linear models.
You can apply both on the same dataset for sure as long as there is benefit to use the compressed feature-space. Note if you can tolerate the original feature dimension, feature-hashing is not required. For example, in a common digit recognition problem, e.g., MINST, the image is represented by 28x28 binary pixels. The input dimension is only 784. For sure feature hashing won't have any benefit in this case.

Algorithm generation

I have a rather large(not too large but possibly 50+) set of conditions that must be placed on a set of data(or rather the data should be manipulated to fit the conditions).
For example, Suppose I have the a sequence of binary numbers of length n,
if n = 5 then a element in the data might be {0,1,1,0,0} or {0,0,0,1,1}, etc...
BUT there might be a set of conditions such as
x_3 + x_4 = 2
sum(x_even) <= 2
x_2*x_3 = x_4 mod 2
etc...
Because the conditions are quite complex in that they come from experiment(although they can be written down in logic form) and are hard to diagnose I would like instead to use a large sample set of valid data. i.e., Data I know satisfies the conditions and is a pretty large set. i.e., it is easier to collect the data then it is to deduce the conditions that the data must abide by.
Having said that, basically what I'm doing is very similar to neural networks. The difference is, I would like an actual algorithm, in some sense optimal, in some form of code that I can run instead of the network.
It might not be clear what I'm actually trying to do. What I have is a set of data in some raw format that is unique and unambiguous but not appropriate for my needs(in a sense the amount of data is too large).
I need to map the data into another set that actually is ambiguous to some degree but also has certain specific set of constraints that all the data follows(certain things just cannot happen while others are preferred).
The unique constraints and preferences are hard to figure out. That is, the mapping from the non-ambiguous set to the ambiguous set is hard to describe(which is why it is ambiguous). The goal, actually, is to have an unambiguous map by supplying the right constraints if at all possible.
So, on the vein of my initial example, I'm given(or supply) a set of elements and need some way to derive a list of constraints similar to what I've listed.
In a sense, I simply have a set of valid data and train it very similar to neural networks.
Then, after this "Training" I'm given the mapping function I can then use on any element in my dataset and it will produce a new element satisfying the constraint's, or if it can't, will give as close as possible an unambiguous result.
The main difference between neural networks and what I'm trying to achieve is I'd like to be able to use have an algorithm to code to be used instead of having to run a neural network. The difference here is the algorithm would probably be a lot less complex, not need potential retraining, and a lot faster.
Here is a simple example.
Suppose my "training set" are the binary sequences and mappings
01000 => 10000
00001 => 00010
01010 => 10100
00111 => 01110
then from the "Magical Algorithm Finder"(tm) I would get a mapping out like
f(x) = x rol 1 (rol = rotate left)
or whatever way one would want to express it.
Then I could simply apply f(x) to any other element, such as x = 011100 and could apply f to generate a hopefully unambiguous output.
Of course there are many such functions that will work on this example but the goal is to supply enough of the dataset to narrow it down to hopefully a few functions that make the most sense(at the very least will always map the training set correctly).
In my specific case I could easily convert my problem into mapping the set of binary digits of length m to the set of base B digits of length n. The constraints prevents some numbers from having an inverse. e.g., the mapping is injective but not surjective.
My algorithm could be a simple collection if statements acting on the digits if need be.
I think what you are looking for here is an application of Learning Classifier Systems, LCS -wiki. There are actually quite a few LCS open-source applications available, but you may need to experiment with the parameters in order to get a good result.
LCS/XCS/ZCS have the features that you are looking for including individual rules that could be heavily optimized, pressure to reduce the rule-set, and of course a human-readable/understandable set of rules. (Unlike a neural-net)