Shannon's Entropy measure in Decision Trees - encoding

Why is Shannon's Entropy measure used in Decision Tree branching?
Entropy(S) = - p(+)log( p(+) ) - p(-)log( p(-) )
I know it is a measure of the no. of bits needed to encode information; the more uniform the distribution, the more the entropy. But I don't see why it is so frequently applied in creating decision trees (choosing a branch point).

Because you want to ask the question that will give you the most information. The goal is to minimize the number of decisions/questions/branches in the tree, so you start with the question that will give you the most information and then use the following questions to fill in the details.

For the sake of decision trees, forget about the number of bits and just focus on the formula itself. Consider a binary (+/-) classification task where you have an equal number of + and - examples in your training data. Initially, the entropy will be 1 since p(+) = p(-) = 0.5. You want to split the data on an attribute that most decreases the entropy (i.e., makes the distribution of classes least random). If you choose an attribute, A1, that is completely unrelated to the classes, then the entropy will still be 1 after splitting the data by the values of A1, so there is no reduction in entropy. Now suppose another attribute, A2, perfectly separates the classes (e.g., the class is always + for A2="yes" and always - for A2="no". In this case, the entropy is zero, which is the ideal case.
In practical cases, attributes don't typically perfectly categorize the data (the entropy is greater than zero). So you choose the attribute that "best" categorizes the data (provides the greatest reduction in entropy). Once the data are separated in this manner, another attribute is selected for each of the branches from the first split in a similar manner to further reduce the entropy along that branch. This process is continued to construct the tree.

You seem to have an understanding of the math behind the method, but here is a simple example that might give you some intuition behind why this method is used: Imagine you are in a classroom that is occupied by 100 students. Each student is sitting at a desk, and the desks are organized such there are 10 rows and 10 columns. 1 out of the 100 students has a prize that you can have, but you must guess which student it is to get the prize. The catch is that everytime you guess, the prize is decremented in value. You could start by asking each student individually whether or not they have the prize. However, initially, you only have a 1/100 chance of guessing correctly, and it is likely that by the time you find the prize it will be worthless (think of every guess as a branch in your decision tree). Instead, you could ask broad questions that dramatically reduce the search space with each question. For example "Is the student somewhere in rows 1 though 5?" Whether the answer is "Yes" or "No" you have reduced the number of potential branches in your tree by half.

Related

What to do with not enough training data?

I have a problem that I don't have enough training data for my NN. It is trying to predict the result of a soccer game given the last games which I woulf say is a regression task.
The training data are results of soccer games of the last 15 seasons (which are about 4500 games). Getting to new data would be hard and would take a lot of time.
What should I do now?
Is it good to duplicate the data?
Should I input randomized data? (Maybe noise but I'm not quite sure what that is)
If there is no way of creating more data,
I should probably turn up the learning rate right? (I have it sitting at 0.01 and the momentum at 0.9)
I am using mini batches consisting of 32 training datas in training. Since I don't have a lot of training I don't have a lot of mini batches. Should I stop using them?
To start from the beginning: This is a very theoretical question and is not directly related to programming, which I recommend (in future) to post over at the Data Science Stackexchange.
To go into your problem: 4500 samples is not as bad as it sounds, depending on the exact task at hand. Are you trying to predict the match results (i.e. which team is the winner?), are you looking for more specific predictions (across a lot of different, specific teams)?
If you can make sure that you have a reasonable amount of data per class, one can work with a number of samples lower than what you have. Simply duplicating the data will not help you much, since you are very likely to just overfit on the samples you are seeing, without much of an improvement; Or rather, you will get the same results as training over a longer period (since essentially you see every sample twice per epoch, instead of one).
Again, what usually happens after long training periods is overfitting, so nothing gained here.
Your second suggestion is generally called data augmentation. Instead of simply copying samples, you alter them enough to make it look "different" to the network. But be careful! Data augmentation works well for some inputs, like images, since the change in input is significant enough to not represent the same sample, but still contains meaningful information about the class (a horizontally mirrored image of a cat still shows a "valid cat", unlike a vertically mirrored image, which is more unrealistic in the real world).
Essentially, it depends on your input features to determine where it makes sense to add noise. If you are only changing the results of the previous game, a minor change in input (adding/subtracting one goal at random) can significantly change the prediction you make.
If you slightly scramble ELO scores by a random number, on the other hand, the input value will not be too different, "but different enough" to use it as a novel example.
Turning up the learning rate is not a good idea, since you are essentially just letting the network converge more towards the specific samples. On the contrary, I would argue that the current learning rate is still too high, and you should certainly not increase it.
Regarding mini batches, I think I have referenced this a million times now, but always consider smaller minibatches. From a theoretical point of view, you are more likely to converge to a local minimum.

What is the actual meaning implied by information gain in data mining?

Information Gain= (Information before split)-(Information after split)
Information gain can be found by above equation. But what I don't understand is what is exactly the meaning of this information gain? Does it mean that how much more information is gained or reduced by splitting according to the given attribute or something like that???
Link to the answer:
https://stackoverflow.com/a/1859910/740601
Information gain is the reduction in entropy achieved after splitting the data according to an attribute.
IG = Entropy(before split) - Entropy(after split).
See http://en.wikipedia.org/wiki/Information_gain_in_decision_trees
Entropy is a measure of the uncertainty present. By splitting the data, we are trying to reduce the entropy in it and gain information about it.
We want to maximize the information gain by choosing the attribute and split point which reduces the entropy the most.
If entropy = 0, then there is no further information which can be gained from it.
Correctly written it is
Information-gain = entropy-before-split - average entropy-after-split
the difference of entropy vs. information is the sign. Entropy is high, if you do not have much information of the data.
The intuition is that of statistical information theory. The rough idea is: how many bits per record do you need to encode the class label assignment? If you have only one class left, you need 0 bits per record. If you have a chaotic data set, you will need 1 bit for every record. And if the class is unbalanced, you could get away with less than that, using a (theoretical!) optimal compression scheme; e.g. by encoding the exceptions only. To match this intuition, you should be using the base 2 logarithm, of course.
A split is considered good, if the branches have lower entropy on average afterwards. Then you have gained information on the class label by splitting the data set. The IG value is the average number of bits of information you gained for predicting the class label.

In preprocessing data with high cardinality, do you hash first or one-hot-encode first?

Hashing reduces dimensionality while one-hot-encoding essentially blows up the feature space by transforming multi-categorical variables into many binary variables. So it seems like they have opposite effects. My questions are:
What is the benefit of doing both on the same dataset? I read something about capturing interactions but not in detail - can somebody elaborate on this?
Which one comes first and why?
Binary one-hot-encoding is needed for feeding categorical data to linear models and SVMs with the standard kernels.
For example, you might have a feature which is a day of a week. Then you create a one-hot-encoding for each of them.
1000000 Sunday
0100000 Monday
0010000 Tuesday
...
0000001 Saturday
Feature-hashing is mostly used to allow for significant storage compression for parameter vectors: one hashes the high dimensional input vectors into a lower dimensional feature space. Now the parameter vector of a resulting classifier can therefore live in the lower-dimensional space instead of in the original input space. This can be used as a method of dimension reduction thus usually you expect to trade a bit of decreasing of performance with significant storage benefit.
The example in wikipedia is a good one. Suppose your have three documents:
John likes to watch movies.
Mary likes movies too.
John also likes football.
Using a bag-of-words model, you first create below document to words model. (each row is a document, each entry in the matrix indicates whether a word appears in the document).
The problem with this process is that such dictionaries take up a large amount of storage space, and grow in size as the training set grows.
Instead of maintaining a dictionary, a feature vectorizer that uses the hashing trick can build a vector of a pre-defined length by applying a hash function h to the features (e.g., words) in the items under consideration, then using the hash values directly as feature indices and updating the resulting vector at those indices.
Suppose you generate below hashed features with 3 buckets. (you apply k different hash functions to the original features and count how many times the hashed value hit a bucket).
bucket1 bucket2 bucket3
doc1: 3 2 0
doc2: 2 2 0
doc3: 1 0 2
Now you successfully transformed the features in 9-dimensions to 3-dimensions.
A more interesting application of feature hashing is to do personalization. The original paper of feature hashing contains a nice example.
Imagine you want to design a spam filter but customized to each user. The naive way of doing this is to train a separate classifier for each user, which are unfeasible regarding either training (to train and update the personalized model) or serving (to hold all classifiers in memory). A smart way is illustrated below:
Each token is duplicated and one copy is individualized by concatenating each word with a unique user id. (See USER123_NEU and USER123_Votre).
The bag of words model now holds the common keywords and also use-specific keywords.
All words are then hashed into a low dimensioanl feature space where the document is trained and classified.
Now to answer your questions:
Yes. one-hot-encoding should come first since it is transforming a categorical feature to binary feature to make it consumable by linear models.
You can apply both on the same dataset for sure as long as there is benefit to use the compressed feature-space. Note if you can tolerate the original feature dimension, feature-hashing is not required. For example, in a common digit recognition problem, e.g., MINST, the image is represented by 28x28 binary pixels. The input dimension is only 784. For sure feature hashing won't have any benefit in this case.

Newbie to Neural Networks

Just starting to play around with Neural Networks for fun after playing with some basic linear regression. I am an English teacher so don't have a math background and trying to read a book on this stuff is way over my head. I thought this would be a better avenue to get some basic questions answered (even though I suspect there is no easy answer). Just looking for some general guidance put in layman's terms. I am using a trial version of an Excel Add-In called NEURO XL. I apologize if these questions are too "elementary."
My first project is related to predicting a student's Verbal score on the SAT based on a number of test scores, GPA, practice exam scores, etc. as well as some qualitative data (gender: M=1, F=0; took SAT prep class: Y=1, N=0; plays varsity sports: Y=1, N=0).
In total, I have 21 variables that I would like to feed into the network, with the output being the actual score (200-800).
I have 9000 records of data spanning many years/students. Here are my questions:
How many records of the 9000 should I use to train the network?
1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
If I split the data into an even number, say 9x1000 (or however many) and created a network for each one, then tested the results of each of these 9 on the other 8 sets to see which had the lowest MSE across the samples, would this be a valid way to "choose" the best network if I wanted to predict the scores for my incoming students (not included in this data at all)?
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
E.g.
750-800 = 10
700-740 = 9
etc.
Is there any benefit to doing this or should I just go ahead and try to predict the exact score?
What if ALL I cared about was whether or not the score was above or below 600. Would I then just make the output 0(below 600) or 1(above 600)?
5a. I read somewhere that it's not good to use 0 and 1, but instead 0.1 and 0.9 - why is that?
5b. What about -1(below 600), 0(exactly 600), 1(above 600), would this work?
5c. Would the network always output -1, 0, 1 - or would it output fractions that I would then have to roundup or rounddown to finalize the prediction?
Once I have found the "best" network from Question #3, would I then play around with the different parameters (number of epochs, number of neurons in hidden layer, momentum, learning rate, etc.) to optimize this further?
6a. What about the Activation Function? Will Log-sigmoid do the trick or should I try the other options my software has as well (threshold, hyperbolic tangent, zero-based log-sigmoid).
6b. What is the difference between log-sigmoid and zero-based log-sigmoid?
Thanks!
First a little bit of meta content about the question itself (and not about the answers to your questions).
I have to laugh a little that you say 'I apologize if these questions are too "elementary."' and then proceed to ask the single most thorough and well thought out question I've seen as someone's first post on SO.
I wouldn't be too worried that you'll have people looking down their noses at you for asking this stuff.
This is a pretty big question in terms of the depth and range of knowledge required, especially the statistical knowledge needed and familiarity with Neural Networks.
You may want to try breaking this up into several questions distributed across the different StackExchange sites.
Off the top of my head, some of it definitely belongs on the statistics StackExchange, Cross Validated: https://stats.stackexchange.com/
You might also want to try out https://datascience.stackexchange.com/ , a beta site specifically targeting machine learning and related areas.
That said, there is some of this that I think I can help to answer.
Anything I haven't answered is something I don't feel qualified to help you with.
Question 1
How many records of the 9000 should I use to train the network? 1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
Randomizing the selection of training data is probably not a good idea.
Keep in mind that truly random data includes clusters.
A random selection of students could happen to consist solely of those who scored above a 30 on the ACT exams, which could potentially result in a bias in your result.
Likewise, if you only select students whose SAT scores were below 700, the classifier you build won't have any capacity to distinguish between a student expected to score 720 and a student expected to score 780 -- they'll look the same to the classifier because it was trained without the relevant information.
You want to ensure a representative sample of your different inputs and your different outputs.
Because you're dealing with input variables that may be correlated, you shouldn't try to do anything too complex in selecting this data, or you could mistakenly introduce another bias in your inputs.
Namely, you don't want to select a training data set that consists largely of outliers.
I would recommend trying to ensure that your inputs cover all possible values for all of the variables you are observing, and all possible results for the output (the SAT scores), without constraining how these requirements are satisfied.
I'm sure there are algorithms out there designed to do exactly this, but I don't know them myself -- possibly a good question in and of itself for Cross Validated.
Question 3
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
My understanding is that this is not recommended as the input to a Nerual Network, but I may be wrong.
The convergence of the network should handle this for you.
Every node in the network will assign a weight to its inputs, multiply them by their weights, and sum those products as a core part of its computation.
That means that every node in the network is searching for some coefficients for each of their inputs.
To do this, all inputs will be converted to numeric values -- so conditions like gender will be translated into "0=MALE,1=FEMALE" or something similar.
For example, a node's metric might look like this at a given point in time:
2*ACT_SCORE + 0*GENDER + (-5)*VARISTY_SPORTS ...
The coefficients for each values are exactly what the network is searching for as it converges.
If you change the scale of a value, like ACT_SCORE, you just change the scale of the coefficient that will be found by the reciporical of that scaling factor.
The result should still be the same.
There are other concerns in terms of accuracy (computers have limited capacity to represent small fractions) and speed that may enter this, but not being familiar with NEURO XL, I can't say whether or not they apply for this technology.
Question 4
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
This will reduce accuracy, although you should converge to a solution much faster with fewer possible outputs (scores).
Neural Networks actually describe very high-dimensional functions in their input variables.
If you reduce the granularity of that function's output space, you essentially state that you don't care about local minima and maxima in that function, especially around the borders between your output scores.
As a result, you are sacrificing information that may be an essential component of the "true" function that you are searching for.
I hope this has been helpful, but you really should break this question down into its many components and ask them separately on different sites -- potentially some of them do belong here on StackOverflow as well.

Algorithm generation

I have a rather large(not too large but possibly 50+) set of conditions that must be placed on a set of data(or rather the data should be manipulated to fit the conditions).
For example, Suppose I have the a sequence of binary numbers of length n,
if n = 5 then a element in the data might be {0,1,1,0,0} or {0,0,0,1,1}, etc...
BUT there might be a set of conditions such as
x_3 + x_4 = 2
sum(x_even) <= 2
x_2*x_3 = x_4 mod 2
etc...
Because the conditions are quite complex in that they come from experiment(although they can be written down in logic form) and are hard to diagnose I would like instead to use a large sample set of valid data. i.e., Data I know satisfies the conditions and is a pretty large set. i.e., it is easier to collect the data then it is to deduce the conditions that the data must abide by.
Having said that, basically what I'm doing is very similar to neural networks. The difference is, I would like an actual algorithm, in some sense optimal, in some form of code that I can run instead of the network.
It might not be clear what I'm actually trying to do. What I have is a set of data in some raw format that is unique and unambiguous but not appropriate for my needs(in a sense the amount of data is too large).
I need to map the data into another set that actually is ambiguous to some degree but also has certain specific set of constraints that all the data follows(certain things just cannot happen while others are preferred).
The unique constraints and preferences are hard to figure out. That is, the mapping from the non-ambiguous set to the ambiguous set is hard to describe(which is why it is ambiguous). The goal, actually, is to have an unambiguous map by supplying the right constraints if at all possible.
So, on the vein of my initial example, I'm given(or supply) a set of elements and need some way to derive a list of constraints similar to what I've listed.
In a sense, I simply have a set of valid data and train it very similar to neural networks.
Then, after this "Training" I'm given the mapping function I can then use on any element in my dataset and it will produce a new element satisfying the constraint's, or if it can't, will give as close as possible an unambiguous result.
The main difference between neural networks and what I'm trying to achieve is I'd like to be able to use have an algorithm to code to be used instead of having to run a neural network. The difference here is the algorithm would probably be a lot less complex, not need potential retraining, and a lot faster.
Here is a simple example.
Suppose my "training set" are the binary sequences and mappings
01000 => 10000
00001 => 00010
01010 => 10100
00111 => 01110
then from the "Magical Algorithm Finder"(tm) I would get a mapping out like
f(x) = x rol 1 (rol = rotate left)
or whatever way one would want to express it.
Then I could simply apply f(x) to any other element, such as x = 011100 and could apply f to generate a hopefully unambiguous output.
Of course there are many such functions that will work on this example but the goal is to supply enough of the dataset to narrow it down to hopefully a few functions that make the most sense(at the very least will always map the training set correctly).
In my specific case I could easily convert my problem into mapping the set of binary digits of length m to the set of base B digits of length n. The constraints prevents some numbers from having an inverse. e.g., the mapping is injective but not surjective.
My algorithm could be a simple collection if statements acting on the digits if need be.
I think what you are looking for here is an application of Learning Classifier Systems, LCS -wiki. There are actually quite a few LCS open-source applications available, but you may need to experiment with the parameters in order to get a good result.
LCS/XCS/ZCS have the features that you are looking for including individual rules that could be heavily optimized, pressure to reduce the rule-set, and of course a human-readable/understandable set of rules. (Unlike a neural-net)