How to train a neural network to detect presence of a pattern?

How to train a neural network to detect presence of a pattern? - neural-network

The question phrasing is vague - and I'm happy to change it based on feedback. But, I am trying to train a neural network to detect fraudulent transactions on a website. I have a lot of parameters as inputs (time of day, country of origin, number of visits in the past month, number of visits from unique IP's in the past month, number of transactions, average transaction size, etc, etc). Altogether, perhaps over 100 inputs. The inputs have been normalized and sanitized and they form a time series. Historically, I can look at my data and identify that a transaction was fraudulent of Type A or of Type B or not fraudulent. My training set can be large (in the thousands or tens of thousands of points).
Ultimately, I want an indicator: Fraud of Type A, Fraud of Type B or Not Fraud. Generally, fraudulent transactions tend to fit a pattern. I can't exactly identify the pattern (that's why I'm using a NN). However, not fraudulent transactions can be of any type of pattern. So it seems strange to identify things into 3 buckets when the third bucket is "other".
If this were a switch / case, it would be something like:
switch transactionData
when transactionData.transaction_count < 0.2 && ....
FRAUD_A
when transactionData.transaction_count > 0.5 && ....
FRAUD_B
else
NOT_FRAUD
Obviously, these are simplified cases, but my problem runs into how to properly train for the else case. Do I get three types of data (fraud_a, fraud_b and not_fraud) and train them? Or is there another way to train for other?

It is usually perfectly ok to have OTHER (NOT FRAUD) class along with these you are interested in. But I understand your concern. Basically, its job of NN to learn "case/switch" and in most cases it will learn right one, assuming that most samples belong to NOT FRAUD class. In some pathological cases classifiers can learn different idea e.g. everything is FRAUD A class, unless proven otherwise. You can't usually control it directly, but it can be changed by creating better features and some other tricks. For now, proceed with what you have and see what happens.
One thing you can do is to train two classifiers, one (FRAUD/NOT FRAUD) and then if fraud is detected feed data into second two-class classifier (FRAUD A/FRAUD B). Sometimes (but not always) this works better.

Related

What to do with not enough training data?

I have a problem that I don't have enough training data for my NN. It is trying to predict the result of a soccer game given the last games which I woulf say is a regression task.
The training data are results of soccer games of the last 15 seasons (which are about 4500 games). Getting to new data would be hard and would take a lot of time.
What should I do now?
Is it good to duplicate the data?
Should I input randomized data? (Maybe noise but I'm not quite sure what that is)
If there is no way of creating more data,
I should probably turn up the learning rate right? (I have it sitting at 0.01 and the momentum at 0.9)
I am using mini batches consisting of 32 training datas in training. Since I don't have a lot of training I don't have a lot of mini batches. Should I stop using them?

To start from the beginning: This is a very theoretical question and is not directly related to programming, which I recommend (in future) to post over at the Data Science Stackexchange.
To go into your problem: 4500 samples is not as bad as it sounds, depending on the exact task at hand. Are you trying to predict the match results (i.e. which team is the winner?), are you looking for more specific predictions (across a lot of different, specific teams)?
If you can make sure that you have a reasonable amount of data per class, one can work with a number of samples lower than what you have. Simply duplicating the data will not help you much, since you are very likely to just overfit on the samples you are seeing, without much of an improvement; Or rather, you will get the same results as training over a longer period (since essentially you see every sample twice per epoch, instead of one).
Again, what usually happens after long training periods is overfitting, so nothing gained here.
Your second suggestion is generally called data augmentation. Instead of simply copying samples, you alter them enough to make it look "different" to the network. But be careful! Data augmentation works well for some inputs, like images, since the change in input is significant enough to not represent the same sample, but still contains meaningful information about the class (a horizontally mirrored image of a cat still shows a "valid cat", unlike a vertically mirrored image, which is more unrealistic in the real world).
Essentially, it depends on your input features to determine where it makes sense to add noise. If you are only changing the results of the previous game, a minor change in input (adding/subtracting one goal at random) can significantly change the prediction you make.
If you slightly scramble ELO scores by a random number, on the other hand, the input value will not be too different, "but different enough" to use it as a novel example.
Turning up the learning rate is not a good idea, since you are essentially just letting the network converge more towards the specific samples. On the contrary, I would argue that the current learning rate is still too high, and you should certainly not increase it.
Regarding mini batches, I think I have referenced this a million times now, but always consider smaller minibatches. From a theoretical point of view, you are more likely to converge to a local minimum.

Neural Network Underfitting with Dogs and Cats

Without necessarily getting into the code of it, but focusing more on the principles, I have a question about what I assume would be underfitting.
If I am training a network that recognizes true or false as to whether an image is of a dog, and I have maybe 40,000 images, where all dog images are labeled as 1, and all other images are labeled as 0 - what can I do to assure accuracy so that, if only maybe 5,000 of those images are dogs, the network does not act “lazily” from its training, and also label dogs as closer to 0 than 1?
For example, the main purpose of this question is to be able to recognize with high accuracy if an image really is of a dog, without really caring too much about the other images, other than the fact that they are not of dogs. Also, I would like to be able to retain the probability that the guess is correct, because this is highly important for my purposes.
The only two things I was able to come up with were to:
Have more nodes in the network, or
Have half of the images be of dogs (so use 10,000 images where 5,000 of them are dogs).
But I think this 2nd option might give dogs a disproportionately large chance of being the output of the testing data, which would destroy the accuracy and the whole purpose of this network.
I am sure this has been addressed before, so even a point in the right direction would be highly appreciated!

So you have a binary classification task where both classes appear with different frequency in your dataset. About 1/8 is "dog" and 7/8 is "no dog".
In order to avoid biased learning towards one or the other class, it is important that you stratify your training, validation and test data so that these fractions are kept across every subset.
You say that you want to "retain the probability" that the guess is correct - I assume you mean you want to evaluate the "dogness"-probability as output variable. That's a simple softmax output layer with two outputs: 1st is "dog", 2nd "not dog". It's the typical way to address classification problems, regardless of the number of classes you need to distinguish.

How best to deal with "None of the above" in Image Classification?

This seems to be a fundamental question which some of you out there must have an opinion on. I have an image classifier implemented in CNTK with 48 classes. If the image does not match any of the 48 classes very well, then I'd like to be able to conclude that it was not among these 48 image types. My original idea was simply that if the highest output of the final Softmax layer was low, I would be able to conclude that the test image matched none well. While I occasionally see this occur, in most testing, Softmax still produces a very high (and mistaken) result when handed an 'unknown image type'. But maybe my network is 'over fit' and if it wasn't, my original idea would work fine. What do you think? Any way to define a 49-th class called 'none-of-the-above'?

You really have these two options indeed--thresholding the posterior probabilities (softmax values), and adding a garbage class.
In my area (speech), both approaches are their place:
If "none of the above" inputs are of the same nature as the "above" (e.g. non-grammatical inputs), thresholding works fine. Note that the posterior probability for a class is equal to one minus an estimate of the error rate for choosing this class. Rejecting anything with posterior < 50% would be rejecting all cases where you are more likely wrong than right. As long as your none-of-the-above classes are of similar nature, the estimate may be accurate enough to make this correct for them as well.
If "none of the above" inputs are of similar nature but your number of classes is very small (e.g. 10 digits), or if the inputs are of a totally different nature (e.g. a sound of a door slam or someone coughing), thresholding typically fails. Then, one would train a "garbage model." In our experience, it is OK to include the training data for the correct classes. Now the none-of-the-above class may match a correct class as well. But that's OK as long as the none-of-the-above class is not overtrained--its distribution will be much flatter, and thus even if it matches a known class, it will match it with a lower score and thus not win against the actual known class' softmax output.
In the end, I would use both. Definitely use a threshold (to catch the cases that the system can rule out) and use a garbage model, which I would just train it on whatever you have. I would expect that including the correct examples in training will not harm, even if it is the only data you have (please check the paper Anton posted for whether that applies to image as well). It may also make sense to try to synthesize data, e.g. by randomly combining patches from different images.

I agree with you that this is a key question, but I am not aware of much work in that area either.
There's one recent paper by Zhang and LeCun, that addresses the question for image classification in particular. They use large quantities of unlabelled data to create an additional "none of the above" class. The catch though is that, in some cases, their unlabelled data is not completely unlabelled, and they have means of removing "unlabelled" images that are actually in one of their labelled classes. Having said that, the authors report that apart from solving the "none of the above" problem, they even see performance gains even on their test sets.
As for fitting something post-hoc, just by looking at the outputs of the softmax, I can't provide any pointers.

Newbie to Neural Networks

Just starting to play around with Neural Networks for fun after playing with some basic linear regression. I am an English teacher so don't have a math background and trying to read a book on this stuff is way over my head. I thought this would be a better avenue to get some basic questions answered (even though I suspect there is no easy answer). Just looking for some general guidance put in layman's terms. I am using a trial version of an Excel Add-In called NEURO XL. I apologize if these questions are too "elementary."
My first project is related to predicting a student's Verbal score on the SAT based on a number of test scores, GPA, practice exam scores, etc. as well as some qualitative data (gender: M=1, F=0; took SAT prep class: Y=1, N=0; plays varsity sports: Y=1, N=0).
In total, I have 21 variables that I would like to feed into the network, with the output being the actual score (200-800).
I have 9000 records of data spanning many years/students. Here are my questions:
How many records of the 9000 should I use to train the network?
1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
If I split the data into an even number, say 9x1000 (or however many) and created a network for each one, then tested the results of each of these 9 on the other 8 sets to see which had the lowest MSE across the samples, would this be a valid way to "choose" the best network if I wanted to predict the scores for my incoming students (not included in this data at all)?
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
E.g.
750-800 = 10
700-740 = 9
etc.
Is there any benefit to doing this or should I just go ahead and try to predict the exact score?
What if ALL I cared about was whether or not the score was above or below 600. Would I then just make the output 0(below 600) or 1(above 600)?
5a. I read somewhere that it's not good to use 0 and 1, but instead 0.1 and 0.9 - why is that?
5b. What about -1(below 600), 0(exactly 600), 1(above 600), would this work?
5c. Would the network always output -1, 0, 1 - or would it output fractions that I would then have to roundup or rounddown to finalize the prediction?
Once I have found the "best" network from Question #3, would I then play around with the different parameters (number of epochs, number of neurons in hidden layer, momentum, learning rate, etc.) to optimize this further?
6a. What about the Activation Function? Will Log-sigmoid do the trick or should I try the other options my software has as well (threshold, hyperbolic tangent, zero-based log-sigmoid).
6b. What is the difference between log-sigmoid and zero-based log-sigmoid?
Thanks!

First a little bit of meta content about the question itself (and not about the answers to your questions).
I have to laugh a little that you say 'I apologize if these questions are too "elementary."' and then proceed to ask the single most thorough and well thought out question I've seen as someone's first post on SO.
I wouldn't be too worried that you'll have people looking down their noses at you for asking this stuff.
This is a pretty big question in terms of the depth and range of knowledge required, especially the statistical knowledge needed and familiarity with Neural Networks.
You may want to try breaking this up into several questions distributed across the different StackExchange sites.
Off the top of my head, some of it definitely belongs on the statistics StackExchange, Cross Validated: https://stats.stackexchange.com/
You might also want to try out https://datascience.stackexchange.com/ , a beta site specifically targeting machine learning and related areas.
That said, there is some of this that I think I can help to answer.
Anything I haven't answered is something I don't feel qualified to help you with.
Question 1
How many records of the 9000 should I use to train the network? 1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
Randomizing the selection of training data is probably not a good idea.
Keep in mind that truly random data includes clusters.
A random selection of students could happen to consist solely of those who scored above a 30 on the ACT exams, which could potentially result in a bias in your result.
Likewise, if you only select students whose SAT scores were below 700, the classifier you build won't have any capacity to distinguish between a student expected to score 720 and a student expected to score 780 -- they'll look the same to the classifier because it was trained without the relevant information.
You want to ensure a representative sample of your different inputs and your different outputs.
Because you're dealing with input variables that may be correlated, you shouldn't try to do anything too complex in selecting this data, or you could mistakenly introduce another bias in your inputs.
Namely, you don't want to select a training data set that consists largely of outliers.
I would recommend trying to ensure that your inputs cover all possible values for all of the variables you are observing, and all possible results for the output (the SAT scores), without constraining how these requirements are satisfied.
I'm sure there are algorithms out there designed to do exactly this, but I don't know them myself -- possibly a good question in and of itself for Cross Validated.
Question 3
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
My understanding is that this is not recommended as the input to a Nerual Network, but I may be wrong.
The convergence of the network should handle this for you.
Every node in the network will assign a weight to its inputs, multiply them by their weights, and sum those products as a core part of its computation.
That means that every node in the network is searching for some coefficients for each of their inputs.
To do this, all inputs will be converted to numeric values -- so conditions like gender will be translated into "0=MALE,1=FEMALE" or something similar.
For example, a node's metric might look like this at a given point in time:
2*ACT_SCORE + 0*GENDER + (-5)*VARISTY_SPORTS ...
The coefficients for each values are exactly what the network is searching for as it converges.
If you change the scale of a value, like ACT_SCORE, you just change the scale of the coefficient that will be found by the reciporical of that scaling factor.
The result should still be the same.
There are other concerns in terms of accuracy (computers have limited capacity to represent small fractions) and speed that may enter this, but not being familiar with NEURO XL, I can't say whether or not they apply for this technology.
Question 4
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
This will reduce accuracy, although you should converge to a solution much faster with fewer possible outputs (scores).
Neural Networks actually describe very high-dimensional functions in their input variables.
If you reduce the granularity of that function's output space, you essentially state that you don't care about local minima and maxima in that function, especially around the borders between your output scores.
As a result, you are sacrificing information that may be an essential component of the "true" function that you are searching for.
I hope this has been helpful, but you really should break this question down into its many components and ask them separately on different sites -- potentially some of them do belong here on StackOverflow as well.

Shannon's Entropy measure in Decision Trees

Why is Shannon's Entropy measure used in Decision Tree branching?
Entropy(S) = - p(+)log( p(+) ) - p(-)log( p(-) )
I know it is a measure of the no. of bits needed to encode information; the more uniform the distribution, the more the entropy. But I don't see why it is so frequently applied in creating decision trees (choosing a branch point).

Because you want to ask the question that will give you the most information. The goal is to minimize the number of decisions/questions/branches in the tree, so you start with the question that will give you the most information and then use the following questions to fill in the details.

For the sake of decision trees, forget about the number of bits and just focus on the formula itself. Consider a binary (+/-) classification task where you have an equal number of + and - examples in your training data. Initially, the entropy will be 1 since p(+) = p(-) = 0.5. You want to split the data on an attribute that most decreases the entropy (i.e., makes the distribution of classes least random). If you choose an attribute, A1, that is completely unrelated to the classes, then the entropy will still be 1 after splitting the data by the values of A1, so there is no reduction in entropy. Now suppose another attribute, A2, perfectly separates the classes (e.g., the class is always + for A2="yes" and always - for A2="no". In this case, the entropy is zero, which is the ideal case.
In practical cases, attributes don't typically perfectly categorize the data (the entropy is greater than zero). So you choose the attribute that "best" categorizes the data (provides the greatest reduction in entropy). Once the data are separated in this manner, another attribute is selected for each of the branches from the first split in a similar manner to further reduce the entropy along that branch. This process is continued to construct the tree.

You seem to have an understanding of the math behind the method, but here is a simple example that might give you some intuition behind why this method is used: Imagine you are in a classroom that is occupied by 100 students. Each student is sitting at a desk, and the desks are organized such there are 10 rows and 10 columns. 1 out of the 100 students has a prize that you can have, but you must guess which student it is to get the prize. The catch is that everytime you guess, the prize is decremented in value. You could start by asking each student individually whether or not they have the prize. However, initially, you only have a 1/100 chance of guessing correctly, and it is likely that by the time you find the prize it will be worthless (think of every guess as a branch in your decision tree). Instead, you could ask broad questions that dramatically reduce the search space with each question. For example "Is the student somewhere in rows 1 though 5?" Whether the answer is "Yes" or "No" you have reduced the number of potential branches in your tree by half.