What to do with not enough training data?

What to do with not enough training data? - neural-network

I have a problem that I don't have enough training data for my NN. It is trying to predict the result of a soccer game given the last games which I woulf say is a regression task.
The training data are results of soccer games of the last 15 seasons (which are about 4500 games). Getting to new data would be hard and would take a lot of time.
What should I do now?
Is it good to duplicate the data?
Should I input randomized data? (Maybe noise but I'm not quite sure what that is)
If there is no way of creating more data,
I should probably turn up the learning rate right? (I have it sitting at 0.01 and the momentum at 0.9)
I am using mini batches consisting of 32 training datas in training. Since I don't have a lot of training I don't have a lot of mini batches. Should I stop using them?

To start from the beginning: This is a very theoretical question and is not directly related to programming, which I recommend (in future) to post over at the Data Science Stackexchange.
To go into your problem: 4500 samples is not as bad as it sounds, depending on the exact task at hand. Are you trying to predict the match results (i.e. which team is the winner?), are you looking for more specific predictions (across a lot of different, specific teams)?
If you can make sure that you have a reasonable amount of data per class, one can work with a number of samples lower than what you have. Simply duplicating the data will not help you much, since you are very likely to just overfit on the samples you are seeing, without much of an improvement; Or rather, you will get the same results as training over a longer period (since essentially you see every sample twice per epoch, instead of one).
Again, what usually happens after long training periods is overfitting, so nothing gained here.
Your second suggestion is generally called data augmentation. Instead of simply copying samples, you alter them enough to make it look "different" to the network. But be careful! Data augmentation works well for some inputs, like images, since the change in input is significant enough to not represent the same sample, but still contains meaningful information about the class (a horizontally mirrored image of a cat still shows a "valid cat", unlike a vertically mirrored image, which is more unrealistic in the real world).
Essentially, it depends on your input features to determine where it makes sense to add noise. If you are only changing the results of the previous game, a minor change in input (adding/subtracting one goal at random) can significantly change the prediction you make.
If you slightly scramble ELO scores by a random number, on the other hand, the input value will not be too different, "but different enough" to use it as a novel example.
Turning up the learning rate is not a good idea, since you are essentially just letting the network converge more towards the specific samples. On the contrary, I would argue that the current learning rate is still too high, and you should certainly not increase it.
Regarding mini batches, I think I have referenced this a million times now, but always consider smaller minibatches. From a theoretical point of view, you are more likely to converge to a local minimum.

Related

Topics and LL/token in Mallet change every time

Why do I get different keywords and LL/token every time I run topic models in Mallet? Is it normal?
Please help. Thank you.

Yes, this is normal and expected. Mallet implements a randomized algorithm. Finding the exact optimal best topic model for a collection is computationally intractable, but it's much easier to find one of countless "pretty good" solutions.
As an intuition, imagine shaking a box of sand. The smaller particles will sift towards one side, and the larger particles towards the other. That's way easier than trying to sort them by hand. You won't get the exact order, but each time you'll get one of a large number of equally good approximate sortings.
If you want to have a stronger guarantee of local optimality, add --num-icm-iterations 100 to switch from sampling to choosing the single best allocation for each token, given all the others.

Average result of 50 Netlogo Simulation_Agent Based Simulation

I run an infectious disease spread model similar to "VIRUS" model in the model library changing the "infectiousness".
I did 20 runs each for infectiousness values 98% , 95% , 93% and the Maximum infected count was 74.05 , 73 ,78.9 respectively. (peak was at tick 38 for all 3 infectiousness values)
[I took the average of the infected count for each tick and took the maximum of these averages as the "maximum infected".]
I was expecting the maximum infected count to decrease when the infectiousness is reduced, but it didn't. As per what I understood this happens, because I considered the average values of each simulation run. (It is like I am considering a new simulation run with average infected count for each tick ).
I want to say that, I am considering all 20 simulation runs. Is there a way to do that other than the way I used the average?

In the Models Library Virus model with default parameter settings at other values, and those high infectiousness values, what I see when I run the model is a periodic variation in the numbers three classes of person. Look at the plot in the lower left corner, and you'll see this. What is happening, I believe, is this:
When there are many healthy, non-immune people, that means that there are many people who can get infected, so the number of infected people goes up, and the number of healthy people goes down.
Soon after that, the number of sick, infectious people goes down, because they either die or become immune.
Since there are now more immune people, and fewer infectious people, the number of non-immune healthy grows; they are reproducing. (See "How it works" in the Info tab.) But now we have returned to the situation in step 1, ... so the cycle continues.
If your model is sufficiently similar to the Models Library Virus model, I'd bet that this is part of what's happening. If you don't have a plot window like the Virus model, I recommend adding it.
Also, you didn't say how many ticks you are running the model for. If you run it for a short number of ticks, you won't notice the periodic behavior, but that doesn't mean it hasn't begun.
What this all means that increasing infectiousness wouldn't necessarily increase the maximum number infected: a faster rate of infection means that the number of individuals who can infected drops faster. I'm not sure that the maximum number infected over the whole run is an interesting number, with this model and a high infectiousness value. It depends what you are trying to understand.
One of the great things about NetLogo and some other ABM systems is that you can watch the system evolve over time, using various tools such as plots, monitors, etc. as well as just looking at the agents move around or change states over time. This can help you understand what is going on in a way that a single number like an average won't. Then you can use this insight to figure out a more informative way of measuring what is happening.
Another model where you can see a similar kind of periodic pattern is Wolf-Sheep Predation. I recommend looking at that. It may be easier to understand the pattern. (If you are interested in mathematical models of this kind of phenomenon, look up Lotka-Volterra models.)
(Real virus transmission can be more complicated, because a person (or other animal) is a kind of big "island" where viruses can reproduce quickly. If they reproduce too quickly, this can kill the host, and prevent further transmission of the virus. Sometimes a virus that reproduces more slowly can harm more people, because there is time for them to infect others. This blog post by Elliott Sober gives a relatively simple mathematical introduction to some of the issues involved, but his simple mathematical models don't take into account all of the complications involved in real virus transmission.)
EDIT: You added a comment Lawan, saying that you are interested in modeling COVID-19 transmission. This paper, Variation and multilevel selection of SARS‐CoV‐2 by Blackstone, Blackstone, and Berg, suggests that some of the dynamics that I mentioned in the preceding remarks might be characteristic of COVID-19 transmission. That paper is about six months old now, and it offered some speculations based on limited information. There's probably more known now, but this might suggest avenues for further investigation.
If you're interested, you might also consider asking general questions about virus transmission on the Biology Stackexchange site.

Neural Network Underfitting with Dogs and Cats

Without necessarily getting into the code of it, but focusing more on the principles, I have a question about what I assume would be underfitting.
If I am training a network that recognizes true or false as to whether an image is of a dog, and I have maybe 40,000 images, where all dog images are labeled as 1, and all other images are labeled as 0 - what can I do to assure accuracy so that, if only maybe 5,000 of those images are dogs, the network does not act “lazily” from its training, and also label dogs as closer to 0 than 1?
For example, the main purpose of this question is to be able to recognize with high accuracy if an image really is of a dog, without really caring too much about the other images, other than the fact that they are not of dogs. Also, I would like to be able to retain the probability that the guess is correct, because this is highly important for my purposes.
The only two things I was able to come up with were to:
Have more nodes in the network, or
Have half of the images be of dogs (so use 10,000 images where 5,000 of them are dogs).
But I think this 2nd option might give dogs a disproportionately large chance of being the output of the testing data, which would destroy the accuracy and the whole purpose of this network.
I am sure this has been addressed before, so even a point in the right direction would be highly appreciated!

So you have a binary classification task where both classes appear with different frequency in your dataset. About 1/8 is "dog" and 7/8 is "no dog".
In order to avoid biased learning towards one or the other class, it is important that you stratify your training, validation and test data so that these fractions are kept across every subset.
You say that you want to "retain the probability" that the guess is correct - I assume you mean you want to evaluate the "dogness"-probability as output variable. That's a simple softmax output layer with two outputs: 1st is "dog", 2nd "not dog". It's the typical way to address classification problems, regardless of the number of classes you need to distinguish.

How to train an ANN to play a card game?

I would like to teach an ANN to play Hearts, but I am stuck on how to actually perform the training.
A friend suggested to use weka for the implementation of the actual ANN, but I've never used it, so I'm leaning towards a custom implementation.
I have programmed the rules and I can let the computer play a game, choosing random but legal cards each turn.
Now I am at a loss of what to send to the ANN as input and how to extract output (decreasing amount of cards each turn, so I can't let each output neuron be a possible card) and how to teach it and when to perform teaching.
My guess is to give the ANN as input:
The cards that have been played previously, with metadata of which player has played which card
The cards on the table for this turn, also with the same metadata
The cards in the ANN's hand
And then have the output be 13 neurons (the maximal amount of cards per player), of which I take the most activated of the cards that still are in the ANN's hand.
I also don't really know when to teach it (after each turn or after each game), as it is beneficial to have all the penalty cards, but bad to have all but one penalty card.
Any and all help is appreciated. I don't really know where else to put this question.
I currently have it programmed in Swift, but it's only 200 lines and I know a few other languages, so I can translate it.

Note that neural networks might not be the best thing to use here. More on that at the end of the answer, I'll answer your questions first.
Now I am at a loss of what to send to the ANN as input and how to extract output (decreasing amount of cards each turn, so I can't let each output neuron be a possible card) and how to teach it and when to perform teaching.
ANNs require labeled input data. This means a pair (X, y) where X can be whatever (structured) data related to your problem and y is the list of correct answers you expect the ANN to learn for X.
For example, think about how you would learn math in school. The teacher will do a couple of exercises on the blackboard, and you will write those down. This is your training data.
Then, the teacher will invite you to the blackboard to do one on your own. You might not do so well at first, but he/she will guide you in the right direction. This is the training part.
Then, you'll have to do problems on your own, hopefully having learnt how.
The thing is, even this trivial example is much too complex for an ANN. An ANN usually takes in real-valued numbers and outputs one or more real-valued numbers. So it's actually much dumber than a grade schooler who learns about ax + b = 0 type equations.
For your particular problem, it can be hard to see how it fits in this format. As a whole, it doesn't: you can't present the ANN with a game and have it learn the moves, that is much too complex. You need to present it with something for which you have a correct numerical label associated with and you want the ANN to learn the underlying pattern.
To do this, you should break your problem up into subproblems. For example, input the current player's cards and expect as output the correct move.
The cards that have been played previously, with metadata of which player has played which card
The ANN should only care about the current player. I would not use metadata or any other information that identifies the players.
Giving it a history could get complicated. You might want recurrent neural networks for that.
The cards on the table for this turn, also with the same metadata
Yes, but again, I wouldn't use metadata.
The cards in the ANN's hand
Also good.
Make sure you have as many input units as the MAXIMUM number of cards you want to input (2 x total possible cards, for the cards in hand and those on the table). This will be a binary vector where the ith position is true if the card corresponding to that position exists in hand / on the table.
Then do the same for moves: you will have m binary output units, where the ith will be true if the ANN thinks you should do move i, where there are m possible moves in total (pick the max if m depends on stages in the game).
Your training data will also have to be in this format. For simplicity, let's say there can be at most 2 cards in hand and 2 on the table, out of a total of 5 cards, and we can choose from 2 moves (say fold and all in). Then a possible training instance is:
Xi = 1 0 0 1 0 0 0 0 1 1 (meaning cards 1 and 4 in hand, cards 4 and 5 on table)
yi = 0 1 (meaning you should go all in in this case)
I also don't really know when to teach it (after each turn or after each game), as it is beneficial to have all the penalty cards, but bad to have all but one penalty card.
You should gather a lot of labeled training data in the format I described, train it on that, and then use it. You will need thousands or even tens of thousands of games to see good performance. Teaching it after each turn or game is unlikely to do well.
This will lead to very large neural networks. Another thing that you might try is to predict who will win given a current game configuration. This will significantly reduce the number of output units, making learning easier. For example, given the cards currently on the table and in hand, what is the probability that the current player will win? With enough training data, neural networks can attempt to learn these probabilities.
There are obvious shortcomings: the need for large training data sets. There is no memory of how the game has gone so far (unless you use much more advanced nets).
For games such as these, I suggest you read about reinforcement learning, or dedicated algorithms for your particular game. You're not going to have much luck teaching an ANN to play chess for example, and I doubt you will teaching it to play a card game.

First of all you need to create some good learning data set for training ANN. If your budget allows you can ask some cards professionals to share with you enough of their matches of how they played cards. Another way of generating data could be some bots, which play cards. Then you need to think how to represent data set of playing matches to neural network. Also I recommend you to represent cards not by their value (0.2, 0.3, 0.4, ..., 0.10, 0.11 (for jack), but as separated input. Also look for elastic neural networks which can be used for such task.

Newbie to Neural Networks

Just starting to play around with Neural Networks for fun after playing with some basic linear regression. I am an English teacher so don't have a math background and trying to read a book on this stuff is way over my head. I thought this would be a better avenue to get some basic questions answered (even though I suspect there is no easy answer). Just looking for some general guidance put in layman's terms. I am using a trial version of an Excel Add-In called NEURO XL. I apologize if these questions are too "elementary."
My first project is related to predicting a student's Verbal score on the SAT based on a number of test scores, GPA, practice exam scores, etc. as well as some qualitative data (gender: M=1, F=0; took SAT prep class: Y=1, N=0; plays varsity sports: Y=1, N=0).
In total, I have 21 variables that I would like to feed into the network, with the output being the actual score (200-800).
I have 9000 records of data spanning many years/students. Here are my questions:
How many records of the 9000 should I use to train the network?
1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
If I split the data into an even number, say 9x1000 (or however many) and created a network for each one, then tested the results of each of these 9 on the other 8 sets to see which had the lowest MSE across the samples, would this be a valid way to "choose" the best network if I wanted to predict the scores for my incoming students (not included in this data at all)?
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
E.g.
750-800 = 10
700-740 = 9
etc.
Is there any benefit to doing this or should I just go ahead and try to predict the exact score?
What if ALL I cared about was whether or not the score was above or below 600. Would I then just make the output 0(below 600) or 1(above 600)?
5a. I read somewhere that it's not good to use 0 and 1, but instead 0.1 and 0.9 - why is that?
5b. What about -1(below 600), 0(exactly 600), 1(above 600), would this work?
5c. Would the network always output -1, 0, 1 - or would it output fractions that I would then have to roundup or rounddown to finalize the prediction?
Once I have found the "best" network from Question #3, would I then play around with the different parameters (number of epochs, number of neurons in hidden layer, momentum, learning rate, etc.) to optimize this further?
6a. What about the Activation Function? Will Log-sigmoid do the trick or should I try the other options my software has as well (threshold, hyperbolic tangent, zero-based log-sigmoid).
6b. What is the difference between log-sigmoid and zero-based log-sigmoid?
Thanks!

First a little bit of meta content about the question itself (and not about the answers to your questions).
I have to laugh a little that you say 'I apologize if these questions are too "elementary."' and then proceed to ask the single most thorough and well thought out question I've seen as someone's first post on SO.
I wouldn't be too worried that you'll have people looking down their noses at you for asking this stuff.
This is a pretty big question in terms of the depth and range of knowledge required, especially the statistical knowledge needed and familiarity with Neural Networks.
You may want to try breaking this up into several questions distributed across the different StackExchange sites.
Off the top of my head, some of it definitely belongs on the statistics StackExchange, Cross Validated: https://stats.stackexchange.com/
You might also want to try out https://datascience.stackexchange.com/ , a beta site specifically targeting machine learning and related areas.
That said, there is some of this that I think I can help to answer.
Anything I haven't answered is something I don't feel qualified to help you with.
Question 1
How many records of the 9000 should I use to train the network? 1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
Randomizing the selection of training data is probably not a good idea.
Keep in mind that truly random data includes clusters.
A random selection of students could happen to consist solely of those who scored above a 30 on the ACT exams, which could potentially result in a bias in your result.
Likewise, if you only select students whose SAT scores were below 700, the classifier you build won't have any capacity to distinguish between a student expected to score 720 and a student expected to score 780 -- they'll look the same to the classifier because it was trained without the relevant information.
You want to ensure a representative sample of your different inputs and your different outputs.
Because you're dealing with input variables that may be correlated, you shouldn't try to do anything too complex in selecting this data, or you could mistakenly introduce another bias in your inputs.
Namely, you don't want to select a training data set that consists largely of outliers.
I would recommend trying to ensure that your inputs cover all possible values for all of the variables you are observing, and all possible results for the output (the SAT scores), without constraining how these requirements are satisfied.
I'm sure there are algorithms out there designed to do exactly this, but I don't know them myself -- possibly a good question in and of itself for Cross Validated.
Question 3
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
My understanding is that this is not recommended as the input to a Nerual Network, but I may be wrong.
The convergence of the network should handle this for you.
Every node in the network will assign a weight to its inputs, multiply them by their weights, and sum those products as a core part of its computation.
That means that every node in the network is searching for some coefficients for each of their inputs.
To do this, all inputs will be converted to numeric values -- so conditions like gender will be translated into "0=MALE,1=FEMALE" or something similar.
For example, a node's metric might look like this at a given point in time:
2*ACT_SCORE + 0*GENDER + (-5)*VARISTY_SPORTS ...
The coefficients for each values are exactly what the network is searching for as it converges.
If you change the scale of a value, like ACT_SCORE, you just change the scale of the coefficient that will be found by the reciporical of that scaling factor.
The result should still be the same.
There are other concerns in terms of accuracy (computers have limited capacity to represent small fractions) and speed that may enter this, but not being familiar with NEURO XL, I can't say whether or not they apply for this technology.
Question 4
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
This will reduce accuracy, although you should converge to a solution much faster with fewer possible outputs (scores).
Neural Networks actually describe very high-dimensional functions in their input variables.
If you reduce the granularity of that function's output space, you essentially state that you don't care about local minima and maxima in that function, especially around the borders between your output scores.
As a result, you are sacrificing information that may be an essential component of the "true" function that you are searching for.
I hope this has been helpful, but you really should break this question down into its many components and ask them separately on different sites -- potentially some of them do belong here on StackOverflow as well.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse