Neural Network playing Tic Tac Toe doesn't learn - neural-network

I have a neural network playing tic-tac-toe. (I know there are other better methods for this, but I want to learn about NN)
So the NN plays against a random AI. First, it should learn to make an allowed move, ie. not choosing a field that is already occupied.
It doesn't get very far with this, however.
When NN chooses an illegal move I optimize the weights such that the distance to another, randomly chosen (legal) field is minimized. (There is one output which should have values between 1 and 9).
My problem is: in changing the weights, a formerly optimized outcome is now also changed. So I have this kind of overfitting: Everytime I backpropagade to optimize the weights for one particular situation, the decision for every other situation becomes worse!
I know I should probably have 9 output neurons instead of 1 and should probably not use a random field as the target, as I assume this can mess things up. I am starting to change this.
Still, the issue seems to remain. Obviously. How can I improve the decision in one situation without forgetting every other situation?
One solution I came up with is to "remember" every game played and optimizing simultaneously over all games played.
However, after a while this becomes very demanding on the computation. Also, it seems to go into the direction of a complete enumartion of all possible board situations. This might be possible for Tic Tac Toe but if I move to another game, say Go, this becomes infeasible.
Where is my mistake? How do I generally tackle this problem? Or where could I read about it? Thanks a lot!

To tackle this problem efficiently, you sould consider Reinforcement Learning methods, instead of what you are currently doing. What your are trying to do is to learn the behaviour of an agent playing Tic Tac Toe. The agent gets a high reward when he wins a game, a high penalty when he loses and an even higher penalty when he performs an illegal move. My guess is that using methods such as Q-learning with neural networks will work perfectly, even with very simple neural nets. One useful paper on the topic could be: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf, or earlier papers on TD-Gammon (I think you can easily find tutorials on the topic using the keywords TD-Gammon, Q-learning, ...).
By the way, a more down-to-earth answer to why your model might not work is that you are seemingly using one single unit to represent categorical outputs: if you want to represent an integer between 1 and N, you should represent it using N output neurons with values between 0 and 1, and pick the neuron with the highest value as your answer. Using a single neuron with value between 1 and 9 creates an unatural assymetry between your outputs, and, for example, when the expected value is 3, your network gets a higher error for outputing a 9 than a 2. This should obviously not be the case: all wrong answers are equally wrong.
Hope this helps,
Best

Related

Ways to Improve Universal Differential Equation Training with sciml_train

About a month ago I asked a question about strategies for better convergence when training a neural differential equation. I've since gotten that example to work using the advice I was given, but when I applied what the same advice to a more difficult model, I got stuck again. All of my code is in Julia, primarily making use of the DiffEqFlux library. In effort to keep this post as brief as possible, I won't share all of my code for everything I've tried, but if anyone wants access to it to troubleshoot I can provide it.
What I'm Trying to Do
The data I'm trying to learn comes from an SIRx model:
function SIRx!(du, u, p, t)
β, μ, γ, a, b = Float32.([280, 1/50, 365/22, 100, 0.05])
S, I, x = u
du[1] = μ*(1-x) - β*S*I - μ*S
du[2] = β*S*I - (μ+γ)*I
du[3] = a*I - b*x
nothing
end;
The initial condition I used was u0 = Float32.([0.062047128, 1.3126149f-7, 0.9486445]);. I generated data from t=0 to 25, sampled every 0.02 (in training, I only use every 20 points or so for speed, and using more doesn't improve results). The data looks like this: Training Data
The UDE I'm training is
function SIRx_ude!(du, u, p, t)
μ, γ = Float32.([1/50, 365/22])
S,I,x = u
du[1] = μ*(1-x) - μ*S + ann_dS(u, #view p[1:lenS])[1]
du[2] = -(μ+γ)*I + ann_dI(u, #view p[lenS+1:lenS+lenI])[1]
du[3] = ann_dx(u, #view p[lenI+1:end])[1]
nothing
end;
Each of the neural networks (ann_dS, ann_dI, ann_dx) are defined using FastChain(FastDense(3, 20, tanh), FastDense(20, 1)). I tried using a single neural network with 3 inputs and 3 outputs, but it was slower and didn't perform any better. I also tried normalizing inputs to the network first, but it doesn't make a significant difference outside of slowing things down.
What I've Tried
Single shooting
The network just fits a line through the middle of the data. This happens even when I weight the earlier datapoints more in the loss function. Single-shot Training
Multiple Shooting
The best result I had was with multiple shooting. As seen here, it's not simply fitting a straight line, but it's not exactly fitting the data eitherMultiple Shooting Result. I've tried continuity terms ranging from 0.1 to 100 and group sizes from 3 to 30 and it doesn't make a significant difference.
Various Other Strategies
I've also tried iteratively growing the fit, 2-stage training with a collocation, and mini-batching as outlined here: https://diffeqflux.sciml.ai/dev/examples/local_minima, https://diffeqflux.sciml.ai/dev/examples/collocation/, https://diffeqflux.sciml.ai/dev/examples/minibatch/. Iteratively growing the fit works well the first couple of iterations, but as the length increases it goes back to fitting a straight line again. 2-stage collocation training works really well for stage 1, but it doesn't actually improve performance on the second stage (I've tried both single and multiple shooting for the second stage). Finally, mini-batching worked about as well as single-shooting (which is to say not very well) but much more quickly.
My Question
In summary, I have no idea what to try. There are so many strategies, each with so many parameters that can be tweaked. I need a way to diagnose the problem more precisely so I can better decide how to proceed. If anyone has experience with this sort of problem, I'd appreciate any advice or guidance I can get.
This isn't a great SO question because it's more exploratory. Did you lower your ODE tolerances? That would improve your gradient calculation which could help. What activation function are you using? I would use something like softplus instead of tanh so that you don't have the saturating behavior. Did you scale the eigenvalues and take into account the issues explored in the stiff neural ODE paper? Larger neural networks? Different learning rates? ADAM? Etc.
This is much better suited for a forum for discussion like the JuliaLang Discourse. We can continue there since walking through this will not be fruitful without some back and forth.

What to do with not enough training data?

I have a problem that I don't have enough training data for my NN. It is trying to predict the result of a soccer game given the last games which I woulf say is a regression task.
The training data are results of soccer games of the last 15 seasons (which are about 4500 games). Getting to new data would be hard and would take a lot of time.
What should I do now?
Is it good to duplicate the data?
Should I input randomized data? (Maybe noise but I'm not quite sure what that is)
If there is no way of creating more data,
I should probably turn up the learning rate right? (I have it sitting at 0.01 and the momentum at 0.9)
I am using mini batches consisting of 32 training datas in training. Since I don't have a lot of training I don't have a lot of mini batches. Should I stop using them?
To start from the beginning: This is a very theoretical question and is not directly related to programming, which I recommend (in future) to post over at the Data Science Stackexchange.
To go into your problem: 4500 samples is not as bad as it sounds, depending on the exact task at hand. Are you trying to predict the match results (i.e. which team is the winner?), are you looking for more specific predictions (across a lot of different, specific teams)?
If you can make sure that you have a reasonable amount of data per class, one can work with a number of samples lower than what you have. Simply duplicating the data will not help you much, since you are very likely to just overfit on the samples you are seeing, without much of an improvement; Or rather, you will get the same results as training over a longer period (since essentially you see every sample twice per epoch, instead of one).
Again, what usually happens after long training periods is overfitting, so nothing gained here.
Your second suggestion is generally called data augmentation. Instead of simply copying samples, you alter them enough to make it look "different" to the network. But be careful! Data augmentation works well for some inputs, like images, since the change in input is significant enough to not represent the same sample, but still contains meaningful information about the class (a horizontally mirrored image of a cat still shows a "valid cat", unlike a vertically mirrored image, which is more unrealistic in the real world).
Essentially, it depends on your input features to determine where it makes sense to add noise. If you are only changing the results of the previous game, a minor change in input (adding/subtracting one goal at random) can significantly change the prediction you make.
If you slightly scramble ELO scores by a random number, on the other hand, the input value will not be too different, "but different enough" to use it as a novel example.
Turning up the learning rate is not a good idea, since you are essentially just letting the network converge more towards the specific samples. On the contrary, I would argue that the current learning rate is still too high, and you should certainly not increase it.
Regarding mini batches, I think I have referenced this a million times now, but always consider smaller minibatches. From a theoretical point of view, you are more likely to converge to a local minimum.

Neural network to figure out the meaning of user input

I'm trying to make a neural network try to figure out the meaning of input(keyboard keys in this case) according to the user.
I have multiple possible output "commands" that the NN can interpret the inputs to mean, and at each state certain outputs can count as beneficial while others are a detriment
When the NN starts up for the first time, no input should have any particular meaning to it but as time goes on I want the NN to be able to figure out what the user most likely meant.
I've tried a Multilayer perceptron NN that has as many input nodes as there are physical inputs and as many output nodes as there are commands and a number of nodes equal to the sum of the other two layers as a single hidden layer, in this case it is then a 5,15,10.
The NN assumes that the user will only make moves that are in the NN's best interest.
So far it seems the NN is just figuring out what is the command it can take that will most likely result in a beneficial move, regardless of the input key rather than trying to figure out what key should result in what move according to the user.
Because of this I'm wondering (most likely wrong) if I should produce a separate NN for each input to try and figure out the current output according to the user.
Is there a different type of NN I should look into that will work better, and is there a recommended configuration for this problem?
I'll be happy with some recommendations of reading material that would help in this particular problem.
I'm at best an amateur in NN and would like to learn a lot more about the whole field, But I'm trying to focus my efforts on this problem for now.
Accordng to me you want the output to be according to behaviour of the player as number of inputs are more than in actual case. So according to me there should be some type of memory for the actions taken by the player in order to find the patterns.This can be done using Long Short term Memory.

How to train an ANN to play a card game?

I would like to teach an ANN to play Hearts, but I am stuck on how to actually perform the training.
A friend suggested to use weka for the implementation of the actual ANN, but I've never used it, so I'm leaning towards a custom implementation.
I have programmed the rules and I can let the computer play a game, choosing random but legal cards each turn.
Now I am at a loss of what to send to the ANN as input and how to extract output (decreasing amount of cards each turn, so I can't let each output neuron be a possible card) and how to teach it and when to perform teaching.
My guess is to give the ANN as input:
The cards that have been played previously, with metadata of which player has played which card
The cards on the table for this turn, also with the same metadata
The cards in the ANN's hand
And then have the output be 13 neurons (the maximal amount of cards per player), of which I take the most activated of the cards that still are in the ANN's hand.
I also don't really know when to teach it (after each turn or after each game), as it is beneficial to have all the penalty cards, but bad to have all but one penalty card.
Any and all help is appreciated. I don't really know where else to put this question.
I currently have it programmed in Swift, but it's only 200 lines and I know a few other languages, so I can translate it.
Note that neural networks might not be the best thing to use here. More on that at the end of the answer, I'll answer your questions first.
Now I am at a loss of what to send to the ANN as input and how to extract output (decreasing amount of cards each turn, so I can't let each output neuron be a possible card) and how to teach it and when to perform teaching.
ANNs require labeled input data. This means a pair (X, y) where X can be whatever (structured) data related to your problem and y is the list of correct answers you expect the ANN to learn for X.
For example, think about how you would learn math in school. The teacher will do a couple of exercises on the blackboard, and you will write those down. This is your training data.
Then, the teacher will invite you to the blackboard to do one on your own. You might not do so well at first, but he/she will guide you in the right direction. This is the training part.
Then, you'll have to do problems on your own, hopefully having learnt how.
The thing is, even this trivial example is much too complex for an ANN. An ANN usually takes in real-valued numbers and outputs one or more real-valued numbers. So it's actually much dumber than a grade schooler who learns about ax + b = 0 type equations.
For your particular problem, it can be hard to see how it fits in this format. As a whole, it doesn't: you can't present the ANN with a game and have it learn the moves, that is much too complex. You need to present it with something for which you have a correct numerical label associated with and you want the ANN to learn the underlying pattern.
To do this, you should break your problem up into subproblems. For example, input the current player's cards and expect as output the correct move.
The cards that have been played previously, with metadata of which player has played which card
The ANN should only care about the current player. I would not use metadata or any other information that identifies the players.
Giving it a history could get complicated. You might want recurrent neural networks for that.
The cards on the table for this turn, also with the same metadata
Yes, but again, I wouldn't use metadata.
The cards in the ANN's hand
Also good.
Make sure you have as many input units as the MAXIMUM number of cards you want to input (2 x total possible cards, for the cards in hand and those on the table). This will be a binary vector where the ith position is true if the card corresponding to that position exists in hand / on the table.
Then do the same for moves: you will have m binary output units, where the ith will be true if the ANN thinks you should do move i, where there are m possible moves in total (pick the max if m depends on stages in the game).
Your training data will also have to be in this format. For simplicity, let's say there can be at most 2 cards in hand and 2 on the table, out of a total of 5 cards, and we can choose from 2 moves (say fold and all in). Then a possible training instance is:
Xi = 1 0 0 1 0 0 0 0 1 1 (meaning cards 1 and 4 in hand, cards 4 and 5 on table)
yi = 0 1 (meaning you should go all in in this case)
I also don't really know when to teach it (after each turn or after each game), as it is beneficial to have all the penalty cards, but bad to have all but one penalty card.
You should gather a lot of labeled training data in the format I described, train it on that, and then use it. You will need thousands or even tens of thousands of games to see good performance. Teaching it after each turn or game is unlikely to do well.
This will lead to very large neural networks. Another thing that you might try is to predict who will win given a current game configuration. This will significantly reduce the number of output units, making learning easier. For example, given the cards currently on the table and in hand, what is the probability that the current player will win? With enough training data, neural networks can attempt to learn these probabilities.
There are obvious shortcomings: the need for large training data sets. There is no memory of how the game has gone so far (unless you use much more advanced nets).
For games such as these, I suggest you read about reinforcement learning, or dedicated algorithms for your particular game. You're not going to have much luck teaching an ANN to play chess for example, and I doubt you will teaching it to play a card game.
First of all you need to create some good learning data set for training ANN. If your budget allows you can ask some cards professionals to share with you enough of their matches of how they played cards. Another way of generating data could be some bots, which play cards. Then you need to think how to represent data set of playing matches to neural network. Also I recommend you to represent cards not by their value (0.2, 0.3, 0.4, ..., 0.10, 0.11 (for jack), but as separated input. Also look for elastic neural networks which can be used for such task.

Newbie to Neural Networks

Just starting to play around with Neural Networks for fun after playing with some basic linear regression. I am an English teacher so don't have a math background and trying to read a book on this stuff is way over my head. I thought this would be a better avenue to get some basic questions answered (even though I suspect there is no easy answer). Just looking for some general guidance put in layman's terms. I am using a trial version of an Excel Add-In called NEURO XL. I apologize if these questions are too "elementary."
My first project is related to predicting a student's Verbal score on the SAT based on a number of test scores, GPA, practice exam scores, etc. as well as some qualitative data (gender: M=1, F=0; took SAT prep class: Y=1, N=0; plays varsity sports: Y=1, N=0).
In total, I have 21 variables that I would like to feed into the network, with the output being the actual score (200-800).
I have 9000 records of data spanning many years/students. Here are my questions:
How many records of the 9000 should I use to train the network?
1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
If I split the data into an even number, say 9x1000 (or however many) and created a network for each one, then tested the results of each of these 9 on the other 8 sets to see which had the lowest MSE across the samples, would this be a valid way to "choose" the best network if I wanted to predict the scores for my incoming students (not included in this data at all)?
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
E.g.
750-800 = 10
700-740 = 9
etc.
Is there any benefit to doing this or should I just go ahead and try to predict the exact score?
What if ALL I cared about was whether or not the score was above or below 600. Would I then just make the output 0(below 600) or 1(above 600)?
5a. I read somewhere that it's not good to use 0 and 1, but instead 0.1 and 0.9 - why is that?
5b. What about -1(below 600), 0(exactly 600), 1(above 600), would this work?
5c. Would the network always output -1, 0, 1 - or would it output fractions that I would then have to roundup or rounddown to finalize the prediction?
Once I have found the "best" network from Question #3, would I then play around with the different parameters (number of epochs, number of neurons in hidden layer, momentum, learning rate, etc.) to optimize this further?
6a. What about the Activation Function? Will Log-sigmoid do the trick or should I try the other options my software has as well (threshold, hyperbolic tangent, zero-based log-sigmoid).
6b. What is the difference between log-sigmoid and zero-based log-sigmoid?
Thanks!
First a little bit of meta content about the question itself (and not about the answers to your questions).
I have to laugh a little that you say 'I apologize if these questions are too "elementary."' and then proceed to ask the single most thorough and well thought out question I've seen as someone's first post on SO.
I wouldn't be too worried that you'll have people looking down their noses at you for asking this stuff.
This is a pretty big question in terms of the depth and range of knowledge required, especially the statistical knowledge needed and familiarity with Neural Networks.
You may want to try breaking this up into several questions distributed across the different StackExchange sites.
Off the top of my head, some of it definitely belongs on the statistics StackExchange, Cross Validated: https://stats.stackexchange.com/
You might also want to try out https://datascience.stackexchange.com/ , a beta site specifically targeting machine learning and related areas.
That said, there is some of this that I think I can help to answer.
Anything I haven't answered is something I don't feel qualified to help you with.
Question 1
How many records of the 9000 should I use to train the network? 1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
Randomizing the selection of training data is probably not a good idea.
Keep in mind that truly random data includes clusters.
A random selection of students could happen to consist solely of those who scored above a 30 on the ACT exams, which could potentially result in a bias in your result.
Likewise, if you only select students whose SAT scores were below 700, the classifier you build won't have any capacity to distinguish between a student expected to score 720 and a student expected to score 780 -- they'll look the same to the classifier because it was trained without the relevant information.
You want to ensure a representative sample of your different inputs and your different outputs.
Because you're dealing with input variables that may be correlated, you shouldn't try to do anything too complex in selecting this data, or you could mistakenly introduce another bias in your inputs.
Namely, you don't want to select a training data set that consists largely of outliers.
I would recommend trying to ensure that your inputs cover all possible values for all of the variables you are observing, and all possible results for the output (the SAT scores), without constraining how these requirements are satisfied.
I'm sure there are algorithms out there designed to do exactly this, but I don't know them myself -- possibly a good question in and of itself for Cross Validated.
Question 3
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
My understanding is that this is not recommended as the input to a Nerual Network, but I may be wrong.
The convergence of the network should handle this for you.
Every node in the network will assign a weight to its inputs, multiply them by their weights, and sum those products as a core part of its computation.
That means that every node in the network is searching for some coefficients for each of their inputs.
To do this, all inputs will be converted to numeric values -- so conditions like gender will be translated into "0=MALE,1=FEMALE" or something similar.
For example, a node's metric might look like this at a given point in time:
2*ACT_SCORE + 0*GENDER + (-5)*VARISTY_SPORTS ...
The coefficients for each values are exactly what the network is searching for as it converges.
If you change the scale of a value, like ACT_SCORE, you just change the scale of the coefficient that will be found by the reciporical of that scaling factor.
The result should still be the same.
There are other concerns in terms of accuracy (computers have limited capacity to represent small fractions) and speed that may enter this, but not being familiar with NEURO XL, I can't say whether or not they apply for this technology.
Question 4
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
This will reduce accuracy, although you should converge to a solution much faster with fewer possible outputs (scores).
Neural Networks actually describe very high-dimensional functions in their input variables.
If you reduce the granularity of that function's output space, you essentially state that you don't care about local minima and maxima in that function, especially around the borders between your output scores.
As a result, you are sacrificing information that may be an essential component of the "true" function that you are searching for.
I hope this has been helpful, but you really should break this question down into its many components and ask them separately on different sites -- potentially some of them do belong here on StackOverflow as well.