I have been experimenting with evolving artificial creatures, but so far all creatures just die. To initialize the creatures that do not result from asexual reproduction; I create around 8 random neurons which both have a connection in and a connection out. I'm using mutation to get a set of weights which are used in a small neural network, that can form recurrent connections. I have 15 inputs and 5 output. There is a max number of 25 neurons in the hidden layer. The mutation chance is 25%. The different mutations are add a connection, disable a connection, make a small change to a weight, add a neuron, and disable a neuron. Is there something off with my mutation chances?
Real evolution is a massively parallel computation. Even so it took eons to get the basics of life. And then most of them died. Only a small sliver of all possible genes are ok.
To get your simulation to work in a reasonable time frame, you're gonna have to take some shortcuts.
Also, you should make sure your "small neural net" is capable of creating the kind of lifeforms that are successful. Your architecture may not be powerful enough to produce viable life.
Related
I've seen some comments in online articles/tutorials or Stack Overflow questions which suggest that increasing number of epochs can result in overfitting. But my intuition tells me that there should be no direct relationship at all between number of epochs and overfitting. So I'm looking for answer which explains if I'm right or wrong (or whatever's in between).
Here's my reasoning though. To overfit, you need to have enough free parameters (I think this is called "capacity" in neural networks) in your model to generate a function which can replicate the sample data points. If you don't have enough free parameters, you'll never overfit. You might just underfit.
So really, if you don't have too many free parameters, you could run infinite epochs and never overfit. If you have too many free parameters, then yes, the more epochs you have the more likely it is that you get to a place where you're overfitting. But that's just because running more epochs revealed the root cause: too many free parameters. The real loss function doesn't care about how many epochs you run. It existed the moment you defined your model structure, before you ever even tried to do gradient descent on it.
In fact, I'd venture as far as to say: assuming you have the computational resources and time, you should always aim to run as many epochs as possible, because that will tell you whether your model is prone to overfitting. Your best model will be the one that provides great training and validation accuracy, no matter how many epochs you run it for.
EDIT
While reading more into this, I realise I forgot to take into account that you can arbitrarily vary the sample size as well. Given a fixed model, a smaller sample size is more prone to being overfit. And then that kind of makes me doubt my intuition above. Still happy to get an answer though!
Your intuition to me seems completely correct.
But here is the caveat. The whole purpose of deep models is that they are "deep" (duh!!). So what happens is that your feature space gets exponentially larger as you grow your network.
Here is an example to compare a deep model with a simpler mode:
Assume you have a 10-variable data set. With a crazy amount of feature engineering, you might be able to extract 50 features out of it. Then if you run a traditional model (let's say a logistic regression), you will have 50 parameters (capacity in your word, or degree of freedom) to train.
But, if you use a very simple deep model with Layer 1: 10 unit, layer2: 10 units, layer3: 5 units, layer4: 2 units, you will end up with (10*10 + 10*10 + 5*2 = 210) parameters to train.
Therefore, usually when we train a neural net for a long time, we end of with a memorized version of our data set(this gets worse if our data set is small and easy to be memorized).
But as you also mentioned, there is no intrinsic reason why higher number of epochs result in overfitting. Early stopping is usually a very good way for avoiding this. Just set patience equal to 5-10 epochs.
If the amount of trainable parameters is small with respect to the size of your training set (and your training set is reasonably diverse) then running over the same data multiple times will not be that significant, since you will be learning some features about your problem, rather than just memorizing the training data set. The problem arises when the amount of parameters is comparable to your training data set size (or bigger), it is basically the same problem as with any machine learning technique that uses too many features. This is quite common if you use large layers with dense connections. To combat this overfitting problem there are lots of regularization techniques (dropout, L1 regularizer, constraining certain connections to be 0 or equal such as in CNN).
The problem is that might still be left with too many trainable parameters. A simple way to regularize even further is to have a small learning rate (i.e. don't learn too much from this particular example lest you memorize it) combined with monitoring the epochs (if there is a large gap increase between validation/training accuracy, you are starting to overfit your model). You can then use the gap info to stop your training. This is a version of what is known as early stopping (stop before you reach the minimum in your loss function).
I am coding a simple Neural Network, but I have thought of one issue that is bothering me.
This NN is for finding categories in the input. To better understand this, say the categories are "the numbers" (0,1,2...9).
To implement this the output layer is 10 nodes. Say I train this NN with several input -output pairs and save the learned weights somewhere. As the learning process takes quite a lot of time, after that I go and take a break. Come fresh the next day and re-start learning with new input -output pairs. So fair so goo
But what happen if on that time, I decide that I want to recognize hexadecimals (0,1,...9,A,B,,,E,F)... ergo the categories are increasing.
I suspect that would imply changing the structure of the NN and therefore I should retrain the NN from scratch.
Is this so?
Any comment, advice or your share of experience will be greatly appreciated
EDIT: This question has been marked as duplicate. I read the other question and although similar, my question is more concrete. While the other question speaks in generalities and the answer also is quite general- mine is very concrete as I use an example:
If I train a NN to recognize decimal numbers and later on decide to add data to make it recognize hexadecimals, can this be possible? How? Do I have to retrain the whole NN? In other words, does the structure of the NN needs to stay stationary with 10 OR 16 outputs since the beginning?
I would very much appreciate for a concrete answer to this. Thanks
A few considerations
Your training set and testing set should have the same distribution
Unless you have some way of specifying sample weights like some algorithms can you should at all costs avoid training on biased data. This is true for machine learning in general, not only neural networks.
Resuming training from a previous session is equivalent of using good initial values
Technically, you're just using the previous network as initial value instead of a random value. You should keep training in the whole dataset as always, to avoid a biased network.
Short Answer
Yes, you should always retrain your network if by retrain, you mean doing a training routine with the full dataset.
If you just mean retrain as doing a really long training iteration, it isn't your choice anyway. You must always train the network until the training error and testing error (or cross validated error) converge. If you reuse the previously trained network, that will probably happen faster.
You see, this is true no matter what kind of model change. If you change the network architecture, or the dataset, or both (your example), or some other parameter.
Of course, if you change the network architecture, you're going to have a bit of trouble on reusing the previous network. You could reuse the learned parameters from nodes that were kept and randomly initialize the parameters for the new nodes.
I am currently training a reinforcement learning agent using a simple Neural Network with 100 hidden elements to solve 2048 game. I am using DQN's reinforcement learning algorithm (i.e. Q-learning with replay memory), but with 2 layers Neural Network instead of Deep Neural Network.
However, I left it trained on my laptop overnight (~7 hours, ~1000 games played, > 100000 steps) and the score does not seem to increase. I suspect there might be 3 sources of errors in my code: bug, parameters tuned badly, or maybe I just don't wait long enough.
Is there any method to figure out what is wrong with the code?
And what is the best practice to improve the training results?
I'll talk about all three of your hypothesis.
If you are using a standard DL framework like caffe or tensorflow, the chance of it being a bug is small.
Try decreasing the learning rate. Maybe you set it too high for the network to converge.
The training time of 100000 steps is not that long. For a simple pong game, you need to train around 500000 steps to get a good accuracy. So you can try training it for longer.
Also, 2048 is a fairly complicated game, so maybe you network is not deep enough to learn how to play it. Two layers is not much for such a complicated game. Try increasing the number of hidden layers. Perhaps you can use the network provided here
i am experimenting with neural networks. I have a network with 8 input neurons, 5 hidden and 2 output. When i let the network learn with backpropagation, sometimes, it produces worse result between single iterations of training. What can be the cause? It should not be implementation error, because i even tried using implementation from Introduction to Neural Networks for Java and it does exactly the same.
Nothing is wrong. Back propagation is just a gradient optimization, and gradient methods do not have a guarantee of making error smaller in each iteration (you do have a guarantee that there exists a very small step size/learning rate which has such property, but in practise no way of finding it); furthermore you are probably updating weights after each sample making your training stochastic, which is even more "unstable" in this matter (as you do not really calculate the true gradient). However, if due to this, your method is not converging - think about proper scaling of your data as well as reducing the learning rate and probably adding the momentum term. These are just gradient-based optimization-related issues, not BP as such.
I wrote a neural network and made a small application with things eating other things.
But I don't really know, how to make the thing genetic.
Currently I'm recording all the inputs and outputs from every individual every frame.
At the end of an generation, I then teach every knew individual the data from the top 10 best fitting individuals from prevous generations.
But the problem is, that the recorded data from a a pool of top 10 individuals at 100 generations, is about 50MB large. When I now start a new generation with 20 individuals I have to teach them 20x50MB.
This process takes longer than 3 minutes, and I am not sure if this is what I am supposed to do in genetic neural networks.
My approach works kind of good actually. Only the inefficiency bugs me. (Of course I know, I could just reduce the population.)
And I could't find me a solution to what I have to crossover and what to mutate.
Crossovering and mutating biases and weights is nonsense, isn't it? It only would break the network, would't it? I saw examples doing just this. Mutating the weight vector. But I just can't see, how this would make the network progress reaching it's desired outputs.
Can somebody show me how the network would become better at what it is doing by randomly switching and mutating weights and connections?
Would't it be the same, just randomly generating networks and hoping they start doing what they are supposed to do?
Are there other algorithms for genetic neural networks?
Thank you.
Typically, genetic algorithms for neural networks are used as an alternative to training with back-propagation. So there is no training phase (trying to combine various kinds of supervised training with evolution is an interesting idea, but isn't done commonly enough for there to be any standard methods that I know of).
In this context, crossover and mutation of weights and biases makes sense. It provides variation in the population. A lot of the resulting neural networks (especially early on) won't do much of anything interesting, but some will be better. As you keep selecting these better networks, you will continue to get better offspring. Eventually (assuming your task is reasonable and such) you'll have neural networks that are really good at what you want them to do. This is substantially better than random search, because evolution will explore the search space of potential neural networks in a much more intelligent manner.
So yes, just about any genetic neural network algorithm will involve mutating the weights, and perhaps crossing them over as well. Some, such as NEAT, also evolve the topology of the neural network and so allow mutations and crossovers that add or remove nodes and connections between nodes.