Number of Q values for a deep reinforcement learning network - neural-network

I am currently developing a deep reinforcement learning network however, I have a small doubt about the number of q-values I will have at the output of the NN. I will have a total of 150 q-values, which personally seems excessive to me. I have read on several papers and books that this could be a problem. I know that it will depend from the kind of NN I will build, but do you guys think that the number of q-values is too high? should I reduce it?

There is no general principle what is "too much". Everything depends solely on the problem and throughput one can get in learning. In particular number of actions does not have to matter as long as internal parametrisation of Q(a, s) is efficient. To give some example lets assume that the neural network is actually of form NN(a, s) = Q(a, s), in other words it accepts action as an input, together with the state, and outputs the Q value. If such an architecture can be trained in a problem considered, than it might be able to scale to big action spaces; on the other hand if the neural net basically has independent output per action, something of form NN(s)[a] = Q(a, s) then many actions can lead to relatively sparse learning signal for the model and thus lead to slow convergence.
Since you are asking about reducing action space it sounds like the true problem has complex control (maybe it is a continuous control domain?) and you are looking for some discretization to make it simpler to learn. If this is the case you will have to follow the typical approach of trial and error - try with simple action space, observe the dynamics, and if the results are not satisfactory - increase the complexity of the problem. This allows making iterative improvements, as opposed to going in the opposite direction - starting with too complex setting to get any results and than having to reduce it without knowing what are the "reasonable values".

Related

Is running more epochs really a direct cause of overfitting?

I've seen some comments in online articles/tutorials or Stack Overflow questions which suggest that increasing number of epochs can result in overfitting. But my intuition tells me that there should be no direct relationship at all between number of epochs and overfitting. So I'm looking for answer which explains if I'm right or wrong (or whatever's in between).
Here's my reasoning though. To overfit, you need to have enough free parameters (I think this is called "capacity" in neural networks) in your model to generate a function which can replicate the sample data points. If you don't have enough free parameters, you'll never overfit. You might just underfit.
So really, if you don't have too many free parameters, you could run infinite epochs and never overfit. If you have too many free parameters, then yes, the more epochs you have the more likely it is that you get to a place where you're overfitting. But that's just because running more epochs revealed the root cause: too many free parameters. The real loss function doesn't care about how many epochs you run. It existed the moment you defined your model structure, before you ever even tried to do gradient descent on it.
In fact, I'd venture as far as to say: assuming you have the computational resources and time, you should always aim to run as many epochs as possible, because that will tell you whether your model is prone to overfitting. Your best model will be the one that provides great training and validation accuracy, no matter how many epochs you run it for.
EDIT
While reading more into this, I realise I forgot to take into account that you can arbitrarily vary the sample size as well. Given a fixed model, a smaller sample size is more prone to being overfit. And then that kind of makes me doubt my intuition above. Still happy to get an answer though!
Your intuition to me seems completely correct.
But here is the caveat. The whole purpose of deep models is that they are "deep" (duh!!). So what happens is that your feature space gets exponentially larger as you grow your network.
Here is an example to compare a deep model with a simpler mode:
Assume you have a 10-variable data set. With a crazy amount of feature engineering, you might be able to extract 50 features out of it. Then if you run a traditional model (let's say a logistic regression), you will have 50 parameters (capacity in your word, or degree of freedom) to train.
But, if you use a very simple deep model with Layer 1: 10 unit, layer2: 10 units, layer3: 5 units, layer4: 2 units, you will end up with (10*10 + 10*10 + 5*2 = 210) parameters to train.
Therefore, usually when we train a neural net for a long time, we end of with a memorized version of our data set(this gets worse if our data set is small and easy to be memorized).
But as you also mentioned, there is no intrinsic reason why higher number of epochs result in overfitting. Early stopping is usually a very good way for avoiding this. Just set patience equal to 5-10 epochs.
If the amount of trainable parameters is small with respect to the size of your training set (and your training set is reasonably diverse) then running over the same data multiple times will not be that significant, since you will be learning some features about your problem, rather than just memorizing the training data set. The problem arises when the amount of parameters is comparable to your training data set size (or bigger), it is basically the same problem as with any machine learning technique that uses too many features. This is quite common if you use large layers with dense connections. To combat this overfitting problem there are lots of regularization techniques (dropout, L1 regularizer, constraining certain connections to be 0 or equal such as in CNN).
The problem is that might still be left with too many trainable parameters. A simple way to regularize even further is to have a small learning rate (i.e. don't learn too much from this particular example lest you memorize it) combined with monitoring the epochs (if there is a large gap increase between validation/training accuracy, you are starting to overfit your model). You can then use the gap info to stop your training. This is a version of what is known as early stopping (stop before you reach the minimum in your loss function).

How to interpret the discriminator's loss and the generator's loss in Generative Adversarial Nets?

I am reading people's implementation of DCGAN, especially this one in tensorflow.
In that implementation, the author draws the losses of the discriminator and of the generator, which is shown below (images come from https://github.com/carpedm20/DCGAN-tensorflow):
Both the losses of the discriminator and of the generator don't seem to follow any pattern. Unlike general neural networks, whose loss decreases along with the increase of training iteration. How to interpret the loss when training GANs?
Unfortunately, like you've said for GANs the losses are very non-intuitive. Mostly it happens down to the fact that generator and discriminator are competing against each other, hence improvement on the one means the higher loss on the other, until this other learns better on the received loss, which screws up its competitor, etc.
Now one thing that should happen often enough (depending on your data and initialisation) is that both discriminator and generator losses are converging to some permanent numbers, like this:
(it's ok for loss to bounce around a bit - it's just the evidence of the model trying to improve itself)
This loss convergence would normally signify that the GAN model found some optimum, where it can't improve more, which also should mean that it has learned well enough. (Also note, that the numbers themselves usually aren't very informative.)
Here are a few side notes, that I hope would be of help:
if loss haven't converged very well, it doesn't necessarily mean that the model hasn't learned anything - check the generated examples, sometimes they come out good enough. Alternatively, can try changing learning rate and other parameters.
if the model converged well, still check the generated examples - sometimes the generator finds one/few examples that discriminator can't distinguish from the genuine data. The trouble is it always gives out these few, not creating anything new, this is called mode collapse. Usually introducing some diversity to your data helps.
as vanilla GANs are rather unstable, I'd suggest to use some version
of the DCGAN models, as they contain some features like convolutional
layers and batch normalisation, that are supposed to help with the
stability of the convergence. (the picture above is a result of the DCGAN rather than vanilla GAN)
This is some common sense but still: like with most neural net structures tweaking the model, i.e. changing its parameters or/and architecture to fit your certain needs/data can improve the model or screw it.

Neural Network Retraining

I am coding a simple Neural Network, but I have thought of one issue that is bothering me.
This NN is for finding categories in the input. To better understand this, say the categories are "the numbers" (0,1,2...9).
To implement this the output layer is 10 nodes. Say I train this NN with several input -output pairs and save the learned weights somewhere. As the learning process takes quite a lot of time, after that I go and take a break. Come fresh the next day and re-start learning with new input -output pairs. So fair so goo
But what happen if on that time, I decide that I want to recognize hexadecimals (0,1,...9,A,B,,,E,F)... ergo the categories are increasing.
I suspect that would imply changing the structure of the NN and therefore I should retrain the NN from scratch.
Is this so?
Any comment, advice or your share of experience will be greatly appreciated
EDIT: This question has been marked as duplicate. I read the other question and although similar, my question is more concrete. While the other question speaks in generalities and the answer also is quite general- mine is very concrete as I use an example:
If I train a NN to recognize decimal numbers and later on decide to add data to make it recognize hexadecimals, can this be possible? How? Do I have to retrain the whole NN? In other words, does the structure of the NN needs to stay stationary with 10 OR 16 outputs since the beginning?
I would very much appreciate for a concrete answer to this. Thanks
A few considerations
Your training set and testing set should have the same distribution
Unless you have some way of specifying sample weights like some algorithms can you should at all costs avoid training on biased data. This is true for machine learning in general, not only neural networks.
Resuming training from a previous session is equivalent of using good initial values
Technically, you're just using the previous network as initial value instead of a random value. You should keep training in the whole dataset as always, to avoid a biased network.
Short Answer
Yes, you should always retrain your network if by retrain, you mean doing a training routine with the full dataset.
If you just mean retrain as doing a really long training iteration, it isn't your choice anyway. You must always train the network until the training error and testing error (or cross validated error) converge. If you reuse the previously trained network, that will probably happen faster.
You see, this is true no matter what kind of model change. If you change the network architecture, or the dataset, or both (your example), or some other parameter.
Of course, if you change the network architecture, you're going to have a bit of trouble on reusing the previous network. You could reuse the learned parameters from nodes that were kept and randomly initialize the parameters for the new nodes.

Neural Networks back propogation

I have gone through neural networks and have understood the derivation for back propagation almost perfectly(finally!). However, I had a small doubt.
We are updating all the weights simultaneously, so what is the guarantee that they lead to a smaller cost. If the weights are updated one by one, it would definitely lead to a lesser cost and it would be similar to linear regression. But if you update all the weights simultaneously, might we not cross the minima?
Also, do we update the biases like we update the weights after each forward propagation and back propagation of each test case?
Lastly, I have started reading on RNN's. What are some good resources to understand BPTT in RNN's?
Yes, updating only one weight at the time could result in decreasing error value every time but it's usually infeasible to do such updates in practical solutions using NN. Most of today's architectures usually have ~ 10^6 parameters so one epoch for every parameter could last enormously long. Moreover - because of nature of backpropagation - you usually have to compute loads of different derivatives in order to compute derivative with respect to a parameter given - so you will waste a lot of computations when using such approach.
But the phenomenon which you mention has been noticed a long time ago and there are some ways in dealing with it. There are two most common issues connected with it:
Covariance shift: it's when error and weight updates of a layer given strongly depends on output from previous layer, so when you update it - the results in the next layer might be different. The most common way to deal with this problem right now is Batch normalization.
Nolinear function vs Linear Differentation: it's quite uncommon when you think about BP but derivative is a linear operator which might generate a lot of problems in gradient descent. The most countintuitive example is the fact that if you multiply your input by a constant then every derivative will also be multiplied by the same number. This may lead to a lot of problems but most of recent methods of learning do a great job in dealing with it.
About BPTT I stronly recomend you Geoffrey Hinton course about ANN and especially this video.

Neural Nets: computing weight deltas is a bottleneck

After profiling my Neural Nets' code I've realized that the method, which computes the weight changes for each arc in the network (-rate*gradient + momentum*previous_delta - decay*rate*weight), already given the gradient, is the bottleneck (55% inclusive samples).
Is there any trick to compute these values in a efficient manner?
This is normal behaviour. I am assuming that you are using an iterative process to solve the weights at each evolution step (such as backpropagation?). If the number of neurons is large and the training (back-testing) algorithm is short, then it is normal that weight mutation such as this will consume a larger fraction of compute time during training of the neural network.
Did you get this result using a simple XOR problem or similar? If so, you will probably find that if you start to solve more complex problems (such as pattern detection in multidimensional arrays, image processing, etc.) that those functions will begin to consume an insignificant fraction of compute time.
If you are profiling, I would suggest you profile with a problem that is closer to the purpose for which the neural network is designed (I am guessing you didn't design it to solve XOR or play tic tac toe) and you will probably find that optimising code such as -rate*gradient + momentum*previous_delta - decay*rate*weight is more or less a waste of time, at least this is my experience.
If you do find that this code is compute-intensive in real-world applications then I would suggest trying to reduce the number of times this line of code is executed via structural changes. Neural network optimization is a rich field and I can't possibly give you useful advise from such a broad question, but I will say that if your program is unusually slow, you're unlikely to see significant improvements by tinkering at such low-level code. I will however suggest the following from my own experience:
Consider parallelisation. Many search algorithms such as those implemented in back-propagation techniques are amenable to parallel attempts to improve convergence. As weight-adjustments are identical in terms of computation demand for a given network, think static loops in Open MP.
Modify the convergence criterion (the critical convergence rate before you stop adjustments of weights) to perform less of these calculations
Consider an alternative to deterministic solutions such as back-propagations, which are slightly more prone to local optimisation anyway. Consider gaussian mutation (All things being equal gaussian mutation will 1) reduce time spent on mutation relative to backtesting 2) increase convergence time and 3) be less prone to getting caught in local minima of the error search space)
Please note that this is a non-technical answer to what I have interpreted as a non-technical question.