How to train a neural network with Q-Learning - neural-network

I just implemented Q-Learning without neural networks but I am stuck at implementing them with neural networks.
I will give you a pseudo code showing how my Q-Learning is implemented:
train(int iterations)
buffer = empty buffer
for i = 0 while i < iterations:
move = null
if random(0,1) > threshold:
move = random_move()
else
move = network_calculate_move()
input_to_network = game.getInput()
output_of_network = network.calculate(input_to_network)
game.makeMove(move)
reward = game.getReward()
maximum_next_q_value = max(network.calculate(game.getInput()))
if reward is 1 or -1: //either lost or won
output_of_network[move] = reward
else:
output_of_network[move] = reward + discount_factor * max
buffer.add(input_to_network, output_of_network)
if buffer is full:
buffer.remove_oldest()
train_network()
train_network(buffer b):
batch = b.extract_random_batch(batch_size)
for each input,output in batch:
network.train(input, output, learning_rate) //one forward/backward pass
My problem right now is that this code works for a buffer size of less than 200.
For any buffer over 200, my code does not work anymore so I've got a few questions:
Is this implementation correct? (In theory)
How big should the batch size be compared to the buffer size
How would one usually train the network? For how long? Until a specific MSE of the whole batch is reached?

Is this implementation correct? (In theory)
Yes, your pseudocode does have the right approach.
How big should the batch size be compared to the buffer size
Algorithmically speaking, using larger batches in stochastic gradient descent allows you to reduce the variance of your stochastic gradient updates (by taking the average of the gradients in the batch), and this in turn allows you to take bigger step-sizes, which means the optimization algorithm will make progress faster.
The experience replay buffer stores a fixed number of recent memories, and as new ones come in, old ones are removed. When the time comes to train, we simply draw a uniform batch of random memories from the buffer, and train our network with them.
While related, there is no standard value for batch size vs. buffer size. Experimenting with these hyperparameters is one of the joys of deep reinforcement learning.
How would one usually train the network? For how long? Until a
specific MSE of the whole batch is reached?
Networks are usually trained until they "converge," which means that there are repeatedly no meaningful changes in the Q-table between episodes

Related

How to compensate if I cant do a large batch size in neural network

I am trying to run an action recognition code from GitHub. The original code used a batch size of 128 with 4 GPUS. I only have two gpus so I cannot match their bacth size number. Is there anyway I can compensate this difference in batch. I saw somewhere that iter_size might compensate according to a formula effective_batchsize= batch_size*iter_size*n_gpu. what is iter_size in this formula?
I am using PYthorch not Caffe.
In pytorch, when you perform the backward step (calling loss.backward() or similar) the gradients are accumulated in-place. This means that if you call loss.backward() multiple times, the previously calculated gradients are not replaced, but in stead the new gradients get added on to the previous ones. That is why, when using pytorch, it is usually necessary to explicitly zero the gradients between minibatches (by calling optimiser.zero_grad() or similar).
If your batch size is limited, you can simulate a larger batch size by breaking a large batch up into smaller pieces, and only calling optimiser.step() to update the model parameters after all the pieces have been processed.
For example, suppose you are only able to do batches of size 64, but you wish to simulate a batch size of 128. If the original training loop looks like:
optimiser.zero_grad()
loss = model(batch_data) # batch_data is a batch of size 128
loss.backward()
optimiser.step()
then you could change this to:
optimiser.zero_grad()
smaller_batches = batch_data[:64], batch_data[64:128]
for batch in smaller_batches:
loss = model(batch) / 2
loss.backward()
optimiser.step()
and the updates to the model parameters would be the same in each case (apart maybe from some small numerical error). Note that you have to rescale the loss to make the update the same.
The important concept is not so much the batch size; it's the quantity of epochs you train. Can you double the batch size, giving you the same cluster batch size? If so, that will compensate directly for the problem. If not, double the quantity of iterations, so you're training for the same quantity of epochs. The model will quickly overcome the effects of the early-batch bias.
However, if you are comfortable digging into the training code, myrtlecat gave you an answer that will eliminate the batch-size difference quite nicely.

Set Batch Size *and* Number of Training Iterations for a neural network?

I am using the KNIME Doc2Vec Learner node to build a Word Embedding. I know how Doc2Vec works. In KNIME I have the option to set the parameters
Batch Size: The number of words to use for each batch.
Number of Epochs: The number of epochs to train.
Number of Training Iterations: The number of updates done for each batch.
From Neural Networks I know that (lazily copied from https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network):
one epoch = one forward pass and one backward pass of all the training examples
batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).
As far as I understand it makes little sense to set batch size and iterations, because one is determined by the other (given the data size, which is given by the circumstances). So why can I change both parameters?
This is not necessarily the case. You can also train "half epochs". For example, in Google's inceptionV3 pretrained script, you usually set the number of iterations and the batch size at the same time. This can lead to "partial epochs", which can be fine.
If it is a good idea or not to train half epochs may depend on your data. There is a thread about this but not a concluding answer.
I am not familiar with KNIME Doc2Vec, so I am not sure if the meaning is somewhat different there. But from the definitions you gave setting batch size + iterations seems fine. Also setting number of epochs could cause conflicts though leading to situations where numbers don't add up to reasonable combinations.

Neural network gets only 50% good prediction on test data

I made a neural network whice i want to classify the input data (400 caracteristics per input data) as one of the five arabic dialects. I divede the trainig data in "train data", "validation data" and than "test date", with net.divideFcn = 'dividerand'; . I use trainbr as training function, whice results in a long training, that's because i have 9000 elements in training data.
For the network arhitecture i used two-layers, first with 10 perceptrons, second with 5, 5 because i use one vs all strategy.
The network training ends usually with minimum gradient reached, rather than minimux error.
How can i make the network predict better? Could it be o problem with generalization (the network learn very well the training data, but test on new data tends to fail?
Should i add more perceptrons to the first layer? I'm asking that because i take about a hour to train the network when i have 10 perceptrons on first layer, so the time will increase.
This is the code for my network:
[Test] = load('testData.mat');
[Ex] = load('trainData.mat');
Ex.trainVectors = Ex.trainVectors';
Ex.trainLabels = Ex.trainLabels';
net = newff(minmax(Ex.trainVectors),[10 5] ,{'logsig','logsig'},'trainlm','learngdm','sse');
net.performFcn = 'mse';
net.trainParam.lr = 0.01;
net.trainParam.mc = 0.95;
net.trainParam.epochs = 1000;
net.trainParam.goal = 0;
net.trainParam.max_fail = 50;
net.trainFcn = 'trainbr';
net.divideFcn = 'dividerand';
net.divideParam.trainRatio = 0.7;
net.divideParam.valRatio = 0.15;
net.divideParam.testRatio = 0.15;
net = init(net);
net = train(net,Ex.trainVectors,Ex.trainLabels);
Thanks !
Working with neural networks is some type of creative work. So noone can't give you the only true answer. But I can give some advices based on my own experience.
First of all - check the network error when training ends (on training and validation data sets. Before you start to use test data set). You told it is minimum but what is its actual value? If it 50% too, so we have bad data or wrong net architecture.
If error for train data set is OK. Next step - lets check how much the coefficients of your net are changing at the validation step. And what's up about the error here. If they changed dramatically that's the sigh our architecture is wrong: Network does not have the ability to generalize and will retrain at every new data sets.
What else can we do before changing architecture? We can change the number of epochs. Sometimes we can get good results but it is some type of random - we must be sure the changing of coefficient is small at the ending steps of training. But as I remember nntool check it automatically, so maybe we can skip this step.
One more thing I want to recommend to you - change train data set. Maybe you know rand is give you always the same number at start of matlab, so if you create your data sets only once you can work with the same sets always. This problem is also about non-homogeneous data. It can be that some part of your data is more important than other. So if some different random sets will give about the same error data is ok and we can go further. If not - we need to work with data and split it more carefully. Sometimes I avoid using dividerand and divide data manually.
Sometimes I tried to change the type of activation function. But here you use perceptron... So the idea - try to use sigma- or linear- neurons instead of perceptrons. This rarely leads to significant improvements but can help.
If all this steps can't give you enough you have to change net architecture. And the number of neurons in the first layer is the first you have to do. Usually when I work on the neural network I spend a lot of time trying not only different number of neurons but the different types of nets too.
For example, I found interesting article about your topic: link at Alberto Simões article. And that's what they say:
Regarding the number of units in the hidden layers, there are some
rules of thumb: use the same number of units in all hidden layers, and
use at least the same number of units as the maximum between the
number of classes and the number of features. But there can be up to
three times that value. Given the high number of features we opted to
keep that same number of units in the hidden layer.
Some advices from comments:
Data split method (for train and test data sets) depends on your data. For example, I worked on industry data and found that at the last part of the data set technological parameters (pressure for some equipment) was changed. So I have to get data for both operation modes to train data set. But for your case I don't thing there are the same problem... I recommend you to try several random sets (just check they are really different!).
For measuring net error I usually calculate full vector of errors - I train net and then check it's work for all values to get the whole errors vector. It's useful to get some useful vies like histograms and etc and I can see where my net is go wrong. It is not necessary and even harmful to get sse (or mse) close to zero - usually that's mean you already overtrain the net. For the first approximation I usually try to get 80-95% of correct values on training data set and then try the net on the test data set.

Tensorflow Inception Multiple GPU Training Loss is not Summed?

I am trying to go through Tensorflow's inception code for multiple GPUs (on 1 machine). I am confused because we get multiple losses from the different towers, aka the GPUs, as I understand, but the loss variable evaluated seems to only be of the last tower and not a sum of the losses from all towers:
for step in xrange(FLAGS.max_steps):
start_time = time.time()
_, loss_value = sess.run([train_op, loss])
duration = time.time() - start_time
Where loss was last defined specifically for each tower:
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (inception.TOWER_NAME, i)) as scope:
# Force all Variables to reside on the CPU.
with slim.arg_scope([slim.variables.variable], device='/cpu:0'):
# Calculate the loss for one tower of the ImageNet model. This
# function constructs the entire ImageNet model but shares the
# variables across all towers.
loss = _tower_loss(images_splits[i], labels_splits[i], num_classes,
scope)
Could someone explain where the step is to combine the losses from different towers? Or are we simply a single tower's loss as representative of the other tower's losses as well?
Here's the link to the code:
https://github.com/tensorflow/models/blob/master/inception/inception/inception_train.py#L336
For monitoring purposes, considering all towers work as expected, single tower's loss is as representative as average of all towers' losses. This is due to the fact that there is no relation between batch and tower it is assigned to.
But the train_op uses gradients from all towers, as per line 263, 278 so technically training takes into account batches from all towers, as it should be.
Note, that average of losses will have lower variance than single tower's loss, but they will have the same expectation.
Yes, according to this code, losses are not summed or averaged across gpus. Loss per gpu is used inside of each gpu (tower) for gradient calculation. Only gradients are synchronized. So the isnan test is only done for the portion of data processed by the last gpu. This is not crucial but can be a limitation.
If really needed, I think you can do as follows to get averaged loss cross gpus:
per_gpu_loss = []
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (inception.TOWER_NAME, i)) as scope:
...
per_gpu_loss.append(loss)
mean_loss = tf.reduce_mean(per_gpu_loss, name="mean_loss")
tf.summary.scalar('mean_loss', mean_loss)
and then replace loss in sess.run as mean_loss:
_, loss_value = sess.run([train_op, mean_loss])
loss_value is now an average across losses processed by all the gpus.

Re-Use Sliding Window data for Neural Network for Time Series?

I've read a few ideas on the correct sample size for Feed Forward Neural networks. x5, x10, and x30 the # of weights. This part I'm not overly concerned about, what I am concerned about is can I reuse my training data (randomly).
My data is broken up like so
5 independent vars and 1 dependent var per sample.
I was planning on feeding 6 samples in (6x5 = 30 input neurons), confirm the 7th samples dependent variable (1 output neuron.
I would train on neural network by running say 6 or 7 iterations. before trying to predict the next iteration outside of my training data.
Say I have
each sample = 5 independent variables & 1 dependent variables (6 vars total per sample)
output = just the 1 dependent variable
sample:sample:sample:sample:sample:sample->output(dependent var)
Training sliding window 1:
Set 1: 1:2:3:4:5:6->7
Set 2: 2:3:4:5:6:7->8
Set 3: 3:4:5:6:7:8->9
Set 4: 4:5:6:7:8:9->10
Set 5: 5:6:7:6:9:10->11
Set 6: 6:7:8:9:10:11->12
Non training test:
7:8:9:10:11:12 -> 13
Training Sliding Window 2:
Set 1: 2:3:4:5:6:7->8
Set 2: 3:4:5:6:7:8->9
...
Set 6: 7:8:9:10:11:12->13
Non Training test: 8:9:10:11:12:13->14
I figured I would randomly run through my set's per training iteration say 30 times the number of my weights. I believe in my network I have about 6 hidden neurons (i.e. sqrt(inputs*outputs)). So 36 + 6 + 1 + 2 bias = 45 weights. So 44 x 30 = 1200 runs?
So I would do a randomization of the 6 sets 1200 times per training sliding window.
I figured due to the small # of data, I was going to do simulation runs (i.e. rerun over the same problem with new weights). So say 1000 times, of which I do 1140 runs over the sliding window using randomization.
I have 113 variables, this results in 101 training "sliding window".
Another question I have is if I'm trying to predict up or down movement (i.e. dependent variable). Should I match to an actual # or just if I guessed up/down movement correctly? I'm thinking I should shoot for an actual number, but as part of my analysis do a % check on if this # is guessed correctly as up/down.
If you have a small amount of data, and a comparatively large number of training iterations, you run the risk of "overtraining" - creating a function which works very well on your test data but does not generalize.
The best way to avoid this is to acquire more training data! But if you cannot, then there are two things you can do. One is to split the training data into test and verification data - using say 85% to train and 15% to verify. Verification means compute the fitness of the learner on the training set, without adjusting the weights/training. When the verification data fitness (which you are not training on) stops improving (in general it will be noisy), and your training data fitness continues improving - stop training. If on the other hand you use a "sliding window", you may not have a good criterion to know when to stop training - the fitness function will bounce around in unpredictable ways (you might slowly make the effect of each training iteration have less effect on the parameters, however, to give you convergence... maybe not the best approach but some training regimes do this) The other thing you can do normalize out your node's weights via some metric to ensure some notion of 'smoothness' - if you visualize overfitting for a second you'll find that in the extreme case your fitness function sharply curves around your dataset positives...
As for the latter question - for the training to converge, you fitness function needs to be smooth. If you were to just use binary all-or-nothing fitness terms, most likely what would happen is that whatever algorithm you are using to train (backprop, BGFS, etc...) would not converge. In practice, the classification criterion should be an activation that is above for a positive result, less than or equal to for a negative result, and varies smoothly in your weight/parameter space. You can think of 0 as "I am certain that the answer is up" and 1 as "I am certain that the answer is down", and thus realize a fitness function that has a higher "cost" for incorrect guesses that were more certain... There are subtleties possible in how the function is shaped (for example you might have different ideas about how acceptable a false negative and false positive are) - and you may also introduce regions of "uncertain" where the result is closer to "zero weight" - but it should certainly be continuous/smooth.
You can re-use sliding window's.
It basically the same concept as bootstrapping (your training set); which in itself reduces training time, but don't know if it's really helpful in making the net more adaptive to anything other than the training data.
Below is an example of a sliding window in pictorial format (using spreadsheet magic)
http://i.imgur.com/nxhtgaQ.png
https://github.com/thistleknot/FredAPI/blob/05f74faf85d15f6898aa05b9b08d5363fe27c473/FredAPI/Program.cs
Line 294 shows how the code is ran using randomization, it resets the randomization at position 353 so the rest flows as normal.
I was also able to use a 1 (up) or 0 (down) as my target values and the network did converge.