Neural network doesn't converge - using Multilayer Perceptron - neural-network

I've developed a "Pong" style game which effectively has a ball at the bottom of the screen and bouncy walls on the left and right and a sticky wall on the top. It randomly chooses a point on the bottom (on a straight horizontal line) and a random angle, bounces off the side walls, and hits the top wall. This is repeated a 1000 times and each time, the x-value of the launch position, the launch angle and the final x-value of the position it collides with on the top wall.
This gives me 2 inputs - x-value of launch and launch angle and 1 output - x-value of final position. I tried using a multilayer perceptron with 2 input nodes, 2 hidden nodes (1 layer) and 1 output node. However it converges upto a point ~20 and then tapers off. Here's what I've tried and none of them helped, either the error never converges or it starts diverging:
Transform inputs and output to be between 0 and 1
Transform inputs and output to be between -1 and 1
Increase number of hidden layers
Increase number of nodes in hidden layer
Convert the launch position, launch angle and final position into 0s and 1s resulting in ~750+175 inputs and ~750 outputs - no convergence
So, after spending all night and morning and making my brain and body revolt against me, I'm hoping someone can help me identify the problem here. Is this a task that's just not solvable by a neural network or am I doing something wrong?
PS: I'm using the online version of Neuroph and not coding my own procedure. At least this will help me avoid issues in implementation

If it doesn't minimize the training error, that's most likely a bug in the implementation. If you're measuring the accuracy on a held-out test set, on the other hand, there's nothing surprising about the error going up after a while.
As to the formulation, I think with sufficient amount of training data and sufficiently long training time, a sufficiently complex NN can learn the mapping whether you binarize the input or not (provided the implementation you use supports non-binary input and output). I have only a vague idea of what "sufficient" means in the above sentence, but I'd venture a guess that 1000 samples won't do. Note also that the more complex the network, the more data it will generally need to estimate the parameters.

To eliminate potential implementation issues in Neuroph, I'd suggest trying the exact same process (Multi-Layer Perceptron, same parameters, same data, etc.) but use Weka instead.
I've used the MLP in Weka before with success, so I can verify that this implementation works correctly. I know Weka has a fairly high-penetration in the academic community and its fairly well vetted, but I'm not sure about Neuroph since its newer. If you get the same results as Neuroph, then you know the issue is in your data or neural net topology or configuration.
Qnan brings up a good point - what exactly is the error you are measuring? To really determine why the training error isn't converging towards zero, you need to determine what exactly it is that the error represents.
Also, how many epochs (i.e., number of iterations) is the neural net running in training before it stops converging?
In Weka, if I recall correctly you can set the training to execute either until the error reaches a certain value or for a certain number of epochs. Looks like Neuroph is the same way, from a quick look.
If you're limiting the number of epochs, try bumping up the number to something significantly higher to give the network more iterations to converge.

Related

Reinforcement Learning- Won't Converge

I'm working on my bachelor thesis.
My topic is reinforcement learning. The Setup:
Unity3D (C#)
Own neural network framework
Confirmed the network working by testing to training a sine-function.
It can approximate it. Well. there are some values which won't get to their desired value but it's good enough.
When training it with single Values it always converges.
Here is my problem:
I try to teach my network the Q-Value-Function of a simple game,
catch balls:
In this game it just has to catch a ball dropping from random position and with random angle.
+1 if catch
-1 if failed
My network-model has 1 hidden layer with neurons ranging from 45-180 (i tested this numbers with no success)
It uses replay with 32 samples from a 100k memory with a learning-rate of 0.0001
It learns for 50000 frames then tests for 10000 frames. This happens 10 times.
Inputs are PlatformPosX, BallPosX, BallPosY from the last 4 frames
Pseudocode:
Choose action (e-greedy)
Do action,
Store state action, CurrentReward. Done in memory
if in learnphase: Replay
My problem is:
Its actions starts clipping to either 0 or 1 with some variance sometimes.
It never has a ideal policy like if the platform would just follow the ball.
EDIT:
Sorry for cheap info...
My Quality-Function is trained by:
Reward + Gamma(nextEstimated_Reward)
So its discounting.
Why would you possibly expect that to work?
Your training can barely approximate a 1-dimensional function. And now you expect it to solve a 12-dimensional function which involves a differential equation? You should have verified first whether your training does even converge for a multi dimensional function at all, with the chosen training parameters.
Your training, given the little detail you provided, also appears to be unsuitable. There is hardly a chance it ever successfully catches the ball, and even when it does, you are rewarding it mostly for random outputs. Only correlation between in- and output is in the last few frames when the pad can only reach the target in time by a limited set of possible actions.
Then there is the choice of inputs. Don't require your model to differentiate by itself. Relevant inputs would had been x, y, dx, dy. Preferably even x, y relative to pad position, not world. Should have a much better chance to converge. Even if it was only learning to keep x minimal.
Working with absolute world coordinates is pretty much bound to fail, as it would require the training to cover the entire range of possible input combinations. And also the network to be big enough to even store all the combinations. Be aware that the network isn't learning the actual function, it's learning an approximation for every single possible set of inputs. Even if the ideal solution is actually just a linear equation, the non linear properties of the activation function make it impossible to learn it in a generalized form for unbound inputs.

neural network converges too fast and predicts blank results

I am using a UNet model to train a segmentation algorithm with roughly 1,000 grayscale medical images and 1,000 corresponding masks where the section of interest in the medical image is white pixel and the background is black.
I am using dice loss and a similar dice score as an accuracy metric to account for the fact that my white pixels are generally less in number than the black background pixels. But I am still having a few problems when training
1) The loss converges too fast. If I have my SGD optimizer's learning rate at 0.01 for example, at around 2 epochs the loss (training and validation) will drop to 0.00009 and the accuracy shoots up and settles at 100% in proportion. Testing on an unseen set gives blank images.
Assumption - Overfitting:
I assumed this was due to overfitting, so I augmented the dataset as much as possible with rigid transformations - flipping and rotating, but still no help.
Also if I test the model against the same data I used to train it, it still predicts blank images. So does this mean it isn't a case of overfitting?
2)Model doesn't look like it's even training. I was able to check the model before it reduced all the test data to blackness, but even then the results would look like blurry versions of the original without segmenting the features highlighted by my training mask
3) The loss vs epochs and accuracy vs epochs output charts are very smooth: They present none of the oscillating behaviour that I expect to see when doing semantic segmentation. According to this related post a smooth chart usually occurs when there is only one class. I however assumed that my model would see the training masks (white pixels vs black pixels) and see that as a two class problem. Am I wrong in this assumption?
4) According to this post Dice is good for an unbalanced training set. I have also tried to get precision/recall/F1 results as they suggest, but was unable to do it and assuming it might be related to my 3rd issue where the model sees my segmentation task as a single class problem.
TLDR: How can I fix the black output results I am getting? Can you please help me clarify if my learning model is actually seeing my white and black pixels in each mask as two separate classes and if not what is it actually doing?
Your model is only predicting one class (the background/back pixels) because of the class imbalance.
The loss converges too fast. If I have my SGD optimizer's learning rate at 0.01 for example, at around 2 epochs the loss (training and validation) will drop to 0.00009 and the accuracy shoots up and settles at 100% in proportion. Testing on an unseen set gives blank images.
Lower your learning rate. 0.01 is really high, so try something like 3e-5 for your learning and see how your model performs.
Also, having a 100% accuracy (supposedly you're using dice?) suggests that you're still using accuracy, so I believe that your model does not recognize that you're using dice/dice loss for training and evaluation(code snippets would be appreciated).
Example:
model.compile(optimizer=Adam(lr=TRAIN_SEG_LEARNING_RATE),
loss=dice_coef_loss,
metrics=[dice_coef])
Also if I test the model against the same data I used to train it, it still predicts blank images. So does this mean it isn't a case of overfitting?
Try using model.evaluate(test_data, test_label). If the evaluated performance is good (dice should be extremely low if you're only predicting 0s), then either your labels are messed in some way or there is something wrong with your pipeline.
Possible Solutions if all else fails:
make sure to go through all the sanity checks in this article
You might not have enough data, so try to use a patchwise approach with random crops.
Add more regularization (dropout, BatchNormalization, InstanceNormalization, increasing input image size, etc.)

Q-learning using neural networks

I'm trying to implement the Deep q-learning algorithm for a pong game.
I've already implemented Q-learning using a table as Q-function. It works very well and learns how to beat the naive AI within 10 minutes. But I can't make it work
using neural networks as a Q-function approximator.
I want to know if I am on the right track, so here is a summary of what I am doing:
I'm storing the current state, action taken and reward as current Experience in the replay memory
I'm using a multi layer perceptron as Q-function with 1 hidden layer with 512 hidden units. for the input -> hidden layer I am using a sigmoid activation function. For hidden -> output layer I'm using a linear activation function
A state is represented by the position of both players and the ball, as well as the velocity of the ball. Positions are remapped, to a much smaller state space.
I am using an epsilon-greedy approach for exploring the state space where epsilon gradually goes down to 0.
When learning, a random batch of 32 subsequent experiences is selected. Then I
compute the target q-values for all the current state and action Q(s, a).
forall Experience e in batch
if e == endOfEpisode
target = e.getReward
else
target = e.getReward + discountFactor*qMaxPostState
end
Now I have a set of 32 target Q values, I am training the neural network with those values using batch gradient descent. I am just doing 1 training step. How many should I do?
I am programming in Java and using Encog for the multilayer perceptron implementation. The problem is that training is very slow and performance is very weak. I think I am missing something, but can't figure out what. I would expect at least a somewhat decent result as the table approach has no problems.
I'm using a multi layer perceptron as Q-function with 1 hidden layer with 512 hidden units.
Might be too big. Depends on your input / output dimensionality and the problem. Did you try fewer?
Sanity checks
Can the network possibly learn the necessary function?
Collect ground truth input/output. Fit the network in a supervised way. Does it give the desired output?
A common error is to have the last activation function something wrong. Most of the time, you will want a linear activation function (as you have). Then you want the network to be as small as possible, because RL is pretty unstable: You can have 99 runs where it doesn't work and 1 where it works.
Do I explore enough?
Check how much you explore. Maybe you need more exploration, especially in the beginning?
See also
My DQN agent
keras-rl
Try using ReLu (or better Leaky ReLu)-Units in the hidden layer and a Linear-Activision for the output.
Try changing the optimizer, sometimes SGD with propper learning-rate-decay helps.
Sometimes ADAM works fine.
Reduce the number of hidden units. It might be just too much.
Adjust the learning rate. The more units you have, the more impact does the learning rate have as the output is the weighted sum of all neurons before.
Try using the local position of the ball meaning: ballY - paddleY. This can help drastically as it reduces the data to: above or below the paddle distinguished by the sign. Remember: if you use the local position, you won't need the players paddle-position and the enemies paddle position must be local too.
Instead of the velocity, you can give it the previous state as an additional input.
The network can calculate the difference between those 2 steps.

Accuracy in Caffe keeps on 0.1 and does not change

Through all training process, accuracy is 0.1. What am I doing wrong?
Model, solver and part of log here:
https://gist.github.com/yutkin/3a147ebbb9b293697010
Topology in png format:
P.S. I am using the latest version of Caffe and g2.2xlarge instance on AWS.
You're working on CIFAR-10 dataset which has 10 classes. When the training of a network commences, the first guess is usually random due to which your accuracy is 1/N, where N is the number of classes. In your case it is 1/10, i.e., 0.1. If your accuracy stays the same over time it implies that your network isn't learning anything. This may happen due to a large learning rate. The basic idea of training a network is that you calculate the loss and propagate it back. The gradients are multiplied with the learning rate and added to the current weights and biases. If the learning rate is too big you may overshoot the local minima every time. If it is too small, the convergence will be slow. I see that your base_lr here is 0.01. As far as my experience goes, this is somewhat large. You may want to keep it at 0.001 in the beginning and then go on reducing it by a factor of 10 whenever you observe that the accuracy is not improving. But then anything below 0.00001 usually doesn't make much of a difference. The trick is to observe the progress of the training and make parameter changes as and when required.
I know the thread is quite old but maybe my answer helps somebody. I experienced the same problem with an accuracy like a random guess.
What helped was to set the number of outputs of the last layer before the accuracy layer to the number of labels.
In your case that should be the ip2 layer. Open the model definition of your net and set num_outputs to the number of labels.
See Section 4.4 for more information: A Practical Introduction to Deep Learning with Caffe and Python

Replicator Neural Network for outlier detection, Step-wise function causing same prediction

In my project, one of my objectives is to find outliers in aeronautical engine data and chose to use the Replicator Neural Network to do so and read the following report on it (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.3366&rep=rep1&type=pdf) and am having a slight understanding issue with the step-wise function (page 4, figure 3) and the prediction values due to it.
The explanation of a replicator neural network is best described in the above report but as a background the replicator neural network I have built works by having the same number of outputs as inputs and having 3 hidden layers with the following activation functions:
Hidden layer 1 = tanh sigmoid S1(θ) = tanh,
Hidden layer 2 = step-wise, S2(θ) = 1/2 + 1/(2(k − 1)) {summation each variable j} tanh[a3(θ −j/N)]
Hidden Layer 3 = tanh sigmoid S1(θ) = tanh,
Output Layer 4 = normal sigmoid S3(θ) = 1/1+e^-θ
I have implemented the algorithm and it seems to be training (since the mean squared error decreases steadily during training). The only thing I don't understand is how the predictions are made when the middle layer with the step-wise activation function is applied since it causes the 3 middle nodes' activations to be become specific discrete values (e.g. my last activations on the 3 middle were 1.0, -1.0, 2.0 ) , this causes these values to be forward propagated and me getting very similar or exactly the same predictions every time.
The section in the report on page 3-4 best describes the algorithm but i have no idea what i have to do to fix this, i don't have much time either :(
Any help would be greatly appreciated.
Thank you
I'm facing the problem of implementing this algorithm and here is my insight into the problem that you might have had: The middle layer, by utilizing a step-wise function, is essentially performing clustering on the data. Each layer transforms the data into a discrete number which could be interpreted as a coordinate in a grid system. Imagine we use two neurons in the middle layer with step-wise values ranging from -2 to +2 in increments of 1. This way we define a 5x5 grid where each set of features will be placed. The more steps you allow, the more grids. The more grids, the more "clusters" you have.
This all sounds good and all. After all, we are compressing the data into a smaller (dimensional) representation which then is used to try to reconstruct into the original input.
This step-wise function, however, has a big problem on itself: back-propagation does not work (in theory) with step-wise functions. You can find more about this in this paper. In this last paper they suggest switching the step-wise function with a ramp-like function. That is, to have almost an infinite amount of clusters.
Your problem might be directly related to this. Try switching the step-wise function with a ramp-wise one and measure how the error changes throughout the learning phase.
By the way, do you have any of this code available anywhere for other researchers to use?