Replicator Neural Network for outlier detection, Step-wise function causing same prediction - neural-network

In my project, one of my objectives is to find outliers in aeronautical engine data and chose to use the Replicator Neural Network to do so and read the following report on it (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.3366&rep=rep1&type=pdf) and am having a slight understanding issue with the step-wise function (page 4, figure 3) and the prediction values due to it.
The explanation of a replicator neural network is best described in the above report but as a background the replicator neural network I have built works by having the same number of outputs as inputs and having 3 hidden layers with the following activation functions:
Hidden layer 1 = tanh sigmoid S1(θ) = tanh,
Hidden layer 2 = step-wise, S2(θ) = 1/2 + 1/(2(k − 1)) {summation each variable j} tanh[a3(θ −j/N)]
Hidden Layer 3 = tanh sigmoid S1(θ) = tanh,
Output Layer 4 = normal sigmoid S3(θ) = 1/1+e^-θ
I have implemented the algorithm and it seems to be training (since the mean squared error decreases steadily during training). The only thing I don't understand is how the predictions are made when the middle layer with the step-wise activation function is applied since it causes the 3 middle nodes' activations to be become specific discrete values (e.g. my last activations on the 3 middle were 1.0, -1.0, 2.0 ) , this causes these values to be forward propagated and me getting very similar or exactly the same predictions every time.
The section in the report on page 3-4 best describes the algorithm but i have no idea what i have to do to fix this, i don't have much time either :(
Any help would be greatly appreciated.
Thank you

I'm facing the problem of implementing this algorithm and here is my insight into the problem that you might have had: The middle layer, by utilizing a step-wise function, is essentially performing clustering on the data. Each layer transforms the data into a discrete number which could be interpreted as a coordinate in a grid system. Imagine we use two neurons in the middle layer with step-wise values ranging from -2 to +2 in increments of 1. This way we define a 5x5 grid where each set of features will be placed. The more steps you allow, the more grids. The more grids, the more "clusters" you have.
This all sounds good and all. After all, we are compressing the data into a smaller (dimensional) representation which then is used to try to reconstruct into the original input.
This step-wise function, however, has a big problem on itself: back-propagation does not work (in theory) with step-wise functions. You can find more about this in this paper. In this last paper they suggest switching the step-wise function with a ramp-like function. That is, to have almost an infinite amount of clusters.
Your problem might be directly related to this. Try switching the step-wise function with a ramp-wise one and measure how the error changes throughout the learning phase.
By the way, do you have any of this code available anywhere for other researchers to use?

Related

Q-learning using neural networks

I'm trying to implement the Deep q-learning algorithm for a pong game.
I've already implemented Q-learning using a table as Q-function. It works very well and learns how to beat the naive AI within 10 minutes. But I can't make it work
using neural networks as a Q-function approximator.
I want to know if I am on the right track, so here is a summary of what I am doing:
I'm storing the current state, action taken and reward as current Experience in the replay memory
I'm using a multi layer perceptron as Q-function with 1 hidden layer with 512 hidden units. for the input -> hidden layer I am using a sigmoid activation function. For hidden -> output layer I'm using a linear activation function
A state is represented by the position of both players and the ball, as well as the velocity of the ball. Positions are remapped, to a much smaller state space.
I am using an epsilon-greedy approach for exploring the state space where epsilon gradually goes down to 0.
When learning, a random batch of 32 subsequent experiences is selected. Then I
compute the target q-values for all the current state and action Q(s, a).
forall Experience e in batch
if e == endOfEpisode
target = e.getReward
else
target = e.getReward + discountFactor*qMaxPostState
end
Now I have a set of 32 target Q values, I am training the neural network with those values using batch gradient descent. I am just doing 1 training step. How many should I do?
I am programming in Java and using Encog for the multilayer perceptron implementation. The problem is that training is very slow and performance is very weak. I think I am missing something, but can't figure out what. I would expect at least a somewhat decent result as the table approach has no problems.
I'm using a multi layer perceptron as Q-function with 1 hidden layer with 512 hidden units.
Might be too big. Depends on your input / output dimensionality and the problem. Did you try fewer?
Sanity checks
Can the network possibly learn the necessary function?
Collect ground truth input/output. Fit the network in a supervised way. Does it give the desired output?
A common error is to have the last activation function something wrong. Most of the time, you will want a linear activation function (as you have). Then you want the network to be as small as possible, because RL is pretty unstable: You can have 99 runs where it doesn't work and 1 where it works.
Do I explore enough?
Check how much you explore. Maybe you need more exploration, especially in the beginning?
See also
My DQN agent
keras-rl
Try using ReLu (or better Leaky ReLu)-Units in the hidden layer and a Linear-Activision for the output.
Try changing the optimizer, sometimes SGD with propper learning-rate-decay helps.
Sometimes ADAM works fine.
Reduce the number of hidden units. It might be just too much.
Adjust the learning rate. The more units you have, the more impact does the learning rate have as the output is the weighted sum of all neurons before.
Try using the local position of the ball meaning: ballY - paddleY. This can help drastically as it reduces the data to: above or below the paddle distinguished by the sign. Remember: if you use the local position, you won't need the players paddle-position and the enemies paddle position must be local too.
Instead of the velocity, you can give it the previous state as an additional input.
The network can calculate the difference between those 2 steps.

How can I improve the performance of a feedforward network as a q-value function approximator?

I'm trying to navigate an agent in a n*n gridworld domain by using Q-Learning + a feedforward neural network as a q-function approximator. Basically the agent should find the best/shortest way to reach a certain terminal goal position (+10 reward). Every step the agent takes it gets -1 reward. In the gridworld there are also some positions the agent should avoid (-10 reward, terminal states,too).
So far I implemented a Q-learning algorithm, that saves all Q-values in a Q-table and the agent performs well.
In the next step, I want to replace the Q-table by a neural network, trained online after every step of the agent. I tried a feedforward NN with one hidden layer and four outputs, representing the Q-values for the possible actions in the gridworld (north,south,east, west).
As input I used a nxn zero-matrix, that has a "1" at the current positions of the agent.
To reach my goal I tried to solve the problem from the ground up:
Explore the gridworld with standard Q-Learning and use the Q-map as training data for the Network once Q-Learning is finished
--> worked fine
Use Q-Learning and provide the updates of the Q-map as trainingdata
for NN (batchSize = 1)
--> worked good
Replacy the Q-Map completely by the NN. (This is the point, when it gets interesting!)
-> FIRST MAP: 4 x 4
As described above, I have 16 "discrete" Inputs, 4 Output and it works fine with 8 neurons(relu) in the hidden layer (learning rate: 0.05). I used a greedy policy with an epsilon, that reduces from 1 to 0.1 within 60 episodes.
The test scenario is shown here. Performance is compared beetween standard qlearning with q-map and "neural" qlearning (in this case i used 8 neurons and differnt dropOut rates).
To sum it up: Neural Q-learning works good for small grids, also the performance is okay and reliable.
-> Bigger MAP: 10 x 10
Now I tried to use the neural network for bigger maps.
At first I tried this simple case.
In my case the neural net looks as following: 100 input; 4 Outputs; about 30 neurons(relu) in one hidden layer; again I used a decreasing exploring factor for greedy policy; over 200 episodes the learning rate decreases from 0.1 to 0.015 to increase stability.
At frist I had problems with convergence and interpolation between single positions caused by the discrete input vector.
To solve this I added some neighbour positions to the vector with values depending on thier distance to the current position. This improved the learning a lot and the policy got better. Performance with 24 neurons is seen in the picture above.
Summary: the simple case is solved by the network, but only with a lot of parameter tuning (number of neurons, exploration factor, learning rate) and special input transformation.
Now here are my questions/problems I still haven't solved:
(1) My network is able to solve really simple cases and examples in a 10 x 10 map, but it fails as the problem gets a bit more complex. In cases where failing is very likely, the network has no change to find a correct policy.
I'm open minded for any idea that could improve performace in this cases.
(2) Is there a smarter way to transform the input vector for the network? I'm sure that adding the neighboring positons to the input vector on the one hand improve the interpolation of the q-values over the map, but on the other hand makes it harder to train special/important postions to the network. I already tried standard cartesian two-dimensional input (x/y) on an early stage, but failed.
(3) Is there another network type than feedforward network with backpropagation, that generally produces better results with q-function approximation? Have you seen projects, where a FF-nn performs well with bigger maps?
It's known that Q-Learning + a feedforward neural network as a q-function approximator can fail even in simple problems [Boyan & Moore, 1995].
Rich Sutton has a question in the FAQ of his web site related with this.
A possible explanation is the phenomenok known as interference described in [Barreto & Anderson, 2008]:
Interference happens when the update of one state–action pair changes the Q-values of other pairs, possibly in the wrong direction.
Interference is naturally associated with generalization, and also happens in conventional supervised learning. Nevertheless, in the reinforcement learning paradigm its effects tend to be much more harmful. The reason for this is twofold. First, the combination of interference and bootstrapping can easily become unstable, since the updates are no longer strictly local. The convergence proofs for the algorithms derived from (4) and (5) are based on the fact that these operators are contraction mappings, that is, their successive application results in a sequence converging to a fixed point which is the solution for the Bellman equation [14,36]. When using approximators, however, this asymptotic convergence is lost, [...]
Another source of instability is a consequence of the fact that in on-line reinforcement learning the distribution of the incoming data depends on the current policy. Depending on the dynamics of the system, the agent can remain for some time in a region of the state space which is not representative of the entire domain. In this situation, the learning algorithm may allocate excessive resources of the function approximator to represent that region, possibly “forgetting” the previous stored information.
One way to alleviate the interference problem is to use a local function approximator. The more independent each basis function is from each other, the less severe this problem is (in the limit, one has one basis function for each state, which corresponds to the lookup-table case) [86]. A class of local functions that have been widely used for approximation is the radial basis functions (RBFs) [52].
So, in your kind of problem (n*n gridworld), an RBF neural network should produce better results.
References
Boyan, J. A. & Moore, A. W. (1995) Generalization in reinforcement learning: Safely approximating the value function. NIPS-7. San Mateo, CA: Morgan Kaufmann.
André da Motta Salles Barreto & Charles W. Anderson (2008) Restricted gradient-descent algorithm for value-function approximation in reinforcement learning, Artificial Intelligence 172 (2008) 454–482

sigmoid - back propagation neural network

I'm trying to create a sample neural network that can be used for credit scoring. Since this is a complicated structure for me, i'm trying to learn them small first.
I created a network using back propagation - input layer (2 nodes), 1 hidden layer (2 nodes +1 bias), output layer (1 node), which makes use of sigmoid as activation function for all layers. I'm trying to test it first using a^2+b2^2=c^2 which means my input would be a and b, and the target output would be c.
My problem is that my input and target output values are real numbers which can range from (-/infty, +/infty). So when I'm passing these values to my network, my error function would be something like (target- network output). Would that be correct or accurate? In the sense that I'm getting the difference between the network output (which is ranged from 0 to 1) and the target output (which is a large number).
I've read that the solution would be to normalise first, but I'm not really sure how to do this. Should i normalise both the input and target output values before feeding them to the network? What normalisation function is best to use cause I read different methods in normalising. After getting the optimized weights and use them to test some data, Im getting an output value between 0 and 1 because of the sigmoid function. Should i revert the computed values to the un-normalized/original form/value? Or should i only normalise the target output and not the input values? This really got me stuck for weeks as I'm not getting the desired outcome and not sure how to incorporate the normalisation idea in my training algorithm and testing..
Thank you very much!!
So to answer your questions :
Sigmoid function is squashing its input to interval (0, 1). It's usually useful in classification task because you can interpret its output as a probability of a certain class. Your network performes regression task (you need to approximate real valued function) - so it's better to set a linear function as an activation from your last hidden layer (in your case also first :) ).
I would advise you not to use sigmoid function as an activation function in your hidden layers. It's much better to use tanh or relu nolinearities. The detailed explaination (as well as some useful tips if you want to keep sigmoid as your activation) might be found here.
It's also important to understand that architecture of your network is not suitable for a task which you are trying to solve. You can learn a little bit of what different networks might learn here.
In case of normalization : the main reason why you should normalize your data is to not giving any spourius prior knowledge to your network. Consider two variables : age and income. First one varies from e.g. 5 to 90. Second one varies from e.g. 1000 to 100000. The mean absolute value is much bigger for income than for age so due to linear tranformations in your model - ANN is treating income as more important at the beginning of your training (because of random initialization). Now consider that you are trying to solve a task where you need to classify if a person given has grey hair :) Is income truly more important variable for this task?
There are a lot of rules of thumb on how you should normalize your input data. One is to squash all inputs to [0, 1] interval. Another is to make every variable to have mean = 0 and sd = 1. I usually use second method when the distribiution of a given variable is similiar to Normal Distribiution and first - in other cases.
When it comes to normalize the output it's usually also useful to normalize it when you are solving regression task (especially in multiple regression case) but it's not so crucial as in input case.
You should remember to keep parameters needed to restore the original size of your inputs and outputs. You should also remember to compute them only on a training set and apply it on both training, test and validation sets.

Why does my neural network trained on MNIST data set not predict 7 and 9 correctly?

I'm using Matlab ( github code repository ). The details of the network are:
Hidden units: 100 ( variable )
Epochs : 500
Batch size: 100
The weights are being updated using Back propagation algorithm.
I've been able to recognize 0,1,2,3,4,5,6,8 which I have drawn in photoshop.
However 7,9 are not recognized, but upon running on the test set I get only 749/10000 wrong and it correctly classifies 9251/10000.
Any idea what might be wrong? Because it is learning and based on the test set results its learning correctly.
I don't see anything downright incorrect in your code, but there is a lot that can be improved:
You use this to set the initial weights:
hiddenWeights = rand(hiddenUnits,inputVectorSize);
outputWeights = rand(outputVectorSize,hiddenUnits);
hiddenWeights = hiddenWeights./size(hiddenWeights, 2);
outputWeights = outputWeights./size(outputWeights, 2);
This will make your weights very small I think. Not only that, but you will have no negative values, so you'll throw away half of the sigmoid's range of values. I suggest you try:
weights = 2*rand(x, y) - 1
Which will generate random numbers in [-1, 1]. You can then try dividing this interval to get smaller weights (try dividing by the sqrt of the size).
You use this as the output delta:
outputDelta = dactivation(outputActualInput).*(outputVector - targetVector) % (tk-yk)*f'(yin)
Multiplying by the derivative is done if you use the square loss function. For log loss (which is usually the one used in classification), you should have just outputVector - targetVector. It might not make that big of a difference, but you might want to try.
You say in the comments that the network doesn't detect your own sevens and nines. This can suggest overfitting on the MNIST data. To address this, you'll need to add some form of regularization to your network: either weight decay or dropout.
You should try different learning rates as well, if you haven't already.
You don't seem to have any bias neurons. Each layer, except the output layer, should have a neuron that only returns the value 1 to the next layer. You can implement this by adding another feature to your input data that is always 1.
MNIST is a big data set for which better algorithms are still being researched. Your networks is very basic, small, with no regularization, no bias neurons and no improvements to classic gradient descent. It's not surprising that it's not working too well: you'll likely need a more complex network for better results.
Nothing to do with neural nets or your code,
but this picture of KNN-nearest digits shows that some MNIST digits
are simply hard to recognize:

Backpropagation learning fails to converge

I use a neural network with 3 layers for categorization problem: 1) ~2k neurons 2) ~2k neurons 3) 20 neurons. My training set consists of 2 examples, most of the inputs in each example are zeros. For some reason after the backpropagation training the network gives virtually the same output for both examples (which is either valid for only 1 of examples or have 1.0 for outputs where one of example has 1s). It comes to this state after the first epoch and doesn't change much afterwards, even if learning rate is minimal double vale. I use sigmoid as activation function.
I thought it could be something wrong with my code so I've used AForge open source library, and seems like it suffers from the same issue.
What might be the problem here?
Solution: I've removed one layer and decreased the number of neurons in hidden layer to 800
2000 by 2000 by 20 is huge. That's approximately 4 million weights to determine, meaning the algorithm has to search a 4-million-dimensional space. Any optimization algorithm will be totally at a loss in this case. I'm assuming you're using gradient descent, which is not even that powerful, so likely the algorithm is stuck in a local optimum somewhere in this gigantic search space.
Simplify your model!
Added:
And please also describe in more detail what you're trying to do. Do you really have only 2 training examples? That's like trying to categorize 2 points using a 4-million-dimensional plane. It doesn't make sense to me.
You mentioned that most of the inputs are zero. To your reduce the size of your search space, try removing redundancy in your training examples. For instance if
trainingExample[0].inputValue[i] == trainingExample[1].inputValue[i]
then x.inputValue[i] has no information bearing data for the NN.
Also, perhaps it's not clear, but it seems that two training examples seem small.