Neural networks and the XOR function - neural-network

I'm playing with a neural network I implemented myself: it's a trivial forward network using RPROP as a learning algorithm as the only "plus" compared to the basic design.
The network scores decently when I test it against MNIST or when I attempt at image compression, but when I try to model something as simple as the XOR function, sometimes during learning it gets trapped into a local minima, and outputs the following truth table:
0 XOR 0 = 1.4598413968251171e-171
1 XOR 0 = 0.9999999999999998
0 XOR 1 = 0.9999999999999998
1 XOR 1 = 0.5
Often the result after the training is correct, but sometimes 1 XOR 1 outputs 0.5 instead of 1 as it should. It does not really always happens with XOR(1,1), but with other inputs as well. Being the XOR function a "classical" in the literature of back propagation I wonder what's happening here, especially given that my network appears to learn more complex (but perhaps less non-linear) tasks just fine.
My wild guess is that's something wrong with the biases.
Any hint?
Note 1: the network layout above is 2|3|1, but does not change much when I use more hidden units, certain learning attempts still go wrong.
Note 2: I put the implementation into a Gist: https://gist.github.com/antirez/e45939b918868b91ec6fea1d1938db0d

The problem was due to a bug in my implementation: the bias unit of the NN immediately before the output unit was not computed correctly. After fixing the code the XOR function is always computed right.

Related

Ways to Improve Universal Differential Equation Training with sciml_train

About a month ago I asked a question about strategies for better convergence when training a neural differential equation. I've since gotten that example to work using the advice I was given, but when I applied what the same advice to a more difficult model, I got stuck again. All of my code is in Julia, primarily making use of the DiffEqFlux library. In effort to keep this post as brief as possible, I won't share all of my code for everything I've tried, but if anyone wants access to it to troubleshoot I can provide it.
What I'm Trying to Do
The data I'm trying to learn comes from an SIRx model:
function SIRx!(du, u, p, t)
β, μ, γ, a, b = Float32.([280, 1/50, 365/22, 100, 0.05])
S, I, x = u
du[1] = μ*(1-x) - β*S*I - μ*S
du[2] = β*S*I - (μ+γ)*I
du[3] = a*I - b*x
nothing
end;
The initial condition I used was u0 = Float32.([0.062047128, 1.3126149f-7, 0.9486445]);. I generated data from t=0 to 25, sampled every 0.02 (in training, I only use every 20 points or so for speed, and using more doesn't improve results). The data looks like this: Training Data
The UDE I'm training is
function SIRx_ude!(du, u, p, t)
μ, γ = Float32.([1/50, 365/22])
S,I,x = u
du[1] = μ*(1-x) - μ*S + ann_dS(u, #view p[1:lenS])[1]
du[2] = -(μ+γ)*I + ann_dI(u, #view p[lenS+1:lenS+lenI])[1]
du[3] = ann_dx(u, #view p[lenI+1:end])[1]
nothing
end;
Each of the neural networks (ann_dS, ann_dI, ann_dx) are defined using FastChain(FastDense(3, 20, tanh), FastDense(20, 1)). I tried using a single neural network with 3 inputs and 3 outputs, but it was slower and didn't perform any better. I also tried normalizing inputs to the network first, but it doesn't make a significant difference outside of slowing things down.
What I've Tried
Single shooting
The network just fits a line through the middle of the data. This happens even when I weight the earlier datapoints more in the loss function. Single-shot Training
Multiple Shooting
The best result I had was with multiple shooting. As seen here, it's not simply fitting a straight line, but it's not exactly fitting the data eitherMultiple Shooting Result. I've tried continuity terms ranging from 0.1 to 100 and group sizes from 3 to 30 and it doesn't make a significant difference.
Various Other Strategies
I've also tried iteratively growing the fit, 2-stage training with a collocation, and mini-batching as outlined here: https://diffeqflux.sciml.ai/dev/examples/local_minima, https://diffeqflux.sciml.ai/dev/examples/collocation/, https://diffeqflux.sciml.ai/dev/examples/minibatch/. Iteratively growing the fit works well the first couple of iterations, but as the length increases it goes back to fitting a straight line again. 2-stage collocation training works really well for stage 1, but it doesn't actually improve performance on the second stage (I've tried both single and multiple shooting for the second stage). Finally, mini-batching worked about as well as single-shooting (which is to say not very well) but much more quickly.
My Question
In summary, I have no idea what to try. There are so many strategies, each with so many parameters that can be tweaked. I need a way to diagnose the problem more precisely so I can better decide how to proceed. If anyone has experience with this sort of problem, I'd appreciate any advice or guidance I can get.
This isn't a great SO question because it's more exploratory. Did you lower your ODE tolerances? That would improve your gradient calculation which could help. What activation function are you using? I would use something like softplus instead of tanh so that you don't have the saturating behavior. Did you scale the eigenvalues and take into account the issues explored in the stiff neural ODE paper? Larger neural networks? Different learning rates? ADAM? Etc.
This is much better suited for a forum for discussion like the JuliaLang Discourse. We can continue there since walking through this will not be fruitful without some back and forth.

Activation function to get day of week

I'm writing a program to predict when will something happens. I don't know which activation function to get output in day of week (1-7).
I tried sigmoid function but i need to input the predicted day and it output probability of it, I don't want it to be this way.
I expect the activation function returning 0 to infinite, is ReLU the best activation function for this task?
EDIT:
also, what if i wanted output more than 7 days, for example, x will hapen in 9th day from today, or 15th day from today, etc? I'm looking for dynamic ways to do this
What you are trying to do is solving a classification problem with a regression approach. That's at least unconventional.
You can use any activation function you want and define your output as you want. E.g. linear, relu with output range from 1 to 7 or something between -1(or 0) and 1 like tanh or sigmoid and map the output (-1 -> 1; -0.3 -> 2; ...).
The problem for you will be that you get a floatingpoint number as a result. So your model not only has to learn how to classify correctly but also how to predict the (allmost) exact number you want in your output neuron. That makes the problem more complicated than it has to be. With a model like that it also will be likley that for some outlier datapoints you might get unexpected return values like 0, -1 or 8. What do you do then?
To sum it up: Listen to #venkata krishnan, use softmax and seven output neurons and map this result to a number between 1 and 7 outside the neural network if you have to.
EDIT
What comes to my mind after reading the comments again would be a mix of what you want and what you should do.
You could try to make the second last layer a 7 neuron softmax layer and map those output to a single neuron in the last layer.
Niether did i ever try that nor have i ever read about something like that so i can't tell you if thats a good idea, likely not, but you might consider it worth a try.
I want to add onto the point of #venkata krishnan, which raises a valid point in your problem setting. You will find an answer to your original question further down, but I strongly suggeste you read the following comment first.
Generally, you want to discern between categorical, ordinal and interval variables. I have given a relatively lengthy explanation in a different answer on Stackoverflow, it might be helpful to understand this concept in more detail.
In your scenario, you mostly want to have an understanding of "how wrong" you are. Of course, it is perfectly reasonable to assume what you are doing and interpret it as a interval variable, and therefore have an assumed ordering (and a distance) between different values.
What is problematic, though, is the fact that you are assuming a continuous space on a discrete variable. E.g., it does not make any sense to interpret the output of 4.3, since you can only tell between 4 (Friday, assuming you start numbering your days at 0), or 5 (Saturday). Any value in between would have to be rounded, which is perfectly fine - until you want to perform backpropagation on this loss.
It is problematic, because you are essentially introducing a non-convex and non-continous function, no matter how you "round" your values. Again, to exemplify this, you could assume to round to the nearest number; then, at the value of 4.5, you would see a sudden increase in the loss, which is non-differentialbe, and will therefore put a hard time on your optimizer, potentially limiting convergence of your system.
If, instead, you utilize several output neurons, as suggested by #venkata krishnan, you might lose the information of distance (how many days you are off) on paper, but you can of course still interpret your loss in any way you like. This would certainly be the better option for a discrete-valued variable.
To answer your original question: I personally would make sure that your loss function is bounded both in the upper and lower level, as you could otherwise have undefined/inconsistent loss values, that might lead to subpar optimization. One way to do this is to re-scale a Sigmoid function (the co-domain of sigmoid(R) is [0,1]. Eventually, you can then just multiply your output by 6, to get a value range that is [0,6], and could (after rounding) cover all the values you want.
As far I know, there is no such thing like an activation function which will yield 0 to infinite. You can apply 7 output nodes with a "Softmax" activation function which will return the probability. There is another solution which may work. You can you 3 output nodes with "Binary" activation function which will return either 0 or 1. That means you can have 8 different outputs with only 3 nodes which are 000, 001, 010, 011, 100, 101, 110 and 111. You can use 7 of them. 

Character Recognition Using Back Propagation Algorithm Testing

Recently I've been working on character recognition using Back Propagation Algorithm. I've taken the image and reduced to 5x7 size, therefore I got 35 pixels and trained the network using those pixels with 35 input neurons, 35 hidden nodes, and 10 output nodes. And I had completed the training successfully and I got weights that I needed. And I've got stuck here. I have my test set and I know I should feed forward the network. But I don't know what to do exactly. My test set will be 4 samples of 1x35. My output layer has 10 neurons. how do I exactly distinguish the characters with the output that I will get? I want to know how this testing works. Please guide me through this stage. Thanks in advance.
One vs All
A common approach for testing these types of neural networks is "one-vs-all" approach. We view each of the output nodes as its own classifier that is giving the probability of the sample being that class vs not being that class.
For instance if you network output [1, 0, ..., 0] then class 1 has high probability of being class 1 vs not being class 1. Class 2 has low probability of being class 2 vs not being class 2, etc.
Ties
In the case of a tie, it is common (in research) to have a random function break the tie. If you get [1, 1, 1, ..., 1] then the function would pick a number from 1-10 and that is your prediction. In practice sometimes an expert system is used to break ties. Perhaps class 1 is more expensive than class 2, so we tie in preference to class 2.
Steps
So the steps are:
Split dataset into test/train set
Train weights on train set
Pass test set forward through the neural network
For each sample, choose the argmax (the output with highest value) as your prediction
In case of tie, choose randomly between all tying classes
Aside
In your particular case, I imagine implementation of this strategy will result in a network that barely beats random performance (10%) accuracy.
I would suggest some reconsidering of the network architecture.
If you look at your 5x7 images, can you tell what number that image was originally? It seems likely that scaling the image down to this size losses too much information that the network cannot distinguish between classes.
Debugging
From what you've described I would look at the following when debugging your network.
Is your data preprocessing (down-scaling) leeching out too much information? Check this by manually investigating a few of the images and seeing if you can tell what the image should be.
Does your one-hot algorithm work? When you convert your targets for training, does it successfully convert 1 -> [1, 0, 0, ..., 0]?
Is your back-prop / gradient descent algorithm correct? You should see (roughly) a monotonic decrease in your loss function while training. Try at every step (or every few steps) printing the loss that you are optimizing. Or even for a very simple gut check, print mean squared error: (P-Y)^2

What is the policy gradient when multiple actions are possible?

I am trying to program a reinforcement learning algorithm using policy gradients, as inspired by Karpathy's blog article. Karpathy's example has only two actions UP or DOWN, so a single output neuron is sufficient (high activation=UP, low activation=DOWN). I want to extend this to multiple actions, so I believe I need a softmax activation function on the output layer. However, I am not certain about what the gradient for the output layer should be.
If I was using a cross-entropy loss function with the softmax activation in a supervised learning context, the gradient for neuron is simply:
g[i] = a[i] - target[i]
where target[i] = 1 for the desired action and 0 for all others.
To use this for reinforcement learning I would multiply g[i] by the discounted reward before back-propagating.
However, it seems that reinforcement learning uses negative log-likelihood as the loss instead of cross-entropy. How does that change the gradient?
Note: something that I think will get you on the right track:
The negative log likelihood is also know as the multiclass cross-entropy (Pattern Recognition and Machine Learning).
EDIT: misread the question. I thought this was talking about Deep Deterministic Policy Gradients
It would depend on your domain, but with a softmax, you are getting a probability across all output nodes. To me that doesn't really make sense in most domains when you think about DDPG. For example, if you are controlling the extension of robotic arms and legs, it wouldn't make sense to have limb extension measured as [.25, .25, .25, .25], if you wanted to have all limbs extended. In this case, .25 could mean fully extended, but what happens if the vector of outputs is [.75,.25,0,0]? So in this way, you could have a separate sigmoid function from 0 to 1 for all action nodes, where then you could represent it as [1,1,1,1] for all arms being extended. I hope that makes sense.
Since the actor network is what determines the actions in DDPG, we could then represent our network like this for our robot (rough keras example):
state = Input(shape=[your_state_shape])
hidden_layer = Dense(30, activation='relu')(state)
all_limbs = Dense(4, activation='sigmoid')(hidden_layer)
model = Model(input=state, output=all_limbs)
Then, your critic network will have to account for the action dimensions.
state = Input(shape=[your_state_shape])
action = Input(shape=[4])
state_hidden = Dense(30, activation='relu')(state)
state_hidden_2 = Dense(30, activation='linear')(state_hidden)
action_hidden = Dense(30, activation='linear')(action)
combined = merge([state_hidden_2, action_hidden], mode='sum')
squasher = Dense(30, activation='relu')(combined)
output = Dense(4, activation='linear')(squasher) #number of actions
Then you can use your target functions from there. Note, I don't know if this working code, as I haven't tested it, but hopefully you get the idea.
Source: https://arxiv.org/pdf/1509.02971.pdf
Awesome blog on this with Torc (not created by me): https://yanpanlau.github.io/2016/10/11/Torcs-Keras.html
In the above blog, they also show how to use different output functions, such as one TAHN, and two sigmoid functions for actions.

Re-Use Sliding Window data for Neural Network for Time Series?

I've read a few ideas on the correct sample size for Feed Forward Neural networks. x5, x10, and x30 the # of weights. This part I'm not overly concerned about, what I am concerned about is can I reuse my training data (randomly).
My data is broken up like so
5 independent vars and 1 dependent var per sample.
I was planning on feeding 6 samples in (6x5 = 30 input neurons), confirm the 7th samples dependent variable (1 output neuron.
I would train on neural network by running say 6 or 7 iterations. before trying to predict the next iteration outside of my training data.
Say I have
each sample = 5 independent variables & 1 dependent variables (6 vars total per sample)
output = just the 1 dependent variable
sample:sample:sample:sample:sample:sample->output(dependent var)
Training sliding window 1:
Set 1: 1:2:3:4:5:6->7
Set 2: 2:3:4:5:6:7->8
Set 3: 3:4:5:6:7:8->9
Set 4: 4:5:6:7:8:9->10
Set 5: 5:6:7:6:9:10->11
Set 6: 6:7:8:9:10:11->12
Non training test:
7:8:9:10:11:12 -> 13
Training Sliding Window 2:
Set 1: 2:3:4:5:6:7->8
Set 2: 3:4:5:6:7:8->9
...
Set 6: 7:8:9:10:11:12->13
Non Training test: 8:9:10:11:12:13->14
I figured I would randomly run through my set's per training iteration say 30 times the number of my weights. I believe in my network I have about 6 hidden neurons (i.e. sqrt(inputs*outputs)). So 36 + 6 + 1 + 2 bias = 45 weights. So 44 x 30 = 1200 runs?
So I would do a randomization of the 6 sets 1200 times per training sliding window.
I figured due to the small # of data, I was going to do simulation runs (i.e. rerun over the same problem with new weights). So say 1000 times, of which I do 1140 runs over the sliding window using randomization.
I have 113 variables, this results in 101 training "sliding window".
Another question I have is if I'm trying to predict up or down movement (i.e. dependent variable). Should I match to an actual # or just if I guessed up/down movement correctly? I'm thinking I should shoot for an actual number, but as part of my analysis do a % check on if this # is guessed correctly as up/down.
If you have a small amount of data, and a comparatively large number of training iterations, you run the risk of "overtraining" - creating a function which works very well on your test data but does not generalize.
The best way to avoid this is to acquire more training data! But if you cannot, then there are two things you can do. One is to split the training data into test and verification data - using say 85% to train and 15% to verify. Verification means compute the fitness of the learner on the training set, without adjusting the weights/training. When the verification data fitness (which you are not training on) stops improving (in general it will be noisy), and your training data fitness continues improving - stop training. If on the other hand you use a "sliding window", you may not have a good criterion to know when to stop training - the fitness function will bounce around in unpredictable ways (you might slowly make the effect of each training iteration have less effect on the parameters, however, to give you convergence... maybe not the best approach but some training regimes do this) The other thing you can do normalize out your node's weights via some metric to ensure some notion of 'smoothness' - if you visualize overfitting for a second you'll find that in the extreme case your fitness function sharply curves around your dataset positives...
As for the latter question - for the training to converge, you fitness function needs to be smooth. If you were to just use binary all-or-nothing fitness terms, most likely what would happen is that whatever algorithm you are using to train (backprop, BGFS, etc...) would not converge. In practice, the classification criterion should be an activation that is above for a positive result, less than or equal to for a negative result, and varies smoothly in your weight/parameter space. You can think of 0 as "I am certain that the answer is up" and 1 as "I am certain that the answer is down", and thus realize a fitness function that has a higher "cost" for incorrect guesses that were more certain... There are subtleties possible in how the function is shaped (for example you might have different ideas about how acceptable a false negative and false positive are) - and you may also introduce regions of "uncertain" where the result is closer to "zero weight" - but it should certainly be continuous/smooth.
You can re-use sliding window's.
It basically the same concept as bootstrapping (your training set); which in itself reduces training time, but don't know if it's really helpful in making the net more adaptive to anything other than the training data.
Below is an example of a sliding window in pictorial format (using spreadsheet magic)
http://i.imgur.com/nxhtgaQ.png
https://github.com/thistleknot/FredAPI/blob/05f74faf85d15f6898aa05b9b08d5363fe27c473/FredAPI/Program.cs
Line 294 shows how the code is ran using randomization, it resets the randomization at position 353 so the rest flows as normal.
I was also able to use a 1 (up) or 0 (down) as my target values and the network did converge.