First neural net - weights don't converge

First neural net - weights don't converge - neural-network

i just wrote my first implementation of a neural network for the game tic-tac-toe.
I have 9 input neurons (1,0,-1 for the state), and 1 output neuron (from -1 to 1 - that's the value of the state).
I use a simple reinforcement algorithm:
The winning state gets the value +1, the previous state gets 1-1*scale, the next state 1-2*scale and so on.
The losing state gets the value -1, the previous state gets -1+1*scale, the next state -1+2*scale and so on. (optional so far)
I feed this values into the network to get an error function to work with.
In the next iteration, i take all possible states and feed them into the network. Then i select the state with the highest value.
Now the problem is: If i only use the rewards for winning, the algorithm works
just fine, the output values converge to the values calculated by the reward-formula. I get winning rates around 85% against a random player.
If i use rewards for winning and (negative) rewards for losing, the algorithm does not converge at all and tends to output only positive, or only negative values (around +1 or around -1).
Does anybody know what could be the reason behind this? I use the tanh(x) output function for my neurons.
Thanks and best wishes,
beinando

Related

Machine Learning to predict time-series multi-class signal changes

I would like to predict the switching behavior of time-dependent signals. Currently the signal has 3 states (1, 2, 3), but it could be that this will change in the future. For the moment, however, it is absolutely okay to assume three states.
I can make the following assumptions about these states (see picture):
the signals repeat periodically, possibly with variations concerning the time of day.
the duration of state 2 is always constant and relatively short for all signals.
the duration of states 1 and 3 are also constant, but vary for the different signals.
the switching sequence is always the same: 1 --> 2 --> 3 --> 2 --> 1 --> [...]
there is a constant but unknown time reference between the different signals.
There is no constant time reference between my observations for the different signals. They are simply measured one after the other, but always at different times.
I am able to rebuild my model periodically after i obtained more samples.
I have the following problems:
I can only observe one signal at a time.
I can only observe the signals at different times.
I cannot trigger my measurement with the state transition. That means, when I measure, I am always "in the middle" of a state. Therefore I don't know when this state has started and also not exactly when this state will end.
I cannot observe a certain signal for a long duration. So, i am not able to observe a complete period.
My samples (observations) are widespread in time.
I would like to get a prediction either for the state change or the current state for the current time. It is likely to happen that i will never have measured my signals for that requested time.
So far I have tested the TimeSeriesPredictor from the ML.NET Toolbox, as it seemed suitable to me. However, in my opinion, this algorithm requires that you always pass only the data of one signal. This means that assumption 5 is not included in the prediction, which is probably suboptimal. Also, in this case I had problems with the prediction not changing, which should actually happen time-dependently when I query multiple predictions. This behavior led me to believe that only the order of the values entered the model, but not the associated timestamp. If I have understood everything correctly, then exactly this timestamp is my most important "feature"...
So far, i did not do any tests on Regression-based approaches, e.g. FastTree, since my data is not linear, but keeps changing states. Maybe this assumption is not valid and regression-based methods could also be suitable?
I also don't know if a multiclassifier is required, because I had understood that the TimeSeriesPredictor would also be suitable for this, since it works with the single data type. Whether the prediction is 1.3 or exactly 1.0 would be fine for me.
To sum it up:
I am looking for a algorithm which is able to recognize the switching patterns based on lose and widespread samples. It would be okay to define boundaries, e.g. state duration 3 of signal 1 will never last longer than 30s or state duration 1 of signal 3 will never last longer 60s.
Then, after the algorithm has obtained an approximate model of the switching behaviour, i would like to request a prediction of a certain signal state for a certain time.
Which methods can I use to get the best prediction, preferably using the ML.NET toolbox or based on matlab?

Not sure if this is quite what you're looking for, but if detecting spikes and changes using signals is what you're looking for, check out the anomaly detection algorithms in ML.NET. Here are two tutorials that show how to use them.
Detect anomalies in product sales
Spike detection
Change point detection
Detect anomalies in time series
Detect anomaly period
Detect anomaly

One way to approach this would be to first determine the periodicity of each of the signals independently. This could be done by looking at the frequency distribution of time differences between measurements of state 2 only and separately for each signal.
This will give a multinomial distribution. The shortest time difference will be the duration of the switching event (after discarding time differences less than the max duration of state 2). The second shortest peak will be the duration between the end of one switching event and the start of the next.
When you have the 3 calculations of periodicity you can simply calculate the difference between each of them. Given you have the timestamps of the measurements of state 2 for each signal you should be able to calculate the time of switching for all other signals.

Activation function to get day of week

I'm writing a program to predict when will something happens. I don't know which activation function to get output in day of week (1-7).
I tried sigmoid function but i need to input the predicted day and it output probability of it, I don't want it to be this way.
I expect the activation function returning 0 to infinite, is ReLU the best activation function for this task?
EDIT:
also, what if i wanted output more than 7 days, for example, x will hapen in 9th day from today, or 15th day from today, etc? I'm looking for dynamic ways to do this

What you are trying to do is solving a classification problem with a regression approach. That's at least unconventional.
You can use any activation function you want and define your output as you want. E.g. linear, relu with output range from 1 to 7 or something between -1(or 0) and 1 like tanh or sigmoid and map the output (-1 -> 1; -0.3 -> 2; ...).
The problem for you will be that you get a floatingpoint number as a result. So your model not only has to learn how to classify correctly but also how to predict the (allmost) exact number you want in your output neuron. That makes the problem more complicated than it has to be. With a model like that it also will be likley that for some outlier datapoints you might get unexpected return values like 0, -1 or 8. What do you do then?
To sum it up: Listen to #venkata krishnan, use softmax and seven output neurons and map this result to a number between 1 and 7 outside the neural network if you have to.
EDIT
What comes to my mind after reading the comments again would be a mix of what you want and what you should do.
You could try to make the second last layer a 7 neuron softmax layer and map those output to a single neuron in the last layer.
Niether did i ever try that nor have i ever read about something like that so i can't tell you if thats a good idea, likely not, but you might consider it worth a try.

I want to add onto the point of #venkata krishnan, which raises a valid point in your problem setting. You will find an answer to your original question further down, but I strongly suggeste you read the following comment first.
Generally, you want to discern between categorical, ordinal and interval variables. I have given a relatively lengthy explanation in a different answer on Stackoverflow, it might be helpful to understand this concept in more detail.
In your scenario, you mostly want to have an understanding of "how wrong" you are. Of course, it is perfectly reasonable to assume what you are doing and interpret it as a interval variable, and therefore have an assumed ordering (and a distance) between different values.
What is problematic, though, is the fact that you are assuming a continuous space on a discrete variable. E.g., it does not make any sense to interpret the output of 4.3, since you can only tell between 4 (Friday, assuming you start numbering your days at 0), or 5 (Saturday). Any value in between would have to be rounded, which is perfectly fine - until you want to perform backpropagation on this loss.
It is problematic, because you are essentially introducing a non-convex and non-continous function, no matter how you "round" your values. Again, to exemplify this, you could assume to round to the nearest number; then, at the value of 4.5, you would see a sudden increase in the loss, which is non-differentialbe, and will therefore put a hard time on your optimizer, potentially limiting convergence of your system.
If, instead, you utilize several output neurons, as suggested by #venkata krishnan, you might lose the information of distance (how many days you are off) on paper, but you can of course still interpret your loss in any way you like. This would certainly be the better option for a discrete-valued variable.
To answer your original question: I personally would make sure that your loss function is bounded both in the upper and lower level, as you could otherwise have undefined/inconsistent loss values, that might lead to subpar optimization. One way to do this is to re-scale a Sigmoid function (the co-domain of sigmoid(R) is [0,1]. Eventually, you can then just multiply your output by 6, to get a value range that is [0,6], and could (after rounding) cover all the values you want.

As far I know, there is no such thing like an activation function which will yield 0 to infinite. You can apply 7 output nodes with a "Softmax" activation function which will return the probability. There is another solution which may work. You can you 3 output nodes with "Binary" activation function which will return either 0 or 1. That means you can have 8 different outputs with only 3 nodes which are 000, 001, 010, 011, 100, 101, 110 and 111. You can use 7 of them.

Backpropagation through time in stateful RNNs

If I use a stateful RNN in Keras for processing a sequence of length N divided into N parts (each time step is processed individually),
how is backpropagation handled? Does it only affect the last time step, or does it backpropagate through the entire sequence?
If it does not propagate through the entire sequence, is there a way to do this?

The back propagation horizon is limited to the second dimension of the input sequence. i.e. if your data is of type (num_sequences, num_time_steps_per_seq, data_dim) then back prop is done over a time horizon of value num_time_steps_per_seq Take a look at
https://github.com/fchollet/keras/issues/3669

There are a couple things you need to know about RNNs in Keras. At default the parameter return_sequences=False in all recurrent neural networks. This means that at default only the activations of the RNN after processing the entire input sequence are returned as output. If you want to have the activations at every time step and optimize every time step seperately, you need to pass return_sequences=True as parameter (https://keras.io/layers/recurrent/#recurrent).
The next thing that is important to know is that all a stateful RNN does is remember the last activation. So if you have a large input sequence and break it up in smaller sequences (which I believe you are doing), the activation in the network is retained in the network after processing the first sequence and therefore affects the activations in the network when processing the second sequence. This has nothing to do with how the network is optimized, the network simply minimizes the difference between the output and the targets you give.

to the Q1: how is backpropagation handled? (as so as RNN is not only fully-connected vertically as in basic_NN, but also considered to be Deep - having also horizontal backprop connections in hidden layer)
Suppose batch_input_shape=(num_seq, 1, data_dim) - "Backprop will be truncated to 1 timestep , as the second dimension is 1. No gradient updates will be performed further back in time than the second dimension's value." - see here
Thus, if having time_step >1 there - gradient WILL update further back in time_steps assigned in second_dim of input_shape
set return_sequences=True for all recurrent layers except the last one (that use as needed output or Dense further to needed output) -- True is needed to have transmissible sequence from previous to the next rolled at +1 in sliding_window -- to be able to backprop according already estimated weights
return_state=True is used to get the states returned -- 2 state tensors in LSTM [output, state_h, state_c = layers.LSTM(64, return_state=True, name="encoder")] or 1 state tensor in GRU [incl. in shapes] -- that "can be used in the encoder-decoder sequence-to-sequence model, where the encoder final state is used as the initial state of the decoder."...
But remember (for any case): Stateful training does not allow shuffling, and is more time-consuming compared with stateless
p.s.
as you can see here -- (c,h) in tf or (h,c) in keras -- both h & c are elements of output, thus both becoming urgent in batched or multi-threaded training

Re-Use Sliding Window data for Neural Network for Time Series?

I've read a few ideas on the correct sample size for Feed Forward Neural networks. x5, x10, and x30 the # of weights. This part I'm not overly concerned about, what I am concerned about is can I reuse my training data (randomly).
My data is broken up like so
5 independent vars and 1 dependent var per sample.
I was planning on feeding 6 samples in (6x5 = 30 input neurons), confirm the 7th samples dependent variable (1 output neuron.
I would train on neural network by running say 6 or 7 iterations. before trying to predict the next iteration outside of my training data.
Say I have
each sample = 5 independent variables & 1 dependent variables (6 vars total per sample)
output = just the 1 dependent variable
sample:sample:sample:sample:sample:sample->output(dependent var)
Training sliding window 1:
Set 1: 1:2:3:4:5:6->7
Set 2: 2:3:4:5:6:7->8
Set 3: 3:4:5:6:7:8->9
Set 4: 4:5:6:7:8:9->10
Set 5: 5:6:7:6:9:10->11
Set 6: 6:7:8:9:10:11->12
Non training test:
7:8:9:10:11:12 -> 13
Training Sliding Window 2:
Set 1: 2:3:4:5:6:7->8
Set 2: 3:4:5:6:7:8->9
...
Set 6: 7:8:9:10:11:12->13
Non Training test: 8:9:10:11:12:13->14
I figured I would randomly run through my set's per training iteration say 30 times the number of my weights. I believe in my network I have about 6 hidden neurons (i.e. sqrt(inputs*outputs)). So 36 + 6 + 1 + 2 bias = 45 weights. So 44 x 30 = 1200 runs?
So I would do a randomization of the 6 sets 1200 times per training sliding window.
I figured due to the small # of data, I was going to do simulation runs (i.e. rerun over the same problem with new weights). So say 1000 times, of which I do 1140 runs over the sliding window using randomization.
I have 113 variables, this results in 101 training "sliding window".
Another question I have is if I'm trying to predict up or down movement (i.e. dependent variable). Should I match to an actual # or just if I guessed up/down movement correctly? I'm thinking I should shoot for an actual number, but as part of my analysis do a % check on if this # is guessed correctly as up/down.

If you have a small amount of data, and a comparatively large number of training iterations, you run the risk of "overtraining" - creating a function which works very well on your test data but does not generalize.
The best way to avoid this is to acquire more training data! But if you cannot, then there are two things you can do. One is to split the training data into test and verification data - using say 85% to train and 15% to verify. Verification means compute the fitness of the learner on the training set, without adjusting the weights/training. When the verification data fitness (which you are not training on) stops improving (in general it will be noisy), and your training data fitness continues improving - stop training. If on the other hand you use a "sliding window", you may not have a good criterion to know when to stop training - the fitness function will bounce around in unpredictable ways (you might slowly make the effect of each training iteration have less effect on the parameters, however, to give you convergence... maybe not the best approach but some training regimes do this) The other thing you can do normalize out your node's weights via some metric to ensure some notion of 'smoothness' - if you visualize overfitting for a second you'll find that in the extreme case your fitness function sharply curves around your dataset positives...
As for the latter question - for the training to converge, you fitness function needs to be smooth. If you were to just use binary all-or-nothing fitness terms, most likely what would happen is that whatever algorithm you are using to train (backprop, BGFS, etc...) would not converge. In practice, the classification criterion should be an activation that is above for a positive result, less than or equal to for a negative result, and varies smoothly in your weight/parameter space. You can think of 0 as "I am certain that the answer is up" and 1 as "I am certain that the answer is down", and thus realize a fitness function that has a higher "cost" for incorrect guesses that were more certain... There are subtleties possible in how the function is shaped (for example you might have different ideas about how acceptable a false negative and false positive are) - and you may also introduce regions of "uncertain" where the result is closer to "zero weight" - but it should certainly be continuous/smooth.

You can re-use sliding window's.
It basically the same concept as bootstrapping (your training set); which in itself reduces training time, but don't know if it's really helpful in making the net more adaptive to anything other than the training data.
Below is an example of a sliding window in pictorial format (using spreadsheet magic)
http://i.imgur.com/nxhtgaQ.png
https://github.com/thistleknot/FredAPI/blob/05f74faf85d15f6898aa05b9b08d5363fe27c473/FredAPI/Program.cs
Line 294 shows how the code is ran using randomization, it resets the randomization at position 353 so the rest flows as normal.
I was also able to use a 1 (up) or 0 (down) as my target values and the network did converge.

How can I prevent my program from getting stuck at a local maximum (Feed forward artificial neural network and genetic algorithm)

I'm working on a feed forward artificial neural network (ffann) that will take input in form of a simple calculation and return the result (acting as a pocket calculator). The outcome wont be exact.
The artificial network is trained using genetic algorithm on the weights.
Currently my program gets stuck at a local maximum at:
5-6% correct answers, with 1% error margin
30 % correct answers, with 10% error margin
40 % correct answers, with 20% error margin
45 % correct answers, with 30% error margin
60 % correct answers, with 40% error margin
I currently use two different genetic algorithms:
The first is a basic selection, picking two random from my population, naming the one with best fitness the winner, and the other the loser. The loser receives one of the weights from the winner.
The second is mutation, where the loser from the selection receives a slight modification based on the amount of resulting errors. (the fitness is decided by correct answers and incorrect answers).
So if the network outputs a lot of errors, it will receive a big modification, where as if it has many correct answers, we are close to a acceptable goal and the modification will be smaller.
So to the question: What are ways I can prevent my ffann from getting stuck at local maxima?
Should I modify my current genetic algorithm to something more advanced with more variables?
Should I create additional mutation or crossover?
Or Should I maybe try and modify my mutation variables to something bigger/smaller?
This is a big topic so if I missed any information that could be needed, please leave a comment
Edit:
Tweaking the numbers of the mutation to a more suited value has gotten be a better answer rate but far from approved:
10% correct answers, with 1% error margin
33 % correct answers, with 10% error margin
43 % correct answers, with 20% error margin
65 % correct answers, with 30% error margin
73 % correct answers, with 40% error margin
The network is currently a very simple 3 layered structure with 3 inputs, 2 neurons in the only hidden layer, and a single neuron in the output layer.
The activation function used is Tanh, placing values in between -1 and 1.
The selection type crossover is very simple working like the following:
[a1, b1, c1, d1] // Selected as winner due to most correct answers
[a2, b2, c2, d2] // Loser
The loser will end up receiving one of the values from the winner, moving the value straight down since I believe the position in the array (of weights) matters to how it performs.
The mutation is very simple, adding a very small value (currently somewhere between about 0.01 and 0.001) to a random weight in the losers array of weights, with a 50/50 chance of being a negative value.
Here are a few examples of training data:
1, 8, -7 // the -7 represents + (1+8)
3, 7, -3 // -3 represents - (3-7)
7, 7, 3 // 3 represents * (7*7)
3, 8, 7 // 7 represents / (3/8)

Use a niching techniche in the GA. A useful alternative is niching. The score of every solution (some form of quadratic error, I think) is changed in taking account similarity of the entire population. This maintains diversity inside the population and avoid premature convergence an traps into local optimum.
Take a look here:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.7342

A common problem when using GAs to train ANNs is that the population becomes highly correlated
as training progresses.
You could try increasing mutation chance and/or effect as the error-change decreases.
In English. The population becomes genetically similar due to crossover and fitness selection as a local minim is approached. You can introduce variation by increasing the chance of mutation.

You can do a simple modification to the selection scheme: the population can be viewed as having a 1-dimensional spatial structure - a circle (consider the first and last locations to be adjacent).
The production of an individual for location i is permitted to involve only parents from i's local neighborhood, where the neighborhood is defined as all individuals within distance R of i. Aside from this restriction no changes are made to the genetic system.
It's only one or a few lines of code and it can help to avoid premature convergence.
References:
TRIVIAL GEOGRAPHY IN GENETIC PROGRAMMING (2005) - Lee Spector, Jon Klein

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse