Re-Use Sliding Window data for Neural Network for Time Series? - neural-network

I've read a few ideas on the correct sample size for Feed Forward Neural networks. x5, x10, and x30 the # of weights. This part I'm not overly concerned about, what I am concerned about is can I reuse my training data (randomly).
My data is broken up like so
5 independent vars and 1 dependent var per sample.
I was planning on feeding 6 samples in (6x5 = 30 input neurons), confirm the 7th samples dependent variable (1 output neuron.
I would train on neural network by running say 6 or 7 iterations. before trying to predict the next iteration outside of my training data.
Say I have
each sample = 5 independent variables & 1 dependent variables (6 vars total per sample)
output = just the 1 dependent variable
sample:sample:sample:sample:sample:sample->output(dependent var)
Training sliding window 1:
Set 1: 1:2:3:4:5:6->7
Set 2: 2:3:4:5:6:7->8
Set 3: 3:4:5:6:7:8->9
Set 4: 4:5:6:7:8:9->10
Set 5: 5:6:7:6:9:10->11
Set 6: 6:7:8:9:10:11->12
Non training test:
7:8:9:10:11:12 -> 13
Training Sliding Window 2:
Set 1: 2:3:4:5:6:7->8
Set 2: 3:4:5:6:7:8->9
...
Set 6: 7:8:9:10:11:12->13
Non Training test: 8:9:10:11:12:13->14
I figured I would randomly run through my set's per training iteration say 30 times the number of my weights. I believe in my network I have about 6 hidden neurons (i.e. sqrt(inputs*outputs)). So 36 + 6 + 1 + 2 bias = 45 weights. So 44 x 30 = 1200 runs?
So I would do a randomization of the 6 sets 1200 times per training sliding window.
I figured due to the small # of data, I was going to do simulation runs (i.e. rerun over the same problem with new weights). So say 1000 times, of which I do 1140 runs over the sliding window using randomization.
I have 113 variables, this results in 101 training "sliding window".
Another question I have is if I'm trying to predict up or down movement (i.e. dependent variable). Should I match to an actual # or just if I guessed up/down movement correctly? I'm thinking I should shoot for an actual number, but as part of my analysis do a % check on if this # is guessed correctly as up/down.

If you have a small amount of data, and a comparatively large number of training iterations, you run the risk of "overtraining" - creating a function which works very well on your test data but does not generalize.
The best way to avoid this is to acquire more training data! But if you cannot, then there are two things you can do. One is to split the training data into test and verification data - using say 85% to train and 15% to verify. Verification means compute the fitness of the learner on the training set, without adjusting the weights/training. When the verification data fitness (which you are not training on) stops improving (in general it will be noisy), and your training data fitness continues improving - stop training. If on the other hand you use a "sliding window", you may not have a good criterion to know when to stop training - the fitness function will bounce around in unpredictable ways (you might slowly make the effect of each training iteration have less effect on the parameters, however, to give you convergence... maybe not the best approach but some training regimes do this) The other thing you can do normalize out your node's weights via some metric to ensure some notion of 'smoothness' - if you visualize overfitting for a second you'll find that in the extreme case your fitness function sharply curves around your dataset positives...
As for the latter question - for the training to converge, you fitness function needs to be smooth. If you were to just use binary all-or-nothing fitness terms, most likely what would happen is that whatever algorithm you are using to train (backprop, BGFS, etc...) would not converge. In practice, the classification criterion should be an activation that is above for a positive result, less than or equal to for a negative result, and varies smoothly in your weight/parameter space. You can think of 0 as "I am certain that the answer is up" and 1 as "I am certain that the answer is down", and thus realize a fitness function that has a higher "cost" for incorrect guesses that were more certain... There are subtleties possible in how the function is shaped (for example you might have different ideas about how acceptable a false negative and false positive are) - and you may also introduce regions of "uncertain" where the result is closer to "zero weight" - but it should certainly be continuous/smooth.

You can re-use sliding window's.
It basically the same concept as bootstrapping (your training set); which in itself reduces training time, but don't know if it's really helpful in making the net more adaptive to anything other than the training data.
Below is an example of a sliding window in pictorial format (using spreadsheet magic)
http://i.imgur.com/nxhtgaQ.png
https://github.com/thistleknot/FredAPI/blob/05f74faf85d15f6898aa05b9b08d5363fe27c473/FredAPI/Program.cs
Line 294 shows how the code is ran using randomization, it resets the randomization at position 353 so the rest flows as normal.
I was also able to use a 1 (up) or 0 (down) as my target values and the network did converge.

Related

Neural Network Oscillates Around 0.5

I wanted to create my own neural network - mainly for the fun of it, but also because Khan Academy doesn't allow libraries, and I hadn't seen any good neural nets on the site.
Neural Network Info:
The one I am showing in the images is a 1-2-3-2-1 neural network, although it does this for all layer sizes and amounts. The thicker line is the first training run, which is 5,000 iterations. The thinner line shows another 1,000 iterations after the first training run.
Training Data Info:
I'm making it switch 0 to 1 and 1 to 0. The graphs shown are the loss when trying to change 1 to 0. The dataset looks like this:
[{
inputs: [0],
outputs: [1]
}, {
inputs: [1],
outputs: [0]
}]
Before each iteration, the dataset is randomized.
I put a neural net together, but when testing I ran across an interesting issue:
It will oscillate around 0.5 about 3/4ths of the time. The other 1/4th of the time, it works as intended. Sometimes it will go to where it is supposed to (about a quarter of the time) (These graphs show the loss, with the line in the center being 0):
Another part of the time (maybe 1/20th, so pretty rarely), it will "stick" at 0.5, but then kick itself out:
Or it'll get it right, but then just mess itself up for no reason (very rare, almost never happens):
And the rest of the time, it will just stay at around 0.5:
I have no clue what's causing these to happen (although I think it might be my implementation of Gradient Descent, found on line 137 of the program), or how to fix them.
You can find the program here:
khanacademy.org/cs/-/6305674778411008
I think that this can be a overffitting. the neural network reach the min. but after of some time the loss start to grow again and stop in a local min.
but this depends of how you neural network are implemented. you need see if you data is normalized between 0 and 1 or -1 and 1 for example. because if o data is not normalized the gradient can "break out".
Standartzation is important too.

Set Batch Size *and* Number of Training Iterations for a neural network?

I am using the KNIME Doc2Vec Learner node to build a Word Embedding. I know how Doc2Vec works. In KNIME I have the option to set the parameters
Batch Size: The number of words to use for each batch.
Number of Epochs: The number of epochs to train.
Number of Training Iterations: The number of updates done for each batch.
From Neural Networks I know that (lazily copied from https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network):
one epoch = one forward pass and one backward pass of all the training examples
batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).
As far as I understand it makes little sense to set batch size and iterations, because one is determined by the other (given the data size, which is given by the circumstances). So why can I change both parameters?
This is not necessarily the case. You can also train "half epochs". For example, in Google's inceptionV3 pretrained script, you usually set the number of iterations and the batch size at the same time. This can lead to "partial epochs", which can be fine.
If it is a good idea or not to train half epochs may depend on your data. There is a thread about this but not a concluding answer.
I am not familiar with KNIME Doc2Vec, so I am not sure if the meaning is somewhat different there. But from the definitions you gave setting batch size + iterations seems fine. Also setting number of epochs could cause conflicts though leading to situations where numbers don't add up to reasonable combinations.

Character Recognition Using Back Propagation Algorithm Testing

Recently I've been working on character recognition using Back Propagation Algorithm. I've taken the image and reduced to 5x7 size, therefore I got 35 pixels and trained the network using those pixels with 35 input neurons, 35 hidden nodes, and 10 output nodes. And I had completed the training successfully and I got weights that I needed. And I've got stuck here. I have my test set and I know I should feed forward the network. But I don't know what to do exactly. My test set will be 4 samples of 1x35. My output layer has 10 neurons. how do I exactly distinguish the characters with the output that I will get? I want to know how this testing works. Please guide me through this stage. Thanks in advance.
One vs All
A common approach for testing these types of neural networks is "one-vs-all" approach. We view each of the output nodes as its own classifier that is giving the probability of the sample being that class vs not being that class.
For instance if you network output [1, 0, ..., 0] then class 1 has high probability of being class 1 vs not being class 1. Class 2 has low probability of being class 2 vs not being class 2, etc.
Ties
In the case of a tie, it is common (in research) to have a random function break the tie. If you get [1, 1, 1, ..., 1] then the function would pick a number from 1-10 and that is your prediction. In practice sometimes an expert system is used to break ties. Perhaps class 1 is more expensive than class 2, so we tie in preference to class 2.
Steps
So the steps are:
Split dataset into test/train set
Train weights on train set
Pass test set forward through the neural network
For each sample, choose the argmax (the output with highest value) as your prediction
In case of tie, choose randomly between all tying classes
Aside
In your particular case, I imagine implementation of this strategy will result in a network that barely beats random performance (10%) accuracy.
I would suggest some reconsidering of the network architecture.
If you look at your 5x7 images, can you tell what number that image was originally? It seems likely that scaling the image down to this size losses too much information that the network cannot distinguish between classes.
Debugging
From what you've described I would look at the following when debugging your network.
Is your data preprocessing (down-scaling) leeching out too much information? Check this by manually investigating a few of the images and seeing if you can tell what the image should be.
Does your one-hot algorithm work? When you convert your targets for training, does it successfully convert 1 -> [1, 0, 0, ..., 0]?
Is your back-prop / gradient descent algorithm correct? You should see (roughly) a monotonic decrease in your loss function while training. Try at every step (or every few steps) printing the loss that you are optimizing. Or even for a very simple gut check, print mean squared error: (P-Y)^2

How to use KNN to classify data in MATLAB?

I'm having problems in understanding how K-NN classification works in MATLAB.ยด
Here's the problem, I have a large dataset (65 features for over 1500 subjects) and its respective classes' label (0 or 1).
According to what's been explained to me, I have to divide the data into training, test and validation subsets to perform supervised training on the data, and classify it via K-NN.
First of all, what's the best ratio to divide the 3 subgroups (1/3 of the size of the dataset each?).
I've looked into ClassificationKNN/fitcknn functions, as well as the crossval function (idealy to divide data), but I'm really not sure how to use them.
To sum up, I wanted to
- divide data into 3 groups
- "train" the KNN (I know it's not a method that requires training, but the equivalent to training) with the training subset
- classify the test subset and get it's classification error/performance
- what's the point of having a validation test?
I hope you can help me, thank you in advance
EDIT: I think I was able to do it, but, if that's not asking too much, could you see if I missed something? This is my code, for a random case:
nfeats=60;ninds=1000;
trainRatio=0.8;valRatio=.1;testRatio=.1;
kmax=100; %for instance...
data=randi(100,nfeats,ninds);
class=randi(2,1,ninds);
[trainInd,valInd,testInd] = dividerand(1000,trainRatio,valRatio,testRatio);
train=data(:,trainInd);
test=data(:,testInd);
val=data(:,valInd);
train_class=class(:,trainInd);
test_class=class(:,testInd);
val_class=class(:,valInd);
precisionmax=0;
koptimal=0;
for know=1:kmax
%is it the same thing use knnclassify or fitcknn+predict??
predicted_class = knnclassify(val', train', train_class',know);
mdl = fitcknn(train',train_class','NumNeighbors',know) ;
label = predict(mdl,val');
consistency=sum(label==val_class')/length(val_class);
if consistency>precisionmax
precisionmax=consistency;
koptimal=know;
end
end
mdl_final = fitcknn(train',train_class','NumNeighbors',know) ;
label_final = predict(mdl,test');
consistency_final=sum(label==test_class')/length(test_class);
Thank you very much for all your help
For your 1st question "what's the best ratio to divide the 3 subgroups" there are only rules of thumb:
The amount of training data is most important. The more the better.
Thus, make it as big as possible and definitely bigger than the test or validation data.
Test and validation data have a similar function, so it is convenient to assign them the same amount
of data. But it is important to have enough data to be able to recognize over-adaptation. So, they
should be picked from the data basis fully randomly.
Consequently, a 50/25/25 or 60/20/20 partitioning is quite common. But if your total amount of data is small in relation to the total number of weights of your chosen topology (e.g. 10 weights in your net and only 200 cases in the data), then 70/15/15 or even 80/10/10 might be better choices.
Concerning your 2nd question "what's the point of having a validation test?":
Typically, you train the chosen model on your training data and then estimate the "success" by applying the trained model to unseen data - the validation set.
If you now would completely stop your efforts to improve accuracy, you indeed don't need three partitions of your data. But typically, you feel that you can improve the success of your model by e.g. changing the number of weights or hidden layers or ... and now a big loops starts to run with many iterations:
1) change weights and topology, 2) train, 3) validate, not satisfied, goto 1)
The long-term effect of this loop is, that you increasingly adapt your model to the validation data, so the results get better not because you so intelligently improve your topology but because you unconsciously learn the properties of the validation set and how to cope with them.
Now, the final and only valid accuracy of your neural net is estimated on really unseen data: the test set. This is done only once and is also useful to reveal over-adaption. You are not allowed to start a second even bigger loop now to prohibit any adaption to the test set!

How can I prevent my program from getting stuck at a local maximum (Feed forward artificial neural network and genetic algorithm)

I'm working on a feed forward artificial neural network (ffann) that will take input in form of a simple calculation and return the result (acting as a pocket calculator). The outcome wont be exact.
The artificial network is trained using genetic algorithm on the weights.
Currently my program gets stuck at a local maximum at:
5-6% correct answers, with 1% error margin
30 % correct answers, with 10% error margin
40 % correct answers, with 20% error margin
45 % correct answers, with 30% error margin
60 % correct answers, with 40% error margin
I currently use two different genetic algorithms:
The first is a basic selection, picking two random from my population, naming the one with best fitness the winner, and the other the loser. The loser receives one of the weights from the winner.
The second is mutation, where the loser from the selection receives a slight modification based on the amount of resulting errors. (the fitness is decided by correct answers and incorrect answers).
So if the network outputs a lot of errors, it will receive a big modification, where as if it has many correct answers, we are close to a acceptable goal and the modification will be smaller.
So to the question: What are ways I can prevent my ffann from getting stuck at local maxima?
Should I modify my current genetic algorithm to something more advanced with more variables?
Should I create additional mutation or crossover?
Or Should I maybe try and modify my mutation variables to something bigger/smaller?
This is a big topic so if I missed any information that could be needed, please leave a comment
Edit:
Tweaking the numbers of the mutation to a more suited value has gotten be a better answer rate but far from approved:
10% correct answers, with 1% error margin
33 % correct answers, with 10% error margin
43 % correct answers, with 20% error margin
65 % correct answers, with 30% error margin
73 % correct answers, with 40% error margin
The network is currently a very simple 3 layered structure with 3 inputs, 2 neurons in the only hidden layer, and a single neuron in the output layer.
The activation function used is Tanh, placing values in between -1 and 1.
The selection type crossover is very simple working like the following:
[a1, b1, c1, d1] // Selected as winner due to most correct answers
[a2, b2, c2, d2] // Loser
The loser will end up receiving one of the values from the winner, moving the value straight down since I believe the position in the array (of weights) matters to how it performs.
The mutation is very simple, adding a very small value (currently somewhere between about 0.01 and 0.001) to a random weight in the losers array of weights, with a 50/50 chance of being a negative value.
Here are a few examples of training data:
1, 8, -7 // the -7 represents + (1+8)
3, 7, -3 // -3 represents - (3-7)
7, 7, 3 // 3 represents * (7*7)
3, 8, 7 // 7 represents / (3/8)
Use a niching techniche in the GA. A useful alternative is niching. The score of every solution (some form of quadratic error, I think) is changed in taking account similarity of the entire population. This maintains diversity inside the population and avoid premature convergence an traps into local optimum.
Take a look here:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.7342
A common problem when using GAs to train ANNs is that the population becomes highly correlated
as training progresses.
You could try increasing mutation chance and/or effect as the error-change decreases.
In English. The population becomes genetically similar due to crossover and fitness selection as a local minim is approached. You can introduce variation by increasing the chance of mutation.
You can do a simple modification to the selection scheme: the population can be viewed as having a 1-dimensional spatial structure - a circle (consider the first and last locations to be adjacent).
The production of an individual for location i is permitted to involve only parents from i's local neighborhood, where the neighborhood is defined as all individuals within distance R of i. Aside from this restriction no changes are made to the genetic system.
It's only one or a few lines of code and it can help to avoid premature convergence.
References:
TRIVIAL GEOGRAPHY IN GENETIC PROGRAMMING (2005) - Lee Spector, Jon Klein