How to calculate the time complexity of e sequential neural network in Tensorflow? - neural-network

I have a sample neural network and am trying to see how much it would cost me to run it on a server and how long it would take to train if, for example, I add 3 more layers with around 4000,3000,2000 nodes in each layer respectively.
I understand that from a high level perspective the network needs to
Feed the inputs and get the results (which in turn will run Sigmoid) from the network which I guess happens in constant time (even tho the output may not be constant or even linear!)
Run Adam to optimize weights/biases which I guess also happens in linear time since it is like Gradient descent and is different in how it manages the learning rate!
Update the weights/biases which is constant!
I can't find a calculator to use and estimate the computation needed and I'm thinking of making one if I can get a good understanding of different variables in a neural network!
This is the code for my Tensorflow model:
const model = tf.sequential();
model.add(tf.layers.flatten({inputShape: [4317, 5]}));
model.add(tf.layers.dense({units: 1000, activation: 'sigmoid'}));
model.add(tf.layers.dense({units: 4316, activation: 'sigmoid'}));
const optimizer = tf.train.adam();
model.compile({
optimizer: optimizer,
loss: 'meanSquaredError'
});
And here is the network summary printed by Tensorflow
_________________________________________________________________
Layer (type) Output shape Param #
=================================================================
flatten_Flatten1 (Flatten) [null,21585] 0
_________________________________________________________________
dense_Dense1 (Dense) [null,1000] 21586000
_________________________________________________________________
dense_Dense2 (Dense) [null,4316] 4320316
=================================================================
Total params: 25906316
Trainable params: 25906316
Non-trainable params: 0
What if I change the activation functions to linear or ReLU?
I have a laptop with 16 GB of memory and 3.2 GHz 8-core ARMv8-A (M1 chip) and it looks like the laptop is taking about a minute to train a batch of 32 inputs.

With N inputs, each weight is used O(N) times per round of training, so assuming M weights you have roughly O(N*M) training time per round. It doesn't really matter where those weights are in your network. Even for recurrent layers (GRU,RNN, LSTM) this stays true.
Where things break down is that you can't let M go to infinity (which is how big-O works) because in that case your network training won't converge anymore. Effectively, it would be O(infinity).

Related

Neural Network Oscillates Around 0.5

I wanted to create my own neural network - mainly for the fun of it, but also because Khan Academy doesn't allow libraries, and I hadn't seen any good neural nets on the site.
Neural Network Info:
The one I am showing in the images is a 1-2-3-2-1 neural network, although it does this for all layer sizes and amounts. The thicker line is the first training run, which is 5,000 iterations. The thinner line shows another 1,000 iterations after the first training run.
Training Data Info:
I'm making it switch 0 to 1 and 1 to 0. The graphs shown are the loss when trying to change 1 to 0. The dataset looks like this:
[{
inputs: [0],
outputs: [1]
}, {
inputs: [1],
outputs: [0]
}]
Before each iteration, the dataset is randomized.
I put a neural net together, but when testing I ran across an interesting issue:
It will oscillate around 0.5 about 3/4ths of the time. The other 1/4th of the time, it works as intended. Sometimes it will go to where it is supposed to (about a quarter of the time) (These graphs show the loss, with the line in the center being 0):
Another part of the time (maybe 1/20th, so pretty rarely), it will "stick" at 0.5, but then kick itself out:
Or it'll get it right, but then just mess itself up for no reason (very rare, almost never happens):
And the rest of the time, it will just stay at around 0.5:
I have no clue what's causing these to happen (although I think it might be my implementation of Gradient Descent, found on line 137 of the program), or how to fix them.
You can find the program here:
khanacademy.org/cs/-/6305674778411008
I think that this can be a overffitting. the neural network reach the min. but after of some time the loss start to grow again and stop in a local min.
but this depends of how you neural network are implemented. you need see if you data is normalized between 0 and 1 or -1 and 1 for example. because if o data is not normalized the gradient can "break out".
Standartzation is important too.

How to train a neural network with Q-Learning

I just implemented Q-Learning without neural networks but I am stuck at implementing them with neural networks.
I will give you a pseudo code showing how my Q-Learning is implemented:
train(int iterations)
buffer = empty buffer
for i = 0 while i < iterations:
move = null
if random(0,1) > threshold:
move = random_move()
else
move = network_calculate_move()
input_to_network = game.getInput()
output_of_network = network.calculate(input_to_network)
game.makeMove(move)
reward = game.getReward()
maximum_next_q_value = max(network.calculate(game.getInput()))
if reward is 1 or -1: //either lost or won
output_of_network[move] = reward
else:
output_of_network[move] = reward + discount_factor * max
buffer.add(input_to_network, output_of_network)
if buffer is full:
buffer.remove_oldest()
train_network()
train_network(buffer b):
batch = b.extract_random_batch(batch_size)
for each input,output in batch:
network.train(input, output, learning_rate) //one forward/backward pass
My problem right now is that this code works for a buffer size of less than 200.
For any buffer over 200, my code does not work anymore so I've got a few questions:
Is this implementation correct? (In theory)
How big should the batch size be compared to the buffer size
How would one usually train the network? For how long? Until a specific MSE of the whole batch is reached?
Is this implementation correct? (In theory)
Yes, your pseudocode does have the right approach.
How big should the batch size be compared to the buffer size
Algorithmically speaking, using larger batches in stochastic gradient descent allows you to reduce the variance of your stochastic gradient updates (by taking the average of the gradients in the batch), and this in turn allows you to take bigger step-sizes, which means the optimization algorithm will make progress faster.
The experience replay buffer stores a fixed number of recent memories, and as new ones come in, old ones are removed. When the time comes to train, we simply draw a uniform batch of random memories from the buffer, and train our network with them.
While related, there is no standard value for batch size vs. buffer size. Experimenting with these hyperparameters is one of the joys of deep reinforcement learning.
How would one usually train the network? For how long? Until a
specific MSE of the whole batch is reached?
Networks are usually trained until they "converge," which means that there are repeatedly no meaningful changes in the Q-table between episodes

Character Recognition Using Back Propagation Algorithm Testing

Recently I've been working on character recognition using Back Propagation Algorithm. I've taken the image and reduced to 5x7 size, therefore I got 35 pixels and trained the network using those pixels with 35 input neurons, 35 hidden nodes, and 10 output nodes. And I had completed the training successfully and I got weights that I needed. And I've got stuck here. I have my test set and I know I should feed forward the network. But I don't know what to do exactly. My test set will be 4 samples of 1x35. My output layer has 10 neurons. how do I exactly distinguish the characters with the output that I will get? I want to know how this testing works. Please guide me through this stage. Thanks in advance.
One vs All
A common approach for testing these types of neural networks is "one-vs-all" approach. We view each of the output nodes as its own classifier that is giving the probability of the sample being that class vs not being that class.
For instance if you network output [1, 0, ..., 0] then class 1 has high probability of being class 1 vs not being class 1. Class 2 has low probability of being class 2 vs not being class 2, etc.
Ties
In the case of a tie, it is common (in research) to have a random function break the tie. If you get [1, 1, 1, ..., 1] then the function would pick a number from 1-10 and that is your prediction. In practice sometimes an expert system is used to break ties. Perhaps class 1 is more expensive than class 2, so we tie in preference to class 2.
Steps
So the steps are:
Split dataset into test/train set
Train weights on train set
Pass test set forward through the neural network
For each sample, choose the argmax (the output with highest value) as your prediction
In case of tie, choose randomly between all tying classes
Aside
In your particular case, I imagine implementation of this strategy will result in a network that barely beats random performance (10%) accuracy.
I would suggest some reconsidering of the network architecture.
If you look at your 5x7 images, can you tell what number that image was originally? It seems likely that scaling the image down to this size losses too much information that the network cannot distinguish between classes.
Debugging
From what you've described I would look at the following when debugging your network.
Is your data preprocessing (down-scaling) leeching out too much information? Check this by manually investigating a few of the images and seeing if you can tell what the image should be.
Does your one-hot algorithm work? When you convert your targets for training, does it successfully convert 1 -> [1, 0, 0, ..., 0]?
Is your back-prop / gradient descent algorithm correct? You should see (roughly) a monotonic decrease in your loss function while training. Try at every step (or every few steps) printing the loss that you are optimizing. Or even for a very simple gut check, print mean squared error: (P-Y)^2

Tensorflow Inception Multiple GPU Training Loss is not Summed?

I am trying to go through Tensorflow's inception code for multiple GPUs (on 1 machine). I am confused because we get multiple losses from the different towers, aka the GPUs, as I understand, but the loss variable evaluated seems to only be of the last tower and not a sum of the losses from all towers:
for step in xrange(FLAGS.max_steps):
start_time = time.time()
_, loss_value = sess.run([train_op, loss])
duration = time.time() - start_time
Where loss was last defined specifically for each tower:
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (inception.TOWER_NAME, i)) as scope:
# Force all Variables to reside on the CPU.
with slim.arg_scope([slim.variables.variable], device='/cpu:0'):
# Calculate the loss for one tower of the ImageNet model. This
# function constructs the entire ImageNet model but shares the
# variables across all towers.
loss = _tower_loss(images_splits[i], labels_splits[i], num_classes,
scope)
Could someone explain where the step is to combine the losses from different towers? Or are we simply a single tower's loss as representative of the other tower's losses as well?
Here's the link to the code:
https://github.com/tensorflow/models/blob/master/inception/inception/inception_train.py#L336
For monitoring purposes, considering all towers work as expected, single tower's loss is as representative as average of all towers' losses. This is due to the fact that there is no relation between batch and tower it is assigned to.
But the train_op uses gradients from all towers, as per line 263, 278 so technically training takes into account batches from all towers, as it should be.
Note, that average of losses will have lower variance than single tower's loss, but they will have the same expectation.
Yes, according to this code, losses are not summed or averaged across gpus. Loss per gpu is used inside of each gpu (tower) for gradient calculation. Only gradients are synchronized. So the isnan test is only done for the portion of data processed by the last gpu. This is not crucial but can be a limitation.
If really needed, I think you can do as follows to get averaged loss cross gpus:
per_gpu_loss = []
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (inception.TOWER_NAME, i)) as scope:
...
per_gpu_loss.append(loss)
mean_loss = tf.reduce_mean(per_gpu_loss, name="mean_loss")
tf.summary.scalar('mean_loss', mean_loss)
and then replace loss in sess.run as mean_loss:
_, loss_value = sess.run([train_op, mean_loss])
loss_value is now an average across losses processed by all the gpus.

Re-Use Sliding Window data for Neural Network for Time Series?

I've read a few ideas on the correct sample size for Feed Forward Neural networks. x5, x10, and x30 the # of weights. This part I'm not overly concerned about, what I am concerned about is can I reuse my training data (randomly).
My data is broken up like so
5 independent vars and 1 dependent var per sample.
I was planning on feeding 6 samples in (6x5 = 30 input neurons), confirm the 7th samples dependent variable (1 output neuron.
I would train on neural network by running say 6 or 7 iterations. before trying to predict the next iteration outside of my training data.
Say I have
each sample = 5 independent variables & 1 dependent variables (6 vars total per sample)
output = just the 1 dependent variable
sample:sample:sample:sample:sample:sample->output(dependent var)
Training sliding window 1:
Set 1: 1:2:3:4:5:6->7
Set 2: 2:3:4:5:6:7->8
Set 3: 3:4:5:6:7:8->9
Set 4: 4:5:6:7:8:9->10
Set 5: 5:6:7:6:9:10->11
Set 6: 6:7:8:9:10:11->12
Non training test:
7:8:9:10:11:12 -> 13
Training Sliding Window 2:
Set 1: 2:3:4:5:6:7->8
Set 2: 3:4:5:6:7:8->9
...
Set 6: 7:8:9:10:11:12->13
Non Training test: 8:9:10:11:12:13->14
I figured I would randomly run through my set's per training iteration say 30 times the number of my weights. I believe in my network I have about 6 hidden neurons (i.e. sqrt(inputs*outputs)). So 36 + 6 + 1 + 2 bias = 45 weights. So 44 x 30 = 1200 runs?
So I would do a randomization of the 6 sets 1200 times per training sliding window.
I figured due to the small # of data, I was going to do simulation runs (i.e. rerun over the same problem with new weights). So say 1000 times, of which I do 1140 runs over the sliding window using randomization.
I have 113 variables, this results in 101 training "sliding window".
Another question I have is if I'm trying to predict up or down movement (i.e. dependent variable). Should I match to an actual # or just if I guessed up/down movement correctly? I'm thinking I should shoot for an actual number, but as part of my analysis do a % check on if this # is guessed correctly as up/down.
If you have a small amount of data, and a comparatively large number of training iterations, you run the risk of "overtraining" - creating a function which works very well on your test data but does not generalize.
The best way to avoid this is to acquire more training data! But if you cannot, then there are two things you can do. One is to split the training data into test and verification data - using say 85% to train and 15% to verify. Verification means compute the fitness of the learner on the training set, without adjusting the weights/training. When the verification data fitness (which you are not training on) stops improving (in general it will be noisy), and your training data fitness continues improving - stop training. If on the other hand you use a "sliding window", you may not have a good criterion to know when to stop training - the fitness function will bounce around in unpredictable ways (you might slowly make the effect of each training iteration have less effect on the parameters, however, to give you convergence... maybe not the best approach but some training regimes do this) The other thing you can do normalize out your node's weights via some metric to ensure some notion of 'smoothness' - if you visualize overfitting for a second you'll find that in the extreme case your fitness function sharply curves around your dataset positives...
As for the latter question - for the training to converge, you fitness function needs to be smooth. If you were to just use binary all-or-nothing fitness terms, most likely what would happen is that whatever algorithm you are using to train (backprop, BGFS, etc...) would not converge. In practice, the classification criterion should be an activation that is above for a positive result, less than or equal to for a negative result, and varies smoothly in your weight/parameter space. You can think of 0 as "I am certain that the answer is up" and 1 as "I am certain that the answer is down", and thus realize a fitness function that has a higher "cost" for incorrect guesses that were more certain... There are subtleties possible in how the function is shaped (for example you might have different ideas about how acceptable a false negative and false positive are) - and you may also introduce regions of "uncertain" where the result is closer to "zero weight" - but it should certainly be continuous/smooth.
You can re-use sliding window's.
It basically the same concept as bootstrapping (your training set); which in itself reduces training time, but don't know if it's really helpful in making the net more adaptive to anything other than the training data.
Below is an example of a sliding window in pictorial format (using spreadsheet magic)
http://i.imgur.com/nxhtgaQ.png
https://github.com/thistleknot/FredAPI/blob/05f74faf85d15f6898aa05b9b08d5363fe27c473/FredAPI/Program.cs
Line 294 shows how the code is ran using randomization, it resets the randomization at position 353 so the rest flows as normal.
I was also able to use a 1 (up) or 0 (down) as my target values and the network did converge.