What is the meaning of forward pass and backward pass in neural networks?
Everybody is mentioning these expressions when talking about backpropagation and epochs.
I understood that forward pass and backward pass together form an epoch.
The "forward pass" refers to calculation process, values of the output layers from the inputs data. It's traversing through all neurons from first to last layer.
A loss function is calculated from the output values.
And then "backward pass" refers to process of counting changes in weights (de facto learning), using gradient descent algorithm (or similar). Computation is made from last layer, backward to the first layer.
Backward and forward pass makes together one "iteration".
During one iteration, you usually pass a subset of the data set, which is called "mini-batch" or "batch" (however, "batch" can also mean an entire set, hence the prefix "mini")
"Epoch" means passing the entire data set in batches. One epoch contains (number_of_items / batch_size) iterations
Related
I am trying to make my very first Neural Network work. I designed it so that I can choose the number of layers and the number of nodes per layer freely. I had a hard time trying to implement back propagation but I think I have done it recursively even if it is not as performant as it can be. I am using the sigmoid as an activation for all nodes (even the input nodes and the output node).
My network has a single output node in the output layer that should predict a variable (zero or one).
My question is how exactly should I do to train my network ? I noticed that when I use the following algorithm:
for i in [1:100000]
feed the same record to my neural network
perform a forward pass
compute the error using the square of the difference as a loss function for this record with the current weights
Update the weights using back propagation
it converges to the correct result (the output node value converges to zero when the record is labeled with zero, and to one when the record is labeled as one). But when I feed a different record to the network at each time of this iterative algorithm the network completely diverges.
Suppose that I would like to work with a mini batch of N records, this means that I have to make N forward passes giving at each time one of the N records as input to the network, comùpute the error, take the average over the N records, but then, when I would like to use the average error in the back propagation algorithm, what input record should I use ? Because, as far as I know the input layer is also used to compute the weights between it and the first hidden layer. Should I then use the last one of the N records as input? Or the first one ? Does it even matter ? I am a bit confused here and I found nothing on the internet to answer this particular question.
Best regards.
What does it mean to "unroll a RNN dynamically". I've seen this specifically mentioned in the Tensorflow source code, but I'm looking for a conceptual explanation that extends to RNN in general.
In the tensorflow rnn method, it is documented:
If the sequence_length vector is provided, dynamic calculation is
performed. This method of calculation does not compute the RNN steps
past the maximum sequence length of the minibatch (thus saving
computational time),
But in the dynamic_rnn method it mentions:
The parameter sequence_length is optional and is used to
copy-through state and zero-out outputs when past a batch element's
sequence length. So it's more for correctness than performance,
unlike in rnn().
So does this mean rnn is more performant for variable length sequences? What is the conceptual difference between dynamic_rnn and rnn?
From the documentation I understand that what they are saying is that the parameter sequence_length in the rnn method affects the performance because when set, it will perform dynamic computation and it will stop before.
For example, if the rnn largest input sequence has a length of 50, if the other sequences are shorter it will be better to set the sequence_length for each sequence, so that the computation for each sequence will stop when the sequence ends and won't compute the padding zeros until reaching 50 timesteps. However, if sequence_length is not provided, it will consider each sequence to have the same length, so it will treat the zeros used for padding as normal items in the sequence.
This does not mean that dynamic_rnn is less performant, the documentation says that the parameter sequence_length will not affect the performance because the computation is already dynamic.
Also according to this post about RNNs in Tensorflow:
Internally, tf.nn.rnn creates an unrolled graph for a fixed RNN length. That means, if you call tf.nn.rnn with inputs having 200 time steps you are creating a static graph with 200 RNN steps. First, graph creation is slow. Second, you’re unable to pass in longer sequences (> 200) than you’ve originally specified.
tf.nn.dynamic_rnn solves this. It uses a tf.While loop to dynamically construct the graph when it is executed. That means graph creation is faster and you can feed batches of variable size. What about performance? You may think the static rnn is faster than its dynamic counterpart because it pre-builds the graph. In my experience that’s not the case.
In short, just use tf.nn.dynamic_rnn. There is no benefit to tf.nn.rnn and I wouldn’t be surprised if it was deprecated in the future.
dynamic_rnn is even faster (or equal) so he suggests to use dynamic_rnn anyway.
To better understand dynamic unrolling, consider how you would create RNN from scratch, but using Tensorflow (I mean without using any RNN library) for 2 time stamp input
Create two placeholders, X1 and X2
Create two variable weights, Wx and Wy, and bias
Calculate output, Y1 = fn(X1 x Wx + b) and Y2 = fn(X2 x Wx + Y1 x Wy + b).
Its clear that we get two outputs, one for each timestamp. Keep in mind that Y2 indirectly depends on X2, via Y1.
Now consider you have 50 timestamp of inputs, X1 through X50. In this case, you'll have to create 50 outputs, Y1 through Y50. This is what Tensorflow does by dynamic unrolling
It creates these 50 outputs for you via tf.dynamic_rnn() units.
I hope this helps.
LSTM (or GRU) cell are the base of both.
Imagine an RNN as a stacked deep net with
weights sharing (=weights and biases matrices are the same in all layers)
input coming "from the side" into each layer
outputs are interpreted in higher layers (i.e. decoder), one in each layer
The depth of this net should depend on (actually be equal to) actual input and output lengths. And nothing else, as weights are the same in all the layers anyway.
Now, the classic way to build this is to group input-output-pairs into fixed max-lengths (i.e. model_with_buckets()). DynRNN breaks with this constraint and adapts to the actual sequence-lengths.
So no real trade-off here. Except maybe that you'll have to rewrite older code in order to adapt.
I'm trying to create a sample neural network that can be used for credit scoring. Since this is a complicated structure for me, i'm trying to learn them small first.
I created a network using back propagation - input layer (2 nodes), 1 hidden layer (2 nodes +1 bias), output layer (1 node), which makes use of sigmoid as activation function for all layers. I'm trying to test it first using a^2+b2^2=c^2 which means my input would be a and b, and the target output would be c.
My problem is that my input and target output values are real numbers which can range from (-/infty, +/infty). So when I'm passing these values to my network, my error function would be something like (target- network output). Would that be correct or accurate? In the sense that I'm getting the difference between the network output (which is ranged from 0 to 1) and the target output (which is a large number).
I've read that the solution would be to normalise first, but I'm not really sure how to do this. Should i normalise both the input and target output values before feeding them to the network? What normalisation function is best to use cause I read different methods in normalising. After getting the optimized weights and use them to test some data, Im getting an output value between 0 and 1 because of the sigmoid function. Should i revert the computed values to the un-normalized/original form/value? Or should i only normalise the target output and not the input values? This really got me stuck for weeks as I'm not getting the desired outcome and not sure how to incorporate the normalisation idea in my training algorithm and testing..
Thank you very much!!
So to answer your questions :
Sigmoid function is squashing its input to interval (0, 1). It's usually useful in classification task because you can interpret its output as a probability of a certain class. Your network performes regression task (you need to approximate real valued function) - so it's better to set a linear function as an activation from your last hidden layer (in your case also first :) ).
I would advise you not to use sigmoid function as an activation function in your hidden layers. It's much better to use tanh or relu nolinearities. The detailed explaination (as well as some useful tips if you want to keep sigmoid as your activation) might be found here.
It's also important to understand that architecture of your network is not suitable for a task which you are trying to solve. You can learn a little bit of what different networks might learn here.
In case of normalization : the main reason why you should normalize your data is to not giving any spourius prior knowledge to your network. Consider two variables : age and income. First one varies from e.g. 5 to 90. Second one varies from e.g. 1000 to 100000. The mean absolute value is much bigger for income than for age so due to linear tranformations in your model - ANN is treating income as more important at the beginning of your training (because of random initialization). Now consider that you are trying to solve a task where you need to classify if a person given has grey hair :) Is income truly more important variable for this task?
There are a lot of rules of thumb on how you should normalize your input data. One is to squash all inputs to [0, 1] interval. Another is to make every variable to have mean = 0 and sd = 1. I usually use second method when the distribiution of a given variable is similiar to Normal Distribiution and first - in other cases.
When it comes to normalize the output it's usually also useful to normalize it when you are solving regression task (especially in multiple regression case) but it's not so crucial as in input case.
You should remember to keep parameters needed to restore the original size of your inputs and outputs. You should also remember to compute them only on a training set and apply it on both training, test and validation sets.
I am new to neural nets, and am creating a LSTM from scratch. I have the forward propagation working...but I have a few questions about the moving pieces in forward propagation in the context of a trained model, back propagation, and memory management.
So, right now, when I run forward propagation, I stack the new columns, f_t, i_t, C_t, h_t, etc on their corresponding arrays as I accumulate previous positions for the bptt gradient calculations.
My question is 4 part:
1) How far back in time do I need to back propagate in order to retain reasonably long-term memories? (memory stretching back 20-40 time steps is probably what I need for my system (although I could benefit from a much longer time period--that is just the minimum for decent performance--and I'm only shooting for the minimum right now, so I can get it working)
2) Once I consider my model "trained," is there any reason for me to keep more than the 2 time-steps I need to calculate the next C and h values? (where C_t is the Cell state, and h_t is the final output of the LSTM net) in which case I would need multiple versions of the forward propagation function
3) If I have limited time series data on which to train, and I want to train my model, will the performance of my model converge as I train it on the training data over and over (as versus oscillate around some maximal average performance)? And will it converge if I implement dropout?
4) How many components of the gradient do I need to consider? When I calculate the gradient of the various matrices, I get a primary contribution at time step t, and secondary contributions from time step t-1 (and the calculation recurses all the way back to t=0)? (in other words: does the primary contribution dominate the gradient calculation--will the slope change due to the secondary components enough to warrant implementing the recursion as I back propagate time steps...)
As you have observed, it depends on the dependencies in the data. But LSTM can learn to learn longer term dependencies even though we back propagate only a few time steps if we do not reset the cell and hidden states.
No. Given c_t and h_t, you can determine c and h for the next time step. Since you don't need to back propagate, you can throw away c_t (and even h_t if you are just interested in the final LSTM output)
You might converge if you start over fitting. Using Dropout will definitely help avoiding that, especially along with early stopping.
There will be 2 components of gradient for h_t - one for current output and one from the next time step. Once you add the both, you won't have to worry about any other components
In my undergrad thesis I am creating a neural network to control automated shifting algorithm for a vehicle.
I have created the nn from scratch starting from .m script which works correctly. I tested it to recognize some shapes.
A brief background information;
NN rewires neurons which are mathematical blocks located in a layer. There are multiple layers. output of a layer is input of preceding layer. Actual output is subtracted from known output and error is obtained by this manner. By using back propagation algorithm which are some algebraic equation the coefficient of neurons are updated.
What I want to do is;
in code there are 6 input matrices, don't have to be matrix just anything and corresponding outputs. lets call them as x(i) matrices and y(i) vectors. In for loop I go through each matrix and vector to teach the network. Finally by using last known updated coeffs networks give some responses according to unknown input.
I couldn't find the way that, how to simulate that for loop in simulink to go through each different input and output pairs. When the network is done with one pair it should change the input and compare with corresponding output then update the coefficient matrices.
I model the layers as given and just fed with one input but I need multiple.
When it comes to automatic transmission control issue it should do all this real time. It should continuously read the output and updates the coeffs and gives the decision.
Check out the "For each Subsystem". Exists since 2011b
To create the input signals you use the "Concatenate" Block which would have six inputs in your case, and a three dimensional output x.dim = [1x20x6] then you could iterate over the third dimension...
A very useful pattern to create smaller models that run faster and to keep your code DRY (Dont repeat yourself)