Max/Mean pooling are ways to generate representations based on the LSTM outputs. How does back-propagation happen in such cases?
I understand how back-propagation happens in the case where pooling is not done. But I would like to know how it happens in the above case. Thanks in advance.
since max pooling is considering max value across all time steps, so during back propagation
error will be considered in that time step which contributed to max value.
weights that contributes to max value in particular channel will be updated.
Related
I have a neural network for regression prediction means that the output is a real value number in range 0 to 1.
I used drop out for all layers and the errors suddenly increased and never converged.
Is drop out usable for regression task? Because if we disregard some nodes, the last layer will have fewer nodes and the predicted value will definitely very different from the actual value. So the back propagated error will be large and the model will be destroyed. Then Why should we use drop out for regression task in neural networks?
Because if we disregard some nodes, the last layer will have fewer
nodes and the predicted value will definitely very different from the
actual value.
You are correct. Hence most frameworks scale up the number of neurons during training (and don't during prediction time). This simple hack is effective and works well for most cases. However, it doesn't work that well for a regression task. It works well where the outputs of activation can be relative to each other (like softmax). In regression the values are absolute and the small differences in "train" and "prediction" setups do cause mild instabilities on occasions.
It is always best to start with a 0 dropout and then increase it slowly to observe what value gives the best result
I used drop out for all layers and the errors suddenly increased and never converged.
This also happens when you use too many dropouts, especially in regression tasks. Did you tried reducing dropouts?? Also, dropouts is recommended for those layers which has very high number of trainable parameters. Also consider removing dropouts from last layer and then check once.
I have gone through neural networks and have understood the derivation for back propagation almost perfectly(finally!). However, I had a small doubt.
We are updating all the weights simultaneously, so what is the guarantee that they lead to a smaller cost. If the weights are updated one by one, it would definitely lead to a lesser cost and it would be similar to linear regression. But if you update all the weights simultaneously, might we not cross the minima?
Also, do we update the biases like we update the weights after each forward propagation and back propagation of each test case?
Lastly, I have started reading on RNN's. What are some good resources to understand BPTT in RNN's?
Yes, updating only one weight at the time could result in decreasing error value every time but it's usually infeasible to do such updates in practical solutions using NN. Most of today's architectures usually have ~ 10^6 parameters so one epoch for every parameter could last enormously long. Moreover - because of nature of backpropagation - you usually have to compute loads of different derivatives in order to compute derivative with respect to a parameter given - so you will waste a lot of computations when using such approach.
But the phenomenon which you mention has been noticed a long time ago and there are some ways in dealing with it. There are two most common issues connected with it:
Covariance shift: it's when error and weight updates of a layer given strongly depends on output from previous layer, so when you update it - the results in the next layer might be different. The most common way to deal with this problem right now is Batch normalization.
Nolinear function vs Linear Differentation: it's quite uncommon when you think about BP but derivative is a linear operator which might generate a lot of problems in gradient descent. The most countintuitive example is the fact that if you multiply your input by a constant then every derivative will also be multiplied by the same number. This may lead to a lot of problems but most of recent methods of learning do a great job in dealing with it.
About BPTT I stronly recomend you Geoffrey Hinton course about ANN and especially this video.
Having neural network with alot of inputs causes my network problems like
Neural network gets stuck and feed forward calculation always gives output as
1.0 because of the output sum being too big and while doing backpropagation, sum of gradients will be too high what causes the
learning speed to be too dramatic.
Neural network is using tanh as an active function in all layers.
Giving alot of thought, I came up with following solutions:
Initalizing smaller random weight values ( WeightRandom / PreviousLayerNeuronCount )
or
After calculation the sum of either outputs or gradients, dividing the sum with the number of 'neurons in previus layer for output sum' and number of 'neurons in next layer for gradient sum' and then passing sum into activation/derivative function.
I don't feel comfortable with solutions I came up with.
Solution 1. does not solve problem entirely. Possibility of gradient or output sum getting to high is still there. Solution 2. seems to solve the problem but I fear that it completely changes network behavior in a way that it might not solve some problems anymore.
What would you suggest me in this situation, keeping in mind that reducing neuron count in layers is not an option?
Thanks in advance!
General things that affect the output backpropagation include weights and biases of early elections, the number of hidden units, the amount of exercise patterns, and long iterations. As an alternative way, the selection of initial weights and biases there are several algorithms that can be used, one of which is an algorithm Nguyen widrow. You can use it to initialize the weights and biases early, I've tried it and gives good results.
Through all training process, accuracy is 0.1. What am I doing wrong?
Model, solver and part of log here:
https://gist.github.com/yutkin/3a147ebbb9b293697010
Topology in png format:
P.S. I am using the latest version of Caffe and g2.2xlarge instance on AWS.
You're working on CIFAR-10 dataset which has 10 classes. When the training of a network commences, the first guess is usually random due to which your accuracy is 1/N, where N is the number of classes. In your case it is 1/10, i.e., 0.1. If your accuracy stays the same over time it implies that your network isn't learning anything. This may happen due to a large learning rate. The basic idea of training a network is that you calculate the loss and propagate it back. The gradients are multiplied with the learning rate and added to the current weights and biases. If the learning rate is too big you may overshoot the local minima every time. If it is too small, the convergence will be slow. I see that your base_lr here is 0.01. As far as my experience goes, this is somewhat large. You may want to keep it at 0.001 in the beginning and then go on reducing it by a factor of 10 whenever you observe that the accuracy is not improving. But then anything below 0.00001 usually doesn't make much of a difference. The trick is to observe the progress of the training and make parameter changes as and when required.
I know the thread is quite old but maybe my answer helps somebody. I experienced the same problem with an accuracy like a random guess.
What helped was to set the number of outputs of the last layer before the accuracy layer to the number of labels.
In your case that should be the ip2 layer. Open the model definition of your net and set num_outputs to the number of labels.
See Section 4.4 for more information: A Practical Introduction to Deep Learning with Caffe and Python
I am using
net = newfit(in,out,lag(j),{'tansig','tansig'});
to generate a new neural network. The default value of the number of validation checks is 6.
I am training a lot of networks and this is taking a lot of time. I guess it doesn't matter if my results are a bit less accurate if they can be made considerably faster.
How can I train faster?
I believe one of the ways might be to reduce the value of the number of validation checks. How can I do that (in code, not using GUI)
Is there some other way to increase speed.
As I said, the increase in speed may be at a little loss of accuracy.
Just to extend #mtrw answer, according to the documentation, training stops when any of these conditions occurs:
The maximum number of epochs is reached: net.trainParam.epochs
The maximum amount of time is exceeded: net.trainParam.time
Performance is minimized to the goal: net.trainParam.goal
The performance gradient falls below min_grad: net.trainParam.min_grad
mu exceeds mu_max: net.trainParam.mu_max
Validation performance has increased more than max_fail times since
the last time it decreased (when using validation): net.trainParam.max_fail
Epochs and time contraints allows to put an upper bound on the training duration.
Goal constraint stop the training when the performance (error) drops below it, and usually allows you to adjust the level of time/accuracy trade-off: less accurate results for faster execution.
This is similar to min_grad (gradient tells you the strength of the "descent") in that if the magnitude of the gradient is less than mingrad, training stops. It can be understood by the fact that if the error function is not changing by much, then we are reaching a plateau and we should probably stop training since we are not going to improve by much.
mu, mu_dec, and mu_max are used to control the weight updating process (backpropagation).
max_fail is usually used to avoid over-fitting, not so much for speedup.
My advice, set time and epochs to the maximum possible that your application constraints allow (otherwise the results will be poor). And in turn, you can control goal and min_grad to reach the level of speed/accuracy trade-off desired. Keep in mind that max_fails wont make you gain any time, since its mainly used to assure good generalization power.
(Disclaimer: I don't have the neural network toolbox, so I'm only extrapolating from the Mathworks documentation)
It looks from your input parameters like you're using TRAINLM. According to the documentation, you can set the net.trainParam.max_fail parameter to change the validation checks.
You can set the initial mu value, as well as the increment and decrement factors. But this would require some insight into the expected answer and performance of the search.
For a more blunt approach, you can also control the maximum number of iterations by setting the net.trainParam.epochs parameter to something less than its default 100. You might also set the net.trainParam.time parameter to limit the number of seconds.
You should probably set net.trainParam.show to NaN to skip any displays.
Neural nets are treated as objects in MATLAB. To access any parameter before (or after) training, you need to access the network's properties using the . operator.
In addition to mtrw's and Amro's answers, make MATLAB's Neural Network Toolbox documentation your new best friend. It will usually explain things in much better detail.