How is the derivative in this ReLU backpropagation being calculated? (Neural Network) - neural-network

The "dvalue" variable is what I'm hung up on... I understand the derivative of the ReLU.
Picture 1
Picture 2

Related

Linear Regression using Neural Network

I am working on a regression problem with the following sample training data .
As shown I have an input of only 4 parameters with only one of them changing which is Z so the rest have no real value while an output of 124 parameters denoted from O1 to O124
Noting that O1 changes with a constant rate of 20 [1000 then 1020 then 1040 ...] while O2 changes with a different rate which is 30 however still constant and same applies for all the 124 outputs ,all changes linearily in a constant way.
I believed it's a trivial problem and a very simple neural network model will reach a 100% accuracy on testing data but the results were the opposite.
I reached 100% test accuracy using a linear regressor and 99.99997% test accuracy using KNN regressor
I reached 41% test data accuracy in a 10 layered neural network using relu activation while all the rest activation functions failed and shallow relu also failed
Using a simple neural network with linear activation function and no hidden layers I reached 92% on the test data
My Question is how can I get the neural network to get 100% on test data like the linear Regressor?
It is supposed that using a shallow network with linear activation to be equivilant to the linear regressor but the results are different ,am I missing something ?
If you use linear activation a deep model is in principle the same as a linear regression / a NN with 1 layer. E.g a deep NN with linear activation the prediction is given as y = W_3(W_2(W_1 x))), which can be rewritten as y = (W_3 (W_2 W_1))x, which is the same as y = (W_4 x), which is a linear Regression.
Given that check if your NN without a hidden layer converges to the same parameters as your linear regression. If this is not the case then your implementation is probably wrong. If this is the case, then your larger NN probably converges to some solution to the problem, where the test accuracy is just worse. Try different random seeds then.

Why do we need biases in the neural network?

We have weights and optimizer in the neural network.
Why cant we just W * input then apply activation, estimate loss and minimize it?
Why do we need to do W * input + b?
Thanks for your answer!
There are two ways to think about why biases are useful in neural nets. The first is conceptual, and the second is mathematical.
Neural nets are loosely inspired by biological neurons. The basic idea is that human neurons take a bunch of inputs and "add" them together. If the sum of the inputs is greater than some threshold, then the neuron will "fire" (produce an output that goes to other neurons). This threshold is essentially the same thing as a bias. So, in this way, the bias in artificial neural nets helps to replicate the behavior of real, human neurons.
Another way to think about biases is simply by considering any linear function, y = mx + b. Let's say you are using y to approximate some linear function z. If z has a non-zero z-intercept, and you have no bias in the equation for y (i.e. y = mx), then y can never perfectly fit z. Similarly, if the neurons in your network have no bias terms, then it can be harder for your network to approximate some functions.
All that said, you don't "need" biases in neural nets--and, indeed, recent developments (like batch normalization) have made biases less frequent in convolutional neural nets.

Neural Network(Multilayer Perceptron) config to approximate the sine function

I write a multilayer perceptron and try to approximate the sine function.
My network contains only a single hidden layer with 50 neurals (input layer and output layer each has only 1 neural of course). Activation function used in hidden layer is tanh, and output layer is linear. Learning rate is set to 0.0001, momentum 0.9 (normal momentum not Nesterov momentum) Training mode is online since the data is generate without noise. Weights and bias are generate randomly with mean = 0;
After 10000 epochs, my network result plotted below (the upper image is real sine function, the lower image is my network output), although it is not too bad, I cannot achieve the exact sine function.
Can anyone give me advice for a better config for better error convergence.

MNIST - Training stuck

I'm reading Neural Networks and Deep Learning (first two chapters), and I'm trying to follow along and build my own ANN to classify digits from the MNIST data set.
I've been scratching my head for several days now, since my implementation peaks out at ~57% accuracy at classifying digits from the test set (some 5734/10000) after 10 epochs (accuracy for the training set stagnates after the tenth epoch, and accuracy for the test set deteriorates presumably because of over-fitting).
I'm using nearly the same configuration as in the book: 2-layer feedforward ANN (784-30-10) with all layers fully connected; standard sigmoid activation functions; quadratic cost function; weights are initialized the same way (taken from a Gaussian distribution with mean 0 and standard deviation 1)
The only differences being that I'm using online training instead of batch/mini-batch training and a learning rate of 1.0 instead of 3.0 (I have tried mini-batch training + learning rate of 3.0 though)
And yet, my implementation doesn't pass the 60% percentile after a bunch of epochs where as in the book the ANN goes above %90 just after the first epoch with pretty much the exact same configuration.
At first I messed up implementing the backpropagation algorithm, but after reimplementing backpropagation differently three times, with the exactly the same results in each reimplementation, I'm stumped...
An example of the results the backpropagation algorithm is producing:
With a simpler feedforward network with the same configuration mentioned above (online training + learning rate of 1.0): 3 input neurons, 2 hidden neurons and 1 output neuron.
The initial weights are initialized as follows:
Layer #0 (3 neurons)
Layer #1 (2 neurons)
- Neuron #1: weights=[0.1, 0.15, 0.2] bias=0.25
- Neuron #2: weights=[0.3, 0.35, 0.4] bias=0.45
Layer #2 (1 neuron)
- Neuron #1: weights=[0.5, 0.55] bias=0.6
Given an input of [0.0, 0.5, 1.0], the output is 0.78900331.
Backpropagating for the same input and with the desired output of 1.0 gives the following partial derivatives (dw = derivative wrt weight, db = derivative wrt bias):
Layer #0 (3 neurons)
Layer #1 (2 neurons)
- Neuron #1: dw=[0, 0.0066968054, 0.013393611] db=0.013393611
- Neuron #2: dw=[0, 0.0061298212, 0.012259642] db=0.012259642
Layer #2 (1 neuron)
- Neuron #1: dw=[0.072069918, 0.084415339] db=0.11470326
Updating the network with those partial derivatives yields a corrected output value of 0.74862305.
If anyone would be kind enough to confirm the above results, it would help me tremendously as I've pretty much ruled out backpropagation being faulty as the reason for the problem.
Did anyone tackling the MNIST problem ever come across this problem?
Even suggestions for things I should check would help since I'm really lost here.
Doh..
Turns out nothing was wrong with my backpropagation implementation...
The problem was that I read the images into a signed char (in C++) array, and the pixel values overflowed, so that when I divided by 255.0 to normalize the input vectors into the range of 0.0-1.0, I actually got negative values... ;-;
So basically I spent some four days debugging and reimplementing the same thing when the problem was somewhere else entirely.

NAN when I use ReLU activation function in convolutional neural network Lenet-5

I did programmed convolution neural network LeNet-5. I made some modifications:
I replaced the activation function of output neurons in the last layer RBF to SoftMah.
SubSampling layers to MaxPooling layers.
Learning method is Backpropagation
As a result, the network is working correctly.
After that I tried to replace the sigmoid output of each neuron in the feature maps to ReLU (Rectifier Linear Unit). As a result, the network began to learn faster, but if I do not choose a low speed, I get NaN value.
For a small set of input data, it is simpler to use a lower speed of learning. When it comes to more than 1,000 examples, the network is working, but in the end I get NaN again.
Why is there a NaN when using ReLU? Is LeNet architecture is not for ReLU?