Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I'm still a little confused on how activation functions work in Neural Networks (well, confused to not be able to explain them in layman's terms). So far I have:
The activation function in a hidden layer determines whether the neuron is switched ON (passes a value to the next layer) or switched OFF (nothing is passed to the next layer). This is accomplished by feeding the result from a weights/bias calculation into a function (e.g. sigmoid) that results in an output that is high (ON/value passed) or low (OFF/value not passed).
What's confusing me:
What happens next? If the neuron is ON, 1) do the connectors between the neuron and next layer have their own values of w & b that are passed to the next layer? 2) Is the input to the activation function passed to the next layer? or, 3) is the output from the sigmoid function passed to the next layer? I think the answer is (1).
Can anyone help me with this areas of confusion?
First of all, I think you should forget the idea of "ON" or "OFF" because it is not really the way it often works : it is not compulsory that the result of a such function is something binary. There exist threshold activation functions, but they are not the only ones. The sigmoid function is a function that goes from the reals to the set ]0,1[. This function is applied and, unless you add a threshold, your neuron always outputs something, even if it is tiny or big, that is neither 0 nor 1.
Take the example of the linear activation function : you can even output among all reals. Then, the idea of on/off isn't relevant.
The goal of a such function is to add complexity to the model, and to make it non-linear. If you had a neural network without these functions, the output would just be a linear weighted sum of the inputs plus bias which is often not complex enough to solve problems (the example of simulating a XOR gate with a network is often used, you won't do it without activation functions). With activation functions, you can add whatever you want like tanh, sigmoid, ReLU...
That being said, the answer is 1 and 3.
If you take a random neuron n in a hidden layer, its input is a sum of values weighted by weights, and a bias (also weighted by a weight often called w0), sum on which it then applies the activation function. Imagine the weighted values of the previous neurons are 0.5 and 0.2, and you have a weighted bias of 0.1. You then apply a function, let's take the sigmoid, on 0.5+0.2+0.1=0.8. That makes something like 0.69.
The output of the neuron is the result of the function. Each neuron of the next layer will make a weighted sum of the output of the current layer, including the output of our neuron. Note that each neuron of the next layer has its own weights between the previous layer and itself. Then, neurons of the next layer will apply an activation function (not necesarly the same as current layer) to make their own outputs. So, informaly, it will do something like activ_func(..+..+0.69*weight_n+..).
That means, you can say each layer takes as value the result of the activation function applied on the weighted sum of the values of the neurons of the previous layer and a weighted bias. If you managed to read it without suffocating, you can recursively apply this definition for each layer (except input of course).
Related
I need clarification on when exactly do we say an activation function is activated. The job of activation function is to introduce non-linearity, right. Is it just scaling a given input to confined range?
I need clarification on when exactly do we say an activation function is activated.
We don't. This is not a Boolean thing, to be "active" or "inactive". You may be thinking in terms of whether a neuron fires (sends an electrical signal through its axon).
Perceptrons (the "neurons" of a software neural network) do not necessarily work this way. A couple of the activation functions do have hard binary signals, (-1 vs 1, or 0 vs 1), but most are continuous functions.
Instead, think of it as an "attention function", an evaluation of "how excited should this neuron get in response to the input?" For instance, ReLU (y = max(x, 0)) translates as "If this is boring, I don't care how boring it is; call it a 0 and move on." Sigmoid and tanh are more discriminating:
below -2 ......... forget it
-2 through 2 ... yes, let's pay attention to the pro's and con's
above 2 .......... I got the idea -- this is extremely cool ... don't care about the rest of your sales pitch, you already got an A+.
Activation functions are a sort of normalization or scaling filter. They help the next layer effectively focus on discriminating among undecided cases; a good activation function usually has a useful gradient (say, around 1.0) in the middle range ("the model isn't sure") of its inputs. They keep a wildly excited input (say +1000) from dominating the next layer's "conversation".
In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard computer chip circuit can be seen as a digital network of activation functions that can be "ON" (1) or "OFF" (0), depending on input.
it depends on what activation function you are talking about. but in general they are used to make the output results clearer in regression or to scale the input to make it easier to choose between them in classification.
References:
https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0
https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
I have been experimenting with neural networks these days. I have come across a general question regarding the activation function to use. This might be a well known fact to but I couldn't understand properly. A lot of the examples and papers I have seen are working on classification problems and they either use sigmoid (in binary case) or softmax (in multi-class case) as the activation function in the out put layer and it makes sense. But I haven't seen any activation function used in the output layer of a regression model.
So my question is that is it by choice we don't use any activation function in the output layer of a regression model as we don't want the activation function to limit or put restrictions on the value. The output value can be any number and as big as thousands so the activation function like sigmoid to tanh won't make sense. Or is there any other reason? Or we actually can use some activation function which are made for these kind of problems?
for linear regression type of problem, you can simply create the Output layer without any activation function as we are interested in numerical values without any transformation.
more info :
https://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python/
for classification :
You can use sigmoid, tanh, Softmax etc.
If you have, say, a Sigmoid as an activation function in output layer of your NN you will never get any value less than 0 and greater than 1.
Basically if the data your're trying to predict are distributed within that range you might approach with a Sigmoid function and test if your prediction performs well on your training set.
Even more general, when predict a data you should come up with the function that represents your data in the most effective way.
Hence if your real data does not fit Sigmoid function well you have to think of any other function (e.g. some polynomial function, or periodic function or any other or a combination of them) but you also should always care of how easily you will build your cost function and evaluate derivatives.
Just use a linear activation function without limiting the output value range unless you have some reasonable assumption about it.
This question already has answers here:
Why use softmax only in the output layer and not in hidden layers?
(5 answers)
Closed 5 years ago.
I have read the answer given here. My exact question pertains to the accepted answer:
Variables independence : a lot of regularization and effort is put to keep your variables independent, uncorrelated and quite sparse. If you use softmax layer as a hidden layer - then you will keep all your nodes (hidden variables) linearly dependent which may result in many problems and poor generalization.
What are the complications that forgoing the variable independence in hidden layers arises? Please provide at least one example. I know hidden variable independence helps a lot in codifying the backpropogation but backpropogation can be codified for softmax as well (Please verify if or not i am correct in this claim. I seem to have gotten the equations right according to me. hence the claim).
Training issue: try to imagine that to make your network working better you have to make a part of activations from your hidden layer a little bit lower. Then - automaticaly you are making rest of them to have mean activation on a higher level which might in fact increase the error and harm your training phase.
I don't understand how you achieve that kind of flexibility even in sigmoid hidden neuron where you can fine tune the activation of a particular given neuron which is precisely what the gradient descent's job is. So why are we even worried about this issue. If you can implement the backprop rest will be taken care of by gradient descent. Fine tuning the weights so as to make the activations proper is not something you, even if you could do, which you cant, would want to do. (Kindly correct me if my understanding is wrong here)
mathematical issue: by creating constrains on activations of your model you decrease the expressive power of your model without any logical explaination. The strive for having all activations the same is not worth it in my opinion.
Kindly explain what is being said here
Batch normalization: I understand this, No issues here
1/2. I don't think you have a clue of what the author is trying to say. Imagine a layer with 3 nodes. 2 of these nodes have an error responsibility of 0 with respect to the output error; so there is óne node that should be adjusted. So if you want to improve the output of node 0, then you immediately affect nodes 1 and 2 in that layer - possibly making the output even more wrong.
Fine tuning the weights so as to make the activations proper is not something you, even if you could do, which you cant, would want to do. (Kindly correct me if my understanding is wrong here)
That is the definition of backpropagation. That is exactly what you want. Neural networks rely on activations (which are non-linear) to map a function.
3. Your basically saying to every neuron 'hey, your output cannot be higher than x, because some other neuron in this layer already has value y'. Because all neurons in a softmax layer should have a total activation of 1, it means that neurons cannot be higher than a specific value. For small layers - small problem, but for big layers - big problem. Imagine a layer with 100 neurons. Now imagine their total output should be 1. The average value of those neurons will be 0.01 -> that means you are making networks connection relying (because activations will stay very low, averagely) - as other activation functions output (or take on input) of range (0:1 / -1:1).
I am learning to build neural networks for regression problems. It works well approximating linear functions. Setup with 1-5–1 units with linear activation functions in hidden and output layers does the trick and results are fast and reliable. However, when I try to feed it simple quadratic data (f(x) = x*x) here is what happens:
With linear activation function, it tries to fit a linear function through dataset
And with TANH function it tries to fit a a TANH curve through the dataset.
This makes me believe that the current setup is inherently unable to learn anything but a linear relation, since it's repeating the shape of activation function on the chart. But this may not be true because I've seen other implementations learn curves just perfectly. So I may be doing something wrong. Please provide your guidance.
About my code
My weights are randomized (-1, 1) inputs are not normalized. Dataset is fed in random order. Changing learning rate or adding layers, does not change the picture much.
I've created a jsfiddle,
the place to play with is this function:
function trainingSample(n) {
return [[n], [n]];
}
It produces a single training sample: an array of an input vector array and a target vector array.
In this example it produces an f(x)=x function. Modify it to be [[n], [n*n]] and you've got a quadratic function.
The play button is at the upper right, and there also are two input boxes to manually input these values. If target (right) box is left empty, you can test the output of the network by feedforward only.
There is also a configuration file for the network in the code, where you can set learning rate and other things. (Search for var Config)
It's occurred to me that in the setup I am describing, it is impossible to learn non–linear functions, because of the choice of features. Nowhere in forward pass we have input dependency of power higher than 1, that's why I am seeing a snapshot of my activation function in the output. Duh.
I'm trying to create a sample neural network that can be used for credit scoring. Since this is a complicated structure for me, i'm trying to learn them small first.
I created a network using back propagation - input layer (2 nodes), 1 hidden layer (2 nodes +1 bias), output layer (1 node), which makes use of sigmoid as activation function for all layers. I'm trying to test it first using a^2+b2^2=c^2 which means my input would be a and b, and the target output would be c.
My problem is that my input and target output values are real numbers which can range from (-/infty, +/infty). So when I'm passing these values to my network, my error function would be something like (target- network output). Would that be correct or accurate? In the sense that I'm getting the difference between the network output (which is ranged from 0 to 1) and the target output (which is a large number).
I've read that the solution would be to normalise first, but I'm not really sure how to do this. Should i normalise both the input and target output values before feeding them to the network? What normalisation function is best to use cause I read different methods in normalising. After getting the optimized weights and use them to test some data, Im getting an output value between 0 and 1 because of the sigmoid function. Should i revert the computed values to the un-normalized/original form/value? Or should i only normalise the target output and not the input values? This really got me stuck for weeks as I'm not getting the desired outcome and not sure how to incorporate the normalisation idea in my training algorithm and testing..
Thank you very much!!
So to answer your questions :
Sigmoid function is squashing its input to interval (0, 1). It's usually useful in classification task because you can interpret its output as a probability of a certain class. Your network performes regression task (you need to approximate real valued function) - so it's better to set a linear function as an activation from your last hidden layer (in your case also first :) ).
I would advise you not to use sigmoid function as an activation function in your hidden layers. It's much better to use tanh or relu nolinearities. The detailed explaination (as well as some useful tips if you want to keep sigmoid as your activation) might be found here.
It's also important to understand that architecture of your network is not suitable for a task which you are trying to solve. You can learn a little bit of what different networks might learn here.
In case of normalization : the main reason why you should normalize your data is to not giving any spourius prior knowledge to your network. Consider two variables : age and income. First one varies from e.g. 5 to 90. Second one varies from e.g. 1000 to 100000. The mean absolute value is much bigger for income than for age so due to linear tranformations in your model - ANN is treating income as more important at the beginning of your training (because of random initialization). Now consider that you are trying to solve a task where you need to classify if a person given has grey hair :) Is income truly more important variable for this task?
There are a lot of rules of thumb on how you should normalize your input data. One is to squash all inputs to [0, 1] interval. Another is to make every variable to have mean = 0 and sd = 1. I usually use second method when the distribiution of a given variable is similiar to Normal Distribiution and first - in other cases.
When it comes to normalize the output it's usually also useful to normalize it when you are solving regression task (especially in multiple regression case) but it's not so crucial as in input case.
You should remember to keep parameters needed to restore the original size of your inputs and outputs. You should also remember to compute them only on a training set and apply it on both training, test and validation sets.