I am new to Neural Networks and currently in need of guidance for a question I was presented with.
Question: Consider a single-input neuron with a bias. We would like the output to be -1 for inputs less than 3 and +1 for inputs greater than or equal to 3. What kind of transfer function is required, and what bias would you suggest?
Again, I am new to this, and I am fairly certain the answer is obvious but I have little to go on right now. I originally considered going with either the Signum function or a Threshold function, but I do not achieve the answers I need. Any help or information will be greatly appreciated.
As you require such an immediate transition (i.e. <3 should be -1 and >=3 should equal 1) the the most appropriate activation function you could select to achieve this goal would be a binary step or threshold function as you rightly suggested:
Other common activation function are typically continuous and will not allow for this sort of transition.
Related
I've started working on Forward and back propagation of neural networks. I've coded it as-well and works properly too. But i'm confused in the algorithm itself. I'm new to Neural Networks.
So Forward propagation of neural networks is finding the right label with the given weights?
and Back-propagation is using forward propagation to find the most error free parameters by minimizing cost function and using these parameters to help classify other training examples? And this is called a trained Neural Network?
I feel like there is a big blunder in my concept if there is please let me know where i'm wrong and why i am wrong.
I will try my best to explain forward and back propagation in a detailed yet simple to understand manner, although it's not an easy topic to do.
Forward Propagation
Forward propagation is the process in a neural network where-by during the runtime of the network, values are fed into the front of the neural network, (the inputs). You can imagine that these values then travel across the weights which multiply the original value from the inputs by themselves. They then arrive at the hidden layer (neurons). Neurons vary quite a lot based on different types of networks, but here is one way of explaining it. When the values reach the neuron they go through a function where every single value being fed into the neuron is summed up and then fed into an activation function. This activation function can be very different depending on the use-case but let's take for example a linear activation function. It essentially gets the value being fed into it and then it rounds it to a 0 or 1. It is then fed through more weights and then it is spat out into the outputs. Which is the last step into the network.
You can imagine this network with this diagram.
Back Propagation
Back propagation is just like forward propagation except we work backwards from where we were in forward propagation.
The aim of back propagation is to reduce the error in the training phase (trying to get the neural network as accurate as possible). The way this is done is by going backwards through the weights and layers. At each weight the error is calculated and each weight is individually adjusted using an optimization algorithm; optimization algorithm is exactly what it sounds like. It optimizes the weights and adjusts their values to make the neural network more accurate.
Some optimization algorithms include gradient descent and stochastic gradient descent. I will not go through the details in this answer as I have already explained them in some of my other answers (linked below).
The process of calculating the error in the weights and adjusting them accordingly is the back-propagation process and it is usually repeated many times to get the network as accurate as possible. The number of times you do this is called the epoch count. It is good to learn the importance of how you should manage epochs and batch sizes (another topic), as these can severely impact the efficiency and accuracy of your network.
I understand that this answer may be hard to follow, but unfortunately this is the best way I can explain this. It is expected that you might not understand this the first time you read it, but remember this is a complicated topic. I have a linked a few more resources down below including a video (not mine) that explains these processes even better than a simple text explanation can. But I also hope my answer may have resolved your question and have a good day!
Further resources:
Link 1 - Detailed explanation of back-propagation.
Link 2 - Detailed explanation of stochastic/gradient-descent.
Youtube Video 1 - Detailed explanation of types of propagation.
Credits go to Sebastian Lague
I need clarification on when exactly do we say an activation function is activated. The job of activation function is to introduce non-linearity, right. Is it just scaling a given input to confined range?
I need clarification on when exactly do we say an activation function is activated.
We don't. This is not a Boolean thing, to be "active" or "inactive". You may be thinking in terms of whether a neuron fires (sends an electrical signal through its axon).
Perceptrons (the "neurons" of a software neural network) do not necessarily work this way. A couple of the activation functions do have hard binary signals, (-1 vs 1, or 0 vs 1), but most are continuous functions.
Instead, think of it as an "attention function", an evaluation of "how excited should this neuron get in response to the input?" For instance, ReLU (y = max(x, 0)) translates as "If this is boring, I don't care how boring it is; call it a 0 and move on." Sigmoid and tanh are more discriminating:
below -2 ......... forget it
-2 through 2 ... yes, let's pay attention to the pro's and con's
above 2 .......... I got the idea -- this is extremely cool ... don't care about the rest of your sales pitch, you already got an A+.
Activation functions are a sort of normalization or scaling filter. They help the next layer effectively focus on discriminating among undecided cases; a good activation function usually has a useful gradient (say, around 1.0) in the middle range ("the model isn't sure") of its inputs. They keep a wildly excited input (say +1000) from dominating the next layer's "conversation".
In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard computer chip circuit can be seen as a digital network of activation functions that can be "ON" (1) or "OFF" (0), depending on input.
it depends on what activation function you are talking about. but in general they are used to make the output results clearer in regression or to scale the input to make it easier to choose between them in classification.
References:
https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0
https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
I am working with neural network in my freetime.
I developed already an easy XOR-Operation with a neural network.
But I dont know when I should use the correct activations function.
Is there an trick or is it just math logic?
There are a lot of options of activation functions such as identity, logistic, tanh, Relu, etc.
The choice of the activation function can be based on the gradient computation (back-propagation). E.g. logistic function is always differentiable but it kind of saturate when the input has large value and therefore slows down the speed of optimization. In this case Relu is prefered over logistic.
Above is only one simple example for the choise of activation function. It really depends on the actual situation.
Besides, I dont think the activation functions used in XOR neural network is representative in more complex application.
The subject of when to use a particular activation function over another is a subject of ongoing academic research. You can find papers related to it by searching for journal articles related to "neural network activation function" in an academic database, or through a Google Scholar search, such as this one:
https://scholar.google.com/scholar?hl=en&as_sdt=0%2C2&q=neural+network+activation+function&btnG=&oq=neural+network+ac
Generally, which function to use depends mostly on what you are trying to do. An activation function is like a lens. You put input into your network, and it comes out changed or focused in some way by the activation function. How your input should be changed depends on what you are trying to achieve. You need to think of your problem, then figure out what function will help you shape your signal into the results you are trying to approximate.
Ask yourself, what is the shape of the data you are trying to model? If it is linear or approximately so, then a linear activation function will suffice. If it is more "step-shaped," you would want to use something like Sigmoid or Tanh (the Tanh function is actually just a scaled Sigmoid), because their graphs exhibit a similar shape. In the case of your XOR problem, we know that a either of those--which work by pushing the output closer to the [-1, 1] range--will work quite well. If you need something that doesn't flatten out away from zero like those two do, the ReLU function might be a good choice (in fact ReLU is probably the most popular activation function these days, and deserves far more serious study than this answer can provide).
You should analyze the graph of each one of these functions and think about the effects each will have on your data. You know the data you will be putting in. When that data goes through the function, what will come out? Will that particular function help you get the output you want? If so, it is a good choice.
Furthermore, if you have a graph of some data with a really interesting shape that corresponds to some other function you know, feel free to use that one and see how it works! Some of ANN design is about understanding, but other parts (at least currently) are intuition.
You can solve you problem with a sigmoid neurons in this case the activation function is:
https://chart.googleapis.com/chart?cht=tx&chl=%5Csigma%20%5Cleft%20(%20z%20%5Cright%20)%20%3D%20%5Cfrac%7B1%7D%7B1%2Be%5E%7B-z%7D%7D
Where:
https://chart.googleapis.com/chart?cht=tx&chl=z%20%3D%20%5Csum_%7Bj%7D%20(w_%7Bj%7Dx_%7Bj%7D%2Bb)
In this formula w there are the weights for each input, b is the bias and x there are the inputs, finally you can use back-propagation for calculate the cost function.
I'm trying to create a sample neural network that can be used for credit scoring. Since this is a complicated structure for me, i'm trying to learn them small first.
I created a network using back propagation - input layer (2 nodes), 1 hidden layer (2 nodes +1 bias), output layer (1 node), which makes use of sigmoid as activation function for all layers. I'm trying to test it first using a^2+b2^2=c^2 which means my input would be a and b, and the target output would be c.
My problem is that my input and target output values are real numbers which can range from (-/infty, +/infty). So when I'm passing these values to my network, my error function would be something like (target- network output). Would that be correct or accurate? In the sense that I'm getting the difference between the network output (which is ranged from 0 to 1) and the target output (which is a large number).
I've read that the solution would be to normalise first, but I'm not really sure how to do this. Should i normalise both the input and target output values before feeding them to the network? What normalisation function is best to use cause I read different methods in normalising. After getting the optimized weights and use them to test some data, Im getting an output value between 0 and 1 because of the sigmoid function. Should i revert the computed values to the un-normalized/original form/value? Or should i only normalise the target output and not the input values? This really got me stuck for weeks as I'm not getting the desired outcome and not sure how to incorporate the normalisation idea in my training algorithm and testing..
Thank you very much!!
So to answer your questions :
Sigmoid function is squashing its input to interval (0, 1). It's usually useful in classification task because you can interpret its output as a probability of a certain class. Your network performes regression task (you need to approximate real valued function) - so it's better to set a linear function as an activation from your last hidden layer (in your case also first :) ).
I would advise you not to use sigmoid function as an activation function in your hidden layers. It's much better to use tanh or relu nolinearities. The detailed explaination (as well as some useful tips if you want to keep sigmoid as your activation) might be found here.
It's also important to understand that architecture of your network is not suitable for a task which you are trying to solve. You can learn a little bit of what different networks might learn here.
In case of normalization : the main reason why you should normalize your data is to not giving any spourius prior knowledge to your network. Consider two variables : age and income. First one varies from e.g. 5 to 90. Second one varies from e.g. 1000 to 100000. The mean absolute value is much bigger for income than for age so due to linear tranformations in your model - ANN is treating income as more important at the beginning of your training (because of random initialization). Now consider that you are trying to solve a task where you need to classify if a person given has grey hair :) Is income truly more important variable for this task?
There are a lot of rules of thumb on how you should normalize your input data. One is to squash all inputs to [0, 1] interval. Another is to make every variable to have mean = 0 and sd = 1. I usually use second method when the distribiution of a given variable is similiar to Normal Distribiution and first - in other cases.
When it comes to normalize the output it's usually also useful to normalize it when you are solving regression task (especially in multiple regression case) but it's not so crucial as in input case.
You should remember to keep parameters needed to restore the original size of your inputs and outputs. You should also remember to compute them only on a training set and apply it on both training, test and validation sets.
What's the correct way to do 'disjoint' classification (where the outputs are mutually exclusive, i.e. true probabilities sum to 1) in FANN since it doesn't seems to have an option for softmax output?
My understanding is that using sigmoid outputs, as if doing 'labeling', that I wouldn't be getting the correct results for a classification problem.
FANN only supports tanh and linear error functions. This means, as you say, that the probabilities output by the neural network will not sum to 1. There is no easy solution to implementing a softmax output, as this will mean changing the cost function and hence the error function used in the backpropagation routine. As FANN is open source you could have a look at implementing this yourself. A question on Cross Validated seems to give the equations you would have to implement.
Although not the mathematically elegant solution you are looking for, I would try play around with some cruder approaches before tackling the implementation of a softmax cost function - as one of these might be sufficient for your purposes. For example, you could use a tanh error function and then just renormalise all the outputs to sum to 1. Or, if you are actually only interested in what the most likely classification is you could just take the output with the highest score.
Steffen Nissen, the guy behind FANN, presents an example here where he tries to classify what language a text is written in based on letter frequency. I think he uses a tanh error function (default) and just takes the class with the biggest score, but he indicates that it works well.