I am working with neural network in my freetime.
I developed already an easy XOR-Operation with a neural network.
But I dont know when I should use the correct activations function.
Is there an trick or is it just math logic?
There are a lot of options of activation functions such as identity, logistic, tanh, Relu, etc.
The choice of the activation function can be based on the gradient computation (back-propagation). E.g. logistic function is always differentiable but it kind of saturate when the input has large value and therefore slows down the speed of optimization. In this case Relu is prefered over logistic.
Above is only one simple example for the choise of activation function. It really depends on the actual situation.
Besides, I dont think the activation functions used in XOR neural network is representative in more complex application.
The subject of when to use a particular activation function over another is a subject of ongoing academic research. You can find papers related to it by searching for journal articles related to "neural network activation function" in an academic database, or through a Google Scholar search, such as this one:
https://scholar.google.com/scholar?hl=en&as_sdt=0%2C2&q=neural+network+activation+function&btnG=&oq=neural+network+ac
Generally, which function to use depends mostly on what you are trying to do. An activation function is like a lens. You put input into your network, and it comes out changed or focused in some way by the activation function. How your input should be changed depends on what you are trying to achieve. You need to think of your problem, then figure out what function will help you shape your signal into the results you are trying to approximate.
Ask yourself, what is the shape of the data you are trying to model? If it is linear or approximately so, then a linear activation function will suffice. If it is more "step-shaped," you would want to use something like Sigmoid or Tanh (the Tanh function is actually just a scaled Sigmoid), because their graphs exhibit a similar shape. In the case of your XOR problem, we know that a either of those--which work by pushing the output closer to the [-1, 1] range--will work quite well. If you need something that doesn't flatten out away from zero like those two do, the ReLU function might be a good choice (in fact ReLU is probably the most popular activation function these days, and deserves far more serious study than this answer can provide).
You should analyze the graph of each one of these functions and think about the effects each will have on your data. You know the data you will be putting in. When that data goes through the function, what will come out? Will that particular function help you get the output you want? If so, it is a good choice.
Furthermore, if you have a graph of some data with a really interesting shape that corresponds to some other function you know, feel free to use that one and see how it works! Some of ANN design is about understanding, but other parts (at least currently) are intuition.
You can solve you problem with a sigmoid neurons in this case the activation function is:
https://chart.googleapis.com/chart?cht=tx&chl=%5Csigma%20%5Cleft%20(%20z%20%5Cright%20)%20%3D%20%5Cfrac%7B1%7D%7B1%2Be%5E%7B-z%7D%7D
Where:
https://chart.googleapis.com/chart?cht=tx&chl=z%20%3D%20%5Csum_%7Bj%7D%20(w_%7Bj%7Dx_%7Bj%7D%2Bb)
In this formula w there are the weights for each input, b is the bias and x there are the inputs, finally you can use back-propagation for calculate the cost function.
Related
I have been experimenting with neural networks these days. I have come across a general question regarding the activation function to use. This might be a well known fact to but I couldn't understand properly. A lot of the examples and papers I have seen are working on classification problems and they either use sigmoid (in binary case) or softmax (in multi-class case) as the activation function in the out put layer and it makes sense. But I haven't seen any activation function used in the output layer of a regression model.
So my question is that is it by choice we don't use any activation function in the output layer of a regression model as we don't want the activation function to limit or put restrictions on the value. The output value can be any number and as big as thousands so the activation function like sigmoid to tanh won't make sense. Or is there any other reason? Or we actually can use some activation function which are made for these kind of problems?
for linear regression type of problem, you can simply create the Output layer without any activation function as we are interested in numerical values without any transformation.
more info :
https://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python/
for classification :
You can use sigmoid, tanh, Softmax etc.
If you have, say, a Sigmoid as an activation function in output layer of your NN you will never get any value less than 0 and greater than 1.
Basically if the data your're trying to predict are distributed within that range you might approach with a Sigmoid function and test if your prediction performs well on your training set.
Even more general, when predict a data you should come up with the function that represents your data in the most effective way.
Hence if your real data does not fit Sigmoid function well you have to think of any other function (e.g. some polynomial function, or periodic function or any other or a combination of them) but you also should always care of how easily you will build your cost function and evaluate derivatives.
Just use a linear activation function without limiting the output value range unless you have some reasonable assumption about it.
I recently started learning neural networks, and I thought that creating a sudoku solver would be a nice application for NN. I started learning them with backward propagation neural network, but later I figured that there are tens of neural networks. At this point, I find it hard to learn all of them and then pick an appropriate one for my purpose. Hence, I am asking what would be a good choice for creating this solver. Can back propagation NN work here? If not, can you explain why and tell me which one can work.
Thanks!
Neural networks don't really seem to be the best way to solve sudoku, as others have already pointed out. I think a better (but also not really good/efficient) way would be to use an genetic algorithm. Genetic algorithms don't directly relate to NNs but its very useful to know how they work.
Better (with better i mean more likely to be sussessful and probably better for you to learn something new) ideas would include:
If you use a library:
Play around with the networks, try to train them to different datasets, maybe random numbers and see what you get and how you have to tune the parameters to get better results.
Try to write an image generator. I wrote a few of them and they are stil my favourite projects, with one of them i used backprop to teach a NN what x/y coordinate of the image has which color, and the other aproach combines random generated images with ine another (GAN/NEAT).
Try to use create a movie (series of images) of the network learning to create a picture. It will show you very well how backprop works and what parameter tuning does to the results and how it changes how the network gets to the result.
If you are not using a library:
Try to solve easy problems, one after the other. Use backprop or a genetic algorithm for training (whatever you have implemented).
Try to improove your implementation and change some things that nobody else cares about and see how it changes the results.
List of 'tasks' for your Network:
XOR (basically the hello world of NN)
Pole balancing problem
Simple games like pong
More complex games like flappy bird, agar.io etc.
Choose more problems that you find interesting, maybe you are into image recognition, maybe text, audio, who knows. Think of something you can/would like to be able to do and find a way to make you computer do it for you.
It's not advisable to only use your own NN implemetation, since it will probably not work properly the first few times and you'll get frustratet. Experiment with librarys and your own implementation.
Good way to find almost endless resources:
Use google search and add 'filetype:pdf' in the end in order to only show pdf files. Search for neural network, genetic algorithm, evolutional neural network.
Neither neural nets not GAs are close to ideal solutions for Sudoku. I would advise to look into Constraint Programming (eg. the Choco or Gecode solver). See https://gist.github.com/marioosh/9188179 for example. Should solve any 9x9 sudoku in a matter of milliseconds (the daily Sudokus of "Le monde" journal are created using this type of technology BTW).
There is also a famous "Dancing links" algorithm for this problem by Knuth that works very well https://en.wikipedia.org/wiki/Dancing_Links
Just like was mentioned in the comments, you probably want to take a look at convolutional networks. You basically input the sudoku bord as an two dimensional 'image'. I think using a receptive field of 3x3 would be quite interesting, and I don't really think you need more than one filter.
The harder thing is normalization: the numbers 1-9 don't have an underlying relation in sudoku, you could easily replace them by A-I for example. So they are categories, not numbers. However, one-hot encoding every output would mean a lot of inputs, so i'd stick to numerical normalization (1=0.1, 2 = 0.2, etc.)
The output of your network should be a softmax with of some kind: if you don't use softmax, and instead outupt just an x and y coordinate, then you can't assure that the outputedd square has not been filled in yet.
A numerical value should be passed along with the output, to show what number the network wants to fill in.
As PLEXATIC mentionned, neural-nets aren't really well suited for these kind of task. Genetic algorithm sounds good indeed.
However, if you still want to stick with neural-nets you could have a look at https://github.com/Kyubyong/sudoku. As answered Thomas W, 3x3 looks nice.
If you don't want to deal with CNN, you could find some answers here as well. https://www.kaggle.com/dithyrambe/neural-nets-as-sudoku-solvers
What's the correct way to do 'disjoint' classification (where the outputs are mutually exclusive, i.e. true probabilities sum to 1) in FANN since it doesn't seems to have an option for softmax output?
My understanding is that using sigmoid outputs, as if doing 'labeling', that I wouldn't be getting the correct results for a classification problem.
FANN only supports tanh and linear error functions. This means, as you say, that the probabilities output by the neural network will not sum to 1. There is no easy solution to implementing a softmax output, as this will mean changing the cost function and hence the error function used in the backpropagation routine. As FANN is open source you could have a look at implementing this yourself. A question on Cross Validated seems to give the equations you would have to implement.
Although not the mathematically elegant solution you are looking for, I would try play around with some cruder approaches before tackling the implementation of a softmax cost function - as one of these might be sufficient for your purposes. For example, you could use a tanh error function and then just renormalise all the outputs to sum to 1. Or, if you are actually only interested in what the most likely classification is you could just take the output with the highest score.
Steffen Nissen, the guy behind FANN, presents an example here where he tries to classify what language a text is written in based on letter frequency. I think he uses a tanh error function (default) and just takes the class with the biggest score, but he indicates that it works well.
I am currently studying a doctoral thesis in control theory. At the end of every chapter there is a simulation of a relative-with-the-subject problem. I have finished the theory,but for further understanding I would like to reproduce the simulations. The first simulation is as follows :
The solution of the problem concludes in a system of differential equations whose right hand side consists of functions with unknown parameters. The author states the following : "We will use neural networks with one hidden layer,sigmoid basis functions and 5 weights in the external layer in order to approximate every parameter of the unknown functions.More specifically, the weights of the hidden layer are selected through iterative trials and are kept stable during the simulation." And then he states the logic with which he selects the initial values of the unknown parameters and then shows the results of the simulation.
Could anyone give me a lead on where to look and what I need to know in order to solve this specific problem myself in MATLAB (since this is the environment I am most familiar with)? Because the results of a google search are chaotic since I don't really know what I'm looking for.
If you need any more info,feel free to ask!
You can try MATLAB's Neural Network Toolbox. This gives you an nice UI where you can configure the network, train it with data to find the parameter values and test for performance. No coding involved.
Or, you can program it by hand. Since you are working with one hidden layer, it should be very simple. I am sure any machine learning or neural net (NN) textbook would have one example of it. You can also look into GitHib for projects. There should be many NN projects there, in case you are looking to salvage code from existing project.
Most importantly, you should start by learning about NN, if you haven't done that already. NN with single hidden layer is easy to implement once you understand the equations for the forward and back propagation.
I just started playing around with neural networks and, as I would expect, in order to train a neural network effectively there must be some relation between the function to approximate and activation function.
For instance, I had good results using sin(x) as an activation function when approximating cos(x), or two tanh(x) to approximate a gaussian. Now, to approximate a function about which I know nothing I am planning to use a cocktail of activation functions, for instance a hidden layer with some sin, some tanh and a logistic function. In your opinion does this make sens?
Thank you,
Tunnuz
While it is true that different activation functions have different merits (mainly for either biological plausibility or a unique network design like radial basis function networks), in general you be able to use any continuous squashing function and expect to be able to approximate most functions encountered in real world training data.
The two most popular choices are the hyperbolic tangent and the logistic function, since they both have easily calculable derivatives and interesting behavior around the axis.
If neither if those allows you to accurately approximate your function, my first response wouldn't be to change activation functions. Rather, you should first investigate your training set and network training parameters (learning rates, number of units in each pool, weight decay, momentum, etc.).
If your still stuck, step back and make sure your using the right architecture (feed forward vs. simple recurrent vs. full recurrent) and learning algorithm (back-propagation vs. back-prop through time vs. contrastive hebbian vs. evolutionary/global methods).
One side note: Make sure you never use a linear activation function (except for output layers or crazy simple tasks), as these have very well documented limitations, namely the need for linear separability.