How to do disjoint classification without softmax output? - neural-network

What's the correct way to do 'disjoint' classification (where the outputs are mutually exclusive, i.e. true probabilities sum to 1) in FANN since it doesn't seems to have an option for softmax output?
My understanding is that using sigmoid outputs, as if doing 'labeling', that I wouldn't be getting the correct results for a classification problem.

FANN only supports tanh and linear error functions. This means, as you say, that the probabilities output by the neural network will not sum to 1. There is no easy solution to implementing a softmax output, as this will mean changing the cost function and hence the error function used in the backpropagation routine. As FANN is open source you could have a look at implementing this yourself. A question on Cross Validated seems to give the equations you would have to implement.
Although not the mathematically elegant solution you are looking for, I would try play around with some cruder approaches before tackling the implementation of a softmax cost function - as one of these might be sufficient for your purposes. For example, you could use a tanh error function and then just renormalise all the outputs to sum to 1. Or, if you are actually only interested in what the most likely classification is you could just take the output with the highest score.
Steffen Nissen, the guy behind FANN, presents an example here where he tries to classify what language a text is written in based on letter frequency. I think he uses a tanh error function (default) and just takes the class with the biggest score, but he indicates that it works well.

Related

Activation function for output layer for regression models in Neural Networks

I have been experimenting with neural networks these days. I have come across a general question regarding the activation function to use. This might be a well known fact to but I couldn't understand properly. A lot of the examples and papers I have seen are working on classification problems and they either use sigmoid (in binary case) or softmax (in multi-class case) as the activation function in the out put layer and it makes sense. But I haven't seen any activation function used in the output layer of a regression model.
So my question is that is it by choice we don't use any activation function in the output layer of a regression model as we don't want the activation function to limit or put restrictions on the value. The output value can be any number and as big as thousands so the activation function like sigmoid to tanh won't make sense. Or is there any other reason? Or we actually can use some activation function which are made for these kind of problems?
for linear regression type of problem, you can simply create the Output layer without any activation function as we are interested in numerical values without any transformation.
more info :
https://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python/
for classification :
You can use sigmoid, tanh, Softmax etc.
If you have, say, a Sigmoid as an activation function in output layer of your NN you will never get any value less than 0 and greater than 1.
Basically if the data your're trying to predict are distributed within that range you might approach with a Sigmoid function and test if your prediction performs well on your training set.
Even more general, when predict a data you should come up with the function that represents your data in the most effective way.
Hence if your real data does not fit Sigmoid function well you have to think of any other function (e.g. some polynomial function, or periodic function or any other or a combination of them) but you also should always care of how easily you will build your cost function and evaluate derivatives.
Just use a linear activation function without limiting the output value range unless you have some reasonable assumption about it.

Activation functions - Neural Network

I am working with neural network in my freetime.
I developed already an easy XOR-Operation with a neural network.
But I dont know when I should use the correct activations function.
Is there an trick or is it just math logic?
There are a lot of options of activation functions such as identity, logistic, tanh, Relu, etc.
The choice of the activation function can be based on the gradient computation (back-propagation). E.g. logistic function is always differentiable but it kind of saturate when the input has large value and therefore slows down the speed of optimization. In this case Relu is prefered over logistic.
Above is only one simple example for the choise of activation function. It really depends on the actual situation.
Besides, I dont think the activation functions used in XOR neural network is representative in more complex application.
The subject of when to use a particular activation function over another is a subject of ongoing academic research. You can find papers related to it by searching for journal articles related to "neural network activation function" in an academic database, or through a Google Scholar search, such as this one:
https://scholar.google.com/scholar?hl=en&as_sdt=0%2C2&q=neural+network+activation+function&btnG=&oq=neural+network+ac
Generally, which function to use depends mostly on what you are trying to do. An activation function is like a lens. You put input into your network, and it comes out changed or focused in some way by the activation function. How your input should be changed depends on what you are trying to achieve. You need to think of your problem, then figure out what function will help you shape your signal into the results you are trying to approximate.
Ask yourself, what is the shape of the data you are trying to model? If it is linear or approximately so, then a linear activation function will suffice. If it is more "step-shaped," you would want to use something like Sigmoid or Tanh (the Tanh function is actually just a scaled Sigmoid), because their graphs exhibit a similar shape. In the case of your XOR problem, we know that a either of those--which work by pushing the output closer to the [-1, 1] range--will work quite well. If you need something that doesn't flatten out away from zero like those two do, the ReLU function might be a good choice (in fact ReLU is probably the most popular activation function these days, and deserves far more serious study than this answer can provide).
You should analyze the graph of each one of these functions and think about the effects each will have on your data. You know the data you will be putting in. When that data goes through the function, what will come out? Will that particular function help you get the output you want? If so, it is a good choice.
Furthermore, if you have a graph of some data with a really interesting shape that corresponds to some other function you know, feel free to use that one and see how it works! Some of ANN design is about understanding, but other parts (at least currently) are intuition.
You can solve you problem with a sigmoid neurons in this case the activation function is:
https://chart.googleapis.com/chart?cht=tx&chl=%5Csigma%20%5Cleft%20(%20z%20%5Cright%20)%20%3D%20%5Cfrac%7B1%7D%7B1%2Be%5E%7B-z%7D%7D
Where:
https://chart.googleapis.com/chart?cht=tx&chl=z%20%3D%20%5Csum_%7Bj%7D%20(w_%7Bj%7Dx_%7Bj%7D%2Bb)
In this formula w there are the weights for each input, b is the bias and x there are the inputs, finally you can use back-propagation for calculate the cost function.

Using a learned Artificial Neural Network to solve inputs

I've recently been delving into artificial neural networks again, both evolved and trained. I had a question regarding what methods, if any, to solve for inputs that would result in a target output set. Is there a name for this? Everything I try to look for leads me to backpropagation which isn't necessarily what I need. In my search, the closest thing I've come to expressing my question is
Is it possible to run a neural network in reverse?
Which told me that there, indeed, would be many solutions for networks that had varying numbers of nodes for the layers and they would not be trivial to solve for. I had the idea of just marching toward an ideal set of inputs using the weights that have been established during learning. Does anyone else have experience doing something like this?
In order to elaborate:
Say you have a network with 401 input nodes which represents a 20x20 grayscale image and a bias, two hidden layers consisting of 100+25 nodes, as well as 6 output nodes representing a classification (symbols, roman numerals, etc).
After training a neural network so that it can classify with an acceptable error, I would like to run the network backwards. This would mean I would input a classification in the output that I would like to see, and the network would imagine a set of inputs that would result in the expected output. So for the roman numeral example, this could mean that I would request it to run the net in reverse for the symbol 'X' and it would generate an image that would resemble what the net thought an 'X' looked like. In this way, I could get a good idea of the features it learned to separate the classifications. I feel as it would be very beneficial in understanding how ANNs function and learn in the grand scheme of things.
For a simple feed-forward fully connected NN, it is possible to project hidden unit activation into pixel space by taking inverse of activation function (for example Logit for sigmoid units), dividing it by sum of incoming weights and then multiplying that value by weight of each pixel. That will give visualization of average pattern, recognized by this hidden unit. Summing up these patterns for each hidden unit will result in average pattern, that corresponds to this particular set of hidden unit activities.Same procedure can be in principle be applied to to project output activations into hidden unit activity patterns.
This is indeed useful for analyzing what features NN learned in image recognition. For more complex methods you can take a look at this paper (besides everything it contains examples of patterns that NN can learn).
You can not exactly run NN in reverse, because it does not remember all information from source image - only patterns that it learned to detect. So network cannot "imagine a set inputs". However, it possible to sample probability distribution (taking weight as probability of activation of each pixel) and produce a set of patterns that can be recognized by particular neuron.
I know that you can, and I am working on a solution now. I have some code on my github here for imagining the inputs of a neural network that classifies the handwritten digits of the MNIST dataset, but I don't think it is entirely correct. Right now, I simply take a trained network and my desired output and multiply backwards by the learned weights at each layer until I have a value for inputs. This is skipping over the activation function and may have some other errors, but I am getting pretty reasonable images out of it. For example, this is the result of the trained network imagining a 3: number 3
Yes, you can run a probabilistic NN in reverse to get it to 'imagine' inputs that would match an output it's been trained to categorise.
I highly recommend Geoffrey Hinton's coursera course on NN's here:
https://www.coursera.org/course/neuralnets
He demonstrates in his introductory video a NN imagining various "2"s that it would recognise having been trained to identify the numerals 0 through 9. It's very impressive!
I think it's basically doing exactly what you're looking to do.
Gruff

Neural networks: classification using Encog

I'm trying to get started using neural networks for a classification problem. I chose to use the Encog 3.x library as I'm working on the JVM (in Scala). Please let me know if this problem is better handled by another library.
I've been using resilient backpropagation. I have 1 hidden layer, and e.g. 3 output neurons, one for each of the 3 target categories. So ideal outputs are either 1/0/0, 0/1/0 or 0/0/1. Now, the problem is that the training tries to minimize the error, e.g. turn 0.6/0.2/0.2 into 0.8/0.1/0.1 if the ideal output is 1/0/0. But since I'm picking the highest value as the predicted category, this doesn't matter for me, and I'd want the training to spend more effort in actually reducing the number of wrong predictions.
So I learnt that I should use a softmax function as the output (although it is unclear to me if this becomes a 4th layer or I should just replace the activation function of the 3rd layer with softmax), and then have the training reduce the cross entropy. Now I think that this cross entropy needs to be calculated either over the entire network or over the entire output layer, but the ErrorFunction that one can customize calculates the error on a neuron-by-neuron basis (reads array of ideal inputs and actual inputs, writes array of error values). So how does one actually do cross entropy minimization using Encog (or which other JVM-based library should I choose)?
I'm also working with Encog, but in Java, though I don't think it makes a real difference. I have similar problem and as far as I know you have to write your own function that minimizes cross entropy.
And as I understand it, softmax should just replace your 3rd layer.

Does it make sense to use an "activation function cocktail" for approximating an unknown function through a feed-forward neural network?

I just started playing around with neural networks and, as I would expect, in order to train a neural network effectively there must be some relation between the function to approximate and activation function.
For instance, I had good results using sin(x) as an activation function when approximating cos(x), or two tanh(x) to approximate a gaussian. Now, to approximate a function about which I know nothing I am planning to use a cocktail of activation functions, for instance a hidden layer with some sin, some tanh and a logistic function. In your opinion does this make sens?
Thank you,
Tunnuz
While it is true that different activation functions have different merits (mainly for either biological plausibility or a unique network design like radial basis function networks), in general you be able to use any continuous squashing function and expect to be able to approximate most functions encountered in real world training data.
The two most popular choices are the hyperbolic tangent and the logistic function, since they both have easily calculable derivatives and interesting behavior around the axis.
If neither if those allows you to accurately approximate your function, my first response wouldn't be to change activation functions. Rather, you should first investigate your training set and network training parameters (learning rates, number of units in each pool, weight decay, momentum, etc.).
If your still stuck, step back and make sure your using the right architecture (feed forward vs. simple recurrent vs. full recurrent) and learning algorithm (back-propagation vs. back-prop through time vs. contrastive hebbian vs. evolutionary/global methods).
One side note: Make sure you never use a linear activation function (except for output layers or crazy simple tasks), as these have very well documented limitations, namely the need for linear separability.