Difference between function approximator and optimization algorithm? - neural-network

I just started learning about artificial neural networks and genetic algorithms and found that the difference between them is that ANN is a function approximator and that GA is an optimization algorithm (according to SO). Problem is I am not 100% sure where and how to draw the line between these definitions; is there a simpler way to explain what the difference is using e.g. analogies (assume I am a 10 year old)? What I found particularly confusing is that both types seem to be able to solve the same problem in some cases (e.g. Traveling Salesman Problem).

ANNs approximate an unknown function that correlates input and output. The goal of ANNs is to find the mathematical relation between both: if is presented a new input, the modeling found by the net gives an approximation to the true value. Example: find the pressure of gas in a tube giving as input temperature, viscosity, density, section of tube, ecc., using a set of measurements for training.
GAs are used often to find max or min of a function (optimization). For example: find the optimal net (minor error) for my previous example, using a set of nets, or solve the traveling salesman problem (given a set of cities, visit each city once and find the minimal path).

Related

What is the Difference between evolutionary computing and classification?

I am looking for some comprehensive description. I couldn't find it via browsing as things are more clustered on the web and its not in my scope currently.
Classification and evolutionary computing is comparing oranges to apples. Let me explain:
Classification is a type of problem, where the goal is to determine a label given some input. (Typical example, given pixel values, determine image label).
Evolutionary computing is a family of algorithms to solve different types of problems. They work with a "population" of candidates (imagine a set of different neural networks trying to solve a given problem). Somehow you evaluate how good each candidate is in the given task (typically using a "fitness function", but there are other methods). Then a new generation of candidates is produced, taking the best candidates from the previous generation as a model, and including mutations and cross-over (that is, introducing changes). Repeat until happy.
Evolutionary computing can absolutely be used for classification! But there are examples where it is used in different ways. You may use evolutionary computing to create an artificial neural network controlling a robot (in this case, inputs are sensor values, outputs are commands for actuators). Or to create original content free of a given goal, as in Picbreeder.
Classification may be solved using evolutionary computation (maybe this is why you where confused in the first place) but other techniques are also common. You can use decision trees, or notably deep-learning (based on backpropagation).
Deep-learning based on backpropagation may sound similar to evolutionary computation, but it is quite different. Here you have only one artificial neural network, and a clear rule (backpropagation) telling you which changes to introduce every iteration.
Hope this helps to complement other answers!
Classification algorithms and evolutionary computing are different approaches. However, they are related in some ways.
Classification algorithms aim to identify the class label of new instances. They are trained with some labeled instances. For example, recognition of digits is a classification algorithm.
Evolutionary algorithms are used to find out the minimum or maximum solution of an optimization problem. They randomly explore the solution space of the given problem. They can find a good solution in a reasonable time and are not able to find the global optimum in all problems.
In some classification approaches, evolutionary algorithms are used to find out the optimal value of the parameters.

How can I improve the performance of a feedforward network as a q-value function approximator?

I'm trying to navigate an agent in a n*n gridworld domain by using Q-Learning + a feedforward neural network as a q-function approximator. Basically the agent should find the best/shortest way to reach a certain terminal goal position (+10 reward). Every step the agent takes it gets -1 reward. In the gridworld there are also some positions the agent should avoid (-10 reward, terminal states,too).
So far I implemented a Q-learning algorithm, that saves all Q-values in a Q-table and the agent performs well.
In the next step, I want to replace the Q-table by a neural network, trained online after every step of the agent. I tried a feedforward NN with one hidden layer and four outputs, representing the Q-values for the possible actions in the gridworld (north,south,east, west).
As input I used a nxn zero-matrix, that has a "1" at the current positions of the agent.
To reach my goal I tried to solve the problem from the ground up:
Explore the gridworld with standard Q-Learning and use the Q-map as training data for the Network once Q-Learning is finished
--> worked fine
Use Q-Learning and provide the updates of the Q-map as trainingdata
for NN (batchSize = 1)
--> worked good
Replacy the Q-Map completely by the NN. (This is the point, when it gets interesting!)
-> FIRST MAP: 4 x 4
As described above, I have 16 "discrete" Inputs, 4 Output and it works fine with 8 neurons(relu) in the hidden layer (learning rate: 0.05). I used a greedy policy with an epsilon, that reduces from 1 to 0.1 within 60 episodes.
The test scenario is shown here. Performance is compared beetween standard qlearning with q-map and "neural" qlearning (in this case i used 8 neurons and differnt dropOut rates).
To sum it up: Neural Q-learning works good for small grids, also the performance is okay and reliable.
-> Bigger MAP: 10 x 10
Now I tried to use the neural network for bigger maps.
At first I tried this simple case.
In my case the neural net looks as following: 100 input; 4 Outputs; about 30 neurons(relu) in one hidden layer; again I used a decreasing exploring factor for greedy policy; over 200 episodes the learning rate decreases from 0.1 to 0.015 to increase stability.
At frist I had problems with convergence and interpolation between single positions caused by the discrete input vector.
To solve this I added some neighbour positions to the vector with values depending on thier distance to the current position. This improved the learning a lot and the policy got better. Performance with 24 neurons is seen in the picture above.
Summary: the simple case is solved by the network, but only with a lot of parameter tuning (number of neurons, exploration factor, learning rate) and special input transformation.
Now here are my questions/problems I still haven't solved:
(1) My network is able to solve really simple cases and examples in a 10 x 10 map, but it fails as the problem gets a bit more complex. In cases where failing is very likely, the network has no change to find a correct policy.
I'm open minded for any idea that could improve performace in this cases.
(2) Is there a smarter way to transform the input vector for the network? I'm sure that adding the neighboring positons to the input vector on the one hand improve the interpolation of the q-values over the map, but on the other hand makes it harder to train special/important postions to the network. I already tried standard cartesian two-dimensional input (x/y) on an early stage, but failed.
(3) Is there another network type than feedforward network with backpropagation, that generally produces better results with q-function approximation? Have you seen projects, where a FF-nn performs well with bigger maps?
It's known that Q-Learning + a feedforward neural network as a q-function approximator can fail even in simple problems [Boyan & Moore, 1995].
Rich Sutton has a question in the FAQ of his web site related with this.
A possible explanation is the phenomenok known as interference described in [Barreto & Anderson, 2008]:
Interference happens when the update of one state–action pair changes the Q-values of other pairs, possibly in the wrong direction.
Interference is naturally associated with generalization, and also happens in conventional supervised learning. Nevertheless, in the reinforcement learning paradigm its effects tend to be much more harmful. The reason for this is twofold. First, the combination of interference and bootstrapping can easily become unstable, since the updates are no longer strictly local. The convergence proofs for the algorithms derived from (4) and (5) are based on the fact that these operators are contraction mappings, that is, their successive application results in a sequence converging to a fixed point which is the solution for the Bellman equation [14,36]. When using approximators, however, this asymptotic convergence is lost, [...]
Another source of instability is a consequence of the fact that in on-line reinforcement learning the distribution of the incoming data depends on the current policy. Depending on the dynamics of the system, the agent can remain for some time in a region of the state space which is not representative of the entire domain. In this situation, the learning algorithm may allocate excessive resources of the function approximator to represent that region, possibly “forgetting” the previous stored information.
One way to alleviate the interference problem is to use a local function approximator. The more independent each basis function is from each other, the less severe this problem is (in the limit, one has one basis function for each state, which corresponds to the lookup-table case) [86]. A class of local functions that have been widely used for approximation is the radial basis functions (RBFs) [52].
So, in your kind of problem (n*n gridworld), an RBF neural network should produce better results.
References
Boyan, J. A. & Moore, A. W. (1995) Generalization in reinforcement learning: Safely approximating the value function. NIPS-7. San Mateo, CA: Morgan Kaufmann.
André da Motta Salles Barreto & Charles W. Anderson (2008) Restricted gradient-descent algorithm for value-function approximation in reinforcement learning, Artificial Intelligence 172 (2008) 454–482

Extracting Patterns using Neural Networks

I am trying to extract common patterns that always appear whenever a certain event occurs.
For example, patient A, B, and C all had a heart attack. Using the readings from there pulse, I want to find the common patterns before the heart attack stroke.
In the next stage I want to do this using multiple dimensions. For example, using the readings from the patients pulse, temperature, and blood pressure, what are the common patterns that occurred in the three dimensions taking into consideration the time and order between each dimension.
What is the best way to solve this problem using Neural Networks and which type of network is best?
(Just need some pointing in the right direction)
and thank you all for reading
Described problem looks like a time series prediction problem. That means a basic prediction problem for a continuous or discrete phenomena generated by some existing process. As a raw data for this problem we will have a sequence of samples x(t), x(t+1), x(t+2), ..., where x() means an output of considered process and t means some arbitrary timepoint.
For artificial neural networks solution we will consider a time series prediction, where we will organize our raw data to a new sequences. As you should know, we consider X as a matrix of input vectors that will be used in ANN learning. For time series prediction we will construct a new collection on following schema.
In the most basic form your input vector x will be a sequence of samples (x(t-k), x(t-k+1), ..., x(t-1), x(t)) taken at some arbitrary timepoint t, appended to it predecessor samples from timepoints t-k, t-k+1, ..., t-1. You should generate every example for every possible timepoint t like this.
But the key is to preprocess data so that we get the best prediction results.
Assuming your data (phenomena) is continuous, you should consider to apply some sampling technique. You could start with an experiment for some naive sampling period Δt, but there are stronger methods. See for example Nyquist–Shannon Sampling Theorem, where the key idea is to allow to recover continuous x(t) from discrete x(Δt) samples. This is reasonable when we consider that we probably expect our ANNs to do this.
Assuming your data is discrete... you still should need to try sampling, as this will speed up your computations and might possibly provide better generalization. But the key advice is: do experiments! as the best architecture depends on data and also will require to preprocess them correctly.
The next thing is network output layer. From your question, it appears that this will be a binary class prediction. But maybe a wider prediction vector is worth considering? How about to predict the future of considered samples, that is x(t+1), x(t+2) and experiment with different horizons (length of the future)?
Further reading:
Somebody mentioned Python here. Here is some good tutorial on timeseries prediction with Keras: Victor Schmidt, Keras recurrent tutorial, Deep Learning Tutorials
This paper is good if you need some real example: Fessant, Francoise, Samy Bengio, and Daniel Collobert. "On the prediction of solar activity using different neural network models." Annales Geophysicae. Vol. 14. No. 1. 1996.

How to do disjoint classification without softmax output?

What's the correct way to do 'disjoint' classification (where the outputs are mutually exclusive, i.e. true probabilities sum to 1) in FANN since it doesn't seems to have an option for softmax output?
My understanding is that using sigmoid outputs, as if doing 'labeling', that I wouldn't be getting the correct results for a classification problem.
FANN only supports tanh and linear error functions. This means, as you say, that the probabilities output by the neural network will not sum to 1. There is no easy solution to implementing a softmax output, as this will mean changing the cost function and hence the error function used in the backpropagation routine. As FANN is open source you could have a look at implementing this yourself. A question on Cross Validated seems to give the equations you would have to implement.
Although not the mathematically elegant solution you are looking for, I would try play around with some cruder approaches before tackling the implementation of a softmax cost function - as one of these might be sufficient for your purposes. For example, you could use a tanh error function and then just renormalise all the outputs to sum to 1. Or, if you are actually only interested in what the most likely classification is you could just take the output with the highest score.
Steffen Nissen, the guy behind FANN, presents an example here where he tries to classify what language a text is written in based on letter frequency. I think he uses a tanh error function (default) and just takes the class with the biggest score, but he indicates that it works well.

Which multiplication and addition factor to use when doing adaptive learning rate in neural networks?

I am new to neural networks and, to get grip on the matter, I have implemented a basic feed-forward MLP which I currently train through back-propagation. I am aware that there are more sophisticated and better ways to do that, but in Introduction to Machine Learning they suggest that with one or two tricks, basic gradient descent can be effective for learning from real world data. One of the tricks is adaptive learning rate.
The idea is to increase the learning rate by a constant value a when the error gets smaller, and decrease it by a fraction b of the learning rate when the error gets larger. So basically the learning rate change is determined by:
+(a)
if we're learning in the right direction, and
-(b * <learning rate>)
if we're ruining our learning. However, on the above book there's no advice on how to set these parameters. I wouldn't expect a precise suggestion since parameter tuning is a whole topic on its own, but just a hint at least on their order of magnitude. Any ideas?
Thank you,
Tunnuz
I haven't looked at neural networks for the longest time (10 years+) but after I saw your question I thought I would have a quick scout about. I kept seeing the same figures all over the internet in relation to increase(a) and decrease(b) factor (1.2 & 0.5 respectively).
I have managed to track these values down to Martin Riedmiller and Heinrich Braun's RPROP algorithm (1992). Riedmiller and Braun are quite specific about sensible parameters to choose.
See: RPROP: A Fast Adaptive Learning Algorithm
I hope this helps.