Why use Crossover in Neural Network training? - neural-network

Why specifically is it used?
I know it increases variation which may help explore the problem space, but how much does it increase the probability of finding the optimal solution/configuration in time? And does it do anything else advantageous?
And does it necessarily always help, or are there instances in which it would increase the time taken to find the optimal solution?

As Patrick Trentin said, crossover improve the speed of convergence, because it allows to combine good genes that are already found in the population.
But, for neuro-evolution, crossover is facing the "permutation problem", also known as "the competing convention problem". When two parents are permutations of the same network, then, except in rare cases, their offspring will always have a lower fitness. Because the same part of the network is copied in two different locations, and so the offspring is losing viable genes for one of these two locations.
for example the networks A,B,C,D and D,C,B,A that are permutations of the same network. The offspring can be:
A,B,C,D (copy of parent 1)
D,C,B,A (copy of parent 2)
A,C,B,D OK
A,B,C,A
A,B,B,A
A,B,B,D
A,C,B,A
A,C,C,A
A,C,C,D
D,B,C,A OK
D,C,B,D
D,B,B,A
D,B,B,D
D,B,C,D
D,C,C,A
D,C,C,D
So, for this example, 2/16 of the offspring are copies of the parents. 2/16 are combinations without duplicates. And 12/16 have duplicated genes.
The permutation problem occurs because networks that are permutations one of the other have the same fitness. So, even for an elitist GA, if one is selected as parent, the other will also often be selected as parent.
The permutations may be only partial. In this case, the result is better than for complete permutations, but the offspring will, in a lot of cases, still have a lower fitness than the parents.
To avoid the permutation problem, I heard about similarity based crossover, that compute similarity of neurons and their connected synapses, doing the crossing-over between the most similar neurons instead of a crossing-over based on the locus.
When evolving topology of the networks, some NEAT specialists think the permutation problem is part of a broader problem: "the variable lenght genome problem". And NEAT seems to avoid this problem by speciation of the networks, when two networks are too differents in topology and weights, they aren't allowed to mate. So, NEAT algorithm seems to consider permuted networks as too different, and doesn't allow them to mate.
A website about NEAT also says:
However, in another sense, one could say that the variable length genome problem can never be "solved" because it is inherent in any system that generates different constructions that solve the same problem. For example, both a bird and a bat represent solutions to the problem of flight, yet they are not compatible since they are different conventions of doing the same thing. The same situation can happen in NEAT, where very different structures might arise that do the same thing. Of course, such structures will not mate, avoiding the serious consequence of damaged offspring. Still, it can be said that since disparate representations can exist simultaneously, incompatible genomes are still present and therefore the problem is not "solved." Ultimately, it is subjective whether or not the problem has been solved. It depends on what you would consider a solution. However, it is at least correct to say, "the problem of variable length genomes is avoided."
Edit: To answer your comment.
You may be right for similarity based crossover, I'm not sure it totally avoids the permutation problem.
About the ultimate goal of crossover, without considering the permutation problem, I'm not sure it is useful for the evolution of neural networks, but my thought is: if we divide a neural network in several parts, each part contributes to the fitness, so two networks with a high fitness may have different good parts. And combining these parts should create an even better network. Some offspring will of course inherit the bad parts, but some other offspring will inherit the good parts.
Like Ray suggested, it could be useful to experiment the evolution of neural networks with and without crossover. As there is randomness in the evolution, the problem is to run a large number of tests, to compute the average evolution speed.
About evolving something else than a neural network, I found a paper that says an algorithm using crossover outperforms a mutation-only algorithm for solving the all-pairs shortest path problem (APSP).
Edit 2:
Even if the permutation problem seems to be only applicable to some particular problems like neuro-evolution, I don't think we can say the same about crossover, because maybe we are missing something about the problems that don't seem to be suitable for crossover.
I found a free version of the paper about similarity based crossover for neuro-evolution, and it shows that:
an algorithm using a naive crossover performs worse than a mutation-only algorithm.
using similarity based crossover it performs better than a mutation-only algorithm for all tested cases.
NEAT algorithm sometimes performs better than a mutation-only algorithm.
Crossover is complex and I think there is a lack of studies that compare it with mutation-only algorithms, maybe because its usefulness highly depends:
of its engineering, in function of particular problems like the permutation problem. So of the type of crossover we use (similarity based, single point, uniform, edge recombination, etc...).
And of the mating algorithm. For example, this paper shows that a gendered genetic algorithm strongly outperforms a non-gendered genetic algorithm for solving the TSP. For solving two other problems, the algorithm doesn't strongly outperforms, but it is better than the non-gendered GA. In this experiment, males are selected on their fitness, and females are selected on their ability to produce a good offspring. Unfortunately, this study doesn't compare the results with a mutation-only algorithm.

Related

Does it matter which algorithm you use for Multiple Imputation by Chained Equations (MICE)

I have seen MICE implemented with different types of algorithms e.g. RandomForest or Stochastic Regression etc.
My question is that does it matter which type of algorithm i.e. does one perform the best? Is there any empirical evidence?
I am struggling to find any info on the web
Thank you
Yes, (depending on your task) it can matter quite a lot, which algorithm you choose.
You also can be sure, the mice developers wouldn't out effort into providing different algorithms, if there was one algorithm that anyway always performs best. Because, of course like in machine learning the "No free lunch theorem" is also relevant for imputation.
In general you can say, that the default settings of mice are often a good choice.
Look at this example from the miceRanger Vignette to see, how far imputations can differ for different algorithms. (the real distribution is marked in red, the respective multiple imputations in black)
The Predictive Mean Matching (pmm) algorithm e.g. makes sure that only imputed values appear, that were really in the dataset. This is for example useful, where only integer values like 0,1,2,3 appear in the data (and no values in between). Other algorithms won't do this, so while doing their regression they will also provide interpolated values like on the picture to the right ( so they will provide imputations that are e.g. 1.1, 1.3, ...) Both solutions can come with certain drawbacks.
That is why it is important to actually assess imputation performance afterwards. There are several diagnostic plots in mice to do this.

What is the Difference between evolutionary computing and classification?

I am looking for some comprehensive description. I couldn't find it via browsing as things are more clustered on the web and its not in my scope currently.
Classification and evolutionary computing is comparing oranges to apples. Let me explain:
Classification is a type of problem, where the goal is to determine a label given some input. (Typical example, given pixel values, determine image label).
Evolutionary computing is a family of algorithms to solve different types of problems. They work with a "population" of candidates (imagine a set of different neural networks trying to solve a given problem). Somehow you evaluate how good each candidate is in the given task (typically using a "fitness function", but there are other methods). Then a new generation of candidates is produced, taking the best candidates from the previous generation as a model, and including mutations and cross-over (that is, introducing changes). Repeat until happy.
Evolutionary computing can absolutely be used for classification! But there are examples where it is used in different ways. You may use evolutionary computing to create an artificial neural network controlling a robot (in this case, inputs are sensor values, outputs are commands for actuators). Or to create original content free of a given goal, as in Picbreeder.
Classification may be solved using evolutionary computation (maybe this is why you where confused in the first place) but other techniques are also common. You can use decision trees, or notably deep-learning (based on backpropagation).
Deep-learning based on backpropagation may sound similar to evolutionary computation, but it is quite different. Here you have only one artificial neural network, and a clear rule (backpropagation) telling you which changes to introduce every iteration.
Hope this helps to complement other answers!
Classification algorithms and evolutionary computing are different approaches. However, they are related in some ways.
Classification algorithms aim to identify the class label of new instances. They are trained with some labeled instances. For example, recognition of digits is a classification algorithm.
Evolutionary algorithms are used to find out the minimum or maximum solution of an optimization problem. They randomly explore the solution space of the given problem. They can find a good solution in a reasonable time and are not able to find the global optimum in all problems.
In some classification approaches, evolutionary algorithms are used to find out the optimal value of the parameters.

How many and which parents should we select for crossover in genetic algorithm

I have read many tutorials, papers and I understood the concept of Genetic Algorithm, but I have some problems to implement the problem in Matlab.
In summary, I have:
A chromosome containing three genes [ a b c ] with each gene constrained by some different limits.
Objective function to be evaluated to find the best solution
What I did:
Generated random values of a, b and c, say 20 populations. i.e
[a1 b1 c1] [a2 b2 c2]…..[a20 b20 c20]
At each solution, I evaluated the objective function and ranked the solutions from best to worst.
Difficulties I faced:
Now, why should we go for crossover and mutation? Is the best solution I found not enough?
I know the concept of doing crossover (generating random number, probability…etc) but which parents and how many of them will be selected to do crossover or mutation?
Should I do the crossover for the entire 20 solutions (parents) or only two of them?
Generally a Genetic Algorithm is used to find a good solution to a problem with a huge search space, where finding an absolute solution is either very difficult or impossible. Obviously, I don't know the range of your values but since you have only three genes it's likely that a good solution will be found by a Genetic Algorithm (or a simpler search strategy at that) without any additional operators. Selection and Crossover is usually carried out on all chromosome in the population (although it's not uncommon to carry some of the best from each generation forward as is). The general idea is that the fitter chromosomes are more likely to be selected and undergo crossover with each other.
Mutation is usually used to stop the Genetic Algorithm prematurely converging on a non-optimal solution. You should analyse the results without mutation to see if it's needed. Mutation is usually run on the entire population, at every generation, but with a very small probability. Giving every gene 0.05% chance that it will mutate isn't uncommon. You usually want to give a small chance of mutation, without it completely overriding the results of selection and crossover.
As has been suggested I'd do a lit bit more general background reading on Genetic Algorithms to give a better understanding of its concepts.
Sharing a bit of advice from 'Practical Neural Network Recipies in C++' book... It is a good idea to have a significantly larger population for your first epoc, then your likely to include features which will contribute to an acceptable solution. Later epocs which can have smaller populations will then tune and combine or obsolete these favourable features.
And Handbook-Multiparent-Eiben seems to indicate four parents are better than two. However bed manufactures have not caught on to this yet and seem to only produce single and double-beds.

Is there a rule/good advice on how big a artificial neural network should be?

My last lecture on ANN's was a while ago but I'm currently facing a project where I would want to use one.
So, the basics - like what type (a mutli-layer feedforward network), trained by an evolutionary algorithm (thats a given by the project), how many input-neurons (8) and how many ouput-neurons (7) - are set.
But I'm currently trying to figure out how many hidden layers I should use and how many neurons in each of these layers (the ea doesn't modify the network itself, but only the weights).
Is there a general rule or maybe a guideline on how to figure this out?
The best approach for this problem is to implement the cascade correlation algorithm, in which hidden nodes are sequentially added as necessary to reduce the error rate of the network. This has been demonstrated to be very useful in practice.
An alternative, of course, is a brute-force test of various values. I don't think simple answers such as "10 or 20 is good" are meaningful because you are directly addressing the separability of the data in high-dimensional space by the basis function.
A typical neural net relies on hidden layers in order to converge on a particular problem solution. A hidden layer of about 10 neurons is standard for networks with few input and output neurons. However, a trial an error approach often works best. Since the neural net will be trained by a genetic algorithm the number of hidden neurons may not play a significant role especially in training since its the weights and biases on the neurons which would be modified by an algorithm like back propogation.
As rcarter suggests, trial and error might do fine, but there's another thing you could try.
You could use genetic algorithms in order to determine the number of hidden layers or and the number of neurons in them.
I did similar things with a bunch of random forests, to try and find the best number of trees, branches, and parameters given to each tree, etc.

Neural Net Optimize w/ Genetic Algorithm

Is a genetic algorithm the most efficient way to optimize the number of hidden nodes and the amount of training done on an artificial neural network?
I am coding neural networks using the NNToolbox in Matlab. I am open to any other suggestions of optimization techniques, but I'm most familiar with GA's.
Actually, there are multiple things that you can optimize using GA regarding NN.
You can optimize the structure (number of nodes, layers, activation function etc.).
You can also train using GA, that means setting the weights.
Genetic algorithms will never be the most efficient, but they usually used when you have little clue as to what numbers to use.
For training, you can use other algorithms including backpropagation, nelder-mead etc..
You said you wanted to optimize number hidden nodes, for this, genetic algorithm may be sufficient, although far from "optimal". The space you are searching is probably too small to use genetic algorithms, but they can still work and afaik, they are already implemented in matlab, so no biggie.
What do you mean by optimizing amount of training done? If you mean number of epochs, then that's fine, just remember that training is somehow dependent on starting weights and they are usually random, so the fitness function used for GA won't really be a function.
A good example of neural networks and genetic programming is the NEAT architecture (Neuro-Evolution of Augmenting Topologies). This is a genetic algorithm that finds an optimal topology. It's also known to be good at keeping the number of hidden nodes down.
They also made a game using this called Nero. Quite unique and very amazing tangible results.
Dr. Stanley's homepage:
http://www.cs.ucf.edu/~kstanley/
Here you'll find just about everything NEAT related as he is the one who invented it.
Genetic algorithms can be usefully applied to optimising neural networks, but you have to think a little about what you want to do.
Most "classic" NN training algorithms, such as Back-Propagation, only optimise the weights of the neurons. Genetic algorithms can optimise the weights, but this will typically be inefficient. However, as you were asking, they can optimise the topology of the network and also the parameters for your training algorithm. You'll have to be especially wary of creating networks that are "over-trained" though.
One further technique with a modified genetic algorithms can be useful for overcoming a problem with Back-Propagation. Back-Propagation usually finds local minima, but it finds them accurately and rapidly. Combining a Genetic Algorithm with Back-Propagation, e.g., in a Lamarckian GA, gives the advantages of both. This technique is briefly described during the GAUL tutorial
It is sometimes useful to use a genetic algorithm to train a neural network when your objective function isn't continuous.
I'm not sure whether you should use a genetic algorithm for this.
I suppose the initial solution population for your genetic algorithm would consist of training sets for your neural network (given a specific training method). Usually the initial solution population consists of random solutions to your problem. However, random training sets would not really train your neural network.
The evaluation algorithm for your genetic algorithm would be a weighed average of the amount of training needed, the quality of the neural network in solving a specific problem and the numer of hidden nodes.
So, if you run this, you would get the training set that delivered the best result in terms of neural network quality (= training time, number hidden nodes, problem solving capabilities of the network).
Or are you considering an entirely different approach?
I'm not entirely sure what kind of problem you're working with, but GA sounds like a little bit of overkill here. Depending on the range of parameters you're working with, an exhaustive (or otherwise unintelligent) search may work. Try plotting your NN's performance with respect to number of hidden nodes for a first few values, starting small and jumping by larger and larger increments. In my experience, many NNs plateau in performance surprisingly early; you may be able to get a good picture of what range of hidden node numbers makes the most sense.
The same is often true for NNs' training iterations. More training helps networks up to a point, but soon ceases to have much effect.
In the majority of cases, these NN parameters don't affect performance in a very complex way. Generally, increasing them increases performance for a while but then diminishing returns kick in. GA is not really necessary to find a good value on this kind of simple curve; if the number of hidden nodes (or training iterations) really does cause the performance to fluctuate in a complicated way, then metaheuristics like GA may be apt. But give the brute-force approach a try before taking that route.
I would tend to say that genetic algorithms is a good idea since you can start with a minimal solution and grow the number of neurons. It is very likely that the "quality function" for which you want to find the optimal point is smooth and has only few bumps.
If you have to find this optimal NN frequently I would recommend using optimization algorithms and in your case quasi newton as described in numerical recipes which is optimal for problems where the function is expensive to evaluate.