NEAT algorithm: How to crossover disjoint and excess genes? - neural-network

I am currently implementing the NEAT algorithm developed by Kenneth Stanley, taking the original paper as a reference.
In the section where the crossover method is described, one thing confuses me a little bit.
So, the above figure is illustrating the crossover method for NEAT. To decide from which parent a gene inherited, the paper says the following:
Matching genes are inherited
randomly, whereas disjoint genes (those that do not match in the middle) and excess
genes (those that do not match in the end) are inherited from the more fit parent.
For the matching genes (1 - 5) it's easy to understand. You just randomly inherit from either Parent1 or Parent2 (with 50% chance for both). But for the disjoint (6-8) and excess (9-10) genes you cannot inherit from the more fit parent because you only have those genes in either Parent1 or Parent2.
For example:
Parent1's fitness is higher than Parent2's. The disjoint gene 6 only exists in Parent2 (of course, because disjoint and excess genes only occur in one parent)
So, you cannot decide to inherit this gene from the more fit parent. Same goes for all other disjoint and excess genes. You can only inherit those from the parent they exist in.
So my question is: Do you maybe inherit all matching genes from the more fit parent and just take over the disjoint and excess genes? Or do i missunderstand something here?
Thanks in advance.

It might help to look at the actual implementation and see how it is handled. In the original C++ code here (look at lines 2085 onwards), the disjoint and excess genes from the unfit parent seem to be just skipped.
In your implementation, you could inherit disjoint and excess genes from the unfit parent but disable them with probability 1 so you can do pointwise mutations on them (toggle disabled to enabled) later on. However, this might result in significant genome bloat, so test and see what works.

It makes more sense to take mismatching genes only from 'more fit parent'. This will create strong offspring as a result of crossover. For matching genes, apply usual crossover operator. For improving diversity, create second offspring by random selection of mismatching genes from two parents.
In this way, First offspring will be more fit and second offspring will maintain diversity. Hope, this helps.

The graphic depicts the special case of two parents with the same fitness, so selection is random again and therefore could lead to the depicted case. I agree that it is misleading without that additional piece of information.

Related

NEAT Two Identical Genes with Different Innovation Numbers

In paper written about the NEAT algorithm, the author states
A possible problem is that the same structural innovation will receive different innovation numbers in the same generation if it occurs by chance more than once. However, by keeping a list of the innovations that occurred in the current generation, it is possible to ensure that when the same structure arises more than once through independent mutations in the same generation, each identical mutation is assigned the same innovation number.
This makes sense because you don't want identical genes to end up with different innovation numbers. If they did, there would be problems when crossing over two genomes with identical genes but different innovation numbers because you would end up with an offspring with a copy of each gene from each parent, creating the same connection twice.
What doesn't make sense to me though is what happens if a mutation occurs between two genes, and then in the next generation the same mutation occurs? In the paper, it's very clear that only a list of mutations in the current generation is kept to avoid an "explosion of innovation numbers" but doesn't specify anything about what happens if the same mutation occurs across different generations.
Do you keep a global list of gene pairs and their corresponding innovation numbers to prevent this problem? Is there a reason why the paper only specifies what happens if the same mutation happens in the same generation and doesn't consider the case of cross-generational mutations?
No. You don't keep a global list of gene pairs. You can if you want to avoid the same mutation from happening. But something that I want to point out: it doesn't matter. The only effect of the same mutation happening is that you will do some unnecessary computation and your global innovation number will increase.
For future genomes however, there is no chance that they will have two innovation numbers that are the same innovation.
Matching genes are inherited
randomly, whereas disjoint genes (those that do not match in the middle) and excess
genes (those that do not match in the end) are inherited from the more fit parent
So when two identical innovations occur, they will be either a disjoint or excess gene. These will get inherited from the more fit parent and only one parent can be fitter, so óne offspring will never have the same innovation genes.

Skipping steps in Normalization?

Just curious: is there some reason why one cannot do all necessary normalizations
in a single step? Isnt normalization ultimately the redrawing of the Functional Dependency (FD) graph? We start out with an FD diagram/graph and we want to end up with a graph (vertices are attributes, there is an edge between attributes a,b if b is FD on a ) representing a relation in (Edit) BCNF ?
EDIT: What I mean is : we start with a FD graph , which is a graph pairing attributes a,b iff b is FD on A, i.e., we join a and b with an edge iff b=f(a).
From this graph we want to obtain a graph (FD)_2 with certain traits, which are equivalent to having been fully normalized, i.e., (FD)_2 is in 5NF or 6NF, using the graph-theoretical relation between a graph and a given normal form. If So we are basically mapping one graph to another graph. Can we use this approch-- drawing (FD)_2 directly, as a function of FD, to skip normalization steps?
Yes: Normalization can be characterized by rearranging (hyper)graphs. It does not have to be done by moving through normal forms in some order. (It's just a common misconception that it is.)
The normal forms on the continuum from 1NF to 6NF are those dealing with problematic FDs (functional dependencies) and JDs (join dependencies). They can be ordered so that if a relation value or variable satisfies a form then it satisfies the forms before but not necessarily after. Currently: 1NF, 2NF, 3NF, EKNF, BCNF, 4NF, ETNF, RFNF, SKNF, 5NF aka PJ/NF, Overstrong PJ/NF, 6NF. This ordering has nothing to do per se with decomposing to relation values or variables that are in higher normal forms. It is not necessary to decompose through a sequence of forms.
The normal forms are just different conditions that have been found with helpful properties. Moreover, the normal forms are just those that have been discovered; there may well be other helpful properties to be distinguished. We don't pass through them to normalize now. ETNF is 2012!
As to your graph characterization:
A FD has a set of attributes as determinant. Which determines another set. But since the one determines the other if and only if the one determines each of the sets that contain exactly one member of the other, informally but unambiguously we also talk about a set of attributes determining an attribute. A FD {...} -> a holds iff a = f(...). (There can be zero or more determinant attributes.) BCNF is the highest normal form re problematic FDs, but there are higher normal forms re problematic JDs. A JD with given components holds in a relation iff it is always their join. Ie its meaning/predicate can be expressed as the AND of the components'. So a FD {...} -> A holds iff a JD holds corresponding to a meaning/predicate with conjunct A = F(...)! A MVD (multi-valued dependency) corresponds to a certain binary JD. 5NF means that every JD that holds is "implied by the keys" (a technical term).
There are algorithms that starting with FDs decompose directly to 2NF, directly to 3NF and directly to BCNF (with various other properties like preservation of FDs). See the Alice book. One can decompose to 6NF simply by decomposing until there are no nontrivial JDs, without regard to FDs.
(See C. J. Date's Database Design and Relational Theory: Normal Forms and All That Jazz.)

Error Correcting Tournaments (ect) Multi Class Classification in Vowpal Wabbit

I tried to go through this paper which describes the ECT algorithm but could not make much out of it.
I know it is different from one-against-al (oaa) and even performs better than oaa.I wanted a simple explanation about how ect works.
ECT and Filter trees are useful (only) if you have a very big number of output labels (classes), let's say N=1000. With OAA (one-against-all), it would mean to do N binary classification tasks for each example (during both training and testing). With ECT you can make the prediction much faster: log(N). You can imagine Filter trees (which are the basis of ECT) as a decision tree where in each node you ask whether the example belongs to one set of labels or another set of labels (using all the features, unlike original decision trees).
In general, ECT is worse (in terms of loss or accuracy) than OAA (but in some cases it may be almost as good as OAA). With N=10 labels, you should try OAA first. With N>1000, OAA is too slow (and even the accuracy is low), you should try ECT (or --log_multi or --csoaa_ldf in VW, if you can preselect a smaller number of labels which are relevant for each example).
See http://cilvr.cs.nyu.edu/diglib/lsml/logarithmic.pdf

Shannon's Entropy measure in Decision Trees

Why is Shannon's Entropy measure used in Decision Tree branching?
Entropy(S) = - p(+)log( p(+) ) - p(-)log( p(-) )
I know it is a measure of the no. of bits needed to encode information; the more uniform the distribution, the more the entropy. But I don't see why it is so frequently applied in creating decision trees (choosing a branch point).
Because you want to ask the question that will give you the most information. The goal is to minimize the number of decisions/questions/branches in the tree, so you start with the question that will give you the most information and then use the following questions to fill in the details.
For the sake of decision trees, forget about the number of bits and just focus on the formula itself. Consider a binary (+/-) classification task where you have an equal number of + and - examples in your training data. Initially, the entropy will be 1 since p(+) = p(-) = 0.5. You want to split the data on an attribute that most decreases the entropy (i.e., makes the distribution of classes least random). If you choose an attribute, A1, that is completely unrelated to the classes, then the entropy will still be 1 after splitting the data by the values of A1, so there is no reduction in entropy. Now suppose another attribute, A2, perfectly separates the classes (e.g., the class is always + for A2="yes" and always - for A2="no". In this case, the entropy is zero, which is the ideal case.
In practical cases, attributes don't typically perfectly categorize the data (the entropy is greater than zero). So you choose the attribute that "best" categorizes the data (provides the greatest reduction in entropy). Once the data are separated in this manner, another attribute is selected for each of the branches from the first split in a similar manner to further reduce the entropy along that branch. This process is continued to construct the tree.
You seem to have an understanding of the math behind the method, but here is a simple example that might give you some intuition behind why this method is used: Imagine you are in a classroom that is occupied by 100 students. Each student is sitting at a desk, and the desks are organized such there are 10 rows and 10 columns. 1 out of the 100 students has a prize that you can have, but you must guess which student it is to get the prize. The catch is that everytime you guess, the prize is decremented in value. You could start by asking each student individually whether or not they have the prize. However, initially, you only have a 1/100 chance of guessing correctly, and it is likely that by the time you find the prize it will be worthless (think of every guess as a branch in your decision tree). Instead, you could ask broad questions that dramatically reduce the search space with each question. For example "Is the student somewhere in rows 1 though 5?" Whether the answer is "Yes" or "No" you have reduced the number of potential branches in your tree by half.

Difference between Disjoint and Excess genes in NEAT?

I was reading Stanley's paper but I couldn't figure out what exactly are Disjoint and Excess genes in NEAT. I understand they appear to be related in some particular way with the fact that all of them contain innovation numbers not pertaining to both parents. But what distinguishes them?
Could anyone shed some light into the issue?
When aligning two genomes by gene ID (innovation number), the mismatches at the ends are referred to as excess genes, and all other mismatches are referred to as disjoint genes. As far as I am aware no NEAT implementation has ever treated disjoint and excess genes differently. The distinction was made in early NEAT papers, most probably because treating the types of mismatches differently was being suggested as a possible future research topic.
(FYI - speaking as the author of SharpNEAT).