NEAT Two Identical Genes with Different Innovation Numbers - neural-network

In paper written about the NEAT algorithm, the author states
A possible problem is that the same structural innovation will receive different innovation numbers in the same generation if it occurs by chance more than once. However, by keeping a list of the innovations that occurred in the current generation, it is possible to ensure that when the same structure arises more than once through independent mutations in the same generation, each identical mutation is assigned the same innovation number.
This makes sense because you don't want identical genes to end up with different innovation numbers. If they did, there would be problems when crossing over two genomes with identical genes but different innovation numbers because you would end up with an offspring with a copy of each gene from each parent, creating the same connection twice.
What doesn't make sense to me though is what happens if a mutation occurs between two genes, and then in the next generation the same mutation occurs? In the paper, it's very clear that only a list of mutations in the current generation is kept to avoid an "explosion of innovation numbers" but doesn't specify anything about what happens if the same mutation occurs across different generations.
Do you keep a global list of gene pairs and their corresponding innovation numbers to prevent this problem? Is there a reason why the paper only specifies what happens if the same mutation happens in the same generation and doesn't consider the case of cross-generational mutations?

No. You don't keep a global list of gene pairs. You can if you want to avoid the same mutation from happening. But something that I want to point out: it doesn't matter. The only effect of the same mutation happening is that you will do some unnecessary computation and your global innovation number will increase.
For future genomes however, there is no chance that they will have two innovation numbers that are the same innovation.
Matching genes are inherited
randomly, whereas disjoint genes (those that do not match in the middle) and excess
genes (those that do not match in the end) are inherited from the more fit parent
So when two identical innovations occur, they will be either a disjoint or excess gene. These will get inherited from the more fit parent and only one parent can be fitter, so óne offspring will never have the same innovation genes.

Related

In Fasttext skipgram training, what will happen if some sentences in the corpus have just one word?

Imagine that you have a corpus in which some lines have just one word, so there is no context around some of the words. In this situation how does Fasttext perform to provide embeddings for these single words? Note that the frequency of some of these words are one and there is no cut-off to get rid of them.
There's no way to train a context_word -> target_word skip-gram pair for such words (in either 'context' or 'target' roles), so such words can't receive trained representations. Only texts with at least 2 tokens contribute anything to word2vec or FastText word-vector training.
(One possible exception: FastText in its 'supervised classification' mode might be able to make use of, and train vectors for, such words, because then even single words can be used to predict the known-label of training texts.)
I suspect that such corpuses will still result in the model counting the word in its initial vocabulary-discovery scan, and thus it will be allocated a vector (if it appears at least min_count times), and that vector will receive the usual small-random-vector initialization. But the word-vector will receive no further training – so when you request the vector back after training, it will be of low-quality, with the only meaningful contributions coming from any char n-grams shared with other words that received real training.
You should consider any text-breaking process that results in single-word texts as buggy for the purposes of FastText. If those single-word texts come from another meaningful context where they were once surrounded by other contextual words, you should change your text-breaking process to work in larger chunks that retain that context.
Also note: it's rare for min_count=1 to be a good idea for word-vector models, at least when the training text is real natural-language material where word-token frequencies roughly follow Zipf's law. There will be many, many 1-occurrence (or few-occurrence) words, but with just one to a few example usage contexts, not likely representing the true breadth and subtleties of that word's real usages, it's nearly impossible for such words to receive good vectors that generalize to other uses of those same words elsewhere.
Training good vectors require a variety of usage examples, and just one or a few examples will practically be "noise" compared to the tens-to-hundreds of examples of other words' usage. So keeping these rare words, instead of dropping them like a default min_count=5 (or higher in larger corpuses) would do, tends to slow training, slow convergence ("settling") of the model, and lower the quality of the other more-frequent word vectors at the end – due to the significant-but-largely-futile efforts of the algorithm to helpfully position these many rare words.

NEAT algorithm: How to crossover disjoint and excess genes?

I am currently implementing the NEAT algorithm developed by Kenneth Stanley, taking the original paper as a reference.
In the section where the crossover method is described, one thing confuses me a little bit.
So, the above figure is illustrating the crossover method for NEAT. To decide from which parent a gene inherited, the paper says the following:
Matching genes are inherited
randomly, whereas disjoint genes (those that do not match in the middle) and excess
genes (those that do not match in the end) are inherited from the more fit parent.
For the matching genes (1 - 5) it's easy to understand. You just randomly inherit from either Parent1 or Parent2 (with 50% chance for both). But for the disjoint (6-8) and excess (9-10) genes you cannot inherit from the more fit parent because you only have those genes in either Parent1 or Parent2.
For example:
Parent1's fitness is higher than Parent2's. The disjoint gene 6 only exists in Parent2 (of course, because disjoint and excess genes only occur in one parent)
So, you cannot decide to inherit this gene from the more fit parent. Same goes for all other disjoint and excess genes. You can only inherit those from the parent they exist in.
So my question is: Do you maybe inherit all matching genes from the more fit parent and just take over the disjoint and excess genes? Or do i missunderstand something here?
Thanks in advance.
It might help to look at the actual implementation and see how it is handled. In the original C++ code here (look at lines 2085 onwards), the disjoint and excess genes from the unfit parent seem to be just skipped.
In your implementation, you could inherit disjoint and excess genes from the unfit parent but disable them with probability 1 so you can do pointwise mutations on them (toggle disabled to enabled) later on. However, this might result in significant genome bloat, so test and see what works.
It makes more sense to take mismatching genes only from 'more fit parent'. This will create strong offspring as a result of crossover. For matching genes, apply usual crossover operator. For improving diversity, create second offspring by random selection of mismatching genes from two parents.
In this way, First offspring will be more fit and second offspring will maintain diversity. Hope, this helps.
The graphic depicts the special case of two parents with the same fitness, so selection is random again and therefore could lead to the depicted case. I agree that it is misleading without that additional piece of information.

How can one have disjoint genes when comparing two genomes in NEAT?

In the NEAT paper it is said that "Genes that do not match are either disjoint or excess, depending on whether they occur within or outside the range of the other parent’s innovation numbers". I can't understand how it is possible for disjoint genes to arise, since I don't see a way for a genome to have a gap between innovation numbers within its connection genes. From what I understand an innovation number is shared within a given genome and is incremented whenever a new gene appears in the genome. Could someone explain it?
Okay, I think I know the answer now. A list of innovations shared among all genomes is kept. Whenever a structural innovation appears in some genome the list is checked whether it contains such an innovation. If it doesn't, then the global innovation number is incremented, assigned to the structural innovation and the list is being appended with information about the innovation and its corresponding innovation number. If it does, then the innovation number for it is returned and assigned to the structural innovation. For example in two genomes scenario, the first genome could have had 7 innovations and the second one only 5. Suppose first 5 innovations were the same for both. Then the new innovation is added for the second one. It turns out that it's the same innovation as the 7th one in the first genome. Then the new innovation (6th one) for the second genome will have assigned 7 for its innovation number.

Shannon's Entropy measure in Decision Trees

Why is Shannon's Entropy measure used in Decision Tree branching?
Entropy(S) = - p(+)log( p(+) ) - p(-)log( p(-) )
I know it is a measure of the no. of bits needed to encode information; the more uniform the distribution, the more the entropy. But I don't see why it is so frequently applied in creating decision trees (choosing a branch point).
Because you want to ask the question that will give you the most information. The goal is to minimize the number of decisions/questions/branches in the tree, so you start with the question that will give you the most information and then use the following questions to fill in the details.
For the sake of decision trees, forget about the number of bits and just focus on the formula itself. Consider a binary (+/-) classification task where you have an equal number of + and - examples in your training data. Initially, the entropy will be 1 since p(+) = p(-) = 0.5. You want to split the data on an attribute that most decreases the entropy (i.e., makes the distribution of classes least random). If you choose an attribute, A1, that is completely unrelated to the classes, then the entropy will still be 1 after splitting the data by the values of A1, so there is no reduction in entropy. Now suppose another attribute, A2, perfectly separates the classes (e.g., the class is always + for A2="yes" and always - for A2="no". In this case, the entropy is zero, which is the ideal case.
In practical cases, attributes don't typically perfectly categorize the data (the entropy is greater than zero). So you choose the attribute that "best" categorizes the data (provides the greatest reduction in entropy). Once the data are separated in this manner, another attribute is selected for each of the branches from the first split in a similar manner to further reduce the entropy along that branch. This process is continued to construct the tree.
You seem to have an understanding of the math behind the method, but here is a simple example that might give you some intuition behind why this method is used: Imagine you are in a classroom that is occupied by 100 students. Each student is sitting at a desk, and the desks are organized such there are 10 rows and 10 columns. 1 out of the 100 students has a prize that you can have, but you must guess which student it is to get the prize. The catch is that everytime you guess, the prize is decremented in value. You could start by asking each student individually whether or not they have the prize. However, initially, you only have a 1/100 chance of guessing correctly, and it is likely that by the time you find the prize it will be worthless (think of every guess as a branch in your decision tree). Instead, you could ask broad questions that dramatically reduce the search space with each question. For example "Is the student somewhere in rows 1 though 5?" Whether the answer is "Yes" or "No" you have reduced the number of potential branches in your tree by half.

Difference between Disjoint and Excess genes in NEAT?

I was reading Stanley's paper but I couldn't figure out what exactly are Disjoint and Excess genes in NEAT. I understand they appear to be related in some particular way with the fact that all of them contain innovation numbers not pertaining to both parents. But what distinguishes them?
Could anyone shed some light into the issue?
When aligning two genomes by gene ID (innovation number), the mismatches at the ends are referred to as excess genes, and all other mismatches are referred to as disjoint genes. As far as I am aware no NEAT implementation has ever treated disjoint and excess genes differently. The distinction was made in early NEAT papers, most probably because treating the types of mismatches differently was being suggested as a possible future research topic.
(FYI - speaking as the author of SharpNEAT).