How does encoding as a bit string in Genetic algorithm helpful? - encoding

Suppose we have to solve a global optimization problem in which we have to find values of 5 variables, all of which are integers. Assume we get following two parent chromosomes,
Parent 1: 6, 10, 3, 5, 12
Parent 2: 12, 10, 3, 8, 11
If we do cross-over after first 2 elements, we get following
Child 1: 6, 10, 3, 8, 11
Child 2: 12, 10, 3, 5, 12
Here we can clearly see the children are related to parents.
But when we encode as bit strings, then each chromosome is encoded as a single string of bits and we can, at random, choose any point for crossover. I do not see how this is any more beneficial than completely randomly picking any trial solutions.
I have a similar question with mutation. We randomly flip a bit. If the bit flipped has a small place value, then the change will be small. But if it has a big place value, the change will be big. How is it better than completely randomly changing a chromosome?

Binary encoding is still common mainly because first works about GA used that encoding.
Furthermore it's often space efficient: [6, 10, 3, 5, 12] represented as a sequence of integers would probably require 5 * 32 bits; for a bit string representation 5 * 4 bits are enough (assuming numbers in the [0;15] range).
Under this aspect the knapsack problem is the best case for the bit string representation (each bit says if the corresponding object is in the knapsack).
we can, at random, choose any point for crossover. I do not see how this is any more beneficial than completely randomly picking any trial solutions
In general choosing a crossover point in the middle of a digit introduces an essentially arbitrary mutation of that digit with a destructive (negative) effect.
There is a nice example about this effect in the Local Search Algorithms and Optimization Problems - Genetic Algorithms section of Artificial Intelligence - A modern approach (Russel, Norvig).
Take also a look at a similar question on Software Engineering.

Related

_mm256_packs_epi32, except pack sequentially

One can use _mm256_packs_epi32. as follows: __m256i e = _mm256_packs_epi32 ( ai, bi);
In the debugger, I see the value of ai: m256i_i32 = {0, 1, 0, 1, 1, 1, 0, 1}. I also see the value of bi: m256i_i32 = {1, 1, 1, 1, 0, 0, 0, 1}. The packing gave me e: m256i_i16 = {0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1}. The packing is interleaved. So we have in e first four numbers in ai, first four numbers in bi, last four numbers in ai, last four numbers in bi in that order.
I am wondering if there is an instruction that just packs ai and bi side by side without the interleaving.
vpermq after packing would work, but I'm wondering if there's a single instruction to achieve this.
No sequential-across-lanes pack until AVX-512, unfortunately. (And even then only for 1 register, or not with saturation.)
The in-lane behaviour of shuffles like vpacksswd and vpalignr is one of the major warts of AVX2 that make the 256-bit versions of those shuffles less useful than their __m128i versions. But on Intel, and Zen2 CPUs, it is often still best to use __m256i vectors with a vpermq at the end, if you need the elements in a specific order. (Or vpermd with a vector constant after 2 levels of packing: How do I efficiently reorder bytes of a __m256i vector (convert int32_t to uint8_t)?)
If your 32-bit elements came from unpacking narrower elements, and you don't care about order of the wider elements, you can widen with in-lane unpacks, which sets you up to pack back into the original order.
This is cheap for zero-extending unpacks: _mm256_unpacklo/hi_epi16 (with _mm256_setzero_si256()). That's as cheap as vpmovzxwd (_mm256_cvtepu16_epi32), and is actually better because you can do 256-bit loads of your source data and unpack two ways, instead of narrow loads to feed vpmovzx... which only works on data at the bottom of an input register. (And memory-source vpmovzx... ymm, [mem] can't micro-fuse the load with a YMM destination, only for the 128-bit XMM version, on Intel CPUs, so the front-end cost is the same as separate load and shuffle instructions.)
But that trick doesn't work work quite as nicely for data you need to sign-extend. vpcmpgtw to get high halves for vpunpckl/hwd does work, but vpermq when re-packing is about as good, just different execution-port pressure. So vpmovsxwd is simpler there.
Slicing up your data into odd/even instead of low/high can also work, e.g. to get 16 bit elements zero-extended into 32-bit elements:
auto veven = _mm256_and_si256(v, _mm256_set1_epi32(0x0000FFFF));
auto vodd = _mm256_srli_epi32(v, 16);
After processing, one can recombine with a shift and vpblendw. (1 uop for port 5 on Intel Skylake / Ice Lake). Or for bytes, vpblendvb with a control vector, but that costs 2 uops on Intel CPUs (but for any port), vs. only 1 uop on Zen2. (Those uop counts aren't including the vpslld ymm, ymm, 16 shift to line up the odd elements back with their starting points.)
Even with AVX-512, the situation isn't perfect. You still can use a single shuffle uop to combine 2 vectors to one of the same width.
There's very nice single-vector narrowing with truncation, or signed or unsigned saturation, for any pair of element sizes like an inverse of vpmovzx / sx. e.g. qword to byte vpmov[su]qb, with an optional memory destination.
(Fun fact: vpmovdm [rdi]{k1}, zmm0 was the only way Xeon Phi (lacking both AVX-512BW and AVX-512VL) could do byte-masked stores to memory; that might be why these exist in memory-destination form. On mainstream Intel like Skylake-X / Ice Lake, the memory destination versions are no cheaper than separate pack into register then store. https://uops.info/)
AVX-512 also has nice 2-input shuffles with a control vector, so for dword-to-word truncation you could use vpermt2w zmm1, zmm2, zmm3. But that needs a shuffle control vector, and vpermt2w is 3 uops on SKX and IceLake. (t2d and t2q are 1 uop). vpermt2b is only available in AVX-512VBMI (Ice Lake), and is also 3 uops there.
(Unlike vpermb which is 1 uop on Ice Lake, and AVX-512BW vpermw which is still 2 uops on Ice Lake. So they didn't reduce the front-end cost of the backwards-compatible instruction, but ICL can run 1 of its 2 uops on port 0 or 1, instead of both on the shuffle unit on port 5. Perhaps ICL has one uop preprocess the shuffle control into a vpermb control or something, which would also explain the improved latency: 3 cycles for data->data, 4 cycles for control->data. vs. 6 cycles on SKX for the 2p5 uops, apparently a serial dependency starting with both the control and data vectors.)

Implementing one hot encoding

I already understand the uses and concept behind one hot encoding with neural networks. My question is just how to implement the concept.
Let's say, for example, I have a neural network that takes in up to 10 letters (not case sensitive) and uses one hot encoding. Each input will be a 26 dimensional vector of some kind for each spot. In order to code this, do I act as if I have 260 inputs with each one displaying only a 1 or 0, or is there some other standard way to implement these 26 dimensional vectors?
In your case, you have to differ between various frameworks. I can speak for PyTorch, which is my goto framework when programming a neural network.
There, one-hot encodings for sequences are generally performed in a way where your network will expect a sequence of indices. Taking your 10 letters as an example, this could be the sequence of ["a", "b", "c" , ...]
The embedding layer will be initialized with a "dictionary length", i.e. the number of distinct elements (num_embeddings) your network can receive - in your case 26. Additionally, you can specify embedding_dim, i.e. the output dimension of a single character. This is already past the step of one-hot encodings, since you generally only need them to know which value to associate with that item.
Then, you would feed a coded version of the above string to the layer, which could be looking like this: [0,1,2,3, ...]. Assuming the sequence is of length 10, his will produce an output of [10,embedding_dim], i.e. a 2-dimensional Tensor.
To summarize, PyTorch essentially allows you to skip this rather tedious step of encoding it as a one-hot encoding. This is mainly due to the fact that your vocabulary can in some instances be quite large: Consider for example Machine Translation Systems, in which you could have 10,000+ words in your vocabulary. Instead of storing every single word as a 10,000-dimensional vector, using a single index is more convenient.
If that should not completely answer your question (since I am essentially telling you how it is generally preferred): Instead of making a 260-dimensional vector, you would again use a [10,26] Tensor, in which each line represents a different letter.
If you have 10 distinct elements(Ex: a,b....j OR 1,2...10) to be represented as 'one hot-encoding' vector of dimension-26 then, your inputs are 10 vectors only each of which is to be represented by 26-dim vector. Do this:
y = torch.eye(26) # If you want a tensor for each 'letter' of length 26.
y[torch.arange(0,10)] #This line gives you 10 one hot-encoding vector each of dimension 26.
Hope this helps a bit.

Why are vectors so shallow?

What is the rationale behind Scala's vectors having a branching factor of 32, and not some other number? Wouldn't smaller branching factors enable more structural sharing? Clojure seems to use the same branching factor. Is there anything magic about the branching factor 32 that I am missing?
It would help if you explained what a branching factor is:
The branching factor of a tree or a graph is the number of children at each node.
So, the answer appears to be largely here:
http://www.scala-lang.org/docu/files/collections-api/collections_15.html
Vectors are represented as trees with a high branching factor. Every
tree node contains up to 32 elements of the vector or contains up to
32 other tree nodes. Vectors with up to 32 elements can be represented
in a single node. Vectors with up to 32 * 32 = 1024 elements can be
represented with a single indirection. Two hops from the root of the
tree to the final element node are sufficient for vectors with up to
215 elements, three hops for vectors with 220, four hops for vectors
with 225 elements and five hops for vectors with up to 230 elements.
So for all vectors of reasonable size, an element selection involves
up to 5 primitive array selections. This is what we meant when we
wrote that element access is "effectively constant time".
So, basically, they had to make a design decision as to how many children to have at each node. As they explained, 32 seemed reasonable, but, if you find that it is too restrictive for you, then you could always write your own class.
For more information on why it may have been 32, you can look at this paper, as in the introduction they make the same statement as above, about it being nearly constant time, but this paper deals with Clojure it seems, more than Scala.
http://infoscience.epfl.ch/record/169879/files/RMTrees.pdf
James Black's answer is correct. Another argument for choosing 32 items might have been that the cache line size in many modern processors is 64 bytes, so two lines can hold 32 ints with 4 bytes each or 32 pointers on a 32bit machine or a 64bit JVM with a heap size up to 32GB due to pointer compression.
It's the "effectively constant time" for updates. With that large of a branching factor, you never have to go beyond 5 levels, even for terabyte-scale vectors. Here's a video with Rich talking about that and other aspects of Clojure on Channel 9. http://channel9.msdn.com/Shows/Going+Deep/Expert-to-Expert-Rich-Hickey-and-Brian-Beckman-Inside-Clojure
Just adding a bit to James's answer.
From an algorithm analysis standpoint, because the growth of the two functions is logarithmic, so they scale the same way.
But, in practical applications, having
hops is a much smaller number of hops than, say, base 2, sufficiently so that it keeps it closer to constant time, even for fairly large values of N.
I'm sure they picked 32 exactly (as opposed to a higher number) because of some memory block size, but the main reason is the fewer number of hops, compared to smaller sizes.
I also recommend you watch this presentation on InfoQ, where Daniel Spiewak discusses Vectors starting about 30 minutes in: http://www.infoq.com/presentations/Functional-Data-Structures-in-Scala

Efficient encoding of integers with constant digit sum

How can a large set of integers all with a known constant digit sum, and a constant amount of digits be encoded.
Example of integers in base 10, with digit sum 5, and 3 digits:
014, 041, 104, 113, 122, 131, 140, 203 ....
The most important factor is space, but computing time is not completely unimportant.
The simplest way would be to store the digit sum itself, and leave it at that.
But it's possible that I misunderstand the question.
Edit: Here's my takeaway: You want to encode the set itself; yeah?
Encoding the set itself is as easy as storing the base, the digit sum, and the number of digits, e.g. {10, 5, 3} in the example you give.
Most of the time, however, you'll find that the most compact representation of a number is the number itself, unless it is very large.
Also, because digit sum is commonly taken to be recursive; and between one and nine, inclusive; 203 has the same digit sum as 500, or as 140, or as 950. This means that the set is huge for any combination of numbers, and also that any set (except for certain degenerate cases) uses every available digit in the base they are related to.
So, you know, the most efficient encoding of the numbers themselves when stored singly becomes the number itself, especially considering that every number between ±2 147 483 648 generally takes the same amount of space in memory, and often in storage.
When you have as clearly a defined set of possible values to encode as this the straight-forward encoding theoretic approach is to sequentially number all possible values, then store this number is as many bits as necessary. This is quite clearly optimal, if the frequencies of the individual values are identical or not known. If you know something about the frequency distribution you'll instead have to use something like a Huffman code to get a truly optimal result, but that's rather complicated and I'll handle only the other case.
For the uniformly distributed (or unknown) case the approach is as follows:
Imagine (you can pre-generate and store it, or generate it on the fly) a lexicographically sorted list of all your input (for the encoding) values. E.g. in your case the list would start be (unless your digit sum is recursive): 005, 023, 032, 050, 104, 113, 122, 131, 140, 203, 212, 221, 230, 401, 410, 500.
Then assign each item in the list an integer based on its position in the list: 005 becomes 0, 023 becomes 1, 032 becomes 2 and so on. There are (unless I made a mistake, if so, adjust appropriately) 16 values in the list, for which you need 4 bits to encode any index into the list. This index is your encoded value, and encoding and decoding become obvious.
As for an algorithm to generate the list in the first place: The simplest way is to count from 000 to 999 and throw away everything that doesn't match your criterion. It is possible to be more clever about that by replicating counting and overflow (e.g. how 104 follows 050) but it's probably not worth the effort.

How do BigNums implementations work?

I wanted to know how the BigInt and other such stuff are implemented. I tried to check out JAVA source code, but it was all Greek and Latin to me.
Can you please explain me the algo in words - no code, so that i understand what i am actually using when i use something from the JAVA API.
regards
Conceptually, the same way you do arbitrary size arithmentic by hand. You have something like an array of values, and algorithms for the various operations that work on the array.
Say you want to add 100 to 901. You start with the two numbers as arrays:
[0, 1, 0, 0]
[0, 9, 0, 1]
When you add, your addition algorithm starts from the right, takes 0+1, giving 1, 0+0, giving 0, and -- now the tricky part -- 9+1 gives 10, but now we need to carry, so we add 1 to the next column over, and put (9+1)%10 into the third column.
When your numbers grow big enough -- greater than 9999 in this example -- then you have to allocate more space somehow.
This is, of course, somewhat simplified if you store the numbers in reverse order.
Real implementations use full words, so the modulus is really some large power of two, but the concept is the same.
There's a very good section on this in Knuth.