_mm256_packs_epi32, except pack sequentially - x86-64

One can use _mm256_packs_epi32. as follows: __m256i e = _mm256_packs_epi32 ( ai, bi);
In the debugger, I see the value of ai: m256i_i32 = {0, 1, 0, 1, 1, 1, 0, 1}. I also see the value of bi: m256i_i32 = {1, 1, 1, 1, 0, 0, 0, 1}. The packing gave me e: m256i_i16 = {0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1}. The packing is interleaved. So we have in e first four numbers in ai, first four numbers in bi, last four numbers in ai, last four numbers in bi in that order.
I am wondering if there is an instruction that just packs ai and bi side by side without the interleaving.
vpermq after packing would work, but I'm wondering if there's a single instruction to achieve this.

No sequential-across-lanes pack until AVX-512, unfortunately. (And even then only for 1 register, or not with saturation.)
The in-lane behaviour of shuffles like vpacksswd and vpalignr is one of the major warts of AVX2 that make the 256-bit versions of those shuffles less useful than their __m128i versions. But on Intel, and Zen2 CPUs, it is often still best to use __m256i vectors with a vpermq at the end, if you need the elements in a specific order. (Or vpermd with a vector constant after 2 levels of packing: How do I efficiently reorder bytes of a __m256i vector (convert int32_t to uint8_t)?)
If your 32-bit elements came from unpacking narrower elements, and you don't care about order of the wider elements, you can widen with in-lane unpacks, which sets you up to pack back into the original order.
This is cheap for zero-extending unpacks: _mm256_unpacklo/hi_epi16 (with _mm256_setzero_si256()). That's as cheap as vpmovzxwd (_mm256_cvtepu16_epi32), and is actually better because you can do 256-bit loads of your source data and unpack two ways, instead of narrow loads to feed vpmovzx... which only works on data at the bottom of an input register. (And memory-source vpmovzx... ymm, [mem] can't micro-fuse the load with a YMM destination, only for the 128-bit XMM version, on Intel CPUs, so the front-end cost is the same as separate load and shuffle instructions.)
But that trick doesn't work work quite as nicely for data you need to sign-extend. vpcmpgtw to get high halves for vpunpckl/hwd does work, but vpermq when re-packing is about as good, just different execution-port pressure. So vpmovsxwd is simpler there.
Slicing up your data into odd/even instead of low/high can also work, e.g. to get 16 bit elements zero-extended into 32-bit elements:
auto veven = _mm256_and_si256(v, _mm256_set1_epi32(0x0000FFFF));
auto vodd = _mm256_srli_epi32(v, 16);
After processing, one can recombine with a shift and vpblendw. (1 uop for port 5 on Intel Skylake / Ice Lake). Or for bytes, vpblendvb with a control vector, but that costs 2 uops on Intel CPUs (but for any port), vs. only 1 uop on Zen2. (Those uop counts aren't including the vpslld ymm, ymm, 16 shift to line up the odd elements back with their starting points.)
Even with AVX-512, the situation isn't perfect. You still can use a single shuffle uop to combine 2 vectors to one of the same width.
There's very nice single-vector narrowing with truncation, or signed or unsigned saturation, for any pair of element sizes like an inverse of vpmovzx / sx. e.g. qword to byte vpmov[su]qb, with an optional memory destination.
(Fun fact: vpmovdm [rdi]{k1}, zmm0 was the only way Xeon Phi (lacking both AVX-512BW and AVX-512VL) could do byte-masked stores to memory; that might be why these exist in memory-destination form. On mainstream Intel like Skylake-X / Ice Lake, the memory destination versions are no cheaper than separate pack into register then store. https://uops.info/)
AVX-512 also has nice 2-input shuffles with a control vector, so for dword-to-word truncation you could use vpermt2w zmm1, zmm2, zmm3. But that needs a shuffle control vector, and vpermt2w is 3 uops on SKX and IceLake. (t2d and t2q are 1 uop). vpermt2b is only available in AVX-512VBMI (Ice Lake), and is also 3 uops there.
(Unlike vpermb which is 1 uop on Ice Lake, and AVX-512BW vpermw which is still 2 uops on Ice Lake. So they didn't reduce the front-end cost of the backwards-compatible instruction, but ICL can run 1 of its 2 uops on port 0 or 1, instead of both on the shuffle unit on port 5. Perhaps ICL has one uop preprocess the shuffle control into a vpermb control or something, which would also explain the improved latency: 3 cycles for data->data, 4 cycles for control->data. vs. 6 cycles on SKX for the 2p5 uops, apparently a serial dependency starting with both the control and data vectors.)

Related

Is there an efficient way to calculate the smallest N numbers from a set of numbers in hardware (HDL)?

I am trying to calculate the smallest N numbers from a set and I've found software algorithms to do this. I'm wondering if there is an efficient way to do this in hardware (i.e. HDL - in System Verilog or Verilog)? I am specifically trying to calculate the smallest 2 numbers from a set.
I am trying to do this combinationally optimizing with respect to area and speed (for a large set of signals) but I can only think of comparator trees to do this? Is there a more efficient way of doing this?
Thank you, any help is appreciated~
I don't think you can work around using comparator trees if you want to find the two smallest elements combinationally. However, if your goal isn't low latency than a (possibly pipelined) sequential circuit could also be an option.
One approach that I can come up with on the spot would be to break down the operation doing kind of an incomplete bubble sort in hardware using small sorting networks. Depending on the amount of area you are willing to spend you can use a smaller or larger p-sorting network that combinationaly sorts p elements at a time where p >= 3. You can then apply this network on your input set of size N, sorting p elements at a time. The two smallest elements in each iteration are stored in some sort of memory (e.g. an SRAM memory, if you want to process larger amounts of elements).
Here is an example for p=3 (the brackets indicate the grouping of elements the p-sorter is applied to):
(4 0 9) (8 6 7) (4 2 1) --> (0 4 9) (6 7 8) (1 2 4) --> 0 4 6 7 1 2
Now you start the next round:
You apply the p-sorter on the results of the first round.
Again you store the two smallest outputs of your p-sorter into the same memory overwriting values from the previous round.
Here the continuation of the example:
(0 4 6) (7 1 2) --> (0 4 6) (1 2 7) --> 0 4 1 2
In each round you can reduce the number of elements to look at by a factor of 2/p. E.g. with p==4 you discard half the elements in each round until the smallest two elements are stored at the first two memory locations. So the algorithm has time/cycle complexity of O(n log(n)). For an actual hardware implementation, you probably want to stick to powers of two for the size p of the sorting network.
Although the control logic of such a circuit is not trivial to implement the area should be mainly dominated by the size of your sorting network and the memory you need to hold the first 2/p*N intermediate results (assuming your input signals are not already stored in a memory that you can reuse for that purpose). If you want to tune your circuit towards throughput you can increase p and pipeline the sorting network at the expense of additional area. Additional speedup could be gained by replacing the single memory using up to p two-port memories (1 read and 1 write port each) which would allow you to fetch and write back the data for the sorting network in a single cycle thus increasing the utilization ratio of the comparators in the sorting network.

How does encoding as a bit string in Genetic algorithm helpful?

Suppose we have to solve a global optimization problem in which we have to find values of 5 variables, all of which are integers. Assume we get following two parent chromosomes,
Parent 1: 6, 10, 3, 5, 12
Parent 2: 12, 10, 3, 8, 11
If we do cross-over after first 2 elements, we get following
Child 1: 6, 10, 3, 8, 11
Child 2: 12, 10, 3, 5, 12
Here we can clearly see the children are related to parents.
But when we encode as bit strings, then each chromosome is encoded as a single string of bits and we can, at random, choose any point for crossover. I do not see how this is any more beneficial than completely randomly picking any trial solutions.
I have a similar question with mutation. We randomly flip a bit. If the bit flipped has a small place value, then the change will be small. But if it has a big place value, the change will be big. How is it better than completely randomly changing a chromosome?
Binary encoding is still common mainly because first works about GA used that encoding.
Furthermore it's often space efficient: [6, 10, 3, 5, 12] represented as a sequence of integers would probably require 5 * 32 bits; for a bit string representation 5 * 4 bits are enough (assuming numbers in the [0;15] range).
Under this aspect the knapsack problem is the best case for the bit string representation (each bit says if the corresponding object is in the knapsack).
we can, at random, choose any point for crossover. I do not see how this is any more beneficial than completely randomly picking any trial solutions
In general choosing a crossover point in the middle of a digit introduces an essentially arbitrary mutation of that digit with a destructive (negative) effect.
There is a nice example about this effect in the Local Search Algorithms and Optimization Problems - Genetic Algorithms section of Artificial Intelligence - A modern approach (Russel, Norvig).
Take also a look at a similar question on Software Engineering.

32-1024 bit fixed point vector arithmetic with AVX-2

For a mandelbrot generator I want to used fixed point arithmetic going from 32 up to maybe 1024 bit as you zoom in.
Now normaly SSE or AVX is no help there due to the lack of add with carry and doing normal integer arithmetic is faster. But in my case I have literally millions of pixels that all need to be computed. So I have a huge vector of values that all need to go through the same iterative formula over and over a million times too.
So I'm not looking at doing a fixed point add/sub/mul on single values but doing it on huge vectors. My hope is that for such vector operations AVX/AVX2 can still be utilized to improve the performance despite the lack of native add with carry.
Anyone know of a library for fixed point arithmetic on vectors or some example code how to do emulate add with carry on AVX/AVX2.
FP extended precision gives more bits per clock cycle (because double FMA throughput is 2/clock vs. 32x32=>64-bit at 1 or 2/clock on Intel CPUs); consider using the same tricks that Prime95 uses with FMA for integer math. With care it's possible to use FPU hardware for bit-exact integer work.
For your actual question: since you want to do the same thing to multiple pixels in parallel, probably you want to do carries between corresponding elements in separate vectors, so one __m256i holds 64-bit chunks of 4 separate bigintegers, not 4 chunks of the same integer.
Register pressure is a problem for very wide integers with this strategy. Perhaps you can usefully branch on there being no carry propagation past the 4th or 6th vector of chunks, or something, by using vpmovmskb on the compare result to generate the carry-out after each add. An unsigned add has carry out of a+b < a (unsigned compare)
But AVX2 only has signed integer compares (for greater-than), not unsigned. And with carry-in, (a+b+c_in) == a is possible with b=carry_in=0 or with b=0xFFF... and carry_in=1 so generating carry-out is not simple.
To solve both those problems, consider using chunks with manual wrapping to 60-bit or 62-bit or something, so they're guaranteed to be signed-positive and so carry-out from addition appears in the high bits of the full 64-bit element. (Where you can vpsrlq ymm, 62 to extract it for addition into the vector of next higher chunks.)
Maybe even 63-bit chunks would work here so carry appears in the very top bit, and vmovmskpd can check if any element produced a carry. Otherwise vptest can do that with the right mask.
This is a handy-wavy kind of brainstorm answer; I don't have any plans to expand it into a detailed answer. If anyone wants to write actual code based on this, please post your own answer so we can upvote that (if it turns out to be a useful idea at all).
Just for kicks, without claiming that this will be actually useful, you can extract the carry bit of an addition by just looking at the upper bits of the input and output values.
unsigned result = a + b + last_carry; // add a, b and (optionally last carry)
unsigned carry = (a & b) // carry if both a AND b have the upper bit set
| // OR
((a ^ b) // upper bits of a and b are different AND
& ~r); // AND upper bit of the result is not set
carry >>= sizeof(unsigned)*8 - 1; // shift the upper bit to the lower bit
With SSE2/AVX2 this could be implemented with two additions, 4 logic operations and one shift, but works for arbitrary (supported) integer sizes (uint8, uint16, uint32, uint64). With AVX2 you'd need 7uops to get 4 64bit additions with carry-in and carry-out.
Especially since multiplying 64x64-->128 is not possible either (but would require 4 32x32-->64 products -- and some additions or 3 32x32-->64 products and even more additions, as well as special case handling), you will likely not be more efficient than with mul and adc (maybe unless register pressure is your bottleneck).As
As Peter and Mystical suggested, working with smaller limbs (still stored in 64 bits) can be beneficial. On the one hand, with some trickery, you can use FMA for 52x52-->104 products. And also, you can actually add up to 2^k-1 numbers of 64-k bits before you need to carry the upper bits of the previous limbs.

Why are vectors so shallow?

What is the rationale behind Scala's vectors having a branching factor of 32, and not some other number? Wouldn't smaller branching factors enable more structural sharing? Clojure seems to use the same branching factor. Is there anything magic about the branching factor 32 that I am missing?
It would help if you explained what a branching factor is:
The branching factor of a tree or a graph is the number of children at each node.
So, the answer appears to be largely here:
http://www.scala-lang.org/docu/files/collections-api/collections_15.html
Vectors are represented as trees with a high branching factor. Every
tree node contains up to 32 elements of the vector or contains up to
32 other tree nodes. Vectors with up to 32 elements can be represented
in a single node. Vectors with up to 32 * 32 = 1024 elements can be
represented with a single indirection. Two hops from the root of the
tree to the final element node are sufficient for vectors with up to
215 elements, three hops for vectors with 220, four hops for vectors
with 225 elements and five hops for vectors with up to 230 elements.
So for all vectors of reasonable size, an element selection involves
up to 5 primitive array selections. This is what we meant when we
wrote that element access is "effectively constant time".
So, basically, they had to make a design decision as to how many children to have at each node. As they explained, 32 seemed reasonable, but, if you find that it is too restrictive for you, then you could always write your own class.
For more information on why it may have been 32, you can look at this paper, as in the introduction they make the same statement as above, about it being nearly constant time, but this paper deals with Clojure it seems, more than Scala.
http://infoscience.epfl.ch/record/169879/files/RMTrees.pdf
James Black's answer is correct. Another argument for choosing 32 items might have been that the cache line size in many modern processors is 64 bytes, so two lines can hold 32 ints with 4 bytes each or 32 pointers on a 32bit machine or a 64bit JVM with a heap size up to 32GB due to pointer compression.
It's the "effectively constant time" for updates. With that large of a branching factor, you never have to go beyond 5 levels, even for terabyte-scale vectors. Here's a video with Rich talking about that and other aspects of Clojure on Channel 9. http://channel9.msdn.com/Shows/Going+Deep/Expert-to-Expert-Rich-Hickey-and-Brian-Beckman-Inside-Clojure
Just adding a bit to James's answer.
From an algorithm analysis standpoint, because the growth of the two functions is logarithmic, so they scale the same way.
But, in practical applications, having
hops is a much smaller number of hops than, say, base 2, sufficiently so that it keeps it closer to constant time, even for fairly large values of N.
I'm sure they picked 32 exactly (as opposed to a higher number) because of some memory block size, but the main reason is the fewer number of hops, compared to smaller sizes.
I also recommend you watch this presentation on InfoQ, where Daniel Spiewak discusses Vectors starting about 30 minutes in: http://www.infoq.com/presentations/Functional-Data-Structures-in-Scala

Help designing a hash function to detect duplicate records?

Let me explain my program thus far. It is a rubiks cube solver. I am given a scrambled cube (this is the initial state). This becomes the root node of a graph. I am using iterative deepening depth first search to "brute force" this scrambled cube to a recognizable state which I can then use pattern recognition to solve.
As you can imagine, this is a very large graph, so I would like to come up with some sort of hashing functionality to detect duplicate nodes in this graph (thus speeding up the traversal).
I am largely unfamiliar with hashing functions, but here is what I am thinking... Each node is essentially a different state of the rubik's cube. So if I come to a cube state (node) that has already be seen, I want to skip over it. So I need a hashing function that takes me from the state variable to a checksum, where the state variable is a 54-character string. The only allowed characters are y, r, g, o, b, w (which correspond to colors).
Any help designing this hash function would be greatly appreciated.
For the fastest duplicate detection and removal - avoid generating many of the repeated positions in the first place. This is easy to do and quicker than generating and then finding the repeats. So for example if you have moves like F and B, if you allow the sub sequence FB don't also allow BF, which gives the same result. If you've just done 3F, don't follow it with F. You can generate a small look-up table for allowed next moves, given the last three moves.
For the remaining duplicates you want a fast hash because there are a lot of positions. To make your hash go fast, as others have commented, you want what it hashes from, the representation of the position, to be small. There are 12 edge cubies and there are 8 corner cubies. Representing each cubies position and orientation need take only five bits per cubie, i.e. 100 bits (12.5 bytes) total. For edges its four bits for position and one for flip. For corners its three bits for position and 2 for spin. You can ignore the last edge cubie since its position and flip is fixed by the others. With this representation you are already down to 12 bytes for the position.
You have about 70 real bits of information in a rubik cube position, and 96 bits is close enough to 70 to make it actually counter productive hashing those bits further. I.e. treat this representation of the board as your hash. That may sound a bit strange, but from your question I'm envisaging you at the same time experimenting with a less compact representation of the cube that is more amenable to your pattern matching. In that case the 12 byte value can be treated as if it were a hash, with the advantage that it's a hash that never has a collision. That makes the duplicate testing code and new value insertion shorter and simpler and faster. It's going to be cheaper than the MD5 solutions suggested so far.
There are many other tricks you could use to cut down the work in searching for repeated positions. Have a look at http://cube20.org/ for ideas.
You can always try a cryptographic hash function. Since your problem is not a question of security (there is no attacker purposely trying to find distinct states which hash to the same value), you can use a broken hash function. I recommend trying MD4, which is quite fast. Your 54-character string is quite appropriate for MD4 input (MD4 can process inputs up to 55 bytes as a single block).
A basic 2.4 GHz PC can hash about 12 millions such strings per second, using a single core, with a simple unrolled C implementation (e.g. one which would look like the MD4Transform() function in the sample code included in RFC 1320). This may be enough for your needs.
1) Don't Use A Hash
You have 9*6 = 54 separate faces on a rubik cube. Even wastefully using 1 byte per face this is 432 bits, so hashing won't save you too much space. A better packing of 3 bits per face comes to 162 bits (21 bytes). It sounds to me like you need a compact way to represent the rubik.
OTOH, if you are looking to store a set of many many previously-visited states then I've found that using a bloom filter instead of a true set gets me decent results (but often non-optimal) with much lower space utilization.
2) If you are married to the idea of a hash:
Just use MD5, its slightly more compact than the proposed rubik states, rather fast, and has good collision properties - it's not like you have a malicious adversary trying to cause rubik cube hash collisions ;-).
EDIT: Using cryptographic hash functions, such as MD4/MD5, is usually simple once you have a library or function implementing the algorithm (ex: OpenSSL, GNU TLS, and many stand-alone implementations exist). Usually the function is something like void md5(unsigned char *buf, size_t len, unsigned char *digest) where digest points to a pre-allocated 16 byte buffer and buf is the data to be hashed (your rubik cube structure). Here is some untested C code:
#include <openssl/md5.h>
void main()
{
unsigned char digest[16];
unsigned char buf[BUFLEN];
initializeBuffer(buf);
MD5(buf,BUFLEN,digest); // This is the openssl function
printDigest(digest);
}
And be sure to compile/link with -lssl.
8 corner cubes:
You can assign each of these corners to 8 positions which each require 3 bits to determine which corner cube is at which position for a total of 24 bits.
You can further reduce this to just recording 7-of-8 positions as you can easily use a process of elimination to determine what the 8th corner is (for 21 bits).
However, this can be reduced further as the 8 corners can only be arranged in 8! = 40320 permutations and 40320 can be represented in 16 bits.
Each corner cube can be orientated correctly or be rotated 120° clockwise or anti-clockwise to be in three different positions (represented as 0, 1 and 2 respectively).
This requires 2 bits per corner to represent.
However, the sum of the orientations (modulo 3) is always 0; so, if you know 7-of-8 orientations then (assuming you have a solvable cube) you can calculate the orientation of the 8th corner (giving a total of 14 bits).
Or for a further reduction, seven ternary (base 3) digits can represent the orientation of the corners and this can be represented in 12 binary digits (bits).
So the corners cubes can be represented in 28 bits, if you want to decode the permutations, or in 33 bits, if you want to directly record the positions of 7-of-8 corners.
12 edge cubes:
Each can be represented in 4 bits (for a total of 48 bits) which can be reduced to 44 bits by only recording the position of 11-of-12 edges (for a total of 44 bits).
However, the 12! = 479001600 permutations of the edges can be stored in 29 bits.
Each edge can be either be oriented correctly or flipped:
This requires 1 bit to represent.
However, edges are always flipped in pairs so the parity of the flipped edges will always be zero (again, meaning that you only need to record 11-of-12 orientations for the edges) giving a total of 11 bits required.
So edge cubes can be represented in 40 bits, if you want to decode the permutations, or in 55 bits if you want to record all the positions and flips of 11-of-12 edges.
6 centre cubes
You do not need to record any information about the centre cubes - they are fixed relative to the ball at the centre of the Rubik's cube (so assuming you are not worried about the orientation of any logos on the cube) are immobile.
Total:
Using permutations: 68 bits
Using positions: 88 bits
Just to establish the theoretical minimum representation - the state space of a valid Rubik's cube is about 4.3*10^19. Log2(4.3*10^19) will then determine how many bits you need to represent that full space, the ceiling of which is 66. So in theory, if you could number every valid state, any given state could be uniquely represented in 66 bits.
While you may want to follow others' advice and find a more compact way of representing the cube, consider representing the state in terms of edge, corner, and face pieces. Due to the swapping laws of legal cube moves, you should be able to concatenate a sequence of 12 4-bit edge locations, 8 3-bit corner locations, and 6 3-bit face locations. This should result in a unique representation using 90 bits.
This representation may not be conducive to the way you are creating your tree, but it is unique, easily comparable, and should be possible to find given a state in your existing representation.