Construct a 64 bit mask register from four 16 bit ones - x86-64

What is the best way to end up with a __mmask64 from four __mmask16? I just want to concatenate them. Can't seem to find a solution on the internet.

AVX-512 has hardware instructions for concatenating two mask registers, for example 2x kunpckwd instructions and one kunpckdq would do the trick here.
(Each instruction is 4 cycle latency, port 5 only, on SKX and Ice Lake. https://uops.info. But at least the 2 independent ones in the first step can mostly overlap, starting one cycle apart, limited by competition for port 5. But they won't all be ready at once anyway, if the compiler schedules the instructions that generate the 4 masks so one pair should be ready first so it can get started.)
// compiles nicely with GCC/clang/ICC. Current MSVC has major pessimizations
inline
__mmask64 set_mask64_kunpck(__mmask16 m0, __mmask16 m1, __mmask16 m2, __mmask16 m3)
{
__mmask32 md0 = _mm512_kunpackw(m1, m0); // hi, lo
__mmask32 md1 = _mm512_kunpackw(m3, m2);
__mmask64 mq = _mm512_kunpackd(md1, md0);
return mq;
}
That's your best bet if your __mask16 values are actually in k registers, where a compiler will have them if they're the result of AVX-512 compare/test intrinsics like _mm512_cmple_epu32_mask. If they're coming from an array you generated earlier, it might be better to combine them with plain scalar stuff (See Paul's answer), instead of slowly getting them into mask registers with kmov. kmov k, mem is 3 uops for the front-end, with scalar integer load and a kmov k, reg back-end uops, plus an extra front-end uop for no apparent reason.
__mmask16 is just a typedef for unsigned short (in gcc/clang/ICC/MSVC) so you can simply manipulate it like an integer, and compilers will use kmov as necessary. (This can lead to pretty inefficient code if you're not careful, and unfortunately current compilers aren't smart enough to compile a shift/OR function into using kunpckwd.)
There are intrinsics like unsigned int _cvtmask16_u32 (__mmask16 a) but they're optional for current compilers that implement __mmask16 as unsigned short.
To look at compiler output for a case where __mmask16 values start out in k registers, it's necessary to write a test function that uses intrinsics to create the mask values. (Or use inline asm constraints.) The standard x86-64 calling conventions handle __mmask16 as a scalar integer, so as a function arg it's already in an integer register, not a k register.
__mmask64 test(__m256i v0, __m256i v1, __m256i v2, __m256i v3)
{
__mmask16 m0 = _mm256_movepi16_mask(v0); // clang can optimize _mm_movepi8_mask into pmovmskb eax, xmm avoiding k regs
__mmask16 m1 = _mm256_movepi16_mask(v1);
__mmask16 m2 = _mm256_movepi16_mask(v2);
__mmask16 m3 = _mm256_movepi16_mask(v3);
//return set_mask64_mmx(m0,m1,m2,m3);
//return set_mask64_scalar(m0,m1,m2,m3);
return set_mask64_kunpck(m0,m1,m2,m3);
}
With GCC and clang, that compiles to (Godbolt):
# gcc 11.1 -O3 -march=skylake-avx512
test(long long __vector(4), long long __vector(4), long long __vector(4), long long __vector(4)):
vpmovw2m k3, ymm0
vpmovw2m k1, ymm1
vpmovw2m k2, ymm2
vpmovw2m k0, ymm3 # create masks
kunpckwd k1, k1, k3
kunpckwd k0, k0, k2
kunpckdq k4, k0, k1 # combine masks
kmovq rax, k4 # use mask, in this case by returning as integer
ret
I could have used the final mask result for a blend intrinsic between two of the inputs, for example, but the compiler didn't try to avoid kunpck by doing 4x kmov (also only 1 port).
MSVC 19.29 -O2 -Gv -arch:AVX512 does a rather poor job, extracting each mask to a scalar integer regs between intrinsics. like
MSVC 19.29
kmovw ax, k1
movzx edx, ax
...
kmovd k3, edx
This is supremely dumb, not even using kmovw eax, k1 to zero-extend into a 32-bit register, not to mention not realizing that the next kunpck only cares about the low part of its input anyway, so there was not need to kmov the data to/from an integer register at all. Later, it even uses this, apparently not realizing that kmovd writing a 32-bit register zero-extends into the 64-bit register. (To be fair, GCC has some dumb missed optimizations like that around its __builtin_popcount intrinsic.)
; MSVC 19.29
kmovd ecx, k2
mov ecx, ecx
kmovq k1, rcx
The kunpck intrinsics do have strange prototypes, with inputs as wide as their outputs, e.g.
__mmask32 _mm512_kunpackw (__mmask32 a, __mmask32 b)
So perhaps this is tricking MSVC into manually doing the uint16_t -> uint32_t conversion by going to scalar and back, since it apparently doesn't know that vpmovw2m k3, ymm0 already zero-extends into the full k3.

You can just treat __mmask16 and __mmask64 like 16 bit and 64 bit ints, e.g.
__mmask64 set_mask64(__mmask16 m0, __mmask16 m1, __mmask16 m2, __mmask16 m3)
{
return (((__mmask64)m0) << 0)
| (((__mmask64)m1) << 16)
| (((__mmask64)m2) << 32)
| (((__mmask64)m3) << 48);
}
or perhaps:
__mmask64 set_mask64(__mmask16 m0, __mmask16 m1, __mmask16 m2, __mmask16 m3)
{
return (__mmask64)_mm_set_pi16(m0, m1, m2, m3);
}
Both of the above use scalar/SSE code. Using AVX512 mask intrinsics will be more efficient (see #Peter's answer for better solutions).

Related

In DPI-C, How to map data type to reg or wire

I am writing a CRC16 function in C to use in System Verilog.
Requirement as below:
Output of CRC16 has 16 bits
Input of CRC16 has bigger than 72 bits
The difficulty is that I don't know whether DPI-C can support map data type with reg/wire in System Verilog to C or not ?
how many maximum length of reg/wire can support to use DPI-C.
Can anybody help me ?
Stay with compatible types across the language boundaries. For output use shortint For input, use an array of byte in SystemVerilog which maps to array of char in C.
Dpi support has provision for any bit width, converting packed arrays into c-arrays. The question is: what are you going to do with 72-bit data at c side?
But, svBitVecVal for two-state bits and svLogicVecVal for four-stat logics could be used at 'c' side to retrieve values. Look at H.7.6/7 of lrm for more info.
Here is an example from lrm H.10.2 for 4-state data (logic):
SystemVerilog:
typedef struct {int x; int y;} pair;
import "DPI-C" function void f1(input int i1, pair i2, output logic [63:0] o3);
C:
void f1(const int i1, const pair *i2, svLogicVecVal* o3)
{
int tab[8];
printf("%d\n", i1);
o3[0].aval = i2->x;
o3[0].bval = 0;
o3[1].aval = i2->y;
o3[1].b = 0;
...
}

Bitwise operations on different sized data types

I am trying to concatenate two Bytes, defined as p1 and p2 to form an Int pnum, and this is the implementation I came up with
pnum = p1 << 8
pnum |= p2
But a test case fails, when one of the numbers is negative (i.e has the MSB set), it is converted to an Int and the MSB is moved to the 32nd bit instead of the 8th, producing an incorrect value
What is the correct way to concatenate these two bytes, or preserving the MSB of the byte during the conversion?
You could mask it before adding the p2 bits (also masked).
val pnum = p1 << 8 & 0xFF00 | p2 & 0xFF

Hash function for 8 / 16 bit "graphics" on 8 bit processor

For an implementation of coherent noise (similar to Perlin noise), I'm looking for a hash function suitable for graphics.
I don't need it to be in any way cryptographic, and really, I don't even need it to be a super brilliant hash.
I just want to to combine two 16 bit numbers and output an 8 bit hash. As random as possible is good, but also, fast on a AVR processor (8 bit, as used by Arduino) is good.
Currently I'm using an implementation here:
const uint32_t hash(uint32_t a)
{
a -= (a<<6);
a ^= (a>>17);
a -= (a<<9);
a ^= (a<<4);
a -= (a<<3);
a ^= (a<<10);
a ^= (a>>15);
return a;
}
But given that I'm truncating all but 8 bits, and I don't need anything spectacular, can I get away with something using fewer instructions?
… I'm inspired in this search by the lib8tion library that's packaged with FastLED. It has specific functions to, for example, multiple two uint8_t numbers to give a uint16_t number in the fewest possible clock cycles.
Check out Pearson hashing:
unsigned char hash(unsigned short a, unsigned short b) {
static const unsigned char t[256] = {...};
return t[t[t[t[a & 0xFF] ^ (b & 0xFF)] ^ (a >> 8)] ^ (b >> 8)];
}

How to customize the output of the Postgres Pseudo Encrypt function?

I would like to use the pseudo_encrypt function mentioned a few times on StackOverflow to make my IDs look more random: https://wiki.postgresql.org/wiki/Pseudo_encrypt
How can I customize this to output unique "random" numbers for just me. I read somewhere that you can just change the 1366.0 constant, but I don't want to take any risks with my IDs as any potential ID duplicates would cause major issues.
I really have no idea what each constant actually does, so I don't want to mess around with it unless I get some direction. Does anyone know which constants I can safely change?
Here it is:
CREATE OR REPLACE FUNCTION "pseudo_encrypt"("VALUE" int) RETURNS int IMMUTABLE STRICT AS $function_pseudo_encrypt$
DECLARE
l1 int;
l2 int;
r1 int;
r2 int;
i int:=0;
BEGIN
l1:= ("VALUE" >> 16) & 65535;
r1:= "VALUE" & 65535;
WHILE i < 3 LOOP
l2 := r1;
r2 := l1 # ((((1366.0 * r1 + 150889) % 714025) / 714025.0) * 32767)::int;
r1 := l2;
l1 := r2;
i := i + 1;
END LOOP;
RETURN ((l1::int << 16) + r1);
END;
$function_pseudo_encrypt$ LANGUAGE plpgsql;
for bigint's
CREATE OR REPLACE FUNCTION "pseudo_encrypt"("VALUE" bigint) RETURNS bigint IMMUTABLE STRICT AS $function_pseudo_encrypt$
DECLARE
l1 bigint;
l2 bigint;
r1 bigint;
r2 bigint;
i int:=0;
BEGIN
l1:= ("VALUE" >> 32) & 4294967295::bigint;
r1:= "VALUE" & 4294967295;
WHILE i < 3 LOOP
l2 := r1;
r2 := l1 # ((((1366.0 * r1 + 150889) % 714025) / 714025.0) * 32767*32767)::bigint;
r1 := l2;
l1 := r2;
i := i + 1;
END LOOP;
RETURN ((l1::bigint << 32) + r1);
END;
$function_pseudo_encrypt$ LANGUAGE plpgsql;
Alternative solution: use different ciphers
Other cipher functions are now available on postgres wiki. They're going to be significantly slower, but aside from that, they're better candidates for generating customized random-looking series of unique numbers.
For 32 bit outputs, Skip32 in plpgsql will encrypt its input with a 10 bytes wide key, so you just have to choose your own secret key to have your own specific permutation (the particular order in which the 2^32 unique values will come out).
For 64 bit outputs, XTEA in plpgsql will do similarly, but using a 16 bytes wide key.
Otherwise, to just customize pseudo_encrypt, see below:
Explanations about pseudo_encrypt's implementation:
This function has 3 properties
global unicity of the output values
reversability
pseudo-random effect
The first and second property come from the Feistel Network, and as already explained in #CodesInChaos's answer, they don't depend on the choice of these constants: 1366 and also 150889 and 714025.
Make sure when changing f(r1) that it stays a function in the mathematical sense, that is x=y implies f(x)=f(y), or in other words the same input must always produce the same output. Breaking this would break the unicity.
The purpose of these constants and this formula for f(r1) is to produce a reasonably good pseudo-random effect. Using postgres built-in random() or similar method is not possible because it's not a mathematical function as described above.
Why these arbitrary constants? In this part of the function:
r2 := l1 # ((((1366.0 * r1 + 150889) % 714025) / 714025.0) * 32767)::int;
The formula and the values 1366, 150889 and 714025 come from Numerical recipes in C (1992, by William H.Press, 2nd ed.), chapter 7: random numbers, specifically p.284 and 285.
The book is not directly indexable on the web but readable through an interface here: http://apps.nrbook.com/c/index.html .It's also cited as a reference in various source code implementing PRNGs.
Among the algorithms discussed in this chapter, the one used above is very simple and relatively effective. The formula to get a new random number from a previous one (jran) is:
jran = (jran * ia + ic) % im;
ran = (float) jran / (float) im; /* normalize into the 0..1 range */
where jran is the current random integer.
This generator will necessarily loop over itself after a certain number of values (the "period"), so the constants ia, ic and im have to be chosen carefully for that period to be as large as possible. The book provides a table p.285 where constants are suggested for various lengths of the period.
ia=1366, ic=150889 and im=714025 is one of the entries for a period of
229 bits, which is way more than needed.
Finally the multiplication by 32767 or 215-1 is not part of the PRNG but meant to produce a positive half-integer from the 0..1 pseudo-random float value. Don't change that part, unless to widen the blocksize of the algorithm.
This function looks like a blockcipher based on a Feistel network - but it's lacking a key.
The Feistel construction is bijective, i.e. it guarantees that there are no collisions. The interesting part is: r2 := l1 # f(r1). As long as f(r1) only depends on r1 the pseudo_encrypt will be bijective, no matter what the function does.
The lack of key means that anybody who knows the source code can recover the sequential ID. So you're relying on security-though-obscurity.
The alternative is using a block cipher which takes a key. For 32 bit blocks there are relatively few choices, I know of Skip32 and ipcrypt. For 64 bit blocks there are many ciphers to choose from, including 3DES, Blowfish and XTEA.

How do I make use of multipliers to generate a simple adder?

I'm trying to synthesize an Altera circuit using as few logic elements as possible. Also, embedded multipliers do not count against logic elements, so I should be using them. So far the circuit looks correct in terms of functionality. However, the following module uses a large amount of logic elements. It uses 24 logic elements and I'm not sure why since it should be using 8 + a couple of combinational gates for the case block.
I suspect the adder but I'm not 100% sure. If my suspicion is correct however, is it possible to use multipliers as a simple adder?
module alu #(parameter N = 8)
(
output logic [N-1:0] alu_res,
input [N-1:0] a,
input [N-1:0] b,
input [1:0] op,
input clk
);
wire [7:0] dataa, datab;
wire [15:0] result;
// instantiate embedded 8-bit signed multiplier
mult mult8bit (.*);
// assign multiplier operands
assign dataa = a;
assign datab = b;
always_comb
unique case (op)
// LW
2'b00: alu_res = 8'b0;
// ADD
2'b01: alu_res = a + b;
// MUL
2'b10: alu_res = result[2*N-2:N-1]; // a is a fraction
// MOV
2'b11: alu_res = a;
endcase
endmodule
Your case statement will generate a 4 input mux with op as the select which uses a minimum of 2 logic cells. However since your assigning an 8-bit variable in the case block you will require 2 logic elements for each bit of the output. Therefore total logic elements is 8*2 for the large mux and 8 for the adder giving you 24 as the total.
I'm doing this project too so I won't give too much away about how to optimise this. However what I will tell you is that both the mux's and the adder can be implemented using multipliers, 8 at most. With that said I don't think this architecture is optimal for a multiplier implementation.