AVX2 multiply 2 vectors of 64 bit integers discarding the upper half of each results? [duplicate] - x86-64

I want to vectorize the multiplication of two memory aligned arrays.
I didn't find any way to multiply 64*64 bit in AVX/AVX2, so I just did loop-unroll and AVX2 loads/stores. Is there a faster way to do this?
Note: I don't want to save the high-half result of each multiplication.
void multiply_vex(long *Gi_vec, long q, long *Gj_vec){
int i;
__m256i data_j, data_i;
__uint64_t *ptr_J = (__uint64_t*)&data_j;
__uint64_t *ptr_I = (__uint64_t*)&data_i;
for (i=0; i<BASE_VEX_STOP; i+=4) {
data_i = _mm256_load_si256((__m256i*)&Gi_vec[i]);
data_j = _mm256_load_si256((__m256i*)&Gj_vec[i]);
ptr_I[0] -= ptr_J[0] * q;
ptr_I[1] -= ptr_J[1] * q;
ptr_I[2] -= ptr_J[2] * q;
ptr_I[3] -= ptr_J[3] * q;
_mm256_store_si256((__m256i*)&Gi_vec[i], data_i);
}
for (; i<BASE_DIMENSION; i++)
Gi_vec[i] -= Gj_vec[i] * q;
}
UPDATE:
I'm using the Haswell microarchitecture with both ICC/GCC compilers. So both AVX and AVX2 is fine.
I substitute the -= by the C intrisic _mm256_sub_epi64 after the multiplication loop-unroll, where it get some speedup. Currently, it is ptr_J[0] *= q; ...
I use __uint64_t but is a error. The right data type is __int64_t.

You seem to be assuming long is 64bits in your code, but then using __uint64_t as well. In 32bit, the x32 ABI, and on Windows, long is a 32bit type. Your title mentions long long, but then your code ignores it. I was wondering for a while if your code was assuming that long was 32bit.
You're completely shooting yourself in the foot by using AVX256 loads but then aliasing a pointer onto the __m256i to do scalar operations. gcc just gives up and gives you the terrible code you asked for: vector load and then a bunch of extract and insert instructions. Your way of writing it means that both vectors have to be unpacked to do the sub in scalar as well, instead of using vpsubq.
Modern x86 CPUs have very fast L1 cache that can handle two operations per clock. (Haswell and later: two loads and one store per clock). Doing multiple scalar loads from the same cache line is better than a vector load and unpacking. (Imperfect uop scheduling reduces the throughput to about 84% of that, though: see below)
gcc 5.3 -O3 -march=haswell (Godbolt compiler explorer) auto-vectorizes a simple scalar implementation pretty well. When AVX2 isn't available, gcc foolishly still auto-vectorizes with 128b vectors: On Haswell, this will actually be about 1/2 the speed of ideal scalar 64bit code. (See the perf analysis below, but substitute 2 elements per vector instead of 4).
#include <stdint.h> // why not use this like a normal person?
#define BASE_VEX_STOP 1024
#define BASE_DIMENSION 1028
// restrict lets the compiler know the arrays don't overlap,
// so it doesn't have to generate a scalar fallback case
void multiply_simple(uint64_t *restrict Gi_vec, uint64_t q, const uint64_t *restrict Gj_vec){
for (intptr_t i=0; i<BASE_DIMENSION; i++) // gcc doesn't manage to optimize away the sign-extension from 32bit to pointer-size in the scalar epilogue to handle the last less-than-a-vector elements
Gi_vec[i] -= Gj_vec[i] * q;
}
inner loop:
.L4:
vmovdqu ymm1, YMMWORD PTR [r9+rax] # MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B], MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B]
add rcx, 1 # ivtmp.30,
vpsrlq ymm0, ymm1, 32 # tmp174, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B],
vpmuludq ymm2, ymm1, ymm3 # tmp173, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B], vect_cst_.25
vpmuludq ymm0, ymm0, ymm3 # tmp176, tmp174, vect_cst_.25
vpmuludq ymm1, ymm4, ymm1 # tmp177, tmp185, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B]
vpaddq ymm0, ymm0, ymm1 # tmp176, tmp176, tmp177
vmovdqa ymm1, YMMWORD PTR [r8+rax] # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B]
vpsllq ymm0, ymm0, 32 # tmp176, tmp176,
vpaddq ymm0, ymm2, ymm0 # vect__13.24, tmp173, tmp176
vpsubq ymm0, ymm1, ymm0 # vect__14.26, MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__13.24
vmovdqa YMMWORD PTR [r8+rax], ymm0 # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__14.26
add rax, 32 # ivtmp.32,
cmp rcx, r10 # ivtmp.30, bnd.14
jb .L4 #,
Translate that back to intrinsics if you want, but it's going to be a lot easier to just let the compiler autovectorize. I didn't try to analyse it to see if it's optimal.
If you don't usually compile with -O3, you could use #pragma omp simd before the loop (and -fopenmp).
Of course, instead of a scalar epilogue, it would prob. be faster to do an unaligned load of the last 32B of Gj_vec, and store into the last 32B of Gi_vec, potentially overlapping with the last store from the loop. (A scalar fallback is still needed if the arrays are smaller than 32B.)
Improved vector intrinsic version for Haswell
From my comments on Z Boson's answer. Based on Agner Fog's vector class library code.
Agner Fog's version saves an instruction but bottlenecks on the shuffle port by using phadd + pshufd where I use psrlq / paddq / pand.
Since one of your operands is constant, make sure to pass set1(q) as b, not a, so the "bswap" shuffle can be hoisted.
// replace hadd -> shuffle (4 uops) with shift/add/and (3 uops)
// The constant takes 2 insns to generate outside a loop.
__m256i mul64_haswell (__m256i a, __m256i b) {
// instruction does not exist. Split into 32-bit multiplies
__m256i bswap = _mm256_shuffle_epi32(b,0xB1); // swap H<->L
__m256i prodlh = _mm256_mullo_epi32(a,bswap); // 32 bit L*H products
// or use pshufb instead of psrlq to reduce port0 pressure on Haswell
__m256i prodlh2 = _mm256_srli_epi64(prodlh, 32); // 0 , a0Hb0L, 0, a1Hb1L
__m256i prodlh3 = _mm256_add_epi32(prodlh2, prodlh); // xxx, a0Lb0H+a0Hb0L, xxx, a1Lb1H+a1Hb1L
__m256i prodlh4 = _mm256_and_si256(prodlh3, _mm256_set1_epi64x(0x00000000FFFFFFFF)); // zero high halves
__m256i prodll = _mm256_mul_epu32(a,b); // a0Lb0L,a1Lb1L, 64 bit unsigned products
__m256i prod = _mm256_add_epi64(prodll,prodlh4); // a0Lb0L+(a0Lb0H+a0Hb0L)<<32, a1Lb1L+(a1Lb1H+a1Hb1L)<<32
return prod;
}
See it on Godbolt.
Note that this doesn't include the final subtract, only the multiply.
This version should perform a bit better on Haswell than gcc's autovectorized version. (like maybe one vector per 4 cycles instead of one vector per 5 cycles, bottlenecked on port0 throughput. I didn't consider other bottlenecks for the full problem, since this was a late addition to the answer.)
An AVX1 version (two elements per vector) would suck, and probably still be worse than 64bit scalar. Don't do it unless you already have your data in vectors, and want the result in a vector (extracting to scalar and back might not be worth it).
Perf analysis of GCC's autovectorized code (not the intrinsic version)
Background: see Agner Fog's insn tables and microarch guide, and other links in the x86 tag wiki.
Until AVX512 (see below), this is probably only barely faster than scalar 64bit code: imul r64, m64 has a throughput of one per clock on Intel CPUs (but one per 4 clocks on AMD Bulldozer-family). load/imul/sub-with-memory-dest is 4 fused-domain uops on Intel CPUs (with an addressing mode that can micro-fuse, which gcc fails to use). The pipeline width is 4 fused-domain uops per clock, so even a large unroll can't get this to issue at one-per-clock. With enough unrolling, we'll bottleneck on load/store throughput. 2 loads and one store per clock is possible on Haswell, but stores-address uops stealing load ports will lower the throughput to about 81/96 = 84% of that, according to Intel's manual.
So perhaps the best way for Haswell would load and multiply with scalar, (2 uops), then vmovq / pinsrq / vinserti128 so you can do the subtract with a vpsubq. That's 8 uops to load&multiply all 4 scalars, 7 shuffle uops to get the data into a __m256i (2 (movq) + 4 (pinsrq is 2 uops) + 1 vinserti128), and 3 more uops to do a vector load / vpsubq / vector store. So that's 18 fused-domain uops per 4 multiplies (4.5 cycles to issue), but 7 shuffle uops (7 cycles to execute). So nvm, this is no good compared to pure scalar.
The autovectorized code is using 8 vector ALU instructions for each vector of four values. On Haswell, 5 of those uops (multiplies and shifts) can only run on port 0, so no matter how you unroll this algorithm it will achieve at best one vector per 5 cycles (i.e. one multiply per 5/4 cycles.)
The shifts could be replaced with pshufb (port 5) to move the data and shift in zeros. (Other shuffles don't support zeroing instead of copying a byte from the input, and there aren't any known-zeros in the input that we could copy.)
paddq / psubq can run on ports 1/5 on Haswell, or p015 on Skylake.
Skylake runs pmuludq and immediate-count vector shifts on on p01, so it could in theory manage a throughput of one vector per max(5/2, 8/3, 11/4) = 11/4 = 2.75 cycles. So it bottlenecks on total fused-domain uop throughput (including the 2 vector loads and 1 vector store). So a bit of loop unrolling will help. Probably resource conflicts from imperfect scheduling will bottleneck it to slightly less than 4 fused-domain uops per clock. The loop overhead can hopefully run on port 6, which can only handle some scalar ops, including add and compare-and-branch, leaving ports 0/1/5 for vector ALU ops, since they're close to saturating (8/3 = 2.666 clocks). The load/store ports are nowhere near saturating, though.
So, Skylake can theoretically manage one vector per 2.75 cycles (plus loop overhead), or one multiply per ~0.7 cycles, vs. Haswell's best option (one per ~1.2 cycles in theory with scalar, or one per 1.25 cycles in theory with vectors). The scalar one per ~1.2 cycles would probably require a hand-tuned asm loop, though, because compilers don't know how to use a one-register addressing mode for stores, and a two-register addressing mode for loads (dst + (src-dst) and increment dst).
Also, if your data isn't hot in L1 cache, getting the job done with fewer instructions lets the frontend get ahead of the execution units, and get started on the loads before the data is needed. Hardware prefetch doesn't cross page lines, so a vector loop will probably beat scalar in practice for large arrays, and maybe even for smaller arrays.
AVX-512DQ introduces a 64bx64b->64b vector multiply
gcc can auto-vectorize using it, if you add -mavx512dq.
.L4:
vmovdqu64 zmm0, ZMMWORD PTR [r8+rax] # vect__11.23, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B]
add rcx, 1 # ivtmp.30,
vpmullq zmm1, zmm0, zmm2 # vect__13.24, vect__11.23, vect_cst_.25
vmovdqa64 zmm0, ZMMWORD PTR [r9+rax] # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B]
vpsubq zmm0, zmm0, zmm1 # vect__14.26, MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__13.24
vmovdqa64 ZMMWORD PTR [r9+rax], zmm0 # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__14.26
add rax, 64 # ivtmp.32,
cmp rcx, r10 # ivtmp.30, bnd.14
jb .L4 #,
So AVX512DQ (expected to be part of Skylake multi-socket Xeon (Purley) in ~2017) will give a much larger than 2x speedup (from wider vectors) if these instructions are pipelined at one per clock.
Update: Skylake-AVX512 (aka SKL-X or SKL-SP) runs VPMULLQ at one per 1.5 cycles for xmm, ymm, or zmm vectors. It's 3 uops with 15c latency. (With maybe an extra 1c of latency for the zmm version, if that's not a measurement glitch in the AIDA results.)
vpmullq is much faster than anything you can build out of 32-bit chunks, so it's very much worth having an instruction for this even if current CPUs don't have 64-bit-element vector-multiply hardware. (Presumably they use the mantissa multipliers in the FMA units.)

If you're interested in SIMD 64bx64b to 64b (lower) operations here are the AVX and AVX2 solutions from Agner Fog's Vector Class Library. I would test these with arrays and see how it compares to what GCC does with a generic loop such as the one in Peter Cordes' answer.
AVX (use SSE - you can still compile with -mavx to get vex encoding).
// vector operator * : multiply element by element
static inline Vec2q operator * (Vec2q const & a, Vec2q const & b) {
#if INSTRSET >= 5 // SSE4.1 supported
// instruction does not exist. Split into 32-bit multiplies
__m128i bswap = _mm_shuffle_epi32(b,0xB1); // b0H,b0L,b1H,b1L (swap H<->L)
__m128i prodlh = _mm_mullo_epi32(a,bswap); // a0Lb0H,a0Hb0L,a1Lb1H,a1Hb1L, 32 bit L*H products
__m128i zero = _mm_setzero_si128(); // 0
__m128i prodlh2 = _mm_hadd_epi32(prodlh,zero); // a0Lb0H+a0Hb0L,a1Lb1H+a1Hb1L,0,0
__m128i prodlh3 = _mm_shuffle_epi32(prodlh2,0x73); // 0, a0Lb0H+a0Hb0L, 0, a1Lb1H+a1Hb1L
__m128i prodll = _mm_mul_epu32(a,b); // a0Lb0L,a1Lb1L, 64 bit unsigned products
__m128i prod = _mm_add_epi64(prodll,prodlh3); // a0Lb0L+(a0Lb0H+a0Hb0L)<<32, a1Lb1L+(a1Lb1H+a1Hb1L)<<32
return prod;
#else // SSE2
int64_t aa[2], bb[2];
a.store(aa); // split into elements
b.store(bb);
return Vec2q(aa[0]*bb[0], aa[1]*bb[1]); // multiply elements separetely
#endif
}
AVX2
// vector operator * : multiply element by element
static inline Vec4q operator * (Vec4q const & a, Vec4q const & b) {
// instruction does not exist. Split into 32-bit multiplies
__m256i bswap = _mm256_shuffle_epi32(b,0xB1); // swap H<->L
__m256i prodlh = _mm256_mullo_epi32(a,bswap); // 32 bit L*H products
__m256i zero = _mm256_setzero_si256(); // 0
__m256i prodlh2 = _mm256_hadd_epi32(prodlh,zero); // a0Lb0H+a0Hb0L,a1Lb1H+a1Hb1L,0,0
__m256i prodlh3 = _mm256_shuffle_epi32(prodlh2,0x73); // 0, a0Lb0H+a0Hb0L, 0, a1Lb1H+a1Hb1L
__m256i prodll = _mm256_mul_epu32(a,b); // a0Lb0L,a1Lb1L, 64 bit unsigned products
__m256i prod = _mm256_add_epi64(prodll,prodlh3); // a0Lb0L+(a0Lb0H+a0Hb0L)<<32, a1Lb1L+(a1Lb1H+a1Hb1L)<<32
return prod;
}
These functions work for signed and unsigned 64-bit integers. In your case since q is constant within the loop you don't need to recalculate some things every iteration but your compiler will probably figure that out anyway.

Related

Why are the hex numbers for big endian different than little endian?

#include<stdio.h>
int main()
{
typedef unsigned char *byte_pointer;
void show_bytes(byte_pointer start, size_t len)
{
int i;
for (i = 0; i < len; i++)
{
printf(" %.2x", start[i]);
printf("\n");
}
}
void show_int(int x)
{
show_bytes((byte_pointer) &x, sizeof(int));
}
void show_float(int x)
{
show_bytes((byte_pointer) &x, sizeof(float));
}
void show_pointer(int x)
{
show_bytes((byte_pointer) &x, sizeof(void *));
}
int a = 0x12345678;
byte_pointer ap = (byte_pointer) &a;
show_bytes(ap, 3);
return 0;
}
(Solutions according to the CS:APP book)
Big endian: 12 34 56
Little endian: 78 56 34
I know systems have different conventions for storage allocation but if two systems use the same convention but are different endian why are the hex values different?
Endian-ness is an issue that arises when we use more than one storage location for a value/type, which we do because somethings won't fit in a single storage location.
As soon as we use multiple storage locations for a single value that gives rise to the question of:  What part of the value will we store in each storage location?
The first byte of a two-byte item will have a lower address than the second byte, and in particular, the address of the second byte will be at +1 from the address of the lower byte.
Storing a two-byte item in two bytes of storage, do we store the most significant byte first and the least significant byte second, or vice versa?
We choose to use directly consecutive bytes for the two bytes of the two-byte item, so no matter which (endian) way we choose to store such an item, we refer to the whole two-byte item by the lower address (the address of its first byte).
We can express these storage choices with a formula, here item[0] refer to the first byte while item[1] refers to the second byte.
item[0] = value >> 8 // also value / 256
item[1] = value & 0xFF // also value % 256
value = (item[0]<<8) | item[1] // also item[0]*256 | item[1]
--vs--
item[0] = value & 0xFF // also value % 256
item[1] = value >> 8 // also value / 256
value = item[0] | (item[1]<<8) // also item[0] | item[1]*256
The first set of formulas is for big endian, and the second for little endian.
By these formulas, it doesn't matter what order we access memory as to whether item[0] first, then item[1], or vice versa, or both at the same time (common in hardware), as long as the formulas for one endian are consistently used.
If the item in question is a four-byte value, then there are 4 possible orderings(!) — though only two of them are truly sensible.
For efficiency, the hardware offers us multibyte memory access in one instruction (and with one reference, namely to the lowest address of the multibyte item), and therefore, the hardware itself needs to define and consistently use one of the two possible/reasonable orderings.
If the hardware did not offer multibyte memory access, then the ordering would be entirely up to the software program itself to define (accessing memory one byte at a time), and the program could choose big or little endian, even differently for each variable, as long as it consistently accesses the multiple bytes of memory in the same manner to reassemble the values stored there.
In a similar manner, when we define a structure of multiple items (e.g. struct point { int x; int y; }, software chooses whether x comes first or y comes first in memory ordering.  However, since programmers (and compilers) will still choose to use hardware instructions to access individual fields such as x in one go, the hardware's endian configuration remains necessary.

How to use fast division operation without support of division instruction, for a constant divisor? [duplicate]

I've been reading about div and mul assembly operations, and I decided to see them in action by writing a simple program in C:
File division.c
#include <stdlib.h>
#include <stdio.h>
int main()
{
size_t i = 9;
size_t j = i / 5;
printf("%zu\n",j);
return 0;
}
And then generating assembly language code with:
gcc -S division.c -O0 -masm=intel
But looking at generated division.s file, it doesn't contain any div operations! Instead, it does some kind of black magic with bit shifting and magic numbers. Here's a code snippet that computes i/5:
mov rax, QWORD PTR [rbp-16] ; Move i (=9) to RAX
movabs rdx, -3689348814741910323 ; Move some magic number to RDX (?)
mul rdx ; Multiply 9 by magic number
mov rax, rdx ; Take only the upper 64 bits of the result
shr rax, 2 ; Shift these bits 2 places to the right (?)
mov QWORD PTR [rbp-8], rax ; Magically, RAX contains 9/5=1 now,
; so we can assign it to j
What's going on here? Why doesn't GCC use div at all? How does it generate this magic number and why does everything work?
Integer division is one of the slowest arithmetic operations you can perform on a modern processor, with latency up to the dozens of cycles and bad throughput. (For x86, see Agner Fog's instruction tables and microarch guide).
If you know the divisor ahead of time, you can avoid the division by replacing it with a set of other operations (multiplications, additions, and shifts) which have the equivalent effect. Even if several operations are needed, it's often still a heck of a lot faster than the integer division itself.
Implementing the C / operator this way instead of with a multi-instruction sequence involving div is just GCC's default way of doing division by constants. It doesn't require optimizing across operations and doesn't change anything even for debugging. (Using -Os for small code size does get GCC to use div, though.) Using a multiplicative inverse instead of division is like using lea instead of mul and add
As a result, you only tend to see div or idiv in the output if the divisor isn't known at compile-time.
For information on how the compiler generates these sequences, as well as code to let you generate them for yourself (almost certainly unnecessary unless you're working with a braindead compiler), see libdivide.
Dividing by 5 is the same as multiplying 1/5, which is again the same as multiplying by 4/5 and shifting right 2 bits. The value concerned is CCCCCCCCCCCCCCCD in hex, which is the binary representation of 4/5 if put after a hexadecimal point (i.e. the binary for four fifths is 0.110011001100 recurring - see below for why). I think you can take it from here! You might want to check out fixed point arithmetic (though note it's rounded to an integer at the end).
As to why, multiplication is faster than division, and when the divisor is fixed, this is a faster route.
See Reciprocal Multiplication, a tutorial for a detailed writeup about how it works, explaining in terms of fixed-point. It shows how the algorithm for finding the reciprocal works, and how to handle signed division and modulo.
Let's consider for a minute why 0.CCCCCCCC... (hex) or 0.110011001100... binary is 4/5. Divide the binary representation by 4 (shift right 2 places), and we'll get 0.001100110011... which by trivial inspection can be added the original to get 0.111111111111..., which is obviously equal to 1, the same way 0.9999999... in decimal is equal to one. Therefore, we know that x + x/4 = 1, so 5x/4 = 1, x=4/5. This is then represented as CCCCCCCCCCCCD in hex for rounding (as the binary digit beyond the last one present would be a 1).
In general multiplication is much faster than division. So if we can get away with multiplying by the reciprocal instead we can significantly speed up division by a constant
A wrinkle is that we cannot represent the reciprocal exactly (unless the division was by a power of two but in that case we can usually just convert the division to a bit shift). So to ensure correct answers we have to be careful that the error in our reciprocal does not cause errors in our final result.
-3689348814741910323 is 0xCCCCCCCCCCCCCCCD which is a value of just over 4/5 expressed in 0.64 fixed point.
When we multiply a 64 bit integer by a 0.64 fixed point number we get a 64.64 result. We truncate the value to a 64-bit integer (effectively rounding it towards zero) and then perform a further shift which divides by four and again truncates By looking at the bit level it is clear that we can treat both truncations as a single truncation.
This clearly gives us at least an approximation of division by 5 but does it give us an exact answer correctly rounded towards zero?
To get an exact answer the error needs to be small enough not to push the answer over a rounding boundary.
The exact answer to a division by 5 will always have a fractional part of 0, 1/5, 2/5, 3/5 or 4/5 . Therefore a positive error of less than 1/5 in the multiplied and shifted result will never push the result over a rounding boundary.
The error in our constant is (1/5) * 2-64. The value of i is less than 264 so the error after multiplying is less than 1/5. After the division by 4 the error is less than (1/5) * 2−2.
(1/5) * 2−2 < 1/5 so the answer will always be equal to doing an exact division and rounding towards zero.
Unfortunately this doesn't work for all divisors.
If we try to represent 4/7 as a 0.64 fixed point number with rounding away from zero we end up with an error of (6/7) * 2-64. After multiplying by an i value of just under 264 we end up with an error just under 6/7 and after dividing by four we end up with an error of just under 1.5/7 which is greater than 1/7.
So to implement divison by 7 correctly we need to multiply by a 0.65 fixed point number. We can implement that by multiplying by the lower 64 bits of our fixed point number, then adding the original number (this may overflow into the carry bit) then doing a rotate through carry.
Here is link to a document of an algorithm that produces the values and code I see with Visual Studio (in most cases) and that I assume is still used in GCC for division of a variable integer by a constant integer.
http://gmplib.org/~tege/divcnst-pldi94.pdf
In the article, a uword has N bits, a udword has 2N bits, n = numerator = dividend, d = denominator = divisor, ℓ is initially set to ceil(log2(d)), shpre is pre-shift (used before multiply) = e = number of trailing zero bits in d, shpost is post-shift (used after multiply), prec is precision = N - e = N - shpre. The goal is to optimize calculation of n/d using a pre-shift, multiply, and post-shift.
Scroll down to figure 6.2, which defines how a udword multiplier (max size is N+1 bits), is generated, but doesn't clearly explain the process. I'll explain this below.
Figure 4.2 and figure 6.2 show how the multiplier can be reduced to a N bit or less multiplier for most divisors. Equation 4.5 explains how the formula used to deal with N+1 bit multipliers in figure 4.1 and 4.2 was derived.
In the case of modern X86 and other processors, multiply time is fixed, so pre-shift doesn't help on these processors, but it still helps to reduce the multiplier from N+1 bits to N bits. I don't know if GCC or Visual Studio have eliminated pre-shift for X86 targets.
Going back to Figure 6.2. The numerator (dividend) for mlow and mhigh can be larger than a udword only when denominator (divisor) > 2^(N-1) (when ℓ == N => mlow = 2^(2N)), in this case the optimized replacement for n/d is a compare (if n>=d, q = 1, else q = 0), so no multiplier is generated. The initial values of mlow and mhigh will be N+1 bits, and two udword/uword divides can be used to produce each N+1 bit value (mlow or mhigh). Using X86 in 64 bit mode as an example:
; upper 8 bytes of dividend = 2^(ℓ) = (upper part of 2^(N+ℓ))
; lower 8 bytes of dividend for mlow = 0
; lower 8 bytes of dividend for mhigh = 2^(N+ℓ-prec) = 2^(ℓ+shpre) = 2^(ℓ+e)
dividend dq 2 dup(?) ;16 byte dividend
divisor dq 1 dup(?) ; 8 byte divisor
; ...
mov rcx,divisor
mov rdx,0
mov rax,dividend+8 ;upper 8 bytes of dividend
div rcx ;after div, rax == 1
mov rax,dividend ;lower 8 bytes of dividend
div rcx
mov rdx,1 ;rdx:rax = N+1 bit value = 65 bit value
You can test this with GCC. You're already seen how j = i/5 is handled. Take a look at how j = i/7 is handled (which should be the N+1 bit multiplier case).
On most current processors, multiply has a fixed timing, so a pre-shift is not needed. For X86, the end result is a two instruction sequence for most divisors, and a five instruction sequence for divisors like 7 (in order to emulate a N+1 bit multiplier as shown in equation 4.5 and figure 4.2 of the pdf file). Example X86-64 code:
; rbx = dividend, rax = 64 bit (or less) multiplier, rcx = post shift count
; two instruction sequence for most divisors:
mul rbx ;rdx = upper 64 bits of product
shr rdx,cl ;rdx = quotient
;
; five instruction sequence for divisors like 7
; to emulate 65 bit multiplier (rbx = lower 64 bits of multiplier)
mul rbx ;rdx = upper 64 bits of product
sub rbx,rdx ;rbx -= rdx
shr rbx,1 ;rbx >>= 1
add rdx,rbx ;rdx = upper 64 bits of corrected product
shr rdx,cl ;rdx = quotient
; ...
To explain the 5 instruction sequence, a simple 3 instruction sequence could overflow. Let u64() mean upper 64 bits (all that is needed for quotient)
mul rbx ;rdx = u64(dvnd*mplr)
add rdx,rbx ;rdx = u64(dvnd*(2^64 + mplr)), could overflow
shr rdx,cl
To handle this case, cl = post_shift-1. rax = multiplier - 2^64, rbx = dividend. u64() is upper 64 bits. Note that rax = rax<<1 - rax. Quotient is:
u64( ( rbx * (2^64 + rax) )>>(cl+1) )
u64( ( rbx * (2^64 + rax<<1 - rax) )>>(cl+1) )
u64( ( (rbx * 2^64) + (rbx * rax)<<1 - (rbx * rax) )>>(cl+1) )
u64( ( (rbx * 2^64) - (rbx * rax) + (rbx * rax)<<1 )>>(cl+1) )
u64( ( ((rbx * 2^64) - (rbx * rax))>>1) + (rbx*rax) )>>(cl ) )
mul rbx ; (rbx*rax)
sub rbx,rdx ; (rbx*2^64)-(rbx*rax)
shr rbx,1 ;( (rbx*2^64)-(rbx*rax))>>1
add rdx,rbx ;( ((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax)
shr rdx,cl ;((((rbx*2^64)-(rbx*rax))>>1)+(rbx*rax))>>cl
I will answer from a slightly different angle: Because it is allowed to do it.
C and C++ are defined against an abstract machine. The compiler transforms this program in terms of the abstract machine to concrete machine following the as-if rule.
The compiler is allowed to make ANY changes as long as it doesn't change the observable behaviour as specified by the abstract machine. There is no reasonable expectation that the compiler will transform your code in the most straightforward way possible (even when a lot of C programmer assume that). Usually, it does this because the compiler wants to optimize the performance compared to the straightforward approach (as discussed in the other answers at length).
If under any circumstances the compiler "optimizes" a correct program to something that has a different observable behaviour, that is a compiler bug.
Any undefined behaviour in our code (signed integer overflow is a classical example) and this contract is void.

Reducing LUT utilization in a Vivado HLS design (RSA cryptosystem using montgomery multiplication)

A question/problem for anyone experienced with Xilinx Vivado HLS and FPGA design:
I need help reducing the utilization numbers of a design within the confines of HLS (i.e. can't just redo the design in an HDL). I am targeting the Zedboard (Zynq 7020).
I'm trying to implement 2048-bit RSA in HLS, using the Tenca-koc multiple-word radix 2 montgomery multiplication algorithm, shown below (More algorithm details here):
I wrote this algorithm in HLS and it works in simulation and in C/RTL cosim. My algorithm is here:
#define MWR2MM_m 2048 // Bit-length of operands
#define MWR2MM_w 8 // word size
#define MWR2MM_e 257 // number of words per operand
// Type definitions
typedef ap_uint<1> bit_t; // 1-bit scan
typedef ap_uint< MWR2MM_w > word_t; // 8-bit words
typedef ap_uint< MWR2MM_m > rsaSize_t; // m-bit operand size
/*
* Multiple-word radix 2 montgomery multiplication using carry-propagate adder
*/
void mwr2mm_cpa(rsaSize_t X, rsaSize_t Yin, rsaSize_t Min, rsaSize_t* out)
{
// extend operands to 2 extra words of 0
ap_uint<MWR2MM_m + 2*MWR2MM_w> Y = Yin;
ap_uint<MWR2MM_m + 2*MWR2MM_w> M = Min;
ap_uint<MWR2MM_m + 2*MWR2MM_w> S = 0;
ap_uint<2> C = 0; // two carry bits
bit_t qi = 0; // an intermediate result bit
// Store concatenations in a temporary variable to eliminate HLS compiler warnings about shift count
ap_uint<MWR2MM_w> temp_concat=0;
// scan X bit-by bit
for (int i=0; i<MWR2MM_m; i++)
{
qi = (X[i]*Y[0]) xor S[0];
// C gets top two bits of temp_concat, j'th word of S gets bottom 8 bits of temp_concat
temp_concat = X[i]*Y.range(MWR2MM_w-1,0) + qi*M.range(MWR2MM_w-1,0) + S.range(MWR2MM_w-1,0);
C = temp_concat.range(9,8);
S.range(MWR2MM_w-1,0) = temp_concat.range(7,0);
// scan Y and M word-by word, for each bit of X
for (int j=1; j<=MWR2MM_e; j++)
{
temp_concat = C + X[i]*Y.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) + qi*M.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) + S.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j);
C = temp_concat.range(9,8);
S.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) = temp_concat.range(7,0);
S.range(MWR2MM_w*(j-1)+(MWR2MM_w-1), MWR2MM_w*(j-1)) = (S.bit(MWR2MM_w*j), S.range( MWR2MM_w*(j-1)+(MWR2MM_w-1), MWR2MM_w*(j-1)+1));
}
S.range(S.length()-1, S.length()-MWR2MM_w) = 0;
C=0;
}
// if final partial sum is greater than the modulus, bring it back to proper range
if (S >= M)
S -= M;
*out = S;
}
Unfortunately, the LUT utilization is huge.
This is problematic because I need to be able to fit multiple of these blocks in hardware as axi4-lite slaves.
Could someone please provide a few suggestions as to how I can reduce the LUT utilization, WITHIN THE CONFINES OF HLS?
I've already tried the following:
Experimenting with different word lengths
switching the top level inputs to arrays so they are BRAM (i.e. not using ap_uint<2048>, but instead ap_uint foo[MWR2MM_e])
Experimenting with all sorts of directives: compartmentalizing into multiple inline functions, dataflow architecture, resource limits on lshr, etc.
However, nothing really drives the LUT utilization down in a meaningful way. Is there a glaringly obvious way that I could reduce the utilization that is apparent to anyone?
In particular, I've seen papers on implementations of the mwr2mm algorithm that (only use one DSP block and one BRAM). Is this even worth attempting to implement using HLS? Or is there no way that I can actually control the resources that the algorithm is mapped to without describing it in HDL?
Thanks for the help.

Convert 24-bit ADC serial read data from 3-byte format to signed integer (int32) in Matlab

I am receiving EEG data from a 24 bit ADC over serial. The ADC data is transmitting in 3 bytes from MSB to LSB. The full packet is 21 bytes:
The first byte is the start byte - 0xFF (255 in decimal)
Then packet number byte.
Then the next 3 bytes are the 24 bit ADC value broken into MSB LSB2 LSB1
I can parse the data fine, but re-constructing a 2's complement signed int32 number is causing issues. The values I am getting out certainly don't reflect what the ADC should be giving out.
Below are the lines to read and parse the 504 samples (which gives me 24 ADC values (504samples/21bytes = 24 values)). I have tried uint8 instead of uchar with similar results (when I try int8 I get a invalid specified precision error).
comEEGSMT = serial(com,'BaudRate',3000000);
fopen(comEEGSMT);
rawData(1:504) = fread(comEEGSMT, 504, 'uchar');
fclose(comEEGSMT);
startPackets = find(rawData == 255);
bytes = rawData([startpackets+2 startpackets+3 startpackets+4]);
I have tried the following method to reconstruct the value:
ADC_value = bytes(:,1)*256^2 + bytes(:,2)*256 + bytes(:,3);
and the following line is the formula to convert the above number to volts:
ADC_value_volts = ADC_value*(5/3)*(1/(2^32));
The values are in the range of 4000 - 8000 microvolts with large jumps in value. The values SHOULD be in the range of 200 - 600 microvolts with small changes.
I have found other questions relating to similar issues, but have had no success trying the proposed solutions such as in the link below:
https://uk.mathworks.com/matlabcentral/answers/137965-concatenate-3-bytes-array-of-real-time-serial-data-into-single-precision
Any help would be very much appreciated as I've been stuck on this for quite long.
Thanks Mark
Starting with ADC_value as int32 with value 0, then:
ADC_value |= MSB << 16;
ADC_value |= LSB2 << 8;
ADC_value |= LSB1;
And then, to find out the corresponding volts value, supposing your ADC has a reference voltage VREF, in volts (e.g. 5.0V):
ADC_value_volts = (ADC_value * VREF)/2^24
since your converter is 24 bits, not 32.
Note the above expressions are in C language equivalent, not Matlab.
EDIT:
The ADC data sheet tell us the PGA gain can be set for the following values:
1, 2, 4, 6, 8, 12, 24, one value at the time for each channel.
The FSR (full scale range) of measurement is: (2*VREF)/Gain = 5/3, for Gain=6,
(eq.(5) page 23) so this must be accounted for in expression computing the volts
values. (these can be verified if you have access to the hardware and can make some
measurements).
Data resulted from ADC is already in two's complement, binary form, 24 bits.
The weird thing is the data sheet counts bits starting with 1, not 0, so this
is why shifting with "17" instead 16 - this is in fact 16 for coding.
(revealed in fig 47, page 42).
So the computing formula of ADC_value_volts should be:
ADC_value_volts = (AC_value * FSR/(2^23))/3 (1LSB=FSR/(2^23), pg.37)
If some other calculations/modifications from original, them these must be explained by provider.
If the provider is not friendly, worth to be changed...

How to do bitwise operation decently?

I'm doing analysis on binary data. Suppose I have two uint8 data values:
a = uint8(0xAB);
b = uint8(0xCD);
I want to take the lower two bits from a, and whole content from b, to make a 10 bit value. In C-style, it should be like:
(a[2:1] << 8) | b
I tried bitget:
bitget(a,2:-1:1)
But this just gave me separate [1, 1] logical type values, which is not a scalar, and cannot be used in the bitshift operation later.
My current solution is:
Make a|b (a or b):
temp1 = bitor(bitshift(uint16(a), 8), uint16(b));
Left shift six bits to get rid of the higher six bits from a:
temp2 = bitshift(temp1, 6);
Right shift six bits to get rid of lower zeros from the previous result:
temp3 = bitshift(temp2, -6);
Putting all these on one line:
result = bitshift(bitshift(bitor(bitshift(uint16(a), 8), uint16(b)), 6), -6);
This is doesn't seem efficient, right? I only want to get (a[2:1] << 8) | b, and it takes a long expression to get the value.
Please let me know if there's well-known solution for this problem.
Since you are using Octave, you can make use of bitpack and bitunpack:
octave> a = bitunpack (uint8 (0xAB))
a =
1 1 0 1 0 1 0 1
octave> B = bitunpack (uint8 (0xCD))
B =
1 0 1 1 0 0 1 1
Once you have them in this form, it's dead easy to do what you want:
octave> [B A(1:2)]
ans =
1 0 1 1 0 0 1 1 1 1
Then simply pad with zeros accordingly and pack it back into an integer:
octave> postpad ([B A(1:2)], 16, false)
ans =
1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0
octave> bitpack (ans, "uint16")
ans = 973
That or is equivalent to an addition when dealing with integers
result = bitshift(bi2de(bitget(a,1:2)),8) + b;
e.g
a = 01010111
b = 10010010
result = 00000011 100010010
= a[2]*2^9 + a[1]*2^8 + b
an alternative method could be
result = mod(a,2^x)*2^y + b;
where the x is the number of bits you want to extract from a and y is the number of bits of a and b, in your case:
result = mod(a,4)*256 + b;
an extra alternative solution close to the C solution:
result = bitor(bitshift(bitand(a,3), 8), b);
I think it is important to explain exactly what "(a[2:1] << 8) | b" is doing.
In assembly, referencing individual bits is a single operation. Assume all operations take the exact same time and "efficient" a[2:1] starts looking extremely inefficient.
The convenience statement actually does (a & 0x03).
If your compiler actually converts a uint8 to a uint16 based on how much it was shifted, this is not a 'free' operation, per se. Effectively, what your compiler will do is first clear the "memory" to the size of uint16 and then copy "a" into the location. This requires an extra step (clearing the "memory" (register)) that wouldn't normally be needed.
This means your statement actually is (uint16(a & 0x03) << 8) | uint16(b)
Now yes, because you're doing a power of two shift, you could just move a into AH, move b into AL, and AH by 0x03 and move it all out but that's a compiler optimization and not what your C code said to do.
The point is that directly translating that statement into matlab yields
bitor(bitshift(uint16(bitand(a,3)),8),uint16(b))
But, it should be noted that while it is not as TERSE as (a[2:1] << 8) | b, the number of "high level operations" is the same.
Note that all scripting languages are going to be very slow upon initiating each instruction, but will complete said instruction rapidly. The terse nature of Python isn't because "terse is better" but to create simple structures that the language can recognize so it can easily go into vectorized operations mode and start executing code very quickly.
The point here is that you have an "overhead" cost for calling bitand; but when operating on an array it will use SSE and that "overhead" is only paid once. The JIT (just in time) compiler, which optimizes script languages by reducing overhead calls and creating temporary machine code for currently executing sections of code MAY be able to recognize that the type checks for a chain of bitwise operations need only occur on the initial inputs, hence further reducing runtime.
Very high level languages are quite different (and frustrating) from high level languages such as C. You are giving up a large amount of control over code execution for ease of code production; whether matlab actually has implemented uint8 or if it is actually using a double and truncating it, you do not know. A bitwise operation on a native uint8 is extremely fast, but to convert from float to uint8, perform bitwise operation, and convert back is slow. (Historically, Matlab used doubles for everything and only rounded according to what 'type' you specified)
Even now, octave 4.0.3 has a compiled bitshift function that, for bitshift(ones('uint32'),-32) results in it wrapping back to 1. BRILLIANT! VHLL place you at the mercy of the language, it isn't about how terse or how verbose you write the code, it's how the blasted language decides to interpret it and execute machine level code. So instead of shifting, uint32(floor(ones / (2^32))) is actually FASTER and more accurate.