Nibble shuffling with x64 SIMD - x86-64

I'm aware of byte shuffling instructions, but I'd like to do the same with nibbles (4-bit values), concretely I'd like to shuffle 16 nibbles in a 64-bit word. My shuffling indices are also stored as 16 nibbles. What's the most efficient implementation of this?

Arbitrary shuffles with a control vector that has to be stored this way? Ugh, hard to work with. I guess you'd have to unpack both to feed SSSE3 pshufb and then re-pack that result.
Probably just punpcklbw against a right-shifted copy, then AND mask to keep only the low 4 bits in each byte. Then pshufb.
Sometimes an odd/even split is easier than widening each element (so bits just stay within their original byte or word). In this case, if we could change your nibble index numbering, punpcklqdq could put the odd or even nibbles in the high half, ready to bring them back down and OR.
But without doing that, re-packing is a separate problem. I guess combine adjacent pairs of bytes into a word in the low byte, perhaps with pmaddubsw if throughput is more important than latency. Then you can packuswd (against zero or itself) or pshufb (with a constant control vector).
If you were doing multiple such shuffles, you could pack two vectors down to one, to store with movhps / movq. Using AVX2, it might be possible to have all the other instructions working on two independent shuffles in the two 128-bit lanes.
// UNTESTED, requires only SSSE3
#include <stdint.h>
#include <immintrin.h>
uint64_t shuffle_nibbles(uint64_t data, uint64_t control)
{
__m128i vd = _mm_cvtsi64_si128(data); // movq
__m128i vd_hi = _mm_srli_epi32(vd, 4); // x86 doesn't have a SIMD byte shift
vd = _mm_unpacklo_epi8(vd, vd_hi); // every nibble at the bottom of a byte, with high garbage
vd = _mm_and_si128(vd, _mm_set1_epi8(0x0f)); // clear high garbage for later merging
__m128i vc = _mm_cvtsi64_si128(control);
__m128i vc_hi = _mm_srli_epi32(vc, 4);
vc = _mm_unpacklo_epi8(vc, vc_hi);
vc = _mm_and_si128(vc, _mm_set1_epi8(0x0f)); // make sure high bit is clear, else pshufb zeros that element.
// AVX-512VBMI vpermb doesn't have that problem, if you have it available
vd = _mm_shuffle_epi8(vd, vc);
// left-hand input is the unsigned one, right hand is treated as signed bytes.
vd = _mm_maddubs_epi16(vd, _mm_set1_epi16(0x1001)); // hi nibbles << 4 (*= 0x10), lo nibbles *= 1.
// vd has nibbles merged into bytes, but interleaved with zero bytes
vd = _mm_packus_epi16(vd, vd); // duplicate vd into low & high halves.
// Pack against _mm_setzero_si128() if you're not just going to movq into memory or a GPR and you want the high half of the vector to be zero.
return _mm_cvtsi128_si64(vd);
}
Masking the data with 0x0f ahead of the shuffle (instead of after) allows more ILP on CPUs with two shuffle units. At least if they already had the uint64_t values in vector registers, or if the data and control values are coming from memory so both can be loaded in the same cycle. If coming from GPRs, 1/clock throughput for vmovq xmm, reg means there's a resource conflict between the dep chains so they can't both start in the same cycle. But since we the data might be ready before the control, masking early keeps it off the critical path for control->output latency.
If latency is a bottleneck instead of the usual throughput, consider replacing pmaddubsw with right-shift by 4, por, and AND/pack. Or pshufb to pack while ignoring garbage in odd bytes. Since you'd need another constant anyway, might as well make it a pshufb constant instead of and.
If you had AVX-512, a shift and bit-blend with vpternlogd could avoid needing to mask the data before shuffling, and vpermb instead of vpshufb would avoid needing to mask the control, so you'd avoid the set1_epi8(0x0f) constant entirely.
clang's shuffle optimizer didn't spot anything, just compiling it as-written like GCC does (https://godbolt.org/z/xz7TTbM1d), even with -march=sapphirerapids. Not spotting that it could use vpermb instead of vpand / vpshufb.
shuffle_nibbles(unsigned long, unsigned long):
vmovq xmm0, rdi
vpsrld xmm1, xmm0, 4
vpunpcklbw xmm0, xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
vmovq xmm1, rsi
vpsrld xmm2, xmm1, 4
vpunpcklbw xmm1, xmm1, xmm2 # xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
vmovdqa xmm2, xmmword ptr [rip + .LCPI0_0] # xmm2 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
vpand xmm0, xmm0, xmm2
vpand xmm1, xmm1, xmm2
vpshufb xmm0, xmm0, xmm1
vpmaddubsw xmm0, xmm0, xmmword ptr [rip + .LCPI0_1]
vpackuswb xmm0, xmm0, xmm0
vmovq rax, xmm0
ret
(Without AVX, it requires 2 extra movdqa register-copy instructions.)

I came across this problem today. In AVX-512 you can use vpmultishiftqb (1), an amusing instruction available in Ice Lake and after (and apparently in Zen 4, according to Wikipedia), to shuffle nibbles much more quickly. Its power lies in its ability to permute bytes in an unaligned fashion: It takes the eight 8-bit chunks in each 64-bit element and selects unaligned 8-bit chunks from the corresponding element. Below is an implementation.
#include <immintrin.h>
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
// Convention: (a & (0xf << (4 * i))) >> (4 * i) is the ith nibble of a
// (i.e., lowest-significant is 0)
uint64_t shuffle_nibbles(uint64_t data, uint64_t indices) {
#if defined(__AVX512VBMI__) && defined(__AVX512VL__)
// If your data is already in vectors, then this method also works in parallel
const __m128i lo_nibble_msk = _mm_set1_epi8(0x0f);
__m128i v_data = _mm_cvtsi64_si128(data);
__m128i v_indices = _mm_cvtsi64_si128(indices);
__m128i indices_lo = _mm_and_si128(lo_nibble_msk, v_indices);
__m128i indices_hi = _mm_andnot_si128(lo_nibble_msk, v_indices);
indices_lo = _mm_slli_epi32(indices_lo, 2);
indices_hi = _mm_srli_epi32(indices_hi, 2);
// Look up unaligned bytes
__m128i shuffled_hi = _mm_multishift_epi64_epi8(indices_hi, v_data);
__m128i shuffled_lo = _mm_multishift_epi64_epi8(indices_lo, v_data);
shuffled_hi = _mm_slli_epi32(shuffled_hi, 4);
// msk ? lo : hi
__m128i shuffled = _mm_ternarylogic_epi32(lo_nibble_msk, shuffled_lo, shuffled_hi, 202);
return _mm_cvtsi128_si64(shuffled);
#else
// Fallback scalar implementation (preferably Peter Cordes's SSE solution--this is as an example)
uint64_t result = 0;
for (int i = 0; i < 16; ++i) {
indices = (indices >> 60) + (indices << 4);
int idx = indices & 0xf;
result <<= 4;
result |= (data >> (4 * idx)) & 0xf;
}
return result;
#endif
}
int main() {
// 0xaa025411fe034102
uint64_t r1 = shuffle_nibbles(0xfedcba9876543210, 0xaa025411fe034102);
// 0x55fdabee01fcbefd
uint64_t r2 = shuffle_nibbles(0x0123456789abcdef, 0xaa025411fe034102);
// 0xaaaa00002222aaaa
uint64_t r3 = shuffle_nibbles(0xaa025411fe034102, 0xeeee11110000ffff);
printf("0x%" PRIx64 "\n", r1);
printf("0x%" PRIx64 "\n", r2);
printf("0x%" PRIx64 "\n", r3);
}
Clang yields (2):
.LCPI0_0:
.zero 16,60
shuffle_nibbles(unsigned long, unsigned long):
vmovq xmm0, rdi
vmovq xmm1, rsi
vpslld xmm2, xmm1, 2
vpsrld xmm1, xmm1, 2
vmovdqa xmm3, xmmword ptr [rip + .LCPI0_0] # xmm3 = [60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60]
vpand xmm1, xmm1, xmm3
vpmultishiftqb xmm1, xmm1, xmm0
vpand xmm2, xmm2, xmm3
vpmultishiftqb xmm0, xmm2, xmm0
vpslld xmm1, xmm1, 4
vpternlogd xmm1, xmm0, dword ptr [rip + .LCPI0_1]{1to4}, 216
vmovq rax, xmm1
In my case, I am shuffling nibbles in 64-bit-element vectors; this method also avoids the need for widening. If your shuffle(s) is/are constant and you stay in vectors, this method reduces to a measly four instructions: 2x vpmultishiftqb, 1x vpslld, and 1x vpternlogd. Counting µops suggests a latency of 5 and throughput of one every 2 cycles, bottlenecked on shuffle µops, for 128- and 256-bit vectors; and a throughput of 3 for 512-bit vectors, due to reduced execution units for the latter two instructions.

Related

Unaligned access performance on Intel x86 vs AMD x86 CPUs

I have implemented a simple linear probing hash map with an array of structs memory layout. The struct holds the key, the value, and a flag indicating whether the entry is valid. By default, this struct gets padded by the compiler, as key and value are 64-bit integers, but the entry only takes up 8 bools. Hence, I have also tried packing the struct at the cost of unaligned access. I was hoping to get better performance from the packed/unaligned version due to higher memory density (we do not waste bandwidth on transferring padding bytes).
When benchmarking this hash map on an Intel Xeon Gold 5220S CPU (single-threaded, gcc 11.2, -O3 and -march=native), I see no performance difference between the padded version and the unaligned version. However, on an AMD EPYC 7742 CPU (same setup), I find a performance difference between unaligned and padded. Here is a graph depicting the results for hash map load factors 25 % and 50 %, for different successful query rates on the x axis (0,25,50,75,100): As you can see, on Intel, the grey and blue (circle and square) lines almost overlap, the benefit of struct packing is marginal. On AMD, however, the line representing unaligned/packed structs is consistently higher, i.e., we have more throughput.
In order to investigate this, I tried to built a smaller microbenchmark. In this microbenchmark, we perform a similar benchmark, but without the hash map find logic (i.e., we just pick random indices in the array and advance a little there). Please find the benchmark here:
#include <atomic>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <random>
#include <vector>
void ClobberMemory() { std::atomic_signal_fence(std::memory_order_acq_rel); }
template <typename T>
void doNotOptimize(T const& val) {
asm volatile("" : : "r,m"(val) : "memory");
}
struct PaddedStruct {
uint64_t key;
uint64_t value;
bool is_valid;
PaddedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
}
};
struct PackedStruct {
uint64_t key;
uint64_t value;
uint8_t is_valid;
PackedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
}
} __attribute__((__packed__));
int main() {
const uint64_t size = 134217728;
uint16_t repetitions = 0;
uint16_t advancement = 0;
std::cin >> repetitions;
std::cout << "Got " << repetitions << std::endl;
std::cin >> advancement;
std::cout << "Got " << advancement << std::endl;
std::cout << "Initializing." << std::endl;
std::vector<PaddedStruct> padded(size);
std::vector<PackedStruct> unaligned(size);
std::vector<uint64_t> queries(size);
// Initialize the structs with random values + prefault
std::random_device rd;
std::mt19937 gen{rd()};
std::uniform_int_distribution<uint64_t> dist{0, 0xDEADBEEF};
std::uniform_int_distribution<uint64_t> dist2{0, size - advancement - 1};
for (uint64_t i = 0; i < padded.size(); ++i) {
padded[i].key = dist(gen);
padded[i].value = dist(gen);
padded[i].is_valid = 1;
}
for (uint64_t i = 0; i < unaligned.size(); ++i) {
unaligned[i].key = padded[i].key;
unaligned[i].value = padded[i].value;
unaligned[i].is_valid = 1;
}
for (uint64_t i = 0; i < unaligned.size(); ++i) {
queries[i] = dist2(gen);
}
std::cout << "Running benchmark." << std::endl;
ClobberMemory();
auto start_padded = std::chrono::high_resolution_clock::now();
PaddedStruct* padded_ptr = nullptr;
uint64_t sum = 0;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
padded_ptr = &padded[query + i];
if (padded_ptr->is_valid) [[likely]] {
sum += padded_ptr->value;
}
}
doNotOptimize(sum);
}
}
ClobberMemory();
auto end_padded = std::chrono::high_resolution_clock::now();
uint64_t padded_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_padded - start_padded).count());
std::cout << "Padded Runtime (ms): " << padded_runtime << " (sum = " << sum << ")" << std::endl; // print sum to avoid that it gets optimized out
ClobberMemory();
auto start_unaligned = std::chrono::high_resolution_clock::now();
uint64_t sum2 = 0;
PackedStruct* packed_ptr = nullptr;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
packed_ptr = &unaligned[query + i];
if (packed_ptr->is_valid) [[likely]] {
sum2 += packed_ptr->value;
}
}
doNotOptimize(sum2);
}
}
ClobberMemory();
auto end_unaligned = std::chrono::high_resolution_clock::now();
uint64_t unaligned_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_unaligned - start_unaligned).count());
std::cout << "Unaligned Runtime (ms): " << unaligned_runtime << " (sum = " << sum2 << ")" << std::endl;
}
When running the benchmark, I pick repetitions = 3 and advancement = 5, i.e., after compiling and running it, you have to enter 3 (and press newline) and then enter 5 and press enter/newline. I updated the source code to (a) avoid loop unrolling by the compiler because repetition/advancement were hardcoded and (b) switch to pointers into that vector as it more closely resembles what the hash map is doing.
On the Intel CPU, I get:
Padded Runtime (ms): 13204
Unaligned Runtime (ms): 12185
On the AMD CPU, I get:
Padded Runtime (ms): 28432
Unaligned Runtime (ms): 22926
So while in this microbenchmark, Intel still benefits a little from the unaligned access, for the AMD CPU, both the absolute and relative improvement is higher. I cannot explain this. In general, from what I've learned from relevant SO threads, unaligned access for a single member is just as expensive as aligned access, as long as it stays within a single cache line (1). Also in (1), a reference to (2) is given, which claims that the cache fetch width can differ from the cache line size. However, except for Linus Torvalds mail, I could not find any other documentation of cache fetch widths in processors and especially not for my concrete two CPUs to figure out if that might somehow have to do with this.
Does anybody have an idea why the AMD CPU benefits much more from the struct packing? If it is about reduced memory bandwidth consumption, I should be able to see the effects on both CPUs. And if the bandwidth usage is similar, I do not understand what might be causing the differences here.
Thank you so much.
(1) Relevant SO thread: How can I accurately benchmark unaligned access speed on x86_64?
(2) https://www.realworldtech.com/forum/?threadid=168200&curpostid=168779
The L1 Data Cache fetch width on the Intel Xeon Gold 5220S (and all the other Skylake/CascadeLake Xeon processors) is up to 64 naturally-aligned Bytes per cycle per load.
The core can execute two loads per cycle for any combination of size and alignment that does not cross a cacheline boundary. I have not tested all the combinations on the SKX/CLX processors, but on Haswell/Broadwell, throughput was reduced to one load per cycle whenever a load crossed a cacheline boundary, and I would assume that SKX/CLX are similar. This can be viewed as necessary feature rather than a "penalty" -- a line-splitting load might need to use both ports to load a pair of adjacent lines, then combine the requested portions of the lines into a payload for the target register.
Loads that cross page boundaries have a larger performance penalty, but to measure it you have to be very careful to understand and control the locations of the page table entries for the two pages: DTLB, STLB, in the caches, or in main memory. My recollection is that the most common case is pretty fast -- partly because the "Next Page Prefetcher" is pretty good at pre-loading the PTE entry for the next page into the TLB before a sequence of loads gets to the end of the first page. The only case that is painfully slow is for stores that straddle a page boundary, and the Intel compiler works very hard to avoid this case.
I have not looked at the sample code in detail, but if I were performing this analysis, I would be careful to pin the processor frequency, measure the instruction and cycle counts, and compute the average number of instructions and cycles per update. (I usually set the core frequency to the nominal (TSC) frequency just to make the numbers easier to work with.) For the naturally-aligned cases, it should be pretty easy to look at the assembly code and estimate what the cycle counts should be. If the measurements are similar to observations for that case, then you can begin looking at the overhead of unaligned accesses in reference to a more reliable understanding of the baseline.
Hardware performance counters can be valuable for this case as well, particularly the DTLB_LOAD_MISSES events and the L1D.REPLACEMENT event. It only takes a few high-latency TLB miss or L1D miss events to skew the averages.
The number of cache-line accesses when using 24-byte data structures may be the same as when using 17-byte data structure.
Please see this blog post: https://lemire.me/blog/2022/06/06/data-structure-size-and-cache-line-accesses/

Understanding CRC32 value as division remainder

I'm struggling with understanding CRC algorithm. I've been reading this tutorial and if I got it correctly a CRC value is just a remainder of a division where message serves as the dividend and the divisor is a predefined value - carried out in a special kind of polynomial arithmetic. It looked quote simple so I tried implementing CRC-32:
public static uint Crc32Naive(byte[] bytes)
{
uint poly = 0x04c11db7; // (Poly)
uint crc = 0xffffffff; // (Init)
foreach (var it in bytes)
{
var b = (uint)it;
for (var i = 0; i < 8; ++i)
{
var prevcrc = crc;
// load LSB from current byte into LSB of crc (RefIn)
crc = (crc << 1) | (b & 1);
b >>= 1;
// subtract polynomial if we've just popped out 1
if ((prevcrc & 0x80000000) != 0)
crc ^= poly;
}
}
return Reverse(crc ^ 0xffffffff); // (XorOut) (RefOut)
}
(where Reverese function reverses bit order)
It is supposed to be analogous to following method of division (with some additional adjustments):
1100001010
_______________
10011 ) 11010110110000
10011,,.,,....
-----,,.,,....
10011,.,,....
10011,.,,....
-----,.,,....
00001.,,....
00000.,,....
-----.,,....
00010,,....
00000,,....
-----,,....
00101,....
00000,....
-----,....
01011....
00000....
-----....
10110...
10011...
-----...
01010..
00000..
-----..
10100.
10011.
-----.
01110
00000
-----
1110 = Remainder
For: 0x00 function returns 0xd202ef8d which is correct, but for 0x01 - 0xd302ef8d instead of 0xa505df1b (I've been using this page to verify my results).
Solution from my implementation has more sense to me: incrementing dividend by 1 should only change reminder by 1, right? But it turns out that the result should look completely different. So apparently I am missing something obvious. What is it? How can changing the least significant number in a dividend influence the result this much?
This is an example of a left shifting CRC that emulates division, with the CRC initialized = 0, and no complementing or reversing of the crc. The example code is emulating a division where 4 bytes of zeroes are appended to bytes[] ({bytes[],0,0,0,0} is the dividend, the divisor is 0x104c11db7, the quotient is not used, and the remainder is the CRC).
public static uint Crc32Naive(byte[] bytes)
{
uint poly = 0x04c11db7; // (Poly is actually 0x104c11db7)
uint crc = 0; // (Init)
foreach (var it in bytes)
{
crc ^= (((int)it)<<24); // xor next byte to upper 8 bits of crc
for (var i = 0; i < 8; ++i) // cycle the crc 8 times
{
var prevcrc = crc;
crc = (crc << 1);
// subtract polynomial if we've just popped out 1
if ((prevcrc & 0x80000000) != 0)
crc ^= poly;
}
}
return crc;
}
It's common to initialize the CRC to something other than zero, but it's not that common to post-complement the CRC, and I'm not aware of any CRC that does a post bit reversal of the CRC.
Another variations of CRC is one that right shifts, normally used to emulate hardware where data is sent in bytes least significant bit first.

CRC-32 algorithm from HDL to software

I implemented a Galois Linear-Feedback Shift-Regiser in Verilog (and also in MATLAB, mainly to emulate the HDL design). It's been working great, and as of know I use MATLAB to calculate CRC-32 fields, and then include them in my HDL simulations to verify a data packet has arrived correctly (padding data with CRC-32), which produces good results.
The thing is I want to be able to calculate the CRC-32 I've implemented in software, because I'll be using a Raspberry Pi to input data through GPIO in my FPGA, and I haven't been able to do so. I've tried this online calculator, using the same parameters, but never get to yield the same result.
This is the MATLAB code I use to calculate my CRC-32:
N = 74*16;
data = [round(rand(1,N)) zeros(1,32)];
lfsr = ones(1,32);
next_lfsr = zeros(1,32);
for i = 1:length(data)
next_lfsr(1) = lfsr(2);
next_lfsr(2) = lfsr(3);
next_lfsr(3) = lfsr(4);
next_lfsr(4) = lfsr(5);
next_lfsr(5) = lfsr(6);
next_lfsr(6) = xor(lfsr(7),lfsr(1));
next_lfsr(7) = lfsr(8);
next_lfsr(8) = lfsr(9);
next_lfsr(9) = xor(lfsr(10),lfsr(1));
next_lfsr(10) = xor(lfsr(11),lfsr(1));
next_lfsr(11) = lfsr(12);
next_lfsr(12) = lfsr(13);
next_lfsr(13) = lfsr(14);
next_lfsr(14) = lfsr(15);
next_lfsr(15) = lfsr(16);
next_lfsr(16) = xor(lfsr(17), lfsr(1));
next_lfsr(17) = lfsr(18);
next_lfsr(18) = lfsr(19);
next_lfsr(19) = lfsr(20);
next_lfsr(20) = xor(lfsr(21),lfsr(1));
next_lfsr(21) = xor(lfsr(22),lfsr(1));
next_lfsr(22) = xor(lfsr(23),lfsr(1));
next_lfsr(23) = lfsr(24);
next_lfsr(24) = xor(lfsr(25), lfsr(1));
next_lfsr(25) = xor(lfsr(26), lfsr(1));
next_lfsr(26) = lfsr(27);
next_lfsr(27) = xor(lfsr(28), lfsr(1));
next_lfsr(28) = xor(lfsr(29), lfsr(1));
next_lfsr(29) = lfsr(30);
next_lfsr(30) = xor(lfsr(31), lfsr(1));
next_lfsr(31) = xor(lfsr(32), lfsr(1));
next_lfsr(32) = xor(data2(i), lfsr(1));
lfsr = next_lfsr;
end
crc32 = lfsr;
See I use a 32-zeroes padding to calculate the CRC-32 in the first place (whatever's left in the LFSR at the end is my CRC-32, and if I do the same replacing the zeroes with this CRC-32, my LFSR becomes empty at the end too, which means the verification passed).
The polynomial I'm using is the standard for CRC-32: 04C11DB7. See also that the order seems to be reversed, but that's just because it's mirrored to have the input in the MSB. The results of using this representation and a mirrored one are the same when the input is the same, only the result will be also mirrored.
Any ideas would be of great help.
Thanks in advance
Your CRC is not a CRC. The last 32 bits fed in don't actually participate in the calculation, other than being exclusive-or'ed into the result. That is, if you replace the last 32 bits of data with zeros, do your calculation, and then exclusive-or the last 32 bits of data with the resulting "crc32", then you will get the same result.
So you will never get it to match another CRC calculation, since it isn't a CRC.
This code in C replicates your function, where the data bits come from the series of n bytes at p, least significant bit first, and the result is a 32-bit value:
unsigned long notacrc(void const *p, unsigned n) {
unsigned char const *dat = p;
unsigned long reg = 0xffffffff;
while (n) {
for (unsigned k = 0; k < 8; k++)
reg = reg & 1 ? (reg >> 1) ^ 0xedb88320 : reg >> 1;
reg ^= (unsigned long)*dat++ << 24;
n--;
}
return reg;
}
You can immediately see that the last byte of data is simply exclusive-or'ed with the final register value. Less obvious is that the last four bytes are just exclusive-or'ed. This exactly equivalent version makes that evident:
unsigned long notacrc_xor(void const *p, unsigned n) {
unsigned char const *dat = p;
// initial register values
unsigned long const init[] = {
0xffffffff, 0x2dfd1072, 0xbe26ed00, 0x00be26ed, 0xdebb20e3};
unsigned xor = n > 3 ? 4 : n; // number of bytes merely xor'ed
unsigned long reg = init[xor];
while (n > xor) {
reg ^= *dat++;
for (unsigned k = 0; k < 8; k++)
reg = reg & 1 ? (reg >> 1) ^ 0xedb88320 : reg >> 1;
n--;
}
switch (n) {
case 4:
reg ^= *dat++;
case 3:
reg ^= (unsigned long)*dat++ << 8;
case 2:
reg ^= (unsigned long)*dat++ << 16;
case 1:
reg ^= (unsigned long)*dat++ << 24;
}
return reg;
}
There you can see that the last four bytes of the message, or all of the message if it is three or fewer bytes, is exclusive-or'ed with the final register value at the end.
An actual CRC must use all of the input data bits in determining when to exclusive-or the polynomial with the register. The inner part of that last function is what a CRC implementation looks like (though more efficient versions make use of pre-computed tables to process a byte or more at a time). Here is a function that computes an actual CRC:
unsigned long crc32_jam(void const *p, unsigned n) {
unsigned char const *dat = p;
unsigned long reg = 0xffffffff;
while (n) {
reg ^= *dat++;
for (unsigned k = 0; k < 8; k++)
reg = reg & 1 ? (reg >> 1) ^ 0xedb88320 : reg >> 1;
n--;
}
return reg;
}
That one is called crc32_jam because it implements a particular CRC called "JAMCRC". That CRC is the closest to what you attempted to implement.
If you want to use a real CRC, you will need to update your Verilog implementation.

How many bits do i need to store AB+C?

I was wondering about this-
If A, B are 16-bit numbers and C is 8-bit, how many bits would I need to store the result ? 32 or 33 ?
And, what if C was a 16-bit number? What then ?
I would appreciate if I got answers with an explanation of the hows and whys.
Why don't you just take the maximum value for each register, and check the result?
If all registers are unsigned:
0xFFFF * 0xFFFF + 0xFF = 0xFFFE0100 = // 32 bits are enough
0xFFFF * 0xFFFF + 0xFFFF = 0xFFFF0000 // 32 bits are enough
If all registers are signed, then 0xFFFF = -32767, but 0xFFFF * 0xFFFF would be the same as before (negative * negative = positive). Register C will make the result a little smaller than the previous result, but you would still require 32 bits in order to store it.

Which data structure should I use for bit stuffing?

I am trying to implement bitstuffing for a project I am working on, namely a simple software AFSK modem. The simplified protocol looks something like this:
0111 1110 # burst sequence
0111 1110 # 16 times 0b0111_1110
...
0111 1110
...
... # 80 bit header (CRC, frame counter, etc.)
...
0111 1110 # header delimiter
...
... # data
...
0111 1110 # end-of-frame sequence
Now I need to find the reserved sequence 0111 1110 in the received data and therefore must ensure that neither the header nor the data contains six consecutive 1s. This can be done by bit stuffing, e.g. inserting a zero after every sequence of five 1s:
11111111
converts to
111110111
and
11111000
converts to
111110000
If I want to implement this efficiently I guess I should not use arrays of 1s and 0s, where I have to convert the data bytes to 1s and 0s, then populate an array etc. but bitfields of static size do not seem to fit either, because the length of the content is variable due to the bit stuffing.
Which data structure can I use to do bit stuffing more efficiently?
I just saw this question now and seeing that it is unanswered and not deleted I'll go ahead and answer. It might help others who see this question and also provide closure.
Bit stuffing: here the maximum contiguous sequence of 1's is 5. After 5 1's there should be a 0 appended after those 5 1's.
Here is the C program that does that:
#include <stdio.h>
typedef unsigned long long int ulli;
int main()
{
ulli buf = 0x0fffff01, // data to be stuffed
temp2= 1ull << ((sizeof(ulli)*8)-1), // mask to stuff data
temp3 = 0; // temporary
int count = 0; // continuous 1s indicator
while(temp2)
{
if((buf & temp2) && count <= 5) // enter the loop if the bit is `1` and if count <= 5
{
count++;
if(count == 5)
{
temp3 = buf & (~(temp2 - 1ull)); // make MS bits all 1s
temp3 <<= 1ull; // shift 1 bit to accomodeate the `0`
temp3 |= buf & ((temp2) - 1); // add back the LS bits or original producing stuffed data
buf = temp3;
count = 0; // reset count
printf("%llx\n",temp3); // debug only
}
}
else
{
count = 0; // this was what took 95% of my debug time. i had not put this else clause :-)
}
temp2 >>=1; // move on to next bit.
}
printf("ans = %llx",buf); // finally
}
The problem with this is that if there are more that 10 of 5 consecutive 1s then it might overflow. It's better to divide and then bitstuff and repeat.