Understanding CRC32 value as division remainder - hash

I'm struggling with understanding CRC algorithm. I've been reading this tutorial and if I got it correctly a CRC value is just a remainder of a division where message serves as the dividend and the divisor is a predefined value - carried out in a special kind of polynomial arithmetic. It looked quote simple so I tried implementing CRC-32:
public static uint Crc32Naive(byte[] bytes)
{
uint poly = 0x04c11db7; // (Poly)
uint crc = 0xffffffff; // (Init)
foreach (var it in bytes)
{
var b = (uint)it;
for (var i = 0; i < 8; ++i)
{
var prevcrc = crc;
// load LSB from current byte into LSB of crc (RefIn)
crc = (crc << 1) | (b & 1);
b >>= 1;
// subtract polynomial if we've just popped out 1
if ((prevcrc & 0x80000000) != 0)
crc ^= poly;
}
}
return Reverse(crc ^ 0xffffffff); // (XorOut) (RefOut)
}
(where Reverese function reverses bit order)
It is supposed to be analogous to following method of division (with some additional adjustments):
1100001010
_______________
10011 ) 11010110110000
10011,,.,,....
-----,,.,,....
10011,.,,....
10011,.,,....
-----,.,,....
00001.,,....
00000.,,....
-----.,,....
00010,,....
00000,,....
-----,,....
00101,....
00000,....
-----,....
01011....
00000....
-----....
10110...
10011...
-----...
01010..
00000..
-----..
10100.
10011.
-----.
01110
00000
-----
1110 = Remainder
For: 0x00 function returns 0xd202ef8d which is correct, but for 0x01 - 0xd302ef8d instead of 0xa505df1b (I've been using this page to verify my results).
Solution from my implementation has more sense to me: incrementing dividend by 1 should only change reminder by 1, right? But it turns out that the result should look completely different. So apparently I am missing something obvious. What is it? How can changing the least significant number in a dividend influence the result this much?

This is an example of a left shifting CRC that emulates division, with the CRC initialized = 0, and no complementing or reversing of the crc. The example code is emulating a division where 4 bytes of zeroes are appended to bytes[] ({bytes[],0,0,0,0} is the dividend, the divisor is 0x104c11db7, the quotient is not used, and the remainder is the CRC).
public static uint Crc32Naive(byte[] bytes)
{
uint poly = 0x04c11db7; // (Poly is actually 0x104c11db7)
uint crc = 0; // (Init)
foreach (var it in bytes)
{
crc ^= (((int)it)<<24); // xor next byte to upper 8 bits of crc
for (var i = 0; i < 8; ++i) // cycle the crc 8 times
{
var prevcrc = crc;
crc = (crc << 1);
// subtract polynomial if we've just popped out 1
if ((prevcrc & 0x80000000) != 0)
crc ^= poly;
}
}
return crc;
}
It's common to initialize the CRC to something other than zero, but it's not that common to post-complement the CRC, and I'm not aware of any CRC that does a post bit reversal of the CRC.
Another variations of CRC is one that right shifts, normally used to emulate hardware where data is sent in bytes least significant bit first.

Related

How to emulate *really simple* variable bit shifts with SSE?

I have two variable bit-shifting code fragments that I want to SSE-vectorize by some means:
1) a = 1 << b (where b = 0..7 exactly), i.e. 0/1/2/3/4/5/6/7 -> 1/2/4/8/16/32/64/128/256
2) a = 1 << (8 * b) (where b = 0..7 exactly), i.e. 0/1/2/3/4/5/6/7 -> 1/0x100/0x10000/etc
OK, I know that AMD's XOP VPSHLQ would do this, as would AVX2's VPSHLQ. But my challenge here is whether this can be achieved on 'normal' (i.e. up to SSE4.2) SSE.
So, is there some funky SSE-family opcode sequence that will achieve the effect of either of these code fragments? These only need yield the listed output values for the specific input values (0-7).
Update: here's my attempt at 1), based on Peter Cordes' suggestion of using the floating point exponent to do simple variable bitshifting:
#include <stdint.h>
typedef union
{
int32_t i;
float f;
} uSpec;
void do_pow2(uint64_t *in_array, uint64_t *out_array, int num_loops)
{
uSpec u;
for (int i=0; i<num_loops; i++)
{
int32_t x = *(int32_t *)&in_array[i];
u.i = (127 + x) << 23;
int32_t r = (int32_t) u.f;
out_array[i] = r;
}
}

EditorGuiLayout.MaskField issue with large enums

I'm working on an input system that would allow the user to translate input mappings between different input devices and operating systems and potentially define their own.
I'm trying to create a MaskField for an editor window where the user can select from a list of RuntimePlatforms, but selecting individual values results in multiple values being selected.
Mainly for debugging I set it up to generate an equivalent enum RuntimePlatformFlags that it uses instead of RuntimePlatform:
[System.Flags]
public enum RuntimePlatformFlags: long
{
OSXEditor=(0<<0),
OSXPlayer=(0<<1),
WindowsPlayer=(0<<2),
OSXWebPlayer=(0<<3),
OSXDashboardPlayer=(0<<4),
WindowsWebPlayer=(0<<5),
WindowsEditor=(0<<6),
IPhonePlayer=(0<<7),
PS3=(0<<8),
XBOX360=(0<<9),
Android=(0<<10),
NaCl=(0<<11),
LinuxPlayer=(0<<12),
FlashPlayer=(0<<13),
LinuxEditor=(0<<14),
WebGLPlayer=(0<<15),
WSAPlayerX86=(0<<16),
MetroPlayerX86=(0<<17),
MetroPlayerX64=(0<<18),
WSAPlayerX64=(0<<19),
MetroPlayerARM=(0<<20),
WSAPlayerARM=(0<<21),
WP8Player=(0<<22),
BB10Player=(0<<23),
BlackBerryPlayer=(0<<24),
TizenPlayer=(0<<25),
PSP2=(0<<26),
PS4=(0<<27),
PSM=(0<<28),
XboxOne=(0<<29),
SamsungTVPlayer=(0<<30),
WiiU=(0<<31),
tvOS=(0<<32),
Switch=(0<<33),
Lumin=(0<<34),
BJM=(0<<35),
}
In this linked screenshot, only the first 4 options were selected. The integer next to "Platforms: " is the mask itself.
I'm not a bitwise wizard by a large margin, but my assumption is that this occurs because EditorGUILayout.MaskField returns a 32bit int value, and there are over 32 enum options. Are there any workarounds for this or is something else causing the issue?
First thing I've noticed is that all values inside that Enum is the same because you are shifting 0 bits to left. You can observe this by logging your values with this script.
// Shifts 0 bits to the left, printing "0" 36 times.
for(int i = 0; i < 36; i++){
Debug.Log(System.Convert.ToString((0 << i), 2));
}
// Shifts 1 bits to the left, printing values up to 2^35.
for(int i = 0; i < 36; i++){
Debug.Log(System.Convert.ToString((1 << i), 2));
}
The reason inheriting from long does not work alone, is because of bit shifting. Check out this example I found about the issue:
UInt32 x = ....;
UInt32 y = ....;
UInt64 result = (x << 32) + y;
The programmer intended to form a 64-bit value from two 32-bit ones by shifting 'x' by 32 bits and adding the most significant and the least significant parts. However, as 'x' is a 32-bit value at the moment when the shift operation is performed, shifting by 32 bits will be equivalent to shifting by 0 bits, which will lead to an incorrect result.
So you should also cast the shifting bits. Like this:
public enum RuntimePlatformFlags : long {
OSXEditor = (1 << 0),
OSXPlayer = (1 << 1),
WindowsPlayer = (1 << 2),
OSXWebPlayer = (1 << 3),
// With literals.
tvOS = (1L << 32),
Switch = (1L << 33),
// Or with casts.
Lumin = ((long)1 << 34),
BJM = ((long)1 << 35),
}

CRC-32 algorithm from HDL to software

I implemented a Galois Linear-Feedback Shift-Regiser in Verilog (and also in MATLAB, mainly to emulate the HDL design). It's been working great, and as of know I use MATLAB to calculate CRC-32 fields, and then include them in my HDL simulations to verify a data packet has arrived correctly (padding data with CRC-32), which produces good results.
The thing is I want to be able to calculate the CRC-32 I've implemented in software, because I'll be using a Raspberry Pi to input data through GPIO in my FPGA, and I haven't been able to do so. I've tried this online calculator, using the same parameters, but never get to yield the same result.
This is the MATLAB code I use to calculate my CRC-32:
N = 74*16;
data = [round(rand(1,N)) zeros(1,32)];
lfsr = ones(1,32);
next_lfsr = zeros(1,32);
for i = 1:length(data)
next_lfsr(1) = lfsr(2);
next_lfsr(2) = lfsr(3);
next_lfsr(3) = lfsr(4);
next_lfsr(4) = lfsr(5);
next_lfsr(5) = lfsr(6);
next_lfsr(6) = xor(lfsr(7),lfsr(1));
next_lfsr(7) = lfsr(8);
next_lfsr(8) = lfsr(9);
next_lfsr(9) = xor(lfsr(10),lfsr(1));
next_lfsr(10) = xor(lfsr(11),lfsr(1));
next_lfsr(11) = lfsr(12);
next_lfsr(12) = lfsr(13);
next_lfsr(13) = lfsr(14);
next_lfsr(14) = lfsr(15);
next_lfsr(15) = lfsr(16);
next_lfsr(16) = xor(lfsr(17), lfsr(1));
next_lfsr(17) = lfsr(18);
next_lfsr(18) = lfsr(19);
next_lfsr(19) = lfsr(20);
next_lfsr(20) = xor(lfsr(21),lfsr(1));
next_lfsr(21) = xor(lfsr(22),lfsr(1));
next_lfsr(22) = xor(lfsr(23),lfsr(1));
next_lfsr(23) = lfsr(24);
next_lfsr(24) = xor(lfsr(25), lfsr(1));
next_lfsr(25) = xor(lfsr(26), lfsr(1));
next_lfsr(26) = lfsr(27);
next_lfsr(27) = xor(lfsr(28), lfsr(1));
next_lfsr(28) = xor(lfsr(29), lfsr(1));
next_lfsr(29) = lfsr(30);
next_lfsr(30) = xor(lfsr(31), lfsr(1));
next_lfsr(31) = xor(lfsr(32), lfsr(1));
next_lfsr(32) = xor(data2(i), lfsr(1));
lfsr = next_lfsr;
end
crc32 = lfsr;
See I use a 32-zeroes padding to calculate the CRC-32 in the first place (whatever's left in the LFSR at the end is my CRC-32, and if I do the same replacing the zeroes with this CRC-32, my LFSR becomes empty at the end too, which means the verification passed).
The polynomial I'm using is the standard for CRC-32: 04C11DB7. See also that the order seems to be reversed, but that's just because it's mirrored to have the input in the MSB. The results of using this representation and a mirrored one are the same when the input is the same, only the result will be also mirrored.
Any ideas would be of great help.
Thanks in advance
Your CRC is not a CRC. The last 32 bits fed in don't actually participate in the calculation, other than being exclusive-or'ed into the result. That is, if you replace the last 32 bits of data with zeros, do your calculation, and then exclusive-or the last 32 bits of data with the resulting "crc32", then you will get the same result.
So you will never get it to match another CRC calculation, since it isn't a CRC.
This code in C replicates your function, where the data bits come from the series of n bytes at p, least significant bit first, and the result is a 32-bit value:
unsigned long notacrc(void const *p, unsigned n) {
unsigned char const *dat = p;
unsigned long reg = 0xffffffff;
while (n) {
for (unsigned k = 0; k < 8; k++)
reg = reg & 1 ? (reg >> 1) ^ 0xedb88320 : reg >> 1;
reg ^= (unsigned long)*dat++ << 24;
n--;
}
return reg;
}
You can immediately see that the last byte of data is simply exclusive-or'ed with the final register value. Less obvious is that the last four bytes are just exclusive-or'ed. This exactly equivalent version makes that evident:
unsigned long notacrc_xor(void const *p, unsigned n) {
unsigned char const *dat = p;
// initial register values
unsigned long const init[] = {
0xffffffff, 0x2dfd1072, 0xbe26ed00, 0x00be26ed, 0xdebb20e3};
unsigned xor = n > 3 ? 4 : n; // number of bytes merely xor'ed
unsigned long reg = init[xor];
while (n > xor) {
reg ^= *dat++;
for (unsigned k = 0; k < 8; k++)
reg = reg & 1 ? (reg >> 1) ^ 0xedb88320 : reg >> 1;
n--;
}
switch (n) {
case 4:
reg ^= *dat++;
case 3:
reg ^= (unsigned long)*dat++ << 8;
case 2:
reg ^= (unsigned long)*dat++ << 16;
case 1:
reg ^= (unsigned long)*dat++ << 24;
}
return reg;
}
There you can see that the last four bytes of the message, or all of the message if it is three or fewer bytes, is exclusive-or'ed with the final register value at the end.
An actual CRC must use all of the input data bits in determining when to exclusive-or the polynomial with the register. The inner part of that last function is what a CRC implementation looks like (though more efficient versions make use of pre-computed tables to process a byte or more at a time). Here is a function that computes an actual CRC:
unsigned long crc32_jam(void const *p, unsigned n) {
unsigned char const *dat = p;
unsigned long reg = 0xffffffff;
while (n) {
reg ^= *dat++;
for (unsigned k = 0; k < 8; k++)
reg = reg & 1 ? (reg >> 1) ^ 0xedb88320 : reg >> 1;
n--;
}
return reg;
}
That one is called crc32_jam because it implements a particular CRC called "JAMCRC". That CRC is the closest to what you attempted to implement.
If you want to use a real CRC, you will need to update your Verilog implementation.

Which data structure should I use for bit stuffing?

I am trying to implement bitstuffing for a project I am working on, namely a simple software AFSK modem. The simplified protocol looks something like this:
0111 1110 # burst sequence
0111 1110 # 16 times 0b0111_1110
...
0111 1110
...
... # 80 bit header (CRC, frame counter, etc.)
...
0111 1110 # header delimiter
...
... # data
...
0111 1110 # end-of-frame sequence
Now I need to find the reserved sequence 0111 1110 in the received data and therefore must ensure that neither the header nor the data contains six consecutive 1s. This can be done by bit stuffing, e.g. inserting a zero after every sequence of five 1s:
11111111
converts to
111110111
and
11111000
converts to
111110000
If I want to implement this efficiently I guess I should not use arrays of 1s and 0s, where I have to convert the data bytes to 1s and 0s, then populate an array etc. but bitfields of static size do not seem to fit either, because the length of the content is variable due to the bit stuffing.
Which data structure can I use to do bit stuffing more efficiently?
I just saw this question now and seeing that it is unanswered and not deleted I'll go ahead and answer. It might help others who see this question and also provide closure.
Bit stuffing: here the maximum contiguous sequence of 1's is 5. After 5 1's there should be a 0 appended after those 5 1's.
Here is the C program that does that:
#include <stdio.h>
typedef unsigned long long int ulli;
int main()
{
ulli buf = 0x0fffff01, // data to be stuffed
temp2= 1ull << ((sizeof(ulli)*8)-1), // mask to stuff data
temp3 = 0; // temporary
int count = 0; // continuous 1s indicator
while(temp2)
{
if((buf & temp2) && count <= 5) // enter the loop if the bit is `1` and if count <= 5
{
count++;
if(count == 5)
{
temp3 = buf & (~(temp2 - 1ull)); // make MS bits all 1s
temp3 <<= 1ull; // shift 1 bit to accomodeate the `0`
temp3 |= buf & ((temp2) - 1); // add back the LS bits or original producing stuffed data
buf = temp3;
count = 0; // reset count
printf("%llx\n",temp3); // debug only
}
}
else
{
count = 0; // this was what took 95% of my debug time. i had not put this else clause :-)
}
temp2 >>=1; // move on to next bit.
}
printf("ans = %llx",buf); // finally
}
The problem with this is that if there are more that 10 of 5 consecutive 1s then it might overflow. It's better to divide and then bitstuff and repeat.

Fastest way of bitwise AND between two arrays on iPhone?

I have two image blocks stored as 1D arrays and have do the following bitwise AND operations among the elements of them.
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap =0 ;
for(int y=0; y<a_leny; y++)
for(int x=0; x<a_lenx; x++)
{
if(a[x + y * a_pitch] & b[x+y*b_pitch])
overlap++ ;
}
return overlap ;
}
Actually, I have to do this job about 220,000 times, so it becomes very slow on iphone devices.
How could I accelerate this job on iPhone ?
I heard that NEON could be useful, but I'm not really familiar with it. In addition it seems that NEON doesn't have bitwise AND...
Option 1 - Work in the native width of your platform (it's faster to fetch 32-bits into a register and then do operations on that register than it is to fetch and compare data one byte at a time):
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap = 0;
uint32_t* a_int = (uint32_t*)a;
uint32_t* b_int = (uint32_t*)b;
a_leny = a_leny / 4;
a_lenx = a_lenx / 4;
a_pitch = a_pitch / 4;
b_pitch = b_pitch / 4;
for(int y=0; y<a_leny_int; y++)
for(int x=0; x<a_lenx_int; x++)
{
uint32_t aVal = a_int[x + y * a_pitch_int];
uint32_t bVal = b_int[x+y*b_pitch_int];
if (aVal & 0xFF) & (bVal & 0xFF)
overlap++;
if ((aVal >> 8) & 0xFF) & ((bVal >> 8) & 0xFF)
overlap++;
if ((aVal >> 16) & 0xFF) & ((bVal >> 16) & 0xFF)
overlap++;
if ((aVal >> 24) & 0xFF) & ((bVal >> 24) & 0xFF)
overlap++;
}
return overlap ;
}
Option 2 - Use a heuristic to get an approximate result using fewer calculations (a good approach if the absolute difference between 101 overlaps and 100 overlaps is not important to your application):
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap =0 ;
for(int y=0; y<a_leny; y+= 10)
for(int x=0; x<a_lenx; x+= 10)
{
//we compare 1% of all the pixels, and use that as the result
if(a[x + y * a_pitch] & b[x+y*b_pitch])
overlap++ ;
}
return overlap * 100;
}
Option 3 - Rewrite your function in inline assembly code. You're on your own for this one.
Your code is Rambo for the CPU - its worst nightmare :
byte access. Like aroth mentioned, ARM is VERY slow reading bytes from memory
random access. Two absolutely unnecessary multiply/add operations in addition to the already steep performance penalty by its nature.
Simply put, everything is wrong that can be wrong.
Don't call me rude. Let me be your angel instead.
First, I'll provide you a working NEON version. Then an optimized C version showing you exactly what you did wrong.
Just give me some time. I have to go to bed right now, and I have an important meeting tomorrow.
Why don't you learn ARM assembly? It's much easier and useful than x86 assembly.
It will also improve your C programming capabilities by a huge step.
Strongly recommended
cya
==============================================================================
Ok, here is an optimized version written in C with ARM assembly in mind.
Please note that both the pitches AND a_lenx have to be multiples of 4. Otherwise, it won't work properly.
There isn't much room left for optimizations with ARM assembly upon this version. (NEON is a different story - coming soon)
Take a careful look at how to handle variable declarations, loop, memory access, and AND operations.
And make sure that this function runs in ARM mode and not Thumb for best results.
unsigned int compare(unsigned int *a, unsigned int a_pitch,
unsigned int *b, unsigned int b_pitch, unsigned int a_lenx, unsigned int a_leny)
{
unsigned int overlap =0;
unsigned int a_gap = (a_pitch - a_lenx)>>2;
unsigned int b_gap = (b_pitch - a_lenx)>>2;
unsigned int aval, bval, xcount;
do
{
xcount = (a_lenx>>2);
do
{
aval = *a++;
// ldr aval, [a], #4
bval = *b++;
// ldr bavl, [b], #4
aval &= bval;
// and aval, aval, bval
if (aval & 0x000000ff) overlap += 1;
// tst aval, #0x000000ff
// addne overlap, overlap, #1
if (aval & 0x0000ff00) overlap += 1;
// tst aval, #0x0000ff00
// addne overlap, overlap, #1
if (aval & 0x00ff0000) overlap += 1;
// tst aval, #0x00ff0000
// addne overlap, overlap, #1
if (aval & 0xff000000) overlap += 1;
// tst aval, #0xff000000
// addne overlap, overlap, #1
} while (--xcount);
a += a_gap;
b += b_gap;
} while (--a_leny);
return overlap;
}
First of all, why the double loop? You can do it with a single loop and a couple of pointers.
Also, you don't need to calculate x+y*pitch for every single pixel; just increment two pointers by one. Incrementing by one is a lot faster than x+y*pitch.
Why exactly do you need to perform this operation? I would make sure there are no high-level optimizations/changes available before looking into a low-level solution like NEON.