In my project, CRC32 is calculated very many times.
I have used software CRC32 calculation until now.
But I noticed there is CPU support in SSE4.2 and linux also provides the hardware CRC32 calculation function using the CPU instruction.
I use Intel Xeon E5-2650 CPU so I tried to calculate CRC32 by using the linux function.
But the result is different with software CRC32 function that I used.
I used init value 127 in the both. Software CRC32 function I used is below
static uint32_t crc32_tab[] = {
0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f,
0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,
0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2,
0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7,
0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9,
0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172,
0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c,
0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59,
0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423,
0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924,
0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106,
0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433,
0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d,
0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e,
0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950,
0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65,
0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7,
0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0,
0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa,
0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f,
0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81,
0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a,
0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84,
0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1,
0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb,
0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc,
0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e,
0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b,
0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55,
0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236,
0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28,
0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d,
0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f,
0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38,
0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242,
0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777,
0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69,
0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2,
0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc,
0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9,
0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693,
0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94,
0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d
};
uint32_t crc32(uint32_t crc, const void *buf, size_t size)
{
const uint8_t *p;
p = buf;
crc = 127;
while (size--)
crc = crc32_tab[(crc ^ *p++) & 0xFF] ^ (crc >> 8);
return crc ^ ~0U;
}
That table is for the "standard" CRC-32 used in ethernet, zip, gzip, v.42, many other places.
I need to point out that that code in your question is incorrect, with the crc = 127; statement. crc should not be set at all, since it is an input to the function that is then lost, and it should not be initialized to 127. What should be there is crc = ~crc;. That should also be the approach used at the end, instead of return crc ^ ~0U;. That may not be portable if unsigned is not the same size at uint32_t. The use of ~ is fine with uint32_t, but if a different type is used that is not 32 bits, then more portable still is crc ^ 0xffffffff.
Anyway, to answer your question, Intel chose a different 32-bit CRC to implement in their hardware instruction. That CRC is usually referred to as CRC-32C, using a polynomial discovered by Castagnoli (what the "C" refers to) with better properties than the polynomial used in the standard CRC-32. iSCSI and SCTP use the CRC-32C instead of CRC-32. I presume that that had something to do with Intel's choice.
See this answer for code that computes the CRC-32C in software, as well as in hardware if available.
I'm studying about inline assembly. I want to write a simple routine in iPhone under Xcode 4 LLVM 3.0 Compiler. I succeed write basic inline assembly codes.
example :
int sub(int a, int b)
{
int c;
asm ("sub %0, %1, %2" : "=r" (c) : "r" (a), "r" (b));
return c;
}
I found it in stackoverflow.com and it works very well. But, I don't know how to write code about LOOP.
I need to assembly codes like
void brighten(unsigned char* src, unsigned char* dst, int numPixels, int intensity)
{
for(int i=0; i<numPixels; i++)
{
dst[i] = src[i] + intensity;
}
}
Take a look here at the loop section - http://en.wikipedia.org/wiki/ARM_architecture
Basically you'll want something like:
void brighten(unsigned char* src, unsigned char* dst, int numPixels, int intensity) {
asm volatile (
"\t mov r3, #0\n"
"Lloop:\n"
"\t cmp r3, %2\n"
"\t bge Lend\n"
"\t ldrb r4, [%0, r3]\n"
"\t add r4, r4, %3\n"
"\t strb r4, [%1, r3]\n"
"\t add r3, r3, #1\n"
"\t b Lloop\n"
"Lend:\n"
: "=r"(src), "=r"(dst), "=r"(numPixels), "=r"(intensity)
: "0"(src), "1"(dst), "2"(numPixels), "3"(intensity)
: "cc", "r3", "r4");
}
Update:
And here's that NEON version:
void brighten_neon(unsigned char* src, unsigned char* dst, int numPixels, int intensity) {
asm volatile (
"\t mov r4, #0\n"
"\t vdup.8 d1, %3\n"
"Lloop2:\n"
"\t cmp r4, %2\n"
"\t bge Lend2\n"
"\t vld1.8 d0, [%0]!\n"
"\t vqadd.s8 d0, d0, d1\n"
"\t vst1.8 d0, [%1]!\n"
"\t add r4, r4, #8\n"
"\t b Lloop2\n"
"Lend2:\n"
: "=r"(src), "=r"(dst), "=r"(numPixels), "=r"(intensity)
: "0"(src), "1"(dst), "2"(numPixels), "3"(intensity)
: "cc", "r4", "d1", "d0");
}
So this NEON version will do 8 at a time. It does however not check that numPixels is divisible by 8 so you'd definitely want to do that otherwise things will go wrong! Anyway, it's just a start at showing you what can be done. Notice the same number of instructions, but action on eight pixels of data at once. Oh and it's got the saturation in there as well that I assume you would want.
Though this answer is not directly an answer to your question, it is more a general advice regarding use of assembler versus modern compilers.
You will generally have a hard time beating the compiler regarding optimazation of your C code. Of course by clever use of certain knowledge about how your data behave it's possible that you might tweak it just a few percents.
One of the reasons for this is that modern compilers use a number of techniques when dealing with code like the one you describe, e.g. loop unrolling, instruction reordering to avoid pipeline stalls and bubbles, etc.
If you really want to make that algorithm scream, you should consider redesigning the algorithm instead in C so you avoid the worst delays. For instance reading and writing to memory is expensive compared to register access.
One way of accomplishing this could be to have your code load 4 bytes at a time by using an unsigned long and then doing the math on this in registers before writing these 4 bytes back in one store operation.
So to recap, make your algorithm work smarter not harder.
I work on Atom-32bit-intel, I have to port MicroC OS II, so there is no code to make any configuration on the Atom (No GDT, no LDT...):
my question is more about the state of the Atom-32bit after a reset, is the Atom in protecte mode or not ? and the most important how do i check which mode is it (which registers have to be checked nad how)?
Remark:
The CR0.PE = 1 (I checked it), is that enough to prove that the Atom is in protected mode ?
************ UPDATE : *****************
/*Read the IDTR*/
sidt (idt_ptr)
/*Read the GDTR*/
sgdt (gdt_ptr)
So I tried just to use IDT's address to link my ISR to the IDT :
fill_interrupt(ISR_Nbr,(unsigned int) isr33, 0x08, 0x8E);
static void fill_interrupt(unsigned char num, unsigned int base, unsigned short sel, unsigned char flags)
{
unsigned short *Interrupt_Address;
/*address = idt_ptr.base + num * 8 byte*/
Interrupt_Address = (unsigned short *)(idt_ptr.base + num*8);
*(Interrupt_Address) = base&0xFFFF;
*(Interrupt_Address+1) = sel;
*(Interrupt_Address+1) = (flags>>8)&0xFF00;
*(Interrupt_Address+1) = (base>>16)&0xFFFF;
}
my ISR a imple one :
isr33:
nop
nop
cli
push %ebp //save the context to swith back
mov %esp,%ebp
pop %ebp //Return to the calling function
sti
ret
Chapter 9 of volume 3 of the Intel Software Developer's Manual says that the reset value of CR0 is 60000010H. As you can see, bit 0, aka PE, is clear.
Regardless, you can setup the descriptor tables in Protected Mode as well as in Real Mode. You just have to be more careful about it.
I suggest you check if the BIOS or OS are setting this bit at a stage before you read it.
Atom is x86 instruction set, and as such, should be starting in real mode for compatibility. I don't have one on hand to test with though.
Resolved, I use N450 Atom board, it has already a BIOS, the BIOS configures the board in Protected Mode.