Why SSE4.2 CRC32 hash value is different with software CRC32 hash value? - hash

In my project, CRC32 is calculated very many times.
I have used software CRC32 calculation until now.
But I noticed there is CPU support in SSE4.2 and linux also provides the hardware CRC32 calculation function using the CPU instruction.
I use Intel Xeon E5-2650 CPU so I tried to calculate CRC32 by using the linux function.
But the result is different with software CRC32 function that I used.
I used init value 127 in the both. Software CRC32 function I used is below
static uint32_t crc32_tab[] = {
0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f,
0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,
0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2,
0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7,
0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9,
0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172,
0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c,
0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59,
0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423,
0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924,
0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106,
0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433,
0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d,
0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e,
0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950,
0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65,
0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7,
0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0,
0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa,
0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f,
0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81,
0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a,
0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84,
0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1,
0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb,
0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc,
0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e,
0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b,
0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55,
0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236,
0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28,
0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d,
0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f,
0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38,
0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242,
0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777,
0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69,
0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2,
0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc,
0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9,
0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693,
0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94,
0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d
};
uint32_t crc32(uint32_t crc, const void *buf, size_t size)
{
const uint8_t *p;
p = buf;
crc = 127;
while (size--)
crc = crc32_tab[(crc ^ *p++) & 0xFF] ^ (crc >> 8);
return crc ^ ~0U;
}

That table is for the "standard" CRC-32 used in ethernet, zip, gzip, v.42, many other places.
I need to point out that that code in your question is incorrect, with the crc = 127; statement. crc should not be set at all, since it is an input to the function that is then lost, and it should not be initialized to 127. What should be there is crc = ~crc;. That should also be the approach used at the end, instead of return crc ^ ~0U;. That may not be portable if unsigned is not the same size at uint32_t. The use of ~ is fine with uint32_t, but if a different type is used that is not 32 bits, then more portable still is crc ^ 0xffffffff.
Anyway, to answer your question, Intel chose a different 32-bit CRC to implement in their hardware instruction. That CRC is usually referred to as CRC-32C, using a polynomial discovered by Castagnoli (what the "C" refers to) with better properties than the polynomial used in the standard CRC-32. iSCSI and SCTP use the CRC-32C instead of CRC-32. I presume that that had something to do with Intel's choice.
See this answer for code that computes the CRC-32C in software, as well as in hardware if available.

Related

getting illegal instructions when vectorized code writes to PCI

I am writing a program that writes to a device's range of HW registers. I am using mmap to map the HW addresses to virtual address (user space). I tested the result from the mmap and it is OK. I implemented a copy of a buffer into the device:
void bufferCopy(void *dest, void *src, const size_t size) {
uint8_t *pdest = static_cast<uint8_t *>(dest);
uint8_t *psrc = static_cast<uint8_t *>(src);
size_t iters = 0, tailBytes = 0;
/* iterate 64bit */
iters = (size / sizeof(uint64_t));
for (size_t index = 0; index < iters; ++index) {
*(reinterpret_cast<uint64_t *>(pdest)) =
*(reinterpret_cast<uint64_t *>(psrc));
pdest += sizeof(uint64_t);
psrc += sizeof(uint64_t);
}
.
.
.
but when running it on QEMU I get illegal instruction exception. When I debugged got it crashes on the next instruction (below is the asm of the main loop):
movdqu (%rsi,%rax,1),%xmm0
movups %xmm0,(%rdi,%rax,1) <----- this instruction crashes ...
add $0x10,%rax
cmp %rax,%r9
jne 0x7ffff7eca1e0 <_ZN12_GLOBAL__N_110bufferCopyEPvS0_m+64>
any ideas why ? my guess that you can write to PCI only 32/64 bit.
The compile doesn’t know my limitations, so it optimize my code and create vectorized loop (each iteration loads 128 bit and saves 128 bit). Is is making sense ?? can I write to PCI with vectorized instructions ?
Also, whether it is a missing feature in QEMU or a bug or just a recommendation, how can I prevent from the compiler to generate those vector instructions ?

Disabling SPI peripheral on STM32H7 between two transmissions?

I'm still doing SPI experiments between two Nucleo STM32H743 boards.
I've configured SPI in Full-Duplex mode, with CRC enabled, with a SPI frequency of 25MHz (so Slave can transmit without issue).
DSIZE is 8 bits and FIFO threshold is 4.
On Master side, I'm sending 4 bytes then wait for 5 bytes from the Slave. I know I could use half-duplex or simplex mode but I want to understand what's going on in full-duplex mode.
volatile unsigned long *CR1 = (unsigned long *)0x40013000;
volatile unsigned long *CR2 = (unsigned long *)0x40013004;
volatile unsigned long *TXDR = (unsigned long *)0x40013020;
volatile unsigned long *RXDR = (unsigned long *)0x40013030;
volatile unsigned long *SR = (unsigned long *)0x40013014;
volatile unsigned long *IFCR = (unsigned long *)0x40013018;
volatile unsigned long *TXCRC = (unsigned long *)0x40013044;
volatile unsigned long *RXCRC = (unsigned long *)0x40013048;
volatile unsigned long *CFG2 = (unsigned long *)0x4001300C;
unsigned long SPI_TransmitCommandFullDuplex(uint32_t Data)
{
// size of transfer (TSIZE)
*CR2 = 4;
/* Enable SPI peripheral */
*CR1 |= SPI_CR1_SPE;
/* Master transfer start */
*CR1 |= SPI_CR1_CSTART;
*TXDR = Data;
while ( ((*SR) & SPI_FLAG_EOT) == 0 );
// clear flags
*IFCR = 0xFFFFFFFF;
// disable SPI
*CR1 &= ~SPI_CR1_SPE;
return 0;
}
void SPI_ReceiveResponseFullDuplex(uint8_t *pData)
{
unsigned long temp;
// size of transfer (TSIZE)
*CR2 = 5;
/* Enable SPI peripheral */
*CR1 |= SPI_CR1_SPE;
/* Master transfer start */
*CR1 |= SPI_CR1_CSTART;
*TXDR = 0;
*((volatile uint8_t *)TXDR) = 0;
while ( ((*SR) & SPI_FLAG_EOT) == 0 );
*((uint32_t *)pData) = *RXDR;
*((uint8_t *)(pData+4)) = *((volatile uint8_t *)RXDR);
// clear flags
*IFCR = 0xFFFFFFFF;
// disable SPI
*CR1 &= ~SPI_CR1_SPE;
return temp;
}
This is working fine (both functions are just called in sequence in the main).
Then I tried to remove the SPI disabling between the two steps (ie. I don't clear and set again the bit SPE) and I got stuck in function SPI_ReceiveResponseFullDuplex in the while.
Is it necessary to disable SPI between two transmissions or did I make a mistake in the configuration ?
The behaviour of SPE bit is not very clear in the reference manual. For example is it written clearly that, in half-duplex mode, the SPI has to be disabled to change the direction of communication. But nothing in fuill-duplex mode (or I missed it).
This errata item might be relevant here.
Master data transfer stall at system clock much faster than SCK
Description
With the system clock (spi_pclk) substantially faster than SCK (spi_ker_ck divided by a prescaler), SPI/I2S master data transfer can stall upon setting the CSTART bit within one SCK cycle after the EOT event (EOT flag raise) signaling the end of the previous transfer.
Workaround
Apply one of the following measures:
• Disable then enable SPI/I2S after each EOT event.
• Upon EOT event, wait for at least one SCK cycle before setting CSTART.
• Prevent EOT events from occurring, by setting transfer size to undefined (TSIZE = 0)
and by triggering transmission exclusively by TXFIFO writes.
Your SCK frequency is 25 MHz, the system clock can be 400 or 480MHz, 16 or 19 times SCK. When you remove the lines clearing and setting SPE, only these lines remain in effect after detecting EOT
*IFCR = 0xFFFFFFFF;
*CR2 = 5;
*CR1 |= SPI_CR1_CSTART;
When this sequence (quite probably) takes less than 16 clock cycles, then there is the problem described above. Looks like someone did a sloppy work again at the SPI clock system. What you did first, clearing and setting SPE is one of the recommended workarounds.
I would just set TSIZE=9 at the start, then write the command and the dummy bytes in one go, it makes no difference in full-duplex mode.
Keep in mind that in full duplex mode, another 4 bytes are received which must be read and discarded before getting the real answer. This was not a problem with your original code, because clearing SPE discards data still in the receive FIFO, but it would become one if the modified code worked, e.g there were some more delay before enabling CSTART again.

Would my solution work for 8-bit bus addressing using BSRR and BRR?

I have set an 8-bit bus on (PD0:PD7) on the stm32 MCU to send addresses to another chip (0:255). I am interested if a function like below would work for fast change in addresses. I could not find an example directly showing register to be equal to integer so I want to confirm it would work. I need a function to which I will give an integer value for the address (0:255) and it will set the 8 pins of the bus with this value:
void chipbus(uint16_t bus8){
GPIOD->regs->BSRR = bus8; // set all the '1' in bus8 to high
GPIOD->regs->BRR = 255-bus8; // 255-bus8 inverts the 8 bits
// BRR to set the new '1' to low
}
If this solution works, I am curious also if I change the bus to ports PD5:PD12 would my function work as:
void chipbus(uint16_t bus8){
GPIOD->regs->BSRR = bus8*32; // set all '1' in bus8 to high
// multiply by 32 to shift 5 bits/pins
GPIOD->regs->BRR = (255-bus8)*32; // 255-bus8 inverts the 8 bits
// BRR to set the new '1' to low
}
Thank you!
Yes, both should work. However, a more recognisable but equivalent formulation would be:
void chipbus(uint16_t bus8) {
GPIOD->regs->BSRR = bus8; // set all the '1' in bus8 to high
GPIOD->regs->BRR = (~bus8) & 0xFF; // inverts the 8 bits // BRR to set the new '1' to low
}
void chipbus(uint16_t bus8) {
GPIOD->regs->BSRR = bus8<<5; // set all '1' in bus8 to high, shift 5 bits
GPIOD->regs->BRR = ((~bus8)&0xFF)<<5; // inverts the 8 bits
}
But, there is an even faster way. BSRR is a 32bit register than can both set and reset. You can combine the two write accesses into one:
void chipbus(uint16_t bus8) {
GPIOD->regs->BSRR =
(bus8<<5) | (((~bus8) & 0xFF) << (16+5));
}
Happy bit-fiddling!
Yes, it'd definitely work. However, as others have noted, it's advisable to set the outputs in a single operation.
Taking advantage of the full 32 bit BSRR register, it can be done without inverting the data bits:
GPIOD->regs->BSRR = bus8 | (0xFF << 16);
or
GPIOD->regs->BSRR = (bus8 << 5) | (0xFF << (16 + 5));
because BSRR has the functionality to set some bits and reset some others in a single write operation, and when both the set and reset bits for a particular pin is set, the output becomes 1.
I wouldn't recommend using this approach. Writing to BSRR and BRR as two separate steps means that the bus will transition through an unintended state where some bits are still set from the previous value.
Instead, consider writing directly to the GPIO output data register (ODR). If you need to preserve the original value of the upper bits in the port, you can do that on the CPU side:
GPIOD->regs->ODR = (GPIOD->regs->ODR & 0xff00) | (bus8 & 0x00ff);

STM32 SPI data is sent the reverse way

I've been experimenting with writing to an external EEPROM using SPI and I've had mixed success. The data does get shifted out but in an opposite manner. The EEPROM requires a start bit and then an opcode which is essentially a 2-bit code for read, write and erase. Essentially the start bit and the opcode are combined into one byte. I'm creating a 32-bit unsigned int and then bit-shifting the values into it. When I transmit these I see that the actual data is being seen first and then the SB+opcode and then the memory address. How do I reverse this to see the opcode first then the memory address and then the actual data. As seen in the image below, the data is BCDE, SB+opcode is 07 and the memory address is 3F. The correct sequence should be 07, 3F and then BCDE (I think!).
Here is the code:
uint8_t mem_addr = 0x3F;
uint16_t data = 0xBCDE;
uint32_t write_package = (ERASE << 24 | mem_addr << 16 | data);
while (1)
{
/* USER CODE END WHILE */
/* USER CODE BEGIN 3 */
HAL_SPI_Transmit(&hspi1, &write_package, 2, HAL_MAX_DELAY);
HAL_Delay(10);
}
/* USER CODE END 3 */
It looks like as your SPI interface is set up to process 16 bit halfwords at a time. Therefore it would make sense to break up the data to be sent into 16 bit halfwords too. That would take care of the ordering.
uint8_t mem_addr = 0x3F;
uint16_t data = 0xBCDE;
uint16_t write_package[2] = {
(ERASE << 8) | mem_addr,
data
};
HAL_SPI_Transmit(&hspi1, (uint8_t *)write_package, 2, HAL_MAX_DELAY);
EDIT
Added an explicit cast. As noted in the comments, without the explicit cast it wouldn't compile as C++ code, and cause some warnings as C code.
You're packing your information into a 32 bit integer, on line 3 of your code you have the decision about which bits of data are placed where in the word. To change the order you can replace the line with:
uint32_t write_package = ((data << 16) | (mem_addr << 8) | (ERASE));
That is shifting data 16 bits left into the most significant 16 bits of the word, shifting mem_addr up by 8 bits and or-ing it in, and then adding ERASE in the least significant bits.
Your problem is the Endianness.
By default the STM32 uses little edian so the lowest byte of the uint32_t is stored at the first adrress.
If I'm right this is the declaration if the transmit function you are using:
HAL_StatusTypeDef HAL_SPI_Transmit(SPI_HandleTypeDef *hspi, uint8_t *pData, uint16_t Size, uint32_t Timeout)
It requires a pointer to uint8_t as data (and not a uint32_t) so you should get at least a warning if you compile your code.
If you want to write code that is independent of the used endianess, you should store your data into an array instead of one "big" variable.
uint8_t write_package[4];
write_package[0] = ERASE;
write_package[1] = mem_addr;
write_package[2] = (data >> 8) & 0xFF;
write_package[3] = (data & 0xFF);

is Atom-32bit in mode protected after a reset?

I work on Atom-32bit-intel, I have to port MicroC OS II, so there is no code to make any configuration on the Atom (No GDT, no LDT...):
my question is more about the state of the Atom-32bit after a reset, is the Atom in protecte mode or not ? and the most important how do i check which mode is it (which registers have to be checked nad how)?
Remark:
The CR0.PE = 1 (I checked it), is that enough to prove that the Atom is in protected mode ?
************ UPDATE : *****************
/*Read the IDTR*/
sidt (idt_ptr)
/*Read the GDTR*/
sgdt (gdt_ptr)
So I tried just to use IDT's address to link my ISR to the IDT :
fill_interrupt(ISR_Nbr,(unsigned int) isr33, 0x08, 0x8E);
static void fill_interrupt(unsigned char num, unsigned int base, unsigned short sel, unsigned char flags)
{
unsigned short *Interrupt_Address;
/*address = idt_ptr.base + num * 8 byte*/
Interrupt_Address = (unsigned short *)(idt_ptr.base + num*8);
*(Interrupt_Address) = base&0xFFFF;
*(Interrupt_Address+1) = sel;
*(Interrupt_Address+1) = (flags>>8)&0xFF00;
*(Interrupt_Address+1) = (base>>16)&0xFFFF;
}
my ISR a imple one :
isr33:
nop
nop
cli
push %ebp //save the context to swith back
mov %esp,%ebp
pop %ebp //Return to the calling function
sti
ret
Chapter 9 of volume 3 of the Intel Software Developer's Manual says that the reset value of CR0 is 60000010H. As you can see, bit 0, aka PE, is clear.
Regardless, you can setup the descriptor tables in Protected Mode as well as in Real Mode. You just have to be more careful about it.
I suggest you check if the BIOS or OS are setting this bit at a stage before you read it.
Atom is x86 instruction set, and as such, should be starting in real mode for compatibility. I don't have one on hand to test with though.
Resolved, I use N450 Atom board, it has already a BIOS, the BIOS configures the board in Protected Mode.