getting illegal instructions when vectorized code writes to PCI - x86-64

I am writing a program that writes to a device's range of HW registers. I am using mmap to map the HW addresses to virtual address (user space). I tested the result from the mmap and it is OK. I implemented a copy of a buffer into the device:
void bufferCopy(void *dest, void *src, const size_t size) {
uint8_t *pdest = static_cast<uint8_t *>(dest);
uint8_t *psrc = static_cast<uint8_t *>(src);
size_t iters = 0, tailBytes = 0;
/* iterate 64bit */
iters = (size / sizeof(uint64_t));
for (size_t index = 0; index < iters; ++index) {
*(reinterpret_cast<uint64_t *>(pdest)) =
*(reinterpret_cast<uint64_t *>(psrc));
pdest += sizeof(uint64_t);
psrc += sizeof(uint64_t);
}
.
.
.
but when running it on QEMU I get illegal instruction exception. When I debugged got it crashes on the next instruction (below is the asm of the main loop):
movdqu (%rsi,%rax,1),%xmm0
movups %xmm0,(%rdi,%rax,1) <----- this instruction crashes ...
add $0x10,%rax
cmp %rax,%r9
jne 0x7ffff7eca1e0 <_ZN12_GLOBAL__N_110bufferCopyEPvS0_m+64>
any ideas why ? my guess that you can write to PCI only 32/64 bit.
The compile doesn’t know my limitations, so it optimize my code and create vectorized loop (each iteration loads 128 bit and saves 128 bit). Is is making sense ?? can I write to PCI with vectorized instructions ?
Also, whether it is a missing feature in QEMU or a bug or just a recommendation, how can I prevent from the compiler to generate those vector instructions ?

Related

How can i reduce the OLED display clear time?

I have use STM32F0 microcontroller.
I try to integrate the stm32 with OLED(SSD1306:128*64).
It's an I2C communication.
The issue is while clearing the display takes some seconds(like 2sec.).
How can I make it faster?
I have attached the issue portion image.
Is there any way to solve the issue?
I think the issue in I2C_TransferHandling function, because its repeatedly call in loop. But no idea about how to solve this issue.
Issue section
void Display_Clear(void)
{
int i = 0;
Write_inst_oled(SSD1306_SET_COLUMN_ADDR);
Write_inst_oled(0);
Write_inst_oled(127);
Write_inst_oled(SSD1306_SET_PAGE_ADDR);
Write_inst_oled(0);
Write_inst_oled(7);
I2C_TransferHandling(I2C1,(SlaveAddr<<1),1, I2C_Reload_Mode, I2C_Generate_Start_Write);
while(I2C_GetFlagStatus(I2C1, I2C_FLAG_TXIS) == RESET);
Display_SendByte(SSD1306_DATA_CONTINUE);
delay_ms(1);
for (i = 0; i < 1024; i++) // Write Zeros to clear the display
{
I2C_TransferHandling(I2C1,0, 1, I2C_Reload_Mode, I2C_No_StartStop);
while(I2C_GetFlagStatus(I2C1, I2C_FLAG_TXIS) == RESET);
Display_SendByte(0);
delay_ms(1);
}
I2C_TransferHandling(I2C1, (SlaveAddr<<1), 0, I2C_SoftEnd_Mode, I2C_Generate_Start_Write);
Write_inst_oled(SSD1306_SET_COLUMN_ADDR);
Write_inst_oled(0);
Write_inst_oled(127);
Write_inst_oled(SSD1306_SET_PAGE_ADDR);
Write_inst_oled(0);
Write_inst_oled(7);
}
Remove delay from the loop. You wait 1ms on every iteration. It will save you more than 1 second,
Increase the I2C speed. most displays support 400k+ I2C speeds. Sending 1024 bytes will take at least 0.8sek

Disabling SPI peripheral on STM32H7 between two transmissions?

I'm still doing SPI experiments between two Nucleo STM32H743 boards.
I've configured SPI in Full-Duplex mode, with CRC enabled, with a SPI frequency of 25MHz (so Slave can transmit without issue).
DSIZE is 8 bits and FIFO threshold is 4.
On Master side, I'm sending 4 bytes then wait for 5 bytes from the Slave. I know I could use half-duplex or simplex mode but I want to understand what's going on in full-duplex mode.
volatile unsigned long *CR1 = (unsigned long *)0x40013000;
volatile unsigned long *CR2 = (unsigned long *)0x40013004;
volatile unsigned long *TXDR = (unsigned long *)0x40013020;
volatile unsigned long *RXDR = (unsigned long *)0x40013030;
volatile unsigned long *SR = (unsigned long *)0x40013014;
volatile unsigned long *IFCR = (unsigned long *)0x40013018;
volatile unsigned long *TXCRC = (unsigned long *)0x40013044;
volatile unsigned long *RXCRC = (unsigned long *)0x40013048;
volatile unsigned long *CFG2 = (unsigned long *)0x4001300C;
unsigned long SPI_TransmitCommandFullDuplex(uint32_t Data)
{
// size of transfer (TSIZE)
*CR2 = 4;
/* Enable SPI peripheral */
*CR1 |= SPI_CR1_SPE;
/* Master transfer start */
*CR1 |= SPI_CR1_CSTART;
*TXDR = Data;
while ( ((*SR) & SPI_FLAG_EOT) == 0 );
// clear flags
*IFCR = 0xFFFFFFFF;
// disable SPI
*CR1 &= ~SPI_CR1_SPE;
return 0;
}
void SPI_ReceiveResponseFullDuplex(uint8_t *pData)
{
unsigned long temp;
// size of transfer (TSIZE)
*CR2 = 5;
/* Enable SPI peripheral */
*CR1 |= SPI_CR1_SPE;
/* Master transfer start */
*CR1 |= SPI_CR1_CSTART;
*TXDR = 0;
*((volatile uint8_t *)TXDR) = 0;
while ( ((*SR) & SPI_FLAG_EOT) == 0 );
*((uint32_t *)pData) = *RXDR;
*((uint8_t *)(pData+4)) = *((volatile uint8_t *)RXDR);
// clear flags
*IFCR = 0xFFFFFFFF;
// disable SPI
*CR1 &= ~SPI_CR1_SPE;
return temp;
}
This is working fine (both functions are just called in sequence in the main).
Then I tried to remove the SPI disabling between the two steps (ie. I don't clear and set again the bit SPE) and I got stuck in function SPI_ReceiveResponseFullDuplex in the while.
Is it necessary to disable SPI between two transmissions or did I make a mistake in the configuration ?
The behaviour of SPE bit is not very clear in the reference manual. For example is it written clearly that, in half-duplex mode, the SPI has to be disabled to change the direction of communication. But nothing in fuill-duplex mode (or I missed it).
This errata item might be relevant here.
Master data transfer stall at system clock much faster than SCK
Description
With the system clock (spi_pclk) substantially faster than SCK (spi_ker_ck divided by a prescaler), SPI/I2S master data transfer can stall upon setting the CSTART bit within one SCK cycle after the EOT event (EOT flag raise) signaling the end of the previous transfer.
Workaround
Apply one of the following measures:
• Disable then enable SPI/I2S after each EOT event.
• Upon EOT event, wait for at least one SCK cycle before setting CSTART.
• Prevent EOT events from occurring, by setting transfer size to undefined (TSIZE = 0)
and by triggering transmission exclusively by TXFIFO writes.
Your SCK frequency is 25 MHz, the system clock can be 400 or 480MHz, 16 or 19 times SCK. When you remove the lines clearing and setting SPE, only these lines remain in effect after detecting EOT
*IFCR = 0xFFFFFFFF;
*CR2 = 5;
*CR1 |= SPI_CR1_CSTART;
When this sequence (quite probably) takes less than 16 clock cycles, then there is the problem described above. Looks like someone did a sloppy work again at the SPI clock system. What you did first, clearing and setting SPE is one of the recommended workarounds.
I would just set TSIZE=9 at the start, then write the command and the dummy bytes in one go, it makes no difference in full-duplex mode.
Keep in mind that in full duplex mode, another 4 bytes are received which must be read and discarded before getting the real answer. This was not a problem with your original code, because clearing SPE discards data still in the receive FIFO, but it would become one if the modified code worked, e.g there were some more delay before enabling CSTART again.

STM32 SPI data is sent the reverse way

I've been experimenting with writing to an external EEPROM using SPI and I've had mixed success. The data does get shifted out but in an opposite manner. The EEPROM requires a start bit and then an opcode which is essentially a 2-bit code for read, write and erase. Essentially the start bit and the opcode are combined into one byte. I'm creating a 32-bit unsigned int and then bit-shifting the values into it. When I transmit these I see that the actual data is being seen first and then the SB+opcode and then the memory address. How do I reverse this to see the opcode first then the memory address and then the actual data. As seen in the image below, the data is BCDE, SB+opcode is 07 and the memory address is 3F. The correct sequence should be 07, 3F and then BCDE (I think!).
Here is the code:
uint8_t mem_addr = 0x3F;
uint16_t data = 0xBCDE;
uint32_t write_package = (ERASE << 24 | mem_addr << 16 | data);
while (1)
{
/* USER CODE END WHILE */
/* USER CODE BEGIN 3 */
HAL_SPI_Transmit(&hspi1, &write_package, 2, HAL_MAX_DELAY);
HAL_Delay(10);
}
/* USER CODE END 3 */
It looks like as your SPI interface is set up to process 16 bit halfwords at a time. Therefore it would make sense to break up the data to be sent into 16 bit halfwords too. That would take care of the ordering.
uint8_t mem_addr = 0x3F;
uint16_t data = 0xBCDE;
uint16_t write_package[2] = {
(ERASE << 8) | mem_addr,
data
};
HAL_SPI_Transmit(&hspi1, (uint8_t *)write_package, 2, HAL_MAX_DELAY);
EDIT
Added an explicit cast. As noted in the comments, without the explicit cast it wouldn't compile as C++ code, and cause some warnings as C code.
You're packing your information into a 32 bit integer, on line 3 of your code you have the decision about which bits of data are placed where in the word. To change the order you can replace the line with:
uint32_t write_package = ((data << 16) | (mem_addr << 8) | (ERASE));
That is shifting data 16 bits left into the most significant 16 bits of the word, shifting mem_addr up by 8 bits and or-ing it in, and then adding ERASE in the least significant bits.
Your problem is the Endianness.
By default the STM32 uses little edian so the lowest byte of the uint32_t is stored at the first adrress.
If I'm right this is the declaration if the transmit function you are using:
HAL_StatusTypeDef HAL_SPI_Transmit(SPI_HandleTypeDef *hspi, uint8_t *pData, uint16_t Size, uint32_t Timeout)
It requires a pointer to uint8_t as data (and not a uint32_t) so you should get at least a warning if you compile your code.
If you want to write code that is independent of the used endianess, you should store your data into an array instead of one "big" variable.
uint8_t write_package[4];
write_package[0] = ERASE;
write_package[1] = mem_addr;
write_package[2] = (data >> 8) & 0xFF;
write_package[3] = (data & 0xFF);

Why SSE4.2 CRC32 hash value is different with software CRC32 hash value?

In my project, CRC32 is calculated very many times.
I have used software CRC32 calculation until now.
But I noticed there is CPU support in SSE4.2 and linux also provides the hardware CRC32 calculation function using the CPU instruction.
I use Intel Xeon E5-2650 CPU so I tried to calculate CRC32 by using the linux function.
But the result is different with software CRC32 function that I used.
I used init value 127 in the both. Software CRC32 function I used is below
static uint32_t crc32_tab[] = {
0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f,
0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,
0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2,
0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7,
0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9,
0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172,
0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c,
0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59,
0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423,
0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924,
0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106,
0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433,
0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d,
0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e,
0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950,
0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65,
0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7,
0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0,
0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa,
0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f,
0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81,
0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a,
0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84,
0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1,
0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb,
0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc,
0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e,
0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b,
0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55,
0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236,
0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28,
0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d,
0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f,
0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38,
0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242,
0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777,
0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69,
0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2,
0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc,
0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9,
0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693,
0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94,
0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d
};
uint32_t crc32(uint32_t crc, const void *buf, size_t size)
{
const uint8_t *p;
p = buf;
crc = 127;
while (size--)
crc = crc32_tab[(crc ^ *p++) & 0xFF] ^ (crc >> 8);
return crc ^ ~0U;
}
That table is for the "standard" CRC-32 used in ethernet, zip, gzip, v.42, many other places.
I need to point out that that code in your question is incorrect, with the crc = 127; statement. crc should not be set at all, since it is an input to the function that is then lost, and it should not be initialized to 127. What should be there is crc = ~crc;. That should also be the approach used at the end, instead of return crc ^ ~0U;. That may not be portable if unsigned is not the same size at uint32_t. The use of ~ is fine with uint32_t, but if a different type is used that is not 32 bits, then more portable still is crc ^ 0xffffffff.
Anyway, to answer your question, Intel chose a different 32-bit CRC to implement in their hardware instruction. That CRC is usually referred to as CRC-32C, using a polynomial discovered by Castagnoli (what the "C" refers to) with better properties than the polynomial used in the standard CRC-32. iSCSI and SCTP use the CRC-32C instead of CRC-32. I presume that that had something to do with Intel's choice.
See this answer for code that computes the CRC-32C in software, as well as in hardware if available.

luajit qsort callback example memory leak

I have the following qsort example to try out callbacks in luajit. However it has a memory leak (luajit: not enough memory when executing) which is not obvious to me.
Can somebody give me some hints on how to create a proper callback example?
local ffi = require("ffi")
-- ===============================================================================
ffi.cdef[[
void qsort(void *base, size_t nel, size_t width, int (*compar)(const void *, const void *));
]]
function compare(a, b)
return a[0] - b[0]
end
-- ===============================================================================
-- Explicitly convert to a callback via cast
local callback = ffi.cast("int (*)(const char *, const char *)", compare)
local data = "efghabcd"
local size = 8
local loopSize = 1000 * 1000 * 100.
local bytes = ffi.new("char[15]")
-- ===============================================================================
for i=1,loopSize do
ffi.copy(bytes, data, size)
ffi.C.qsort(bytes, size, 1, callback)
end
Platform: OSX 10.8
luajit: 2.0.1
The problem appears to be that lua never gets a chance to perform a full garbage collection cycle inside the tight loop. As hinted by the comment, you can correct this by calling collectgarbage() yourself inside the loop.
Note that calling collectgarbage() on every iteration will impact the running time of whatever you're benching. To minimize this, you should set a threshold to limit how often collectgarbage() gets called:
local memthreshold = 2 ^ 20 / 1024
local start = os.clock()
for i = 1, loopSize do
ffi.copy(bytes, data, size)
ffi.C.qsort(bytes, size, 1, callback)
if collectgarbage'count' > memthreshold then
collectgarbage()
end
end
local elapse = os.clock() - start
print("elapsed:", elapse..'s')