Is it possible that speculative execution on intel CPU causing EXC_BAD_INSTRUCTION (SIGILL) - speculative-execution

I have a hypothesis that speculative execution on Intel Nehalem (1 gen) causing a crash. Is it possible or I completely wrong? If this is possible what can I do to prevent this? Maybe disable speculative execution for just one function or whole translation unit?
For the compilation of the cpp file that has problematic code, clang is used with flags -mavx2 -mxsave all other files compiled without these flags. This code works fine on any available contemporary mac book and windows laptop/desktop.
Testers mac book has Intel(R) Core(TM) i5 CPU 760. This CPU doesn't support AVX2 instruction set.
There is a code that checks if the AVX2 supported and if not it is not executed. I can't have direct access to this device for debugging to know which exactly code causing the crash. But now I have two hypotheses:
code that checks if AVX2 supported is wrong and returns true when should return false
even though check returns false speculative execution actually run the AVX2 code causing the crash
I have already replaced/"fixed" the checking code as the primary hypothesis but tester still reports the crash. So I don't know for sure that isAvx2Supported is false.
Code that checks if AVX2 supported
void cpuid(int info[4], int InfoType) noexcept
{
#ifdef _WIN32
__cpuidex(info, InfoType, 0);
#else
__cpuid_count(InfoType, 0, info[0], info[1], info[2], info[3]);
#endif
}
bool check_xcr0_ymm() noexcept
{
uint32_t xcr0;
#if defined(_MSC_VER)
xcr0 = (uint32_t)_xgetbv(0);
#else
__asm__ __volatile__("xgetbv" : "=a" (xcr0) : "c" (0) : "%edx");
#endif
// checking if xmm and ymm state are enabled in XCR0
return (xcr0 & 6) == 6;
}
bool check_4th_gen_intel_core_features() noexcept
{
// see original article
// https://software.intel.com/en-us/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family
int cpuInfo[4] = {};
cpuid(cpuInfo, 1);
// CPUID.(EAX=01H, ECX=0H):ECX.FMA[bit 12]==1
// && CPUID.(EAX=01H, ECX=0H):ECX.MOVBE[bit 22]==1
// && CPUID.(EAX=01H, ECX=0H):ECX.OSXSAVE[bit 27]==1
constexpr uint32_t fma_movbe_osxsave_mask = ((1 << 12) | (1 << 22) | (1 << 27));
if((cpuInfo[2] & fma_movbe_osxsave_mask) != fma_movbe_osxsave_mask)
return false;
if(!check_xcr0_ymm())
return false;
cpuid(cpuInfo, 7);
// CPUID.(EAX=07H, ECX=0H):EBX.AVX2[bit 5]==1
// && CPUID.(EAX=07H, ECX=0H):EBX.BMI1[bit 3]==1
// && CPUID.(EAX=07H, ECX=0H):EBX.BMI2[bit 8]==1
constexpr uint32_t avx2_bmi12_mask = (1 << 5) | (1 << 3) | (1 << 8);
if((cpuInfo[1] & avx2_bmi12_mask) != avx2_bmi12_mask)
return false;
cpuid(cpuInfo, 0x80000001);
// CPUID.(EAX=80000001H):ECX.LZCNT[bit 5]==1
if((cpuInfo[2] & (1 << 5)) == 0)
return false;
return true;
}
const auto isAvx2Supported = check_4th_gen_intel_core_features();
actual code that uses AVX2
int findCharFast(const char* data, size_t dataSize, char c, unsigned int& offset)
{
if(isAvx2Supported)
{
const auto mask = _mm256_set1_epi8(c);
auto it = data + offset;
for(const auto end = data + dataSize - 31; it < end; it += 32)
{
if(const auto result = _mm256_movemask_epi8(_mm256_cmpeq_epi8(mask, _mm256_loadu_si256(reinterpret_cast<const __m256i*>(it)))))
{
return it - data + get_first_bit_set(result);
}
}
offset = it - data;
}
return -1;
}
crash report says
Crashed Thread: 43 Queue(0x60400005b270)[16]
Exception Type: EXC_BAD_INSTRUCTION (SIGILL)
Exception Codes: 0x0000000000000001, 0x0000000000000000
Thread 43 Crashed:: Queue(0x60400005b270)[16]
0 0x000000010d54b06b findCharFast(char const*, unsigned long, char, unsigned int&) + 91
Thread 43 crashed with X86 Thread State (64-bit):
rax: 0x00000000ffffffff rbx: 0x0000000000000000 rcx: 0x000070000982bbd0 rdx: 0x0000000000000000
rdi: 0x00007f9ac044fa0b rsi: 0x000000000000000d rbp: 0x000070000982bc00 rsp: 0x000070000982bbc8
r8: 0x0000000000000001 r9: 0x0000000000000001 r10: 0x000060c000334030 r11: 0xfffffffffffc0fde
r12: 0x0000000000000001 r13: 0x0000600000016fb0 r14: 0x00007f9ac044fa0b r15: 0x00007f9ac044fa18
rip: 0x000000010d54b06b rfl: 0x0000000000010246 cr2: 0x000000010d529cf0

There was a bug in AVX2 support detection code. Intel article which describes how to do it is basically wrong. Before calling xgetbv implementation MUST check that ECX.XSAVE[bit 26]==1. Checking only OSXSAVE flag is not sufficient.

Related

ESP32 FreeRtos Queue invokes [Guru Meditation Error: Core 0 panic'ed (LoadProhibited)]

I have encountered a pretty strange problem while I was exploring the capabilities of FreeRtos on a ESP32 Wrover module. Currently, I have two tasks in my program. The first task will be used to collect some data, and the second one will be dedicated to print out debug messages to the serial monitor. These task use a queue to exchange data. Since I want to create a few more tasks in the system, the data collector task recieves the queue as part of a parameter struct. Here is my problem: if the data collector task sends only one message to the queue, the program works perfectly. But if I tried to add another message to the queue (as shown in the last piece of code), it forced the CPU to encounter a "LoadProhibited" exception. From what I have read in other topics, this problem is usually caused by accessing a NULL pointer somewhere in the program. But as you can see in the code below, I tried to add some protection by checking the pointers before adding anything to the queue. I also tried to raise the amount of allocated memory of the tasks, and pinning both task to core 1. I still got the same result.
Here is the main:
static QueueHandle_t debugMsgQueue = NULL;
static QueueHandle_t sensorDataBufQueue = NULL;
TaskHandle_t debugTaskHandle = NULL;
TaskHandle_t sensorTaskHandle = NULL;
uint32_t sensorTaskWatchdog;
ESP32Time rtc;
void StreamDebugger(void* pvParameters) {
char debugMsg[_debugDataLength];
while (1) {
if (debugMsgQueue != NULL) {
if (xQueueReceive(debugMsgQueue, (void*)debugMsg, portMAX_DELAY) == pdPASS) {
Serial.print(debugMsg);
}
}
}
}
void setup(){
Serial.begin(115200);
EEPROM.begin(_eepromSize);
/*CREATING GLOBAL DATA BUFFERS*/
debugMsgQueue = xQueueCreate(5, sizeof(char[_debugDataLength]));
sensorDataBufQueue = xQueueCreate(2, sizeof(char*));
if (debugMsgQueue == NULL || sensorDataBufQueue == NULL) {
Serial.print("\r\nCouldn't create dataBuffers. Aborting operation.");
}
BaseType_t xReturned;
/*DEBUG MESSAGE HANDLER TASK*/
xReturned = xTaskCreate(StreamDebugger, "DEBUG", 2048, NULL, 1, &debugTaskHandle);
if (xReturned != pdPASS) {
Serial.print("\r\nCouldn't create DEBUGTASK. Aborting operation.");
}
/*MEASURMENT HANDLER TASK*/
const ReadSensorsParameters sensorTaskDescriptor{ &debugMsgQueue,&sensorDataBufQueue,&sensorTaskWatchdog,rtc};
xReturned = xTaskCreate(ReadSensors, "GETDATA", 4096, (void*)&sensorTaskDescriptor, 1, &sensorTaskHandle);
if (xReturned != pdPASS) {
Serial.print("\r\nCouldn't create GETDATATASK. Aborting operation.");
}
}
void loop(){
}
This is the struct which is used by the sensor data collector task:
typedef struct READSENTASKPARAMETERS {
QueueHandle_t* debugQueue;
QueueHandle_t* dataQueue;
uint32_t* watchdog;
ESP32Time &systemClock;
}ReadSensorsParameters;
This is the data collector task, the one that works:
void ReadSensors(void* pvParameters) {
ReadSensorsParameters* handlers = (ReadSensorsParameters*) pvParameters;
char debugMsg[_debugDataLength];
char dataMsg[_msgDataMaxLength];
strcpy(debugMsg, "READSENSORTASK");
if (debugMsg != NULL && *handlers->debugQueue != NULL) {
xQueueSend(*handlers->debugQueue, (void*)debugMsg, portMAX_DELAY);
}
vTaskDelete(NULL);
}
And here is the modified task, which, for some reason does not work at all:
void ReadSensors(void* pvParameters) {
ReadSensorsParameters* handlers = (ReadSensorsParameters*) pvParameters;
char debugMsg[_debugDataLength];
char dataMsg[_msgDataMaxLength];
strcpy(debugMsg, "READSENSORTASK");
if (debugMsg != NULL && *handlers->debugQueue != NULL) {
xQueueSend(*handlers->debugQueue, (void*)debugMsg, portMAX_DELAY);
}
if (debugMsg != NULL && *handlers->debugQueue != NULL) {
xQueueSend(*handlers->debugQueue, (void*)debugMsg, portMAX_DELAY);
}
vTaskDelete(NULL);
}
And here is the error message I recieve:
rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0018,len:4
load:0x3fff001c,len:1044
load:0x40078000,len:8896
load:0x40080400,len:5816
entry 0x400806ac
READSENSORTASKGuru Meditation Error: Core 0 panic'ed (LoadProhibited). Exception was unhandled.
Core 0 register dump:
PC : 0x400d0e5c PS : 0x00060d30 A0 : 0x800889dc A1 : 0x3ffb2f80
A2 : 0x00000000 A3 : 0x3f400fad A4 : 0x3ffc07b8 A5 : 0x3ffb8058
A6 : 0x00000000 A7 : 0x00000000 A8 : 0x800d0e5a A9 : 0x3ffb2f70
A10 : 0x3ffb2f8a A11 : 0x3f400fbc A12 : 0x000000ff A13 : 0x0000ff00
A14 : 0x00ff0000 A15 : 0xff000000 SAR : 0x00000010 EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000000 LBEG : 0x4000142d LEND : 0x4000143a LCOUNT : 0xfffffff3
Backtrace: 0x400d0e5c:0x3ffb2f80 0x400889d9:0x3ffb2fe0
Does anyone have any idea?
SOLVED! Turned out (after a few sleepless nights) that
static const MyTaskParameters sensorTaskDescriptor{
&debugMsgQueue,
&sensorDataBufQueue,
&sensorTaskWatchdog,
rtc,
&sensorTaskWatchdogSemaphore,
&rtcSemaphore
};
had to be declared as a static variable. What I think had happened was that when READSENSORTASK was created, it immediately started running and was able to place data into the output buffer. After the first context switch the SETUP task was deleted automatically, and therefore this sensorTaskDescriptor variable was also deleted that is why next message placement invoked the LoadProhibited message. What is still weird for me is that I was trying to check all to pointers not to be NULL. I guess the faulty call was somewhere inside the xQueueSend function.
Anyways, I hope this thread helps someone.

Cuda warp illegal address

I'm having some trouble with CUDA and passing classes to a kernel. I've some functions which allocate memory for the class on the GPU, pass it, and work fine. There is another one, though, that just won't work. I noticed that it happens only when I'm working with arrays. Here is an example.
File1.hh
#ifndef PROVA1_HH
#define PROVA1_HH
#include <cstdio>
class cls {
public:
int *x, y;
cls();
void kernel();
};
#endif
File1.cu
#include "Prova1.hh"
__global__ void kernel1(cls* c){
printf("%d\n", c->y);
c->y=2;
printf("%d\n", c->y);
c->x[0]=0; c->x[1]=1;
printf("%d %d\n", c->x[0], c->x[1]);
}
void cls::kernel(){
cls* dev_c; cudaMalloc(&dev_c, sizeof(cls));
cudaMemcpy(dev_c, this, sizeof(cls), cudaMemcpyHostToDevice);
printf("(%d, %d)\n", x[0], x[1]);
kernel1<<<1, 1>>> (dev_c);
cudaDeviceSynchronize();
cudaMemcpy(this, dev_c, sizeof(cls), cudaMemcpyDeviceToHost);
printf("(%d, %d)\n", x[0], x[1]);
}
cls::cls(){
y=3;
x=(int*) malloc(sizeof(int)*2);
x[0]=1; x[1]=2;
}
File.cu
#include<cstdio>
#include "Prova1.hh"
int main(){
cls c=cls();
c.kernel();
return 0;
}
I'm compiling with:
nvcc -std=c++11 -arch=sm_35 -rdc=true -c -o File1.o File1.cu
nvcc -std=c++11 -arch=sm_35 -rdc=true -g -G -o File.out File1.o File.cu
When I simpy run it, the output would be:
(1, 2)
3
2
(1, 2)
When I debug it, I get:
Starting program:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fb10eb1e0 (LWP 806)]
(1, 2)
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x84fa10
Thread 1 "File.out" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
0x000000000084fad0 in kernel1(ciao*)<<<(1,1,1),(1,1,1)>>> ()
Do any of you guys know were I'm making mistakes?
There is a lot broken in that code you posted, but the core source of the error is that you are attempting to access a host pointer inside the kernel (no memory is ever allocated to x on the device and the values are not copied either). Unless you use managed memory, that is obviously never going to work.
You could rework your example into something like this:
#include <cstdio>
class cls {
public:
int *x, y;
__host__ __device__
cls(int *x_, int y_) : x(x_), y(y_) {};
void kernel();
};
__global__ void kernel1(cls* c){
printf("%d\n", c->y);
c->y=2;
printf("%d\n", c->y);
c->x[0]=0; c->x[1]=1;
printf("%d %d\n", c->x[0], c->x[1]);
}
void cls::kernel(){
int* dev_x; cudaMalloc(&dev_x, sizeof(int)*2);
cudaMemcpy(dev_x, x, sizeof(int)*2, cudaMemcpyHostToDevice);
cls h_dev_c(dev_x, y);
cls* dev_c; cudaMalloc(&dev_c, sizeof(cls));
cudaMemcpy(dev_c, &h_dev_c, sizeof(cls), cudaMemcpyHostToDevice);
printf("(%d)\n", y);
printf("(%d, %d)\n", x[0], x[1]);
kernel1<<<1, 1>>> (dev_c);
cudaDeviceSynchronize();
cudaMemcpy(&y, &(dev_c->y), sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(x, dev_x, sizeof(int)*2, cudaMemcpyDeviceToHost);
printf("(%d)\n", y);
printf("(%d, %d)\n", x[0], x[1]);
}
int main(){
int y=3;
int* x=(int*) malloc(sizeof(int)*2);
x[0]=1; x[1]=2;
cls c(x,y);
c.kernel();
return 0;
}
Note that you have to basically build a device copy of the class in host memory and then copy that to the device to make this work correctly (this is a very common design pattern for arrays of pointers or structures and classes containing pointers, although it is almost never recommended for complexity and performance reasons).

Can't write Double word on STM32F429 using HAL driver

I am trying to write uint64_t(double word) variable into the flash memory, without success though. Here is the code.
#define APPLICATION_START_ADDRESS 0x8008000
void flashErase(uint8_t startSector, uint8_t numberOfSectors)
{
HAL_FLASH_Unlock();
Flash_eraseInitStruct.TypeErase = FLASH_TYPEERASE_SECTORS;
Flash_eraseInitStruct.VoltageRange = FLASH_VOLTAGE_RANGE_3;
Flash_eraseInitStruct.Sector = startSector;
Flash_eraseInitStruct.NbSectors = numberOfSectors;
if(HAL_FLASHEx_Erase(&Flash_eraseInitStruct, &Flash_halOperationSectorError) != HAL_OK)
{
Flash_raiseError(errHAL_FLASHEx_Erase);
}
HAL_FLASH_Lock();
}
int main(void)
{
HAL_Init();
main_clockSystemInit();
__IO uint64_t word = 0x1234567890;
flashErase(2, 1);
// flashProgramWord(aTxBuffer, APPLICATION_START_ADDRESS, 2 );
HAL_FLASH_Unlock();
HAL_FLASH_Program(FLASH_TYPEPROGRAM_DOUBLEWORD, APPLICATION_START_ADDRESS, word);
}
I get error flag raised PGSERR and PGAERR. The erase operation goes without problems. But programming returns ERROR.
Some Ideas?
There is no STM32F249, did you mean STM32F429?
In order to use 64 bit programming, VPP (BOOT0) has to be powered by 8 - 9 Volts. Is it?
See the Reference Manual Section 3.6.2
By the way,
__IO uint64_t word = 0x1234567890;
would not work as (presumably) expected. It is a 32 bit architecture, integer constants will be truncated to 32 bits, unless there is an L suffix. U wouldn't hurt either, because the variable is unsigned. __IO is unnecessary.
uint64_t word = 0x1234567890UL;

Perl XS Memory Handling of Strings

I have an XSUB like this:
char *
string4()
CODE:
char *str = strdup("Hello World4");
int len = strlen(str) + 1;
New(0, RETVAL, len, char);
Copy(str, RETVAL, len, char);
free(str);
OUTPUT:
RETVAL
But this shows up as a memory leak, on the New(), in valgrind and if I run it in a loop the resident memory will continue to grow.
I get the same thing if I use this one too:
char *
string2()
CODE:
char *str = strdup("Hello World2");
RETVAL = str;
OUTPUT:
RETVAL
I'm able to prevent the leak and the increasing memory size by doing:
char *
string3()
PPCODE:
char *str = strdup("Hello World3");
XPUSHs(sv_2mortal(newSVpv(str, 0)));
free(str);
but the problem with this solution is that when I compile with -Werror I get the following warnings/errors.
test.c: In function ‘XS_test_string3’:
/usr/lib/x86_64-linux-gnu/perl/5.20/CORE/XSUB.h:175:28: error: unused variable ‘targ’ [-Werror=unused-variable]
#define dXSTARG SV * const targ = ((PL_op->op_private & OPpENTERSUB_HASTARG) \
^
test.c:270:2: note: in expansion of macro ‘dXSTARG’
dXSTARG;
^
test.c:269:9: error: unused variable ‘RETVAL’ [-Werror=unused-variable]
char * RETVAL;
the c file gets built with an unused RETVAL:
XS_EUPXS(XS_test_string3); /* prototype to pass -Wmissing-prototypes */
XS_EUPXS(XS_test_string3)
{
dVAR; dXSARGS;
if (items != 0)
croak_xs_usage(cv, "");
PERL_UNUSED_VAR(ax); /* -Wall */
SP -= items;
{
char * RETVAL;
dXSTARG;
#line 61 "test.xs"
char *str = strdup("Hello World3");
XPUSHs(sv_2mortal(newSVpv(str, 0)));
free(str);
#line 276 "test.c"
PUTBACK;
return;
}
}
So is there a better way to handle the returning of allocated strings in XS? Is there a way to return the string using RETVAL and free the memory? I appreciate any help.
Among other problems[1], your first snippet allocates memory using New, but never deallocates it.
Among other problems, your second snippet allocates memory using strdup, but never deallocates it.
The underlying problem with your third snippet is that you claim the XS function returns a value and it doesn't. That value would have been assigned to RETVAL, which is automatically created for that very purpose. The variable won't be created if you correctly specify that you don't return anything.
void
string3()
PREINIT:
char *str;
PPCODE:
str = strdup("Hello World3");
XPUSHs(sv_2mortal(newSVpv(str, 0)));
free(str);
or just
void
string3()
PPCODE:
XPUSHs(sv_2mortal(newSVpv("Hello World3", 0)));
Note that I moved your declarations out of PPCODE. In C, declarations can't appear after non-declarations, and the code in PPCODE can appear after non-declarations (depending on the options used to build Perl). Declarations belong in PREINIT. You could also use curlies around the code in PPCODE.
One of them is the use of New. You shoudln't be using New. New was deprecated in favour of Newx ages ago. New hasn't even been in the documentation for as long as I can remember.

ReadFile(socket) is cancelled, if the thread that called it dies

I'm trying to learn async I/O.
My program creates sockets and either accepts them with AcceptEx or connects them with connect. In its main thread I call WaitForMultipleObjects() in a loop, but I still create threads to resolve the names, call connect() and call the initial ReadFile().
These threads exit after they call ReadFile() and let the main thread wait for the read result.
For some reason, after the connecting thread dies, the read operation is cancelled, the event is triggered and GetOverlappedResult() fails with ERROR_OPERATION_ABORTED
Example:
#define _WIN32_WINNT 0x0501
#include <winsock2.h>
#include <ws2tcpip.h>
#include <wspiapi.h>
#include <windows.h>
#include <stdio.h>
#include <tchar.h>
#define BUFSZ 2048
#define PORT 80
#define HOST "192.168.2.1"
#define HOST "stackoverflow.com"
static struct {
char buf[BUFSZ];
OVERLAPPED overlap;
SOCKET sock;
} x = { 0 };
static DWORD WINAPI barthread(LPVOID param) {
static struct sockaddr_in inaddr = { 0 };
int rc;
BOOL b;
DWORD dw;
DWORD nb;
LPHOSTENT lphost;
x.sock = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
inaddr.sin_family = AF_INET;
lphost = gethostbyname(HOST);
inaddr.sin_addr.s_addr = ((LPIN_ADDR)lphost->h_addr)->s_addr;
inaddr.sin_port = htons(PORT);
rc = connect(x.sock, (struct sockaddr *)&inaddr, sizeof(struct sockaddr_in));
if (rc == 0) {
printf("thread 2 connected\n");
printf("thread 2 call ReadFile\n");
b = ReadFile((HANDLE)x.sock, x.buf, BUFSZ, &nb, &x.overlap);
dw = GetLastError();
if (b || dw == ERROR_IO_PENDING) {
printf("thread 2 ReadFile ok\n");
} else {
printf("thread 2 ReadFile failed\n");
}
printf("thread 2 sleeping\n");
Sleep(3000);
printf("thread 2 dying\n");
}
return 0;
}
int main(int argc, char* argv[])
{
WSADATA WD;
BOOL b;
DWORD dw;
DWORD nb;
DWORD tid;
WSAStartup(MAKEWORD(2, 0), &WD);
x.overlap.hEvent = CreateEvent(NULL, FALSE, FALSE, NULL);
CreateThread(NULL, 0, barthread, NULL, 0, &tid);
dw = WaitForSingleObject(x.overlap.hEvent, INFINITE);
printf("thread 1 event triggered\n");
b = GetOverlappedResult((HANDLE)x.sock, &x.overlap, &nb, FALSE);
dw = GetLastError();
printf("thread 1 GetOverlappedResult() = %d, GetLastError() = %d\n", b, dw);
return 0;
}
You shouldn't be using separate threads at all. The whole point of overlapped I/O is that a single thread can do multiple tasks at one time. Have your main loop use WSAAsyncGetHostByName() instead of gethostbyname(), and WSAConnect() in non-blocking mode with WSAEventSelect() instead of connect() in blocking mode.
Found the similar question here:
Asynchronous socket reading: the initiating thread must not be exited - what to do?
and, here: http://www.boost.org/doc/libs/1_39_0/doc/html/boost_asio/reference/asynchronous_operations.html :
Specifically, on Windows versions prior to Vista, unfinished operations are cancelled when the initiating thread exits.
I have Windows 7, but suffer from the same problem.
Instead of calling the initial ReadFile() in a temporary thread I will just set some flag, set the event manually and call ReadFile() in the main loop.