Intel Compiler /QxCORE-AVX2 switch and compatibility with AMD Excavator/RyZen - icc

For some unknown reason Intel decided to does not support AVX2 via typical /arch: option. /arch: recognizes only following instructions IA32,SSE,SSE2,SSE3,AVX. So if you want to compile for AVX2 you are basically forced to activate /QxCORE-AVX2 switch. The problem with this option is that it injects check code. That code at runtime checks if your cpu is compatible with selected intructions. If CPU is not compatible then this message pops-up.
Please verify that both the operating system and the processor support Intel(R)
MOVBE, F16C, FMA, BMI, LZCNT and AVX2 instructions.
Now I'm worried that the same message may pop-up on AMD Excavator and RyZen CPU due to not being GenuineIntel. Unfortunately I do not have access to any AMD cpu so I can't check that on real cpu. To make your life easier I've compiled this simple code with activated /QxCORE-AVX2 option.
#include "stdafx.h"
int _tmain(int argc, _TCHAR* argv[])
{
double a, b, c;
a = 3.0;
b = 2.0;
c = 1.0;
a = a*b + c;
printf("a=%1.1f",a);
return 0;
}
and here is decompiled asm code: http://codepad.org/KL4Vq978
My question to people who understand asm code is do you see anything what may block execution of this code on latest AMD cpus? If yes will this http://www.softpedia.com/get/Programming/Patchers/Intel-Compiler-Patcher.shtml help?

It turns out that /arch:CORE-AVX2 is recognized and compiled executable contains FMA instructions! I really do not understand why this option is not listed in Visual Studio and in ICL /help ?!?
Dropbox menu in Visual Studio (NO AVX2!)
http://i.cubeupload.com/c1xidV.png
ICL /help
http://i.cubeupload.com/y2Cre6.png

The Ryzen supports these instruction sets, but the code will not run on AMD processors because it checks if the processor is "GenuineIntel". There has been a long discussion and legal battle about this issue. See http://www.agner.org/optimize/blog/read.php?i=49

Related

WARNING: "__aeabi_uldivmod" Undefined symbol in opendla.ko

I am trying to build kernel module driver (KMD) for NVDLA NVIDIA's Deep Learning Accelerator and got the following error at the end.
enter image description here
After doing some research on google I found that it is due to 64bit operations (especially 64bit division) present in the kmd that is causing the errors. After further investigation I found that the kmd was written for 64bit architecture while I am trying to compile it for 32bit (ARM cortex A9) processor. some people online have suggested to use -lgcc, which will take care the issue.
Could anyone help me in editing the makefile to link the linker library libgcc.
Thanks in advance.
Linux kernel code that uses 64-bit division should use the functions provided by #include <linux/math64.h>. Otherwise, when building for 32-bit architectures, GCC will attempt to use functions from libgcc which is not used by the kernel.
For example, the div_u64 function divides a 64-bit unsigned dividend by a 32-bit unsigned divisor and returns a 64-bit unsigned quotient. The KMD code referenced by OP contains this function:
int64_t dla_get_time_us(void)
{
return ktime_get_ns() / NSEC_PER_USEC;
}
After adding #include <linux/math64.h>, it can be rewritten to use the div_u64 function as follows:
int64_t dla_get_time_us(void)
{
return div_u64(ktime_get_ns(), NSEC_PER_USEC);
}
(Note that ktime_get_ns() returns a u64 (an unsigned 64-bit integer) and NSEC_PER_USEC has the value 1000 so can be used as a 32-bit divisor.)
There may be other places in the code where 64-bit division is used, but that is the first one I spotted.

Is this a memory leak or am I misreading what Visual Studio 2017 is showing?

I am just learning C++ and figured to use Allegro 5 and Visual Studio 2017 to make a simple 2D game but having written a very simple program that just creates and destroys a window, I am worried that there is a memory leak when so far as I am aware, the code should be fine in this regard.
Hopefully, I just need to understand better what Visual Studio is showing me when running the code via the 'Local Windows Debugger' option.
I'm running the code below via Visual Studio 2017 and the 'Local Windows Debugger' option.
The code runs fine so far as I can see but in the 'Diagnostic Tools' window, looking at Process Memory usage, I am concerned there might be a memory leak.
Here is the code...
#include <iostream>
#include <allegro5/allegro5.h>
int main()
{
std::system ("pause");
al_init();
ALLEGRO_DISPLAY *disp = al_create_display(320, 200);
std::system ("pause");
al_destroy_display(disp);
std::cout << "allegro display has been destroyed...\n";
std::system("pause");
return 0;
}
You'll see I have inserted pauses at various places in the code, and when the code reaches the last Pause, I am expecting Process Memory usage to be the same as when the code was at the first Pause.
Meaning that I am expecting Process Memory usage to be the same at the end of main, as it is at the start, or certainly after al_init();
But what I see is...
Typically the Process Memory shows usage of...
At start of Main... 1.7MB
After al_init();... 1.7MB
After creating display/window... 31.4MB
After al_destroy display/window... 13.8MB
Surely after the window is destroyed/closed I should see Process Memory usage return to 1.7MB?
I created a similarly simply program using OpenGL and see similar behavior albeit with higher Process Memory usage in general.
Hopefully, I just need to understand better what Visual Studio is showing me when running the code via the 'Local Windows Debugger' option.
Thank you for reading....

Unaligned accesses are not detected by Raspberry PI version 1

I'm performing a set of activities to make sure Redis runs well in a set of embedded systems, including the Raspberry PI. In order to fix certain code paths of Redis where unaligned memory accesses are performed (due to a change introduced in Redis 3.2) I'm trying to force the PI to either log a message on unaligned memory accesses or send a signal to the process when this happens. In this way I can both make sure that Redis will run well where unaligned accesses are a violation, and that it will run faster in platforms where instead such accesses can be performed but are slower. ARM v6, the one used in the PI v1, is apparently able to deal with unaligned memory accesses, so if I use following command to configure Linux in order to sent a signal to the process performing the unaligned access:
echo 4 > /proc/cpu/alignment
And then run the following program:
#include <stdio.h>
#include <stdint.h>
int main(int argc, char **argv) {
char *buf = "foobareklsjdfklsjdfslkjfskdljfskdfjdslkjfdslkjfsd";
uint32_t *l = (uint32_t*) (buf+1);
printf("%p\n", l);
printf("%d\n", (int)*l);
return 0;
}
I can't see any signal received by the process, or the counters at /proc/cpu/alignment incrementing.
My guess is that this is due to ARM v6 ability to deal with unaligned addresses automatically, if a given CPU configuration flag is set. My question is, is my hypothesis correct? And if so, how to force a PI version 1 to actually raise an exception in case of unaligned accesses so that the Linux kernel can trap it and send a signal, log the access, and so forth, according to /proc/cpu/alignment settings?
EDIT: It is worth to note that not all the instructions can perform unaligned accesses even in ARM v6. For instance STMDB, STMFD, LDMDB, LDMEA and similar multiple words instructions will indeed raise an exception and will be trapped by the Linux kernel.
I think I eventually found my answers:
Yes I'm correct, up to the word size ARM v6 (or greater) can silently handle the unaligned accesses so no trap is generated and is completely transparent for the Linux kernel. Nothing will be logged, nor the traps counter in /proc/cpu/alignment will be incremented.
AFAIK there is no way I can force the kernel to trap word-sized unaligned accesses, since to do that apparently the CPU should be configured in order to trap the unaligned addresses in every case, but the Linux kernel does not do that AFAIK, probably because there is alignment unsafe code inside the kernel itself. Checking the Linux kernel source code indeed one can see:
if (cpu_is_v6_unaligned()) {
set_cr(__clear_cr(CR_A));
ai_usermode = safe_usermode(ai_usermode, false);
}
What this means is that the SCTLR.A bit is always cleared, so no trap
will be generated for unaligned accesses ARM v6 can handle.
There are a great deal of instructions that will still generate traps when used with unaligned addresses, for example multi store/load instructions, loading and storing of double values.
However, there are instructions that GCC (the version shipped in the default Raspberry Linux distribution) will happily produced that are not handled by the Linux kernel correctly, that will result in a SIGBUS generated even when /proc/cpu/alignment is set to fix the access.
So point number 4 basically means that, it is not a good idea to fix programs to run in ARM v6 just letting the Linux kernel handle unaligned addresses for us, even when the performance implications of unaligned addresses are not a problem: the program can still crash since not all the instructions are handled.
How to reliably find all the unaligned accesses in a program remains an open question AFAIK, since unfortunately, the otherwise wonderful valgrind program, never implemented this feature. In the past I had to use QEMU emulating Sparc, however this is a very slow process. Valgrind would be the trivial way to do that.

Running executables of different format on any OS

This shouldn't be that hard that one may think, if I got it right. Specifically, I'll begin with iOS and the ELF executable format. Let's clarify that I have a jailbroken iPhone and I don't want to do this in any appstore apps, so pleas avoid "good advices" like "you can't do it as it's prohibited by Apple".
So, what I have seen is that there's a Flash player implementation, called Frash (by Comex btw, developer of recent jailbreaks). This utility requires, after installation, that Android's libflashplayer.so is present (copied to) the iPhone file system. I digged into the source code and found out that the tweak actually opens the Android (ELF) shared object file, "parses" it and executes code from it. I already asked a friend of mine wheter it is or is not actually possible and he told me that it is, because ELF on ARM and Mach-O on ARM are binary compatible (because they're both ARM). But he actually failed to explain it to me in detail, so I'd like to ask how can it be done? I can't exactly understand the source code fragment that handles, but one thing is sure:
int fd = open("libflashplayer.so", O_RDONLY);
_assert(fd > 0);
fds_init();
sandbox_me();
int symtab_size;
Elf32_Sym *symtab;
void **init_array;
Elf32_Word init_array_size;
char *strtab;
TIME(base_load_elf(fd, &symtab, &symtab_size, &init_array, &init_array_size, &strtab));
// Call the init funcs
_assert(init_array);
while(init_array_size >= 4) {
void (*x)() = *init_array++;
notice("Calling %p", x);
x();
init_array_size -= 4;
}
(from the original code, as of 02/12/2011 on GitHub)
It seems to me that he uses libelf to perform this, right? And that in an ELF file there are symbols that can be executed on a compatible processor just fine?
I'd also like to know whether it is true for all other processor architectures? So maybe one can execute symbols from Linux binaries on OS X?
The important thing about compatibility is the underlying processor architecture, not Linux vs. OS X vs. Android. If the ELF or .so are compiled for the same processor instruction set, then this can work. If not, then they are not compatible. For example, if both were built for Linux but different processors, they would not be compatible.

A trivial SYSENTER/SYSCALL question

If a Windows executable makes use of SYSENTER and is executed on a processor implementing AMD64 ISA, what happens? I am both new and newbie to this topic (OSes, hardware/software interaction) but from what I've read I have understood that SYSCALL is the AMD64 equivalent to Intel's SYSENTER. Hopefully this question makes sense.
If you try to use SYSENTER where it is not supported, you'll probably get an "invalid opcode" exception.
Note that this situation is unusual - generally, Windows executables do not directly contain instructions to enter kernel mode.
As far as i know AM64 processors using different type of modes to handle such issues.
SYSENTER works fine but is not that fast.
A very useful site to get started about the different modes:
Wikipedia
They got rid of a bunch of unused functionality when they developed AMD64 extensions. One of the main ones is the elimination of the cs, ds, es, and ss segment registers. Normally loading segment registers is an extremely expensive operation (the CPU has to do permission checks, which could involve multiple memory accesses). Entering kernel mode requires loading new segment register values.
The SYSENTER instruction accelerates this by having a set of "shadow registers" which is can copy directly to the (internal, hidden) segment descriptors without doing any permission checks. The vast majority of the benefit is lost with only a couple of segment registers, so most likely the reasoning for removing the support for the instructions is that using regular instructions for the mode switch is faster.