I am trying to build kernel module driver (KMD) for NVDLA NVIDIA's Deep Learning Accelerator and got the following error at the end.
enter image description here
After doing some research on google I found that it is due to 64bit operations (especially 64bit division) present in the kmd that is causing the errors. After further investigation I found that the kmd was written for 64bit architecture while I am trying to compile it for 32bit (ARM cortex A9) processor. some people online have suggested to use -lgcc, which will take care the issue.
Could anyone help me in editing the makefile to link the linker library libgcc.
Thanks in advance.
Linux kernel code that uses 64-bit division should use the functions provided by #include <linux/math64.h>. Otherwise, when building for 32-bit architectures, GCC will attempt to use functions from libgcc which is not used by the kernel.
For example, the div_u64 function divides a 64-bit unsigned dividend by a 32-bit unsigned divisor and returns a 64-bit unsigned quotient. The KMD code referenced by OP contains this function:
int64_t dla_get_time_us(void)
{
return ktime_get_ns() / NSEC_PER_USEC;
}
After adding #include <linux/math64.h>, it can be rewritten to use the div_u64 function as follows:
int64_t dla_get_time_us(void)
{
return div_u64(ktime_get_ns(), NSEC_PER_USEC);
}
(Note that ktime_get_ns() returns a u64 (an unsigned 64-bit integer) and NSEC_PER_USEC has the value 1000 so can be used as a 32-bit divisor.)
There may be other places in the code where 64-bit division is used, but that is the first one I spotted.
Related
I write CUDA code that I call from MATLAB MEX files. I am not using any of MATLABs GPU libraries or capabilities. My code its just CUDA code that accepts C type variables and I only use mex to convert from mwtypes to C types, then call independent self-written CUDA code.
The problem is that sometimes, specially in development phase, CUDA fails (because I made a mistake). Most CUDA calls are generally surrounded by a call to gpuErrchk(cudaDoSoething(cuda)), defined as:
// Uses MATLAB functions but you get the idea.
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
mexPrintf("GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort){
//cudaDeviceReset(); //This does not make MATLAB release it
mexErrMsgIdAndTxt("MEX:myfun", ".");
}
}
}
While this works as expected, giving errors such as
GPUassert: an illegal memory access was encountered somefile.cu 208
In most cases MATLAB does not release the GPU afterwards. Meaning that even if I change the code and recompile, the next call of the code will result in error:
GPUassert: all CUDA-capable devices are busy or unavailable
somefile.cu firs_cuda_line
The only way of removing this error is restarting MATLAB. This is just annoying and hinders the development/testing process. This is not what happens when I develop in say Visual Studio.
I have tried to cudaDeviceReset() both before and after the error has been raised, but to no avail.
What can I do/try to make MATLAB release the GPU after a GPU runtime error?
I'm performing a set of activities to make sure Redis runs well in a set of embedded systems, including the Raspberry PI. In order to fix certain code paths of Redis where unaligned memory accesses are performed (due to a change introduced in Redis 3.2) I'm trying to force the PI to either log a message on unaligned memory accesses or send a signal to the process when this happens. In this way I can both make sure that Redis will run well where unaligned accesses are a violation, and that it will run faster in platforms where instead such accesses can be performed but are slower. ARM v6, the one used in the PI v1, is apparently able to deal with unaligned memory accesses, so if I use following command to configure Linux in order to sent a signal to the process performing the unaligned access:
echo 4 > /proc/cpu/alignment
And then run the following program:
#include <stdio.h>
#include <stdint.h>
int main(int argc, char **argv) {
char *buf = "foobareklsjdfklsjdfslkjfskdljfskdfjdslkjfdslkjfsd";
uint32_t *l = (uint32_t*) (buf+1);
printf("%p\n", l);
printf("%d\n", (int)*l);
return 0;
}
I can't see any signal received by the process, or the counters at /proc/cpu/alignment incrementing.
My guess is that this is due to ARM v6 ability to deal with unaligned addresses automatically, if a given CPU configuration flag is set. My question is, is my hypothesis correct? And if so, how to force a PI version 1 to actually raise an exception in case of unaligned accesses so that the Linux kernel can trap it and send a signal, log the access, and so forth, according to /proc/cpu/alignment settings?
EDIT: It is worth to note that not all the instructions can perform unaligned accesses even in ARM v6. For instance STMDB, STMFD, LDMDB, LDMEA and similar multiple words instructions will indeed raise an exception and will be trapped by the Linux kernel.
I think I eventually found my answers:
Yes I'm correct, up to the word size ARM v6 (or greater) can silently handle the unaligned accesses so no trap is generated and is completely transparent for the Linux kernel. Nothing will be logged, nor the traps counter in /proc/cpu/alignment will be incremented.
AFAIK there is no way I can force the kernel to trap word-sized unaligned accesses, since to do that apparently the CPU should be configured in order to trap the unaligned addresses in every case, but the Linux kernel does not do that AFAIK, probably because there is alignment unsafe code inside the kernel itself. Checking the Linux kernel source code indeed one can see:
if (cpu_is_v6_unaligned()) {
set_cr(__clear_cr(CR_A));
ai_usermode = safe_usermode(ai_usermode, false);
}
What this means is that the SCTLR.A bit is always cleared, so no trap
will be generated for unaligned accesses ARM v6 can handle.
There are a great deal of instructions that will still generate traps when used with unaligned addresses, for example multi store/load instructions, loading and storing of double values.
However, there are instructions that GCC (the version shipped in the default Raspberry Linux distribution) will happily produced that are not handled by the Linux kernel correctly, that will result in a SIGBUS generated even when /proc/cpu/alignment is set to fix the access.
So point number 4 basically means that, it is not a good idea to fix programs to run in ARM v6 just letting the Linux kernel handle unaligned addresses for us, even when the performance implications of unaligned addresses are not a problem: the program can still crash since not all the instructions are handled.
How to reliably find all the unaligned accesses in a program remains an open question AFAIK, since unfortunately, the otherwise wonderful valgrind program, never implemented this feature. In the past I had to use QEMU emulating Sparc, however this is a very slow process. Valgrind would be the trivial way to do that.
For some unknown reason Intel decided to does not support AVX2 via typical /arch: option. /arch: recognizes only following instructions IA32,SSE,SSE2,SSE3,AVX. So if you want to compile for AVX2 you are basically forced to activate /QxCORE-AVX2 switch. The problem with this option is that it injects check code. That code at runtime checks if your cpu is compatible with selected intructions. If CPU is not compatible then this message pops-up.
Please verify that both the operating system and the processor support Intel(R)
MOVBE, F16C, FMA, BMI, LZCNT and AVX2 instructions.
Now I'm worried that the same message may pop-up on AMD Excavator and RyZen CPU due to not being GenuineIntel. Unfortunately I do not have access to any AMD cpu so I can't check that on real cpu. To make your life easier I've compiled this simple code with activated /QxCORE-AVX2 option.
#include "stdafx.h"
int _tmain(int argc, _TCHAR* argv[])
{
double a, b, c;
a = 3.0;
b = 2.0;
c = 1.0;
a = a*b + c;
printf("a=%1.1f",a);
return 0;
}
and here is decompiled asm code: http://codepad.org/KL4Vq978
My question to people who understand asm code is do you see anything what may block execution of this code on latest AMD cpus? If yes will this http://www.softpedia.com/get/Programming/Patchers/Intel-Compiler-Patcher.shtml help?
It turns out that /arch:CORE-AVX2 is recognized and compiled executable contains FMA instructions! I really do not understand why this option is not listed in Visual Studio and in ICL /help ?!?
Dropbox menu in Visual Studio (NO AVX2!)
http://i.cubeupload.com/c1xidV.png
ICL /help
http://i.cubeupload.com/y2Cre6.png
The Ryzen supports these instruction sets, but the code will not run on AMD processors because it checks if the processor is "GenuineIntel". There has been a long discussion and legal battle about this issue. See http://www.agner.org/optimize/blog/read.php?i=49
I fiddled with this the whole day, so I thought I might make everyone benefit from my experience, please see my answer below.
I first had a problem with running a compiled Mex file within Matlab, because Matlab complained that it couldn't open the shared library libarmadillo. I solved this using the environment variables LD_LIBRARY_PATH and LD_RUN_PATH (DYLD_LIBRARY_PATH and LYLD_RUN_PATH in osx).
The problem remained however, that a simple test file would segfault at runtime even though the exact same code would compile and run fine outside Matlab (not Mex'd).
The segfault seems to be caused by the fact that Matlab uses 64bits integers (long long or int64_t) in its bundled LAPACK and BLAS libraries. Armadillo on the other hand, uses 32bits integers (regular int on a 64bits platform, or int32_t) by default.
There are two solutions; the first one involves forcing Matlab to link to the system's libraries instead (which use ints), the second involves changing Armadillo's config file to enable long longs with BLAS. I tend to think that the first is more reliable, because there is no black-box effect, but it's also more troublesome, because you need to manually install and remember the path of your BLAS and LAPACK libs.
Both solutions required that I stopped using Armadillo's shared libraries and linked/included manually the sources.
To do this, you must simply install LAPACK and BLAS on your system (if they are not already there, in Ubuntu that's libblas-dev and liblapack-dev), and copy the entire includes directory somewhere sensible like in $HOME/.local/arma for example.
Solution 1: linking to system's libraries
From the matlab console, set the environment variables BLAS_VERSION and LAPACK_VERSION to point to your system's libraries. In my case (Ubuntu 14.04, Matlab R2014b):
setenv('BLAS_VERSION','/usr/lib/libblas.so');
setenv('LAPACK_VERSION','/usr/lib/liblapack.so');
You can then compile normally:
mex -compatibleArrayDims -outdir +mx -L/home/john/.local/arma -llapack -lblas -I/home/john/.local/arma test_arma.cpp
or if you define the flag ARMA_64BIT_WORD in includes/armadillo_bits/config.hpp, you can drop the option -compatibleArrayDims.
Solution 2: changing Armadillo's config
The second solution involves uncommenting the flag ARMA_BLAS_LONG_LONG in Armadillo's config file includes/armadillo_bits/config.hpp. Matlab will link to its bundled LAPACK and BLAS libraries, but this time Armadillo won't segfault because it's using the right word-size. Same than before, you can also uncomment ARMA_64BIT_WORD if you want to drop the -compatibleArrayDims.
Compiled with
mex -larmadillo -DARMA_BLAS_LONG_LONG armaMex_demo2.cpp
(In Matlab)
armaMex_demo2(rand(1))
works without segfault.
However, compiled with
mex -larmadillo armaMex_demo2.cpp
(In Matlab)
armaMex_demo2(rand(1))
causes a segfault.
Here, armaMex_demo2.cpp is
/* ******************************************************************* */
// armaMex_demo2.cpp: Modified from armaMex_demo.cpp copyright Conrad Sanderson and George Yammine.
/* ******************************************************************* */
// Demonstration of how to connect Armadillo with Matlab mex functions.
// Version 0.2
//
// Copyright (C) 2014 George Yammine
// Copyright (C) 2014 Conrad Sanderson
//
// This Source Code Form is subject to the terms of the Mozilla Public
// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at http://mozilla.org/MPL/2.0/.
/////////////////////////////
#include "armaMex.hpp"
void
mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) {
/*
Input: X (real matrix)
Output: Eigenvalues of X X.T
*/
if (nrhs != 1)
mexErrMsgTxt("Incorrect number of input arguments.");
// Check matrix is real
if( (mxGetClassID(prhs[0]) != mxDOUBLE_CLASS) || mxIsComplex(prhs[0]))
mexErrMsgTxt("Input must be double and not complex.");
// Create matrix X from the first argument.
arma::mat X = armaGetPr(prhs[0],true);
// Run an arma function (eig_sym)
arma::vec eigvals(X.n_rows);
if(not arma::eig_sym(eigvals, X*X.t()))
mexErrMsgTxt("arma::eig_sym failed.");
// return result to matlab
plhs[0] = armaCreateMxMatrix(eigvals.n_elem, 1);
armaSetPr(plhs[0], eigvals);
return;
}
So after researching engines a lot I've been building a 2d framework for the iphone. As you know the world of engine architecture is vast so I've been trying to apply best practices as much as possible.
I've been using:
uint_fast8_t mId;
If I look up the definition of uint_fast8_t I find:
/* 7.18.1.3 Fastest-width integer types */
...
typedef uint8_t uint_fast8_t;
And I've been using these types throughout my code - My question is, is there a performance benefit to using these types? And what exactly is going on behind the scenes? Besides the obvious fact that this is correct data type (unsigned 8 bit integer) for the data, is it worthwhile to have this peppered throughout my code?
Is this a needless optimization that the compiler would probably take care of anyways?
Thanks.
Edit: No responses/answers, so I'm putting a bounty on this!
the "fast" integer types are defined to be the fastest integer type available with at least the amount of bits required (in this case 8).
If your platform defines uint_fast8_t as uint8_t then there will be absolutely no difference in speed.
The reason is that there may be architectures that are slower when not using their native word length. E.g. I could find one reference where for Alpha processors uint_fast_8_t was defined to be "unsigned int".
An uint_fast8_t is the fastest integer guaranteed to be at least 8 bits wide. Depending on your platform it could be 8 or 16 or 32 bits wide.
It isnt taken care of by the compiler itself, it does indeed make your program execute faster
Here are some resource I found, You might already have seen them http://embeddedgurus.com/stack-overflow/2008/06/efficient-c-tips-1-choosing-the-correct-integer-size/
http://www.mail-archive.com/avr-gcc-list#nongnu.org/msg03149.html
The header in mingw64 said the fast types are "Not actually guaranteed to be fastest for all purposes"
/* 7.18.1.3 Fastest minimum-width integer types
* Not actually guaranteed to be fastest for all purposes <---------------------
* Here we use the exact-width types for 8 and 16-bit ints.
*/
typedef signed char int_fast8_t;
typedef unsigned char uint_fast8_t;
typedef short int_fast16_t;
typedef unsigned short uint_fast16_t;
typedef int int_fast32_t;
typedef unsigned int uint_fast32_t;
__MINGW_EXTENSION typedef long long int_fast64_t;
__MINGW_EXTENSION typedef unsigned long long uint_fast64_t;
and that still applies to ARM or other architectures, because using a narrow type requires zero extension or sign extension in many situations which is less optimal than a native int.
However that'll benefit in large arrays or in case or slow operations (like division). I'm not sure how slow ARM divisions are but on x86 64-bit division is much slower than 32-bit or 8-bit division