openCL double precision in CPU - double

Hello so I have ran several openCL kernels in double precision on GPU with the following defined:
#ifndef GPU_AMD
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#else
#pragma OPENCL EXTENSION cl_amd_fp64 : enable
#endif
And now I would like to run the same openCL kernels in double precision but on CPU instead, and am wondering if I need the extensions like above? Or is there another openCL extension I have to enable before using double in cpu?
thanks

You should just be able to use the cl_khr_fp64 extension. The cl_amd_fp64 extension is actually just a subset of the cl_khr_fp64 extension for AMD GPUs.
Some AMD GPUs will actually support the full cl_khr_fp64 extension these days, so check (with CLInfo perhaps) to see if that is a possibility.
See this question for more information.

Related

Is there an equivalent of march=native in the Crystal compiler?

GCC and Clang support a compiler option named -march=native, which is handy if you want to optimize for the current machine's architecture. The resulting binary might not be portable, but that is OK if it will only be executed on the same machine.
I wondered if the Crystal compiler supports it. I can see the following options --mcpu, --mattr, --mcmodel, which might be want I need. Unfortunately, I could not find a lot of information.
Is there a recommended way in Crystal to optimize for the current machine? Ideally, it should figure out the available CPU instructions automatically (like -march=native).
Background: How to see which flags -march=native will activate?
The Crystal compiler doesn't support -march. Maybe that should be added. From what I hear it's there's often no clear separation between -mcpu and -march.
As a workaround, you could ask the compiler to emit LLVM IR or byte code. That allows you to compile the binary with LLVM tools directly, which would give full access to LLVM options like -march.

Why spark blas use f2jBLAS instead of native BLAS for level 1 routines?

I found the following code in BLAS.scala:
// For level-1 routines, we use Java implementation.
private def f2jBLAS: NetlibBLAS = {
if (_f2jBLAS == null) {
_f2jBLAS = new F2jBLAS
}
_f2jBLAS
}
I think the native blas is faster than a pure Java implementation.
So why spark choose the f2jblas for level 1 routines, Is there any reason I do not know?
Thank you!
The answer is most likely to be found in the Performance section of the readme file of the netlib-java repository.
Java has a reputation with older generation developers because Java applications were slow in the 1990s. Nowadays, the JIT ensures that Java applications keep pace with – or exceed the performance of – C / C++ / Fortran applications.
This is followed by charts showing detailed benchmark results for various BLAS routines in both pure Java (translated from Fortran with f2j) and from native BLAS on both Linux on ARM and macOS on x86_64. The ddot benchmark shows that on x86 (JRE for ARM doesn't seem to have JIT capabilities) F2J performs on par with the reference native BLAS implementation for longer vector sizes and even outperforms it for shorter vector sizes. The caveat here is that the JIT kicks in after a couple of invocations, which is not a problem as most ML algorithms are iterative in nature. Most of the level 1 routines are fairly simple and the JIT compiler is able to generate well optimised code. This is also why the tuning efforts in highly optimised BLAS implementations go into the level 2 and 3 routines.

How can generate the 32-bit RISCV form chisel soure. What are the required modifications?

According to the RISCV toolchain, we are generating the verilog files for Rocketchip as 64-bit. but we need 32-bit RISCV rocket chip.
For that what are requirements and modifications in scala and chisel files.
Is it possible to generate the 32-bit Rocket core to do so.
Rocket is a RV64 implementation. Unfortunately it does not have a simple switch to make it RV32. Making it RV32 will require some modification, hopefully small.

Does RenderScript support recursion?

OpenCL doesn't support recursion. CUDA does, but only from a certain version. Initial search indicated RenderScript does support recursion, but I couldn't find anything explicit.
Does RenderScript support recursive function calls?
Yes it does. However, using this will limit a script to processors capable of recursion.

Size of Structs in Objective-C and C++

I am converting some Objective-C++ code into plain Objective-C and I am having some trouble with structs. For both languages, I have the struct declared in the .h file like this.
struct BasicNIDSHeader {
short messageCode;
short messageDate;
int messageTime;
int messageLength;
short sourceID;
short destID;
short numberOfBlocks;
};
In C++, the struct is being declared like
BasicNIDSHeader header;
and in Objective-C I do this
struct BasicNIDSHeader header;
The code for using them is actually the same in both languages.
memset(&header, 0, sizeof(header));
[[fileHandle readDataOfLength:sizeof(header)] getBytes:&header];
where fileHandle is a NSFileHandle.
The problem is than in the original C++ code, sizeof(header) = 18. When using Objective-C, sizeof(header) = 20.
Any ideas why this is happening or how to fix it? The code is dependent on the size being like it is in C++. I can just hardcode it, but would like to have a better understanding of why it is happening. Plus I hate hardcoding constants.
Thanks!
If you depend on the internal memory structure of your structs - you should disable padding. This is called "packed", and different compilers have different ways of signalling it.
In GCC you do this with the __attribute__ keyword. Details here.
I can only speak for C++. In C++ there is an implementation specific feature which aligns the data on specific addresses, so that data can be processed efficently.
In MS Visual C++ you can enforce byte alignment with a pragma:
#pragma pack(1)