How does gem5 (Alpha) perform floating point operations? - fpu

I use gem5 to simulate the CPU architecture Alpha. How are floating point operations handled in gem5? Does it use a floating point unit? Do floating point operations consume more simulated time than integer?

Related

Can float16 data type save compute cycles while computing transcendental functions?

it's clearly that float16 can save bandwidth, but is float16 can save compute cycles while computing transcendental functions, like exp()?
If your hardware has full support for it, not just conversion to float32, then yes, definitely. e.g. on a GPU, or on Intel Alder Lake with AVX-512 enabled, or Sapphire Rapids.
Half-precision floating-point arithmetic on Intel chips. Or apparently on Apple M2 CPUs.
If you can do two 64-byte SIMD vectors of FMAs per clock on a core, you go twice as fast if that's 32 half-precision FMAs per vector instead of 16 single-precision FMAs.
Speed vs. precision tradeoff: only enough for FP16 is needed
Without hardware ALU support for FP16, only by not requiring as much precision because you know you're eventually going to round to fp16. So you'd use polynomial approximations of lower degree, thus fewer FMA operations, even though you're computing with float32.
BTW, exp and log are interesting for floating point because the format itself is build around an exponential representation. So you can do an exponential by converting fp->int and stuffing that integer into the exponent field of an FP bit pattern. Then with the the fractional part of your FP number, you use a polynomial approximation to get the mantissa of the exponent. A log implementation is the reverse: extract the exponent field and use a polynomial approximation of log of the mantissa, over a range like 1.0 to 2.0.
See
Efficient implementation of log2(__m256d) in AVX2
Fastest Implementation of Exponential Function Using AVX
Very fast approximate Logarithm (natural log) function in C++?
vgetmantps vs andpd instructions for getting the mantissa of float
Normally you do want some FP operations, so I don't think it would be worth trying to use only 16-bit integer operations to avoid unpacking to float32 even for exp or log, which are somewhat special and intimately connected with floating point's significand * 2^exponent format, unlike sin/cos/tan or other transcendental functions.
So I think your best bet would normally still be to start by converting fp16 to fp32, if you don't have instructions like AVX-512 FP16 can do actual FP math on it. But you can gain performance from not needing as much precision, since implementing these functions normally involves a speed vs. precision tradeoff.

If I have a GPU server with 1PFlop/s single-precision floating-point power, can I do double-precision calculations?

If I have a GPU server with 1PFlop/s single-precision floating-point power, can I do double-precision calculations? If so, how much computing power is equivalent to double precision calculation?
It depends on the GPU model that is available: typically high-end GPUs like the Nvidia A100 have a fairly good amount of double precision units that can accelerate suitable applications.
The other main factor is the "suitable application": the amount of Flop/s you can extract from your GPU server changes depending on the nature of your application and the specific implementation.

Neural Networks w/ Fixed Point Parameters

Most neural networks are trained with floating point weights/biases.
Quantization methods exist to convert the weights from float to int, for deployment on smaller platforms.
Can you build neural networks from the ground up that constrain all parameters, and their updates to be integer arithmetic?
Could such networks achieve a good accuracy?
(I know a bit about fixed-point and have only some rusty NN experience from the 90's so take what I have to say with a pinch of salt!)
The general answer is yes, but it depends on a number of factors.
Bear in mind that floating-point arithmetic is basically the combination of an integer significand with an integer exponent so it's all integer under the hood. The real question is: can you do it efficiently without floats?
Firstly, "good accuracy" is highly dependent on many factors. It's perfectly possible to perform integer operations that have higher granularity than floating-point. For example, 32-bit integers have 31 bits of mantissa while 32-bit floats effectively have only 24. So provided you do not require the added precision that floats give you near zero, it's all about the types that you choose. 16-bit -- or even 8-bit -- values might suffice for much of the processing.
Secondly, accumulating the inputs to a neuron has the issue that unless you know the maximum number of inputs to a node, you cannot be sure what the upper bound is on the values being accumulated. So effectively you must specify this limit at compile time.
Thirdly, the most complicated operation during the execution of a trained network is often the activation function. Again, you firstly have to think about what the range of values are within which you will be operating. You then need to implement the function without the aid of an FPU with all of the advanced mathematical functions it provides. One way to consider doing this is via lookup tables.
Finally, training involves measuring the error between values and that error can often be quite small. Here is where accuracy is a concern. If the differences you are measuring are too low, they will round down to zero and this may prevent progress. One solution is to increase the resolution of the value by providing more fractional digits.
One advantage that integers have over floating-point here is their even distribution. Where floating-point numbers lose accuracy as they increase in magnitude, integers maintain a constant precision. This means that if you are trying to measure very small differences in values that are close to 1, you should have no more trouble than you would if those values were as close to 0. The same is not true for floats.
It's possible to train a network with higher precision types than those used to run the network if training time is not the bottleneck. You might even be able to train the network using floating-point types and run it using lower-precision integers but you need to be aware of differences in behavior that these shortcuts will bring.
In short the problems involved are by no means insurmountable but you need to take on some of the mental effort that would normally be saved by using floating-point. However, especially if your hardware is physically constrained, this can be a hugely benneficial approach as floating-point arithmetic requires as much as 100 times more silicon and power than integer arithmetic.
Hope that helps.

Difference between data processed by CPU and GPU in openCL

I have some matlab code that uses several large MEX functions and I want to speed things up by using openCL ( I am replacing parts of code of the MEX functions with openCL code using openCL API ). I've translated a small part of the code into an openCL kernel and I am already facing difficulties.
Some elements of the resulting matrix after execution on GPU are different from the corresponding elements of the resulting matrix when the original MEX function is called and the error is less than 0.01. This leads to a small error in the final result but I fear the error will accumulate as I translate more code.
This is probably related with different precision of the calculations on CPU and GPU. Does anyone know how to ensure the same precision? I am running 64 bit matlab R2012b on Ubuntu 12.04. The hardware I am using is Intel Core2 Duo E4700 and NVIDIA GeForce GT 520.
The small differences between results on your CPU and GPU are easily explained as arising from differences in floating-point precision if you have modified your code from using double precision (64-bit) f-p numbers on the CPU to using single-precision (32-bit) f-p numbers on the GPU.
I would not call this difference an error, rather it is an artefact of doing arithmetic on computers with floating-point numbers. The results you were getting on your CPU-only code were already different from any theoretically 'true' result. Much of the art of numerical computing is in keeping the differences between theoretical and actual computations small enough (whatever the heck that means) for the entire duration of a computation. It would take more time and space than I have now to expand on this, but surprises arising from lack of understanding of what floating-point arithmetic is, and isn't, are a rich source of questions here on SO. Some of the answers to those questions are very illuminating. This one should get you started.
If you have taken care to use the same precision on both CPU and GPU then the differences you report may be explained by the non-commutativity of floating-point arithmetic: in floating-point arithmetic it is not guaranteed that (a+b)+c == a+(b+c). The order of operations matters; if you have any SIMD going on I'd bet that the order of operations is not identical on the two implementations. Even if you haven't, what have you done to ensure that operations are ordered the same on both GPU and CPU ?
As to what you should do about it, that's rather up to you. You could (though I personally wouldn't recommend it) write your own routines for doing double-precision f-p arithmetic on the GPU. If you choose to do this, expect to wave goodbye to much of the speed-up that the GPU promises.
A better course of action is to ensure that your single-precision software provides sufficient accuracy for your purposes. For example, in the world I work in our original measurements from the environment are generally not accurate to more than about 3 significant figures, so any results that our codes produce have no validity after about 3 s-f. So if I can keep the errors in the 5th and lower s-fs that's good enough.
Unfortunately, from your point of view, getting enough accuracy from single-precision computations isn't necessarily guaranteed by globally replacing double with float and reompiling, you may (generally would) need to implement different algorithms, ones which take more time to guarantee more accuracy and which do not drift so much as computations proceed. Again, you'll lose some of the speed advantage that GPUs promise.
A common problem is, that floating point values are kept within an 80bit CPU register, instead of getting truncated and stored each time. In these cases, the additional precision leads to deviations. So you may check, what options your compiler offers to counter such issues. It can also be interesting to view the difference of release and debug builds.

Newton-Raphson Division For Floating Point Divide?

I am trying to implement the Newton-Raphson Division Algorithm Wikipedia entry to implement a IEEE-754 32-bit floating point divide on a processor which has no hardware divide unit.
My memory locations are 32-bit two's complement word and I have already implemented floating point addition, subtraction, and multiplication, so I can reuse the code to implement the Newton-Raphson algorithm. I am trying to first implement this all in Matlab.
At this step:
X_0 = 48/17 - 32/17 * D
How do I bitshift D properly to between 0.5 and 1 as described in the algorithm details?
You might look at the compiler-rt runtime library (part of LLVM), which has a liberal license and implements floating-point operations for processors that lack hardware support.
You could also look at libgcc, though I believe that's GPL, which may or may not be an issue for you.
In fact, don't just look at them. Use one of them (or another soft-float library). There's no need to re-invent the wheel.