double or decimal for temperature spread formula - double

I am looking at writing an application which calculates temperatures spread in a material with an exposure (such as fire) over time.
I then need to draw the result as a heat map on the geometry of the material.
Now I wonder if I should in general go with Decimal or Double for all calculations and drawing. I have looked into both but are still unsure which to use.
I will need to compare values, including interpolated values over time. And double have as far as I understand it comparison problems due to inexact representation.
But decimal is heavy to work with.
I am leaning towards double only but at the same time more exact representation and comparison is worth a lot too.

Any finite representation of decimal numbers is bound to have the same “inexact representation” issues as a finite binary representation. Not all real numbers can be represented in finite space and going to base 10 is not going to help there.
Decimal just uses the same bits less efficiently—when talking of representing a physical simulation. On the other hand, a particular implementation of decimal numbers may allow to use more bits, but multi-precision binary floating-point implementations are available too.
The superstition that decimal floating-point libraries would somehow not be “inexact” or have “comparison problems” is widespread because these libraries can represent exactly decimal numbers, starting with 0.1. If your simulation involves coefficients that are powers of ten, then decimal would indeed be a good fit. Otherwise, decimal will not solve any of the problems inherent to the finite representation of a continuous range of numbers.

Related

Can float16 data type save compute cycles while computing transcendental functions?

it's clearly that float16 can save bandwidth, but is float16 can save compute cycles while computing transcendental functions, like exp()?
If your hardware has full support for it, not just conversion to float32, then yes, definitely. e.g. on a GPU, or on Intel Alder Lake with AVX-512 enabled, or Sapphire Rapids.
Half-precision floating-point arithmetic on Intel chips. Or apparently on Apple M2 CPUs.
If you can do two 64-byte SIMD vectors of FMAs per clock on a core, you go twice as fast if that's 32 half-precision FMAs per vector instead of 16 single-precision FMAs.
Speed vs. precision tradeoff: only enough for FP16 is needed
Without hardware ALU support for FP16, only by not requiring as much precision because you know you're eventually going to round to fp16. So you'd use polynomial approximations of lower degree, thus fewer FMA operations, even though you're computing with float32.
BTW, exp and log are interesting for floating point because the format itself is build around an exponential representation. So you can do an exponential by converting fp->int and stuffing that integer into the exponent field of an FP bit pattern. Then with the the fractional part of your FP number, you use a polynomial approximation to get the mantissa of the exponent. A log implementation is the reverse: extract the exponent field and use a polynomial approximation of log of the mantissa, over a range like 1.0 to 2.0.
See
Efficient implementation of log2(__m256d) in AVX2
Fastest Implementation of Exponential Function Using AVX
Very fast approximate Logarithm (natural log) function in C++?
vgetmantps vs andpd instructions for getting the mantissa of float
Normally you do want some FP operations, so I don't think it would be worth trying to use only 16-bit integer operations to avoid unpacking to float32 even for exp or log, which are somewhat special and intimately connected with floating point's significand * 2^exponent format, unlike sin/cos/tan or other transcendental functions.
So I think your best bet would normally still be to start by converting fp16 to fp32, if you don't have instructions like AVX-512 FP16 can do actual FP math on it. But you can gain performance from not needing as much precision, since implementing these functions normally involves a speed vs. precision tradeoff.

MATLAB precision causing problems when dealing with floating point numbers

I am performing operations in matrix multiplication where I have floating point numbers. Due to the precision in MATLAB I am getting incorrect output. For example, in the below
a = 1+1e-18
a = 1
a is rounding to 1 but I want all of the decimals places to be kept for my calculation such that it does not round to one. How can I get MATLAB to keep all of the decimal places when performing my calculations.
MATLAB does not natively support rational data types or extended floating point beyond double. Functions such as rat and rats look promising at first, but don't offer the functionality to work with the result out of the box.
You could get some mileage by retaining the numerator and denominator of your numbers separately. If you then implement the operators that you need for fractions, you'd have much higher precision in your final result.

Neural Networks w/ Fixed Point Parameters

Most neural networks are trained with floating point weights/biases.
Quantization methods exist to convert the weights from float to int, for deployment on smaller platforms.
Can you build neural networks from the ground up that constrain all parameters, and their updates to be integer arithmetic?
Could such networks achieve a good accuracy?
(I know a bit about fixed-point and have only some rusty NN experience from the 90's so take what I have to say with a pinch of salt!)
The general answer is yes, but it depends on a number of factors.
Bear in mind that floating-point arithmetic is basically the combination of an integer significand with an integer exponent so it's all integer under the hood. The real question is: can you do it efficiently without floats?
Firstly, "good accuracy" is highly dependent on many factors. It's perfectly possible to perform integer operations that have higher granularity than floating-point. For example, 32-bit integers have 31 bits of mantissa while 32-bit floats effectively have only 24. So provided you do not require the added precision that floats give you near zero, it's all about the types that you choose. 16-bit -- or even 8-bit -- values might suffice for much of the processing.
Secondly, accumulating the inputs to a neuron has the issue that unless you know the maximum number of inputs to a node, you cannot be sure what the upper bound is on the values being accumulated. So effectively you must specify this limit at compile time.
Thirdly, the most complicated operation during the execution of a trained network is often the activation function. Again, you firstly have to think about what the range of values are within which you will be operating. You then need to implement the function without the aid of an FPU with all of the advanced mathematical functions it provides. One way to consider doing this is via lookup tables.
Finally, training involves measuring the error between values and that error can often be quite small. Here is where accuracy is a concern. If the differences you are measuring are too low, they will round down to zero and this may prevent progress. One solution is to increase the resolution of the value by providing more fractional digits.
One advantage that integers have over floating-point here is their even distribution. Where floating-point numbers lose accuracy as they increase in magnitude, integers maintain a constant precision. This means that if you are trying to measure very small differences in values that are close to 1, you should have no more trouble than you would if those values were as close to 0. The same is not true for floats.
It's possible to train a network with higher precision types than those used to run the network if training time is not the bottleneck. You might even be able to train the network using floating-point types and run it using lower-precision integers but you need to be aware of differences in behavior that these shortcuts will bring.
In short the problems involved are by no means insurmountable but you need to take on some of the mental effort that would normally be saved by using floating-point. However, especially if your hardware is physically constrained, this can be a hugely benneficial approach as floating-point arithmetic requires as much as 100 times more silicon and power than integer arithmetic.
Hope that helps.

Matlab: How to decrease the precision of calculations in matlab to let's say 4 digits?

I am wondering what would be the way to tell Matlab that all computations need to be done and continued with let's say 4 digits.
format long or other formats I think is for showing the results with specific precision, not for their value precision.
Are you hoping to increase performance or save memory by not using double precision? If so, then variable precision arithmetic and vpa (part of the Symbolic Math toolbox) are probably not the right choice. Instead, you might consider fixed point arithmetic. This type of math is mostly used on microcontrollers and architectures with memory address widths of less than 32 bits or that lack dedicated floating point hardware. In Matlab you'll need the Fixed Point Designer toolbox. You can read more about the capabilities here.
Use digits: http://www.mathworks.com/help/symbolic/digits.html
digits(d) sets the current VPA accuracy to d significant decimal digits. The value d must be a positive integer greater than 1 and less than 2^29 + 1.

Are there architectures which are not using two's complement for representation of negative values?

The benefits of using the two's complement for storing negative values in memory are well-known and well-discussed in this board.
Hence, I'm wondering:
Do or did some architectures exist, which have chosen a different way for representing negative values in memory than using two's complement?
If so: What were the reasons?
Signed-magnitude existed as the most obvious, naive implementation of signed numbers.
One's complement has also been used on real machines.
On both of those representations, there's a benefit that the positive and negative ranges span equal intervals. A downside is that they both contain a negative zero representation that doesn't naturally occur in the sort of integer arithmetic commonly used in computation. And of course, the hardware for two's complement turns out to be much simpler to build
Note that the above applies to integers. Common IEEE-style floating point representations are effectively sign-magnitude, with some more details layered into the magnitude representation.