double.NaN - how does this counterintuitive feature work? - double

I stumbled upon .NET's definition of double.NaN in code:
public const double NaN = (double)0.0 / (double)0.0;
This is done similarly in PositiveInfinity and NegativeInfinity.
double.IsNaN (with removing a few #pragmas and comments) is defined as:
[Pure]
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.Success)]
public static bool IsNaN(double d)
{
if (d != d)
{
return true;
}
else
{
return false;
}
}
This is very counter-intuitive to me.
Why is NaN defined as division by zero? How is 0.0 / 0.0 represented "behind the scenes"? How can division by 0 be possible in double, and why does NaN != NaN?

Fairly simple answer here. .Net framework has implemented the floating point standard specified by the IEEE (System.Double complies with the IEC 60559:1989 (IEEE 754) standard for binary floating-point arithmetic.). This is because floating point arithmetic actually has to work across many systems, not just x86/64 architectures, so by following the conventions this ensures that there will be less compatibility issues (for instance porting code from a DSP into an x86 processor).
As for the d != d, this is a performance optimisation. Basically this instruction relies on a hardware instruction which can very quickly determine if two double floating point numbers are equal. Under the standard, NAN != NAN and therefore is the fastest way to test. Trying to find a reference for you.

Why is NaN defined as division by
zero? How can division by 0 be
possible in double, and why does NaN
!= NaN?
All of these are mandated by the IEEE 754 standard, which pretty much all modern CPUs implement.
How is 0.0 / 0.0 represented "behind the scenes"?
By having an exponent with all bits set to 1 and a mantissa with at least one bit set to 1. Note that this means that there are a large number of different bit patterns that all represent NaN - but, as mentioned above, even if the bit patterns are identical, they must be considered not equal (i.e. == must return false).

From the C# Spec:
14.9.2 Floating-point comparison operators
The predefined floating-point comparison operators are:
bool operator ==(float x, float y); bool operator ==(double x, double y);
bool operator !=(float x, float y); bool operator !=(double x, double y);
bool operator <(float x, float y); bool operator <(double x, double y);
bool operator >(float x, float y); bool operator >(double x, double y);
bool operator <=(float x, float y); bool operator <=(double x, double y);
bool operator >=(float x, float y); bool operator >=(double x, double y);
The operators compare the operands according to the rules of the IEC 60559 standard:
If either operand is NaN, the result is false for all operators except !=, for which the result is true. For any two operands, x != y always produces the same result as !(x == y). However, when one or both operands are NaN, the <, >, <=, and >= operators do not produce the same results as the logical negation of the opposite operator. [Example: If either of x and y is NaN, then x < y is false, but !(x >= y) is true. end example]
As to how NaN is represented behind the scenes, the wikipedia article on the IEEE spec has some examples.

Related

Dart: What is the difference between floor() and toInt()

I want to truncate all decimals of a double without rounding. I have two possibilities here:
double x = 13.5;
int x1 = x.toInt(); // x1 = 13
int x2 = x.floor(); // x2 = 13
Is there any difference between those two approaches?
As explained by the documentation:
floor:
Rounds fractional values towards negative infinity.
toInt:
Equivalent to truncate.
truncate:
Rounds fractional values towards zero.
So floor rounds toward negative infinity, but toInt/truncate round toward zero. For positive values, this doesn't matter, but for negative fractional values, floor will return a number less than the original, whereas toInt/truncate will return a greater number.

Why do I get an incorrect output from a modulus operation with negative number

I tried this code on Dart: I get 28.5
void main() {
double modulo = -1.5 % 30.0;
print(modulo);
}
The same code in Javascript returns -1.5
let modulo = -1.5 % 30;
console.log(modulo);
What is the equivalent of the javascript code above in Dart ?
The documentation for num.operator % states (emphasis mine):
Returns the remainder of the Euclidean division. The Euclidean division of two integers a and b yields two integers q and r such that a == b * q + r and 0 <= r < b.abs().
...
The sign of the returned value r is always positive.
See remainder for the remainder of the truncating division.
Meanwhile, num.remainder says (again, emphasis mine):
The result r of this operation satisfies: this == (this ~/ other) * other + r. As a consequence the remainder r has the same sign as the divider this.
So if you use:
void main() {
double modulo = (-1.5).remainder(30.0);
print(modulo);
}
you'll get -1.5.
Note that both values are mathematically correct; there are two different answers that correspond to the two different ways that you can compute a negative quotient when performing integer division. You have a choice between rounding a negative quotient toward zero (also known as truncation) or toward negative infinity (flooring). remainder corresponds to a truncating division, and % corresponds to a flooring division.
An issue was raised about this in the dart-lang repo a while ago. Apparently the % symbol in dart is an "Euclidean modulo operator rather than remainder."
An equivalent of what you are trying to do can be accomplished with
(-1.5).remainder(30.0)

Okay to use double for == comparison and indexing?

In this answer gire mentioned to better not use == when comparing doubles.
When creating a increment variable in a for loop using start:step:stop notation, it's type will be of double. If one wants to use this loop variable for indexing and == comparisons, might that cause problems due to floating point precision?!
Should one use integers? If so, is there a way to do so with the s:s:s notation?
Here's an example
a = rand(1, 5);
for ii = length(a):-1:1
if (ii == 1) % Comparing var of type double with ==
b = 0;
else
b = a(ii); % Using double for indexing
end
... % Code
end
Note that the floating point double specification uses 52 bits to store the mantissa (the part after the decimal point) so you can exactly represent any integer in the range
-4503599627370496 <= x <= 4503599627370496
Note that this is larger than the range of an int32, which can only represent
-2147483648 <= x <= 2147483647
If you are just using the double as a loop variable, and only incrementing it in integer steps, and you are not counting above 4,503,599,627,370,496 then you are fine to use a double, and to use == to compare doubles.
One reason people suggest for not using doubles is that you can't represent some common decimals exactly, e.g. 0.1 has no exact representation as a double. Therefore if you are working with monetary values, it may be better to separately store the data as an int and remember a scale factor of 10x or 100x or whatever.
It's sometimes bad to directly compare floating point numbers for equality because rounding issues can cause two floats to be not equal, even though the numbers are mathematically equal. This generally happens when the numbers are not exactly representable as floats, or when there is a significant size difference between the numbers, e.g.
>> 0.3 - 0.2 == 0.1
ans =
0
If you're indexing between integer bounds with integer steps (even though the variable class is actually double), it is ok to use == for comparisons with other integers.
You can cast the indices, if you really want to be safe.
For example:
for ii = int16(length(a):-1:1)
if (ii == 1)
b = 0;
end
end

MATLAB - integers vs decimals assignment strange bug

newT = [b(i) d(i) a(i) z(i)];
newT, b(i), a(i)
Prints
newT =
123 364 123 902
ans =
1.234e+02
ans =
1.234e+02
What is the problem here? Why are the first and third entry in newT rounded to integer values? Why aren't they correctly assigned?
Unlike most other programming languages, integer types in Matlab take precedence over floating point types. When you combine them, either through concatenation or arithmetic, the floating point values are implicitly narrowed to integers, instead of the integers being widened to floating point.
>> int32(3) + 0.4
ans =
3
>> [int32(3) 0.4]
ans =
3 0
This is for historical reasons, because (IIRC) Matlab originally didn't have support for integers at all, so all numeric constants in Matlab produce double values, and the promotion rules were created to make it possible to mix integer types with floating-point constants.
To fix this, explicitly convert those int types to doubles before concatenating.
newT = [b(i) double(d(i)) a(i) double(z(i))];

Strange problem comparing floats in objective-C

At some point in an algorithm I need to compare the float value of a property of a class to a float. So I do this:
if (self.scroller.currentValue <= 0.1) {
}
where currentValue is a float property.
However, when I have equality and self.scroller.currentValue = 0.1 the if statement is not fulfilled and the code not executed! I found out that I can fix this by casting 0.1 to float. Like this:
if (self.scroller.currentValue <= (float)0.1) {
}
This works fine.
Can anyone explain to my why this is happening? Is 0.1 defined as a double by default or something?
Thanks.
I believe, having not found the standard that says so, that when comparing a float to a double the float is cast to a double before comparing. Floating point numbers without a modifier are considered to be double in C.
However, in C there is no exact representation of 0.1 in floats and doubles. Now, using a float gives you a small error. Using a double gives you an even smaller error. The problem now is, that by casting the float to a double you carry over the bigger of error of the float. Of course they aren't gone compare equal now.
Instead of using (float)0.1 you could use 0.1f which is a bit nicer to read.
The problem is, as you have suggested in your question, that you are comparing a float with a double.
There is a more general problem with comparing floats, this happens because when you do a calculation on a floating point number the result from the calculation may not be exactly what you expect. It is fairly common that the last bit of the resulting float will be wrong (although the inaccuracy can be larger than just the last bit). If you use == to compare two floats then all the bits have to be the same for the floats to be equal. If your calculation gives a slightly inaccurate result then they won't compare equal when you expect them to. Instead of comparing the values like this, you can compare them to see if they are nearly equal. To do this you can take the positive difference between the floats and see if it is smaller than a given value (called an epsilon).
To choose a good epsilon you need to understand a bit about floating point numbers. Floating point numbers work similarly to representing a number to a given number of significant figures. If we work to 5 significant figures and your calculation results in the last digit of the result being wrong then 1.2345 will have an error of +-0.0001 whereas 1234500 will have an error of +-100. If you always base your margin of error on the value 1.2345 then your compare routine will be identical to == for all values great than 10 (when using decimal). This is worse in binary, it's all values greater than 2. This means that the epsilon we choose has to be relative to the size of the floats that we are comparing.
FLT_EPSILON is the gap between 1 and the next closest float. This means that it may be a good epsilon to choose if your number is between 1 and 2, but if your value is greater than 2 using this epsilon is pointless because the gap between 2 and the next nearest float is larger than epsilon. So we have to choose an epsilon relative to the size of our floats (as the error in the calculation is relative to the size of our floats).
A good(ish) floating point compare routine looks something like this:
bool compareNearlyEqual (float a, float b, unsigned epsilonMultiplier)
{
float epsilon;
/* May as well do the easy check first. */
if (a == b)
return true;
if (a > b) {
epsilon = scalbnf(1.0f, ilogb(a)) * FLT_EPSILON * epsilonMultiplier;
} else {
epsilon = scalbnf(1.0, ilogb(b)) * FLT_EPSILON * epsilonMultiplier;
}
return fabs (a - b) <= epsilon;
}
This comparison routine compares floats relative to the size of the largest float passed in. scalbnf(1.0f, ilogb(a)) * FLT_EPSILON finds the gap between a and the next nearest float. This is then multiplied by the epsilonMultiplier, so the size of the difference can be adjusted, depending on how inaccurate the result of the calculation is likely to be.
You can make a simple compareLessThan routine like this:
bool compareLessThan (float a, float b, unsigned epsilonMultiplier)
{
if (compareNearlyEqual (a, b, epsilonMultiplier)
return false;
return a < b;
}
You could also write a very similar compareGreaterThan function.
It's worth noting that comparing floats like this may not always be what you want. For instance this will never find that a float is close to 0 unless it is 0. To fix this you'd need to decide what value you thought was close to zero, and write an additional test for this.
Sometimes the inaccuracies you get won't depend on the size of the result of a calculation, but will depend on the values that you put into a calculation. For instance sin(1.0f + (float)(200 * M_PI)) will give a much less accurate result than sin(1.0f) (the results should be identical). In this case your compare routine would have to look at the number you put into the calculation to know the margin of error of the answer.
Doubles and floats have different values for the mantissa store in binary (float is 23 bits, double 54). These will almost never be equal.
The IEEE Float Point article on wikipedia may help you understand this distinction.
In C, a floating-point literal like 0.1 is a double, not a float. Since the types of the data items being compared are different, the comparison is done in the more precise type (double). In all implementations I know about, float has a shorter representation than double (usually expressed as something like 6 vs. 14 decimal places). Moreover, the arithmetic is in binary, and 1/10 does not have an exact representation in binary.
Therefore, you're taking a float 0.1, which loses accuracy, extending it to double, and expecting it to compare equal to a double 0.1, which loses less accuracy.
Suppose we were doing this in decimal, with float being three digits and double being six, and we were comparing to 1/3.
We have the stored float value being 0.333. We're comparing it to a double with value 0.333333. We convert the float 0.333 to double 0.333000, and find it different.
0.1 is actually a very dificult value to store binary. In base 2, 1/10 is the infinitely repeating fraction
0.0001100110011001100110011001100110011001100110011...
As several has pointed out, the comparison has to made with a constant of the exact same precision.
Generally, in any language, you can't really count on equality of float-like types. In your case since it looks like you have more control, it does appear that 0.1 is not float by default. You could probably find that out with sizeof(0.1) (vs. sizeof(self.scroller.currentValue).
Convert it to a string, then compare:
NSString* numberA = [NSString stringWithFormat:#"%.6f", a];
NSString* numberB = [NSString stringWithFormat:#"%.6f", b];
return [numberA isEqualToString: numberB];

Categories