Why this happens in floating point conversion?

Why this happens in floating point conversion? - swift

I noticed that some floating points converted differently. This question helps me about floating points however still don't know why this happens? I added two screenshots from debug mode about the sample code. Example values : 7.37 and 9.37. Encountered it in swift and surely swift uses IEEE 754 floating point standard Pls explain how this happens? How conversion ended differently ?
if let text = textField.text {
if let number = formatter.number(from: text) {
return Double(number)
}
return nil
}

Double floating point numbers are stored in base-2, and cannot represent all decimals exactly.
In this case, 7.37 and 9.37 are rounded to the nearest floating point numbers which are 7.37000000000000010658141036401502788066864013671875 and 9.3699999999999992184029906638897955417633056640625, respectively.
Of course, such decimal representations are too unwieldy for general use, so programming languages typically print shorter approximate decimal representations. Two popular choices are
The shortest string that will be correctly rounded to the original number (which in this case are 7.37 and 9.37, respectively).
Rounded 17 significant digits, which is guaranteed to give the correct value when converting back to binary.
These appear to correspond to the 2 debug output values that you are seeing.

Related

TI Basic Numeric Standard

Are numeric variables following a documented standard on TI calculators ?
I've been really surprised noticing on my TI 83 Premium CE that this test actually returns true (i.e. 1) :
0.1 -> X
0.1 -> Y
0.01 -> Z
X*Y=Z
I was expecting this to fail, assuming my calculator would use something like IEEE 754 standard to represent floating points numbers.
On the other hand, calculating 2^50+3-2^50 returns 0, showing that large integers seems use such a standard : we see here the big number has a limited mantissa.

TI-BASIC's = is a tolerant comparison
Try 1+10^-12=1 on your calculator. Those numbers aren't represented equally (1+10^-12-1 gives 1E-12), but you'll notice the comparison returns true: that's because = has a certain amount of tolerance. AFAICT from testing on my calculator, if the numbers are equal when rounded to ten significant digits, = will return true.
Secondarily,
TI-BASIC uses a proprietary BCD float format
TI floats are a BCD format that is nine bytes long, with one byte for sign and auxilliary information and 14 digits (7 bytes) of precision. The ninth byte is used for extra precision so numbers can be rounded properly.
See a source linked to by #doynax here for more information.

How doubles truncate in swift in case of overflow

I know that swift's Double values have 15 decimal point precision so I took a variable
let pi: Double = 3.1415926535897932384
and REPL returned me
pi: Double = 3.1415926535897931
One thing I can clearly see that REPL has rounded off 32384 to 31(in case of overflow). So, is it following the standard mathematics rule for rounding off or something else.

This behavior has to do how floating point digits are represented in binary. So the conversion to binary doesn't round to the next decimal representation instead it converts it to the next binary one.
// test this in a playground
9.05 // returns 9.050000000000001
You shouldn't consider the last digit of a double value in general.

Calculating floating points as binary

The question is :
x and y are two floating point numbers in 32-bit IEEE floating-point format
(8-bit exponent with bias 127) whose binary representation is as follows:
x: 1 10000001 00010100000000000000000
y: 0 10000010 00100001000000000000000
Compute their product z = x y and give the result in binary IEEE floating-point format.
So I've found out that X = -4.3125. y = 9.03125. i can multiply them and get -38.947265625. I don't know how to show it in a IEEE format. Thanks in advance for the help.

I agree with the comment that it should be done in binary, rather than by conversion to decimal and decimal multiplication. I used Exploring Binary to do the arithmetic.
The first step is to find the actual binary significands. Neither input is subnormal, so they are 1.000101 and 1.00100001.
Multiply them, getting 1.00110111100101.
Similarly, subtract the bias, binary 1111111, from the exponents, getting 10 and 11. Add those, getting 101, then add back the bias, 10000100.
The sign bit for multiplying two numbers with different sign bits will be 1.
Now pack it all back together. The signficand came out in the [1,2) range so there is no need to normalize and adjust the exponent. We are still in the normal range, so drop the 1 before the binary point in the significand. The significand is narrow enough to fit without rounding - just add enough trailing zeros.
1 10000100 00110111100101000000000

You've made it harder by converting to decimal, the way you'd have to convert it back. It's not that it can't be done that way, but it's harder by hand.
Without converting, the algorithm to multiply two floats is (roughly) this:
put the implicit 1 back (if applicable)
multiply, to full size (don't truncate) (you can get away with using just Guard and Sticky, if you know how they work)
add the exponents
xor the signs
normalize/round/handle special cases (under-/overflow)
So here, multiply (look up how binary multiply worked if you forgot)
1.00010100000000000000000 *
1.00100001000000000000000 =
1.00100001000000000000000 +
0.000100100001000000000000000 +
0.00000100100001000000000000000 =
1.00110111100101000000000000000
Add exponents (mind the bias), 2+3 = 5 in this case, so 132 = 10000100.
Xor the signs, get 1.
No rounding is necessary because the dropped bits are all zero anyway.
Result: 1 10000100 00110111100101000000000

How to stop matlab truncating long numbers

These two long numbers are the same except for the last digit.
test = [];
test(1) = 33777100285870080;
test(2) = 33777100285870082;
but the last digit is lost when the numbers are put in the array:
unique(test)
ans = 3.3777e+16
How can I prevent this? The numbers are ID codes and losing the last digit is screwing everything up.

Matlab uses 64-bit floating point representation by default for numbers. Those have a base-10 16-digit precision (more or less) and your numbers seem to exceed that.
Use something like uint64 to store your numbers:
> test = [uint64(33777100285870080); uint64(33777100285870082)];
> disp(test(1));
33777100285870080
> disp(test(2));
33777100285870082
This is really a rounding error, not a display error. To get the correct strings for output purposes, use int2str, because, again, num2str uses a 64-bit floating point representation, and that has rounding errors in this case.

To add more explanation to #rubenvb's solution, your values are greater than flintmax for IEEE 754 double precision floating-point, i.e, greater than 2^53. After this point not all integers can be exactly represented as doubles. See also this related question.

mod() operation weird behavior

I use mod() to compare if a number's 0.01 digit is 2 or not.
if mod(5.02*100, 10) == 2
...
end
The result is mod(5.02*100, 10) = 2 returns 0;
However, if I use mod(1.02*100, 10) = 2 or mod(20.02*100, 10) = 2, it returns 1.
The result of mod(5.02*100, 10) - 2 is
ans =
-5.6843e-14
Could it be possible that this is a bug for matlab?
The version I used is R2013a. version 8.1.0

This is not a bug in MATLAB. It is a limitation of floating point arithmetic and conversion between binary and decimal numbers. Even a simple decimal number such as 0.1 has cannot be exactly represented as a binary floating point number with finite precision.
Computer floating point arithmetic is typically not exact. Although we are used to dealing with numbers in decimal format (base10), computers store and process numbers in binary format (base2). The IEEE standard for double precision floating point representation (see http://en.wikipedia.org/wiki/Double-precision_floating-point_format, what MATLAB uses) specifies the use of 64 bits to represent a binary number. 1 bit is used for the sign, 52 bits are used for the mantissa (the actual digits of the number), and 11 bits are used for the exponent and its sign (which specifies where the decimal place goes).
When you enter a number into MATLAB, it is immediately converted to binary representation for all manipulations and arithmetic and then converted back to decimal for display and output.
Here's what happens in your example:
Convert to binary (keeping only up to 52 digits):
5.02 => 1.01000001010001111010111000010100011110101110000101e2
100 => 1.1001e6
10 => 1.01e3
2 => 1.0e1
Perform multiplication:
1.01000001010001111010111000010100011110101110000101 e2
x 1.1001 e6
--------------------------------------------------------------
0.000101000001010001111010111000010100011110101110000101
0.101000001010001111010111000010100011110101110000101
+ 1.01000001010001111010111000010100011110101110000101
-------------------------------------------------------------
1.111101011111111111111111111111111111111111111111111101e8
Cutting off at 52 digits gives 1.111101011111111111111111111111111111111111111111111e8
Note that this is not the same as 1.11110110e8 which would be 502.
Perform modulo operation: (there may actually be additional error here depending on what algorithm is used within the mod() function)
mod( 1.111101011111111111111111111111111111111111111111111e8, 1.01e3) = 1.111111111111111111111111111111111111111111100000000e0
The error is exactly -2-44 which is -5.6843x10-14. The conversion between decimal and binary and the rounding due to finite precision have caused a small error. In some cases, you get lucky and rounding errors cancel out and you might still get the 'right' answer which is why you got what you expect for mod(1.02*100, 10), but In general, you cannot rely on this.

To use mod() correctly to test the particular digit of a number, use round() to round it to the nearest whole number and compensate for floating point error.
mod(round(5.02*100), 10) == 2
What you're encountering is a floating point error or artifact, like the commenters say. This is not a Matlab bug; it's just how floating point values work. You'd get the same results in C or Java. Floating point values are "approximate" types, so exact equality comparisons using == without some rounding or tolerance are prone to error.
>> isequal(1.02*100, 102)
ans =
1
>> isequal(5.02*100, 502)
ans =
0
It's not the case that 5.02 is the only number this happens for; several around 0 are affected. Here's an example that picks out several of them.
x = 1.02:1000.02;
ix = mod(x .* 100, 10) ~= 2;
disp(x(ix))
To understand the details of what's going on here (and in many other situations you'll encounter working with floats), have a read through the Wikipedia entry for "floating point", or my favorite article on it, "What Every Computer Scientist Should Know About Floating-Point Arithmetic". (That title is hyperbole; this article goes deep and I don't understand half of it. But it's a great resource.) This stuff is particularly relevant to Matlab because Matlab does everything in floating point by default.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why this happens in floating point conversion? - swift

Related

TI Basic Numeric Standard

How doubles truncate in swift in case of overflow

Calculating floating points as binary

How to stop matlab truncating long numbers

mod() operation weird behavior

Categories

Resources