Precision of double values in Spark - scala

I am reading some data from a CSV file, and I have custom code to parse string values into different data types. For numbers, I use:
val format = NumberFormat.getNumberInstance()
which returns a DecimalFormat, and I call parse function on that to get my numeric value. DecimalFormat has arbitrary precision, so I am not losing any precision there. However, when the data is pushed into a Spark DataFrame, it is stored using DoubleType. At this point, I am expecting to see some precision issues, however I do not. I tried entering values from 0.1, 0.01, 0.001, ..., 1e-11 in my CSV file, and when I look at the values stored in the Spark DataFrame, they are all accurately represented (i.e. not like 0.099999999). I am surprised by this behavior since I do not expect a double value to store arbitrary precision. Can anyone help me understand the magic here?
Cheers!

There are probably two issues here: the number of significant digits that a Double can represent in its mantissa; and the range of its exponent.
Roughly, a Double has about 16 (decimal) digits of precision, and the exponent can cover the range from about 10^-308 to 10^+308. (Obviously, the actual limits are set by the binary representation used by the ieee754 format.)
When you try to store a number like 1e-11, this can be accurately approximated within the 56 bits available in the mantissa. Where you'll get accuracy issues is when you want to subtract two numbers that are so close together that they only differ by a small number of the least significant bits (assuming that their mantissas have been aligned shifted so that their exponents are the same).
For example, if you try (1e20 + 2) - (1e20 + 1), you'd hope to get 1, but actually you'll get zero. This is because a Double does not have enough precision to represent the 20 (decimal) digits needed. However, (1e100 + 2e90) - (1e100 + 1e90) is computed to be almost exactly 1e90, as it should be.

Related

Checking if there are rounding issues for epoch figures

I have an array of integers (they are actually epochs) and I would like to check if they can be represented in double precision floating point without rounding issues.
So I have a large n rows by 1 column array like this:
1104757200
1104757320
1135981260
1135981560
1135982040
1135982280
1135982340
1135982580
1135982880
1135983420
1135984020
1135984140
1135984200
1135985100
1135985340
And I would like to know if they can be stored without losing precision as double precision floating point numbers.
The output could be another array -vector- with 0 or 1 depending on if the number can be represented without losing precision or not.
Any tips on how to do that check in Matlab would be welcomed.

Maximum double value (float) possible in MATLAB (64-bit)

I'm aware that double is the default data-type in MATLAB.
When you compare two double numbers that have no floating part, MATLAB is accurate upto the 17th digit place in my testing.
a=12345678901234567 ; b=12345678901234567; isequal(a,b) --> TRUE
a=123456789012345671; b=123456789012345672; isequal(a,b) --> printed as TRUE
I have found a conservative estimate to be use numbers (non-floating) upto only 13th digit as other functions can become unreliable after it (such as ismember, or the MEX functions ismembc etc).
Is there a similar cutoff for floating values? E.g., if I use shares-outstanding for a company which can be very very large with decimal places, when do I start losing decimal accuracy?
a = 1234567.89012345678 ; b = 1234567.89012345679 ; isequal(a,b) --> printed as TRUE
a = 123456789012345.678 ; b = 123456789012345.677 ; isequal(a,b) --> printed as TRUE
isequal may not be right tool to use for comparing such numbers. I'm more concerned about up to how many places should I trust my decimal values once the integer part of a number starts growing?
It's usually not a good idea to test the equality of floating-point numbers. The behavior of binary floating-point numbers can differ drastically from what you may expect from base-10 decimals. Consider the example:
>> isequal(0.1, 0.3/3)
ans =
0
Ultimately, you have 53 bits of precision. This means that integers can be represented exactly (with no loss in accuracy) up to the number 253 (which is a little over 9 x 1015). After that, well:
>> (2^53 + 1) - 2^53
ans =
0
>> 2^53 + (1 - 2^53)
ans =
1
For non-integers, you are almost never going to be representing them exactly, even for simple-looking decimals such as 0.1 (as shown in that first example). However, it still guarantees you at least 15 significant figures of precision.
This means that if you take any number and round it to the nearest number representable as a double-precision floating point, then this new number will match your original number at least up to the first 15 digits (regardless of where these digits are with respect to the decimal point).
You might want to use variable precision arithmetics (VPA) in matlab. It computes expressions exactly up to a given digit count, which may be quite large. See here.
Check out the MATLAB function flintmax which tells you the maximum consecutive integers that can be stored in either double or single precision. From that page:
flintmax returns the largest consecutive integer in IEEEĀ® double
precision, which is 2^53. Above this value, double-precision format
does not have integer precision, and not all integers can be
represented exactly.

How to stop matlab truncating long numbers

These two long numbers are the same except for the last digit.
test = [];
test(1) = 33777100285870080;
test(2) = 33777100285870082;
but the last digit is lost when the numbers are put in the array:
unique(test)
ans = 3.3777e+16
How can I prevent this? The numbers are ID codes and losing the last digit is screwing everything up.
Matlab uses 64-bit floating point representation by default for numbers. Those have a base-10 16-digit precision (more or less) and your numbers seem to exceed that.
Use something like uint64 to store your numbers:
> test = [uint64(33777100285870080); uint64(33777100285870082)];
> disp(test(1));
33777100285870080
> disp(test(2));
33777100285870082
This is really a rounding error, not a display error. To get the correct strings for output purposes, use int2str, because, again, num2str uses a 64-bit floating point representation, and that has rounding errors in this case.
To add more explanation to #rubenvb's solution, your values are greater than flintmax for IEEE 754 double precision floating-point, i.e, greater than 2^53. After this point not all integers can be exactly represented as doubles. See also this related question.

Matlab double function outputs infinity when taking a big number of type "sym"

This is literally the number I obtain (from symsum function), which is of type sym:
a=328791078344903739363762093060350430076929707044786898291940722052812676355129485878814911641516759087483581972443760841410582114920781832660013389681326267351368505696628653562484228680842650173635989588528021721039959787053654401351638478786763875479187208098871238084448485336138651690856082810553570419028927840285091142054111375001
I would like to make mathematical operations (in particular, take a natural log) on this number and so want to transform it to double, however the output from double(a) is simply "Inf". How to go about this problem and convert it from "sum" to a numeric type?
Your number is ~3.3x10335 but the largest number that can be represented by MATLAB's double precision floating point numbers is ~1.8x10308 (see the output of realmax). Converting your number to double precision causes overflow because the number is larger than can be represented so MATLAB just returns Inf.
For an exhaustive overview of floating point representations and arithmetic, you can check out this PDF.
Can you count the digits and insert a decimal point before converting to double?
If so, take advantage of the fact that the natural log of a number that overflows may not itself overflow.
Using "^" for power, you can represent your number as 3.28791078344903739363762093060350430076929707044786898291940722052812676355129485878814911641516759087483581972443760841410582114920781832660013389681326267351368505696628653562484228680842650173635989588528021721039959787053654401351638478786763875479187208098871238084448485336138651690856082810553570419028927840285091142054111375001 * (10 ^ 335).
The decimal log of (10^335) is 335. Its natural log is 335*log(10).
The natural log of the original number is:
log(3.287910783449037393637620930603504300769297070447868982919407220528)
+ 335*log(10)
All inputs, intermediate results, and the final result of this calculation are in the double range.

Matlab precion when specifying fractions

I wanted to create a vector with three values 1/6, 2/3 and 1/6. Obviously I Matlab has to convert these rational numbers into real numbers but I expected that it would maximize the precision available.
It's storing the values as doubles but it's storing them as -
b =
0.1667 0.6667 0.1667
This is a huge loss of precision. Isn't double supposed to mean 52 bits of accuracy for the fractional part of the number, why are the numbers truncated so severly?
The numbers are only displayed that way. Internally, they use full precision. You can use the format command to change display precision. For example:
format long
will display them as:
0.166666666666667 0.666666666666667 0.166666666666667
So the answer is simple; there is no loss of precision. It's only a display issue.
You can read the documentation on what other formats you can use to display numbers.
you can not store values as 1/2 or 1/4 or 1/6 in to a Double variable... these are stored as decimals behind the system; if you want to store these values , try storing it as string that would work;
Whenever you want to make mathematical calculation using these strings then convert the value into number and continue....