comparing float and double and printing them - double

I have a quick question. So, say I have a really big number up to like 15 digits, and I would take the input and assign it to two variables, one float and one double if I were to compare two numbers, how would you compare them? I think double has the precision up to like 15 digits? and float has 8? So, do I simply compare them while the float only contains 8 digits and pad the rest or do I have the float to print out all 15 digits and then make the comparison? Also, if I were asked to print out the float number, is the standard way of doing it is just printing it up to 8 digits? which is its max precision

Most languages will do some form of type promotion to let you compare types that are not identical, but reasonably similar. For details, you would have to indicate what language you are referring to.
Of course, the real problem with comparing floating point numbers is that the results might be unexpected due to rounding errors. Most mathematical equivalences don't hold for floating point artihmetic, so two sequences of operations which SHOULD yield the same value might actually yield slightly different values (or even very different values if you aren't careful).
EDIT: as for printing, the "standard way" is based on what you need. If, for some reason, you are doing monetary computations in floating point, chances are that you'll only want to print 2 decimal digits.

Thinking in terms of digits may be a problem here. Floats can have a range from negative infinity to positive infinity. In C# for example the range is ±1.5 × 10^−45 to ±3.4 × 10^38 with a precision of 7 digits.
Also, IEEE 754 defines floats and doubles.
Here is a link that might help

Your question is the right one. You want to consider your approach, though.
Whether at 32 or 64 bits, the floating-point representation is not meant to compare numbers for equality. For example, the assertion 2.0/7.0 == 60.0/210.0 may or may not be true in the CPU's view. Conceptually, the floating-point is inherently meant to be imprecise.
If you wish to compare numbers for equality, use integers. Consider again the ratios of the last paragraph. The assertion that 2*210 == 7*60 is always true -- noting that those are the integral versions of the same four numbers as before, only related using multiplication rather than division. One suspects that what you are really looking for is something like this.


Marie Simulator Multiplication of fractions

I have a task to use Marie Simulator to calculate the area of a circle
requiring its radius
I know that in Marie Language there is no multiplication operator so we use multiplication by adding numbers several times so If I wanted to multiply 2*3 I could write it down like 3+3 or 2+2+2
but when using the area of a circle there is pi which is 3.14 I can't imagine how could I get it so can anyone give me the algorithm or code for that ?
thanks in advance.
MARIE does not have floating point support.
So, should refer to your course work or ask your instructors what to do, as it is not obvious.
It is, of course, possible to do floating point in software, but the complexity is extraordinary, so unlikely to be what the're looking for.
You could use fixed point arithmetic, fractions, or decimal.
Here's one solution that might be appropriate: multiply one of the numbers (having decimal places) by some fixed constant factor, do the arithmetic, then interpret answers accordingly.  For example, let's use 100 as the factor, so 3.14 is represented by 314.  Let's say r is 9, so we can square that (9x9=81), then multiply 81 x 314 = 25434.  Now we know that value is 100x too large, so the real answer is 254.34.  (You can choose to ignore the .34, or, round it, then ignore.  254 is still more accurate than 243 which we would get from 9x9x3.)
Fixed point multiplies all numbers by the constant (usually a power of 2, so that the binary point is in the same bit position).  Additions are relatively straightforward, but multiplications need to interpret results by factoring in (or out) that both sources are in scaled, meaning the answer is doubly scaled.
If you need to measure radius also with decimal digits, e.g. 9.5, then you could scale both 9.5 and 3.14 by 100.  Then we need 950x950, and multiply by 314.  The answer will be 100x100x100 too large, so 1000000x too large.  With this approach, 16 bits that MARIE offers will overflow, so you would need to use at least 32-bit arithmetic (not trivial on 16-bit machine).
You can use two different scaling factors, e.g. 9.5 as 95 and 3.14 as 314.  Take 95x95x314, is 10000x too large, so interpret the answer accordingly.  Still this will overflow MARIE's 16-bits
Fractions would maintain both a numerator and denominator for all numbers.  So, 3.14 could be 314/100, and 9.5 could be 95/10 — and simplified 157/50 and 19/2.  To add you have to find a common denominator, convert, then sum numerators.  To multiply you multiply both numerators and denominators: numerator = 19x19x157, denominator = 2x2x50.  Just fits in 16-bit unsigned arithmetic, but still overflows 16-bit signed arithmetic..
And finally binary coded decimal is more like a string format, where numbers are stored one decimal digit per byte or per nibble (packed decimal).  Algorithms for addition and subtraction need to account for variable length inputs.
Big integer forms also use similar to binary coded decimal but compose much larger elements instead of single decimal digits.
All of these approaches require some thought, and the more limitations you want to remove, the more work required.  So, I'd suggest to go back to your course to find what they really want.

Is it possible to predict when Perl's decimal/float math will be wrong? [duplicate]

This question already has answers here:
Why can't decimal numbers be represented exactly in binary?
(22 answers)
Closed 7 years ago.
In one respect, I understand that Perl's floats are inexact binary representations, which causes Perl's math to sometimes be wrong. What I don't understand, is why sometimes these floats seem to give exact answers, and other times, not. Is it possible to predict when Perl's float math will give the wrong (i.e. inexact answer)?
For instance, in the below code, Perl's math is wrong 1 time when the subtraction is "16.12 - 15.13", wrong 2 times when the problem is "26.12 - 25.13", and wrong 20 times when the problem is "36.12 - 35.13". Furthermore, for some reason, in all of the above mentioned test cases, the result of our subtraction problem (i.e. $subtraction_problem) starts out as being wrong, but will tend to become more correct, the more we add or subtract from it (with $x). This makes no sense, why is it that the more we add to or subtract from our arithmetic problem, the more likely it becomes that the value is correct (i.e. exact)?
my $subtraction_problem = 16.12 - 15.13;
my $perl_math_failures = 0;
for (my $x = -25; $x< 25; $x++){
my $result = $subtraction_problem +$x;
print "$result\n";
$perl_math_failures++ if length $result > 6;
print "There were $perl_math_failures perl math failures!\n";
None of this is Perl specific. See Goldberg:
Rounding Error
Squeezing infinitely many real numbers into a finite number of bits requires an approximate representation. Although there are infinitely many integers, in most programs the result of integer computations can be stored in 32 bits. In contrast, given any fixed number of bits, most calculations with real numbers will produce quantities that cannot be exactly represented using that many bits. Therefore the result of a floating-point calculation must often be rounded in order to fit back into its finite representation. This rounding error is the characteristic feature of floating-point computation. The section Relative Error and Ulps describes how it is measured.
Since most floating-point calculations have rounding error anyway, does it matter if the basic arithmetic operations introduce a little bit more rounding error than necessary? That question is a main theme throughout this section. The section Guard Digits discusses guard digits, a means of reducing the error when subtracting two nearby numbers. Guard digits were considered sufficiently important by IBM that in 1968 it added a guard digit to the double precision format in the System/360 architecture (single precision already had a guard digit), and retrofitted all existing machines in the field. Two examples are given to illustrate the utility of guard digits.
The IEEE standard goes further than just requiring the use of a guard digit. It gives an algorithm for addition, subtraction, multiplication, division and square root, and requires that implementations produce the same result as that algorithm. Thus, when a program is moved from one machine to another, the results of the basic operations will be the same in every bit if both machines support the IEEE standard. This greatly simplifies the porting of programs. Other uses of this precise specification are given in Exactly Rounded Operations.

Irrational number representation in computer

We can write a simple Rational Number class using two integers representing A/B with B != 0.
If we want to represent an irrational number class (storing and computing), the first thing came to my mind is to use floating point, which means use IEEE 754 standard (binary fraction). This is because irrational number must be approximated.
Is there another way to write irrational number class other than using binary fraction (whether they conserve memory space or not) ?
I studied jsbeuno's solution using Python: Irrational number representation in any programming language?
He's still using the built-in floating point to store.
This is not homework.
Thank you for your time.
With a cardinality argument, there are much more irrational numbers than rational ones. (and the number of IEEE754 floating point numbers is finite, probably less than 2^64).
You can represent numbers with something else than fractions (e.g. logarithmically).
jsbeuno is storing the number as a base and a radix and using those when doing calcs with other irrational numbers; he's only using the float representation for output.
If you want to get fancier, you can define the base and the radix as rational numbers (with two integers) as described above, or make them themselves irrational numbers.
To make something thoroughly useful, though, you'll end up replicating a symbolic math package.
You can always use symbolic math, where items are stored exactly as they are and calculations are deferred until they can be performed with precision above some threshold.
For example, say you performed two operations on a non-irrational number like 2, one to take the square root and then one to square that. With limited precision, you may get something like:
= 1.414213562²
= 1.999999999
However, storing symbolic math would allow you to store the result of √2 as √2 rather than an approximation of it, then realise that (√x)² is equivalent to x, removing the possibility of error.
Now that obviously involves a more complicated encoding that simple IEEE754 but it's not impossible to achieve.

Matlab precision: simple subtraction is not zero

I compute this simple sum on Matlab:
2*0.04-0.5*0.4^2 = -1.387778780781446e-017
but the result is not zero. What can I do?
Aabaz and Jim Clay have good explanations of what's going on.
It's often the case that, rather than exactly calculating the value of 2*0.04 - 0.5*0.4^2, what you really want is to check whether 2*0.04 and 0.5*0.4^2 differ by an amount that is small enough to be within the relevant numerical precision. If that's the case, than rather than checking whether 2*0.04 - 0.5*0.4^2 == 0, you can check whether abs(2*0.04 - 0.5*0.4^2) < thresh. Here thresh can either be some arbitrary smallish number, or an expression involving eps, which gives the precision of the numerical type you're working with.
Thanks to Jim and Tal for suggested improvement. Altered to compare the absolute value of the difference to a threshold, rather than the difference.
Matlab uses double-precision floating-point numbers to store real numbers. These are numbers of the form m*2^e where m is an integer between 2^52 and 2^53 (the mantissa) and e is the exponent. Let's call a number a floating-point number if it is of this form.
All numbers used in calculations must be floating-point numbers. Often, this can be done exactly, as with 2 and 0.5 in your expression. But for other numbers, most notably most numbers with digits after the decimal point, this is not possible, and an approximation has to be used. What happens in this case is that the number is rounded to the nearest floating-point number.
So, whenever you write something like 0.04 in Matlab, you're really saying "Get me the floating-point number that is closest to 0.04. In your expression, there are 2 numbers that need to be approximated: 0.04 and 0.4.
In addition, the exact result of operations like addition and multiplication on floating-point numbers may not be a floating-point number. Although it is always of the form m*2^e the mantissa may be too large. So you get an additional error from rounding the results of operations.
At the end of the day, a simple expression like yours will be off by about 2^-52 times the size of the operands, or about 10^-17.
In summary: the reason your expression does not evaluate to zero is two-fold:
Some of the numbers you start out with are different (approximations) to the exact numbers you provided.
The intermediate results may also be approximations of the exact results.
What you are seeing is quantization error. Matlab uses doubles to represent numbers, and while they are capable of a lot of precision, they still cannot represent all real numbers because there are an infinite number of real numbers. I'm not sure about Aabaz's trick, but in general I would say there isn't anything you can do, other than perhaps massaging your inputs to be double-friendly numbers.
I do not know if it is applicable to your problem but often the simplest solution is to scale your data.
For example:
c=a/min(abs([a b]));
d=b/min(abs([a b]));
EDIT: of course I did not mean to give a universal solution to these kind of problems but it is still a good practice that can make you avoid a few problems in numerical computation (curve fitting, etc ...). See Jim Clay's answer for the reason why you are experiencing these problems.
I'm pretty sure this is a case of ye olde floating point accuracy issues.
Do you need 1e-17 accuracy? Is this merely a case of wanting 'pretty' output?
In that case, you can just use a formatted sprintf to display the number of significant digits you want.
Realize that this is not a matlab problem, but a fundamental limitation of how numbers are represented in binary.
For fun, work out what .1 is in binary...
Some references:

Decimal vs Double Speed

I write financial applications where I constantly battle the decision to use a double vs using a decimal.
All of my math works on numbers with no more than 5 decimal places and are not larger than ~100,000. I have a feeling that all of these can be represented as doubles anyways without rounding error, but have never been sure.
I would go ahead and make the switch from decimals to doubles for the obvious speed advantage, except that at the end of the day, I still use the ToString method to transmit prices to exchanges, and need to make sure it always outputs the number I expect. (89.99 instead of 89.99000000001)
Is the speed advantage really as large as naive tests suggest? (~100 times)
Is there a way to guarantee the output from ToString to be what I want? Is this assured by the fact that my number is always representable?
UPDATE: I have to process ~ 10 billion price updates before my app can run, and I have implemented with decimal right now for the obvious protective reasons, but it takes ~3 hours just to turn on, doubles would dramatically reduce my turn on time. Is there a safe way to do it with doubles?
Floating point arithmetic will almost always be significantly faster because it is supported directly by the hardware. So far almost no widely used hardware supports decimal arithmetic (although this is changing, see comments).
Financial applications should always use decimal numbers, the number of horror stories stemming from using floating point in financial applications is endless, you should be able to find many such examples with a Google search.
While decimal arithmetic may be significantly slower than floating point arithmetic, unless you are spending a significant amount of time processing decimal data the impact on your program is likely to be negligible. As always, do the appropriate profiling before you start worrying about the difference.
There are two separable issues here. One is whether the double has enough precision to hold all the bits you need, and the other is where it can represent your numbers exactly.
As for the exact representation, you are right to be cautious, because an exact decimal fraction like 1/10 has no exact binary counterpart. However, if you know that you only need 5 decimal digits of precision, you can use scaled arithmetic in which you operate on numbers multiplied by 10^5. So for example if you want to represent 23.7205 exactly you represent it as 2372050.
Let's see if there is enough precision: double precision gives you 53 bits of precision.
This is equivalent to 15+ decimal digits of precision. So this would allow you five digits after the decimal point and 10 digits before the decimal point, which seems ample for your application.
I would put this C code in a .h file:
typedef double scaled_int;
#define SCALE_FACTOR 1.0e5 /* number of digits needed after decimal point */
static inline scaled_int adds(scaled_int x, scaled_int y) { return x + y; }
static inline scaled_int muls(scaled_int x, scaled_int y) { return x * y / SCALE_FACTOR; }
static inline scaled_int scaled_of_int(int x) { return (scaled_int) x * SCALE_FACTOR; }
static inline int intpart_of_scaled(scaled_int x) { return floor(x / SCALE_FACTOR); }
static inline int fraction_of_scaled(scaled_int x) { return x - SCALE_FACTOR * intpart_of_scaled(x); }
void fprint_scaled(FILE *out, scaled_int x) {
fprintf(out, "%d.%05d", intpart_of_scaled(x), fraction_of_scaled(x));
There are probably a few rough spots but that should be enough to get you started.
No overhead for addition, cost of a multiply or divide doubles.
If you have access to C99, you can also try scaled integer arithmetic using the int64_t 64-bit integer type. Which is faster will depend on your hardware platform.
Always use Decimal for any financial calculations or you will be forever chasing 1cent rounding errors.
Yes; software arithmetic really is 100 times slower than hardware. Or, at least, it is a lot slower, and a factor of 100, give or take an order of magnitude, is about right. Back in the bad old days when you could not assume that every 80386 had an 80387 floating-point co-processor, then you had software simulation of binary floating point too, and that was slow.
No; you are living in a fantasy land if you think that a pure binary floating point can ever exactly represent all decimal numbers. Binary numbers can combine halves, quarters, eighths, etc, but since an exact decimal of 0.01 requires two factors of one fifth and one factor of one quarter (1/100 = (1/4)*(1/5)*(1/5)) and since one fifth has no exact representation in binary, you cannot exactly represent all decimal values with binary values (because 0.01 is a counter-example which cannot be represented exactly, but is representative of a huge class of decimal numbers that cannot be represented exactly).
So, you have to decide whether you can deal with the rounding before you call ToString() or whether you need to find some other mechanism that will deal with rounding your results as they are converted to a string. Or you can continue to use decimal arithmetic since it will remain accurate, and it will get faster once machines are released that support the new IEEE 754 decimal arithmetic in hardware.
Obligatory cross-reference: What Every Computer Scientist Should Know About Floating-Point Arithmetic. That's one of many possible URLs.
Information on decimal arithmetic and the new IEEE 754:2008 standard at this Speleotrove site.
Just use a long and multiply by a power of 10. After you're done, divide by the same power of 10.
Decimals should always be used for financial calculations. The size of the numbers isn't important.
The easiest way for me to explain is via some C# code.
double one = 3.05;
double two = 0.05;
System.Console.WriteLine((one + two) == 3.1);
That bit of code will print out False even though 3.1 is equal to 3.1...
Same thing...but using decimal:
decimal one = 3.05m;
decimal two = 0.05m;
System.Console.WriteLine((one + two) == 3.1m);
This will now print out True!
If you want to avoid this sort of issue, I recommend you stick with decimals.
I refer you to my answer given to this question.
Use a long, store the smallest amount you need to track, and display the values accordingly.