If I want -3 in binary, I can use signed bit, or ones complement, or two's complement, correct?
However, when is it appropriate to use a method? and why?
Let's just use 8 bits:
-3 with signed is just 10000011
-3 with ones is just 11111100
-3 with twos is just 11111101
Wikipedia neatly summarizes the benefits of two's complement:
The two's-complement system has the
advantage of not requiring that the
addition and subtraction circuitry
examine the signs of the operands to
determine whether to add or subtract.
This property makes the system both
simpler to implement and capable of
easily handling higher precision
arithmetic. Also, zero has only a
single representation, obviating the
subtleties associated with negative
zero, which exists in ones'-complement
systems.
Related
I have a program which calculates probability values
(p-values),
but it is entering a very large negative number into the
exp function
exp(-626294.830) which evaluates to zero instead of the very small
positive number that it should be.
How can I get this to evaluate as a very small floating point number?
I have tried
Math::BigFloat,
bignum, and
bigrat
but all have failed.
Wolfram Alpha says that exp(-626294.830) is 4.08589×10^-271997... zero is a pretty close approximation to that ;-) Although you've edited and removed the context from your question, do you really need to work with such tiny numbers, or perhaps there is some way you could optimize your algorithm or scale your numbers?
Anyway, you are correct that code like Math::BigFloat->new("-626294.830")->bexp seems to take quite some time, even with the support of use Math::BigFloat lib => 'GMP';.
The only alternative I can offer at the moment is Math::Prime::Util::GMP's expreal, although you need to specify a precision to it.
use Math::Prime::Util::GMP qw/expreal/;
use Math::BigFloat;
my $e = Math::BigFloat->new(expreal(-626294.830,272000));
print $e->bnstr,"\n";
__END__
4.086e-271997
But on my machine, even that still takes ~20s to run, which brings us back to the question of potential optimization in other places.
Floating point numbers do not have infinite precision. Assuming the number is represented as an IEEE 754 double, we have 52 bits for a fraction, 11 bits for the exponent, and one bit for the sign. Due to the way exponents are encoded, the smallest positive number that can be represented is 2^-1022.
If we look at your number e^-626294.830, we can do a change of base and see that it equals 2^(log_2 e · -626294.830) = 2^-903552.445, which is significantly smaller than 2^-1022. Approximating your number as zero is therefore correct.
Instead of calculating this value using arbitrary-precision numerics, you are likely better off solving the necessary equations by hand, then coding this in a way that does not require extreme precision. For example, it is unlikely that you need the exact value of e^-626294.830, but perhaps just the magnitude. Then, you can calculate the logarithm instead of using exp().
I'm new to Swift and is trying to learn the concept of "shifting behavior for signed integers". I saw this example from "the swift programming language 2.1".
My question is: Why is the sign bit included in the calculation as well?
I experienced with several number combinations and they all works, but I don't seem to get reasons behind including sign bit in the calculation.
To add -1 to -4, simply by performing a standard binary addition of
all eight bits (including the sign bit), and discarding anything that
doesn't fit in the eight bits once you are done:
This is not unique to Swift. Nevertheless, historically, computers didn't subtract, they just added. To calculate A-B, do a two's-complement of B and add it to A. The two's-complement (negate all the bits and add 1) means that the highest order bit will be 0 for positive numbers (and zero), or 1 for negative numbers.
This is called 2's compliment math and allows both addition and subtraction to be done with simple operations. This is the way many hardware platforms implement arithmetic. You should look more into 2's compliment arithmetic of you want a deeper understanding.
There are some great docs out there describing the nature of floating point numbers and why precision must necessarily be lost in some floating point operations. E.g.,:
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
I'm interested in understanding which floating point operations are lossless in terms of the precision of the underlying number. For instance,
$x = 2.0/3.0; # this is an operation which will lose some precision
But will this lose precision?
$x = 473.25 / 1000000;
or this?
$x = 0.3429 + 0.2939201;
What general rules of thumb can one use to know when precision will be lost and when it will be retained?
In general most operations will introduce some precision loss. The exceptions are values that are exactly representable in floating point (mantissa is expressible as a non-repeating binary). When applying arithmetic, both operands and the result would have to be expressible without precision loss.
You can construct arbitrary examples that have no precision loss, but the number of such examples is small compared with the domain. For example:
1.0f / 8.0f = 0.125f
Put another way, the mantissa must be expressible as A/B where B is a power of 2, and (as pointed out by #ysth) both A and B are smaller than some upper bound which is dictated by the total number of bits available for the mantissa in the representation.
I don't think you'll find any useful rules of thumb. Generally speaking, precision will not be lost if both the following are true:
Each operand has an exact representation in floating point
The (mathematical) result has an exact representation in floating point
In your second example, 473.25 / 1000000, each operand has an exact floating point representation, but the quotient does not. (Any rational number that has a factor of 5 in the denominator has a repeating expansion in base 2 and hence no exact representation in floating point.) The above rules are not particularly useful because you cannot tell ahead of time whether the result is going to be exact just by looking at the operands; you need to also know the result.
Your question makes a big assumption.
The issue isn't whether certain operations are lossless, it's whether storing numbers in floating point at all will be without loss. For example, take this number from your example:
$x = .3429;
sprintf "%.20f", $x;
Outputs:
0.34289999999999998000
So to answer your question, some operations might be lossless in specific cases. However, it depends on both the original numbers and the result, so should never be counted upon.
For more information, read perlnumber.
I have a quick question. So, say I have a really big number up to like 15 digits, and I would take the input and assign it to two variables, one float and one double if I were to compare two numbers, how would you compare them? I think double has the precision up to like 15 digits? and float has 8? So, do I simply compare them while the float only contains 8 digits and pad the rest or do I have the float to print out all 15 digits and then make the comparison? Also, if I were asked to print out the float number, is the standard way of doing it is just printing it up to 8 digits? which is its max precision
thanks
Most languages will do some form of type promotion to let you compare types that are not identical, but reasonably similar. For details, you would have to indicate what language you are referring to.
Of course, the real problem with comparing floating point numbers is that the results might be unexpected due to rounding errors. Most mathematical equivalences don't hold for floating point artihmetic, so two sequences of operations which SHOULD yield the same value might actually yield slightly different values (or even very different values if you aren't careful).
EDIT: as for printing, the "standard way" is based on what you need. If, for some reason, you are doing monetary computations in floating point, chances are that you'll only want to print 2 decimal digits.
Thinking in terms of digits may be a problem here. Floats can have a range from negative infinity to positive infinity. In C# for example the range is ±1.5 × 10^−45 to ±3.4 × 10^38 with a precision of 7 digits.
Also, IEEE 754 defines floats and doubles.
Here is a link that might help http://en.wikipedia.org/wiki/IEEE_floating_point
Your question is the right one. You want to consider your approach, though.
Whether at 32 or 64 bits, the floating-point representation is not meant to compare numbers for equality. For example, the assertion 2.0/7.0 == 60.0/210.0 may or may not be true in the CPU's view. Conceptually, the floating-point is inherently meant to be imprecise.
If you wish to compare numbers for equality, use integers. Consider again the ratios of the last paragraph. The assertion that 2*210 == 7*60 is always true -- noting that those are the integral versions of the same four numbers as before, only related using multiplication rather than division. One suspects that what you are really looking for is something like this.
We can write a simple Rational Number class using two integers representing A/B with B != 0.
If we want to represent an irrational number class (storing and computing), the first thing came to my mind is to use floating point, which means use IEEE 754 standard (binary fraction). This is because irrational number must be approximated.
Is there another way to write irrational number class other than using binary fraction (whether they conserve memory space or not) ?
I studied jsbeuno's solution using Python: Irrational number representation in any programming language?
He's still using the built-in floating point to store.
This is not homework.
Thank you for your time.
With a cardinality argument, there are much more irrational numbers than rational ones. (and the number of IEEE754 floating point numbers is finite, probably less than 2^64).
You can represent numbers with something else than fractions (e.g. logarithmically).
jsbeuno is storing the number as a base and a radix and using those when doing calcs with other irrational numbers; he's only using the float representation for output.
If you want to get fancier, you can define the base and the radix as rational numbers (with two integers) as described above, or make them themselves irrational numbers.
To make something thoroughly useful, though, you'll end up replicating a symbolic math package.
You can always use symbolic math, where items are stored exactly as they are and calculations are deferred until they can be performed with precision above some threshold.
For example, say you performed two operations on a non-irrational number like 2, one to take the square root and then one to square that. With limited precision, you may get something like:
(√2)²
= 1.414213562²
= 1.999999999
However, storing symbolic math would allow you to store the result of √2 as √2 rather than an approximation of it, then realise that (√x)² is equivalent to x, removing the possibility of error.
Now that obviously involves a more complicated encoding that simple IEEE754 but it's not impossible to achieve.