I have two numbers in fixed point a(Q2n1,2m1) and b(Qn1,m1). to subtract these numbers a should have the same exponent size of b. now how to subtract them. I tried shifting a by m1 times to the right then a becomes (Q2n1,m1), but won't it affect the precision?
Is my approach correct?
regards, phani tej
Yes, the approach is right. Nevertheless you need only a precision of only m1, which is enough, 2m1 precision is not required . And also you can only subtract if both the exponents are of the same size.
Related
I have a task to use Marie Simulator to calculate the area of a circle
requiring its radius
I know that in Marie Language there is no multiplication operator so we use multiplication by adding numbers several times so If I wanted to multiply 2*3 I could write it down like 3+3 or 2+2+2
but when using the area of a circle there is pi which is 3.14 I can't imagine how could I get it so can anyone give me the algorithm or code for that ?
thanks in advance.
MARIE does not have floating point support.
So, should refer to your course work or ask your instructors what to do, as it is not obvious.
It is, of course, possible to do floating point in software, but the complexity is extraordinary, so unlikely to be what the're looking for.
You could use fixed point arithmetic, fractions, or decimal.
Here's one solution that might be appropriate: multiply one of the numbers (having decimal places) by some fixed constant factor, do the arithmetic, then interpret answers accordingly. For example, let's use 100 as the factor, so 3.14 is represented by 314. Let's say r is 9, so we can square that (9x9=81), then multiply 81 x 314 = 25434. Now we know that value is 100x too large, so the real answer is 254.34. (You can choose to ignore the .34, or, round it, then ignore. 254 is still more accurate than 243 which we would get from 9x9x3.)
Fixed point multiplies all numbers by the constant (usually a power of 2, so that the binary point is in the same bit position). Additions are relatively straightforward, but multiplications need to interpret results by factoring in (or out) that both sources are in scaled, meaning the answer is doubly scaled.
If you need to measure radius also with decimal digits, e.g. 9.5, then you could scale both 9.5 and 3.14 by 100. Then we need 950x950, and multiply by 314. The answer will be 100x100x100 too large, so 1000000x too large. With this approach, 16 bits that MARIE offers will overflow, so you would need to use at least 32-bit arithmetic (not trivial on 16-bit machine).
You can use two different scaling factors, e.g. 9.5 as 95 and 3.14 as 314. Take 95x95x314, is 10000x too large, so interpret the answer accordingly. Still this will overflow MARIE's 16-bits
Fractions would maintain both a numerator and denominator for all numbers. So, 3.14 could be 314/100, and 9.5 could be 95/10 — and simplified 157/50 and 19/2. To add you have to find a common denominator, convert, then sum numerators. To multiply you multiply both numerators and denominators: numerator = 19x19x157, denominator = 2x2x50. Just fits in 16-bit unsigned arithmetic, but still overflows 16-bit signed arithmetic..
And finally binary coded decimal is more like a string format, where numbers are stored one decimal digit per byte or per nibble (packed decimal). Algorithms for addition and subtraction need to account for variable length inputs.
Big integer forms also use similar to binary coded decimal but compose much larger elements instead of single decimal digits.
All of these approaches require some thought, and the more limitations you want to remove, the more work required. So, I'd suggest to go back to your course to find what they really want.
I have a program which calculates probability values
(p-values),
but it is entering a very large negative number into the
exp function
exp(-626294.830) which evaluates to zero instead of the very small
positive number that it should be.
How can I get this to evaluate as a very small floating point number?
I have tried
Math::BigFloat,
bignum, and
bigrat
but all have failed.
Wolfram Alpha says that exp(-626294.830) is 4.08589×10^-271997... zero is a pretty close approximation to that ;-) Although you've edited and removed the context from your question, do you really need to work with such tiny numbers, or perhaps there is some way you could optimize your algorithm or scale your numbers?
Anyway, you are correct that code like Math::BigFloat->new("-626294.830")->bexp seems to take quite some time, even with the support of use Math::BigFloat lib => 'GMP';.
The only alternative I can offer at the moment is Math::Prime::Util::GMP's expreal, although you need to specify a precision to it.
use Math::Prime::Util::GMP qw/expreal/;
use Math::BigFloat;
my $e = Math::BigFloat->new(expreal(-626294.830,272000));
print $e->bnstr,"\n";
__END__
4.086e-271997
But on my machine, even that still takes ~20s to run, which brings us back to the question of potential optimization in other places.
Floating point numbers do not have infinite precision. Assuming the number is represented as an IEEE 754 double, we have 52 bits for a fraction, 11 bits for the exponent, and one bit for the sign. Due to the way exponents are encoded, the smallest positive number that can be represented is 2^-1022.
If we look at your number e^-626294.830, we can do a change of base and see that it equals 2^(log_2 e · -626294.830) = 2^-903552.445, which is significantly smaller than 2^-1022. Approximating your number as zero is therefore correct.
Instead of calculating this value using arbitrary-precision numerics, you are likely better off solving the necessary equations by hand, then coding this in a way that does not require extreme precision. For example, it is unlikely that you need the exact value of e^-626294.830, but perhaps just the magnitude. Then, you can calculate the logarithm instead of using exp().
I compute this simple sum on Matlab:
2*0.04-0.5*0.4^2 = -1.387778780781446e-017
but the result is not zero. What can I do?
Aabaz and Jim Clay have good explanations of what's going on.
It's often the case that, rather than exactly calculating the value of 2*0.04 - 0.5*0.4^2, what you really want is to check whether 2*0.04 and 0.5*0.4^2 differ by an amount that is small enough to be within the relevant numerical precision. If that's the case, than rather than checking whether 2*0.04 - 0.5*0.4^2 == 0, you can check whether abs(2*0.04 - 0.5*0.4^2) < thresh. Here thresh can either be some arbitrary smallish number, or an expression involving eps, which gives the precision of the numerical type you're working with.
EDIT:
Thanks to Jim and Tal for suggested improvement. Altered to compare the absolute value of the difference to a threshold, rather than the difference.
Matlab uses double-precision floating-point numbers to store real numbers. These are numbers of the form m*2^e where m is an integer between 2^52 and 2^53 (the mantissa) and e is the exponent. Let's call a number a floating-point number if it is of this form.
All numbers used in calculations must be floating-point numbers. Often, this can be done exactly, as with 2 and 0.5 in your expression. But for other numbers, most notably most numbers with digits after the decimal point, this is not possible, and an approximation has to be used. What happens in this case is that the number is rounded to the nearest floating-point number.
So, whenever you write something like 0.04 in Matlab, you're really saying "Get me the floating-point number that is closest to 0.04. In your expression, there are 2 numbers that need to be approximated: 0.04 and 0.4.
In addition, the exact result of operations like addition and multiplication on floating-point numbers may not be a floating-point number. Although it is always of the form m*2^e the mantissa may be too large. So you get an additional error from rounding the results of operations.
At the end of the day, a simple expression like yours will be off by about 2^-52 times the size of the operands, or about 10^-17.
In summary: the reason your expression does not evaluate to zero is two-fold:
Some of the numbers you start out with are different (approximations) to the exact numbers you provided.
The intermediate results may also be approximations of the exact results.
What you are seeing is quantization error. Matlab uses doubles to represent numbers, and while they are capable of a lot of precision, they still cannot represent all real numbers because there are an infinite number of real numbers. I'm not sure about Aabaz's trick, but in general I would say there isn't anything you can do, other than perhaps massaging your inputs to be double-friendly numbers.
I do not know if it is applicable to your problem but often the simplest solution is to scale your data.
For example:
a=0.04;
b=0.2;
a-0.2*b
ans=-6.9389e-018
c=a/min(abs([a b]));
d=b/min(abs([a b]));
c-0.2*d
ans=0
EDIT: of course I did not mean to give a universal solution to these kind of problems but it is still a good practice that can make you avoid a few problems in numerical computation (curve fitting, etc ...). See Jim Clay's answer for the reason why you are experiencing these problems.
I'm pretty sure this is a case of ye olde floating point accuracy issues.
Do you need 1e-17 accuracy? Is this merely a case of wanting 'pretty' output?
In that case, you can just use a formatted sprintf to display the number of significant digits you want.
Realize that this is not a matlab problem, but a fundamental limitation of how numbers are represented in binary.
For fun, work out what .1 is in binary...
Some references:
http://en.wikipedia.org/wiki/Floating_point#Accuracy_problems
http://www.mathworks.com/support/tech-notes/1100/1108.html
I just can't understand fixed point and floating point numbers due to hard to read definitions about them all over Google. But none that I have read provide a simple enough explanation of what they really are. Can I get a plain definition with example?
A fixed point number has a specific number of bits (or digits) reserved for the integer part (the part to the left of the decimal point) and a specific number of bits reserved for the fractional part (the part to the right of the decimal point). No matter how large or small your number is, it will always use the same number of bits for each portion. For example, if your fixed point format was in decimal IIIII.FFFFF then the largest number you could represent would be 99999.99999 and the smallest non-zero number would be 00000.00001. Every bit of code that processes such numbers has to have built-in knowledge of where the decimal point is.
A floating point number does not reserve a specific number of bits for the integer part or the fractional part. Instead it reserves a certain number of bits for the number (called the mantissa or significand) and a certain number of bits to say where within that number the decimal place sits (called the exponent). So a floating point number that took up 10 digits with 2 digits reserved for the exponent might represent a largest value of 9.9999999e+50 and a smallest non-zero value of 0.0000001e-49.
A fixed point number just means that there are a fixed number of digits after the decimal point. A floating point number allows for a varying number of digits after the decimal point.
For example, if you have a way of storing numbers that requires exactly four digits after the decimal point, then it is fixed point. Without that restriction it is floating point.
Often, when fixed point is used, the programmer actually uses an integer and then makes the assumption that some of the digits are beyond the decimal point. For example, I might want to keep two digits of precision, so a value of 100 means actually means 1.00, 101 means 1.01, 12345 means 123.45, etc.
Floating point numbers are more general purpose because they can represent very small or very large numbers in the same way, but there is a small penalty in having to have extra storage for where the decimal place goes.
From my understanding, fixed-point arithmetic is done using integers. where the decimal part is stored in a fixed amount of bits, or the number is multiplied by how many digits of decimal precision is needed.
For example, If the number 12.34 needs to be stored and we only need two digits of precision after the decimal point, the number is multiplied by 100 to get 1234. When performing math on this number, we'd use this rule set. Adding 5620 or 56.20 to this number would yield 6854 in data or 68.54.
If we want to calculate the decimal part of a fixed-point number, we use the modulo (%) operand.
12.34 (pseudocode):
v1 = 1234 / 100 // get the whole number
v2 = 1234 % 100 // get the decimal number (100ths of a whole).
print v1 + "." + v2 // "12.34"
Floating point numbers are a completely different story in programming. The current standard for floating point numbers use something like 23 bits for the data of the number, 8 bits for the exponent, and 1 but for sign. See this Wikipedia link for more information on this.
The term ‘fixed point’ refers to the corresponding manner in which numbers are represented, with a fixed number of digits after, and sometimes before, the decimal point.
With floating-point representation, the placement of the decimal point can ‘float’ relative to the significant digits of the number.
For example, a fixed-point representation with a uniform decimal point placement convention can represent the numbers 123.45, 1234.56, 12345.67, etc, whereas a floating-point representation could in addition represent 1.234567, 123456.7, 0.00001234567, 1234567000000000, etc.
There's of what a fixed-point number is and , but very little mention of what I consider the defining feature. The key difference is that floating-point numbers have a constant relative (percent) error caused by rounding or truncating. Fixed-point numbers have constant absolute error.
With 64-bit floats, you can be sure that the answer to x+y will never be off by more than 1 bit, but how big is a bit? Well, it depends on x and y -- if the exponent is equal to 10, then rounding off the last bit represents an error of 2^10=1024, but if the exponent is 0, then rounding off a bit is an error of 2^0=1.
With fixed point numbers, a bit always represents the same amount. For example, if we have 32 bits before the decimal point and 32 after, that means truncation errors will always change the answer by 2^-32 at most. This is great if you're working with numbers that are all about equal to 1, which gain a lot of precision, but bad if you're working with numbers that have different units--who cares if you calculate a distance of a googol meters, then end up with an error of 2^-32 meters?
In general, floating-point lets you represent much larger numbers, but the cost is higher (absolute) error for medium-sized numbers. Fixed points get better accuracy if you know how big of a number you'll have to represent ahead of time, so that you can put the decimal exactly where you want it for maximum accuracy. But if you don't know what units you're working with, floats are a better choice, because they represent a wide range with an accuracy that's good enough.
It is CREATED, that fixed-point numbers don't only have some Fixed number of decimals after point (digits) but are mathematically represented in negative powers. Very good for mechanical calculators:
e.g, the price of smth is USD 23.37 (Q=2 digits after the point. ) The machine knows where the point is supposed to be!
Take the number 123.456789
As an integer, this number would be 123
As a fixed point (2), this
number would be 123.46 (Assuming you rounded it up)
As a floating point, this number would be 123.456789
Floating point lets you represent most every number with a great deal of precision. Fixed is less precise, but simpler for the computer..
int a=7
int b=10
float answer = (float)a/b;
answer=0.699999988 ( I expect 0.7 ??)
The short version is: Floating points are not accurate, it's only a finite set of bits, and a finite set of bits cannot be used to represent an infinite set of numbers.
The longer version is here: What Every Computer Scientist Should Know About Floating-Point Arithmetic
See also:
How is floating point stored? When does it matter?
Why is my number being rounded incorrectly?
Floating point numbers are accurate only to a certain finite number of digits of precision. You will need to do some rounding to get whole numbers.
If you need more precision, use the double data type, or the NSDecimal class (Which will preserve your decimal digits at the expense of complexity).
It is because floating point calculations are not precise.
The only thing I rely on is the existence of exact small integers (namely -2, -1, 0, 1, 2, as you might use for representing [0,1] plus some special values), and some people frown on using that too.