Floating point representation and mantissa - numbers

So there was this question on a Programming fundamentals past paper based on the unit number bases. I'm putting it down here
How will the number 45612.3789 be represented in a floating point
notation which has a five digit mantissa, the decimal point after the
second digit from left and 10 as the base for the exponent?
From what I can understand the answer should be 45.61237x10^3. Since the floating point part of the number is called the mantissa.
Is it the answer? I don't see a relation between the question and number bases and I'm kinda confused. If it is not the answer, what is? Thanks in advance.

Related

Marie Simulator Multiplication of fractions

I have a task to use Marie Simulator to calculate the area of a circle
requiring its radius
I know that in Marie Language there is no multiplication operator so we use multiplication by adding numbers several times so If I wanted to multiply 2*3 I could write it down like 3+3 or 2+2+2
but when using the area of a circle there is pi which is 3.14 I can't imagine how could I get it so can anyone give me the algorithm or code for that ?
thanks in advance.
MARIE does not have floating point support.
So, should refer to your course work or ask your instructors what to do, as it is not obvious.
It is, of course, possible to do floating point in software, but the complexity is extraordinary, so unlikely to be what the're looking for.
You could use fixed point arithmetic, fractions, or decimal.
Here's one solution that might be appropriate: multiply one of the numbers (having decimal places) by some fixed constant factor, do the arithmetic, then interpret answers accordingly.  For example, let's use 100 as the factor, so 3.14 is represented by 314.  Let's say r is 9, so we can square that (9x9=81), then multiply 81 x 314 = 25434.  Now we know that value is 100x too large, so the real answer is 254.34.  (You can choose to ignore the .34, or, round it, then ignore.  254 is still more accurate than 243 which we would get from 9x9x3.)
Fixed point multiplies all numbers by the constant (usually a power of 2, so that the binary point is in the same bit position).  Additions are relatively straightforward, but multiplications need to interpret results by factoring in (or out) that both sources are in scaled, meaning the answer is doubly scaled.
If you need to measure radius also with decimal digits, e.g. 9.5, then you could scale both 9.5 and 3.14 by 100.  Then we need 950x950, and multiply by 314.  The answer will be 100x100x100 too large, so 1000000x too large.  With this approach, 16 bits that MARIE offers will overflow, so you would need to use at least 32-bit arithmetic (not trivial on 16-bit machine).
You can use two different scaling factors, e.g. 9.5 as 95 and 3.14 as 314.  Take 95x95x314, is 10000x too large, so interpret the answer accordingly.  Still this will overflow MARIE's 16-bits
Fractions would maintain both a numerator and denominator for all numbers.  So, 3.14 could be 314/100, and 9.5 could be 95/10 — and simplified 157/50 and 19/2.  To add you have to find a common denominator, convert, then sum numerators.  To multiply you multiply both numerators and denominators: numerator = 19x19x157, denominator = 2x2x50.  Just fits in 16-bit unsigned arithmetic, but still overflows 16-bit signed arithmetic..
And finally binary coded decimal is more like a string format, where numbers are stored one decimal digit per byte or per nibble (packed decimal).  Algorithms for addition and subtraction need to account for variable length inputs.
Big integer forms also use similar to binary coded decimal but compose much larger elements instead of single decimal digits.
All of these approaches require some thought, and the more limitations you want to remove, the more work required.  So, I'd suggest to go back to your course to find what they really want.

Why is the "sign bit" included in the calculation of negative numbers?

I'm new to Swift and is trying to learn the concept of "shifting behavior for signed integers". I saw this example from "the swift programming language 2.1".
My question is: Why is the sign bit included in the calculation as well?
I experienced with several number combinations and they all works, but I don't seem to get reasons behind including sign bit in the calculation.
To add -1 to -4, simply by performing a standard binary addition of
all eight bits (including the sign bit), and discarding anything that
doesn't fit in the eight bits once you are done:
This is not unique to Swift. Nevertheless, historically, computers didn't subtract, they just added. To calculate A-B, do a two's-complement of B and add it to A. The two's-complement (negate all the bits and add 1) means that the highest order bit will be 0 for positive numbers (and zero), or 1 for negative numbers.
This is called 2's compliment math and allows both addition and subtraction to be done with simple operations. This is the way many hardware platforms implement arithmetic. You should look more into 2's compliment arithmetic of you want a deeper understanding.

comparing float and double and printing them

I have a quick question. So, say I have a really big number up to like 15 digits, and I would take the input and assign it to two variables, one float and one double if I were to compare two numbers, how would you compare them? I think double has the precision up to like 15 digits? and float has 8? So, do I simply compare them while the float only contains 8 digits and pad the rest or do I have the float to print out all 15 digits and then make the comparison? Also, if I were asked to print out the float number, is the standard way of doing it is just printing it up to 8 digits? which is its max precision
thanks
Most languages will do some form of type promotion to let you compare types that are not identical, but reasonably similar. For details, you would have to indicate what language you are referring to.
Of course, the real problem with comparing floating point numbers is that the results might be unexpected due to rounding errors. Most mathematical equivalences don't hold for floating point artihmetic, so two sequences of operations which SHOULD yield the same value might actually yield slightly different values (or even very different values if you aren't careful).
EDIT: as for printing, the "standard way" is based on what you need. If, for some reason, you are doing monetary computations in floating point, chances are that you'll only want to print 2 decimal digits.
Thinking in terms of digits may be a problem here. Floats can have a range from negative infinity to positive infinity. In C# for example the range is ±1.5 × 10^−45 to ±3.4 × 10^38 with a precision of 7 digits.
Also, IEEE 754 defines floats and doubles.
Here is a link that might help http://en.wikipedia.org/wiki/IEEE_floating_point
Your question is the right one. You want to consider your approach, though.
Whether at 32 or 64 bits, the floating-point representation is not meant to compare numbers for equality. For example, the assertion 2.0/7.0 == 60.0/210.0 may or may not be true in the CPU's view. Conceptually, the floating-point is inherently meant to be imprecise.
If you wish to compare numbers for equality, use integers. Consider again the ratios of the last paragraph. The assertion that 2*210 == 7*60 is always true -- noting that those are the integral versions of the same four numbers as before, only related using multiplication rather than division. One suspects that what you are really looking for is something like this.

Meaning of Precision Vs. Range of Double Types

To begin with, allow me to confess that I'm an experienced programmer, who has over 10 years of programming experience. However, the question I'm asking here is the one, which has bugged me ever since, I first picked up a book on C about a decade back.
Below is an excerpt from a book on Python, explaining about Python Floating type.
Floating-point numbers are represented using the native
double-precision (64-bit) representation of floating-point numbers on
the machine. Normally this is IEEE 754, which provides approximately
17 digits of precision and an exponent in the range of –308 to
308.This is the same as the double type in C.
What I never understood is the meaning of the phrase
" ... which provides approximately 17 digits of precision and an
exponent in the range of –308 to 308 ... "
My intuition here goes astray, since i can understand the meaning of precision, but how can range be different from that. I mean, if a floating point number can represent a value up to 17 digits, (i.e. max of 1,000,000,000,000,000,00 - 1), then how can exponent be +308. wouldn't that make a 308 digit number if exponent is 10 or a rough 100 digit number if exponent is 2.
I hope, I'm able to express my confusion.
Regards
Vaid, Abhishek
Suppose that we write 1500 with two digits of precision. That means we are being precise enough to distinguish 1500 from 1600 and 1400, but not precise enough to distinguish 1500 from 1510 or 1490. Telling those numbers apart would require three digits of precision.
Even though I wrote four digits, floating-point representation doesn't necessarily contain all these digits. 1500 is 1.5 * 10^3. In decimal floating-point representation, with two digits of precision, only the first two digits of the number and the exponent would be stored, which I will write (1.5, 3).
Why is there a distinction between the "real" digits and the placeholder zeros? Because it tells us how precisely we can represent numbers, that is, what fraction of their value is lost due to approximation. We can distinguish 1500 = (1.5, 3) from 1500+100 = (1.6, 3). But if we increase the exponent, we can't distinguish 15000 = (1.5, 4) from 15000+100 = (1.51, 4). At best, we can approximate numbers within +/- 10% with two decimal digits of precision. This is true no matter how small or large the exponent is allowed to be.
The regular decimal representation of numbers obscures the issue. If instead, one considers them in normalised scientific notation to separate the mantissa and exponent, then the distinction is immediate. Normalisation is achieved by scaling the mantissa until it is between 0.0 and 1.0 and adjusting the exponent to avoid loss of scale.
The precision of the number is the number of digits in the mantissa. A floating point number has a limited number of bits to represent this portion of the value. It determines how accurately numbers that are similar in size can be distinguished.
The range determines the allowable values of the exponent. In your example, the range of -308 through 308 is represented independently of the mantissa value and is limited by the number of bits in the floating point number allocated to storing the range.
These two values can be varied independently to suit. For example, in many graphic pipelines much smaller values are represented with truncated values that are scaled to fit into even 16 bits.
Numerical methods libraries expend large amounts of effort in ensuring that these limits are not exceeded to maintain the correctness of calculations. Casual use will not usually encounter these limits.
The choices in IEEE 754 are accepted to be a reasonable trade off between precision and range. The 32-bit single has similar, but different limits. The question What range of numbers can be represented in a 16-, 32- and 64-bit IEEE-754 systems? provides a longer summary and further links for more study.
From Wiki, Double Precision Floating Point numbers are expected to have a precision to 17 digits, or 17 SF. The exponent can be in the range -1022 to 1023.
Their -308 to 308 would appear to be an error, or else an idea not fully explained.

Fixed point vs Floating point number

I just can't understand fixed point and floating point numbers due to hard to read definitions about them all over Google. But none that I have read provide a simple enough explanation of what they really are. Can I get a plain definition with example?
A fixed point number has a specific number of bits (or digits) reserved for the integer part (the part to the left of the decimal point) and a specific number of bits reserved for the fractional part (the part to the right of the decimal point). No matter how large or small your number is, it will always use the same number of bits for each portion. For example, if your fixed point format was in decimal IIIII.FFFFF then the largest number you could represent would be 99999.99999 and the smallest non-zero number would be 00000.00001. Every bit of code that processes such numbers has to have built-in knowledge of where the decimal point is.
A floating point number does not reserve a specific number of bits for the integer part or the fractional part. Instead it reserves a certain number of bits for the number (called the mantissa or significand) and a certain number of bits to say where within that number the decimal place sits (called the exponent). So a floating point number that took up 10 digits with 2 digits reserved for the exponent might represent a largest value of 9.9999999e+50 and a smallest non-zero value of 0.0000001e-49.
A fixed point number just means that there are a fixed number of digits after the decimal point. A floating point number allows for a varying number of digits after the decimal point.
For example, if you have a way of storing numbers that requires exactly four digits after the decimal point, then it is fixed point. Without that restriction it is floating point.
Often, when fixed point is used, the programmer actually uses an integer and then makes the assumption that some of the digits are beyond the decimal point. For example, I might want to keep two digits of precision, so a value of 100 means actually means 1.00, 101 means 1.01, 12345 means 123.45, etc.
Floating point numbers are more general purpose because they can represent very small or very large numbers in the same way, but there is a small penalty in having to have extra storage for where the decimal place goes.
From my understanding, fixed-point arithmetic is done using integers. where the decimal part is stored in a fixed amount of bits, or the number is multiplied by how many digits of decimal precision is needed.
For example, If the number 12.34 needs to be stored and we only need two digits of precision after the decimal point, the number is multiplied by 100 to get 1234. When performing math on this number, we'd use this rule set. Adding 5620 or 56.20 to this number would yield 6854 in data or 68.54.
If we want to calculate the decimal part of a fixed-point number, we use the modulo (%) operand.
12.34 (pseudocode):
v1 = 1234 / 100 // get the whole number
v2 = 1234 % 100 // get the decimal number (100ths of a whole).
print v1 + "." + v2 // "12.34"
Floating point numbers are a completely different story in programming. The current standard for floating point numbers use something like 23 bits for the data of the number, 8 bits for the exponent, and 1 but for sign. See this Wikipedia link for more information on this.
The term ‘fixed point’ refers to the corresponding manner in which numbers are represented, with a fixed number of digits after, and sometimes before, the decimal point.
With floating-point representation, the placement of the decimal point can ‘float’ relative to the significant digits of the number.
For example, a fixed-point representation with a uniform decimal point placement convention can represent the numbers 123.45, 1234.56, 12345.67, etc, whereas a floating-point representation could in addition represent 1.234567, 123456.7, 0.00001234567, 1234567000000000, etc.
There's of what a fixed-point number is and , but very little mention of what I consider the defining feature. The key difference is that floating-point numbers have a constant relative (percent) error caused by rounding or truncating. Fixed-point numbers have constant absolute error.
With 64-bit floats, you can be sure that the answer to x+y will never be off by more than 1 bit, but how big is a bit? Well, it depends on x and y -- if the exponent is equal to 10, then rounding off the last bit represents an error of 2^10=1024, but if the exponent is 0, then rounding off a bit is an error of 2^0=1.
With fixed point numbers, a bit always represents the same amount. For example, if we have 32 bits before the decimal point and 32 after, that means truncation errors will always change the answer by 2^-32 at most. This is great if you're working with numbers that are all about equal to 1, which gain a lot of precision, but bad if you're working with numbers that have different units--who cares if you calculate a distance of a googol meters, then end up with an error of 2^-32 meters?
In general, floating-point lets you represent much larger numbers, but the cost is higher (absolute) error for medium-sized numbers. Fixed points get better accuracy if you know how big of a number you'll have to represent ahead of time, so that you can put the decimal exactly where you want it for maximum accuracy. But if you don't know what units you're working with, floats are a better choice, because they represent a wide range with an accuracy that's good enough.
It is CREATED, that fixed-point numbers don't only have some Fixed number of decimals after point (digits) but are mathematically represented in negative powers. Very good for mechanical calculators:
e.g, the price of smth is USD 23.37 (Q=2 digits after the point. ) The machine knows where the point is supposed to be!
Take the number 123.456789
As an integer, this number would be 123
As a fixed point (2), this
number would be 123.46 (Assuming you rounded it up)
As a floating point, this number would be 123.456789
Floating point lets you represent most every number with a great deal of precision. Fixed is less precise, but simpler for the computer..