What is the range of a float represented as a Q16.16 integer? - fixed-point

When encoding a floating point number as a fixed point integer Q16.16 with int(number * 0x10000), what is the range of floating point that can be represented without loosing precision ?

Range is 32767 to -32768 plus or minus the fractional part at maximum. So
32767 + 65535/65536 to - 32768 - 65535/65536.
However precision is a different thing to range. You've got up to 31 bits of precision.

Assuming you are using a two's complement 32-bit integer to represent a Q15.16, the range of the integer is [2^31-1, -2^31] or [2147483647, -2147483648] and the scale is 2^-16 or 1/65536. Therefore the range of the fixed-point value is [2147483647/65536, -2147483648/65536] or approximately [32767.99998, -32768].

Related

what is the most negative fraction that can be represented by a n bit binary number in 2's complement?

I am reading computer architecture by Carl Hamacher. The following lines are confusing me
If we use a 32-bit binary number to represent a signed integer in 2's complement then the range of values that can be represented is -2^31 to 2^31 - 1. //fine
The same 32-bit patterns can be interpreted as fractions in the range -1 to +1 - 2^-31 if we assume that the implied binary point is just to the right of the sign bit.
I don't understand how the range has been calculated for the fraction. I myself calculated the range for positive fractions with the highest fraction being 2^31-1. But I am unable to calculate the lowest value (mentioned as -1).
Shifting the binary point is tantamount to dividing by a power of two. So if we shift the binary point 31 positions, we are effectively representing fractions whose denominators are 231 and whise numerators are the unshifted integers. Thus the range becomes -231/231 to (231-1)/231, which is -1 to 1-2-31.
Another way to look at it: In 2's complement integer representation, the bit positions have weights (left to right) of -231, 230, 229, …, 20 (where only the first weight is negative). Shifting the binary point to just after the high-order (and negated) bit -- "the sign bit" -- makes rhe weights -20, 2-1, 2-2, …, 2-31 (again, only the first weight is negative). Again, it's clear that the "most negative" is 100…0, which has the value -1, the weight for the only one-bit

Matlab representation of floating point numbers

Matlab results for realmax('single') are ans = 3.4028e+38. I am trying to understand why this number appears from the computer's binary representation, but I am a little bit confused.
I understand that realmax('single') is the highest floating point number represented in single percision which is 32-bits. That means the binary representation consists of 1 bit for the sign, 23 bits for the mantissa and 8 bit for the exponent. And 3.4028e+38 is the decimal representation of the highest single precision floating point number, but I don't know how that number was derived.
Now, typing in 2^128 gives me the same answer as 3.4028e+38, but I don't understand the correlation.
Can help me understand why 3.4028e+38 is the largest returned result for a floating point number in a 32 bit format, coming from a binary representation perspective? Thank you.
As IEEE754 specifies, the largest magnitude exponent is Emax=12710=7F16=0111 11112. This is encoded as 25410=FE16=1111 11102 in the 8 exponent bits. The exponent 25510=FF16=1111 11112 is reserved for representing infinity, so 25410 is the largest available. The exponent bias 12710 is subtracted from the exponent bits leading to 25410-12710=12710. The largest mantissa is attained when all 23 mantissa bits are set to 1. The largest value that can be represented is then 1.111111111111111111111112x2127=3.402823510x1038.
This site lets you set the bits of the representation and see the IEEE754 value represented in decimal, binary, and hexadecimal.
Also note that the largest value is less than 2^128. You can see a more precise representation of the numbers output in MATLAB by using format long. The reason they are similar is because 1.111111111111111111111112x2127 is close to 102x2127=12x2128.
As a precursor to the single precision binary floating point representation of a number in a computer, I start with discussing what is known as "scientific notation" for a decimal number.
Using a base 10 number system, every positive decimal number has a first non-zero leading digit in the set {1..9}. (All other digits are in the set {0..9}.) The number's decimal point may always be shifted to the immediate right of this leading digit by multiplying by an appropriate power 10^n of the number base 10. E.g. 0.012345 = 1.2345*10^n where n = -2. This means that the decimal representation of every non-zero number x can be made to take the form
x = (+/-)(i.jklm...)*10^n ; where i is in the set {1,2,3,4,5,6,7,8,9}.
The leading decimal factor (i.jklm...), known as the "mantissa" of x, is a number in the interval [1,10), i.e. greater than or equal to 1 and less than 10. The largest the mantissa can be is 9.9999... so that the real "size" of the number x is an integer stored in the exponential factor 10^n. If the number x is very large, n >> 0, while if x is very small n << 0.
We now want to revisit these ideas using the base 2 number system associated with computer storage of numbers. Computers represent a number internally using the base 2 rather than the more familiar base 10. All the digits used in a "binary" representation of a number belong to the set {0,1}. Using the same kind of thinking to represent x in a binary representation as we did in its decimal representation, we see that every positive number x has the form
x = (+/-)(i.jklm...)*2^n ; where i = 1,
while the remaining digits belong to {0,1}.
Here the leading binary factor (mantissa) i.jklm... lies in the interval [1,2), rather than the interval [1,10) associated with the mantissa in the decimal system. Here the mantissa is bounded by the binary number 1.1111..., which is always less than 2 since in practice there will never be an infinite number of digits. As before, the real "size" of the number x is stored in the integer exponential factor 2^n. When x is very large then n >> 0 and when x is very small n << 0. The exponent n is expressed in the binary decimal system. Therefore every digit in the binary floating point representation of x is either a 0 or a 1. Each such digit is one of the "bits" used in the computer's memory for storing x.
The standard convention for a (single precision) binary representation of x is accomplished by storing exactly 32 bits (0's or 1's) in computer memory. The first bit is used to signify the arithmetic "sign" of the number. This leaves 31 bits to be distributed between the mantissa (i.jklm...) of x and the exponential factor 2^n. (Recall i = 1 in i.jklmn... so none of the 31 bits is required for its representation.) At this point an important "trade off" comes into play:
The more bits that are dedicated to the mantissa (i.jkl...) of x, the fewer that are available to represent the exponent n in its exponential factor 2^n. Conventionally 23 bits are dedicated to the mantissa of x. (It is not hard to show that this allows approximately 7 digits of accuracy for x when regarded in the decimal system, which is adequate for most scientific work.) With the very first bit dedicated to storing the sign of x, this leaves 8 bits that can be used to represent n in the factor 2^n. Since we want to allow for very large x and very small x, the decision has been made to store 2^n in the form
2^n = 2^(m-127) ; n = m - 127,
where the exponent m is stored rather than n. Utilizing 8 bits, this means m belongs to the set of binary integers {000000,00000001,....11111111}. Since it is easier for humans to think in the decimal system, this means that m belongs to the set of values {0,1,....255}. Subtracting -127, this means in turn that 2^n belongs to the number set {-127,-126,...0,1,2...128}, i.e.
-127 <= n <= 128.
The largest the exponential factor 2^n of our binary floating point representation of x can be is then seen to be 2^n = 2^128, or viewed in the decimal system (use any calculator to evaluate 2^128)
2^n <= 3.4028...*10^38.
Summarizing, the largest number x that can possibly be stored in single precision floating point in a computer under the IEEE format is a number in the form
x = y*(3.4028...*10^38).
Here the mantissa y lies in the (half-closed, half-open) interval [1,2).
For simplicity's sake, Matlab reports the "size" of the "largest" possible floating point number as the largest size of the exponential factor 2^128 = 3.4028*10^38. From this discussion we see that the largest floating point number that can be stored using a 32 bit binary floating point representation is actually doubled to max_x = 6.8056*10^38.

Gaps between successive floating point numbers

(all numbers discussed are in decimal)
lets say we have a floating point data type that is like :
m * 10 ^ e
where m is the mantissa . and max mantissa size is 1 ( 0 <= m <= 9);
e is the exponent and its size is -1 <= e <= 1
we say our data type Max value is 90 and its Min value is 0
BUT : that does not mean we can represent all numbers that are in this limit .
we can only represent 27 numbers ( 9 * 3 ) excluding zero.
specifically we can't represent 89 in this way since it has a two digit mantissa
(and non of them are zero).
so technically analogous to the above descriptions . in a float data type (in any programming language) there must be some integers between Max and Min values that we cannot store in a float data type .
is the above argument sound . if it is please give an example how to show this in java or c ?
Your reasoning is perfectly sound. The easiest to show it is be example, as you did.
An non-representable example
Consider the "usual single floating point" format, as defined by IEEE-754, it has 7 exponent bits, thus a range beyond [-2^127,2^127].
It also has 24 mantissa bits, so let's consider
67108864, 67108865 and 67108866. Those numbers are respectively 2^26, 2^26+1 and 2^26+2.
Try to normalize them to write them in the floating point format, and you'll see that
the mantissa gets value 26
the first bit disappears, because it is implicit in the IEEE-754 format that the first number is always* 1, so you're left with 25 bits for each number
all the next bits (in the limit of 24 bits) make up the mantissa...
67108864 has only zeroes in its mantissa, since it's smallest bit is 0 you can remove it without losing information.
67108866 has a 1 in its mantissa's last position, since it's smallest bit is also 0 you can still remove it without losing information.
67108865 has only zeroes and a 1 as smallest bit, that is beyond the 24 bits ! So the number will be rounded to either 2^26 or 2^26+2.
Thus you have an example, like 89 : 67108865 is not representable in a float.
* except for subnormals, see below (expanding on the comment)
Bias
Indeed I skipped a part here. The exponent is not directly encoded in the bits that are reserved to it, it is biased. In the case of single floating points, the bias is 127.
So our 26 is actually represented by 26+127, thus 153. Stealing the following image from wikipedia :
If you take those numbers (sign, exponent and mantissa) as they are written and want to express a non-subnormal number, you get : (-1)sign * 2(exponent-127) * 1.mantissa
Subnormals
Once we reach the smallest possible exponent, that is once we write it 0 and mean -127, we stop supposing the initial 1. This, way, we can represent numbers smaller than 2-127 (by sacrificing precision, because we will have leading 0's on the mantissa).
We then have : (-1)sign * 2-127 * 0.mantissa
In particular, when the mantissa is all 0's, we have 0, and this is intended : now a number that has only 0's in its binary representation is read as 0. In some way, 0 is the smallest of subnormal numbers (though in practice people consider it just a special case on its own).
Other special cases are when the exponent is all 1's. If the mantissa is all 0's then you have +/- infinity (depending on the sign), and if some mantissa bits are set you have a NaN.
Yes, your reasoning is sound, and it should be easy to find real numbers that cannot be represented in your data type.
Consider the smallest mantissa (0) and exponent (-1) you allow:
0 * 10 ^ -1
= 0.0
The next-higher mantissa you allow is 1:
1 * 10 ^ -1
= 0.1
You cannot represent any real numbers between 0.0 and 0.1 exclusive, such as 0.05.
You should be able to express this in Java or C.

Maximum double value (float) possible in MATLAB (64-bit)

I'm aware that double is the default data-type in MATLAB.
When you compare two double numbers that have no floating part, MATLAB is accurate upto the 17th digit place in my testing.
a=12345678901234567 ; b=12345678901234567; isequal(a,b) --> TRUE
a=123456789012345671; b=123456789012345672; isequal(a,b) --> printed as TRUE
I have found a conservative estimate to be use numbers (non-floating) upto only 13th digit as other functions can become unreliable after it (such as ismember, or the MEX functions ismembc etc).
Is there a similar cutoff for floating values? E.g., if I use shares-outstanding for a company which can be very very large with decimal places, when do I start losing decimal accuracy?
a = 1234567.89012345678 ; b = 1234567.89012345679 ; isequal(a,b) --> printed as TRUE
a = 123456789012345.678 ; b = 123456789012345.677 ; isequal(a,b) --> printed as TRUE
isequal may not be right tool to use for comparing such numbers. I'm more concerned about up to how many places should I trust my decimal values once the integer part of a number starts growing?
It's usually not a good idea to test the equality of floating-point numbers. The behavior of binary floating-point numbers can differ drastically from what you may expect from base-10 decimals. Consider the example:
>> isequal(0.1, 0.3/3)
ans =
0
Ultimately, you have 53 bits of precision. This means that integers can be represented exactly (with no loss in accuracy) up to the number 253 (which is a little over 9 x 1015). After that, well:
>> (2^53 + 1) - 2^53
ans =
0
>> 2^53 + (1 - 2^53)
ans =
1
For non-integers, you are almost never going to be representing them exactly, even for simple-looking decimals such as 0.1 (as shown in that first example). However, it still guarantees you at least 15 significant figures of precision.
This means that if you take any number and round it to the nearest number representable as a double-precision floating point, then this new number will match your original number at least up to the first 15 digits (regardless of where these digits are with respect to the decimal point).
You might want to use variable precision arithmetics (VPA) in matlab. It computes expressions exactly up to a given digit count, which may be quite large. See here.
Check out the MATLAB function flintmax which tells you the maximum consecutive integers that can be stored in either double or single precision. From that page:
flintmax returns the largest consecutive integer in IEEE® double
precision, which is 2^53. Above this value, double-precision format
does not have integer precision, and not all integers can be
represented exactly.

0 multiply number in floating point

as you know single number will save in memory by following format:
(-1)^s * 1.f * 2^e:
and zero will save like that: 1.0000000000000000 * 2 ^ -126
now If I multiply it to another floating point number like 3.37 (-1) ^ 0 * 1.10101111 * 2 ^ 128
it will not 0 it reality,but in computer it will be 0 ,how and why?
As pointed out here (Wikipedia, sorry ...), there are special values for the exponent which are treated differently. If the exponent is zero, the formula for calculating the value of the number is
(-1)^s * 0.f * 2^(-126) # notice 0.f instead of 1.f for other exponents
So, a floating point zero has simply all bits set to zero (i.e. f=0, s=0, e=0). The multiplication algorithms of course have to take care of this "special" exponent and set the result to zero in this case (more specifically to +Zero or -Zero accordingly ...)
Zero is (typically) a special case in floating point representations, and in IEEE floating point, zero is represented as 0.0 * 2 ^ -126 (or whatever the exponent is—it really doesn't matter).
I'll say that the math unit of the cpu has some optimization for the "special" floating point numbers, like NaN, Infinity and 0 (and note that technically in IEEE binary fp there are two 0, a positive and a negative one) and know what to do in the three cases.
If you are interested, here http://steve.hollasch.net/cgindex/coding/ieeefloat.html there is one table that shows what happens when you sum/multiply the "special" numbers between themselves.
why: floating-point numbers set isn't continuous like R set in Math. Therefore some nubers can't be visualized correctly and rounded to the neares possible visualizable number
how: it's being rounded :)
Rounding errors. Computers are finite