Gaps between successive floating point numbers - numbers

(all numbers discussed are in decimal)
lets say we have a floating point data type that is like :
m * 10 ^ e
where m is the mantissa . and max mantissa size is 1 ( 0 <= m <= 9);
e is the exponent and its size is -1 <= e <= 1
we say our data type Max value is 90 and its Min value is 0
BUT : that does not mean we can represent all numbers that are in this limit .
we can only represent 27 numbers ( 9 * 3 ) excluding zero.
specifically we can't represent 89 in this way since it has a two digit mantissa
(and non of them are zero).
so technically analogous to the above descriptions . in a float data type (in any programming language) there must be some integers between Max and Min values that we cannot store in a float data type .
is the above argument sound . if it is please give an example how to show this in java or c ?

Your reasoning is perfectly sound. The easiest to show it is be example, as you did.
An non-representable example
Consider the "usual single floating point" format, as defined by IEEE-754, it has 7 exponent bits, thus a range beyond [-2^127,2^127].
It also has 24 mantissa bits, so let's consider
67108864, 67108865 and 67108866. Those numbers are respectively 2^26, 2^26+1 and 2^26+2.
Try to normalize them to write them in the floating point format, and you'll see that
the mantissa gets value 26
the first bit disappears, because it is implicit in the IEEE-754 format that the first number is always* 1, so you're left with 25 bits for each number
all the next bits (in the limit of 24 bits) make up the mantissa...
67108864 has only zeroes in its mantissa, since it's smallest bit is 0 you can remove it without losing information.
67108866 has a 1 in its mantissa's last position, since it's smallest bit is also 0 you can still remove it without losing information.
67108865 has only zeroes and a 1 as smallest bit, that is beyond the 24 bits ! So the number will be rounded to either 2^26 or 2^26+2.
Thus you have an example, like 89 : 67108865 is not representable in a float.
* except for subnormals, see below (expanding on the comment)
Bias
Indeed I skipped a part here. The exponent is not directly encoded in the bits that are reserved to it, it is biased. In the case of single floating points, the bias is 127.
So our 26 is actually represented by 26+127, thus 153. Stealing the following image from wikipedia :
If you take those numbers (sign, exponent and mantissa) as they are written and want to express a non-subnormal number, you get : (-1)sign * 2(exponent-127) * 1.mantissa
Subnormals
Once we reach the smallest possible exponent, that is once we write it 0 and mean -127, we stop supposing the initial 1. This, way, we can represent numbers smaller than 2-127 (by sacrificing precision, because we will have leading 0's on the mantissa).
We then have : (-1)sign * 2-127 * 0.mantissa
In particular, when the mantissa is all 0's, we have 0, and this is intended : now a number that has only 0's in its binary representation is read as 0. In some way, 0 is the smallest of subnormal numbers (though in practice people consider it just a special case on its own).
Other special cases are when the exponent is all 1's. If the mantissa is all 0's then you have +/- infinity (depending on the sign), and if some mantissa bits are set you have a NaN.

Yes, your reasoning is sound, and it should be easy to find real numbers that cannot be represented in your data type.
Consider the smallest mantissa (0) and exponent (-1) you allow:
0 * 10 ^ -1
= 0.0
The next-higher mantissa you allow is 1:
1 * 10 ^ -1
= 0.1
You cannot represent any real numbers between 0.0 and 0.1 exclusive, such as 0.05.
You should be able to express this in Java or C.

Related

Computational arithmetic - how many bits exactly for 8 digit number needed

how many bytes (and how many bits) you need to represent number 99999999 ?
I need to know this: we have a calculator, the simplest possible, and can accommodate up to 8 digits, i.e. from 0 to 99999999 (let's forget the negatives, unless if you feel comfortable to include in your answer).
How many bits/bytes do we need to store values from 0 up to and inclusive 99999999 ?
I appreciate your help, kindly provide the theoretical background and any calculations if you can.
Thank you very much!
Because there are 8 digits and each digit can have 10 values (0, 1, …, 9), the total number of representable numbers is 10^8. To represent this many numbers in binary, we must have a number of digits N such that assigning just one of two values (0, 1) to each position gives at least as many representable numbers as we have in decimal. That is, we have to solve
2^N >= 10^8
We can take the base-2 log of both sides to get
N >= log_2(10^8) = 8 * log_2(10)
At this point, hopefully you have a calculator handy to calculate log_2(10). Note that this is equal to log_10(10)/ log_10(2) = 1/log_10(2), if your calculator does logarithms in base 10 by default. The answer comes out to:
N >= ~26.58
The smallest integer value of N that satisfies this is 27. So, 27 digits are required.
Short answer is 27 bit, or 4 bytes which cover 32 bits.
Longer answer is: you have to represent 10^8 values, so log2(10^8) roughly 26.575424759. Ceil this value, and you see 27. Ceil 27 with groups of 8 bits, and you have 32 bits, 4 bytes

How is eps() calculated in MATLAB?

The eps routine in MATLAB essentially returns the positive distance between floating point numbers. It can take an optional argument, too.
My question: How does MATLAB calculate this value? (Does it use a lookup table, or does it use some algorithm to calculate it at runtime, or something else...?)
Related: how could it be calculated in any language providing bit access, given a floating point number?
WIkipedia has quite the page on it
Specifically for MATLAB it's 2^(-53), as MATLAB uses double precision by default. Here's the graph:
It's one bit for the sign, 11 for the exponent and the rest for the fraction.
The MATLAB documentation on floating point numbers also show this.
d = eps(x), where x has data type single or double, returns the positive distance from abs(x) to the next larger floating-point number of the same precision as x.
As not all fractions are equally closely spaced on the number line, different fractions will show different distances to the next floating-point within the same precision. Their bit representations are:
1.0 = 0 01111111111 0000000000000000000000000000000000000000000000000000
0.9 = 0 01111111110 1100110011001100110011001100110011001100110011001101
the sign for both is positive (0), the exponent is not equal and of course their fraction is vastly different. This means that the next floating point numbers would be:
dec2bin(typecast(eps(1.0), 'uint64'), 64) = 0 01111001011 0000000000000000000000000000000000000000000000000000
dec2bin(typecast(eps(0.9), 'uint64'), 64) = 0 01111001010 0000000000000000000000000000000000000000000000000000
which are not the same, hence eps(0.9)~=eps(1.0).
Here is some insight into eps which will help you to write an algorithm.
See that eps(1) = 2^(-52). Now, say you want to compute the eps of 17179869183.9. Note that, I have chosen a number which is 0.1 less than 2^34 (in other words, something like 2^(33.9999...)). To compute eps of this, you can compute log2 of the number, which would be ~ 33.99999... as mentioned before. Take a floor() of this number and add it to -52, since eps(1) = 2^(-52) and the given number 2^(33.999...). Therefore, eps(17179869183.9) = -52+33 = -19.
If you take a number which is fractionally more than 2^34, e.g., 17179869184.1, then the log2(eps(17179869184.1)) = -18. This also shows that the eps value will change for the numbers that are integer powers of your base (or radix), in this case 2. Since eps value only changes at those numbers which are integer powers of 2, we take floor of the power. You will be able to get the perfect value of eps for any number using this. I hope it is clear.
MATLAB uses (along with other languages) the IEEE754 standard for representing real floating point numbers.
In this format the bits allocated for approximating the actual1 real number, usually 32 - for single or 64 - for double precision, are grouped into: 3 groups
1 bit for determining the sign, s.
8 (or 11) bits for exponent, e.
23 (or 52) bits for the fraction, f.
Then a real number, n, is approximated by the following three - term - relation:
n = (-1)s * 2(e - bias) * (1 + fraction)
where the bias offsets negatively2 the values of the exponent so that they describe numbers between 0 and 1 / (1 and 2) .
Now, the gap reflects the fact that real numbers does not map perfectly to their finite, 32 - or 64 - bit, representations, moreover, a range of real numbers that differ by abs value < eps maps to a single value in computer memory, i.e: if you assign a values val to a variable var_i
var_1 = val - offset
...
var_i = val;
...
val_n = val + offset
where
offset < eps(val) / 2
Then:
var_1 = var_2 = ... = var_i = ... = var_n.
The gap is determined from the second term containing the exponent (or characteristic):
2(e - bias)
in the above relation3, which determines the "scale" of the "line" on which the approximated numbers are located, the larger the numbers, the larger the distance between them, the less precise they are and vice versa: the smaller the numbers, the more densely located their representations are, consequently, more accurate.
In practice, to determine the gap of a specific number, eps(number), you can start by adding / subtracting a gradually increasing small number until the initial value of the number of interest changes - this will give you the gap in that (positive or negative) direction, i.e. eps(number) / 2.
To check possible implementations of MATLAB's eps (or ULP - unit of last place , as it is called in other languages), you could search for ULP implementations either in C, C++ or Java, which are the languages MATLAB is written in.
1. Real numbers are infinitely preciser i.e. they could be written with arbitrary precision, i.e. with any number of digits after the decimal point.
2. Usually around the half: in single precision 8 bits mean decimal values from 1 to 2^8 = 256, around the half in our case is: 127, i.e. 2(e - 127)
2. It can be thought that: 2(e - bias), is representing the most significant digits of the number, i.e. the digits that contribute to describe how big the number is, as opposed to the least significant digits that contribute to describe its precise location. Then the larger the term containing the exponent, the smaller the significance of the 23 bits of the fraction.

what is the most negative fraction that can be represented by a n bit binary number in 2's complement?

I am reading computer architecture by Carl Hamacher. The following lines are confusing me
If we use a 32-bit binary number to represent a signed integer in 2's complement then the range of values that can be represented is -2^31 to 2^31 - 1. //fine
The same 32-bit patterns can be interpreted as fractions in the range -1 to +1 - 2^-31 if we assume that the implied binary point is just to the right of the sign bit.
I don't understand how the range has been calculated for the fraction. I myself calculated the range for positive fractions with the highest fraction being 2^31-1. But I am unable to calculate the lowest value (mentioned as -1).
Shifting the binary point is tantamount to dividing by a power of two. So if we shift the binary point 31 positions, we are effectively representing fractions whose denominators are 231 and whise numerators are the unshifted integers. Thus the range becomes -231/231 to (231-1)/231, which is -1 to 1-2-31.
Another way to look at it: In 2's complement integer representation, the bit positions have weights (left to right) of -231, 230, 229, …, 20 (where only the first weight is negative). Shifting the binary point to just after the high-order (and negated) bit -- "the sign bit" -- makes rhe weights -20, 2-1, 2-2, …, 2-31 (again, only the first weight is negative). Again, it's clear that the "most negative" is 100…0, which has the value -1, the weight for the only one-bit

Matlab representation of floating point numbers

Matlab results for realmax('single') are ans = 3.4028e+38. I am trying to understand why this number appears from the computer's binary representation, but I am a little bit confused.
I understand that realmax('single') is the highest floating point number represented in single percision which is 32-bits. That means the binary representation consists of 1 bit for the sign, 23 bits for the mantissa and 8 bit for the exponent. And 3.4028e+38 is the decimal representation of the highest single precision floating point number, but I don't know how that number was derived.
Now, typing in 2^128 gives me the same answer as 3.4028e+38, but I don't understand the correlation.
Can help me understand why 3.4028e+38 is the largest returned result for a floating point number in a 32 bit format, coming from a binary representation perspective? Thank you.
As IEEE754 specifies, the largest magnitude exponent is Emax=12710=7F16=0111 11112. This is encoded as 25410=FE16=1111 11102 in the 8 exponent bits. The exponent 25510=FF16=1111 11112 is reserved for representing infinity, so 25410 is the largest available. The exponent bias 12710 is subtracted from the exponent bits leading to 25410-12710=12710. The largest mantissa is attained when all 23 mantissa bits are set to 1. The largest value that can be represented is then 1.111111111111111111111112x2127=3.402823510x1038.
This site lets you set the bits of the representation and see the IEEE754 value represented in decimal, binary, and hexadecimal.
Also note that the largest value is less than 2^128. You can see a more precise representation of the numbers output in MATLAB by using format long. The reason they are similar is because 1.111111111111111111111112x2127 is close to 102x2127=12x2128.
As a precursor to the single precision binary floating point representation of a number in a computer, I start with discussing what is known as "scientific notation" for a decimal number.
Using a base 10 number system, every positive decimal number has a first non-zero leading digit in the set {1..9}. (All other digits are in the set {0..9}.) The number's decimal point may always be shifted to the immediate right of this leading digit by multiplying by an appropriate power 10^n of the number base 10. E.g. 0.012345 = 1.2345*10^n where n = -2. This means that the decimal representation of every non-zero number x can be made to take the form
x = (+/-)(i.jklm...)*10^n ; where i is in the set {1,2,3,4,5,6,7,8,9}.
The leading decimal factor (i.jklm...), known as the "mantissa" of x, is a number in the interval [1,10), i.e. greater than or equal to 1 and less than 10. The largest the mantissa can be is 9.9999... so that the real "size" of the number x is an integer stored in the exponential factor 10^n. If the number x is very large, n >> 0, while if x is very small n << 0.
We now want to revisit these ideas using the base 2 number system associated with computer storage of numbers. Computers represent a number internally using the base 2 rather than the more familiar base 10. All the digits used in a "binary" representation of a number belong to the set {0,1}. Using the same kind of thinking to represent x in a binary representation as we did in its decimal representation, we see that every positive number x has the form
x = (+/-)(i.jklm...)*2^n ; where i = 1,
while the remaining digits belong to {0,1}.
Here the leading binary factor (mantissa) i.jklm... lies in the interval [1,2), rather than the interval [1,10) associated with the mantissa in the decimal system. Here the mantissa is bounded by the binary number 1.1111..., which is always less than 2 since in practice there will never be an infinite number of digits. As before, the real "size" of the number x is stored in the integer exponential factor 2^n. When x is very large then n >> 0 and when x is very small n << 0. The exponent n is expressed in the binary decimal system. Therefore every digit in the binary floating point representation of x is either a 0 or a 1. Each such digit is one of the "bits" used in the computer's memory for storing x.
The standard convention for a (single precision) binary representation of x is accomplished by storing exactly 32 bits (0's or 1's) in computer memory. The first bit is used to signify the arithmetic "sign" of the number. This leaves 31 bits to be distributed between the mantissa (i.jklm...) of x and the exponential factor 2^n. (Recall i = 1 in i.jklmn... so none of the 31 bits is required for its representation.) At this point an important "trade off" comes into play:
The more bits that are dedicated to the mantissa (i.jkl...) of x, the fewer that are available to represent the exponent n in its exponential factor 2^n. Conventionally 23 bits are dedicated to the mantissa of x. (It is not hard to show that this allows approximately 7 digits of accuracy for x when regarded in the decimal system, which is adequate for most scientific work.) With the very first bit dedicated to storing the sign of x, this leaves 8 bits that can be used to represent n in the factor 2^n. Since we want to allow for very large x and very small x, the decision has been made to store 2^n in the form
2^n = 2^(m-127) ; n = m - 127,
where the exponent m is stored rather than n. Utilizing 8 bits, this means m belongs to the set of binary integers {000000,00000001,....11111111}. Since it is easier for humans to think in the decimal system, this means that m belongs to the set of values {0,1,....255}. Subtracting -127, this means in turn that 2^n belongs to the number set {-127,-126,...0,1,2...128}, i.e.
-127 <= n <= 128.
The largest the exponential factor 2^n of our binary floating point representation of x can be is then seen to be 2^n = 2^128, or viewed in the decimal system (use any calculator to evaluate 2^128)
2^n <= 3.4028...*10^38.
Summarizing, the largest number x that can possibly be stored in single precision floating point in a computer under the IEEE format is a number in the form
x = y*(3.4028...*10^38).
Here the mantissa y lies in the (half-closed, half-open) interval [1,2).
For simplicity's sake, Matlab reports the "size" of the "largest" possible floating point number as the largest size of the exponential factor 2^128 = 3.4028*10^38. From this discussion we see that the largest floating point number that can be stored using a 32 bit binary floating point representation is actually doubled to max_x = 6.8056*10^38.

0 multiply number in floating point

as you know single number will save in memory by following format:
(-1)^s * 1.f * 2^e:
and zero will save like that: 1.0000000000000000 * 2 ^ -126
now If I multiply it to another floating point number like 3.37 (-1) ^ 0 * 1.10101111 * 2 ^ 128
it will not 0 it reality,but in computer it will be 0 ,how and why?
As pointed out here (Wikipedia, sorry ...), there are special values for the exponent which are treated differently. If the exponent is zero, the formula for calculating the value of the number is
(-1)^s * 0.f * 2^(-126) # notice 0.f instead of 1.f for other exponents
So, a floating point zero has simply all bits set to zero (i.e. f=0, s=0, e=0). The multiplication algorithms of course have to take care of this "special" exponent and set the result to zero in this case (more specifically to +Zero or -Zero accordingly ...)
Zero is (typically) a special case in floating point representations, and in IEEE floating point, zero is represented as 0.0 * 2 ^ -126 (or whatever the exponent is—it really doesn't matter).
I'll say that the math unit of the cpu has some optimization for the "special" floating point numbers, like NaN, Infinity and 0 (and note that technically in IEEE binary fp there are two 0, a positive and a negative one) and know what to do in the three cases.
If you are interested, here http://steve.hollasch.net/cgindex/coding/ieeefloat.html there is one table that shows what happens when you sum/multiply the "special" numbers between themselves.
why: floating-point numbers set isn't continuous like R set in Math. Therefore some nubers can't be visualized correctly and rounded to the neares possible visualizable number
how: it's being rounded :)
Rounding errors. Computers are finite