How does Matlab store this large of a number with only 64 bits? - matlab

The largest number in double precision (that is, 64 bit) floating point arithmetic is
1.111...110 x 2^(512) (where there are 51 1's after the radix point). This number is less than 2 x 2^(512) == 2^(513) == 8^(171) < 10^(171). Therefore, when I assign x = 10^(171), I expect that x will be stored as Inf. However, this is not the case. Calling x in the interactive console displays 1.0000e+171. The only explanation I could think of is that Matlab uses more than 64 bits to store x. But a quick check of whos x reveals that x is stored in 8 bytes.
In fact, the largest power of 10 which will not be stored as Inf is 10^308.
Can someone please explain what is going on here?

I'm sorry, I made a simple mistake here. In 64 bit arithmetic, 11 bits are used to encode the exponent. Therefore we have 2^(11) = 2048 possible exponents, and so they range from -1023 to 1024, not from -511 to 512 like I thought. Therefore the largest number in 64 bit arithmetic is $1.111...110 x 2^(1024)$, which is in fact (with the exponent having 3 significant digits) 10^(308.6), corroborating my experimental results.

Related

Computational arithmetic - how many bits exactly for 8 digit number needed

how many bytes (and how many bits) you need to represent number 99999999 ?
I need to know this: we have a calculator, the simplest possible, and can accommodate up to 8 digits, i.e. from 0 to 99999999 (let's forget the negatives, unless if you feel comfortable to include in your answer).
How many bits/bytes do we need to store values from 0 up to and inclusive 99999999 ?
I appreciate your help, kindly provide the theoretical background and any calculations if you can.
Thank you very much!
Because there are 8 digits and each digit can have 10 values (0, 1, …, 9), the total number of representable numbers is 10^8. To represent this many numbers in binary, we must have a number of digits N such that assigning just one of two values (0, 1) to each position gives at least as many representable numbers as we have in decimal. That is, we have to solve
2^N >= 10^8
We can take the base-2 log of both sides to get
N >= log_2(10^8) = 8 * log_2(10)
At this point, hopefully you have a calculator handy to calculate log_2(10). Note that this is equal to log_10(10)/ log_10(2) = 1/log_10(2), if your calculator does logarithms in base 10 by default. The answer comes out to:
N >= ~26.58
The smallest integer value of N that satisfies this is 27. So, 27 digits are required.
Short answer is 27 bit, or 4 bytes which cover 32 bits.
Longer answer is: you have to represent 10^8 values, so log2(10^8) roughly 26.575424759. Ceil this value, and you see 27. Ceil 27 with groups of 8 bits, and you have 32 bits, 4 bytes

How is eps() calculated in MATLAB?

The eps routine in MATLAB essentially returns the positive distance between floating point numbers. It can take an optional argument, too.
My question: How does MATLAB calculate this value? (Does it use a lookup table, or does it use some algorithm to calculate it at runtime, or something else...?)
Related: how could it be calculated in any language providing bit access, given a floating point number?
WIkipedia has quite the page on it
Specifically for MATLAB it's 2^(-53), as MATLAB uses double precision by default. Here's the graph:
It's one bit for the sign, 11 for the exponent and the rest for the fraction.
The MATLAB documentation on floating point numbers also show this.
d = eps(x), where x has data type single or double, returns the positive distance from abs(x) to the next larger floating-point number of the same precision as x.
As not all fractions are equally closely spaced on the number line, different fractions will show different distances to the next floating-point within the same precision. Their bit representations are:
1.0 = 0 01111111111 0000000000000000000000000000000000000000000000000000
0.9 = 0 01111111110 1100110011001100110011001100110011001100110011001101
the sign for both is positive (0), the exponent is not equal and of course their fraction is vastly different. This means that the next floating point numbers would be:
dec2bin(typecast(eps(1.0), 'uint64'), 64) = 0 01111001011 0000000000000000000000000000000000000000000000000000
dec2bin(typecast(eps(0.9), 'uint64'), 64) = 0 01111001010 0000000000000000000000000000000000000000000000000000
which are not the same, hence eps(0.9)~=eps(1.0).
Here is some insight into eps which will help you to write an algorithm.
See that eps(1) = 2^(-52). Now, say you want to compute the eps of 17179869183.9. Note that, I have chosen a number which is 0.1 less than 2^34 (in other words, something like 2^(33.9999...)). To compute eps of this, you can compute log2 of the number, which would be ~ 33.99999... as mentioned before. Take a floor() of this number and add it to -52, since eps(1) = 2^(-52) and the given number 2^(33.999...). Therefore, eps(17179869183.9) = -52+33 = -19.
If you take a number which is fractionally more than 2^34, e.g., 17179869184.1, then the log2(eps(17179869184.1)) = -18. This also shows that the eps value will change for the numbers that are integer powers of your base (or radix), in this case 2. Since eps value only changes at those numbers which are integer powers of 2, we take floor of the power. You will be able to get the perfect value of eps for any number using this. I hope it is clear.
MATLAB uses (along with other languages) the IEEE754 standard for representing real floating point numbers.
In this format the bits allocated for approximating the actual1 real number, usually 32 - for single or 64 - for double precision, are grouped into: 3 groups
1 bit for determining the sign, s.
8 (or 11) bits for exponent, e.
23 (or 52) bits for the fraction, f.
Then a real number, n, is approximated by the following three - term - relation:
n = (-1)s * 2(e - bias) * (1 + fraction)
where the bias offsets negatively2 the values of the exponent so that they describe numbers between 0 and 1 / (1 and 2) .
Now, the gap reflects the fact that real numbers does not map perfectly to their finite, 32 - or 64 - bit, representations, moreover, a range of real numbers that differ by abs value < eps maps to a single value in computer memory, i.e: if you assign a values val to a variable var_i
var_1 = val - offset
...
var_i = val;
...
val_n = val + offset
where
offset < eps(val) / 2
Then:
var_1 = var_2 = ... = var_i = ... = var_n.
The gap is determined from the second term containing the exponent (or characteristic):
2(e - bias)
in the above relation3, which determines the "scale" of the "line" on which the approximated numbers are located, the larger the numbers, the larger the distance between them, the less precise they are and vice versa: the smaller the numbers, the more densely located their representations are, consequently, more accurate.
In practice, to determine the gap of a specific number, eps(number), you can start by adding / subtracting a gradually increasing small number until the initial value of the number of interest changes - this will give you the gap in that (positive or negative) direction, i.e. eps(number) / 2.
To check possible implementations of MATLAB's eps (or ULP - unit of last place , as it is called in other languages), you could search for ULP implementations either in C, C++ or Java, which are the languages MATLAB is written in.
1. Real numbers are infinitely preciser i.e. they could be written with arbitrary precision, i.e. with any number of digits after the decimal point.
2. Usually around the half: in single precision 8 bits mean decimal values from 1 to 2^8 = 256, around the half in our case is: 127, i.e. 2(e - 127)
2. It can be thought that: 2(e - bias), is representing the most significant digits of the number, i.e. the digits that contribute to describe how big the number is, as opposed to the least significant digits that contribute to describe its precise location. Then the larger the term containing the exponent, the smaller the significance of the 23 bits of the fraction.

Matlab representation of floating point numbers

Matlab results for realmax('single') are ans = 3.4028e+38. I am trying to understand why this number appears from the computer's binary representation, but I am a little bit confused.
I understand that realmax('single') is the highest floating point number represented in single percision which is 32-bits. That means the binary representation consists of 1 bit for the sign, 23 bits for the mantissa and 8 bit for the exponent. And 3.4028e+38 is the decimal representation of the highest single precision floating point number, but I don't know how that number was derived.
Now, typing in 2^128 gives me the same answer as 3.4028e+38, but I don't understand the correlation.
Can help me understand why 3.4028e+38 is the largest returned result for a floating point number in a 32 bit format, coming from a binary representation perspective? Thank you.
As IEEE754 specifies, the largest magnitude exponent is Emax=12710=7F16=0111 11112. This is encoded as 25410=FE16=1111 11102 in the 8 exponent bits. The exponent 25510=FF16=1111 11112 is reserved for representing infinity, so 25410 is the largest available. The exponent bias 12710 is subtracted from the exponent bits leading to 25410-12710=12710. The largest mantissa is attained when all 23 mantissa bits are set to 1. The largest value that can be represented is then 1.111111111111111111111112x2127=3.402823510x1038.
This site lets you set the bits of the representation and see the IEEE754 value represented in decimal, binary, and hexadecimal.
Also note that the largest value is less than 2^128. You can see a more precise representation of the numbers output in MATLAB by using format long. The reason they are similar is because 1.111111111111111111111112x2127 is close to 102x2127=12x2128.
As a precursor to the single precision binary floating point representation of a number in a computer, I start with discussing what is known as "scientific notation" for a decimal number.
Using a base 10 number system, every positive decimal number has a first non-zero leading digit in the set {1..9}. (All other digits are in the set {0..9}.) The number's decimal point may always be shifted to the immediate right of this leading digit by multiplying by an appropriate power 10^n of the number base 10. E.g. 0.012345 = 1.2345*10^n where n = -2. This means that the decimal representation of every non-zero number x can be made to take the form
x = (+/-)(i.jklm...)*10^n ; where i is in the set {1,2,3,4,5,6,7,8,9}.
The leading decimal factor (i.jklm...), known as the "mantissa" of x, is a number in the interval [1,10), i.e. greater than or equal to 1 and less than 10. The largest the mantissa can be is 9.9999... so that the real "size" of the number x is an integer stored in the exponential factor 10^n. If the number x is very large, n >> 0, while if x is very small n << 0.
We now want to revisit these ideas using the base 2 number system associated with computer storage of numbers. Computers represent a number internally using the base 2 rather than the more familiar base 10. All the digits used in a "binary" representation of a number belong to the set {0,1}. Using the same kind of thinking to represent x in a binary representation as we did in its decimal representation, we see that every positive number x has the form
x = (+/-)(i.jklm...)*2^n ; where i = 1,
while the remaining digits belong to {0,1}.
Here the leading binary factor (mantissa) i.jklm... lies in the interval [1,2), rather than the interval [1,10) associated with the mantissa in the decimal system. Here the mantissa is bounded by the binary number 1.1111..., which is always less than 2 since in practice there will never be an infinite number of digits. As before, the real "size" of the number x is stored in the integer exponential factor 2^n. When x is very large then n >> 0 and when x is very small n << 0. The exponent n is expressed in the binary decimal system. Therefore every digit in the binary floating point representation of x is either a 0 or a 1. Each such digit is one of the "bits" used in the computer's memory for storing x.
The standard convention for a (single precision) binary representation of x is accomplished by storing exactly 32 bits (0's or 1's) in computer memory. The first bit is used to signify the arithmetic "sign" of the number. This leaves 31 bits to be distributed between the mantissa (i.jklm...) of x and the exponential factor 2^n. (Recall i = 1 in i.jklmn... so none of the 31 bits is required for its representation.) At this point an important "trade off" comes into play:
The more bits that are dedicated to the mantissa (i.jkl...) of x, the fewer that are available to represent the exponent n in its exponential factor 2^n. Conventionally 23 bits are dedicated to the mantissa of x. (It is not hard to show that this allows approximately 7 digits of accuracy for x when regarded in the decimal system, which is adequate for most scientific work.) With the very first bit dedicated to storing the sign of x, this leaves 8 bits that can be used to represent n in the factor 2^n. Since we want to allow for very large x and very small x, the decision has been made to store 2^n in the form
2^n = 2^(m-127) ; n = m - 127,
where the exponent m is stored rather than n. Utilizing 8 bits, this means m belongs to the set of binary integers {000000,00000001,....11111111}. Since it is easier for humans to think in the decimal system, this means that m belongs to the set of values {0,1,....255}. Subtracting -127, this means in turn that 2^n belongs to the number set {-127,-126,...0,1,2...128}, i.e.
-127 <= n <= 128.
The largest the exponential factor 2^n of our binary floating point representation of x can be is then seen to be 2^n = 2^128, or viewed in the decimal system (use any calculator to evaluate 2^128)
2^n <= 3.4028...*10^38.
Summarizing, the largest number x that can possibly be stored in single precision floating point in a computer under the IEEE format is a number in the form
x = y*(3.4028...*10^38).
Here the mantissa y lies in the (half-closed, half-open) interval [1,2).
For simplicity's sake, Matlab reports the "size" of the "largest" possible floating point number as the largest size of the exponential factor 2^128 = 3.4028*10^38. From this discussion we see that the largest floating point number that can be stored using a 32 bit binary floating point representation is actually doubled to max_x = 6.8056*10^38.

mod() operation weird behavior

I use mod() to compare if a number's 0.01 digit is 2 or not.
if mod(5.02*100, 10) == 2
...
end
The result is mod(5.02*100, 10) = 2 returns 0;
However, if I use mod(1.02*100, 10) = 2 or mod(20.02*100, 10) = 2, it returns 1.
The result of mod(5.02*100, 10) - 2 is
ans =
-5.6843e-14
Could it be possible that this is a bug for matlab?
The version I used is R2013a. version 8.1.0
This is not a bug in MATLAB. It is a limitation of floating point arithmetic and conversion between binary and decimal numbers. Even a simple decimal number such as 0.1 has cannot be exactly represented as a binary floating point number with finite precision.
Computer floating point arithmetic is typically not exact. Although we are used to dealing with numbers in decimal format (base10), computers store and process numbers in binary format (base2). The IEEE standard for double precision floating point representation (see http://en.wikipedia.org/wiki/Double-precision_floating-point_format, what MATLAB uses) specifies the use of 64 bits to represent a binary number. 1 bit is used for the sign, 52 bits are used for the mantissa (the actual digits of the number), and 11 bits are used for the exponent and its sign (which specifies where the decimal place goes).
When you enter a number into MATLAB, it is immediately converted to binary representation for all manipulations and arithmetic and then converted back to decimal for display and output.
Here's what happens in your example:
Convert to binary (keeping only up to 52 digits):
5.02 => 1.01000001010001111010111000010100011110101110000101e2
100 => 1.1001e6
10 => 1.01e3
2 => 1.0e1
Perform multiplication:
1.01000001010001111010111000010100011110101110000101 e2
x 1.1001 e6
--------------------------------------------------------------
0.000101000001010001111010111000010100011110101110000101
0.101000001010001111010111000010100011110101110000101
+ 1.01000001010001111010111000010100011110101110000101
-------------------------------------------------------------
1.111101011111111111111111111111111111111111111111111101e8
Cutting off at 52 digits gives 1.111101011111111111111111111111111111111111111111111e8
Note that this is not the same as 1.11110110e8 which would be 502.
Perform modulo operation: (there may actually be additional error here depending on what algorithm is used within the mod() function)
mod( 1.111101011111111111111111111111111111111111111111111e8, 1.01e3) = 1.111111111111111111111111111111111111111111100000000e0
The error is exactly -2-44 which is -5.6843x10-14. The conversion between decimal and binary and the rounding due to finite precision have caused a small error. In some cases, you get lucky and rounding errors cancel out and you might still get the 'right' answer which is why you got what you expect for mod(1.02*100, 10), but In general, you cannot rely on this.
To use mod() correctly to test the particular digit of a number, use round() to round it to the nearest whole number and compensate for floating point error.
mod(round(5.02*100), 10) == 2
What you're encountering is a floating point error or artifact, like the commenters say. This is not a Matlab bug; it's just how floating point values work. You'd get the same results in C or Java. Floating point values are "approximate" types, so exact equality comparisons using == without some rounding or tolerance are prone to error.
>> isequal(1.02*100, 102)
ans =
1
>> isequal(5.02*100, 502)
ans =
0
It's not the case that 5.02 is the only number this happens for; several around 0 are affected. Here's an example that picks out several of them.
x = 1.02:1000.02;
ix = mod(x .* 100, 10) ~= 2;
disp(x(ix))
To understand the details of what's going on here (and in many other situations you'll encounter working with floats), have a read through the Wikipedia entry for "floating point", or my favorite article on it, "What Every Computer Scientist Should Know About Floating-Point Arithmetic". (That title is hyperbole; this article goes deep and I don't understand half of it. But it's a great resource.) This stuff is particularly relevant to Matlab because Matlab does everything in floating point by default.

Incorrect 64-bit division

I am multiplying a 32-bit number with the same number to get their power of 2. Then I want to get certain bits and leave the rest, but the result gets out rounded and not exact.
For example, what I want to calculate fec2b8f7 × fec2b8f7 = fd86fb26fffffe51 (in hexadecimal notation), in MATLAB it comes out like this:
>> (((uint64(hex2dec('fec2b8f7'))) * (uint64(hex2dec('fec2b8f7')))))
ans =
fd86fb26fffffe51
Which is correct. From this 64-bit hexadecimal number I want to get highest 34 bits out of it, so to get first 34 bits out of 64 bits you need to get rid of the last 30 bits, but division by 2:
>> (((uint64(hex2dec('fec2b8f7'))) * (uint64(hex2dec('fec2b8f7'))))) / uint64(2^30)
ans =
00000003f61bec9c
Sadly the result is not correct, as it was rounded. The correct result should be 00000003f61bec9b.
Is there a way to just get 34 bits out of 64 bit number without rounding? I tried floor, but it doesn't work until the division operation is done, which gets an incorrect answer.
You could try bitshift:
left34 = bitshift(num64, -30)
(negative shift distance shifts to the right)