Detailed implementation of IEEE754 in MATLAB? - matlab

In MATLAB,
>> format hex; 3/10, 3*0.1
ans =
3fd3333333333333
ans =
3fd3333333333334
>> 3/10 - 3*0.1
ans =
bc90000000000000
Is this result predictable? i.e. I can follow some rules of floating point arithmetic, and get 3/10 = 3d3333333333333, 3*0.1 = 3d3333333333334 again by hand.

The rules are:
In MATLAB, unless specified otherwise (via constructors), all literals have double precision in the sense of IEE754 standard: http://www.mathworks.com/help/matlab/matlab_prog/floating-point-numbers.html
All arithmetic operations are executed according to the usual precedence rules: http://www.mathworks.com/help/matlab/matlab_prog/operator-precedence.html
When mixing numeric types with double in an arithmetic operation, MATLAB converts the double to the other numeric type before executing the operation—as opposed to C, for example, which does the other way around.
By using these rules you can pretty much predict the results of any arithmetic expression (always little endian memory layout, bit patterns are two's complement for signed integers and IEEE754 for floats). The alternative is to let MATLAB apply the rules for you; the results will be consistent and repeatable.

The reason is that when creating the binary representation for 0.1 a roundup has occurred, introducing a small error:
>> 0.1
ans =
3fb999999999999a
There should be infinitely many of those 9s in the end but we cut it and round up the last digit. The error is small but becomes significant when you multiply by 3
>> 3*0.1
ans =
3fd3333333333334
When correctly calculated by division this last digit shouldn't be 4:
>> 3/10
ans =
3fd3333333333333
It is interesting to see that this error is not big enough to cause a problem when we multiply by some other number smaller than 3 (the threshold is not exactly 3 though):
>> 2.9/10
ans =
3fd28f5c28f5c28f
>> 2.9*0.1
ans =
3fd28f5c28f5c28f

Related

How to calculate machine epsilon in MATLAB?

I need to find the machine epsilon and I am doing the following:
eps = 1;
while 1.0 + eps > 1.0 do
eps = eps /2;
end
However, it shows me this:
Undefined function or variable 'do'.
Error in epsilon (line 3)
while 1.0 + eps > 1.0 do
What should I do?
First and foremost, there is no such thing as a do keyword in MATLAB, so eliminate that from your code. Also, don't use eps as an actual variable. This is a pre-defined function in MATLAB that calculates machine epsilon, which is also what you are trying to calculate. By creating a variable called eps, you would shadow over the actual function, and so any other functions in MATLAB that require its use will behave unexpectedly, and that's not what you want.
Use something else instead, like macheps. Also, your algorithm is slightly incorrect. You need to check for 1.0 + (macheps/2) in your while loop, not 1.0 + macheps.
In other words, do this:
macheps = 1;
while 1.0 + (macheps/2) > 1.0
macheps = macheps / 2;
end
This should give you 2.22 x 10^{-16}, which agrees with MATLAB if you type in eps in the command prompt. To double-check:
>> format long
>> macheps
macheps =
2.220446049250313e-16
>> eps
ans =
2.220446049250313e-16
Bonus
In case you didn't know, machine epsilon is the upper bound on the relative error due to floating point arithmetic. In other words, this would be the maximum difference expected between a true floating point number and one that is calculated on a computer due to the finite number of bits used to store a floating point number.
If you recall, floating numbers inevitably are represented as binary bits on your computer (or pretty much anything digital). In terms of the IEEE 754 floating point standard, MATLAB assumes that all numerical values are of type double, which represents floating point numbers as 64-bits. You can obviously override this behaviour by explicitly casting to another type. With the IEEE 754 floating point standard, for double precision type numbers, there are 52 bits that represent the fractional part of the number.
Here's a nice diagram of what I am talking about:
Source: Wikipedia
You see that there is one bit reserved for the sign of the number, 11 bits that are reserved for exponent base and finally, 52 bits are reserved for the fractional part. This in total adds up to 64 bits. The fractional part is a collection or summation of numbers of base 2, with negative exponents starting from -1 down to -52. The MSB of the floating point number starts with2^{-1}, all the way down to 2^{-52} as the LSB. Essentially, machine epsilon calculates the maximum resolution difference for an increase of 1 bit in binary between two numbers, given that they have the same sign and the same exponent base. Technically speaking, machine epsilon actually equals to 2^{-52} as this is the maximum resolution of a single bit in floating point, given those conditions that I talked about earlier.
If you actually look at the code above closely, the division by 2 is bit-shifting your number to the right by one position at each iteration, starting at the whole value of 1, or 2^{0}, and we take this number and add this to 1. We keep bit-shifting, and seeing what the value is equal to by adding this bit-shifted value by 1, and we go up until the point where when we bit shift to the right, a change is no longer registered. If you bit-shift any more to the right, the value will become 0 due to underflow, and so 1.0 + 0.0 = 1.0, and this is no longer > 1.0, which is what the while loop is checking.
Once the while loop quits, it is this threshold that defines machine epsilon. If you're curious, if you punch in 2^{-52} in the command prompt, you will get what eps is equal to:
>> 2^-52
ans =
2.220446049250313e-16
This makes sense as you are shifting one bit to the right 52 times, and the point before the loop stops would be at its LSB, which is 2^{-52}. For the sake of being complete, if you were to place a counter inside your while loop, and count up how many times the while loop executes, it will execute exactly 52 times, representing 52 bit shifts to the right:
macheps = 1;
count = 0;
while 1.0 + (macheps/2) > 1.0
macheps = macheps / 2;
count = count + 1;
end
>> count
count =
52
It looks like you may want something like this:
eps = 1;
while (1.0 + eps > 1.0)
eps = eps /2;
end
Although there is a fine answer above, I thought I would add a Octave and Matlab method mentioned[1].
>> a = 1; b = 1; while a+b~=a; b= b/2; end
It is read as a plus b is not equal to a.
References:
[1] Alfio Quarteroni, Fausto Saleri, and Paola Gervasio. 2016. Scientific Computing with MATLAB and Octave. Springer Publishing Company, Incorporated.

mod() operation weird behavior

I use mod() to compare if a number's 0.01 digit is 2 or not.
if mod(5.02*100, 10) == 2
...
end
The result is mod(5.02*100, 10) = 2 returns 0;
However, if I use mod(1.02*100, 10) = 2 or mod(20.02*100, 10) = 2, it returns 1.
The result of mod(5.02*100, 10) - 2 is
ans =
-5.6843e-14
Could it be possible that this is a bug for matlab?
The version I used is R2013a. version 8.1.0
This is not a bug in MATLAB. It is a limitation of floating point arithmetic and conversion between binary and decimal numbers. Even a simple decimal number such as 0.1 has cannot be exactly represented as a binary floating point number with finite precision.
Computer floating point arithmetic is typically not exact. Although we are used to dealing with numbers in decimal format (base10), computers store and process numbers in binary format (base2). The IEEE standard for double precision floating point representation (see http://en.wikipedia.org/wiki/Double-precision_floating-point_format, what MATLAB uses) specifies the use of 64 bits to represent a binary number. 1 bit is used for the sign, 52 bits are used for the mantissa (the actual digits of the number), and 11 bits are used for the exponent and its sign (which specifies where the decimal place goes).
When you enter a number into MATLAB, it is immediately converted to binary representation for all manipulations and arithmetic and then converted back to decimal for display and output.
Here's what happens in your example:
Convert to binary (keeping only up to 52 digits):
5.02 => 1.01000001010001111010111000010100011110101110000101e2
100 => 1.1001e6
10 => 1.01e3
2 => 1.0e1
Perform multiplication:
1.01000001010001111010111000010100011110101110000101 e2
x 1.1001 e6
--------------------------------------------------------------
0.000101000001010001111010111000010100011110101110000101
0.101000001010001111010111000010100011110101110000101
+ 1.01000001010001111010111000010100011110101110000101
-------------------------------------------------------------
1.111101011111111111111111111111111111111111111111111101e8
Cutting off at 52 digits gives 1.111101011111111111111111111111111111111111111111111e8
Note that this is not the same as 1.11110110e8 which would be 502.
Perform modulo operation: (there may actually be additional error here depending on what algorithm is used within the mod() function)
mod( 1.111101011111111111111111111111111111111111111111111e8, 1.01e3) = 1.111111111111111111111111111111111111111111100000000e0
The error is exactly -2-44 which is -5.6843x10-14. The conversion between decimal and binary and the rounding due to finite precision have caused a small error. In some cases, you get lucky and rounding errors cancel out and you might still get the 'right' answer which is why you got what you expect for mod(1.02*100, 10), but In general, you cannot rely on this.
To use mod() correctly to test the particular digit of a number, use round() to round it to the nearest whole number and compensate for floating point error.
mod(round(5.02*100), 10) == 2
What you're encountering is a floating point error or artifact, like the commenters say. This is not a Matlab bug; it's just how floating point values work. You'd get the same results in C or Java. Floating point values are "approximate" types, so exact equality comparisons using == without some rounding or tolerance are prone to error.
>> isequal(1.02*100, 102)
ans =
1
>> isequal(5.02*100, 502)
ans =
0
It's not the case that 5.02 is the only number this happens for; several around 0 are affected. Here's an example that picks out several of them.
x = 1.02:1000.02;
ix = mod(x .* 100, 10) ~= 2;
disp(x(ix))
To understand the details of what's going on here (and in many other situations you'll encounter working with floats), have a read through the Wikipedia entry for "floating point", or my favorite article on it, "What Every Computer Scientist Should Know About Floating-Point Arithmetic". (That title is hyperbole; this article goes deep and I don't understand half of it. But it's a great resource.) This stuff is particularly relevant to Matlab because Matlab does everything in floating point by default.

Matlab indexing uses floating point numbers?

In short, my question is:
Is a double in Matlab really a double, or is it a class with the additional property to act as an integer?
And here is the context and motivation for the question :)
>> 1:4
ans =
1 2 3 4
>> class(ans)
ans =
double
Just doing this creates a double...
>> 1.00:4.00
ans =
1 2 3 4
>> class(ans)
ans =
double
...as does this, even though it's printed as integers.
The floating point nature of the numbers only shows when greater numerical uncertainty is introduced.
>> acosd(cosd(1:4))
ans =
0.999999999999900 1.999999999999947 3.000000000000045 4.000000000000041
Is a double in Matlab really a double, or is it a class with the additional property to act as an integer?
A vector defined with "integers" (which of course really is doubles), it can be used to index another vector, which is usually a property of integers.
>> A = [9 8 7 6]
A =
9 8 7 6
>> idx = [4 3 2 1]
idx =
4 3 2 1
>> class(idx)
ans =
double
>> A(idx)
ans =
6 7 8 9
I also tried A(acosd(cosd(1:4))) which does not work.
It's just a double, but your command prompt format gives you the most compact view. Specifically,
format short
But you can change it to always display decimals, and lots of them, with
format longEng
There are many other options on the format help page.
Interestingly, you can use non-integer numbers as indexes with the colon operator, but it will warn. I would take this warning seriously as this indexing behavior is odd.
As I mentioned in my comments, the reason it is OK for MATLAB to use doubles for indexing has to do with the largest value of an integer that can be specified without losing precision in MATLAB. Double precision (64-bit) floating point numbers can exactly represent integers up to 2^53 (9,007,199,254,740,992) without losing any accuracy. The maximum array size allowed by MATLAB is far below this number, so there is no risk of indexing errors as a result of floating point precision.
In MATLAB, all numeric literals (i.e. numbers in the text of your program) are interpreted as double-precision. You must cast them explicitly to get any other type. It's worth remembering that IEEE floating point can exactly represent a wide range of integer values, up to FLINTMAX.

Why abs(intmin) ~= -intmin in matlab

EDU>> intmin
ans =
-2147483648
EDU>> abs(intmin)
ans =
2147483647
How is this possible? There must be some sort of overflow or the definitions of these functions are mingling in strange ways.
For 2's complement signed integers of 32 bits, intmin is 0x80000000, or indeed -2147483648. However, intmax is 0x7FFFFFFF, which is only 2147483647. This means that the negation of intmin would be 2147483648, which can't be represented in 32-bit signed integers.
MATLAB actually does something odd. Under the normal rules of 2's complement, 0 - 0x80000000 should give 0x80000000 again. However, according to MATLAB, 0 - 0x80000000 = 0x7FFFFFFF. This should explain why abs(intmin) = intmax holds for MATLAB (but not necessarily in other languages).
This oddity has an interesting side-effect, however: you can assume that abs never returns a negative number.
In order to encode zero, there must be asymmetry among positive/negative two's complement integers.
Indeed, you are seeing integer overflow (saturation).
For each integer data type, there is a largest and smallest number that you can represent with that type:
When the result of an expression involving integers exceeds the maximum (or minimum) value of the data type, MATLAB maps the values that are outside the limit to the nearest endpoint. This saturation behavior explains what you are seeing rather than a weird case of overflow in binary representation (which "wraps around" in 2's complement).
Example:
>> x = int8(100)
x =
100
>> x + x
ans =
127

Why inverse equality does not satisfy in MATLAB?

MATLAB does not satisfy matrix arithmetic for inverse, that is;
(ABC)-1 = C-1 * B-1 * A-1
in MATLAB,
if inv(A*B*C) == inv(C)*inv(B)*inv(A)
disp('satisfied')
end
It does not qualify. When I made it format long, I realized that there is difference in points, but it even does not satisfy when I make it format rat.
Why is that so?
Very likely a floating point error. Note that the format function affects only how numbers display, not how MATLAB computes or saves them. So setting it to rat won't help the inaccuracy.
I haven't tested, but you may try the Fractions Toolbox for exact rational number arithmetics, which should give an equality to above.
Consider this (MATLAB R2011a):
a = 1e10;
>> b = inv(a)*inv(a)
b =
1.0000e-020
>> c = inv(a*a)
c =
1.0000e-020
>> b==c
ans =
0
>> format hex
>> b
b =
3bc79ca10c924224
>> c
c =
3bc79ca10c924223
When MATLAB calculates the intermediate quantities inv(a), or a*a (whether a is a scalar or a matrix), it by default stores them as the closest double precision floating point number - which is not exact. So when these slightly inaccurate intermediate results are used in subsequent calculations, there will be round off error.
Instead of comparing floating point numbers for direct equality, such as inv(A*B*C) == inv(C)*inv(B)*inv(A), it's often better to compare the absolute difference to a threshold, such as abs(inv(A*B*C) - inv(C)*inv(B)*inv(A)) < thresh. Here thresh can be an arbitrary small number, or some expression involving eps, which gives you the smallest difference between two numbers at the precision at which you're working.
The format command only controls the display of results at the command line, not the way in which results are internally stored. In particular, format rat does not make MATLAB do calculations symbolically. For this, you might take a look at the Symbolic Math Toolbox. format hex is often even more useful than format long for diagnosing floating point precision issues such as the one you've come across.