Set mantissa length and cut off digits - matlab

I want MATLAB to behave like a floating point unit with (let's say) 5 digits, meaning that the mantissa is set to 5 digits.
Therefore I want to write a function that converts any number to the desired format, which I can apply after every arithmetical operation.
What it should do is normalizing the number to zero-point-something-times-ten-power-something, cutting off left over digits and NOT rounding them.
Important is that the number is not only shown that way, but stored with limited significant digits.
Examples:
d = 5; % 5 significant digits, meaning 5 digits mantissa
0.12345 = foo(0.123456, d)
0.12345e-1 = foo(0.0123456, d)
0.12345e+3 = foo(123.456789, d)
How does one do that? Is there any built-in MATLAB functionality that allows it?

Related

How can 3-state bit packed together?

I am looking for a clever solution that would allow to pack into a 16 bits integer, at least nine 3-state 'bits'. It should also still be possible to easily set the value of one these 3-state 'bit'.
As an example, it could be used to encode a tic-tac-toe position, the tree state being, _ (empty), X (me), O (opponent) for the nine square of the board.
Naturally using 2 bits per square would do the job, but it would require 18bits overall. Is there an encoding that would use only 1.7 bits at most per square, and still stay simple for working with it ?
You can store ten 3-state values in a 16-bit integer, since 310 = 59049 < 65536. Simply encode a 10-digit base-3 number into a 16-bit integer, and pull the digits out going the other way.
To encode each digit d, the repeated operation is n = 3*n + d. To decode the digits in the opposite order, the repeated operations are d = n % 3 and n /= 3.

How is eps() calculated in MATLAB?

The eps routine in MATLAB essentially returns the positive distance between floating point numbers. It can take an optional argument, too.
My question: How does MATLAB calculate this value? (Does it use a lookup table, or does it use some algorithm to calculate it at runtime, or something else...?)
Related: how could it be calculated in any language providing bit access, given a floating point number?
WIkipedia has quite the page on it
Specifically for MATLAB it's 2^(-53), as MATLAB uses double precision by default. Here's the graph:
It's one bit for the sign, 11 for the exponent and the rest for the fraction.
The MATLAB documentation on floating point numbers also show this.
d = eps(x), where x has data type single or double, returns the positive distance from abs(x) to the next larger floating-point number of the same precision as x.
As not all fractions are equally closely spaced on the number line, different fractions will show different distances to the next floating-point within the same precision. Their bit representations are:
1.0 = 0 01111111111 0000000000000000000000000000000000000000000000000000
0.9 = 0 01111111110 1100110011001100110011001100110011001100110011001101
the sign for both is positive (0), the exponent is not equal and of course their fraction is vastly different. This means that the next floating point numbers would be:
dec2bin(typecast(eps(1.0), 'uint64'), 64) = 0 01111001011 0000000000000000000000000000000000000000000000000000
dec2bin(typecast(eps(0.9), 'uint64'), 64) = 0 01111001010 0000000000000000000000000000000000000000000000000000
which are not the same, hence eps(0.9)~=eps(1.0).
Here is some insight into eps which will help you to write an algorithm.
See that eps(1) = 2^(-52). Now, say you want to compute the eps of 17179869183.9. Note that, I have chosen a number which is 0.1 less than 2^34 (in other words, something like 2^(33.9999...)). To compute eps of this, you can compute log2 of the number, which would be ~ 33.99999... as mentioned before. Take a floor() of this number and add it to -52, since eps(1) = 2^(-52) and the given number 2^(33.999...). Therefore, eps(17179869183.9) = -52+33 = -19.
If you take a number which is fractionally more than 2^34, e.g., 17179869184.1, then the log2(eps(17179869184.1)) = -18. This also shows that the eps value will change for the numbers that are integer powers of your base (or radix), in this case 2. Since eps value only changes at those numbers which are integer powers of 2, we take floor of the power. You will be able to get the perfect value of eps for any number using this. I hope it is clear.
MATLAB uses (along with other languages) the IEEE754 standard for representing real floating point numbers.
In this format the bits allocated for approximating the actual1 real number, usually 32 - for single or 64 - for double precision, are grouped into: 3 groups
1 bit for determining the sign, s.
8 (or 11) bits for exponent, e.
23 (or 52) bits for the fraction, f.
Then a real number, n, is approximated by the following three - term - relation:
n = (-1)s * 2(e - bias) * (1 + fraction)
where the bias offsets negatively2 the values of the exponent so that they describe numbers between 0 and 1 / (1 and 2) .
Now, the gap reflects the fact that real numbers does not map perfectly to their finite, 32 - or 64 - bit, representations, moreover, a range of real numbers that differ by abs value < eps maps to a single value in computer memory, i.e: if you assign a values val to a variable var_i
var_1 = val - offset
...
var_i = val;
...
val_n = val + offset
where
offset < eps(val) / 2
Then:
var_1 = var_2 = ... = var_i = ... = var_n.
The gap is determined from the second term containing the exponent (or characteristic):
2(e - bias)
in the above relation3, which determines the "scale" of the "line" on which the approximated numbers are located, the larger the numbers, the larger the distance between them, the less precise they are and vice versa: the smaller the numbers, the more densely located their representations are, consequently, more accurate.
In practice, to determine the gap of a specific number, eps(number), you can start by adding / subtracting a gradually increasing small number until the initial value of the number of interest changes - this will give you the gap in that (positive or negative) direction, i.e. eps(number) / 2.
To check possible implementations of MATLAB's eps (or ULP - unit of last place , as it is called in other languages), you could search for ULP implementations either in C, C++ or Java, which are the languages MATLAB is written in.
1. Real numbers are infinitely preciser i.e. they could be written with arbitrary precision, i.e. with any number of digits after the decimal point.
2. Usually around the half: in single precision 8 bits mean decimal values from 1 to 2^8 = 256, around the half in our case is: 127, i.e. 2(e - 127)
2. It can be thought that: 2(e - bias), is representing the most significant digits of the number, i.e. the digits that contribute to describe how big the number is, as opposed to the least significant digits that contribute to describe its precise location. Then the larger the term containing the exponent, the smaller the significance of the 23 bits of the fraction.

Matlab representation of floating point numbers

Matlab results for realmax('single') are ans = 3.4028e+38. I am trying to understand why this number appears from the computer's binary representation, but I am a little bit confused.
I understand that realmax('single') is the highest floating point number represented in single percision which is 32-bits. That means the binary representation consists of 1 bit for the sign, 23 bits for the mantissa and 8 bit for the exponent. And 3.4028e+38 is the decimal representation of the highest single precision floating point number, but I don't know how that number was derived.
Now, typing in 2^128 gives me the same answer as 3.4028e+38, but I don't understand the correlation.
Can help me understand why 3.4028e+38 is the largest returned result for a floating point number in a 32 bit format, coming from a binary representation perspective? Thank you.
As IEEE754 specifies, the largest magnitude exponent is Emax=12710=7F16=0111 11112. This is encoded as 25410=FE16=1111 11102 in the 8 exponent bits. The exponent 25510=FF16=1111 11112 is reserved for representing infinity, so 25410 is the largest available. The exponent bias 12710 is subtracted from the exponent bits leading to 25410-12710=12710. The largest mantissa is attained when all 23 mantissa bits are set to 1. The largest value that can be represented is then 1.111111111111111111111112x2127=3.402823510x1038.
This site lets you set the bits of the representation and see the IEEE754 value represented in decimal, binary, and hexadecimal.
Also note that the largest value is less than 2^128. You can see a more precise representation of the numbers output in MATLAB by using format long. The reason they are similar is because 1.111111111111111111111112x2127 is close to 102x2127=12x2128.
As a precursor to the single precision binary floating point representation of a number in a computer, I start with discussing what is known as "scientific notation" for a decimal number.
Using a base 10 number system, every positive decimal number has a first non-zero leading digit in the set {1..9}. (All other digits are in the set {0..9}.) The number's decimal point may always be shifted to the immediate right of this leading digit by multiplying by an appropriate power 10^n of the number base 10. E.g. 0.012345 = 1.2345*10^n where n = -2. This means that the decimal representation of every non-zero number x can be made to take the form
x = (+/-)(i.jklm...)*10^n ; where i is in the set {1,2,3,4,5,6,7,8,9}.
The leading decimal factor (i.jklm...), known as the "mantissa" of x, is a number in the interval [1,10), i.e. greater than or equal to 1 and less than 10. The largest the mantissa can be is 9.9999... so that the real "size" of the number x is an integer stored in the exponential factor 10^n. If the number x is very large, n >> 0, while if x is very small n << 0.
We now want to revisit these ideas using the base 2 number system associated with computer storage of numbers. Computers represent a number internally using the base 2 rather than the more familiar base 10. All the digits used in a "binary" representation of a number belong to the set {0,1}. Using the same kind of thinking to represent x in a binary representation as we did in its decimal representation, we see that every positive number x has the form
x = (+/-)(i.jklm...)*2^n ; where i = 1,
while the remaining digits belong to {0,1}.
Here the leading binary factor (mantissa) i.jklm... lies in the interval [1,2), rather than the interval [1,10) associated with the mantissa in the decimal system. Here the mantissa is bounded by the binary number 1.1111..., which is always less than 2 since in practice there will never be an infinite number of digits. As before, the real "size" of the number x is stored in the integer exponential factor 2^n. When x is very large then n >> 0 and when x is very small n << 0. The exponent n is expressed in the binary decimal system. Therefore every digit in the binary floating point representation of x is either a 0 or a 1. Each such digit is one of the "bits" used in the computer's memory for storing x.
The standard convention for a (single precision) binary representation of x is accomplished by storing exactly 32 bits (0's or 1's) in computer memory. The first bit is used to signify the arithmetic "sign" of the number. This leaves 31 bits to be distributed between the mantissa (i.jklm...) of x and the exponential factor 2^n. (Recall i = 1 in i.jklmn... so none of the 31 bits is required for its representation.) At this point an important "trade off" comes into play:
The more bits that are dedicated to the mantissa (i.jkl...) of x, the fewer that are available to represent the exponent n in its exponential factor 2^n. Conventionally 23 bits are dedicated to the mantissa of x. (It is not hard to show that this allows approximately 7 digits of accuracy for x when regarded in the decimal system, which is adequate for most scientific work.) With the very first bit dedicated to storing the sign of x, this leaves 8 bits that can be used to represent n in the factor 2^n. Since we want to allow for very large x and very small x, the decision has been made to store 2^n in the form
2^n = 2^(m-127) ; n = m - 127,
where the exponent m is stored rather than n. Utilizing 8 bits, this means m belongs to the set of binary integers {000000,00000001,....11111111}. Since it is easier for humans to think in the decimal system, this means that m belongs to the set of values {0,1,....255}. Subtracting -127, this means in turn that 2^n belongs to the number set {-127,-126,...0,1,2...128}, i.e.
-127 <= n <= 128.
The largest the exponential factor 2^n of our binary floating point representation of x can be is then seen to be 2^n = 2^128, or viewed in the decimal system (use any calculator to evaluate 2^128)
2^n <= 3.4028...*10^38.
Summarizing, the largest number x that can possibly be stored in single precision floating point in a computer under the IEEE format is a number in the form
x = y*(3.4028...*10^38).
Here the mantissa y lies in the (half-closed, half-open) interval [1,2).
For simplicity's sake, Matlab reports the "size" of the "largest" possible floating point number as the largest size of the exponential factor 2^128 = 3.4028*10^38. From this discussion we see that the largest floating point number that can be stored using a 32 bit binary floating point representation is actually doubled to max_x = 6.8056*10^38.

TI Basic Numeric Standard

Are numeric variables following a documented standard on TI calculators ?
I've been really surprised noticing on my TI 83 Premium CE that this test actually returns true (i.e. 1) :
0.1 -> X
0.1 -> Y
0.01 -> Z
X*Y=Z
I was expecting this to fail, assuming my calculator would use something like IEEE 754 standard to represent floating points numbers.
On the other hand, calculating 2^50+3-2^50 returns 0, showing that large integers seems use such a standard : we see here the big number has a limited mantissa.
TI-BASIC's = is a tolerant comparison
Try 1+10^-12=1 on your calculator. Those numbers aren't represented equally (1+10^-12-1 gives 1E-12), but you'll notice the comparison returns true: that's because = has a certain amount of tolerance. AFAICT from testing on my calculator, if the numbers are equal when rounded to ten significant digits, = will return true.
Secondarily,
TI-BASIC uses a proprietary BCD float format
TI floats are a BCD format that is nine bytes long, with one byte for sign and auxilliary information and 14 digits (7 bytes) of precision. The ninth byte is used for extra precision so numbers can be rounded properly.
See a source linked to by #doynax here for more information.

Maximum double value (float) possible in MATLAB (64-bit)

I'm aware that double is the default data-type in MATLAB.
When you compare two double numbers that have no floating part, MATLAB is accurate upto the 17th digit place in my testing.
a=12345678901234567 ; b=12345678901234567; isequal(a,b) --> TRUE
a=123456789012345671; b=123456789012345672; isequal(a,b) --> printed as TRUE
I have found a conservative estimate to be use numbers (non-floating) upto only 13th digit as other functions can become unreliable after it (such as ismember, or the MEX functions ismembc etc).
Is there a similar cutoff for floating values? E.g., if I use shares-outstanding for a company which can be very very large with decimal places, when do I start losing decimal accuracy?
a = 1234567.89012345678 ; b = 1234567.89012345679 ; isequal(a,b) --> printed as TRUE
a = 123456789012345.678 ; b = 123456789012345.677 ; isequal(a,b) --> printed as TRUE
isequal may not be right tool to use for comparing such numbers. I'm more concerned about up to how many places should I trust my decimal values once the integer part of a number starts growing?
It's usually not a good idea to test the equality of floating-point numbers. The behavior of binary floating-point numbers can differ drastically from what you may expect from base-10 decimals. Consider the example:
>> isequal(0.1, 0.3/3)
ans =
0
Ultimately, you have 53 bits of precision. This means that integers can be represented exactly (with no loss in accuracy) up to the number 253 (which is a little over 9 x 1015). After that, well:
>> (2^53 + 1) - 2^53
ans =
0
>> 2^53 + (1 - 2^53)
ans =
1
For non-integers, you are almost never going to be representing them exactly, even for simple-looking decimals such as 0.1 (as shown in that first example). However, it still guarantees you at least 15 significant figures of precision.
This means that if you take any number and round it to the nearest number representable as a double-precision floating point, then this new number will match your original number at least up to the first 15 digits (regardless of where these digits are with respect to the decimal point).
You might want to use variable precision arithmetics (VPA) in matlab. It computes expressions exactly up to a given digit count, which may be quite large. See here.
Check out the MATLAB function flintmax which tells you the maximum consecutive integers that can be stored in either double or single precision. From that page:
flintmax returns the largest consecutive integer in IEEEĀ® double
precision, which is 2^53. Above this value, double-precision format
does not have integer precision, and not all integers can be
represented exactly.