8-bit unsigned fixed point implementation with multiplication and clamping - fixed-point

I'd like to represent numbers in the range [0.0, 1.0] ( optimally including both endpoints) using 8-bit words.
I'd like to be able to multiply them efficiently and addition/subtraction should optimally be clamped to [0,1], not overflow.
For example, if 0xFF would represent 1.0 and 0x00 would represent 0.0, then the multiplication should yield for example
0x3F (0.247) = 0x7F (0.499) * 0x7F (0.499)
I found https://courses.cs.washington.edu/courses/cse467/08au/labs/l5/fp.pdf and I think that what the paper would name U(0,8) corresponds to what I'm looking for, but I don't understand how multiplication for example would need to be implemented.
Is there a c++ library that efficiently implements such a data type or can someone point me to the necesseary math?
I don't need division, only multiplication, addition and subtraction

The fixed-point format you have chosen, U[0.8], does not include the exact endpoint value of 1. The maximum value in this format is actually 0.99609375. If that's close enough for you we can talk about doing the math.
Multiplying two U[0.8] values gives a 16-bit result in U[0.16] format. To convert back to U[0.8] you must shift right by 8 bit positions. So, multiplying 0x7F times 0x7F gives 0x3F01. Shifting right by 8 bits gives the U[0.8] result of 0x3F, as desired.
Two values in U[0.8] format can be added or subtracted using normal integer operations. However, you must either prevent overflow/underflow or detect overflow/underflow in the result. To detect overflow in addition you could zero-extend both values to 16 bits, perform the addition, and check to see if the result is greater than 0xFF. If so, you could saturate and return 0xFF.
For subtraction you could compare the values before doing the subtraction, and if the result would be negative just return zero.

Related

MATLAB numeric precision when generating a numeric sequence

I was testing a operation like this:
[input] 3.9/0.1 : 4.1/0.1
[output] 39 40
don't know why 4.1/0.1 is approximated to 40. If I add a round(), it will go as expected:
[input] 3.9/0.1 : round(4.1/0.1)
[output] 39 40 41
What's wrong with the first operation?
In this Q&A I go into detail on how the colon operator works in MATLAB to create a range. But the detail that causes the issue described in this question is not covered there.
That post includes the full code for a function that imitates exactly what the colon operator does. Let's follow that code. We start with start = 3.9/0.1, which is exactly 39, and stop = 4.1/0.1, which, due to rounding errors, is just slightly smaller than 41, and step = 1 (the default if it's not given).
It starts by computing a tolerance:
tol = 2.0*eps*max(abs(start),abs(stop));
This tolerance is intended to be used so that the stop value, if within tol of an exact number of steps, is still used, if the last step would step over it. Without a tolerance, it would be really difficult to build correct sequences using floating-point end points and step sizes.
However, then we get this test:
if start == floor(start) && step == 1
% Consecutive integers.
n = floor(stop) - start;
elseif ...
If the start value is an exact integer, and the step size is 1, then it forces the sequence to be an integer sequence. Unfortunately, it does so by taking the number of steps as the distance between floor(stop) and start. That is, it is not using the tolerance computed earlier in determining the right stop! If stop is slightly above an integer, that integer will be in the range. If stop is slightly below an integer (as in the case of the OP), that integer will not be part of the range.
It could be debated whether MATLAB should round the stop number up in this case or not. MATLAB chose not to. All of the sequences produced by the colon operator use the start and stop values exactly as given by the user. It leaves it up to the user to ensure the bounds of the sequence are as required.
However, if the colon operator hadn't special-cased the sequence of integers, the result would have been less surprising in this case. Let's add a very small number to the start value, so it's not an integer:
>> a = 3.9/0.1 : 4.1/0.1
a =
39 40
>> b = 3.9/0.1 + eps(39) : 4.1/0.1
b =
39.0000 40.0000 41.0000
Floating-point numbers suffer from loss of precision when represented with a fixed number of bits (64-bit in MATLAB by default). This is because there are infinite number of real numbers (even within a small range of say 0.0 to 0.1). On the other hand, a n-bit binary pattern can represent a finite 2^n distinct numbers. Hence, not all the real numbers can be represented. The nearest approximation will be used instead, resulted in loss of accuracy.
The closest representable value for 4.1/0.1 in the computer as a 64-bit double precision floating point number is actually,
4.1/0.1 ≈ 40.9999999999999941713291207...
So, in essence, 4.1/0.1 < 41.0 and that's what you get from the range. If you subtract, for example, 41 - 4.1/0.1 = 7.105427357601002e-15. But when you round, you get the closest value of 41.0 as expected.
The representation scheme for 64-bit double-precision according to the IEEE-754 standard:
The most significant bit is the sign bit (S), with 0 for positive numbers and 1 for negative numbers.
The following 11 bits represent exponent (E).
The remaining 52 bits represents fraction (F).

How to convert values to N bits of resolution in MATLAB?

My computer uses 32 bits of resolution as default. I'm writing a script that involves taking measurements with a multimeter that has N bits of resolution. How do I convert the values to that?
For example, if I have a RNG that gives 1000 values
nums = randn(1,1000);
and I use an N-bit multimeter to read those values, how would I get the values to reflect that?
I currently have
meas = round(nums,N-1);
but it's giving me N digits, not N bits. The original random numbers are unbounded, but the resolution of the multimeter is the limitation; how to implement the limitation is what I'm looking for.
Edit I: I'm talking about the resolution of measurement, not the bounds of the numbers. The original values are unbounded. The accuracy of the measured values should be limited by the resolution.
Edit II: I revised the question to try to be a bit clearer.
randn doesn’t produce bounded numbers. Let’s say you are producing 32-bit integers instead:
mums = randi([0,2^32-1],1,n);
To drop the bottom 32-N bits, simply divide by an appropriate value and round (or take the floor):
nums = round(nums/(2^(32-N)));
Do note that we only use floating-point arithmetic here, numbers are integer-valued, but not actually integers. You can do a similar operation using actual integers if you need that.
Also, obviously, N should be lower than 32. You cannot invent new bits. If N is larger, the code above will add zero bits at the bottom of the number.
With a multimeter, it is likely that the range is something like -M V to M V with a a constant resolution, and you can configure the M selecting the range.
This is fixed point math. My answer will not use it because I don't have the toolbox available, if you have it you could use it to have simpler code.
You can generate the integer values with the intended resolution, then rescale it to the intended range.
F=2^N-1 %Maximum integer value
X=randi([0,F],100,1)
X*2*M/F-M %Rescale, divide by the integer range, multiply by the intended range. Then offset by intended minimum.

Calculating floating points as binary

The question is :
x and y are two floating point numbers in 32-bit IEEE floating-point format
(8-bit exponent with bias 127) whose binary representation is as follows:
x: 1 10000001 00010100000000000000000
y: 0 10000010 00100001000000000000000
Compute their product z = x y and give the result in binary IEEE floating-point format.
So I've found out that X = -4.3125. y = 9.03125. i can multiply them and get -38.947265625. I don't know how to show it in a IEEE format. Thanks in advance for the help.
I agree with the comment that it should be done in binary, rather than by conversion to decimal and decimal multiplication. I used Exploring Binary to do the arithmetic.
The first step is to find the actual binary significands. Neither input is subnormal, so they are 1.000101 and 1.00100001.
Multiply them, getting 1.00110111100101.
Similarly, subtract the bias, binary 1111111, from the exponents, getting 10 and 11. Add those, getting 101, then add back the bias, 10000100.
The sign bit for multiplying two numbers with different sign bits will be 1.
Now pack it all back together. The signficand came out in the [1,2) range so there is no need to normalize and adjust the exponent. We are still in the normal range, so drop the 1 before the binary point in the significand. The significand is narrow enough to fit without rounding - just add enough trailing zeros.
1 10000100 00110111100101000000000
You've made it harder by converting to decimal, the way you'd have to convert it back. It's not that it can't be done that way, but it's harder by hand.
Without converting, the algorithm to multiply two floats is (roughly) this:
put the implicit 1 back (if applicable)
multiply, to full size (don't truncate) (you can get away with using just Guard and Sticky, if you know how they work)
add the exponents
xor the signs
normalize/round/handle special cases (under-/overflow)
So here, multiply (look up how binary multiply worked if you forgot)
1.00010100000000000000000 *
1.00100001000000000000000 =
1.00100001000000000000000 +
0.000100100001000000000000000 +
0.00000100100001000000000000000 =
1.00110111100101000000000000000
Add exponents (mind the bias), 2+3 = 5 in this case, so 132 = 10000100.
Xor the signs, get 1.
No rounding is necessary because the dropped bits are all zero anyway.
Result: 1 10000100 00110111100101000000000

how to get reverse(not complement or inverse) of a binary number

I am implementing cooley-tuckey fft(raddix - 2 DIF / DIT) algorithm in matlab.In that for the bit reversing i want to have reverse of an binary number. so can anyone suggest how can I get the reverse of a binary number(like 100111 -> 111001). One who have worked on fft implementation can help me with the algorithm also.
Topic: How to do bit reversal in Matlab? .
If you're using double precision floating point ('double') numbers
which are integers, you can do this:
dr = bin2dec(fliplr(dec2bin(d,n))); % Bits in dr are in reverse order
where n is the number of bits to be reversed and where 0 <= d < 2^n.
You will experience no precision problems at all as long as the
integers are no more than 52 bits long.
And
Re: How to do bit reversal in Matlab?
How large will the numbers be that you need to reverse? May I ask what
is the purpose of it? Maybe there is a more efficient way to solve the
whole problem. If the numbers are large you can just store the bits as
a string. To reverse it just read the string backwards! Or use
fliplr().
(There may be better places to ask).
If it were VHDL I'd suggest an alias with 'REVERSE'RANGE.
Taken from the help section;
Y = swapbytes(X) reverses the byte ordering of each element in array X, converting little-endian values to big-endian (and vice versa). The input array must contain all full, noncomplex, numeric elements.

mod() operation weird behavior

I use mod() to compare if a number's 0.01 digit is 2 or not.
if mod(5.02*100, 10) == 2
...
end
The result is mod(5.02*100, 10) = 2 returns 0;
However, if I use mod(1.02*100, 10) = 2 or mod(20.02*100, 10) = 2, it returns 1.
The result of mod(5.02*100, 10) - 2 is
ans =
-5.6843e-14
Could it be possible that this is a bug for matlab?
The version I used is R2013a. version 8.1.0
This is not a bug in MATLAB. It is a limitation of floating point arithmetic and conversion between binary and decimal numbers. Even a simple decimal number such as 0.1 has cannot be exactly represented as a binary floating point number with finite precision.
Computer floating point arithmetic is typically not exact. Although we are used to dealing with numbers in decimal format (base10), computers store and process numbers in binary format (base2). The IEEE standard for double precision floating point representation (see http://en.wikipedia.org/wiki/Double-precision_floating-point_format, what MATLAB uses) specifies the use of 64 bits to represent a binary number. 1 bit is used for the sign, 52 bits are used for the mantissa (the actual digits of the number), and 11 bits are used for the exponent and its sign (which specifies where the decimal place goes).
When you enter a number into MATLAB, it is immediately converted to binary representation for all manipulations and arithmetic and then converted back to decimal for display and output.
Here's what happens in your example:
Convert to binary (keeping only up to 52 digits):
5.02 => 1.01000001010001111010111000010100011110101110000101e2
100 => 1.1001e6
10 => 1.01e3
2 => 1.0e1
Perform multiplication:
1.01000001010001111010111000010100011110101110000101 e2
x 1.1001 e6
--------------------------------------------------------------
0.000101000001010001111010111000010100011110101110000101
0.101000001010001111010111000010100011110101110000101
+ 1.01000001010001111010111000010100011110101110000101
-------------------------------------------------------------
1.111101011111111111111111111111111111111111111111111101e8
Cutting off at 52 digits gives 1.111101011111111111111111111111111111111111111111111e8
Note that this is not the same as 1.11110110e8 which would be 502.
Perform modulo operation: (there may actually be additional error here depending on what algorithm is used within the mod() function)
mod( 1.111101011111111111111111111111111111111111111111111e8, 1.01e3) = 1.111111111111111111111111111111111111111111100000000e0
The error is exactly -2-44 which is -5.6843x10-14. The conversion between decimal and binary and the rounding due to finite precision have caused a small error. In some cases, you get lucky and rounding errors cancel out and you might still get the 'right' answer which is why you got what you expect for mod(1.02*100, 10), but In general, you cannot rely on this.
To use mod() correctly to test the particular digit of a number, use round() to round it to the nearest whole number and compensate for floating point error.
mod(round(5.02*100), 10) == 2
What you're encountering is a floating point error or artifact, like the commenters say. This is not a Matlab bug; it's just how floating point values work. You'd get the same results in C or Java. Floating point values are "approximate" types, so exact equality comparisons using == without some rounding or tolerance are prone to error.
>> isequal(1.02*100, 102)
ans =
1
>> isequal(5.02*100, 502)
ans =
0
It's not the case that 5.02 is the only number this happens for; several around 0 are affected. Here's an example that picks out several of them.
x = 1.02:1000.02;
ix = mod(x .* 100, 10) ~= 2;
disp(x(ix))
To understand the details of what's going on here (and in many other situations you'll encounter working with floats), have a read through the Wikipedia entry for "floating point", or my favorite article on it, "What Every Computer Scientist Should Know About Floating-Point Arithmetic". (That title is hyperbole; this article goes deep and I don't understand half of it. But it's a great resource.) This stuff is particularly relevant to Matlab because Matlab does everything in floating point by default.