Fixed Point Development - matlab

I'm working on some fixed point coding these days.
If I have a bunch of 16 bit samples from an ADC and I do multiplication with a 16 bit filter coefficient, the result could be a 32 bit fixed point number right? Now that's fine because I'm targeting a 32 bit fixed point DSP. However, if I want to multiply that by another 16 bit fixed point coefficient or something then I get overflow right? So does that mean I need to do intermediate truncation? Eventually I'll be truncating anyway because I need to send the result to a 16 bit DAC.
Does anyone have experience with doing this in MATLAB?
EDIT I do have fixed point toolbox. What I don't understand is that right now if I set up a number with a 16 bit word length, then set the max product length to 16, then multiply it by another 16 bit word it gives me an error? If I have to perform all the truncations to prevent an error how does the fixed point toolbox even really help me? I guess I'm looking for an example on how to use the fixed point toolbox to ensure best possible rounding/overflow conditions given that my inputs are 16 bits and I have 32 bit registers.
Thanks

As you noted, a 16-bit multiply can result in a 32-bit result. In continuing, I'm assuming you're fixed-point notation is 16.16.
In order to perform your second multiplication, you should first shift the initial mul's result back down by 16 bits. Since the result is now back into the desired 16.16 format, you may proceed with the second mul ("...if I want to multiply that by another 16 bit fixed point coefficient..."). After this second multiplication, shift the result down by 16 bits to restore the 16.16 notation.
Before shipping the value out the DAC, I would expect that you need to leave fixed-point notation and revert to integer form. To do this, simply shift the value down by 16 bits. Before leaving fixed-point notation, you might consider rounding the result. Assuming a positive fixed-point number, this can be accomplished by adding 0.5f to the result prior to the final right shift. (In 16.16, 0.5f is 2^15.)
As always, sequential fixed-point arithmetic operations should be studied closely to avoid overflowing the left hand side. The operations may be re-ordered or factored to prevent overflow. There are a number of good tutorials on the web that can help tutorial.
As for performing fixed-point math in matlab, the bitshift functions are easy enough to use: reference. Of course, the fixed-point toolbox makes this all the more easy.

Related

Marie Simulator Multiplication of fractions

I have a task to use Marie Simulator to calculate the area of a circle
requiring its radius
I know that in Marie Language there is no multiplication operator so we use multiplication by adding numbers several times so If I wanted to multiply 2*3 I could write it down like 3+3 or 2+2+2
but when using the area of a circle there is pi which is 3.14 I can't imagine how could I get it so can anyone give me the algorithm or code for that ?
thanks in advance.
MARIE does not have floating point support.
So, should refer to your course work or ask your instructors what to do, as it is not obvious.
It is, of course, possible to do floating point in software, but the complexity is extraordinary, so unlikely to be what the're looking for.
You could use fixed point arithmetic, fractions, or decimal.
Here's one solution that might be appropriate: multiply one of the numbers (having decimal places) by some fixed constant factor, do the arithmetic, then interpret answers accordingly.  For example, let's use 100 as the factor, so 3.14 is represented by 314.  Let's say r is 9, so we can square that (9x9=81), then multiply 81 x 314 = 25434.  Now we know that value is 100x too large, so the real answer is 254.34.  (You can choose to ignore the .34, or, round it, then ignore.  254 is still more accurate than 243 which we would get from 9x9x3.)
Fixed point multiplies all numbers by the constant (usually a power of 2, so that the binary point is in the same bit position).  Additions are relatively straightforward, but multiplications need to interpret results by factoring in (or out) that both sources are in scaled, meaning the answer is doubly scaled.
If you need to measure radius also with decimal digits, e.g. 9.5, then you could scale both 9.5 and 3.14 by 100.  Then we need 950x950, and multiply by 314.  The answer will be 100x100x100 too large, so 1000000x too large.  With this approach, 16 bits that MARIE offers will overflow, so you would need to use at least 32-bit arithmetic (not trivial on 16-bit machine).
You can use two different scaling factors, e.g. 9.5 as 95 and 3.14 as 314.  Take 95x95x314, is 10000x too large, so interpret the answer accordingly.  Still this will overflow MARIE's 16-bits
Fractions would maintain both a numerator and denominator for all numbers.  So, 3.14 could be 314/100, and 9.5 could be 95/10 — and simplified 157/50 and 19/2.  To add you have to find a common denominator, convert, then sum numerators.  To multiply you multiply both numerators and denominators: numerator = 19x19x157, denominator = 2x2x50.  Just fits in 16-bit unsigned arithmetic, but still overflows 16-bit signed arithmetic..
And finally binary coded decimal is more like a string format, where numbers are stored one decimal digit per byte or per nibble (packed decimal).  Algorithms for addition and subtraction need to account for variable length inputs.
Big integer forms also use similar to binary coded decimal but compose much larger elements instead of single decimal digits.
All of these approaches require some thought, and the more limitations you want to remove, the more work required.  So, I'd suggest to go back to your course to find what they really want.

32-1024 bit fixed point vector arithmetic with AVX-2

For a mandelbrot generator I want to used fixed point arithmetic going from 32 up to maybe 1024 bit as you zoom in.
Now normaly SSE or AVX is no help there due to the lack of add with carry and doing normal integer arithmetic is faster. But in my case I have literally millions of pixels that all need to be computed. So I have a huge vector of values that all need to go through the same iterative formula over and over a million times too.
So I'm not looking at doing a fixed point add/sub/mul on single values but doing it on huge vectors. My hope is that for such vector operations AVX/AVX2 can still be utilized to improve the performance despite the lack of native add with carry.
Anyone know of a library for fixed point arithmetic on vectors or some example code how to do emulate add with carry on AVX/AVX2.
FP extended precision gives more bits per clock cycle (because double FMA throughput is 2/clock vs. 32x32=>64-bit at 1 or 2/clock on Intel CPUs); consider using the same tricks that Prime95 uses with FMA for integer math. With care it's possible to use FPU hardware for bit-exact integer work.
For your actual question: since you want to do the same thing to multiple pixels in parallel, probably you want to do carries between corresponding elements in separate vectors, so one __m256i holds 64-bit chunks of 4 separate bigintegers, not 4 chunks of the same integer.
Register pressure is a problem for very wide integers with this strategy. Perhaps you can usefully branch on there being no carry propagation past the 4th or 6th vector of chunks, or something, by using vpmovmskb on the compare result to generate the carry-out after each add. An unsigned add has carry out of a+b < a (unsigned compare)
But AVX2 only has signed integer compares (for greater-than), not unsigned. And with carry-in, (a+b+c_in) == a is possible with b=carry_in=0 or with b=0xFFF... and carry_in=1 so generating carry-out is not simple.
To solve both those problems, consider using chunks with manual wrapping to 60-bit or 62-bit or something, so they're guaranteed to be signed-positive and so carry-out from addition appears in the high bits of the full 64-bit element. (Where you can vpsrlq ymm, 62 to extract it for addition into the vector of next higher chunks.)
Maybe even 63-bit chunks would work here so carry appears in the very top bit, and vmovmskpd can check if any element produced a carry. Otherwise vptest can do that with the right mask.
This is a handy-wavy kind of brainstorm answer; I don't have any plans to expand it into a detailed answer. If anyone wants to write actual code based on this, please post your own answer so we can upvote that (if it turns out to be a useful idea at all).
Just for kicks, without claiming that this will be actually useful, you can extract the carry bit of an addition by just looking at the upper bits of the input and output values.
unsigned result = a + b + last_carry; // add a, b and (optionally last carry)
unsigned carry = (a & b) // carry if both a AND b have the upper bit set
| // OR
((a ^ b) // upper bits of a and b are different AND
& ~r); // AND upper bit of the result is not set
carry >>= sizeof(unsigned)*8 - 1; // shift the upper bit to the lower bit
With SSE2/AVX2 this could be implemented with two additions, 4 logic operations and one shift, but works for arbitrary (supported) integer sizes (uint8, uint16, uint32, uint64). With AVX2 you'd need 7uops to get 4 64bit additions with carry-in and carry-out.
Especially since multiplying 64x64-->128 is not possible either (but would require 4 32x32-->64 products -- and some additions or 3 32x32-->64 products and even more additions, as well as special case handling), you will likely not be more efficient than with mul and adc (maybe unless register pressure is your bottleneck).As
As Peter and Mystical suggested, working with smaller limbs (still stored in 64 bits) can be beneficial. On the one hand, with some trickery, you can use FMA for 52x52-->104 products. And also, you can actually add up to 2^k-1 numbers of 64-k bits before you need to carry the upper bits of the previous limbs.

Numerical convergence and minimum number size

I have a program which calculates probability values
(p-values),
but it is entering a very large negative number into the
exp function
exp(-626294.830) which evaluates to zero instead of the very small
positive number that it should be.
How can I get this to evaluate as a very small floating point number?
I have tried
Math::BigFloat,
bignum, and
bigrat
but all have failed.
Wolfram Alpha says that exp(-626294.830) is 4.08589×10^-271997... zero is a pretty close approximation to that ;-) Although you've edited and removed the context from your question, do you really need to work with such tiny numbers, or perhaps there is some way you could optimize your algorithm or scale your numbers?
Anyway, you are correct that code like Math::BigFloat->new("-626294.830")->bexp seems to take quite some time, even with the support of use Math::BigFloat lib => 'GMP';.
The only alternative I can offer at the moment is Math::Prime::Util::GMP's expreal, although you need to specify a precision to it.
use Math::Prime::Util::GMP qw/expreal/;
use Math::BigFloat;
my $e = Math::BigFloat->new(expreal(-626294.830,272000));
print $e->bnstr,"\n";
__END__
4.086e-271997
But on my machine, even that still takes ~20s to run, which brings us back to the question of potential optimization in other places.
Floating point numbers do not have infinite precision. Assuming the number is represented as an IEEE 754 double, we have 52 bits for a fraction, 11 bits for the exponent, and one bit for the sign. Due to the way exponents are encoded, the smallest positive number that can be represented is 2^-1022.
If we look at your number e^-626294.830, we can do a change of base and see that it equals 2^(log_2 e · -626294.830) = 2^-903552.445, which is significantly smaller than 2^-1022. Approximating your number as zero is therefore correct.
Instead of calculating this value using arbitrary-precision numerics, you are likely better off solving the necessary equations by hand, then coding this in a way that does not require extreme precision. For example, it is unlikely that you need the exact value of e^-626294.830, but perhaps just the magnitude. Then, you can calculate the logarithm instead of using exp().

iOS - rounding a float with roundf() is not working properly

I am having issue with rounding a float in iPhone application.
float f=4.845;
float s= roundf(f * 100.0)/100;
NSLog(#"Output-1: %.2f",s);
s= roundf(484.5)/100;
NSLog(#"Output-2: %.2f",s);
Output-1: 4.84
Output-2: 4.85
Let me know whats problem in this and how to solve this.
The problem is that you don't yet realise one of the inherent problems with floating point: the fact that most numbers cannot be represented exactly (a).
This means that 4.845 is likely to be, in reality, something like 4.8449999999999 which, when you round it, gives you 4.84 rather than what you expect, 4.85.
And what value you end up with also depends on how you calculate it, which is why you're getting a different result.
And, of course, no floating point "inaccuracy" answer would be complete on SO without the authoritative What Every Computer Scientist Should Know About Floating-Point Arithmetic.
(a) Only sums of exact powers of two, within a certain similar range, can be exactly rendered in IEEE754. So, for example, 484.5 is
256 + 128 + 64 + 32 + 4 + 0.5 (28 + 27 + 26 + 25 + 22 + 2-1).
See this answer for a more detailed look into the IEEE754 format.
As to solving it, you have a few choices. One is to use double instead of float. That gives you more precision and greater range of numbers but only moves the problem further away rather than really solving it. Since 0.1 is a repeating fraction in IEEE754, no amount of bits (short of infinity) can exactly represent it.
Another choice is to use a custom library like a big decimal type, which can represent decimals of arbitrary precision (that's not infinite precision as some people are wont to suggest, since it's limited by memory). This will reduce the errors caused by the binary/decimal mismatch.
You may also want to look into NSDecimalNumber - this doesn't give you arbitrary precision but it does give a large range with accurate decimal representation.
There'll still be numbers you can't represent, like PI or the square root of 2 or any other irrational number, but it should cover most cases. If you really need to handle those other values, you need to switch to symbolic numeric representations.
Unlike 484.5 which can be represented exactly as a float* , 4.845 is represented as 4.8449998 (see this calculator if you wish to try other numbers). Multiplying by one hundred keeps the number at 484.49998, which correctly rounds to 484.
* An exact representation is possible because its fractional part 0.5 is a power of two (i.e. 2^-1).

Why my filter output is not accurate?

I am simulating a digital filter, which is 4-stage.
Stages are:
CIC
half-band
OSR
128
Input is 4 bits and output is 24 bits. I am confused about the 24 bits output.
I use MATLAB to generate a 4 bits signed sinosoid input (using SD tool), and simulated with modelsim. So the output should be also a sinosoid. The issue is the output only contains 4 different data.
For 24 bits output, shouldn't we get a 2^24-1 different data?
What's the reason for this? Is it due to internal bit width?
I'm not familiar with Modelsim, and I don't understand the filter terminology you used, but...Are your filters linear systems? If so, an input at a given frequency will cause an output at the same frequency, though possibly different amplitude and phase. If your input signal is a single tone, sampled such that there are four values per cycle, the output will still have four values per cycle. Unless one of the stages performs sample rate conversion the system is behaving as expected. As as Donnie DeBoer pointed out, the word width of the calculation doesn't matter as long as it can represent the four values of the input.
Again, I am not familiar with the particulars of your system so if one of the stages does indeed perform sample rate conversion, this doesn't apply.
Forgive my lack of filter knowledge, but does one of the filter stages interpolate between the input values? If not, then you're only going to get a maximum of 2^4 output values (based on the input resolution), regardless of your output resolution. Just because you output to 24-bit doesn't mean you're going to have 2^24 values... imagine running a digital square wave into a D->A converter. You have all the output resolution in the world, but you still only have 2 values.
Its actually pretty simple:
Even though you have 4 bits of input, your filter coefficients may be more than 4 bits.
Every math stage you do adds bits. If you add two 4-bit values, the answer is a 5 bit number, so that adding 0xf and 0xf doesn't overflow. When you multiply two 4-bit values, you actually need 8 bits of output to hold the answer without the possibility of overflow. By the time all the math is done, your 4-bit input apparently needs 24-bits to hold the maximum possible output.