Fixed point multiplication "solution," crazy or viable? - fixed-point

Assume this much:
I'm using a 16.16 fixed point system.
System is 32 bit.
CPU has no floating point processor.
Overflow is pretty imminent for multiplication for anything larger than 1.0 * 0.4999
To make one last assumption... lets say the values I'm working will not be so high as to cause overflow in this operation...
//assume that in practical application
//this assignment wouldn't be here as 2 fixed values would already exist...
fixed1 = (int)(1.2341 * 65536);
fixed2 = (int)(0.7854 * 65536);
mask1 = fixed1 & 0xFF; //mask off lower 8 bits
fixed1 >>= 8; //keep upper 24 bits... assume value here isn't too large...
answer = (((fixed2 * fixed1) >> 8) + ((fixed2 * mask1) >> 16));
So the question is... is this a stroke of genius (not to say it hasn't already been thought of or anything) or a complete waste of time?

Re-edit - because I was wrong :)
Looks like you are trying to get higher precision by using an extra var?
If you are indeed trying to increase precision, then this would work, but why not use the whole int instead of just 8-bits?
Ok, from your comments, you wanted to know how to do 64-bit precision muls on a 32-bit processor. The easiest way is if the processor underneath you has a long multiply op. If it's an ARM, you are in luck and can use long long to do your mul then shift away your out of bounds low bits and be done.
If it does not, you can still do a long long multiply and let the compiler writer do the heavy lifting of handling overflow for you. These are the easiest methods.
Failing that, you get to do 4 16-bit multiplies and a bunch of adds and shifts:
// The idea is to break the 32-bit multiply into 4 16-bit
parts to prevent any overflow. You can break any
multiply into factors and additions (all math here is unsigned):
X (bhi16)(blo16)
(blo16)(alo16) - First 32-bit product var
(blo16)(ahi16)&lt&lt16 - Second 32-bit product var (Don't shift here)
(bhi16)(alo16)&lt&lt16 - Third 32-bit product var (Don't shift here)
+ (bhi16)(ahi16)&lt&lt32 - Forth 32-bit product var (Don't shift here)
Final Value. Here we add using add and add
with carry techniques to allow overflow.
Basically, we have a low product and a high product The low product gets assigned the first partial product. You then add in the 2 middle products shifted up 16. For each overflow, you add 1 to the high product and continue. Then add the upper 16-bits of each middle product into the high product. Finally, add the last product as is into the high product.
A big pain in the butt, but it works for any abitrary precision of values.


32-1024 bit fixed point vector arithmetic with AVX-2

For a mandelbrot generator I want to used fixed point arithmetic going from 32 up to maybe 1024 bit as you zoom in.
Now normaly SSE or AVX is no help there due to the lack of add with carry and doing normal integer arithmetic is faster. But in my case I have literally millions of pixels that all need to be computed. So I have a huge vector of values that all need to go through the same iterative formula over and over a million times too.
So I'm not looking at doing a fixed point add/sub/mul on single values but doing it on huge vectors. My hope is that for such vector operations AVX/AVX2 can still be utilized to improve the performance despite the lack of native add with carry.
Anyone know of a library for fixed point arithmetic on vectors or some example code how to do emulate add with carry on AVX/AVX2.
FP extended precision gives more bits per clock cycle (because double FMA throughput is 2/clock vs. 32x32=>64-bit at 1 or 2/clock on Intel CPUs); consider using the same tricks that Prime95 uses with FMA for integer math. With care it's possible to use FPU hardware for bit-exact integer work.
For your actual question: since you want to do the same thing to multiple pixels in parallel, probably you want to do carries between corresponding elements in separate vectors, so one __m256i holds 64-bit chunks of 4 separate bigintegers, not 4 chunks of the same integer.
Register pressure is a problem for very wide integers with this strategy. Perhaps you can usefully branch on there being no carry propagation past the 4th or 6th vector of chunks, or something, by using vpmovmskb on the compare result to generate the carry-out after each add. An unsigned add has carry out of a+b < a (unsigned compare)
But AVX2 only has signed integer compares (for greater-than), not unsigned. And with carry-in, (a+b+c_in) == a is possible with b=carry_in=0 or with b=0xFFF... and carry_in=1 so generating carry-out is not simple.
To solve both those problems, consider using chunks with manual wrapping to 60-bit or 62-bit or something, so they're guaranteed to be signed-positive and so carry-out from addition appears in the high bits of the full 64-bit element. (Where you can vpsrlq ymm, 62 to extract it for addition into the vector of next higher chunks.)
Maybe even 63-bit chunks would work here so carry appears in the very top bit, and vmovmskpd can check if any element produced a carry. Otherwise vptest can do that with the right mask.
This is a handy-wavy kind of brainstorm answer; I don't have any plans to expand it into a detailed answer. If anyone wants to write actual code based on this, please post your own answer so we can upvote that (if it turns out to be a useful idea at all).
Just for kicks, without claiming that this will be actually useful, you can extract the carry bit of an addition by just looking at the upper bits of the input and output values.
unsigned result = a + b + last_carry; // add a, b and (optionally last carry)
unsigned carry = (a & b) // carry if both a AND b have the upper bit set
| // OR
((a ^ b) // upper bits of a and b are different AND
& ~r); // AND upper bit of the result is not set
carry >>= sizeof(unsigned)*8 - 1; // shift the upper bit to the lower bit
With SSE2/AVX2 this could be implemented with two additions, 4 logic operations and one shift, but works for arbitrary (supported) integer sizes (uint8, uint16, uint32, uint64). With AVX2 you'd need 7uops to get 4 64bit additions with carry-in and carry-out.
Especially since multiplying 64x64-->128 is not possible either (but would require 4 32x32-->64 products -- and some additions or 3 32x32-->64 products and even more additions, as well as special case handling), you will likely not be more efficient than with mul and adc (maybe unless register pressure is your bottleneck).As
As Peter and Mystical suggested, working with smaller limbs (still stored in 64 bits) can be beneficial. On the one hand, with some trickery, you can use FMA for 52x52-->104 products. And also, you can actually add up to 2^k-1 numbers of 64-k bits before you need to carry the upper bits of the previous limbs.

Ideas for a new way of storing very large numbers (for 3D coordinates)

When creating a large game world, the standard seems to be to go with a floating point number, specifically a double precision float, for world coordinates. You could use a 64 bit integer, which gives you plus or minus 9 quintillion range, (9 x 10^18), but then you'd be dealing in units of a millimetre, which isn't as comfortable as using say metres.
We all know that the bigger a floating point number gets, the less precision it has and the more likely the number will be off by a large margin. In most open world games, you don't really see maps larger than about 6km squared. GTA V (6 by 12), Witcher 3( two separate maps of about 4 by 4), Fallout games, roughly comparable, Just Cause games, slightly larger (around the 20km squared mark), Need for Speed games, same, 4ish squared, maybe 5, Elder Scrolls games, roughly the same, you get the idea. Most of these games aren't a problem for double precision floating point numbers.
But then you have games and cases where the world is particularly enormous, and I guess these games are mostly space themed games, where the action takes place in solar systems, galaxies, etc. I just read thread on reddit about how the developers said that the move to a 64 bit version of the game Space Engineers allowed them to enormously enlarge the scope of the game world, saying they can now fill about 13 astronomical units, or about 2 billion kilometres. Still, Pluto is about 50 AU away at aphelion or 7.5 billion kilometres away. This makes me wonder how games like Star Citizen or Elite Dangerous do it, I'm guessing stitching together different maps seamlessly.
I've heard of the long double and the quad, but the long double doesn't seem to work on most architectures and compilers, and there's little information on the quad.
But that leads me to topic of ideas I'm sure everyone's had at some point, and that is to use more than one data type and using them together. For example, I've already said that with a 64 bit signed integer you can get one integer precision up to 9 quintillion. Say for example you let that 64 bit int be your metres unit. This would give you a world space of 9 quadrillion kilometres. Then all you would need is an extra two bytes, a 16 bit integer, can hold around 65,000 values, and use these two bytes as your millimetres. A value of 2 would be 2 millimetres, a value of 25 would be 2.5 centimetres, a value of 500 would be half a metre. This way you would have millimetre resolution in your world coordinates.
I'm fairly confident that it won't be a problem adding and subtracting such composite numbers (let's call em), but multiplying and dividing seems a bit trickier. But, after all's said and done, is it even true that by carrying out multiple operations on two integer types (actually four integer types when you put two numbers together) would be more speedy or efficient than using floating point number arithmetic?
It's called fixed point arithmetic. You can do it, with an implementation like this:
public struct BigVector3
private const float SCALE = 0.0001; // This controls the fixed point precision.
public long x;
public long y;
public long z;
public Vector3 GetRelativePos(BigVector3 other)
BigVector3 v = this;
v.x -= other.x;
v.y -= other.y;
v.z -= other.z;
return (Vector3)v;
public static implicit operator Vector3(BigVector3 value)
return new Vector3(value.x * SCALE, value.y * SCALE, value.z * SCALE);
But in the end, you still end up converting back to some kind of floating-point number in the end, for rendering, calculating physics with pretty much every major physics engine, etc.
The real solution is to increase the amount of precision in your numbers. double can handle any sub-solar-system scale with millimeter accuracy. Any game on the scale of a galaxy should really use relative positioning and math.
EDIT: "This makes me wonder how games like Star Citizen or Elite Dangerous do it"
These games don't need accurate positioning at such great distances. If you're in a space ship traveling faster than light then a millimeter is negligible. And vertex positions being inaccurate is only visible up close. You can simply render the ship interior separately from the rest of the world.

iOS - rounding a float with roundf() is not working properly

I am having issue with rounding a float in iPhone application.
float f=4.845;
float s= roundf(f * 100.0)/100;
NSLog(#"Output-1: %.2f",s);
s= roundf(484.5)/100;
NSLog(#"Output-2: %.2f",s);
Output-1: 4.84
Output-2: 4.85
Let me know whats problem in this and how to solve this.
The problem is that you don't yet realise one of the inherent problems with floating point: the fact that most numbers cannot be represented exactly (a).
This means that 4.845 is likely to be, in reality, something like 4.8449999999999 which, when you round it, gives you 4.84 rather than what you expect, 4.85.
And what value you end up with also depends on how you calculate it, which is why you're getting a different result.
And, of course, no floating point "inaccuracy" answer would be complete on SO without the authoritative What Every Computer Scientist Should Know About Floating-Point Arithmetic.
(a) Only sums of exact powers of two, within a certain similar range, can be exactly rendered in IEEE754. So, for example, 484.5 is
256 + 128 + 64 + 32 + 4 + 0.5 (28 + 27 + 26 + 25 + 22 + 2-1).
See this answer for a more detailed look into the IEEE754 format.
As to solving it, you have a few choices. One is to use double instead of float. That gives you more precision and greater range of numbers but only moves the problem further away rather than really solving it. Since 0.1 is a repeating fraction in IEEE754, no amount of bits (short of infinity) can exactly represent it.
Another choice is to use a custom library like a big decimal type, which can represent decimals of arbitrary precision (that's not infinite precision as some people are wont to suggest, since it's limited by memory). This will reduce the errors caused by the binary/decimal mismatch.
You may also want to look into NSDecimalNumber - this doesn't give you arbitrary precision but it does give a large range with accurate decimal representation.
There'll still be numbers you can't represent, like PI or the square root of 2 or any other irrational number, but it should cover most cases. If you really need to handle those other values, you need to switch to symbolic numeric representations.
Unlike 484.5 which can be represented exactly as a float* , 4.845 is represented as 4.8449998 (see this calculator if you wish to try other numbers). Multiplying by one hundred keeps the number at 484.49998, which correctly rounds to 484.
* An exact representation is possible because its fractional part 0.5 is a power of two (i.e. 2^-1).

What's the biggest number in a computer?

Just asked by my 5 year old kid: what is the biggest number in the computer?
We are not talking about max number for a specific data types, but the biggest number that a computer can represent.
Infinity is not allowed.
UPDATE my kid always wants to print as
well, so lets say the computer needs
to print this number and the kid to
know that its a big number. Of course,
in practice we won't print because
theres not enough trees.
This question is actually a very interesting one which mathematicians have devoted a fair bit of thought to. You can read about it in this article, which is a fascinating and accessible read.
Briefly, a guy named Tibor Rado set out to find some really big, but still well-defined, numbers by defining a sequence called the Busy Beaver numbers. He defined BB(n) to be the largest number of steps any Turing Machine could take before halting, given an input of n symbols. Note that this sequence is by its very nature not computable, so the numbers themselves, while well-defined, are very difficult to pin down. Here are the first few:
BB(1) = 1
BB(2) = 6
BB(3) = 21
BB(4) = 107
... wait for it ...
BB(5) >= 8,690,333,381,690,951
No one is sure how big exactly BB(5) is, but it is finite. And no one has any idea how big BB(6) and above are. But at least these numbers are completely well-defined mathematically, unlike "the largest number any human has ever thought of, plus one." ;)
So how about this:
The biggest number a computer can represent is the most instructions a program small enough to fit in its available memory can perform before halting.
No, wait, cubed. No, raised to the power of itself!
Bits are not numbers. You, as a programmer, give them the meaning you want, possibly numbers.
Now, I decide that 1 represents "the biggest number ever thought by a human plus one".
Errr this is a five year old?
How about something along the lines of: "I'd love to tell you but the number is so big and would take so long to say, I'd die before I finished telling you".
// wait to see
EDIT: The above is for actually storing a number and treats all media (RAM, HD, cloud etc.) as memory. Subtracting the OS footprint (measured in KB) doesn't make "roughly" less accurate...
If you want to "represent" a number in a meaningful way, then you probably want to go with what the CPU provides: unsigned 32 bit integers (roughly 4 Gigs) or unsigned 64 bit integers for most computers your kid will come into contact with.
NOTE for talking to 5-year-olds: Often, they just want a factoid. Give him a really big and very accurate number (lots of digits), like 4'294'967'295. Then, once the glazing leaves his eyes, try to see how far you can get with explaining how computers represent numbers.
EDIT #2: I once read this article: Who Can Name the Bigger Number that should provide a whole lot of interesting information for your kid. Obviously he's not your normal five-year-old. So this might get you started in a cool direction about numbers and computation.
The answer to life (and this kids question): 42
That depends on the datatype you use to represent it. The computer only stores bits (0/1). We, as developers, give the bits meaning. (65 can be a number or the letter A).
For example, I can define my datatype as 1^N where N is unsigned and represented by an array of bits of arbitrary size. The next person can come up with 10^N which would be ten times larger than my biggest number.
Sure, there would be gaps but if you don't need them, that doesn't matter.
Therefore, the question is meaningless since it doesn't have context.
Well I had the same question earlier this day, so thought why not to make a little c++ codes to see where the computer gonna stop ...
But my laptop wasn't with me in class so I used another, well the number was to big but it never ends, i'll run it again for a night then i'll share the number
you can try the code is stupid
#include <stdlib.h>
#include <stdio.h>
int main() {
int i = 0;
for (i = 0; i <= i; i++) {
printf("%i\n", i);
And let it run till it stops ^^
The size will obviously be limited by the total size of hard drives you manage to put into your PC. After all, you can store a number in a text file occupying all disk space.
You can have 4x2Tb drives even in a simple box so around 8Tb available. if you store as binary, then the biggest number is 2 pow 64000000000000.
If your hard drive is 1 TB (8'000'000'000'000 bits), and you would print the number that fits on it on paper as hex digits (nobody would do that, but let's assume), that's 2,000,000,000,000 hex digits.
Each page would contain 4000 hex digits (40 x 100 digits). That's 500,000,000 pages.
Now stack the pages on top of each other (let's say each page is 0.004 inches / 0.1 mm thick), then the stack would be as 5 km (about 3 miles) tall.
I'll try to give a practical answer.
Common Lisp number crunching is particularly powerful. It has something called "bignums" which are integers that can be arbitrarily large, limited by the amount of available.
Don't know much about theory, but I far as I understood from your question, is: what is the largest number that the computer can represent (and I add: in a reasonable time, and not printing "9" until the Earth will "be eaten by the Sun"). And I put my PC to make one simple calculation (in PHP or whatever language): echo pow(2,1023) - resulting: 8.9884656743116E+307. So I guess this is the largest number that my PC can calculate. On the other side, I think the respresentation of the largest negative number can be: -0,(0)1
LE: That computed value was obataind through PHP, but I tried to figure out what's the largest number that my windows calculator can compute, and it is pow(2, 33219) = 8.2304951207588748764521361245002E+9999. Now I guess this is the largest number my PC can handle.
I think you should be very proud that your 5 year old is already asking questions like this.
And you should continue to promote that! This is truly amazing! With that said, I would say that saying Infinity does not
count is thinking incorrectly about what numbers mean in computer memory.
I feel like this way of thinking is a handicap.
Mathematicians will never be able to write out ALL the digits of pi or eulers number, BUT we FULLY understand it.
Pi, as an example, is perfectly represented by infinite this series: (Pi / 4) = 1 - 1/3 + 1/5 - 1/7 + 1/9 - …
Just because you literally can’t go to inf. or print every single digit in a console means nothing.
You could have printed the symbol representing pi and therefore capturing the inf. series.
Computer Algebra Systems (CAS) represent numbers symbolically all the time. Pi, for instance,
may be a Symbolic object in memory (the binary in memory did not DIRECTLY represent the number. It represents an "mathematical algorithm" for producing the answer to arbitrary precision).
Then you do some math with it, transforming from one expression to the next.
At no point in time did we not represent the number COMPLETELY.
At the end, you can do 2 things with this:
A) Evaluate the expression, turning it into a number of some kind (or Matrix or whatever).
BUT this number could very well be an approximation (say like 20 digits of pi).
B) Keep it in its symbolic form for reference. Obviously we don’t like staring at symbols because we
need to eventually turn the nobs on the apparatii.
NOTE: sometimes you can get a finite (non-irrational) number perfectly represented in memory (like number 1)
by taking limits or going to inf. Not literally having an inf. number in memory, but symbolically representing it.
Just throw this in Wolfram alpha: Lim[Exp[-x], x --> Inf]; It gives you the number 0. Which is EXACT.
In short:
It was the HUMANS need to have some binary in memory that DIRECTLY represented the number that caused
the number to degrade. Symbolically it was perfectly represented. You could design some algorithm that
just continues to calculate the next digits of pi or eulers number giving you an arbitrary amount of precision (Now, this is obviously not practical of course).
I hope this was at least somewhat useful or interesting to you, even if you disagree =)
Depends on how much the computer can handle. Although there are some times when the computer can handle numbers greater than (2^(bits-1)-1)... For example:
My computer is 64 bit (9223372036854775807), however the calculator that comes with the computer itself can handle numbers of up to 10^9999.
Many other supercomputers can exceed these limits, and the one with the most memory (bits) might as well be the one with the record (current largest number that can be held by computers).
Or, if it comes to visually seeing it on computers, you can just make a program that, on monitor, repeats writing 9 and not skips that line to form an ever-growing bunch of 9. :P
go on chrome then go on three dots above and click them then go on tools and then go on developer tool click on console and type Number.MAX_VALUE

iPhone and floating point math

I have following code:
float totalSpent;
int intBudget;
float moneyLeft;
totalSpent += Amount;
moneyLeft = intBudget - totalSpent;
And this is how it looks in debugger:
Why would moneyLeft calculated by the code above is .02 different compared to the expression calculated by the debugger?
Expression windows is correct, yet code above produces wrong by .02 result. It only happens for a very large numbers (yet way below int limit)
A single-precision float has 23 bits of precision. That means that every calculation is rounded to 23 binary digits. This means that if you have a computation that, say, adds a very small number to a very large number, rounding may result in strange results.
Imagine that you are doing math in scientific notation decimal by hand, under the rule that you may only have four significant figures. Let's say I ask you to write twelve in scientific notation, with four significant figures. Remembering junior high school, you write:
1.200 × 101
Now I say compute the square of 12, and then add 0.5. That is easy enough:
1.440×102 + 0.005×102 = 1.445×102
How about twelve cubed plus 0.75:
1.728×103 + 0.00075×103 = 1.72875×103
But remember, I only gave you room for four significant digits, so you must round; then we get:
1.728×103 + 7.5×10-1 = 1.729×103
See? The lack of precision can make the computation come out with unexpected results.
In your example, you've got 999999 in a calculation where you're trying to be precise to 0.01. log2(999999) = 19.93 and log2(0.01) = -6.64. The difference is more than 23; therefore you would need more than 23 binary digits to perform this calculation accurately.
Because floating point mathematics rounds-off precision by its very nature, it is usually a bad choice for currency computation, where you must be accurate to the last cent. But are you really concerned with fractions of a cent in your application? If not, then why not do away with the decimal point altogether, and simply store cents (instead of dollars) in a 64-bit integer? 264¢ is more than the GDP of the entire planet.
Floating point will always produce strange results with money type calculations.
The golden rule is that floating point is good for things you measure litres,yards,lightyears,bushels etc. etc. but not for things you count like
sheep, beans, buttons etc.
Most money calculations are to do with counting pennies so use integer math
and you wont get the strange results. Either use a fixed decimal arithimatic
library (which would probably be overkill on an iPhone) or store your amounts
as whole numbers of cents and only convert to $ and cents on display.