32-1024 bit fixed point vector arithmetic with AVX-2

32-1024 bit fixed point vector arithmetic with AVX-2 - biginteger

For a mandelbrot generator I want to used fixed point arithmetic going from 32 up to maybe 1024 bit as you zoom in.
Now normaly SSE or AVX is no help there due to the lack of add with carry and doing normal integer arithmetic is faster. But in my case I have literally millions of pixels that all need to be computed. So I have a huge vector of values that all need to go through the same iterative formula over and over a million times too.
So I'm not looking at doing a fixed point add/sub/mul on single values but doing it on huge vectors. My hope is that for such vector operations AVX/AVX2 can still be utilized to improve the performance despite the lack of native add with carry.
Anyone know of a library for fixed point arithmetic on vectors or some example code how to do emulate add with carry on AVX/AVX2.

FP extended precision gives more bits per clock cycle (because double FMA throughput is 2/clock vs. 32x32=>64-bit at 1 or 2/clock on Intel CPUs); consider using the same tricks that Prime95 uses with FMA for integer math. With care it's possible to use FPU hardware for bit-exact integer work.
For your actual question: since you want to do the same thing to multiple pixels in parallel, probably you want to do carries between corresponding elements in separate vectors, so one __m256i holds 64-bit chunks of 4 separate bigintegers, not 4 chunks of the same integer.
Register pressure is a problem for very wide integers with this strategy. Perhaps you can usefully branch on there being no carry propagation past the 4th or 6th vector of chunks, or something, by using vpmovmskb on the compare result to generate the carry-out after each add. An unsigned add has carry out of a+b < a (unsigned compare)
But AVX2 only has signed integer compares (for greater-than), not unsigned. And with carry-in, (a+b+c_in) == a is possible with b=carry_in=0 or with b=0xFFF... and carry_in=1 so generating carry-out is not simple.
To solve both those problems, consider using chunks with manual wrapping to 60-bit or 62-bit or something, so they're guaranteed to be signed-positive and so carry-out from addition appears in the high bits of the full 64-bit element. (Where you can vpsrlq ymm, 62 to extract it for addition into the vector of next higher chunks.)
Maybe even 63-bit chunks would work here so carry appears in the very top bit, and vmovmskpd can check if any element produced a carry. Otherwise vptest can do that with the right mask.
This is a handy-wavy kind of brainstorm answer; I don't have any plans to expand it into a detailed answer. If anyone wants to write actual code based on this, please post your own answer so we can upvote that (if it turns out to be a useful idea at all).

Just for kicks, without claiming that this will be actually useful, you can extract the carry bit of an addition by just looking at the upper bits of the input and output values.
unsigned result = a + b + last_carry; // add a, b and (optionally last carry)
unsigned carry = (a & b) // carry if both a AND b have the upper bit set
| // OR
((a ^ b) // upper bits of a and b are different AND
& ~r); // AND upper bit of the result is not set
carry >>= sizeof(unsigned)*8 - 1; // shift the upper bit to the lower bit
With SSE2/AVX2 this could be implemented with two additions, 4 logic operations and one shift, but works for arbitrary (supported) integer sizes (uint8, uint16, uint32, uint64). With AVX2 you'd need 7uops to get 4 64bit additions with carry-in and carry-out.
Especially since multiplying 64x64-->128 is not possible either (but would require 4 32x32-->64 products -- and some additions or 3 32x32-->64 products and even more additions, as well as special case handling), you will likely not be more efficient than with mul and adc (maybe unless register pressure is your bottleneck).As
As Peter and Mystical suggested, working with smaller limbs (still stored in 64 bits) can be beneficial. On the one hand, with some trickery, you can use FMA for 52x52-->104 products. And also, you can actually add up to 2^k-1 numbers of 64-k bits before you need to carry the upper bits of the previous limbs.

Related

What accounts for most of the integer multiply instructions?

The majority of integer multiplications don't actually need multiply:
Floating-point is, and has been since the 486, normally handled by dedicated hardware.
Multiplication by a constant, such as for scaling an array index by the size of the element, can be reduced to a left shift in the common case where it's a power of two, or a sequence of left shifts and additions in the general case.
Multiplications associated with accessing a 2D array, can often be strength reduced to addition if it's in the context of a loop.
So what's left?
Certain library functions like fwrite that take a number of elements and an element size as runtime parameters.
Exact decimal arithmetic e.g. Java's BigDecimal type.
Such forms of cryptography as require multiplication and are not handled by their own dedicated hardware.
Big integers e.g. for exploring number theory.
Other cases I'm not thinking of right now.
None of these jump out at me as wildly common, yet all modern CPU architectures include integer multiply instructions. (RISC-V omits them from the minimal version of the instruction set, but has been criticized for even going this far.)
Has anyone ever analyzed a representative sample of code, such as the SPEC benchmarks, to find out exactly what use case accounts for most of the actual uses of integer multiply (as measured by dynamic rather than static frequency)?

Can a CRC32 engine be used for computing CRC16 hashes?

I'm working with a microcontroller with native HW functions to calculate CRC32 hashes from chunks of memory, where the polynomial can be freely defined. It turns out that the system has different data-links with different bit-lengths for CRC, like 16 and 8 bit, and I intend to use the hardware engine for it.
In simple tests with online tools I've concluded that it is possible to find a 32-bit polynomial that has the same result of a 8-bit CRC, example:
hashing "a sample string" with 8-bit engine and poly 0xb7 yelds a result 0x97
hashing "a sample string" with 16-bit engine and poly 0xb700 yelds a result 0x9700
...32-bit engine and poly 0xb7000000 yelds a result 0x97000000
(with zero initial value and zero final xor, no reflections)
So, padding the poly with zeros and right-shifting the results seems to work.
But is it 'always' possible to find a set of parameters that make 32-bit engines to work as 16 or 8 bit ones? (including poly, final xor, init val and inversions)
To provide more context and prevent 'bypass answers' like 'dont't use the native engine': I have a scenario in a safety critical system where it's necessary to prevent a common design error from propagating to redundant processing nodes. One solution for that is having software-based CRC calculation in one node, and hardware-based in its pair.

Yes, what you're doing will work in general for CRCs that are not reflected. The pre and post conditioning can be done very simply with code around the hardware instructions loop.
Assuming that the hardware CRC doesn't have an option for this, to do a reflected CRC you would need to reflect each input byte, and then reflect the final result. That may defeat the purpose of using a hardware CRC. (Though if your purpose is just to have a different implementation, then maybe it wouldn't.)

You don't have to guess. You can calculate it. Because CRC is a remainder of a division by an irreducible polynomial, it's a 1-to-1 function on its domain.
So, CRC16, for example, has to produce 65536 (64k) unique results if you run it over 0 through 65536.
To see if you get the same outcome by taking parts of CRC32, run it over 0 through 65535, keep the 2 bytes that you want to keep, and then see if there is any collision.
If your data has 32 bits in it, then it should not be an issue. The issue arises if you have less than 32 bit numbers and you shuffle them around in a 32-bit space. Their 1st and last byte are not guaranteed to be uniformly distributed.

Compute e^x for float values in System Verilog?

I am building a neural network running on an FPGA, and the last piece of the puzzle is running a sigmoid function in hardware. This is either:
1/(1 + e^-x)
or
(atan(x) + 1) / 2
Unfortunately, x here is a float value (a real value in SystemVerilog).
Are there any tips on how to implement either of these functions in SystemVerilog?
This is really confusing to me since both of these functions are complex, and I don't even know where to begin implementing them due to the added complexity of being float values.

One simpler way for this is to create a memory/array for this function. However that option can be highly inefficient.
x should be the input address for the memory and the value at that location can be the output of the function.
Suppose value of your function is as follow. (This is just an example)
x = 0 => f(0) = 1
x = 1 => f(0) = 2
x = 2 => f(0) = 3
x = 3 => f(0) = 4
So you can create an array for this, which stored the output values.
int a[4] = `{1, 2, 3, 4};

I just finished this by Vivado HLS, which allows you to write circuits in C.
Here is my C code.
#include math.h
void exp(float a[10],b[10])
{
int i;
for(i=0;i<10;i++)
{
b[i] = exp(a[i]);
}
}
But there is a question that it is impossible to create a unsized matrix. Maybe there is another way that I don't know.

As you seem to realize, type real is not synthesizable. you need to operate on the type integer mantissa and type integer exponent separately and combine them when you are done, having tracked the sign. Once you take care of (e^-x), the rest should be straight-forward.
try this page for a quick explanation: https://www.geeksforgeeks.org/floating-point-representation-digital-logic/
and search on "floating point digital design" for more explanations/examples.

Do you really need a floating number for this? Is fixed point sufficient?
Considering (atan(x) + 1) / 2, quite likely the only useful values of x are those where the exponent is fairly small. (if the exponent is large, your answer is pi/2).
atan of a fixed point number can be calculated in HW fairly easily; there are cordic methods (see https://zipcpu.com/dsp/2017/08/30/cordic.html) and direct methods; see for example https://dspguru.com/dsp/tricks/fixed-point-atan2-with-self-normalization/

FPGA design flows in which hardware (FPGA) is targeted generally do not support floating point numbers in the FPGA fabric. Fixed point with limited precision is more commonly used.
A limited precision fixed point approach:
Use Matlab to create an array of samples for your math function such that the largest value is +/- .99999. For 8 bit precision (actually 7 with sign bit), multiply those numbers by 128, round at the decimal point and drop the fractional part. Write those numbers to a text file in 2s complement hex format. In SystemVerilog you can implement a ROM using that text file. Use $readmemh() to read these numbers into a memory style variable (one that has both a packed and unpacked dimension). Link to a tutorial:
https://projectf.io/posts/initialize-memory-in-verilog/.
Now you have a ROM with limited precision samples of your function
Section 21.4 Loading memory array data from a file in the SystemVerilog specification provides the definition for $readmh(). Here is that doc:
https://ieeexplore.ieee.org/document/8299595
If you need floating point one possibility is to use a processor soft core with a floating point unit implemented in FPGA fabric, and run software on that core. The core interface to the rest of the FPGA fabric over a physical bus such as axi4 steaming. See:
https://www.xilinx.com/products/design-tools/microblaze.html to get started.
It is a very different workflow than ordinary FPGA design and uses different tools. C or C++ compiler with math libraries (tan, exp, div, etc) would be used along with the processor core.
Another possibility for fixed point is an FPGA with a hard core processor. Xilinx Zynq is one of them. This is a complex and powerful approach. A free free book provides knowledge on how to use Zynq
http://www.zynqbook.com/.
This workflow is even more complex that soft core approach because the Zynq is a more complex platform (hard processor & FPGA integrated on one chip).

Its pretty hard to implement non-linear functions like that in hardware, and on top of that floating point arithmetic is even more costly. Its definitely better(and recommended) to work with fixed point arithmetic as mentioned in answers before. The number of precision bits in fixed point arithmetic will depend on your result accuracy and the error tolerance.
For hardware implementations, any kind of non-linear function can be approximated as piecewise linear function, and use a ROM based implementation approach as described in previous answers. The number of sample points that you take from the non-linear function determines your accuracy. The more samples you store the better approximation of the function you get. Often in hardware , number of samples you can store can become restricted by the amount of fast/local memory available to you. In this case to optimize the memory resources, you can add a little extra compute resources and perform linear interpolation to calculate the needed values.

What's the biggest number in a computer?

Just asked by my 5 year old kid: what is the biggest number in the computer?
We are not talking about max number for a specific data types, but the biggest number that a computer can represent.
Infinity is not allowed.
UPDATE my kid always wants to print as
well, so lets say the computer needs
to print this number and the kid to
know that its a big number. Of course,
in practice we won't print because
theres not enough trees.

This question is actually a very interesting one which mathematicians have devoted a fair bit of thought to. You can read about it in this article, which is a fascinating and accessible read.
Briefly, a guy named Tibor Rado set out to find some really big, but still well-defined, numbers by defining a sequence called the Busy Beaver numbers. He defined BB(n) to be the largest number of steps any Turing Machine could take before halting, given an input of n symbols. Note that this sequence is by its very nature not computable, so the numbers themselves, while well-defined, are very difficult to pin down. Here are the first few:
BB(1) = 1
BB(2) = 6
BB(3) = 21
BB(4) = 107
... wait for it ...
BB(5) >= 8,690,333,381,690,951
No one is sure how big exactly BB(5) is, but it is finite. And no one has any idea how big BB(6) and above are. But at least these numbers are completely well-defined mathematically, unlike "the largest number any human has ever thought of, plus one." ;)
So how about this:
The biggest number a computer can represent is the most instructions a program small enough to fit in its available memory can perform before halting.
Squared.
No, wait, cubed. No, raised to the power of itself!
Dammit!

Bits are not numbers. You, as a programmer, give them the meaning you want, possibly numbers.
Now, I decide that 1 represents "the biggest number ever thought by a human plus one".

Errr this is a five year old?
How about something along the lines of: "I'd love to tell you but the number is so big and would take so long to say, I'd die before I finished telling you".

// wait to see
for(;;)
{
printf("9");
}

roughly 2^AVAILABLE_MEMORY_IN_BITS
EDIT: The above is for actually storing a number and treats all media (RAM, HD, cloud etc.) as memory. Subtracting the OS footprint (measured in KB) doesn't make "roughly" less accurate...
If you want to "represent" a number in a meaningful way, then you probably want to go with what the CPU provides: unsigned 32 bit integers (roughly 4 Gigs) or unsigned 64 bit integers for most computers your kid will come into contact with.
NOTE for talking to 5-year-olds: Often, they just want a factoid. Give him a really big and very accurate number (lots of digits), like 4'294'967'295. Then, once the glazing leaves his eyes, try to see how far you can get with explaining how computers represent numbers.
EDIT #2: I once read this article: Who Can Name the Bigger Number that should provide a whole lot of interesting information for your kid. Obviously he's not your normal five-year-old. So this might get you started in a cool direction about numbers and computation.

The answer to life (and this kids question): 42

That depends on the datatype you use to represent it. The computer only stores bits (0/1). We, as developers, give the bits meaning. (65 can be a number or the letter A).
For example, I can define my datatype as 1^N where N is unsigned and represented by an array of bits of arbitrary size. The next person can come up with 10^N which would be ten times larger than my biggest number.
Sure, there would be gaps but if you don't need them, that doesn't matter.
Therefore, the question is meaningless since it doesn't have context.

Well I had the same question earlier this day, so thought why not to make a little c++ codes to see where the computer gonna stop ...
But my laptop wasn't with me in class so I used another, well the number was to big but it never ends, i'll run it again for a night then i'll share the number
you can try the code is stupid
#include <stdlib.h>
#include <stdio.h>
int main() {
int i = 0;
for (i = 0; i <= i; i++) {
printf("%i\n", i);
i++;
}
}
And let it run till it stops ^^

The size will obviously be limited by the total size of hard drives you manage to put into your PC. After all, you can store a number in a text file occupying all disk space.
You can have 4x2Tb drives even in a simple box so around 8Tb available. if you store as binary, then the biggest number is 2 pow 64000000000000.

If your hard drive is 1 TB (8'000'000'000'000 bits), and you would print the number that fits on it on paper as hex digits (nobody would do that, but let's assume), that's 2,000,000,000,000 hex digits.
Each page would contain 4000 hex digits (40 x 100 digits). That's 500,000,000 pages.
Now stack the pages on top of each other (let's say each page is 0.004 inches / 0.1 mm thick), then the stack would be as 5 km (about 3 miles) tall.

I'll try to give a practical answer.
Common Lisp number crunching is particularly powerful. It has something called "bignums" which are integers that can be arbitrarily large, limited by the amount of available.
See: http://en.wikibooks.org/wiki/Common_Lisp/Advanced_topics/Numbers#Fixnums_and_Bignums

Don't know much about theory, but I far as I understood from your question, is: what is the largest number that the computer can represent (and I add: in a reasonable time, and not printing "9" until the Earth will "be eaten by the Sun"). And I put my PC to make one simple calculation (in PHP or whatever language): echo pow(2,1023) - resulting: 8.9884656743116E+307. So I guess this is the largest number that my PC can calculate. On the other side, I think the respresentation of the largest negative number can be: -0,(0)1
LE: That computed value was obataind through PHP, but I tried to figure out what's the largest number that my windows calculator can compute, and it is pow(2, 33219) = 8.2304951207588748764521361245002E+9999. Now I guess this is the largest number my PC can handle.

I think you should be very proud that your 5 year old is already asking questions like this.
And you should continue to promote that! This is truly amazing! With that said, I would say that saying Infinity does not
count is thinking incorrectly about what numbers mean in computer memory.
I feel like this way of thinking is a handicap.
Mathematicians will never be able to write out ALL the digits of pi or eulers number, BUT we FULLY understand it.
Pi, as an example, is perfectly represented by infinite this series: (Pi / 4) = 1 - 1/3 + 1/5 - 1/7 + 1/9 - …
Just because you literally can’t go to inf. or print every single digit in a console means nothing.
You could have printed the symbol representing pi and therefore capturing the inf. series.
Computer Algebra Systems (CAS) represent numbers symbolically all the time. Pi, for instance,
may be a Symbolic object in memory (the binary in memory did not DIRECTLY represent the number. It represents an "mathematical algorithm" for producing the answer to arbitrary precision).
Then you do some math with it, transforming from one expression to the next.
At no point in time did we not represent the number COMPLETELY.
At the end, you can do 2 things with this:
A) Evaluate the expression, turning it into a number of some kind (or Matrix or whatever).
BUT this number could very well be an approximation (say like 20 digits of pi).
B) Keep it in its symbolic form for reference. Obviously we don’t like staring at symbols because we
need to eventually turn the nobs on the apparatii.
NOTE: sometimes you can get a finite (non-irrational) number perfectly represented in memory (like number 1)
by taking limits or going to inf. Not literally having an inf. number in memory, but symbolically representing it.
Just throw this in Wolfram alpha: Lim[Exp[-x], x --> Inf]; It gives you the number 0. Which is EXACT.
In short:
It was the HUMANS need to have some binary in memory that DIRECTLY represented the number that caused
the number to degrade. Symbolically it was perfectly represented. You could design some algorithm that
just continues to calculate the next digits of pi or eulers number giving you an arbitrary amount of precision (Now, this is obviously not practical of course).
I hope this was at least somewhat useful or interesting to you, even if you disagree =)

Depends on how much the computer can handle. Although there are some times when the computer can handle numbers greater than (2^(bits-1)-1)... For example:
My computer is 64 bit (9223372036854775807), however the calculator that comes with the computer itself can handle numbers of up to 10^9999.
Many other supercomputers can exceed these limits, and the one with the most memory (bits) might as well be the one with the record (current largest number that can be held by computers).
Or, if it comes to visually seeing it on computers, you can just make a program that, on monitor, repeats writing 9 and not skips that line to form an ever-growing bunch of 9. :P

go on chrome then go on three dots above and click them then go on tools and then go on developer tool click on console and type Number.MAX_VALUE

Why are 5381 and 33 so important in the djb2 algorithm?

The djb2 algorithm has a hash function for strings.
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
Why are 5381 and 33 so important?

This hash function is similar to a Linear Congruential Generator (LCG - a simple class of functions that generate a series of psuedo-random numbers), which generally has the form:
X = (a * X) + c; // "mod M", where M = 2^32 or 2^64 typically
Note the similarity to the djb2 hash function... a=33, M=2^32. In order for an LCG to have a "full period" (i.e. as random as it can be), a must have certain properties:
a-1 is divisible by all prime factors of M (a-1 is 32, which is divisible by 2, the only prime factor of 2^32)
a-1 is a multiple of 4 if M is a multiple of 4 (yes and yes)
In addition, c and M are supposed to be relatively prime (which will be true for odd values of c).
So as you can see, this hash function somewhat resembles a good LCG. And when it comes to hash functions, you want one that produces a "random" distribution of hash values given a realistic set of input strings.
As for why this hash function is good for strings, I think it has a good balance of being extremely fast, while providing a reasonable distribution of hash values. But I've seen many other hash functions which claim to have much better output characteristics, but involved many more lines of code. For instance see this page about hash functions
EDIT: This good answer explains why 33 and 5381 were chosen for practical reasons.

33 was chosen because:
1) As stated before, multiplication is easy to compute using shift and add.
2) As you can see from the shift and add implementation, using 33 makes two copies of most of the input bits in the hash accumulator, and then spreads those bits relatively far apart. This helps produce good avalanching. Using a larger shift would duplicate fewer bits, using a smaller shift would keep bit interactions more local and make it take longer for the interactions to spread.
3) The shift of 5 is relatively prime to 32 (the number of bits in the register), which helps with avalanching. While there are enough characters left in the string, each bit of an input byte will eventually interact with every preceding bit of input.
4) The shift of 5 is a good shift amount when considering ASCII character data. An ASCII character can sort of be thought of as a 4-bit character type selector and a 4-bit character-of-type selector. E.g. the digits all have 0x3 in the first 4 bits. So an 8-bit shift would cause bits with a certain meaning to mostly interact with other bits that have the same meaning. A 4-bit or 2-bit shift would similarly produce strong interactions between like-minded bits. The 5-bit shift causes many of the four low order bits of a character to strongly interact with many of the 4-upper bits in the same character.
As stated elsewhere, the choice of 5381 isn't too important and many other choices should work as well here.
This is not a fast hash function since it processes it's input a character at a time and doesn't try to use instruction level parallelism. It is, however, easy to write. Quality of the output divided by ease of writing the code is likely to hit a sweet spot.
On modern processors, multiplication is much faster than it was when this algorithm was developed and other multiplication factors (e.g. 2^13 + 2^5 + 1) may have similar performance, slightly better output, and be slightly easier to write.
Contrary to an answer above, a good non-cryptographic hash function doesn't want to produce a random output. Instead, given two inputs that are nearly identical, it wants to produce widely different outputs. If you're input values are randomly distributed, you don't need a good hash function, you can just use an arbitrary set of bits from your input. Some of the modern hash functions (Jenkins 3, Murmur, probably CityHash) produce a better distribution of outputs than random given inputs that are highly similar.

On 5381, Dan Bernstein (djb2) says in this article:
[...] practically any good multiplier works. I think you're worrying
about the fact that 31c + d doesn't cover any reasonable range of hash
values if c and d are between 0 and 255. That's why, when I discovered
the 33 hash function and started using it in my compressors, I started
with a hash value of 5381. I think you'll find that this does just as
well as a 261 multiplier.
The whole thread is here if you're interested.
Ozan Yigit has a page on hash functions which says:
[...] the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.

Maybe because 33 == 2^5 + 1 and many hashing algorithms use 2^n + 1 as their multiplier?
Credit to Jerome Berger
Update:
This seems to be borne out by the current version of the software package djb2 originally came from: cdb
The notes I linked to describe the heart of the hashing algorithm as using h = ((h << 5) + h) ^ c to do the hashing... x << 5 is a fast hardware way to use 2^5 as the multiplier.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse