What accounts for most of the integer multiply instructions? - cpu-architecture

The majority of integer multiplications don't actually need multiply:
Floating-point is, and has been since the 486, normally handled by dedicated hardware.
Multiplication by a constant, such as for scaling an array index by the size of the element, can be reduced to a left shift in the common case where it's a power of two, or a sequence of left shifts and additions in the general case.
Multiplications associated with accessing a 2D array, can often be strength reduced to addition if it's in the context of a loop.
So what's left?
Certain library functions like fwrite that take a number of elements and an element size as runtime parameters.
Exact decimal arithmetic e.g. Java's BigDecimal type.
Such forms of cryptography as require multiplication and are not handled by their own dedicated hardware.
Big integers e.g. for exploring number theory.
Other cases I'm not thinking of right now.
None of these jump out at me as wildly common, yet all modern CPU architectures include integer multiply instructions. (RISC-V omits them from the minimal version of the instruction set, but has been criticized for even going this far.)
Has anyone ever analyzed a representative sample of code, such as the SPEC benchmarks, to find out exactly what use case accounts for most of the actual uses of integer multiply (as measured by dynamic rather than static frequency)?

Related

32-1024 bit fixed point vector arithmetic with AVX-2

For a mandelbrot generator I want to used fixed point arithmetic going from 32 up to maybe 1024 bit as you zoom in.
Now normaly SSE or AVX is no help there due to the lack of add with carry and doing normal integer arithmetic is faster. But in my case I have literally millions of pixels that all need to be computed. So I have a huge vector of values that all need to go through the same iterative formula over and over a million times too.
So I'm not looking at doing a fixed point add/sub/mul on single values but doing it on huge vectors. My hope is that for such vector operations AVX/AVX2 can still be utilized to improve the performance despite the lack of native add with carry.
Anyone know of a library for fixed point arithmetic on vectors or some example code how to do emulate add with carry on AVX/AVX2.
FP extended precision gives more bits per clock cycle (because double FMA throughput is 2/clock vs. 32x32=>64-bit at 1 or 2/clock on Intel CPUs); consider using the same tricks that Prime95 uses with FMA for integer math. With care it's possible to use FPU hardware for bit-exact integer work.
For your actual question: since you want to do the same thing to multiple pixels in parallel, probably you want to do carries between corresponding elements in separate vectors, so one __m256i holds 64-bit chunks of 4 separate bigintegers, not 4 chunks of the same integer.
Register pressure is a problem for very wide integers with this strategy. Perhaps you can usefully branch on there being no carry propagation past the 4th or 6th vector of chunks, or something, by using vpmovmskb on the compare result to generate the carry-out after each add. An unsigned add has carry out of a+b < a (unsigned compare)
But AVX2 only has signed integer compares (for greater-than), not unsigned. And with carry-in, (a+b+c_in) == a is possible with b=carry_in=0 or with b=0xFFF... and carry_in=1 so generating carry-out is not simple.
To solve both those problems, consider using chunks with manual wrapping to 60-bit or 62-bit or something, so they're guaranteed to be signed-positive and so carry-out from addition appears in the high bits of the full 64-bit element. (Where you can vpsrlq ymm, 62 to extract it for addition into the vector of next higher chunks.)
Maybe even 63-bit chunks would work here so carry appears in the very top bit, and vmovmskpd can check if any element produced a carry. Otherwise vptest can do that with the right mask.
This is a handy-wavy kind of brainstorm answer; I don't have any plans to expand it into a detailed answer. If anyone wants to write actual code based on this, please post your own answer so we can upvote that (if it turns out to be a useful idea at all).
Just for kicks, without claiming that this will be actually useful, you can extract the carry bit of an addition by just looking at the upper bits of the input and output values.
unsigned result = a + b + last_carry; // add a, b and (optionally last carry)
unsigned carry = (a & b) // carry if both a AND b have the upper bit set
| // OR
((a ^ b) // upper bits of a and b are different AND
& ~r); // AND upper bit of the result is not set
carry >>= sizeof(unsigned)*8 - 1; // shift the upper bit to the lower bit
With SSE2/AVX2 this could be implemented with two additions, 4 logic operations and one shift, but works for arbitrary (supported) integer sizes (uint8, uint16, uint32, uint64). With AVX2 you'd need 7uops to get 4 64bit additions with carry-in and carry-out.
Especially since multiplying 64x64-->128 is not possible either (but would require 4 32x32-->64 products -- and some additions or 3 32x32-->64 products and even more additions, as well as special case handling), you will likely not be more efficient than with mul and adc (maybe unless register pressure is your bottleneck).As
As Peter and Mystical suggested, working with smaller limbs (still stored in 64 bits) can be beneficial. On the one hand, with some trickery, you can use FMA for 52x52-->104 products. And also, you can actually add up to 2^k-1 numbers of 64-k bits before you need to carry the upper bits of the previous limbs.

How do you choose an optimal PlainModulus in SEAL?

I am currently learning how to use SEAL and in the parameters for BFV scheme there was a helper function for choosing the PolyModulus and CoeffModulus and however this was not provided for choosing the PlainModulus other than it should be either a prime or a power of 2 is there any way to know which optimal value to use?
In the given example the PlainModulus was set to parms.PlainModulus = new SmallModulus(256); Is there any special reason for choosing the value 256?
In BFV, the plain_modulus basically determines the size of your data type, just like in normal programming when you use 32-bit or 64-bit integers. When using BatchEncoder the data type applies to each slot in the plaintext vectors.
How you choose plain_modulus matters a lot: the noise budget consumption in multiplications is proportional to log(plain_modulus), so there are good reasons to keep it as small as possible. On the other hand, you'll need to ensure that you don't get into overflow situations during your computations, where your encrypted numbers exceed plain_modulus, unless you specifically only care about correctness of the results modulo plain_modulus.
In almost all real use-cases of BFV you should want to use BatchEncoder to not waste plaintext/ciphertext polynomial space, and this requires plain_modulus to be a prime. Therefore, you'll probably want it to be a prime, except in some toy examples.

Compute e^x for float values in System Verilog?

I am building a neural network running on an FPGA, and the last piece of the puzzle is running a sigmoid function in hardware. This is either:
1/(1 + e^-x)
or
(atan(x) + 1) / 2
Unfortunately, x here is a float value (a real value in SystemVerilog).
Are there any tips on how to implement either of these functions in SystemVerilog?
This is really confusing to me since both of these functions are complex, and I don't even know where to begin implementing them due to the added complexity of being float values.
One simpler way for this is to create a memory/array for this function. However that option can be highly inefficient.
x should be the input address for the memory and the value at that location can be the output of the function.
Suppose value of your function is as follow. (This is just an example)
x = 0 => f(0) = 1
x = 1 => f(0) = 2
x = 2 => f(0) = 3
x = 3 => f(0) = 4
So you can create an array for this, which stored the output values.
int a[4] = `{1, 2, 3, 4};
I just finished this by Vivado HLS, which allows you to write circuits in C.
Here is my C code.
#include math.h
void exp(float a[10],b[10])
{
int i;
for(i=0;i<10;i++)
{
b[i] = exp(a[i]);
}
}
But there is a question that it is impossible to create a unsized matrix. Maybe there is another way that I don't know.
As you seem to realize, type real is not synthesizable. you need to operate on the type integer mantissa and type integer exponent separately and combine them when you are done, having tracked the sign. Once you take care of (e^-x), the rest should be straight-forward.
try this page for a quick explanation: https://www.geeksforgeeks.org/floating-point-representation-digital-logic/
and search on "floating point digital design" for more explanations/examples.
Do you really need a floating number for this? Is fixed point sufficient?
Considering (atan(x) + 1) / 2, quite likely the only useful values of x are those where the exponent is fairly small. (if the exponent is large, your answer is pi/2).
atan of a fixed point number can be calculated in HW fairly easily; there are cordic methods (see https://zipcpu.com/dsp/2017/08/30/cordic.html) and direct methods; see for example https://dspguru.com/dsp/tricks/fixed-point-atan2-with-self-normalization/
FPGA design flows in which hardware (FPGA) is targeted generally do not support floating point numbers in the FPGA fabric. Fixed point with limited precision is more commonly used.
A limited precision fixed point approach:
Use Matlab to create an array of samples for your math function such that the largest value is +/- .99999. For 8 bit precision (actually 7 with sign bit), multiply those numbers by 128, round at the decimal point and drop the fractional part. Write those numbers to a text file in 2s complement hex format. In SystemVerilog you can implement a ROM using that text file. Use $readmemh() to read these numbers into a memory style variable (one that has both a packed and unpacked dimension). Link to a tutorial:
https://projectf.io/posts/initialize-memory-in-verilog/.
Now you have a ROM with limited precision samples of your function
Section 21.4 Loading memory array data from a file in the SystemVerilog specification provides the definition for $readmh(). Here is that doc:
https://ieeexplore.ieee.org/document/8299595
If you need floating point one possibility is to use a processor soft core with a floating point unit implemented in FPGA fabric, and run software on that core. The core interface to the rest of the FPGA fabric over a physical bus such as axi4 steaming. See:
https://www.xilinx.com/products/design-tools/microblaze.html to get started.
It is a very different workflow than ordinary FPGA design and uses different tools. C or C++ compiler with math libraries (tan, exp, div, etc) would be used along with the processor core.
Another possibility for fixed point is an FPGA with a hard core processor. Xilinx Zynq is one of them. This is a complex and powerful approach. A free free book provides knowledge on how to use Zynq
http://www.zynqbook.com/.
This workflow is even more complex that soft core approach because the Zynq is a more complex platform (hard processor & FPGA integrated on one chip).
Its pretty hard to implement non-linear functions like that in hardware, and on top of that floating point arithmetic is even more costly. Its definitely better(and recommended) to work with fixed point arithmetic as mentioned in answers before. The number of precision bits in fixed point arithmetic will depend on your result accuracy and the error tolerance.
For hardware implementations, any kind of non-linear function can be approximated as piecewise linear function, and use a ROM based implementation approach as described in previous answers. The number of sample points that you take from the non-linear function determines your accuracy. The more samples you store the better approximation of the function you get. Often in hardware , number of samples you can store can become restricted by the amount of fast/local memory available to you. In this case to optimize the memory resources, you can add a little extra compute resources and perform linear interpolation to calculate the needed values.

Efficient Function to Map (or Hash) Integers and Integer Ranges into Index

We are looking for the computationally simplest function that will enable an indexed look-up of a function to be determined by a high frequency input stream of widely distributed integers and ranges of integers.
It is OK if the hash/map function selection itself varies based on the specific integer and range requirements, and the performance associated with the part of the code that selects this algorithm is not critical. The number of integers/ranges of interest in most cases will be small (zero to a few thousand). The performance critical portion is in processing the incoming stream and selecting the appropriate function.
As a simple example, please consider the following pseudo-code:
switch (highFrequencyIntegerStream)
case(2) : func1();
case(3) : func2();
case(8) : func3();
case(33-122) : func4();
...
case(10,000) : func40();
In a typical example, there would be only a few thousand of the "cases" shown above, which could include a full range of 32-bit integer values and ranges. (In the pseudo code above 33-122 represents all integers from 33 to 122.) There will be a large number of objects containing these "switch statements."
(Note that the actual implementation will not include switch statements. It will instead be a jump table (which is an array of function pointers) or maybe a combination of the Command and Observer patterns, etc. The implementation details are tangential to the request, but provided to help with visualization.)
Many of the objects will contain "switch statements" with only a few entries. The values of interest are subject to real time change, but performance associated with managing these changes is not critical. Hash/map algorithms can be re-generated slowly with each update based on the specific integers and ranges of interest (for a given object at a given time).
We have searched around the internet, looking at Bloom filters, various hash functions listed on Wikipedia's "hash function" page and elsewhere, quite a few Stack Overflow questions, abstract algebra (mostly Galois theory which is attractive for its computationally simple operands), various ciphers, etc., but have not found a solution that appears to be targeted to this problem. (We could not even find a hash or map function that considered these types of ranges as inputs, much less a highly efficient one. Perhaps we are not looking in the right places or using the correct vernacular.)
The current plan is to create a custom algorithm that preprocesses the list of interesting integers and ranges (for a given object at a given time) looking for shifts and masks that can be applied to input stream to help delineate the ranges. Note that most of the incoming integers will be uninteresting, and it is of critical importance to make a very quick decision for as large a percentage of that portion of the stream as possible (which is why Bloom filters looked interesting at first (before we starting thinking that their implementation required more computational complexity than other solutions)).
Because the first decision is so important, we are also considering having multiple tables, the first of which would be inverse masks (masks to select uninteresting numbers) for the easy to find large ranges of data not included in a given "switch statement", to be followed by subsequent tables that would expand the smaller ranges. We are thinking this will, for most cases of input streams, yield something quite a bit faster than a binary search on the bounds of the ranges.
Note that the input stream can be considered to be randomly distributed.
There is a pretty extensive theory of minimal perfect hash functions that I think will meet your requirement. The idea of a minimal perfect hash is that a set of distinct inputs is mapped to a dense set of integers in 1-1 fashion. In your case a set of N 32-bit integers and ranges would each be mapped to a unique integer in a range of size a small multiple of N. Gnu has a perfect hash function generator called gperf that is meant for strings but might possibly work on your data. I'd definitely give it a try. Just add a length byte so that integers are 5 byte strings and ranges are 9 bytes. There are some formal references on the Wikipedia page. A literature search in ACM and IEEE literature will certainly turn up more.
I just ran across this library I had not seen before.
Addition
I see now that you are trying to map all integers in the ranges to the same function value. As I said in the comment, this is not very compatible with hashing because hash functions deliberately try to "erase" the magnitude information in a bit's position so that values with similar magnitude are unlikely to map to the same hash value.
Consequently, I think that you will not do better than an optimal binary search tree, or equivalently a code generator that produces an optimal "tree" of "if else" statements.
If we wanted to construct a function of the type you are asking for, we could try using real numbers where individual domain values map to consecutive integers in the co-domain and ranges map to unit intervals in the co-domain. So a simple floor operation will give you the jump table indices you're looking for.
In the example you provided you'd have the following mapping:
2 -> 0.0
3 -> 1.0
8 -> 2.0
33 -> 3.0
122 -> 3.99999
...
10000 -> 42.0 (for example)
The trick is to find a monotonically increasing polynomial that interpolates these points. This is certainly possible, but with thousands of points I'm certain you'ed end up with something much slower to evaluate than the optimal search would be.
Perhaps our thoughts on hashing integers can help a little bit. You will also find there a hashing library (hashlib.zip) based on Bob Jenkins' work which deals with integer numbers in a smart way.
I would propose to deal with larger ranges after the single cases have been rejected by the hashing mechanism.

Irrational number representation in computer

We can write a simple Rational Number class using two integers representing A/B with B != 0.
If we want to represent an irrational number class (storing and computing), the first thing came to my mind is to use floating point, which means use IEEE 754 standard (binary fraction). This is because irrational number must be approximated.
Is there another way to write irrational number class other than using binary fraction (whether they conserve memory space or not) ?
I studied jsbeuno's solution using Python: Irrational number representation in any programming language?
He's still using the built-in floating point to store.
This is not homework.
Thank you for your time.
With a cardinality argument, there are much more irrational numbers than rational ones. (and the number of IEEE754 floating point numbers is finite, probably less than 2^64).
You can represent numbers with something else than fractions (e.g. logarithmically).
jsbeuno is storing the number as a base and a radix and using those when doing calcs with other irrational numbers; he's only using the float representation for output.
If you want to get fancier, you can define the base and the radix as rational numbers (with two integers) as described above, or make them themselves irrational numbers.
To make something thoroughly useful, though, you'll end up replicating a symbolic math package.
You can always use symbolic math, where items are stored exactly as they are and calculations are deferred until they can be performed with precision above some threshold.
For example, say you performed two operations on a non-irrational number like 2, one to take the square root and then one to square that. With limited precision, you may get something like:
(√2)²
= 1.414213562²
= 1.999999999
However, storing symbolic math would allow you to store the result of √2 as √2 rather than an approximation of it, then realise that (√x)² is equivalent to x, removing the possibility of error.
Now that obviously involves a more complicated encoding that simple IEEE754 but it's not impossible to achieve.