Division of Large numbers - biginteger

Is there any faster method for division of large integers(having 1000 digits or more) other than the school method?

Wikipedia lists multiple division algorithms. See Computational complexity of mathematical operations which lists Schoolbook long division as O(n^2) and Newton's method as M(n) where M is the complexity of the multiplication algorithm used, which could be as good as O(n log n 2^(log*n)) asymptotically.
Note from the discussion of one of the multiplication algorithms that the best algorithm asymptotically is not necessarily the fastest for "small" inputs:
In practice the Schönhage–Strassen algorithm starts to outperform older methods such as Karatsuba and Toom–Cook multiplication for numbers beyond 2^(2^15) to 2^(2^17) (10,000 to 40,000 decimal digits). The GNU Multi-Precision Library uses it for values of at least 1728 to 7808 64-bit words (111,000 to 500,000 decimal digits), depending on architecture. There is a Java implementation of Schönhage–Strassen which uses it above 74,000 decimal digits.

Related

Can float16 data type save compute cycles while computing transcendental functions?

it's clearly that float16 can save bandwidth, but is float16 can save compute cycles while computing transcendental functions, like exp()?
If your hardware has full support for it, not just conversion to float32, then yes, definitely. e.g. on a GPU, or on Intel Alder Lake with AVX-512 enabled, or Sapphire Rapids.
Half-precision floating-point arithmetic on Intel chips. Or apparently on Apple M2 CPUs.
If you can do two 64-byte SIMD vectors of FMAs per clock on a core, you go twice as fast if that's 32 half-precision FMAs per vector instead of 16 single-precision FMAs.
Speed vs. precision tradeoff: only enough for FP16 is needed
Without hardware ALU support for FP16, only by not requiring as much precision because you know you're eventually going to round to fp16. So you'd use polynomial approximations of lower degree, thus fewer FMA operations, even though you're computing with float32.
BTW, exp and log are interesting for floating point because the format itself is build around an exponential representation. So you can do an exponential by converting fp->int and stuffing that integer into the exponent field of an FP bit pattern. Then with the the fractional part of your FP number, you use a polynomial approximation to get the mantissa of the exponent. A log implementation is the reverse: extract the exponent field and use a polynomial approximation of log of the mantissa, over a range like 1.0 to 2.0.
See
Efficient implementation of log2(__m256d) in AVX2
Fastest Implementation of Exponential Function Using AVX
Very fast approximate Logarithm (natural log) function in C++?
vgetmantps vs andpd instructions for getting the mantissa of float
Normally you do want some FP operations, so I don't think it would be worth trying to use only 16-bit integer operations to avoid unpacking to float32 even for exp or log, which are somewhat special and intimately connected with floating point's significand * 2^exponent format, unlike sin/cos/tan or other transcendental functions.
So I think your best bet would normally still be to start by converting fp16 to fp32, if you don't have instructions like AVX-512 FP16 can do actual FP math on it. But you can gain performance from not needing as much precision, since implementing these functions normally involves a speed vs. precision tradeoff.

What is the optimum precision to use in an arithmetic encoder?

I've implemented an arithmetic coder here - https://github.com/danieleades/arithmetic-coding
i'm struggling to understand a general way to choose an optimal number of bits for representing integers within the encoder. I'm using a model where probabilities are represented as rationals.
I know that to prevent underflows/overflows, the number of bits used to represent integers within the encoder/decoder must be at least 2 bits greater than the maximum number of bits used to represent the denominator of the probabilities.
for example, if i use a maximum of 10 bits to represent the denominator of the probabilities, then to ensure the encoding/decoding works, i need to use at least MAX_DENOMINATOR_BITS + 2 = 12 bits to represent the integers.
If i was to use 32bit integers to store these values, I would have another 10 bits up my sleeve (right?).
I've seen a couple of examples that use 12 bits for integers, and 8 bits for probabilities, with a 32bit integer type. Is this somehow optimal, or is this just a fairly generic choice?
I've found that increasing the precision above the minimum improves the compression ratio slightly (but it saturates quickly). Given that increasing the precision improves compression, what is the optimum choice? Should I simply aim to maximise the number of bits i use to represent the integers for a given denominator? Performance is a non-goal for my application, in case that's a consideration.
Is it possible to quantify the benefit of moving to say, a 64bit internal representation to provide a greater number of precision bits?
I've based my implementation on this (excellent) article - https://marknelson.us/posts/2014/10/19/data-compression-with-arithmetic-coding.html

Neural Networks w/ Fixed Point Parameters

Most neural networks are trained with floating point weights/biases.
Quantization methods exist to convert the weights from float to int, for deployment on smaller platforms.
Can you build neural networks from the ground up that constrain all parameters, and their updates to be integer arithmetic?
Could such networks achieve a good accuracy?
(I know a bit about fixed-point and have only some rusty NN experience from the 90's so take what I have to say with a pinch of salt!)
The general answer is yes, but it depends on a number of factors.
Bear in mind that floating-point arithmetic is basically the combination of an integer significand with an integer exponent so it's all integer under the hood. The real question is: can you do it efficiently without floats?
Firstly, "good accuracy" is highly dependent on many factors. It's perfectly possible to perform integer operations that have higher granularity than floating-point. For example, 32-bit integers have 31 bits of mantissa while 32-bit floats effectively have only 24. So provided you do not require the added precision that floats give you near zero, it's all about the types that you choose. 16-bit -- or even 8-bit -- values might suffice for much of the processing.
Secondly, accumulating the inputs to a neuron has the issue that unless you know the maximum number of inputs to a node, you cannot be sure what the upper bound is on the values being accumulated. So effectively you must specify this limit at compile time.
Thirdly, the most complicated operation during the execution of a trained network is often the activation function. Again, you firstly have to think about what the range of values are within which you will be operating. You then need to implement the function without the aid of an FPU with all of the advanced mathematical functions it provides. One way to consider doing this is via lookup tables.
Finally, training involves measuring the error between values and that error can often be quite small. Here is where accuracy is a concern. If the differences you are measuring are too low, they will round down to zero and this may prevent progress. One solution is to increase the resolution of the value by providing more fractional digits.
One advantage that integers have over floating-point here is their even distribution. Where floating-point numbers lose accuracy as they increase in magnitude, integers maintain a constant precision. This means that if you are trying to measure very small differences in values that are close to 1, you should have no more trouble than you would if those values were as close to 0. The same is not true for floats.
It's possible to train a network with higher precision types than those used to run the network if training time is not the bottleneck. You might even be able to train the network using floating-point types and run it using lower-precision integers but you need to be aware of differences in behavior that these shortcuts will bring.
In short the problems involved are by no means insurmountable but you need to take on some of the mental effort that would normally be saved by using floating-point. However, especially if your hardware is physically constrained, this can be a hugely benneficial approach as floating-point arithmetic requires as much as 100 times more silicon and power than integer arithmetic.
Hope that helps.

Elisp: What is the time complexity for basic arithmetic operations using calc functions

This includes addition, subtraction, multiplication, and division.
I'm asked to analyze some algorithms that rely heavily on calling calc-eval to work. My teacher does want us to account for the complexity of basic operations when working with large numbers.
How do these arithmetic operations scale as the size of the numbers increase?

Matlab `corr` gives different results on the same dataset. Is floating-point calculation deterministic?

I am using Matlab's corr function to calculate the correlation of a dataset. While the results agree within the double point accuracy (<10^-14), they are not exactly the same even on the same computer for different runs.
Is floating-point calculation deterministic? Where is the source of the randomness?
Yes and no.
Floating point arithmetic, as in a sequence of operations +, *, etc. is deterministic. However in this case, linear algebra libraries (BLAS, LAPACK, etc) are most likely being used, which may not be: for example, matrix multiplication is typically not performed as a "triple loop" as some references would have you believe, but instead matrices are split up into blocks that are optimised for maximum performance based on things like cache size. Therefore, you will get different sequences of operations, with different intermediate rounding, which will give slightly different results. Typically, however, the variation in these results is smaller than the total rounding error you are incurring.
I have to admit, I am a little bit surprised that you get different results on the same computer, but it is difficult to know why without knowing what the library is doing (IIRC, Matlab uses the Intel BLAS libraries, so you could look at their documentation).