modular exponentiation of 2048 bit operands using multiple 256 bit operations - rsa

I am implementing RSA digital signature algorithm and one of the operations needed is modular exponentiation of 2048 bit strings. and the hardware i am using provides me an accelerated 256 bit modular exponentiation operation. so, my question here is there an optimized way to compute the 2048 bit operation using multiple 256 bit operations.
thanks in advance !!

I second this comment that hardware restricted to computing Ab mod n for 256-bit parameters is useless for RSA with 2048-bit modulus N.
We can't use a RSA Multiprime strategy (where N is the product of more than 2 primes, and the compute-intensive operations are made modulo these smaller primes), because a product of eight or nine primes each fitting 256 bits would be vulnerable to the Elliptic Curve factoring method. Also that would only work for the private key operation (signature generation, or decryption).
We can't use the thing as a general multiplier, because there's a single input.
By setting n to 2256-1 and b=2 we could use the thing to compute squares of any 128-bit argument, but that represents only a small fraction of the arithmetic work in RSA, and is most certainly not worth the hassle.

Related

Can float16 data type save compute cycles while computing transcendental functions?

it's clearly that float16 can save bandwidth, but is float16 can save compute cycles while computing transcendental functions, like exp()?
If your hardware has full support for it, not just conversion to float32, then yes, definitely. e.g. on a GPU, or on Intel Alder Lake with AVX-512 enabled, or Sapphire Rapids.
Half-precision floating-point arithmetic on Intel chips. Or apparently on Apple M2 CPUs.
If you can do two 64-byte SIMD vectors of FMAs per clock on a core, you go twice as fast if that's 32 half-precision FMAs per vector instead of 16 single-precision FMAs.
Speed vs. precision tradeoff: only enough for FP16 is needed
Without hardware ALU support for FP16, only by not requiring as much precision because you know you're eventually going to round to fp16. So you'd use polynomial approximations of lower degree, thus fewer FMA operations, even though you're computing with float32.
BTW, exp and log are interesting for floating point because the format itself is build around an exponential representation. So you can do an exponential by converting fp->int and stuffing that integer into the exponent field of an FP bit pattern. Then with the the fractional part of your FP number, you use a polynomial approximation to get the mantissa of the exponent. A log implementation is the reverse: extract the exponent field and use a polynomial approximation of log of the mantissa, over a range like 1.0 to 2.0.
See
Efficient implementation of log2(__m256d) in AVX2
Fastest Implementation of Exponential Function Using AVX
Very fast approximate Logarithm (natural log) function in C++?
vgetmantps vs andpd instructions for getting the mantissa of float
Normally you do want some FP operations, so I don't think it would be worth trying to use only 16-bit integer operations to avoid unpacking to float32 even for exp or log, which are somewhat special and intimately connected with floating point's significand * 2^exponent format, unlike sin/cos/tan or other transcendental functions.
So I think your best bet would normally still be to start by converting fp16 to fp32, if you don't have instructions like AVX-512 FP16 can do actual FP math on it. But you can gain performance from not needing as much precision, since implementing these functions normally involves a speed vs. precision tradeoff.

What is the optimum precision to use in an arithmetic encoder?

I've implemented an arithmetic coder here - https://github.com/danieleades/arithmetic-coding
i'm struggling to understand a general way to choose an optimal number of bits for representing integers within the encoder. I'm using a model where probabilities are represented as rationals.
I know that to prevent underflows/overflows, the number of bits used to represent integers within the encoder/decoder must be at least 2 bits greater than the maximum number of bits used to represent the denominator of the probabilities.
for example, if i use a maximum of 10 bits to represent the denominator of the probabilities, then to ensure the encoding/decoding works, i need to use at least MAX_DENOMINATOR_BITS + 2 = 12 bits to represent the integers.
If i was to use 32bit integers to store these values, I would have another 10 bits up my sleeve (right?).
I've seen a couple of examples that use 12 bits for integers, and 8 bits for probabilities, with a 32bit integer type. Is this somehow optimal, or is this just a fairly generic choice?
I've found that increasing the precision above the minimum improves the compression ratio slightly (but it saturates quickly). Given that increasing the precision improves compression, what is the optimum choice? Should I simply aim to maximise the number of bits i use to represent the integers for a given denominator? Performance is a non-goal for my application, in case that's a consideration.
Is it possible to quantify the benefit of moving to say, a 64bit internal representation to provide a greater number of precision bits?
I've based my implementation on this (excellent) article - https://marknelson.us/posts/2014/10/19/data-compression-with-arithmetic-coding.html

Calculation of hash of a string (MD5, SHA) as a basis for CPU benchmarking

I know that there are many applications and tools available for benching the computational power of CPUs especially in terms of floating point and integer calculations.
What I want to know is that how good is to use the hashing functions such as MD5, SHA, ... for benchmarking CPUs? Does these functions include enough floating point and integer calculations that applying a series of those hashing functions could be a good basis for cpu becnhmarking?
In case platform matters, I'm concerned with Windows and .Net.
MD5 and SHA hash functions do not use floating point at all. They are completely implemented using discrete math

When is it appropriate to use a simple modulus as a hashing function?

I need to create a 16 bit hash from a 32 bit number, and I'm trying to determine if a simple modulus 2^16 is appropriate.
The hash will be used in a 2^16 entry hash table for fast lookup of the 32 bit number.
My understanding is that if the data space has a fairly even distribution, that a simple mod 2^16 is fine - it shouldn't result in too many collisions.
In this case, my 32 bit number is the result of a modified adler32 checksum, using 2^16 as M.
So, in a general sense, is my understanding correct, that it's fine to use a simple mod n (where n is hashtable size) as a hashing function if I have an even data distribution?
And specifically, will adler32 give a random enough distribution for this?
Yes, if your 32-bit numbers are uniformly distributed over all possible values, then a modulo n of those will also be uniformly distributed over the n possible values.
Whether the results of your modified checksum algorithm are uniformly distributed is an entirely different question. That will depend on whether the data you are applying the algorithm to has enough data to roll over the sums several times. If you are applying the algorithm to short strings that don't roll over the sums, then the result will not be uniformly distributed.
If you want a hash function, then you should use a hash function. Neither Adler-32 nor any CRC is a good hash function. There are many very fast and effective hash functions available in the public domain. You can look at CityHash.

Division of Large numbers

Is there any faster method for division of large integers(having 1000 digits or more) other than the school method?
Wikipedia lists multiple division algorithms. See Computational complexity of mathematical operations which lists Schoolbook long division as O(n^2) and Newton's method as M(n) where M is the complexity of the multiplication algorithm used, which could be as good as O(n log n 2^(log*n)) asymptotically.
Note from the discussion of one of the multiplication algorithms that the best algorithm asymptotically is not necessarily the fastest for "small" inputs:
In practice the Schönhage–Strassen algorithm starts to outperform older methods such as Karatsuba and Toom–Cook multiplication for numbers beyond 2^(2^15) to 2^(2^17) (10,000 to 40,000 decimal digits). The GNU Multi-Precision Library uses it for values of at least 1728 to 7808 64-bit words (111,000 to 500,000 decimal digits), depending on architecture. There is a Java implementation of Schönhage–Strassen which uses it above 74,000 decimal digits.