Power of 2 approximation in fixed point - fixed-point

Currently, I am using a small lookup table and linear interpolation which is quite fast and also accurate enough (max error is less than 0.001). However I was wondering if there is an approximation which is even faster.
Since the integer part of the exponent can be extracted and calculated by bitshifts, the approximation just needs to work in the range [-1,1]
I have tried to find a chebyshev polynomial, but could not achieve a good accuracy for polynomials of low order. I could live with a max error around 0.01 I guess, but I did not get near that number. Higher order polynomials are not an option, since they are much less efficient than my current lookup table based solution.

Since no specific fixed-point format was stated, I will demonstrate a possible alternative to table lookup using s15.16 fixed-point arithmetic, which is fairly commonly used. The basic idea is to split the input a into an integral portion i and a fractional portion f, such that f in [-0.5,0.5], then use a minimax polynomial approximation for exp2(f) on [-0.5, 0.5] and perform final scaling based on i.
Minimax approximations can be generated with tools such as Mathematica, Maple, or Sollya. If none of these tools are available, one could use a custom implementation of the Remez algorithm to generate minimax aproximations.
The Horner scheme should be used to evaluate the polynomial. Since fixed-point arithmetic is used, the evaluation of the polynomial should scale operands to the maximum extent possible (i.e. without overflow) in intermediate steps to optimized the accuracy of the computation.
The C code below assumes that right shifts applied to signed integer data types result in arithmetic shift operations, and therefore negative operands are shifted appropriately. This is not guaranteed by the ISO C standard, but in my experience it will work fine with various tool chains. In the worst case, inline assembly could be used to force generation of the desired arithmetic right shift instructions.
The output of the test included with the fixed_exp2() implementation below should look as follows:
testing fixed_exp2 with inputs in [-5.96484, 15)
max. rel. err = 0.000999758
This demonstrates that the desired error bound of 0.001 is met for inputs in the interval [-5.96484, 15).
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
/* compute exp2(a) in s15.16 fixed-point arithmetic, -16 < a < 15 */
int32_t fixed_exp2 (int32_t a)
{
int32_t i, f, r, s;
/* split a = i + f, such that f in [-0.5, 0.5] */
i = (a + 0x8000) & ~0xffff; // 0.5
f = a - i;
s = ((15 << 16) - i) >> 16;
/* minimax approximation for exp2(f) on [-0.5, 0.5] */
r = 0x00000e20; // 5.5171669058037949e-2
r = (r * f + 0x3e1cc333) >> 17; // 2.4261112219321804e-1
r = (r * f + 0x58bd46a6) >> 16; // 6.9326098546062365e-1
r = r * f + 0x7ffde4a3; // 9.9992807353939517e-1
return (uint32_t)r >> s;
}
double fixed_to_float (int32_t a)
{
return a / 65536.0;
}
int main (void)
{
double a, res, ref, err, maxerr = 0.0;
int32_t x, start, end;
start = 0xfffa0900;
end = 0x000f0000;
printf ("testing fixed_exp2 with inputs in [%g, %g)\n",
fixed_to_float (start), fixed_to_float (end));
for (x = start; x < end; x++) {
a = fixed_to_float (x);
ref = exp2 (a);
res = fixed_to_float (fixed_exp2 (x));
err = fabs (res - ref) / ref;
if (err > maxerr) {
maxerr = err;
}
}
printf ("max. rel. err = %g\n", maxerr);
return EXIT_SUCCESS;
}

Related

Does unrolling a loop affect the accuracy of the computations within?

Summarized question Does unrolling a loop affect the accuracy of the computations performed within the loop? And if so, why?
Elaboration and background I am writing a compute shader using HLSL for use in a Unity-project (2021.2.9f1). Parts of my code include numerical procedures and highly osciallatory functions, meaning that high computational accuracy is essential.
When comparing my results with an equivalent procedure in Python, I noticed that some deviations in the order of 1e-5. This was concerning, as I did not expect such large errors to be the result of precision differences, e.g., the float-precision in trigonometric or power functions in HLSL.
Ultimatley, after much debugging, I now believe the choice of unrolling or not unrolling a loop to be the cause of the deviation. However, I do find this strange, as I can not seem to find any sources indicating that unrolling a loop affects the accuracy in addition to the "space–time tradeoff".
For clarification, if considering my Python results as the correct solution, unrolling the loop in HLSL gives me better results than what not unrolling gives.
Minimal working example Below is an MWE consisting of a C# script for Unity, the corresponding compute shader where the computations are performed and a screen-shot of my console when running in Unity (2021.2.9f1). Forgive me for a somewhat messy implementation of Newtons method, but I chose to keep it since I believe it might be a cause to this deviation. That is, if simply computing cos(x), then there is not difference between the unrolled and not unrolled. None the less, I still fail to understand how the simple addition of [unroll(N)] in the testing kernel changes the result...
// C# for Unity
using UnityEngine;
public class UnrollTest : MonoBehaviour
{
[SerializeField] ComputeShader CS;
ComputeBuffer CBUnrolled, CBNotUnrolled;
readonly int N = 3;
private void Start()
{
CBUnrolled = new ComputeBuffer(N, sizeof(double));
CBNotUnrolled = new ComputeBuffer(N, sizeof(double));
CS.SetBuffer(0, "_CBUnrolled", CBUnrolled);
CS.SetBuffer(0, "_CBNotUnrolled", CBNotUnrolled);
CS.Dispatch(0, (int)((N + (64 - 1)) / 64), 1, 1);
double[] ansUnrolled = new double[N];
double[] ansNotUnrolled = new double[N];
CBUnrolled.GetData(ansUnrolled);
CBNotUnrolled.GetData(ansNotUnrolled);
for (int i = 0; i < N; i++)
{
Debug.Log("Unrolled ans = " + ansUnrolled[i] +
" - Not Unrolled ans = " + ansNotUnrolled[i] +
" -- Difference is: " + (ansUnrolled[i] - ansNotUnrolled[i]));
}
CBUnrolled.Release();
CBNotUnrolled.Release();
}
}
#pragma kernel CSMain
RWStructuredBuffer<double> _CBUnrolled, _CBNotUnrolled;
// Dummy function for Newtons method
double fDummy(double k, double fnh, double h, double theta)
{
return fnh * fnh * k * h * cos(theta) * cos(theta) - (double) tanh(k * h);
}
// Derivative of Dummy function above using a central finite difference scheme.
double dfDummy(double k, double fnh, double h, double theta)
{
return (fDummy(k + (double) 1e-3, fnh, h, theta) - fDummy(k - (double) 1e-3, fnh, h, theta)) / (double) 2e-3;
}
// Function to solve.
double f(double fnh, double h, double theta)
{
// Solved using Newton's method.
int max_iter = 50;
double epsilon = 1e-8;
double fxn, dfxn;
// Define initial guess for k, herby denoted as x.
double xn = 10.0;
for (int n = 0; n < max_iter; n++)
{
fxn = fDummy(xn, fnh, h, theta);
if (abs(fxn) < epsilon) // A solution is found.
return xn;
dfxn = dfDummy(xn, fnh, h, theta);
if (dfxn == 0.0) // No solution found.
return xn;
xn = xn - fxn / dfxn;
}
// No solution found.
return xn;
}
[numthreads(64,1,1)]
void CSMain(uint3 threadID : SV_DispatchThreadID)
{
int N = 3;
// ---------------
double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
for (int i = 0; i < N; i++) // Not being unrolled
{
_CBNotUnrolled[i] = f(fnh, h, theta);
theta += dtheta;
}
// ---------------
fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
[unroll(N)] for (int j = 0; j < N; j++) // Being unrolled.
{
_CBUnrolled[j] = f(fnh, h, theta);
theta += dtheta;
}
}
Image of Unity console when running the above
Edit After some more testing, the deviation has been narrowed down to the following code, giving a difference of about 1e-17 between the exact same code unrolled vs not unrolled. Despite the small difference, I still consider it a valid example of the issue, as I believe they should be equal.
[numthreads(64, 1, 1)]
void CSMain(uint3 threadID : SV_DispatchThreadID)
{
if ((int) threadID.x != 1)
return;
int N = 3;
double k = 1.0;
// ---------------
double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
for (int i = 0; i < N; i++) // Not being unrolled
{
_CBNotUnrolled[i] = (k + (double) 1e-3) * theta - (k - (double) 1e-3) * theta;
theta += dtheta;
}
// ---------------
fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
[unroll(N)]
for (int j = 0; j < N; j++) // Being unrolled.
{
_CBUnrolled[j] = (k + (double) 1e-3) * theta - (k - (double) 1e-3) * theta;
theta += dtheta;
}
}
Image of Unity console when running the edited script above
Edit 2 The following is the compiled code for the kernel given in Edit 1. Unfortunately, my experience with assembly language is limited, and I am not capable of spotting if this script shows any errors, or if it is useful to the problem at hand.
**** Platform Direct3D 11:
Compiled code for kernel CSMain
keywords: <none>
binary blob size 648:
//
// Generated by Microsoft (R) D3D Shader Disassembler
//
//
// Note: shader requires additional functionality:
// Double-precision floating point
//
//
// Input signature:
//
// Name Index Mask Register SysValue Format Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Input
//
// Output signature:
//
// Name Index Mask Register SysValue Format Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Output
cs_5_0
dcl_globalFlags refactoringAllowed | enableDoublePrecisionFloatOps
dcl_uav_structured u0, 8
dcl_uav_structured u1, 8
dcl_input vThreadID.x
dcl_temps 2
dcl_thread_group 64, 1, 1
0: ine r0.x, vThreadID.x, l(1)
1: if_nz r0.x
2: ret
3: endif
4: dmov r0.xy, d(-0.161000l, 0.000000l)
5: mov r0.z, l(0)
6: loop
7: ige r0.w, r0.z, l(3)
8: breakc_nz r0.w
9: dmul r1.xyzw, r0.xyxy, d(1.001000l, 0.999000l)
10: dadd r1.xy, -r1.zwzw, r1.xyxy
11: store_structured u1.xy, r0.z, l(0), r1.xyxx
12: dadd r0.xy, r0.xyxy, d(0.010000l, 0.000000l)
13: iadd r0.z, r0.z, l(1)
14: endloop
15: store_structured u0.xy, l(0), l(0), l(-0.000000,-0.707432,0,0)
16: store_structured u0.xy, l(1), l(0), l(0.000000,-0.702312,0,0)
17: store_structured u0.xy, l(2), l(0), l(-918250586112.000000,-0.697192,0,0)
18: ret
// Approximately 0 instruction slots used
Edit 3 After reaching out to Microsoft, (see https://learn.microsoft.com/en-us/an...nrolling-a-loop-affect-the-accuracy-of-t.html), they stated that the problem is more about Unity. This because
"The pragma unroll [(n)] is keil compiler which Unity uses topic"
This is driver, hardware, compiler, and unity dependent.
In essence, the HLSL specification has somewhat looser guarantees for rounding behavior of mathematical operations than regular IEEE-754 floating point.
First, it is implementation-dependent whether operations round up or down.
IEEE-754 requires floating-point operations to produce a result that
is the nearest representable value to an infinitely-precise result,
known as round-to-nearest-even. Direct3D 10, however, defines a looser
requirement: 32-bit floating-point operations produce a result that is
within one unit-last-place (1 ULP) of the infinitely-precise result.
This means that, for example, hardware is allowed to truncate results
to 32-bit rather than perform round-to-nearest-even, as that would
result in error of at most one ULP.
See https://learn.microsoft.com/en-us/windows/win32/direct3d10/d3d10-graphics-programming-guide-resources-float-rules#32-bit-floating-point-rules
Going one step further, the HLSL compiler itself has many fast-math optimizations that can violate IEEE-754 float conformance; see, for example:
D3DCOMPILE_IEEE_STRICTNESS - Forces strict compile, which might not allow for legacy syntax. By default, the compiler disables strictness on deprecated syntax.
D3DCOMPILE_OPTIMIZATION_LEVEL3 - Directs the compiler to use the highest optimization level. If you set this constant, the compiler produces the best possible code but might take significantly longer to do so. Set this constant for final builds of an application when performance is the most important factor.
D3DCOMPILE_PARTIAL_PRECISION - Directs the compiler to perform all computations with partial precision. If you set this constant, the compiled code might run faster on some hardware.
Source: https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/d3dcompile-constants
This particularly matters for your scenario, because if optimizations are enabled, the existence of loop unrolling can trigger constant folding optimizations that reduce the computational cost of your code and change the precision of its results (potentially even improving them). Note that when constant folding occurs, the compiler has to decide how to perform rounding, and that might disagree with what your hardware FPUs would do.
Oh, and note that IEEE-754 does not place constraints on the precision, let alone require implementation, of "additional operations" (e.g. sin, cos, tanh, atan, ln, etc); it purely recommends them.
See, a very common case where this goes wrong and sin gets quantized to 4 different values on intel integrated graphics, but otherwise has reasonable precision on alternative hardware: sin(x) only returns 4 different values for moderately large input on GLSL fragment shader, Intel HD4000
Also, note that Unity does not guarantee that a float in shader is actually a 32-bit float; on certain hardware (e.g. mobile), it can even be backed by a 16-bit half or an 11-bit fixed.
High precision: float
Highest precision floating point value; generally 32 bits (just like float from regular programming languages).
...
One complication of float/half/fixed data type usage is that PC GPUs are always high precision. That is, for all the PC (Windows/Mac/Linux) GPUs, it does not matter whether you write float, half or fixed data types in your shaders. They always compute everything in full 32-bit floating point precision.
The half and fixed types only become relevant when targeting mobile
GPUs, where these types primarily exist for power (and sometimes
performance) constraints. Keep in mind that you need to test your
shaders on mobile to see whether or not you are running into
precision/numerical issues.
Even on mobile GPUs, the different precision support varies between
GPU families.
Source: https://docs.unity3d.com/Manual/SL-DataTypesAndPrecision.html
I don't believe Unity exposes compiler flags to developers; you are at its whim as to what optimizations it passes to dxc/fxc. Given it's primarily used for games, you can bet they enable optimizations.
Source: https://forum.unity.com/threads/possible-to-set-directx-compiler-flags-in-shaders.453790/
Finally, check out "Floating-Point Determinism" by Bruce Dawson if you want an in-depth dive into this topic; I will add that this problem also exists if you want consistent results between languages (since languages themselves can implement math functions themselves rather than using hardware intrinsics, e.g. for better precision), when cross-compiling (since different compilers / backends can optimize differently or use different system libraries), or when running managed code across different runtimes (e.g. since JIT can do different optimiztions).

Fixed point approximation of 2^x, with input range of s5.26

How can I implement 2^x fixed-point arithmetic s5.26 and input values is in range [-31.9, 31.9] using the minimax polynomial approximation for exp2()
How to generate the polynomial using Sollya Tool mentioned in the following link
Power of 2 approximation in fixed point
Since fixed-point arithmetic generally does not include an "infinity" encoding representing overflowed results, any implementation of exp2() for an s5.26 format will be limited to inputs in the interval (-32, 5), resulting in outputs in [0, 32).
The computation of transcendental functions typically consist of argument reduction, core approximation, final result construction. In the case of exp2(a), a reasonable argument reduction scheme is to split a into integer part i and fractional part f, such that a == i + f, with f in [-0.5, 0.5]. One then computes exp2(f), and scales the result by 2i, which corresponds to shifts in fixed-point arithmetic: exp2(a) = exp2(f) * exp2(i).
The common design choices for the computation of exp2(f) are interpolation in tabulated values of exp2(), or polynomial approximation. Since we need 31 result bits for the largest arguments, accurate interpolation would probably want to use quadratic interpolation to keep the table size reasonable. Since many modern processors (including ones used in embedded systems) provide a fast integer multiplier, I will focus here on approximation by polynomial. For this, we want a polynomial with minimax properties, that is, one that minimizes the maximum error compared to the reference.
Both commercial and free tools offer built-in capabilities to generate minimax approximations, e.g. Mathematica's MiniMaxApproximation command, Maple's minimax command, and Sollya's fpminimax command. One might also chose to build one's own infrastructure based on the Remez algorithm, which is the approach I have used. As opposed to floating-point arithmetic which typically uses to-nearest-or-even rounding, fixed-point arithmetic is usually restricted to truncation of intermediate results. This adds additional error during expression evaluation. As a consequence, it is usually a good idea to try a heuristic-based search for small adjustments to the coefficients of the generated approximation to partially balance those accumulating one-sided errors.
Because we need up to 31 bits in the result, and because coefficients in core approximations are typically less than unity in magnitude, we cannot use the native fixed-point precision, here s5.26, for polynomial evaluation. Instead, we want to scale up the operands in intermediate computation to fully use the available range of 32-bit integers, by dynamically adjusting the fixed-point format we are working in. For reasons of efficiency, it seems advisable to arrange the computation such that multiplications use re-normalization right shifts by 32 bits. This will often allow the elimination of explicit shifts on 32-bit processors.
Since intermediate computation uses signed data, right shifts of signed, negative operands will occur. We want those right shifts to map to arithmetic right shift instructions, something the C standard does not guarantee. But on most commonly used platforms, C compilers do what is desirable for us. Otherwise, it may be necessary to resort to intrinsics or inline assembly. I developed the code below with the Microsoft compiler on an x64 platform.
In the evaluation of the polynomial approximation for exp2(f) the original floating-point coefficients, the dynamic scaling, and the heuristic adjustments are all clearly visible. The code below does not quite achieve full accuracy for large arguments. The biggest absolute error is 1.10233e-7, for the argument of 0x12de9c5b = 4.71739332: fixed_exp2() returns 0x693ab6a3 while the accurate result would be 0x693ab69c. Presumably full accuracy could be achieved by increasing the degree of the polynomial core approximation by one.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
/* on 32-bit architectures, there is often an instruction/intrinsic for this */
int32_t mulhi (int32_t a, int32_t b)
{
return (int32_t)(((int64_t)a * (int64_t)b) >> 32);
}
/* compute exp2(a) in s5.26 fixed-point arithmetic */
int32_t fixed_exp2 (int32_t a)
{
int32_t i, f, r, s;
/* split a = i + f, such that f in [-0.5, 0.5] */
i = (a + 0x2000000) & ~0x3ffffff; // 0.5
f = a - i;
s = ((5 << 26) - i) >> 26;
f = f << 5; /* scale up for maximum accuracy in intermediate computation */
/* approximate exp2(f)-1 for f in [-0.5, 0.5] */
r = (int32_t)(1.53303146e-4 * (1LL << 36) + 996);
r = mulhi (r, f) + (int32_t)(1.33887795e-3 * (1LL << 35) + 99);
r = mulhi (r, f) + (int32_t)(9.61833261e-3 * (1LL << 34) + 121);
r = mulhi (r, f) + (int32_t)(5.55036329e-2 * (1LL << 33) + 51);
r = mulhi (r, f) + (int32_t)(2.40226507e-1 * (1LL << 32) + 8);
r = mulhi (r, f) + (int32_t)(6.93147182e-1 * (1LL << 31) + 5);
r = mulhi (r, f);
/* add 1, scale based on integral portion of argument, round the result */
r = ((((uint32_t)r * 2) + (uint32_t)(1.0*(1LL << 31)) + ((1U << s) / 2) + 1) >> s);
/* when argument < -26.5, result underflows to zero */
if (a < -0x6a000000) r = 0;
return r;
}
/* convert from s5.26 fixed point to double-precision floating point */
double fixed_to_float (int32_t a)
{
return a / 67108864.0;
}
int main (void)
{
double a, res, ref, err, maxerr = 0.0;
int32_t x, start, end;
start = -0x7fffffff; // -31.999999985
end = 0x14000000; // 5.000000000
printf ("testing fixed_exp2 with inputs in [%.9f, %.9f)\n",
fixed_to_float (start), fixed_to_float (end));
for (x = start; x < end; x++) {
a = fixed_to_float (x);
ref = exp2 (a);
res = fixed_to_float (fixed_exp2 (x));
err = fabs (res - ref);
if (err > maxerr) {
maxerr = err;
}
}
printf ("max. abs. err = %g\n", maxerr);
return EXIT_SUCCESS;
}
A table-based alternative would trade-off table storage for a reduction in the amount of computation that is performed. Depending on the size of the L1 data cache, this may or may not increase performance. One possible approach is to tabulate 2f-1 for f in [0, 1). The split the function argument into an integer i and a fraction f, such that f in [0, 1). In order to keep the table reasonably small, use quadratic interpolation, with the coefficients of the polynomial computed on the fly from three consecutive table entries. The result is slightly adjusted by a heuristically determined offset to somewhat compensate for the truncating nature of fixed-point arithmetic.
The table is indexed by leading bits of the fraction f. Using seven bits for the index (resulting in a table of 128+2 entries), accuracy is slightly worse than with the previous minimax polynomial approximation. Maximum absolute error is 1.74935e-7. It occurs for an argument of 0x11580000 = 4.33593750, where fixed_exp2() returns 0x50c7d771, whereas the accurate result would be 0x50c7d765.
/* For i in [0,129]: (exp2 (i/128.0) - 1.0) * (1 << 31) */
static const uint32_t expTab [130] =
{
0x00000000, 0x00b1ed50, 0x0164d1f4, 0x0218af43,
0x02cd8699, 0x0383594f, 0x043a28c4, 0x04f1f656,
0x05aac368, 0x0664915c, 0x071f6197, 0x07db3580,
0x08980e81, 0x0955ee03, 0x0a14d575, 0x0ad4c645,
0x0b95c1e4, 0x0c57c9c4, 0x0d1adf5b, 0x0ddf0420,
0x0ea4398b, 0x0f6a8118, 0x1031dc43, 0x10fa4c8c,
0x11c3d374, 0x128e727e, 0x135a2b2f, 0x1426ff10,
0x14f4efa9, 0x15c3fe87, 0x16942d37, 0x17657d4a,
0x1837f052, 0x190b87e2, 0x19e04593, 0x1ab62afd,
0x1b8d39ba, 0x1c657368, 0x1d3ed9a7, 0x1e196e19,
0x1ef53261, 0x1fd22825, 0x20b05110, 0x218faecb,
0x22704303, 0x23520f69, 0x243515ae, 0x25195787,
0x25fed6aa, 0x26e594d0, 0x27cd93b5, 0x28b6d516,
0x29a15ab5, 0x2a8d2653, 0x2b7a39b6, 0x2c6896a5,
0x2d583eea, 0x2e493453, 0x2f3b78ad, 0x302f0dcc,
0x3123f582, 0x321a31a6, 0x3311c413, 0x340aaea2,
0x3504f334, 0x360093a8, 0x36fd91e3, 0x37fbefcb,
0x38fbaf47, 0x39fcd245, 0x3aff5ab2, 0x3c034a7f,
0x3d08a39f, 0x3e0f680a, 0x3f1799b6, 0x40213aa2,
0x412c4cca, 0x4238d231, 0x4346ccda, 0x44563ecc,
0x45672a11, 0x467990b6, 0x478d74c9, 0x48a2d85d,
0x49b9bd86, 0x4ad2265e, 0x4bec14ff, 0x4d078b86,
0x4e248c15, 0x4f4318cf, 0x506333db, 0x5184df62,
0x52a81d92, 0x53ccf09a, 0x54f35aac, 0x561b5dff,
0x5744fccb, 0x5870394c, 0x599d15c2, 0x5acb946f,
0x5bfbb798, 0x5d2d8185, 0x5e60f482, 0x5f9612df,
0x60ccdeec, 0x62055b00, 0x633f8973, 0x647b6ca0,
0x65b906e7, 0x66f85aab, 0x68396a50, 0x697c3840,
0x6ac0c6e8, 0x6c0718b6, 0x6d4f301f, 0x6e990f98,
0x6fe4b99c, 0x713230a8, 0x7281773c, 0x73d28fde,
0x75257d15, 0x767a416c, 0x77d0df73, 0x792959bb,
0x7a83b2db, 0x7bdfed6d, 0x7d3e0c0d, 0x7e9e115c,
0x80000000, 0x8163daa0
};
int32_t fixed_exp2 (int32_t x)
{
int32_t f1, f2, dx, a, b, approx, idx, i, f;
/* extract integer portion; 2**i is realized as a shift at the end */
i = (x >> 26);
/* extract fraction f so we can compute 2^f, 0 <= f < 1 */
f = x & 0x3ffffff;
/* index table of exp2 values using 7 most significant bits of fraction */
idx = (uint32_t)f >> (26 - 7);
/* difference between argument and next smaller sampling point */
dx = f - (idx << (26 - 7));
/* fit parabola through closest 3 sampling points; find coefficients a,b */
f1 = (expTab[idx+1] - expTab[idx]);
f2 = (expTab[idx+2] - expTab[idx]);
a = f2 - (f1 << 1);
b = (f1 << 1) - a;
/* find function value offset for argument x by computing ((a*dx+b)*dx) */
approx = a;
approx = (int32_t)((((int64_t)approx)*dx) >> (26 - 7)) + b;
approx = (int32_t)((((int64_t)approx)*dx) >> (26 - 7 + 1));
/* combine integer and fractional parts of result, round result */
approx = (((expTab[idx] + (uint32_t)approx + (uint32_t)(1.0*(1LL << 31)) + 22U) >> (30 - 26 - i)) + 1) >> 1;
/* flush underflow to 0 */
if (i < -27) approx = 0;
return approx;
}

Different Results of normxcorr2 and normxcorr2_mex

I have images with different rotational orientations. I want to find correct rotation angle using cross-correlation maximization. Since my image set is big, I wanted to speed up normxcorr2 function using the mex file here.
I used the following code to calculate matched_angle:
function [matched_angle, max_corr_vecq, matched_angle_mex, max_corr_vecq_mex] = get_correct_rotation(moving, fixed)
for theta = 360:-10:10
rotated = imrotate(moving, theta,'bicubic','crop');
corr2d_map = normxcorr2(double(rotated), double(fixed));
corr2d_map_mex = normxcorr2_mex(double(rotated), double(fixed),'full');
[max_corr_vec(theta/10), ~] = max(corr2d_map(:));
[max_corr_vec_mex(theta/10), ~] = max(corr2d_map_mex(:));
end
% Interpolate correlation max vector for half degree resolution
max_corr_vecq = interp1(10:10:360, max_corr_vec, 0.5:0.5:360, 'spline');
[~, matched_angle] = max(max_corr_vecq);
matched_angle = 0.5 * matched_angle;
% Interpolate correlation max vector for half degree resolution
max_corr_vecq_mex = interp1(10:10:360, max_corr_vec_mex, 0.5:0.5:360, 'spline');
[~, matched_angle_mex] = max(max_corr_vecq_mex);
matched_angle_mex = 0.5 * matched_angle_mex;
end
However using those two same images (Moving Template Image & Fixed Reference Image) for two different normxcorr2 & normxcorr2_mex gives totally different results.
plot(0.5:0.5:360, max_corr_vecq, 'linewidth',2); hold on;
plot(0.5:0.5:360, max_corr_vecq_mex, 'linewidth',2);
legend({'MATLAB Built-in', 'MEX'});
set(gca, 'FontSize', 14, 'FontWeight', 'bold');
See Result Plot.
Does anyone has an idea what is going on? I could not found any entry regarding the accuracy of that mex file. And according to the author:
the following are equivalent:
result = normxcorr2_mex(template, image, 'full');
AND
result = normxcorr2(template, image);
except that normxcorr2_mex has 0's in the 'invalid' area along the boundary
which should not be problem in my case. Since I am only checking the max correlation value.
Since my previous answer, I have found the normcorr2_mex library to be consistently slower (than MATLAB) and incorrect in all of my use cases.
As I really needed a C++ implementation (that I could verify with MATLAB), I created my own. The code is listed here:
/* normxcorr2_mex.cpp
*
* A MATLAB-mex wrapper around a C/C++ implementation of the Normalised Cross Correlation algorithm described
* by #dafnahaktana in https://stackoverflow.com/questions/44591037/speed-up-calculation-of-maximum-of-normxcorr2.
*
* This module uses the 'integral image' data structure described in the posted MATLAB/Octave code (based upon the
* original Industrial Light & Magic paper at http://scribblethink.org/Work/nvisionInterface/nip.pdf), but replaces
* the "naive" correlation step with a Fourier transform implementation for larger template sizes.
*
* Daniel Eaton released a MATLAB-mex library (http://www.cs.ubc.ca/research/deaton/remarks_ncc.html) with the
* same function name as this one in 2013. Indeed, I acknowledge [and flatteringly plagiarise] his interface and
* naming convention. Unfortunaly, I was unable to duplicate the speed (wrt MATLABs normxcorr2) improvements he
* claimed with the image sizes I required. Curiously, I also observed different results using his library compared
* with MATLABs built-in function (despite being claimed to be identical). This was also noted by others here:
* https://stackoverflow.com/questions/48641648/different-results-of-normxcorr2-and-normxcorr2-mex. This module
* does match normxcorr2 on both the MATLAB R2016b and R2017a/b versions tested, using the (accompanying) test script.
* Like Daniel's module, however, this function returns only the 'valid' region of correlation values, i.e. it
* doesn't pad the output array to match the input image size.
*
* This function is called via:
* NCC = normxcorr2_mex (TEMPLATE, A);
* Where:
* TEMPLATE - The (double precision) matrix to correlate with A.
* A - (Double precision) input matrix for correlation with the TEMPLATE. Note size(A) > size(TEMPLATE).
* NCC - is the computed normalised cross correlation coefficients of the matrices TEMPLATE and A.
* The size of the correlation coefficient matrix is given as:
*
* size(NCC) = [(Ar - TEMPLATEr + 1), (Ac - TEMPLATEc + 1)] ; where:
*
* Ar, Ac and TEMPLATEr, TEMPLATEc are the number of (rows, cols) of A and TEMPLATE respectively.
*
* This module requires the Eigen C++ library (http://eigen.tuxfamily.org/index.php?title=Main_Page) for compilation
* and may be compiled within MATLAB via:
*
* mex -I'[Path to]\eigen-3.3.5' normxcorr2_mex.cpp
*
* Since NCC is such a computationally intensive task, this module may be linked against the openMP library to exploit a
* pool of worker threads and distribute some of the embarrassingly parellel operations within across a number of CPU cores.
* Only rudimentary use is made of the library, but the following compilation option provides speedups generally
* exceeding 50%:
*
* mex -I'[Path to]\eigen-3.3.5' CXXFLAGS="$CXXFLAGS -fopenmp" LDFLAGS="$LDFLAGS -fopenmp" normxcorr2_mex.cpp
*
*
* You are free to do with this code as you wish. For this reason, it is released under the UNLICENSE model:
*
* This is free and unencumbered software released into the public domain.
*
* Anyone is free to copy, modify, publish, use, compile, sell, or
* distribute this software, either in source code form or as a compiled
* binary, for any purpose, commercial or non-commercial, and by any
* means.
*
* In jurisdictions that recognize copyright laws, the author or authors
* of this software dedicate any and all copyright interest in the
* software to the public domain. We make this dedication for the benefit
* of the public at large and to the detriment of our heirs and
* successors. We intend this dedication to be an overt act of
* relinquishment in perpetuity of all present and future rights to this
* software under copyright law.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
* OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
* ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
* OTHER DEALINGS IN THE SOFTWARE.
*
* For more information, please refer to <http://unlicense.org/>
*/
#include "mex.h"
#include <cstring>
#include <algorithm>
#include <limits>
#include <vector>
#include <cmath>
#include <complex>
#include <iostream>
#include <Eigen/Core>
#include <unsupported/Eigen/FFT>
using namespace Eigen;
// If we're compiled/linked with openMP, turn off Eigen's parallelisation
#ifdef _OPENMP
#define EIGEN_DONT_PARALLELIZE
#define EIGEN_NO_DEBUG
#endif
// For very small input templates, performing the raw 2D correlation in the spatial domain may be faster than
// the transform domain (due to the overhead that the latter involves). The decision which approach to use is
// made at runtime by comparing the size (=rows*cols) of the input TEMPLATE matrix with the following constant.
// Feel free to experiment with this value in your own application!
#define TEMPLATE_SIZE_THRESHOLD 401
// 2D Cross-correlation performed via the "naive approach" (laborious spatial domain convolution).
ArrayXXd spatialXcorr (const Ref<const ArrayXXd>& img, const Ref<const ArrayXXd>& templ)
{
int32_t r, c;
ArrayXXd xcorr2(img.rows()-templ.rows()+1, img.cols()-templ.cols()+1);
for (r=0; r<(img.rows()-templ.rows()+1); r++)
for (c=0; c<(img.cols()-templ.cols()+1); c++)
xcorr2(r,c) = (templ*img.block(r,c,templ.rows(),templ.cols())).sum();
return(xcorr2);
}
// 2D Cross-correlation performed via Fourier transform
ArrayXXd transformXcorr (const Ref<const ArrayXXd>& img, const Ref<const ArrayXXd>& templ)
{
ArrayXXd xcorr2(img.rows()-templ.rows()+1, img.cols()-templ.cols()+1);
// Copy the input arrays into a matrix the next power-of-2 up in size
int32_t nextPow2r = (int32_t)(pow(2.0, round(0.5+log((double)(img.rows()))/log(2.0))));
int32_t nextPow2c = (int32_t)(pow(2.0, round(0.5+log((double)(img.cols()))/log(2.0))));
MatrixXd imgPwr2 = MatrixXd::Zero(nextPow2r, nextPow2c);
MatrixXd templPwr2 = MatrixXd::Zero(nextPow2r, nextPow2c);
// A -> copied to top-left corner.
// TEMPLATE is rotated 180 degrees to account for rotation/flip performed during convolution.
imgPwr2.block(0, 0, img.rows(), img.cols()) = img.matrix();
templPwr2.block(0, 0, templ.rows(), templ.cols()) = (templ.matrix().colwise().reverse()).rowwise().reverse();
// Perform 2D FFTs via sequential 1D transforms (Rows first, then columns)
MatrixXcd imgFT(nextPow2r, nextPow2c), templFT(nextPow2r, nextPow2c), prodFT(nextPow2r, nextPow2c);
// Rows first...
#ifdef _OPENMP // If using parallel threads, then each thread
// must have it's own copy of the eigenFFT plan.
#pragma omp parallel for schedule(dynamic)
for (int32_t r=0; r<nextPow2r; r++) { // This is unnecesary for single-threaded execution as
// each evaluation of the FFT is identical in length
VectorXcd rowVec(nextPow2c); // and data type.
FFT<double> eigenFFT;
// The creation of the plan is computationally expensive
#else // and so we do it once, outside of the loop in the single
// threaded case (to reduce the run time by a factor > 2).
VectorXcd rowVec(nextPow2c);
FFT<double> eigenFFT;
for (int32_t r=0; r<nextPow2r; r++) {
#endif
eigenFFT.fwd(rowVec, imgPwr2.row(r));
imgFT.row(r) = rowVec;
eigenFFT.fwd(rowVec, templPwr2.row(r));
templFT.row(r) = rowVec;
}
// ...then columns.
#ifdef _OPENMP
#pragma omp parallel for schedule(dynamic)
for (int32_t c=0; c<nextPow2c; c++) {
VectorXcd colVec(nextPow2r);
FFT<double> eigenFFT;
#else
VectorXcd colVec(nextPow2r);
for (int32_t c=0; c<nextPow2c; c++) {
#endif
eigenFFT.fwd(colVec, imgFT.col(c));
imgFT.col(c) = colVec;
eigenFFT.fwd(colVec, templFT.col(c));
templFT.col(c) = colVec;
}
// Mutliply complex Fourier domain matricies
prodFT = imgFT.cwiseProduct(templFT);
// Transform (complex) Fourier product back -> (real) spatial domain (2D IFFT).
// Reuse templPwr2 as the output variable for efficiency.
// Rows first (again)...
#ifdef _OPENMP
#pragma omp parallel for schedule(dynamic)
for (int32_t r=0; r<nextPow2r; r++) {
FFT<double> eigenFFT;
VectorXcd rowVec(nextPow2c);
#else
for (int32_t r=0; r<nextPow2r; r++) {
#endif
eigenFFT.inv(rowVec, prodFT.row(r));
prodFT.row(r) = rowVec;
}
// ...and lastly, columns.
#ifdef _OPENMP
#pragma omp parallel for schedule(dynamic)
for (int32_t c=0; c<nextPow2c; c++) {
FFT<double> eigenFFT;
VectorXcd colVec(nextPow2r);
#else
for (int32_t c=0; c<nextPow2c; c++) {
#endif
eigenFFT.inv(colVec, prodFT.col(c));
templPwr2.col(c) = colVec.real();
}
// Extract the valid region of correlation coefficients
xcorr2 = templPwr2.array().block(templ.rows()-1, templ.cols()-1, img.rows()-templ.rows()+1, img.cols()-templ.cols()+1);
return(xcorr2);
}
// Normalised cross-correlation top-level function
ArrayXXd normxcorr2 (const Ref<const ArrayXXd>& templ, const Ref<const ArrayXXd>& img)
{
ArrayXXd templZMean(templ.rows(), templ.cols());
ArrayXXd scalingCoeffs(img.rows() - templ.rows() +1, img.cols() - templ.cols() +1);
ArrayXXd normxcorr(img.rows()-templ.rows()+1, img.cols()-templ.cols()+1);
ArrayXXd integralImg(img.rows()+2, img.cols()+2), integralImgSq(img.rows()+2, img.cols()+2);
ArrayXXd windowMeanA = ArrayXXd::Zero(img.rows() - templ.rows() +1, img.cols() - templ.cols() +1);
ArrayXXd windowMeanASq = ArrayXXd::Zero(img.rows() - templ.rows() +1, img.cols() - templ.cols() +1);
// Calculate the standard deviation of the TEMPLATE
double templSizeRcp = 1.0/(double)(templ.rows()*templ.cols());
templZMean = templ-templ.mean();
double templateStd = sqrt((templZMean.pow(2)).sum()*templSizeRcp);
// Compute mean and standard deviation of input matrix A over the template window size. Firsly...
// Construct array for computing the integral image(s) + zero pad the edges to avoid boundary issues
integralImg.block(0, 0, 1, integralImg.cols()) = ArrayXXd::Zero(1, integralImg.cols());
integralImg.block(0, 0, integralImg.rows(), 1) = ArrayXXd::Zero(integralImg.rows(), 1);
integralImg.block(0, integralImg.cols()-1, integralImg.rows(), 1) = ArrayXXd::Zero(integralImg.rows(), 1);
integralImg.block(integralImg.rows()-1, 0, 1, integralImg.cols()) = ArrayXXd::Zero(1, integralImg.cols());
integralImgSq.block(0, 0, 1, integralImgSq.cols()) = ArrayXXd::Zero(1, integralImgSq.cols());
integralImgSq.block(0, 0, integralImgSq.rows(), 1) = ArrayXXd::Zero(integralImgSq.rows(), 1);
integralImgSq.block(0, integralImgSq.cols()-1, integralImgSq.rows(), 1) = ArrayXXd::Zero(integralImgSq.rows(), 1);
integralImgSq.block(integralImgSq.rows()-1, 0, 1, integralImgSq.cols()) = ArrayXXd::Zero(1, integralImgSq.cols());
// Calculate cumulative sum. Along the length of each row first...
for (int32_t r=0; r<img.rows(); r++) {
double sum = 0.0;
double sumSq = 0.0;
for (int32_t c=0; c<img.cols(); c++) {
sum += img(r,c);
sumSq += (img(r,c)*img(r,c));
integralImg(r+1, c+1) = sum;
integralImgSq(r+1, c+1) = sumSq;
}
}
// ...and then down each column.
for (int32_t c=1; c<=img.cols(); c++) {
double sum = 0.0;
double sumSq = 0.0;
for (int32_t r=1; r<=img.rows(); r++) {
sum += integralImg(r,c);
sumSq += integralImgSq(r,c);
integralImg(r,c) = sum;
integralImgSq(r,c) = sumSq;
}
}
// Determine start/finish indexes for the boundaries of the summed area
int32_t rStart = (int32_t)(0.5 + templ.rows()/2.0);
int32_t rEnd = img.rows() - rStart + (templ.rows() % 2);
int32_t cStart = (int32_t)(0.5 + templ.cols()/2.0);
int32_t cEnd = img.cols() - cStart + (templ.cols() % 2);
// Evaluate the sum of intensities
windowMeanA += ( integralImg.block(templ.rows(), templ.cols(), rEnd-rStart+1, cEnd-cStart+1) \
- integralImg.block(templ.rows(), 0, rEnd-rStart+1, cEnd-cStart+1) \
- integralImg.block(0, templ.cols(), rEnd-rStart+1, cEnd-cStart+1) \
+ integralImg.block(0, 0, rEnd-rStart+1, cEnd-cStart+1) )*templSizeRcp;
// Evaluate the sum of intensities (squared)
windowMeanASq += ( integralImgSq.block(templ.rows(), templ.cols(), rEnd-rStart+1, cEnd-cStart+1) \
- integralImgSq.block(templ.rows(), 0, rEnd-rStart+1, cEnd-cStart+1) \
- integralImgSq.block(0, templ.cols(), rEnd-rStart+1, cEnd-cStart+1) \
+ integralImgSq.block(0, 0, rEnd-rStart+1, cEnd-cStart+1) )*templSizeRcp;
// Calculate the standard deviation (squared) of A over the template size window
// Standard deviation = sqrt(windowMeanASq - windowMeanA.square());
scalingCoeffs = (windowMeanASq - windowMeanA.square());
// Amalgamate the element-by-element test/square root with other coefficients scaling for efficiency
for (int32_t r=0; r<scalingCoeffs.rows(); r++)
for (int32_t c=0; c<scalingCoeffs.cols(); c++)
if (scalingCoeffs(r,c) > 0)
scalingCoeffs(r,c) = templSizeRcp/(templateStd*sqrt(scalingCoeffs(r,c)));
else
scalingCoeffs(r,c) = std::numeric_limits<double>::quiet_NaN();
// Decide which 2D correlation approach to use (transform or spatial domain)
if ((templ.rows()*templ.cols()) > TEMPLATE_SIZE_THRESHOLD)
normxcorr = scalingCoeffs*transformXcorr(img, templZMean);
else
normxcorr = scalingCoeffs*spatialXcorr(img, templZMean);
return(normxcorr);
}
// ******************** Minimal MEX wrapper ********************
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
// Check the number of arguments
if (nrhs != 2)
mexErrMsgIdAndTxt("MATLAB:normxcorr2_mex", "Usage: NCC = normxcorr2_mex (TEMPLATE, A);");
// Verify input array sizes
size_t rowsTempl = mxGetM(prhs[0]);
size_t colsTempl = mxGetN(prhs[0]);
size_t rowsA = mxGetM(prhs[1]);
size_t colsA = mxGetN(prhs[1]);
if ((rowsA <= rowsTempl) || (colsA <= colsTempl))
mexErrMsgIdAndTxt("MATLAB:normxcorr2_mex", "Size of TEMPLATE must be less than input matrix A.");
#ifdef _OPENMP
// Required for Eigen versions < 3.3 and for *some* non-compliant C++11 compilers.
// (Warn Eigen our application might be calling it from multiple threads).
initParallel();
#endif
// Perform correlation
ArrayXXd xcorr(rowsA-rowsTempl+1, colsA-colsTempl+1);
xcorr = normxcorr2 (Map<ArrayXXd>(mxGetPr(prhs[0]), rowsTempl, colsTempl), Map<ArrayXXd>(mxGetPr(prhs[1]), rowsA, colsA));
// Return data to MATLAB
plhs[0] = mxCreateDoubleMatrix(rowsA-rowsTempl+1, colsA-colsTempl+1, mxREAL);
Map<ArrayXXd> (mxGetPr(plhs[0]), xcorr.rows(), xcorr.cols()) = xcorr;
return;
}
As per the comments in the header, save the file to normxcorr2_mex.cpp and compile with:
mex -I'[Path to]\eigen-3.3.5' normxcorr2_mex.cpp for
single-threaded operation, or with
mex -I'[Path to]\eigen-3.3.5' CXXFLAGS="$CXXFLAGS -fopenmp" LDFLAGS="$LDFLAGS -fopenmp"
normxcorr2_mex.cpp for multi-threaded openMP support.
The timing and correct operation of the code can be verified with the following MATLAB script:
% testHarness.m
%
% Verify the results of the compiled normxcorr2_mex() function against
% MATLABs inbuilt normxcorr2() function. This takes aaaaages to run!
%% Simulation/comparison parameters
nRunsA = 50; % Number of trials for accuracy comparison
nRunsT = 30; % Number of repetitions for execution time detemination
nStepsT = 50; % Number of input matrix size steps to take in execution time measurement
maxImSize = [1343 1745]; % (Deliberately non-round-number) maximum image size for tests
maxTemplSize = [248 379]; % Maximum image template size
%% Accuracy comparison
sumSqErr = zeros(1, nRunsA);
fprintf(2, 'Accuracy comparison\n');
for nRun = 1:nRunsA
fprintf('Run %d (of %d)\n', nRun, nRunsA);
% Create input images/templates of random content and size
randSizeScale = 0.02 + 0.98*rand(1, 2);
img = rand(round(maxImSize.*randSizeScale));
templ = rand(round(maxTemplSize.*randSizeScale));
% MATLABs inbuilt function
resultMatPadded = normxcorr2(templ, img);
% Remove unwanted padding
[rTempl, cTempl] = size(templ);
[rImg, cImg] = size(img);
resultMat = resultMatPadded(rTempl:rImg, cTempl:cImg);
% MEX function
resultMex = normxcorr2_mex(templ, img);
% Compare results
sumSqErr(nRun) = sum(sum( (resultMat-resultMex).^2 ));
end
figure;
plot(sumSqErr);
title('Accuracy comparison between MATLAB and MEX normxcorr2');
xlabel('Run #');
ylabel('\Sigma |MATLAB-MEX|^2');
grid on;
%% Timing comparison
avMatT = zeros(1, nStepsT);
avMexT = zeros(1, nStepsT);
fprintf(2, 'Timing comparison\n');
for stp = 1:nStepsT
fprintf('Run %d (of %d)\n', stp, nStepsT);
% Create input images/templates of random content and progressively larger size
img = rand(round(maxImSize*stp/nStepsT));
templ = rand(round(maxTemplSize.*stp/nStepsT));
% MATLABs function
tStart = tic;
for exec = 1:nRunsT
dummy = normxcorr2(templ, img);
end
avMatT(stp) = toc(tStart)/nRunsT;
% MEX function
tStart = tic;
for exec = 1:nRunsT
dummy = normxcorr2_mex(templ, img);
end
avMexT(stp) = toc(tStart)/nRunsT;
end
figure;
plot((1:nStepsT)/(0.01*nStepsT), avMatT, 'rx-', (1:nStepsT)/(0.01*nStepsT), avMexT, 'bo-');
title('Execution time comparison between MATLAB and MEX normxcorr2');
xlabel('Input array size [% of maximum]');
ylabel('Evaluation time [s]');
legend('MATLAB', 'MEX');
grid on;
The above C++/mex implementation and MATLAB's inbuilt normxcorr2 function agree to a level approaching the limits of the underlying double-precision data type. It turns out that the recent MATLAB normxcorr2 is hard to beat in speed though - even when using openMP - as this comparative timing plot shows when run on my elderly i7-980 CPU.
Unfortunately I don't have an explanation, but can confirm the issue appears to be with the library and not your implementation. I had issues building the normxcorr2_mex library with the MinGW64 compiler under windows which made me wary of possible variations between builds. Builds under both Debian Linux and Windows exhibit the same (incorrect) behaviour compared to MATLAB's built-in normxcorr2 function, as shown in the plot included here.
To assist anyone else building the library under Windows, I had to coerce the C++ compiler with the following command line:
mex -O CXXFLAGS="$CXXFLAGS -std=c++03 -fpermissive" normxcorr2_mex.cpp cv_src/*.cpp
Incidentally, I also found the mex implementation to be an order of magnitude slower than MATLABs!

RSA hardware implementation: radix-2 montgomery multiplication issues

I'm implementing RSA 1024 in hardware (xilinx ZYNQ FPGA), and am unable to figure out a few curious issues. Most notably, I am finding that my implementation only works for certain base/exponent/modulus combinations, but have not found any reason why this is the case.
Note: I am implementing the algorithm using Xilinx HLS (essentially C code that is synthesized into hardware). For the sake of this post, treat it just like a standard C implementation, except that I can have variables up to 4096 bits wide. I haven't yet parallelized it, so it should behave just like standard C code.
The Problem
My problem is that I am able to get the correct answer for certain modular exponentiation test problems, but only if the values for the base, exponent, and modulus can be written in much fewer bits than the actual 1024 bit operand width (i.e. they are zero padded).
When I use actual 1024-bit values generated from SSH-keygen, I no longer get the correct results.
For example, if my input arguments are
uint1024_t base = 1570
uint1024_t exponent = 1019
uint1024_t modulus = 3337
I correctly get a result of 1570^1029 mod(3337) = 688
However, when I actually use values that occupy all (or approximately all) 1024 bits for the inputs...
uint1024_t base = 0x00be5416af9696937b7234421f7256f78dba8001c80a5fdecdb4ed761f2b7f955946ec920399f23ce9627f66286239d3f20e7a46df185946c6c8482e227b9ce172dd518202381706ed0f91b53c5436f233dec27e8cb46c4478f0398d2c254021a7c21596b30f77e9886e2fd2a081cadd3faf83c86bfdd6e9daad12559f8d2747
uint1024_t exponent = 0x6f1e6ab386677cdc86a18f24f42073b328847724fbbd293eee9cdec29ac4dfe953a4256d7e6b9abee426db3b4ddc367a9fcf68ff168a7000d3a7fa8b9d9064ef4f271865045925660fab620fad0aeb58f946e33bdff6968f4c29ac62bd08cf53cb8be2116f2c339465a64fd02517f2bafca72c9f3ca5bbf96b24c1345eb936d1
uint1024_t modulus = 0xb4d92132b03210f62e52129ae31ef25e03c2dd734a7235efd36bad80c28885f3a9ee1ab626c30072bb3fd9906bf89a259ffd9d5fd75f87a30d75178b9579b257b5dca13ca7546866ad9f2db0072d59335fb128b7295412dd5c43df2c4f2d2f9c1d59d2bb444e6dac1d9cef27190a97aae7030c5c004c5aea3cf99afe89b86d6d
I incorrectly get a massive number, rather than the correct answer of 29 (0x1D)
I've checked both algorithms a million times over, and have experimented with different initial values and loop bounds, but nothing seems to work.
My Implementation
I am using the standard square and multiply method for the modular exponentiation, and I chose to use the Tenca-Koc radix-2 algorithm for the montgomery multiplication, detailed in pseudocode below...
/* Tenca-Koc radix2 montgomery multiplication */
Z = 0
for i = 0 to n-1
Z = Z + X[i]*Y
if Z is odd then Z = Z + M
Z = Z/2 // left shift in radix2
if (S >= M) then S = S - M
My Montgomery multiplication implementation is as follows:
void montMult(uint1024_t X, uint1024_t Y, uint1024_t M, uint1024_t* outData)
{
ap_uint<2*NUM_BITS> S = 0;
for (int i=0; i<NUM_BITS; i++)
{
// add product of X.get_bit(i) and Y to partial sum
S += X[i]*Y;
// if S is even, add modulus to partial sum
if (S.test(0))
S += M;
// rightshift 1 bit (divide by 2)
S = S >> 1;
}
// bring back to under 1024 bits by subtracting modulus
if (S >= M)
S -= M;
// write output data
*outData = S.range(NUM_BITS-1,0);
}
and my top-level modular exponentiation is as follows, where (switching notation!) ...
// k: number of bits
// r = 2^k (radix)
// M: base
// e: exponent
// n: modulus
// Mbar: (precomputed residue) M*r mod(n)
// xbar: (precomputed initial residue) 1*r mod(n)
void ModExp(uint1024_t M, uint1024_t e, uint1024_t n,
uint1024_t Mbar, uint1024_t xbar, uint1024_t* out)
{
for (int i=NUM_BITS-1; i>=0; i--)
{
// square
montMult(xbar,xbar,n,&xbar);
// multiply
if (e.test(i)) // if (e.bit(i) == 1)
montMult(Mbar,xbar,n,&xbar);
}
// undo montgomery residue transformation
montMult(xbar,1,n,out);
}
I can't for the life of me figure out why this works for everything except an actual 1024 bit value. Any help would be much appreciated
I've replaced my answer because I was wrong. Your original code is perfectly correct. I've tested it using my own BigInteger library, which includes Montgomery arithmetic, and everything works like a charm. Here is my code:
const
base1 =
'0x00be5416af9696937b7234421f7256f78dba8001c80a5fdecdb4ed761f2b7f955946ec9203'+
'99f23ce9627f66286239d3f20e7a46df185946c6c8482e227b9ce172dd518202381706ed0f91'+
'b53c5436f233dec27e8cb46c4478f0398d2c254021a7c21596b30f77e9886e2fd2a081cadd3f'+
'af83c86bfdd6e9daad12559f8d2747';
exponent1 =
'0x6f1e6ab386677cdc86a18f24f42073b328847724fbbd293eee9cdec29ac4dfe953a4256d7e'+
'6b9abee426db3b4ddc367a9fcf68ff168a7000d3a7fa8b9d9064ef4f271865045925660fab62'+
'0fad0aeb58f946e33bdff6968f4c29ac62bd08cf53cb8be2116f2c339465a64fd02517f2bafc'+
'a72c9f3ca5bbf96b24c1345eb936d1';
modulus1 =
'0xb4d92132b03210f62e52129ae31ef25e03c2dd734a7235efd36bad80c28885f3a9ee1ab626'+
'c30072bb3fd9906bf89a259ffd9d5fd75f87a30d75178b9579b257b5dca13ca7546866ad9f2d'+
'b0072d59335fb128b7295412dd5c43df2c4f2d2f9c1d59d2bb444e6dac1d9cef27190a97aae7'+
'030c5c004c5aea3cf99afe89b86d6d';
function MontMult(X, Y, N: BigInteger): BigInteger;
var
I: Integer;
begin
Result:= 0;
for I:= 0 to 1023 do begin
if not X.IsEven then Result:= Result + Y;
if not Result.IsEven then Result:= Result + N;
Result:= Result shr 1;
X:= X shr 1;
end;
if Result >= N then Result:= Result - N;
end;
function ModExp(B, E, N: BigInteger): BigInteger;
var
R, MontB: BigInteger;
I: Integer;
begin
R:= BigInteger.PowerOfTwo(1024) mod N;
MontB:= (B * R) mod N;
for I:= 1023 downto 0 do begin
R:= MontMult(R, R, N);
if not (E shr I).IsEven then
R:= MontMult(MontB, R, N);
end;
Result:= MontMult(R, 1, N);
end;
procedure TestMontMult;
var
Base, Expo, Modulus: BigInteger;
MontBase, MontExpo: BigInteger;
X, Y, R: BigInteger;
Mont: TMont;
begin
// convert to BigInteger
Base:= BigInteger.Parse(base1);
Expo:= BigInteger.Parse(exponent1);
Modulus:= BigInteger.Parse(modulus1);
R:= BigInteger.PowerOfTwo(1024) mod Modulus;
// Convert into Montgomery form
MontBase:= (Base * R) mod Modulus;
MontExpo:= (Expo * R) mod Modulus;
Writeln;
// MontMult test, all 3 versions output
// '0x146005377258684F3FFD8D9A70D723BDD3A2E3A160E11B7AD35A7106D4D903AB9D14A9201'+
// 'D0907CE2FC2E04A69656C38CE64AA0BADF2376AEFB19D8732CE2B3650466E31BB78CF24F4E3'+
// '774A78575738B668DA0E40C8DDDA972CE101E0CADC5D4CCFF6EF2E4E97AF02F34E3AB7258A7'+
// '323E472FC051825FFC72ADC53B0DAF3C4';
Writeln('Using MontMult');
Writeln(MontMult(MontMult(MontBase, MontExpo, Modulus), 1, Modulus).ToHexString);
// same using TMont instance
Writeln('Using TMont.Multiply');
Mont:= TMont.GetInstance(Modulus);
Writeln(Mont.Reduce(Mont.Multiply(MontBase, MontExpo)).ToHexString);
Writeln('Using TMont.ModMul');
Writeln(Mont.ModMul(Base,Expo).ToHexString);
// ModExp test, all 3 versions output 29
Writeln('Using ModExp');
Writeln(ModExp(Base, Expo, Modulus).ToString);
Writeln('Using BigInteger.ModPow');
Writeln(BigInteger.ModPow(Base, Expo, Modulus).ToString);
Writeln('Using TMont.ModPow');
Writeln(Mont.ModPow(Base, Expo).ToString);
end;
Update: I finally was able to fix the issue, after I ported my design to Java to check my intermediate values in the debugger. The design ran flawlessly in Java with no modifications to the code structure, and this tipped me off as to what was going wrong.
The problem came to light after getting correct intermediate values using the BigInteger java package. The HLS arbitrary precision library has a fixed bitwidth (obviously, since it synthesizes down to hardware), whereas the software BigInteger libraries are flexible bit widths. It turns out that the addition operator treats both arguments as signed values if they are different bit-widths, despite the fact that I declared them as unsigned. Thus, when there was a 1 in the MSB of an intermediate value and I tried to add it to a greater value, it treated the MSB as a sign bit and attempted to sign extend it.
This did not happen with the Java BigInt library, which quickly pointed me towards the problem.
If anyone is interested in a Java implementation of modular exponentiation using the Tenca-Koc radix2 algorithm for montgomery multiplication, you can find the code here: https://github.com/bigbrett/MontModExp-radix2

Generating Doubles With XORShift Generator

So I am using the Wikipedia entry of XORShift Generators to make a PRNG. My code is as follows.
uint32_t xor128(void) {
static uint32_t x = 123456789;
static uint32_t y = 362436069;
static uint32_t z = 521288629;
static uint32_t w = 88675123;
uint32_t t;
t = x ^ (x << 11);
x = y; y = z; z = w;
return w = w ^ (w >> 19) ^ t ^ (t >> 8);
}
My question is, how can I use this to generate double numbers between [0, 1)?
Thanks for any help.
Just divide the returned uint32_t by the maximum uint32_t (cast as a double). This does have an approximately one in four-billion chance of being 1, though. You could put in a test for the maximum and discard it if you wish.
Assuming you want a uniform distribution, and aren't too picky about randomising all of the bits for extremely small numbers:
double xor128d(void) {
return xor128() / 4294967296.0;
}
Since xor128() cannot return 4294967296, the result cannot be exactly 1.0 -- however, if you returned a float, it might still be rounded up to 1.0f.
If you try to add more bits to fill the whole mantissa then you'll face the same rounding headache for doubles.
Do you want the whole mantissa randomised for all possible values? That's a little harder.