Fixed point approximation of 2^x, with input range of s5.26

Fixed point approximation of 2^x, with input range of s5.26 - fixed-point

How can I implement 2^x fixed-point arithmetic s5.26 and input values is in range [-31.9, 31.9] using the minimax polynomial approximation for exp2()
How to generate the polynomial using Sollya Tool mentioned in the following link
Power of 2 approximation in fixed point

Since fixed-point arithmetic generally does not include an "infinity" encoding representing overflowed results, any implementation of exp2() for an s5.26 format will be limited to inputs in the interval (-32, 5), resulting in outputs in [0, 32).
The computation of transcendental functions typically consist of argument reduction, core approximation, final result construction. In the case of exp2(a), a reasonable argument reduction scheme is to split a into integer part i and fractional part f, such that a == i + f, with f in [-0.5, 0.5]. One then computes exp2(f), and scales the result by 2i, which corresponds to shifts in fixed-point arithmetic: exp2(a) = exp2(f) * exp2(i).
The common design choices for the computation of exp2(f) are interpolation in tabulated values of exp2(), or polynomial approximation. Since we need 31 result bits for the largest arguments, accurate interpolation would probably want to use quadratic interpolation to keep the table size reasonable. Since many modern processors (including ones used in embedded systems) provide a fast integer multiplier, I will focus here on approximation by polynomial. For this, we want a polynomial with minimax properties, that is, one that minimizes the maximum error compared to the reference.
Both commercial and free tools offer built-in capabilities to generate minimax approximations, e.g. Mathematica's MiniMaxApproximation command, Maple's minimax command, and Sollya's fpminimax command. One might also chose to build one's own infrastructure based on the Remez algorithm, which is the approach I have used. As opposed to floating-point arithmetic which typically uses to-nearest-or-even rounding, fixed-point arithmetic is usually restricted to truncation of intermediate results. This adds additional error during expression evaluation. As a consequence, it is usually a good idea to try a heuristic-based search for small adjustments to the coefficients of the generated approximation to partially balance those accumulating one-sided errors.
Because we need up to 31 bits in the result, and because coefficients in core approximations are typically less than unity in magnitude, we cannot use the native fixed-point precision, here s5.26, for polynomial evaluation. Instead, we want to scale up the operands in intermediate computation to fully use the available range of 32-bit integers, by dynamically adjusting the fixed-point format we are working in. For reasons of efficiency, it seems advisable to arrange the computation such that multiplications use re-normalization right shifts by 32 bits. This will often allow the elimination of explicit shifts on 32-bit processors.
Since intermediate computation uses signed data, right shifts of signed, negative operands will occur. We want those right shifts to map to arithmetic right shift instructions, something the C standard does not guarantee. But on most commonly used platforms, C compilers do what is desirable for us. Otherwise, it may be necessary to resort to intrinsics or inline assembly. I developed the code below with the Microsoft compiler on an x64 platform.
In the evaluation of the polynomial approximation for exp2(f) the original floating-point coefficients, the dynamic scaling, and the heuristic adjustments are all clearly visible. The code below does not quite achieve full accuracy for large arguments. The biggest absolute error is 1.10233e-7, for the argument of 0x12de9c5b = 4.71739332: fixed_exp2() returns 0x693ab6a3 while the accurate result would be 0x693ab69c. Presumably full accuracy could be achieved by increasing the degree of the polynomial core approximation by one.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
/* on 32-bit architectures, there is often an instruction/intrinsic for this */
int32_t mulhi (int32_t a, int32_t b)
{
return (int32_t)(((int64_t)a * (int64_t)b) >> 32);
}
/* compute exp2(a) in s5.26 fixed-point arithmetic */
int32_t fixed_exp2 (int32_t a)
{
int32_t i, f, r, s;
/* split a = i + f, such that f in [-0.5, 0.5] */
i = (a + 0x2000000) & ~0x3ffffff; // 0.5
f = a - i;
s = ((5 << 26) - i) >> 26;
f = f << 5; /* scale up for maximum accuracy in intermediate computation */
/* approximate exp2(f)-1 for f in [-0.5, 0.5] */
r = (int32_t)(1.53303146e-4 * (1LL << 36) + 996);
r = mulhi (r, f) + (int32_t)(1.33887795e-3 * (1LL << 35) + 99);
r = mulhi (r, f) + (int32_t)(9.61833261e-3 * (1LL << 34) + 121);
r = mulhi (r, f) + (int32_t)(5.55036329e-2 * (1LL << 33) + 51);
r = mulhi (r, f) + (int32_t)(2.40226507e-1 * (1LL << 32) + 8);
r = mulhi (r, f) + (int32_t)(6.93147182e-1 * (1LL << 31) + 5);
r = mulhi (r, f);
/* add 1, scale based on integral portion of argument, round the result */
r = ((((uint32_t)r * 2) + (uint32_t)(1.0*(1LL << 31)) + ((1U << s) / 2) + 1) >> s);
/* when argument < -26.5, result underflows to zero */
if (a < -0x6a000000) r = 0;
return r;
}
/* convert from s5.26 fixed point to double-precision floating point */
double fixed_to_float (int32_t a)
{
return a / 67108864.0;
}
int main (void)
{
double a, res, ref, err, maxerr = 0.0;
int32_t x, start, end;
start = -0x7fffffff; // -31.999999985
end = 0x14000000; // 5.000000000
printf ("testing fixed_exp2 with inputs in [%.9f, %.9f)\n",
fixed_to_float (start), fixed_to_float (end));
for (x = start; x < end; x++) {
a = fixed_to_float (x);
ref = exp2 (a);
res = fixed_to_float (fixed_exp2 (x));
err = fabs (res - ref);
if (err > maxerr) {
maxerr = err;
}
}
printf ("max. abs. err = %g\n", maxerr);
return EXIT_SUCCESS;
}
A table-based alternative would trade-off table storage for a reduction in the amount of computation that is performed. Depending on the size of the L1 data cache, this may or may not increase performance. One possible approach is to tabulate 2f-1 for f in [0, 1). The split the function argument into an integer i and a fraction f, such that f in [0, 1). In order to keep the table reasonably small, use quadratic interpolation, with the coefficients of the polynomial computed on the fly from three consecutive table entries. The result is slightly adjusted by a heuristically determined offset to somewhat compensate for the truncating nature of fixed-point arithmetic.
The table is indexed by leading bits of the fraction f. Using seven bits for the index (resulting in a table of 128+2 entries), accuracy is slightly worse than with the previous minimax polynomial approximation. Maximum absolute error is 1.74935e-7. It occurs for an argument of 0x11580000 = 4.33593750, where fixed_exp2() returns 0x50c7d771, whereas the accurate result would be 0x50c7d765.
/* For i in [0,129]: (exp2 (i/128.0) - 1.0) * (1 << 31) */
static const uint32_t expTab [130] =
{
0x00000000, 0x00b1ed50, 0x0164d1f4, 0x0218af43,
0x02cd8699, 0x0383594f, 0x043a28c4, 0x04f1f656,
0x05aac368, 0x0664915c, 0x071f6197, 0x07db3580,
0x08980e81, 0x0955ee03, 0x0a14d575, 0x0ad4c645,
0x0b95c1e4, 0x0c57c9c4, 0x0d1adf5b, 0x0ddf0420,
0x0ea4398b, 0x0f6a8118, 0x1031dc43, 0x10fa4c8c,
0x11c3d374, 0x128e727e, 0x135a2b2f, 0x1426ff10,
0x14f4efa9, 0x15c3fe87, 0x16942d37, 0x17657d4a,
0x1837f052, 0x190b87e2, 0x19e04593, 0x1ab62afd,
0x1b8d39ba, 0x1c657368, 0x1d3ed9a7, 0x1e196e19,
0x1ef53261, 0x1fd22825, 0x20b05110, 0x218faecb,
0x22704303, 0x23520f69, 0x243515ae, 0x25195787,
0x25fed6aa, 0x26e594d0, 0x27cd93b5, 0x28b6d516,
0x29a15ab5, 0x2a8d2653, 0x2b7a39b6, 0x2c6896a5,
0x2d583eea, 0x2e493453, 0x2f3b78ad, 0x302f0dcc,
0x3123f582, 0x321a31a6, 0x3311c413, 0x340aaea2,
0x3504f334, 0x360093a8, 0x36fd91e3, 0x37fbefcb,
0x38fbaf47, 0x39fcd245, 0x3aff5ab2, 0x3c034a7f,
0x3d08a39f, 0x3e0f680a, 0x3f1799b6, 0x40213aa2,
0x412c4cca, 0x4238d231, 0x4346ccda, 0x44563ecc,
0x45672a11, 0x467990b6, 0x478d74c9, 0x48a2d85d,
0x49b9bd86, 0x4ad2265e, 0x4bec14ff, 0x4d078b86,
0x4e248c15, 0x4f4318cf, 0x506333db, 0x5184df62,
0x52a81d92, 0x53ccf09a, 0x54f35aac, 0x561b5dff,
0x5744fccb, 0x5870394c, 0x599d15c2, 0x5acb946f,
0x5bfbb798, 0x5d2d8185, 0x5e60f482, 0x5f9612df,
0x60ccdeec, 0x62055b00, 0x633f8973, 0x647b6ca0,
0x65b906e7, 0x66f85aab, 0x68396a50, 0x697c3840,
0x6ac0c6e8, 0x6c0718b6, 0x6d4f301f, 0x6e990f98,
0x6fe4b99c, 0x713230a8, 0x7281773c, 0x73d28fde,
0x75257d15, 0x767a416c, 0x77d0df73, 0x792959bb,
0x7a83b2db, 0x7bdfed6d, 0x7d3e0c0d, 0x7e9e115c,
0x80000000, 0x8163daa0
};
int32_t fixed_exp2 (int32_t x)
{
int32_t f1, f2, dx, a, b, approx, idx, i, f;
/* extract integer portion; 2**i is realized as a shift at the end */
i = (x >> 26);
/* extract fraction f so we can compute 2^f, 0 <= f < 1 */
f = x & 0x3ffffff;
/* index table of exp2 values using 7 most significant bits of fraction */
idx = (uint32_t)f >> (26 - 7);
/* difference between argument and next smaller sampling point */
dx = f - (idx << (26 - 7));
/* fit parabola through closest 3 sampling points; find coefficients a,b */
f1 = (expTab[idx+1] - expTab[idx]);
f2 = (expTab[idx+2] - expTab[idx]);
a = f2 - (f1 << 1);
b = (f1 << 1) - a;
/* find function value offset for argument x by computing ((a*dx+b)*dx) */
approx = a;
approx = (int32_t)((((int64_t)approx)*dx) >> (26 - 7)) + b;
approx = (int32_t)((((int64_t)approx)*dx) >> (26 - 7 + 1));
/* combine integer and fractional parts of result, round result */
approx = (((expTab[idx] + (uint32_t)approx + (uint32_t)(1.0*(1LL << 31)) + 22U) >> (30 - 26 - i)) + 1) >> 1;
/* flush underflow to 0 */
if (i < -27) approx = 0;
return approx;
}

Related

Does unrolling a loop affect the accuracy of the computations within?

Summarized question Does unrolling a loop affect the accuracy of the computations performed within the loop? And if so, why?
Elaboration and background I am writing a compute shader using HLSL for use in a Unity-project (2021.2.9f1). Parts of my code include numerical procedures and highly osciallatory functions, meaning that high computational accuracy is essential.
When comparing my results with an equivalent procedure in Python, I noticed that some deviations in the order of 1e-5. This was concerning, as I did not expect such large errors to be the result of precision differences, e.g., the float-precision in trigonometric or power functions in HLSL.
Ultimatley, after much debugging, I now believe the choice of unrolling or not unrolling a loop to be the cause of the deviation. However, I do find this strange, as I can not seem to find any sources indicating that unrolling a loop affects the accuracy in addition to the "space–time tradeoff".
For clarification, if considering my Python results as the correct solution, unrolling the loop in HLSL gives me better results than what not unrolling gives.
Minimal working example Below is an MWE consisting of a C# script for Unity, the corresponding compute shader where the computations are performed and a screen-shot of my console when running in Unity (2021.2.9f1). Forgive me for a somewhat messy implementation of Newtons method, but I chose to keep it since I believe it might be a cause to this deviation. That is, if simply computing cos(x), then there is not difference between the unrolled and not unrolled. None the less, I still fail to understand how the simple addition of [unroll(N)] in the testing kernel changes the result...
// C# for Unity
using UnityEngine;
public class UnrollTest : MonoBehaviour
{
[SerializeField] ComputeShader CS;
ComputeBuffer CBUnrolled, CBNotUnrolled;
readonly int N = 3;
private void Start()
{
CBUnrolled = new ComputeBuffer(N, sizeof(double));
CBNotUnrolled = new ComputeBuffer(N, sizeof(double));
CS.SetBuffer(0, "_CBUnrolled", CBUnrolled);
CS.SetBuffer(0, "_CBNotUnrolled", CBNotUnrolled);
CS.Dispatch(0, (int)((N + (64 - 1)) / 64), 1, 1);
double[] ansUnrolled = new double[N];
double[] ansNotUnrolled = new double[N];
CBUnrolled.GetData(ansUnrolled);
CBNotUnrolled.GetData(ansNotUnrolled);
for (int i = 0; i < N; i++)
{
Debug.Log("Unrolled ans = " + ansUnrolled[i] +
" - Not Unrolled ans = " + ansNotUnrolled[i] +
" -- Difference is: " + (ansUnrolled[i] - ansNotUnrolled[i]));
}
CBUnrolled.Release();
CBNotUnrolled.Release();
}
}
#pragma kernel CSMain
RWStructuredBuffer<double> _CBUnrolled, _CBNotUnrolled;
// Dummy function for Newtons method
double fDummy(double k, double fnh, double h, double theta)
{
return fnh * fnh * k * h * cos(theta) * cos(theta) - (double) tanh(k * h);
}
// Derivative of Dummy function above using a central finite difference scheme.
double dfDummy(double k, double fnh, double h, double theta)
{
return (fDummy(k + (double) 1e-3, fnh, h, theta) - fDummy(k - (double) 1e-3, fnh, h, theta)) / (double) 2e-3;
}
// Function to solve.
double f(double fnh, double h, double theta)
{
// Solved using Newton's method.
int max_iter = 50;
double epsilon = 1e-8;
double fxn, dfxn;
// Define initial guess for k, herby denoted as x.
double xn = 10.0;
for (int n = 0; n < max_iter; n++)
{
fxn = fDummy(xn, fnh, h, theta);
if (abs(fxn) < epsilon) // A solution is found.
return xn;
dfxn = dfDummy(xn, fnh, h, theta);
if (dfxn == 0.0) // No solution found.
return xn;
xn = xn - fxn / dfxn;
}
// No solution found.
return xn;
}
[numthreads(64,1,1)]
void CSMain(uint3 threadID : SV_DispatchThreadID)
{
int N = 3;
// ---------------
double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
for (int i = 0; i < N; i++) // Not being unrolled
{
_CBNotUnrolled[i] = f(fnh, h, theta);
theta += dtheta;
}
// ---------------
fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
[unroll(N)] for (int j = 0; j < N; j++) // Being unrolled.
{
_CBUnrolled[j] = f(fnh, h, theta);
theta += dtheta;
}
}
Image of Unity console when running the above
Edit After some more testing, the deviation has been narrowed down to the following code, giving a difference of about 1e-17 between the exact same code unrolled vs not unrolled. Despite the small difference, I still consider it a valid example of the issue, as I believe they should be equal.
[numthreads(64, 1, 1)]
void CSMain(uint3 threadID : SV_DispatchThreadID)
{
if ((int) threadID.x != 1)
return;
int N = 3;
double k = 1.0;
// ---------------
double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
for (int i = 0; i < N; i++) // Not being unrolled
{
_CBNotUnrolled[i] = (k + (double) 1e-3) * theta - (k - (double) 1e-3) * theta;
theta += dtheta;
}
// ---------------
fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
[unroll(N)]
for (int j = 0; j < N; j++) // Being unrolled.
{
_CBUnrolled[j] = (k + (double) 1e-3) * theta - (k - (double) 1e-3) * theta;
theta += dtheta;
}
}
Image of Unity console when running the edited script above
Edit 2 The following is the compiled code for the kernel given in Edit 1. Unfortunately, my experience with assembly language is limited, and I am not capable of spotting if this script shows any errors, or if it is useful to the problem at hand.
**** Platform Direct3D 11:
Compiled code for kernel CSMain
keywords: <none>
binary blob size 648:
//
// Generated by Microsoft (R) D3D Shader Disassembler
//
//
// Note: shader requires additional functionality:
// Double-precision floating point
//
//
// Input signature:
//
// Name Index Mask Register SysValue Format Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Input
//
// Output signature:
//
// Name Index Mask Register SysValue Format Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Output
cs_5_0
dcl_globalFlags refactoringAllowed | enableDoublePrecisionFloatOps
dcl_uav_structured u0, 8
dcl_uav_structured u1, 8
dcl_input vThreadID.x
dcl_temps 2
dcl_thread_group 64, 1, 1
0: ine r0.x, vThreadID.x, l(1)
1: if_nz r0.x
2: ret
3: endif
4: dmov r0.xy, d(-0.161000l, 0.000000l)
5: mov r0.z, l(0)
6: loop
7: ige r0.w, r0.z, l(3)
8: breakc_nz r0.w
9: dmul r1.xyzw, r0.xyxy, d(1.001000l, 0.999000l)
10: dadd r1.xy, -r1.zwzw, r1.xyxy
11: store_structured u1.xy, r0.z, l(0), r1.xyxx
12: dadd r0.xy, r0.xyxy, d(0.010000l, 0.000000l)
13: iadd r0.z, r0.z, l(1)
14: endloop
15: store_structured u0.xy, l(0), l(0), l(-0.000000,-0.707432,0,0)
16: store_structured u0.xy, l(1), l(0), l(0.000000,-0.702312,0,0)
17: store_structured u0.xy, l(2), l(0), l(-918250586112.000000,-0.697192,0,0)
18: ret
// Approximately 0 instruction slots used
Edit 3 After reaching out to Microsoft, (see https://learn.microsoft.com/en-us/an...nrolling-a-loop-affect-the-accuracy-of-t.html), they stated that the problem is more about Unity. This because
"The pragma unroll [(n)] is keil compiler which Unity uses topic"

This is driver, hardware, compiler, and unity dependent.
In essence, the HLSL specification has somewhat looser guarantees for rounding behavior of mathematical operations than regular IEEE-754 floating point.
First, it is implementation-dependent whether operations round up or down.
IEEE-754 requires floating-point operations to produce a result that
is the nearest representable value to an infinitely-precise result,
known as round-to-nearest-even. Direct3D 10, however, defines a looser
requirement: 32-bit floating-point operations produce a result that is
within one unit-last-place (1 ULP) of the infinitely-precise result.
This means that, for example, hardware is allowed to truncate results
to 32-bit rather than perform round-to-nearest-even, as that would
result in error of at most one ULP.
See https://learn.microsoft.com/en-us/windows/win32/direct3d10/d3d10-graphics-programming-guide-resources-float-rules#32-bit-floating-point-rules
Going one step further, the HLSL compiler itself has many fast-math optimizations that can violate IEEE-754 float conformance; see, for example:
D3DCOMPILE_IEEE_STRICTNESS - Forces strict compile, which might not allow for legacy syntax. By default, the compiler disables strictness on deprecated syntax.
D3DCOMPILE_OPTIMIZATION_LEVEL3 - Directs the compiler to use the highest optimization level. If you set this constant, the compiler produces the best possible code but might take significantly longer to do so. Set this constant for final builds of an application when performance is the most important factor.
D3DCOMPILE_PARTIAL_PRECISION - Directs the compiler to perform all computations with partial precision. If you set this constant, the compiled code might run faster on some hardware.
Source: https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/d3dcompile-constants
This particularly matters for your scenario, because if optimizations are enabled, the existence of loop unrolling can trigger constant folding optimizations that reduce the computational cost of your code and change the precision of its results (potentially even improving them). Note that when constant folding occurs, the compiler has to decide how to perform rounding, and that might disagree with what your hardware FPUs would do.
Oh, and note that IEEE-754 does not place constraints on the precision, let alone require implementation, of "additional operations" (e.g. sin, cos, tanh, atan, ln, etc); it purely recommends them.
See, a very common case where this goes wrong and sin gets quantized to 4 different values on intel integrated graphics, but otherwise has reasonable precision on alternative hardware: sin(x) only returns 4 different values for moderately large input on GLSL fragment shader, Intel HD4000
Also, note that Unity does not guarantee that a float in shader is actually a 32-bit float; on certain hardware (e.g. mobile), it can even be backed by a 16-bit half or an 11-bit fixed.
High precision: float
Highest precision floating point value; generally 32 bits (just like float from regular programming languages).
...
One complication of float/half/fixed data type usage is that PC GPUs are always high precision. That is, for all the PC (Windows/Mac/Linux) GPUs, it does not matter whether you write float, half or fixed data types in your shaders. They always compute everything in full 32-bit floating point precision.
The half and fixed types only become relevant when targeting mobile
GPUs, where these types primarily exist for power (and sometimes
performance) constraints. Keep in mind that you need to test your
shaders on mobile to see whether or not you are running into
precision/numerical issues.
Even on mobile GPUs, the different precision support varies between
GPU families.
Source: https://docs.unity3d.com/Manual/SL-DataTypesAndPrecision.html
I don't believe Unity exposes compiler flags to developers; you are at its whim as to what optimizations it passes to dxc/fxc. Given it's primarily used for games, you can bet they enable optimizations.
Source: https://forum.unity.com/threads/possible-to-set-directx-compiler-flags-in-shaders.453790/
Finally, check out "Floating-Point Determinism" by Bruce Dawson if you want an in-depth dive into this topic; I will add that this problem also exists if you want consistent results between languages (since languages themselves can implement math functions themselves rather than using hardware intrinsics, e.g. for better precision), when cross-compiling (since different compilers / backends can optimize differently or use different system libraries), or when running managed code across different runtimes (e.g. since JIT can do different optimiztions).

RSA hardware implementation: radix-2 montgomery multiplication issues

I'm implementing RSA 1024 in hardware (xilinx ZYNQ FPGA), and am unable to figure out a few curious issues. Most notably, I am finding that my implementation only works for certain base/exponent/modulus combinations, but have not found any reason why this is the case.
Note: I am implementing the algorithm using Xilinx HLS (essentially C code that is synthesized into hardware). For the sake of this post, treat it just like a standard C implementation, except that I can have variables up to 4096 bits wide. I haven't yet parallelized it, so it should behave just like standard C code.
The Problem
My problem is that I am able to get the correct answer for certain modular exponentiation test problems, but only if the values for the base, exponent, and modulus can be written in much fewer bits than the actual 1024 bit operand width (i.e. they are zero padded).
When I use actual 1024-bit values generated from SSH-keygen, I no longer get the correct results.
For example, if my input arguments are
uint1024_t base = 1570
uint1024_t exponent = 1019
uint1024_t modulus = 3337
I correctly get a result of 1570^1029 mod(3337) = 688
However, when I actually use values that occupy all (or approximately all) 1024 bits for the inputs...
uint1024_t base = 0x00be5416af9696937b7234421f7256f78dba8001c80a5fdecdb4ed761f2b7f955946ec920399f23ce9627f66286239d3f20e7a46df185946c6c8482e227b9ce172dd518202381706ed0f91b53c5436f233dec27e8cb46c4478f0398d2c254021a7c21596b30f77e9886e2fd2a081cadd3faf83c86bfdd6e9daad12559f8d2747
uint1024_t exponent = 0x6f1e6ab386677cdc86a18f24f42073b328847724fbbd293eee9cdec29ac4dfe953a4256d7e6b9abee426db3b4ddc367a9fcf68ff168a7000d3a7fa8b9d9064ef4f271865045925660fab620fad0aeb58f946e33bdff6968f4c29ac62bd08cf53cb8be2116f2c339465a64fd02517f2bafca72c9f3ca5bbf96b24c1345eb936d1
uint1024_t modulus = 0xb4d92132b03210f62e52129ae31ef25e03c2dd734a7235efd36bad80c28885f3a9ee1ab626c30072bb3fd9906bf89a259ffd9d5fd75f87a30d75178b9579b257b5dca13ca7546866ad9f2db0072d59335fb128b7295412dd5c43df2c4f2d2f9c1d59d2bb444e6dac1d9cef27190a97aae7030c5c004c5aea3cf99afe89b86d6d
I incorrectly get a massive number, rather than the correct answer of 29 (0x1D)
I've checked both algorithms a million times over, and have experimented with different initial values and loop bounds, but nothing seems to work.
My Implementation
I am using the standard square and multiply method for the modular exponentiation, and I chose to use the Tenca-Koc radix-2 algorithm for the montgomery multiplication, detailed in pseudocode below...
/* Tenca-Koc radix2 montgomery multiplication */
Z = 0
for i = 0 to n-1
Z = Z + X[i]*Y
if Z is odd then Z = Z + M
Z = Z/2 // left shift in radix2
if (S >= M) then S = S - M
My Montgomery multiplication implementation is as follows:
void montMult(uint1024_t X, uint1024_t Y, uint1024_t M, uint1024_t* outData)
{
ap_uint<2*NUM_BITS> S = 0;
for (int i=0; i<NUM_BITS; i++)
{
// add product of X.get_bit(i) and Y to partial sum
S += X[i]*Y;
// if S is even, add modulus to partial sum
if (S.test(0))
S += M;
// rightshift 1 bit (divide by 2)
S = S >> 1;
}
// bring back to under 1024 bits by subtracting modulus
if (S >= M)
S -= M;
// write output data
*outData = S.range(NUM_BITS-1,0);
}
and my top-level modular exponentiation is as follows, where (switching notation!) ...
// k: number of bits
// r = 2^k (radix)
// M: base
// e: exponent
// n: modulus
// Mbar: (precomputed residue) M*r mod(n)
// xbar: (precomputed initial residue) 1*r mod(n)
void ModExp(uint1024_t M, uint1024_t e, uint1024_t n,
uint1024_t Mbar, uint1024_t xbar, uint1024_t* out)
{
for (int i=NUM_BITS-1; i>=0; i--)
{
// square
montMult(xbar,xbar,n,&xbar);
// multiply
if (e.test(i)) // if (e.bit(i) == 1)
montMult(Mbar,xbar,n,&xbar);
}
// undo montgomery residue transformation
montMult(xbar,1,n,out);
}
I can't for the life of me figure out why this works for everything except an actual 1024 bit value. Any help would be much appreciated

I've replaced my answer because I was wrong. Your original code is perfectly correct. I've tested it using my own BigInteger library, which includes Montgomery arithmetic, and everything works like a charm. Here is my code:
const
base1 =
'0x00be5416af9696937b7234421f7256f78dba8001c80a5fdecdb4ed761f2b7f955946ec9203'+
'99f23ce9627f66286239d3f20e7a46df185946c6c8482e227b9ce172dd518202381706ed0f91'+
'b53c5436f233dec27e8cb46c4478f0398d2c254021a7c21596b30f77e9886e2fd2a081cadd3f'+
'af83c86bfdd6e9daad12559f8d2747';
exponent1 =
'0x6f1e6ab386677cdc86a18f24f42073b328847724fbbd293eee9cdec29ac4dfe953a4256d7e'+
'6b9abee426db3b4ddc367a9fcf68ff168a7000d3a7fa8b9d9064ef4f271865045925660fab62'+
'0fad0aeb58f946e33bdff6968f4c29ac62bd08cf53cb8be2116f2c339465a64fd02517f2bafc'+
'a72c9f3ca5bbf96b24c1345eb936d1';
modulus1 =
'0xb4d92132b03210f62e52129ae31ef25e03c2dd734a7235efd36bad80c28885f3a9ee1ab626'+
'c30072bb3fd9906bf89a259ffd9d5fd75f87a30d75178b9579b257b5dca13ca7546866ad9f2d'+
'b0072d59335fb128b7295412dd5c43df2c4f2d2f9c1d59d2bb444e6dac1d9cef27190a97aae7'+
'030c5c004c5aea3cf99afe89b86d6d';
function MontMult(X, Y, N: BigInteger): BigInteger;
var
I: Integer;
begin
Result:= 0;
for I:= 0 to 1023 do begin
if not X.IsEven then Result:= Result + Y;
if not Result.IsEven then Result:= Result + N;
Result:= Result shr 1;
X:= X shr 1;
end;
if Result >= N then Result:= Result - N;
end;
function ModExp(B, E, N: BigInteger): BigInteger;
var
R, MontB: BigInteger;
I: Integer;
begin
R:= BigInteger.PowerOfTwo(1024) mod N;
MontB:= (B * R) mod N;
for I:= 1023 downto 0 do begin
R:= MontMult(R, R, N);
if not (E shr I).IsEven then
R:= MontMult(MontB, R, N);
end;
Result:= MontMult(R, 1, N);
end;
procedure TestMontMult;
var
Base, Expo, Modulus: BigInteger;
MontBase, MontExpo: BigInteger;
X, Y, R: BigInteger;
Mont: TMont;
begin
// convert to BigInteger
Base:= BigInteger.Parse(base1);
Expo:= BigInteger.Parse(exponent1);
Modulus:= BigInteger.Parse(modulus1);
R:= BigInteger.PowerOfTwo(1024) mod Modulus;
// Convert into Montgomery form
MontBase:= (Base * R) mod Modulus;
MontExpo:= (Expo * R) mod Modulus;
Writeln;
// MontMult test, all 3 versions output
// '0x146005377258684F3FFD8D9A70D723BDD3A2E3A160E11B7AD35A7106D4D903AB9D14A9201'+
// 'D0907CE2FC2E04A69656C38CE64AA0BADF2376AEFB19D8732CE2B3650466E31BB78CF24F4E3'+
// '774A78575738B668DA0E40C8DDDA972CE101E0CADC5D4CCFF6EF2E4E97AF02F34E3AB7258A7'+
// '323E472FC051825FFC72ADC53B0DAF3C4';
Writeln('Using MontMult');
Writeln(MontMult(MontMult(MontBase, MontExpo, Modulus), 1, Modulus).ToHexString);
// same using TMont instance
Writeln('Using TMont.Multiply');
Mont:= TMont.GetInstance(Modulus);
Writeln(Mont.Reduce(Mont.Multiply(MontBase, MontExpo)).ToHexString);
Writeln('Using TMont.ModMul');
Writeln(Mont.ModMul(Base,Expo).ToHexString);
// ModExp test, all 3 versions output 29
Writeln('Using ModExp');
Writeln(ModExp(Base, Expo, Modulus).ToString);
Writeln('Using BigInteger.ModPow');
Writeln(BigInteger.ModPow(Base, Expo, Modulus).ToString);
Writeln('Using TMont.ModPow');
Writeln(Mont.ModPow(Base, Expo).ToString);
end;

Update: I finally was able to fix the issue, after I ported my design to Java to check my intermediate values in the debugger. The design ran flawlessly in Java with no modifications to the code structure, and this tipped me off as to what was going wrong.
The problem came to light after getting correct intermediate values using the BigInteger java package. The HLS arbitrary precision library has a fixed bitwidth (obviously, since it synthesizes down to hardware), whereas the software BigInteger libraries are flexible bit widths. It turns out that the addition operator treats both arguments as signed values if they are different bit-widths, despite the fact that I declared them as unsigned. Thus, when there was a 1 in the MSB of an intermediate value and I tried to add it to a greater value, it treated the MSB as a sign bit and attempted to sign extend it.
This did not happen with the Java BigInt library, which quickly pointed me towards the problem.
If anyone is interested in a Java implementation of modular exponentiation using the Tenca-Koc radix2 algorithm for montgomery multiplication, you can find the code here: https://github.com/bigbrett/MontModExp-radix2

Power of 2 approximation in fixed point

Currently, I am using a small lookup table and linear interpolation which is quite fast and also accurate enough (max error is less than 0.001). However I was wondering if there is an approximation which is even faster.
Since the integer part of the exponent can be extracted and calculated by bitshifts, the approximation just needs to work in the range [-1,1]
I have tried to find a chebyshev polynomial, but could not achieve a good accuracy for polynomials of low order. I could live with a max error around 0.01 I guess, but I did not get near that number. Higher order polynomials are not an option, since they are much less efficient than my current lookup table based solution.

Since no specific fixed-point format was stated, I will demonstrate a possible alternative to table lookup using s15.16 fixed-point arithmetic, which is fairly commonly used. The basic idea is to split the input a into an integral portion i and a fractional portion f, such that f in [-0.5,0.5], then use a minimax polynomial approximation for exp2(f) on [-0.5, 0.5] and perform final scaling based on i.
Minimax approximations can be generated with tools such as Mathematica, Maple, or Sollya. If none of these tools are available, one could use a custom implementation of the Remez algorithm to generate minimax aproximations.
The Horner scheme should be used to evaluate the polynomial. Since fixed-point arithmetic is used, the evaluation of the polynomial should scale operands to the maximum extent possible (i.e. without overflow) in intermediate steps to optimized the accuracy of the computation.
The C code below assumes that right shifts applied to signed integer data types result in arithmetic shift operations, and therefore negative operands are shifted appropriately. This is not guaranteed by the ISO C standard, but in my experience it will work fine with various tool chains. In the worst case, inline assembly could be used to force generation of the desired arithmetic right shift instructions.
The output of the test included with the fixed_exp2() implementation below should look as follows:
testing fixed_exp2 with inputs in [-5.96484, 15)
max. rel. err = 0.000999758
This demonstrates that the desired error bound of 0.001 is met for inputs in the interval [-5.96484, 15).
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
/* compute exp2(a) in s15.16 fixed-point arithmetic, -16 < a < 15 */
int32_t fixed_exp2 (int32_t a)
{
int32_t i, f, r, s;
/* split a = i + f, such that f in [-0.5, 0.5] */
i = (a + 0x8000) & ~0xffff; // 0.5
f = a - i;
s = ((15 << 16) - i) >> 16;
/* minimax approximation for exp2(f) on [-0.5, 0.5] */
r = 0x00000e20; // 5.5171669058037949e-2
r = (r * f + 0x3e1cc333) >> 17; // 2.4261112219321804e-1
r = (r * f + 0x58bd46a6) >> 16; // 6.9326098546062365e-1
r = r * f + 0x7ffde4a3; // 9.9992807353939517e-1
return (uint32_t)r >> s;
}
double fixed_to_float (int32_t a)
{
return a / 65536.0;
}
int main (void)
{
double a, res, ref, err, maxerr = 0.0;
int32_t x, start, end;
start = 0xfffa0900;
end = 0x000f0000;
printf ("testing fixed_exp2 with inputs in [%g, %g)\n",
fixed_to_float (start), fixed_to_float (end));
for (x = start; x < end; x++) {
a = fixed_to_float (x);
ref = exp2 (a);
res = fixed_to_float (fixed_exp2 (x));
err = fabs (res - ref) / ref;
if (err > maxerr) {
maxerr = err;
}
}
printf ("max. rel. err = %g\n", maxerr);
return EXIT_SUCCESS;
}

Generating Doubles With XORShift Generator

So I am using the Wikipedia entry of XORShift Generators to make a PRNG. My code is as follows.
uint32_t xor128(void) {
static uint32_t x = 123456789;
static uint32_t y = 362436069;
static uint32_t z = 521288629;
static uint32_t w = 88675123;
uint32_t t;
t = x ^ (x << 11);
x = y; y = z; z = w;
return w = w ^ (w >> 19) ^ t ^ (t >> 8);
}
My question is, how can I use this to generate double numbers between [0, 1)?
Thanks for any help.

Just divide the returned uint32_t by the maximum uint32_t (cast as a double). This does have an approximately one in four-billion chance of being 1, though. You could put in a test for the maximum and discard it if you wish.

Assuming you want a uniform distribution, and aren't too picky about randomising all of the bits for extremely small numbers:
double xor128d(void) {
return xor128() / 4294967296.0;
}
Since xor128() cannot return 4294967296, the result cannot be exactly 1.0 -- however, if you returned a float, it might still be rounded up to 1.0f.
If you try to add more bits to fill the whole mantissa then you'll face the same rounding headache for doubles.
Do you want the whole mantissa randomised for all possible values? That's a little harder.

Code or compiler: optimizing a IIR filter in C for the iPhone 4 and later

I've been profiling my almost-finished project and I'm seeing that about three-quarters of the CPU time is spent in this IIR filter function (which is called hundreds of thousands of times in about a second currently on the target hardware) so with everything else working well I am wondering if it can be optimized for my specific hardware and software target. My targets are only iPhone 4 and newer, only iOS 4.3 and newer, only LLVM 4.x. A little bit of imprecision is probably OK if there are gains to be made.
static float filter(const float a, const float *b, const float c, float *d, const int e, const float x)
{
float return_value = 0;
d[0] = x;
d[1] = c * d[0] + a * d[1];
int j;
for (j = 2; j <= e; j++) {
return_value += (d[j] += a * (d[j + 1] - d[j - 1])) * b[j];
}
for (j = e + 1; j > 1; j--) {
d[j] = d[j - 1];
}
return (return_value);
}
Any suggestions about speeding it up appreciated, also interested in your opinion if it is possible to optimize beyond the default compiler optimization at all. I am wondering if it is something where NEON SIMD would help (that is new ground for me) or if VFP can be exploited, or if LLVM autovectorization would help.
I've tried the following LLVM flags:
-ffast-math (didn't make a notable difference)
-O4 (made a big difference on the iPhone 4S with a 25% reduction in time, but no notable difference on my minimum target device the iPhone 4, improvement of which is my main goal)
-O3 -mllvm -unroll-allow-partial -mllvm -unroll-runtime -funsafe-math-optimizations -ffast-math -mllvm -vectorize -mllvm -bb-vectorize-aligned-only (LLVM autovectorization flags from Hal Finkel's slides here: http://llvm.org/devmtg/2012-04-12/Slides/Hal_Finkel.pdf, made things slower than the default LLVM optimization for an Xcode release target)
Open to other flags, different approaches, and changes to the function. I'd prefer to leave the input and return types and values alone. There is actually a discussion of using NEON intrinsic functions for FIR here: https://pixhawk.ethz.ch/_media/software/optimization/neon_support_in_the_arm_compiler.pdf but I don't have quite enough experience with its subject to successfully apply the information to my own case. Thank you for any clarification.
EDIT My apologies for not noting this sooner. After investigating aka.nice's suggestion I noticed that the values passed in for e, a and c are always the same values and I know them before runtime, so approaches incorporating this info are an option.

Here are some transformations that could be made on the code to use vDSP routines. These transformations make use of various temporary buffers named T0, T1, and T2. Each of these is an array of float with enough space for e-1 elements.
First, use a temporary buffer to compute a * b[j]. This changes the original code:
for (j = 2; j <= e; j++) {
return_value += (d[j] += a * (d[j + 1] - d[j - 1])) * b[j];
}
to:
vDSP_vsmul(b+2, 1, &a, T0, 1, e-1);
for (j = 2; j <= e; j++)
return_value += (d[j] += (d[j+1] - d[j-1])) * T0[j-2];
Then use vDSP_vmul to compute d[j+1] * T0[j-2]:
vDSP_vsmul(b+2, 1, &a, T0, 1, e-1);
vDSP_vmul(d+3, 1, T0, 1, T1, 1, e-1);
for (j = 2; j <= e; j++)
return_value += (d[j] += T1[j-2] - d[j-1] * T0[j-2];
Next, promote vDSP_vmul to vDSP_vma (vector multiply add) to compute d[j] + d[j+1] * T0[j-2]:
vDSP_vsmul(b+2, 1, &a, T0, 1, e-1);
vDSP_vma(d+3, 1, T0, 1, d+2, 1, T1, 1, e-1);
for (j = 2; j <= e; j++)
return_value += (d[j] = T1[j-2] - d[j-1] * T0[j-2];
I suppose I would time that and see if there is any improvement. There are some issues:
SIMD code works best when data is 16-byte aligned. The use of array indices such as j-1 and j+1 prevents this. The ARM processors in phones are not as bad with unaligned data as some other processors, but performance will vary from model to model.
If e is large (more than a few thousand), then T0 and d may be evicted from cache during the vDSP_vma operation, and the following loop will have to reload them. There is a technique called strip mining to reduce the effect of this. I will not detail it now, but, essentially, the operation is partitioned into smaller strips of the array.
The IIR in the final loop may still bottleneck the processor. There are routines in vDSP for performing some IIRs (such as vDSP_deq22), but it is not clear whether this filter can be expressed in a way that is a good enough match to a vDSP routine to gain more performance than might be lost by the transformation.
The summation in the final loop to calculate return_value could also be removed from the loop and replaced with a vDSP routine (likely vDSP_sve), but I suspect the slack caused by the IIR will permit the additions to be done without adding significant execution time to the loop.
The above is off the top of my head; I have not tested the code. I suggest making the transformations one-by-one so you can test the code after each change and identify any errors before going on.
If you can find a satisfactory filter that is not an IIR, more performance optimizations may be available.

I'd prefer to leave the input and return types and values alone…
Nevertheless, moving your rendering from float to integer would help considerably.
Localizing that change to the implementation you present won't be useful. But if you expand it to reimplement just the FIR as integer, it can quickly pay off (unless the sizes are guaranteed to always be incredibly small -- then conversion/move times cost more). Of course, moving larger portions of the render graph to integer will introduce larger gains and require even fewer conversions.
Another consideration would be to look at the utilities in Accelerate.framework (potentially saving you from writing your own asm).

I tried this little exercize to rewrite your filter with delay operator z
For example, for e=4, i renamed input u and output y
d1*z= u
d2*z= c*u + a*d1
d3*z= d2 + a*(d3-d1*z)
d4*z= d3 + a*(d4-d2*z)
d5*z= d4 + a*(d5-d3*z)
y = (b2*d3*z + b3*d4*z + b4*d5*z)
Note that the di are the filter states.
d3*z is the next value of d3 (it appears to be variable d2 in your code)
You can then eliminate the di to write the transfer function y/u in z.
You will then find that a minimal representation require only e states by factoring/simplifying above transfer function.
Denominator is z*(z-a)^3, that is a pole at 0, and another at a with multiplicity (e-1).
You can then put your filter in a standard state space matrix representation:
z*X = A*X + B*u
y = C*X + d*u
With the particular form of poles, you can decompose the transfer in partial fraction expansion and obtain the matrices A & B in this special form (matlab like notations)
A = [0 1 0 0; B=[0;
0 a 1 0; 0;
0 0 a 1; 0;
0 0 0 a] 1]
C & d are a bit less easy though...
They are extracted from the numerators and direct term of partial fraction expansion
They are polynomials in bi, c (degree 1) and a (degree e)
For e=4, I have
C=[ a^3*b2 - a^2*b3 + a*b4 ,
-a^2*b2 + a*b3 + (c-a^2)*b4 ,
a*b2 + (c-a^2)*b3 + (-2*a^2*c-2*a^2-a+a^4)*b4 ,
(c-a^2)*b2 + (-a^2-a^2*c-a)*b3 + (-2*a*c+2*a^3)*b4 ]
d= -a*b2 - a*c*b3 + a^2*b4
If you can find the recurrence in e governing C & d, and precompute them
then the filter can be reduced to those simple vector ops:
z*X = a*[ 0; x2 ; x3 ; x4 ... xe ] + [x2 ; x3 ; x4 ... xe ; u ];
Y = C*[ x1 ; x2 ; x3 ; x4 ... xe ] + d*u
Or expressed as a function (Xnext,y)=filter(X,u,a,C,d,e) pseudo code:
y = dot_product( C , X) + d*u; // (like BLAS _DOT)
Xnext(1:e-1) = X(2:e); // this is a memcopy (like BLAS _COPY)
Xnext(e)=u;
X(1)=0;
Xnext=a*X+Xnext; // this is an inplace vector muladd (like BLAS _AXPY)
X=Xnext; // another memcopy outside the function (can be moved inside).
Note that if you use BLAS functions, your code will be portable to many hardwares, not just Applecentric, and I guess the perf won't be much different.
EDIT: about partial fraction expansion
The pure partial fraction expansion would give a diagonal state space representation, and a matrix B full of 1s. This can be an interesting variant too. (filters in parallel)
My variant used above is more like a cascade or ladder (filters in serie).