Code or compiler: optimizing a IIR filter in C for the iPhone 4 and later - iphone

I've been profiling my almost-finished project and I'm seeing that about three-quarters of the CPU time is spent in this IIR filter function (which is called hundreds of thousands of times in about a second currently on the target hardware) so with everything else working well I am wondering if it can be optimized for my specific hardware and software target. My targets are only iPhone 4 and newer, only iOS 4.3 and newer, only LLVM 4.x. A little bit of imprecision is probably OK if there are gains to be made.
static float filter(const float a, const float *b, const float c, float *d, const int e, const float x)
{
float return_value = 0;
d[0] = x;
d[1] = c * d[0] + a * d[1];
int j;
for (j = 2; j <= e; j++) {
return_value += (d[j] += a * (d[j + 1] - d[j - 1])) * b[j];
}
for (j = e + 1; j > 1; j--) {
d[j] = d[j - 1];
}
return (return_value);
}
Any suggestions about speeding it up appreciated, also interested in your opinion if it is possible to optimize beyond the default compiler optimization at all. I am wondering if it is something where NEON SIMD would help (that is new ground for me) or if VFP can be exploited, or if LLVM autovectorization would help.
I've tried the following LLVM flags:
-ffast-math (didn't make a notable difference)
-O4 (made a big difference on the iPhone 4S with a 25% reduction in time, but no notable difference on my minimum target device the iPhone 4, improvement of which is my main goal)
-O3 -mllvm -unroll-allow-partial -mllvm -unroll-runtime -funsafe-math-optimizations -ffast-math -mllvm -vectorize -mllvm -bb-vectorize-aligned-only (LLVM autovectorization flags from Hal Finkel's slides here: http://llvm.org/devmtg/2012-04-12/Slides/Hal_Finkel.pdf, made things slower than the default LLVM optimization for an Xcode release target)
Open to other flags, different approaches, and changes to the function. I'd prefer to leave the input and return types and values alone. There is actually a discussion of using NEON intrinsic functions for FIR here: https://pixhawk.ethz.ch/_media/software/optimization/neon_support_in_the_arm_compiler.pdf but I don't have quite enough experience with its subject to successfully apply the information to my own case. Thank you for any clarification.
EDIT My apologies for not noting this sooner. After investigating aka.nice's suggestion I noticed that the values passed in for e, a and c are always the same values and I know them before runtime, so approaches incorporating this info are an option.

Here are some transformations that could be made on the code to use vDSP routines. These transformations make use of various temporary buffers named T0, T1, and T2. Each of these is an array of float with enough space for e-1 elements.
First, use a temporary buffer to compute a * b[j]. This changes the original code:
for (j = 2; j <= e; j++) {
return_value += (d[j] += a * (d[j + 1] - d[j - 1])) * b[j];
}
to:
vDSP_vsmul(b+2, 1, &a, T0, 1, e-1);
for (j = 2; j <= e; j++)
return_value += (d[j] += (d[j+1] - d[j-1])) * T0[j-2];
Then use vDSP_vmul to compute d[j+1] * T0[j-2]:
vDSP_vsmul(b+2, 1, &a, T0, 1, e-1);
vDSP_vmul(d+3, 1, T0, 1, T1, 1, e-1);
for (j = 2; j <= e; j++)
return_value += (d[j] += T1[j-2] - d[j-1] * T0[j-2];
Next, promote vDSP_vmul to vDSP_vma (vector multiply add) to compute d[j] + d[j+1] * T0[j-2]:
vDSP_vsmul(b+2, 1, &a, T0, 1, e-1);
vDSP_vma(d+3, 1, T0, 1, d+2, 1, T1, 1, e-1);
for (j = 2; j <= e; j++)
return_value += (d[j] = T1[j-2] - d[j-1] * T0[j-2];
I suppose I would time that and see if there is any improvement. There are some issues:
SIMD code works best when data is 16-byte aligned. The use of array indices such as j-1 and j+1 prevents this. The ARM processors in phones are not as bad with unaligned data as some other processors, but performance will vary from model to model.
If e is large (more than a few thousand), then T0 and d may be evicted from cache during the vDSP_vma operation, and the following loop will have to reload them. There is a technique called strip mining to reduce the effect of this. I will not detail it now, but, essentially, the operation is partitioned into smaller strips of the array.
The IIR in the final loop may still bottleneck the processor. There are routines in vDSP for performing some IIRs (such as vDSP_deq22), but it is not clear whether this filter can be expressed in a way that is a good enough match to a vDSP routine to gain more performance than might be lost by the transformation.
The summation in the final loop to calculate return_value could also be removed from the loop and replaced with a vDSP routine (likely vDSP_sve), but I suspect the slack caused by the IIR will permit the additions to be done without adding significant execution time to the loop.
The above is off the top of my head; I have not tested the code. I suggest making the transformations one-by-one so you can test the code after each change and identify any errors before going on.
If you can find a satisfactory filter that is not an IIR, more performance optimizations may be available.

I'd prefer to leave the input and return types and values alone…
Nevertheless, moving your rendering from float to integer would help considerably.
Localizing that change to the implementation you present won't be useful. But if you expand it to reimplement just the FIR as integer, it can quickly pay off (unless the sizes are guaranteed to always be incredibly small -- then conversion/move times cost more). Of course, moving larger portions of the render graph to integer will introduce larger gains and require even fewer conversions.
Another consideration would be to look at the utilities in Accelerate.framework (potentially saving you from writing your own asm).

I tried this little exercize to rewrite your filter with delay operator z
For example, for e=4, i renamed input u and output y
d1*z= u
d2*z= c*u + a*d1
d3*z= d2 + a*(d3-d1*z)
d4*z= d3 + a*(d4-d2*z)
d5*z= d4 + a*(d5-d3*z)
y = (b2*d3*z + b3*d4*z + b4*d5*z)
Note that the di are the filter states.
d3*z is the next value of d3 (it appears to be variable d2 in your code)
You can then eliminate the di to write the transfer function y/u in z.
You will then find that a minimal representation require only e states by factoring/simplifying above transfer function.
Denominator is z*(z-a)^3, that is a pole at 0, and another at a with multiplicity (e-1).
You can then put your filter in a standard state space matrix representation:
z*X = A*X + B*u
y = C*X + d*u
With the particular form of poles, you can decompose the transfer in partial fraction expansion and obtain the matrices A & B in this special form (matlab like notations)
A = [0 1 0 0; B=[0;
0 a 1 0; 0;
0 0 a 1; 0;
0 0 0 a] 1]
C & d are a bit less easy though...
They are extracted from the numerators and direct term of partial fraction expansion
They are polynomials in bi, c (degree 1) and a (degree e)
For e=4, I have
C=[ a^3*b2 - a^2*b3 + a*b4 ,
-a^2*b2 + a*b3 + (c-a^2)*b4 ,
a*b2 + (c-a^2)*b3 + (-2*a^2*c-2*a^2-a+a^4)*b4 ,
(c-a^2)*b2 + (-a^2-a^2*c-a)*b3 + (-2*a*c+2*a^3)*b4 ]
d= -a*b2 - a*c*b3 + a^2*b4
If you can find the recurrence in e governing C & d, and precompute them
then the filter can be reduced to those simple vector ops:
z*X = a*[ 0; x2 ; x3 ; x4 ... xe ] + [x2 ; x3 ; x4 ... xe ; u ];
Y = C*[ x1 ; x2 ; x3 ; x4 ... xe ] + d*u
Or expressed as a function (Xnext,y)=filter(X,u,a,C,d,e) pseudo code:
y = dot_product( C , X) + d*u; // (like BLAS _DOT)
Xnext(1:e-1) = X(2:e); // this is a memcopy (like BLAS _COPY)
Xnext(e)=u;
X(1)=0;
Xnext=a*X+Xnext; // this is an inplace vector muladd (like BLAS _AXPY)
X=Xnext; // another memcopy outside the function (can be moved inside).
Note that if you use BLAS functions, your code will be portable to many hardwares, not just Applecentric, and I guess the perf won't be much different.
EDIT: about partial fraction expansion
The pure partial fraction expansion would give a diagonal state space representation, and a matrix B full of 1s. This can be an interesting variant too. (filters in parallel)
My variant used above is more like a cascade or ladder (filters in serie).

Related

Does unrolling a loop affect the accuracy of the computations within?

Summarized question Does unrolling a loop affect the accuracy of the computations performed within the loop? And if so, why?
Elaboration and background I am writing a compute shader using HLSL for use in a Unity-project (2021.2.9f1). Parts of my code include numerical procedures and highly osciallatory functions, meaning that high computational accuracy is essential.
When comparing my results with an equivalent procedure in Python, I noticed that some deviations in the order of 1e-5. This was concerning, as I did not expect such large errors to be the result of precision differences, e.g., the float-precision in trigonometric or power functions in HLSL.
Ultimatley, after much debugging, I now believe the choice of unrolling or not unrolling a loop to be the cause of the deviation. However, I do find this strange, as I can not seem to find any sources indicating that unrolling a loop affects the accuracy in addition to the "space–time tradeoff".
For clarification, if considering my Python results as the correct solution, unrolling the loop in HLSL gives me better results than what not unrolling gives.
Minimal working example Below is an MWE consisting of a C# script for Unity, the corresponding compute shader where the computations are performed and a screen-shot of my console when running in Unity (2021.2.9f1). Forgive me for a somewhat messy implementation of Newtons method, but I chose to keep it since I believe it might be a cause to this deviation. That is, if simply computing cos(x), then there is not difference between the unrolled and not unrolled. None the less, I still fail to understand how the simple addition of [unroll(N)] in the testing kernel changes the result...
// C# for Unity
using UnityEngine;
public class UnrollTest : MonoBehaviour
{
[SerializeField] ComputeShader CS;
ComputeBuffer CBUnrolled, CBNotUnrolled;
readonly int N = 3;
private void Start()
{
CBUnrolled = new ComputeBuffer(N, sizeof(double));
CBNotUnrolled = new ComputeBuffer(N, sizeof(double));
CS.SetBuffer(0, "_CBUnrolled", CBUnrolled);
CS.SetBuffer(0, "_CBNotUnrolled", CBNotUnrolled);
CS.Dispatch(0, (int)((N + (64 - 1)) / 64), 1, 1);
double[] ansUnrolled = new double[N];
double[] ansNotUnrolled = new double[N];
CBUnrolled.GetData(ansUnrolled);
CBNotUnrolled.GetData(ansNotUnrolled);
for (int i = 0; i < N; i++)
{
Debug.Log("Unrolled ans = " + ansUnrolled[i] +
" - Not Unrolled ans = " + ansNotUnrolled[i] +
" -- Difference is: " + (ansUnrolled[i] - ansNotUnrolled[i]));
}
CBUnrolled.Release();
CBNotUnrolled.Release();
}
}
#pragma kernel CSMain
RWStructuredBuffer<double> _CBUnrolled, _CBNotUnrolled;
// Dummy function for Newtons method
double fDummy(double k, double fnh, double h, double theta)
{
return fnh * fnh * k * h * cos(theta) * cos(theta) - (double) tanh(k * h);
}
// Derivative of Dummy function above using a central finite difference scheme.
double dfDummy(double k, double fnh, double h, double theta)
{
return (fDummy(k + (double) 1e-3, fnh, h, theta) - fDummy(k - (double) 1e-3, fnh, h, theta)) / (double) 2e-3;
}
// Function to solve.
double f(double fnh, double h, double theta)
{
// Solved using Newton's method.
int max_iter = 50;
double epsilon = 1e-8;
double fxn, dfxn;
// Define initial guess for k, herby denoted as x.
double xn = 10.0;
for (int n = 0; n < max_iter; n++)
{
fxn = fDummy(xn, fnh, h, theta);
if (abs(fxn) < epsilon) // A solution is found.
return xn;
dfxn = dfDummy(xn, fnh, h, theta);
if (dfxn == 0.0) // No solution found.
return xn;
xn = xn - fxn / dfxn;
}
// No solution found.
return xn;
}
[numthreads(64,1,1)]
void CSMain(uint3 threadID : SV_DispatchThreadID)
{
int N = 3;
// ---------------
double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
for (int i = 0; i < N; i++) // Not being unrolled
{
_CBNotUnrolled[i] = f(fnh, h, theta);
theta += dtheta;
}
// ---------------
fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
[unroll(N)] for (int j = 0; j < N; j++) // Being unrolled.
{
_CBUnrolled[j] = f(fnh, h, theta);
theta += dtheta;
}
}
Image of Unity console when running the above
Edit After some more testing, the deviation has been narrowed down to the following code, giving a difference of about 1e-17 between the exact same code unrolled vs not unrolled. Despite the small difference, I still consider it a valid example of the issue, as I believe they should be equal.
[numthreads(64, 1, 1)]
void CSMain(uint3 threadID : SV_DispatchThreadID)
{
if ((int) threadID.x != 1)
return;
int N = 3;
double k = 1.0;
// ---------------
double fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
for (int i = 0; i < N; i++) // Not being unrolled
{
_CBNotUnrolled[i] = (k + (double) 1e-3) * theta - (k - (double) 1e-3) * theta;
theta += dtheta;
}
// ---------------
fnh = 0.9, h = 4.53052, theta = -0.161, dtheta = 0.01; // Example values.
[unroll(N)]
for (int j = 0; j < N; j++) // Being unrolled.
{
_CBUnrolled[j] = (k + (double) 1e-3) * theta - (k - (double) 1e-3) * theta;
theta += dtheta;
}
}
Image of Unity console when running the edited script above
Edit 2 The following is the compiled code for the kernel given in Edit 1. Unfortunately, my experience with assembly language is limited, and I am not capable of spotting if this script shows any errors, or if it is useful to the problem at hand.
**** Platform Direct3D 11:
Compiled code for kernel CSMain
keywords: <none>
binary blob size 648:
//
// Generated by Microsoft (R) D3D Shader Disassembler
//
//
// Note: shader requires additional functionality:
// Double-precision floating point
//
//
// Input signature:
//
// Name Index Mask Register SysValue Format Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Input
//
// Output signature:
//
// Name Index Mask Register SysValue Format Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Output
cs_5_0
dcl_globalFlags refactoringAllowed | enableDoublePrecisionFloatOps
dcl_uav_structured u0, 8
dcl_uav_structured u1, 8
dcl_input vThreadID.x
dcl_temps 2
dcl_thread_group 64, 1, 1
0: ine r0.x, vThreadID.x, l(1)
1: if_nz r0.x
2: ret
3: endif
4: dmov r0.xy, d(-0.161000l, 0.000000l)
5: mov r0.z, l(0)
6: loop
7: ige r0.w, r0.z, l(3)
8: breakc_nz r0.w
9: dmul r1.xyzw, r0.xyxy, d(1.001000l, 0.999000l)
10: dadd r1.xy, -r1.zwzw, r1.xyxy
11: store_structured u1.xy, r0.z, l(0), r1.xyxx
12: dadd r0.xy, r0.xyxy, d(0.010000l, 0.000000l)
13: iadd r0.z, r0.z, l(1)
14: endloop
15: store_structured u0.xy, l(0), l(0), l(-0.000000,-0.707432,0,0)
16: store_structured u0.xy, l(1), l(0), l(0.000000,-0.702312,0,0)
17: store_structured u0.xy, l(2), l(0), l(-918250586112.000000,-0.697192,0,0)
18: ret
// Approximately 0 instruction slots used
Edit 3 After reaching out to Microsoft, (see https://learn.microsoft.com/en-us/an...nrolling-a-loop-affect-the-accuracy-of-t.html), they stated that the problem is more about Unity. This because
"The pragma unroll [(n)] is keil compiler which Unity uses topic"
This is driver, hardware, compiler, and unity dependent.
In essence, the HLSL specification has somewhat looser guarantees for rounding behavior of mathematical operations than regular IEEE-754 floating point.
First, it is implementation-dependent whether operations round up or down.
IEEE-754 requires floating-point operations to produce a result that
is the nearest representable value to an infinitely-precise result,
known as round-to-nearest-even. Direct3D 10, however, defines a looser
requirement: 32-bit floating-point operations produce a result that is
within one unit-last-place (1 ULP) of the infinitely-precise result.
This means that, for example, hardware is allowed to truncate results
to 32-bit rather than perform round-to-nearest-even, as that would
result in error of at most one ULP.
See https://learn.microsoft.com/en-us/windows/win32/direct3d10/d3d10-graphics-programming-guide-resources-float-rules#32-bit-floating-point-rules
Going one step further, the HLSL compiler itself has many fast-math optimizations that can violate IEEE-754 float conformance; see, for example:
D3DCOMPILE_IEEE_STRICTNESS - Forces strict compile, which might not allow for legacy syntax. By default, the compiler disables strictness on deprecated syntax.
D3DCOMPILE_OPTIMIZATION_LEVEL3 - Directs the compiler to use the highest optimization level. If you set this constant, the compiler produces the best possible code but might take significantly longer to do so. Set this constant for final builds of an application when performance is the most important factor.
D3DCOMPILE_PARTIAL_PRECISION - Directs the compiler to perform all computations with partial precision. If you set this constant, the compiled code might run faster on some hardware.
Source: https://learn.microsoft.com/en-us/windows/win32/direct3dhlsl/d3dcompile-constants
This particularly matters for your scenario, because if optimizations are enabled, the existence of loop unrolling can trigger constant folding optimizations that reduce the computational cost of your code and change the precision of its results (potentially even improving them). Note that when constant folding occurs, the compiler has to decide how to perform rounding, and that might disagree with what your hardware FPUs would do.
Oh, and note that IEEE-754 does not place constraints on the precision, let alone require implementation, of "additional operations" (e.g. sin, cos, tanh, atan, ln, etc); it purely recommends them.
See, a very common case where this goes wrong and sin gets quantized to 4 different values on intel integrated graphics, but otherwise has reasonable precision on alternative hardware: sin(x) only returns 4 different values for moderately large input on GLSL fragment shader, Intel HD4000
Also, note that Unity does not guarantee that a float in shader is actually a 32-bit float; on certain hardware (e.g. mobile), it can even be backed by a 16-bit half or an 11-bit fixed.
High precision: float
Highest precision floating point value; generally 32 bits (just like float from regular programming languages).
...
One complication of float/half/fixed data type usage is that PC GPUs are always high precision. That is, for all the PC (Windows/Mac/Linux) GPUs, it does not matter whether you write float, half or fixed data types in your shaders. They always compute everything in full 32-bit floating point precision.
The half and fixed types only become relevant when targeting mobile
GPUs, where these types primarily exist for power (and sometimes
performance) constraints. Keep in mind that you need to test your
shaders on mobile to see whether or not you are running into
precision/numerical issues.
Even on mobile GPUs, the different precision support varies between
GPU families.
Source: https://docs.unity3d.com/Manual/SL-DataTypesAndPrecision.html
I don't believe Unity exposes compiler flags to developers; you are at its whim as to what optimizations it passes to dxc/fxc. Given it's primarily used for games, you can bet they enable optimizations.
Source: https://forum.unity.com/threads/possible-to-set-directx-compiler-flags-in-shaders.453790/
Finally, check out "Floating-Point Determinism" by Bruce Dawson if you want an in-depth dive into this topic; I will add that this problem also exists if you want consistent results between languages (since languages themselves can implement math functions themselves rather than using hardware intrinsics, e.g. for better precision), when cross-compiling (since different compilers / backends can optimize differently or use different system libraries), or when running managed code across different runtimes (e.g. since JIT can do different optimiztions).

Fixed point approximation of 2^x, with input range of s5.26

How can I implement 2^x fixed-point arithmetic s5.26 and input values is in range [-31.9, 31.9] using the minimax polynomial approximation for exp2()
How to generate the polynomial using Sollya Tool mentioned in the following link
Power of 2 approximation in fixed point
Since fixed-point arithmetic generally does not include an "infinity" encoding representing overflowed results, any implementation of exp2() for an s5.26 format will be limited to inputs in the interval (-32, 5), resulting in outputs in [0, 32).
The computation of transcendental functions typically consist of argument reduction, core approximation, final result construction. In the case of exp2(a), a reasonable argument reduction scheme is to split a into integer part i and fractional part f, such that a == i + f, with f in [-0.5, 0.5]. One then computes exp2(f), and scales the result by 2i, which corresponds to shifts in fixed-point arithmetic: exp2(a) = exp2(f) * exp2(i).
The common design choices for the computation of exp2(f) are interpolation in tabulated values of exp2(), or polynomial approximation. Since we need 31 result bits for the largest arguments, accurate interpolation would probably want to use quadratic interpolation to keep the table size reasonable. Since many modern processors (including ones used in embedded systems) provide a fast integer multiplier, I will focus here on approximation by polynomial. For this, we want a polynomial with minimax properties, that is, one that minimizes the maximum error compared to the reference.
Both commercial and free tools offer built-in capabilities to generate minimax approximations, e.g. Mathematica's MiniMaxApproximation command, Maple's minimax command, and Sollya's fpminimax command. One might also chose to build one's own infrastructure based on the Remez algorithm, which is the approach I have used. As opposed to floating-point arithmetic which typically uses to-nearest-or-even rounding, fixed-point arithmetic is usually restricted to truncation of intermediate results. This adds additional error during expression evaluation. As a consequence, it is usually a good idea to try a heuristic-based search for small adjustments to the coefficients of the generated approximation to partially balance those accumulating one-sided errors.
Because we need up to 31 bits in the result, and because coefficients in core approximations are typically less than unity in magnitude, we cannot use the native fixed-point precision, here s5.26, for polynomial evaluation. Instead, we want to scale up the operands in intermediate computation to fully use the available range of 32-bit integers, by dynamically adjusting the fixed-point format we are working in. For reasons of efficiency, it seems advisable to arrange the computation such that multiplications use re-normalization right shifts by 32 bits. This will often allow the elimination of explicit shifts on 32-bit processors.
Since intermediate computation uses signed data, right shifts of signed, negative operands will occur. We want those right shifts to map to arithmetic right shift instructions, something the C standard does not guarantee. But on most commonly used platforms, C compilers do what is desirable for us. Otherwise, it may be necessary to resort to intrinsics or inline assembly. I developed the code below with the Microsoft compiler on an x64 platform.
In the evaluation of the polynomial approximation for exp2(f) the original floating-point coefficients, the dynamic scaling, and the heuristic adjustments are all clearly visible. The code below does not quite achieve full accuracy for large arguments. The biggest absolute error is 1.10233e-7, for the argument of 0x12de9c5b = 4.71739332: fixed_exp2() returns 0x693ab6a3 while the accurate result would be 0x693ab69c. Presumably full accuracy could be achieved by increasing the degree of the polynomial core approximation by one.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
/* on 32-bit architectures, there is often an instruction/intrinsic for this */
int32_t mulhi (int32_t a, int32_t b)
{
return (int32_t)(((int64_t)a * (int64_t)b) >> 32);
}
/* compute exp2(a) in s5.26 fixed-point arithmetic */
int32_t fixed_exp2 (int32_t a)
{
int32_t i, f, r, s;
/* split a = i + f, such that f in [-0.5, 0.5] */
i = (a + 0x2000000) & ~0x3ffffff; // 0.5
f = a - i;
s = ((5 << 26) - i) >> 26;
f = f << 5; /* scale up for maximum accuracy in intermediate computation */
/* approximate exp2(f)-1 for f in [-0.5, 0.5] */
r = (int32_t)(1.53303146e-4 * (1LL << 36) + 996);
r = mulhi (r, f) + (int32_t)(1.33887795e-3 * (1LL << 35) + 99);
r = mulhi (r, f) + (int32_t)(9.61833261e-3 * (1LL << 34) + 121);
r = mulhi (r, f) + (int32_t)(5.55036329e-2 * (1LL << 33) + 51);
r = mulhi (r, f) + (int32_t)(2.40226507e-1 * (1LL << 32) + 8);
r = mulhi (r, f) + (int32_t)(6.93147182e-1 * (1LL << 31) + 5);
r = mulhi (r, f);
/* add 1, scale based on integral portion of argument, round the result */
r = ((((uint32_t)r * 2) + (uint32_t)(1.0*(1LL << 31)) + ((1U << s) / 2) + 1) >> s);
/* when argument < -26.5, result underflows to zero */
if (a < -0x6a000000) r = 0;
return r;
}
/* convert from s5.26 fixed point to double-precision floating point */
double fixed_to_float (int32_t a)
{
return a / 67108864.0;
}
int main (void)
{
double a, res, ref, err, maxerr = 0.0;
int32_t x, start, end;
start = -0x7fffffff; // -31.999999985
end = 0x14000000; // 5.000000000
printf ("testing fixed_exp2 with inputs in [%.9f, %.9f)\n",
fixed_to_float (start), fixed_to_float (end));
for (x = start; x < end; x++) {
a = fixed_to_float (x);
ref = exp2 (a);
res = fixed_to_float (fixed_exp2 (x));
err = fabs (res - ref);
if (err > maxerr) {
maxerr = err;
}
}
printf ("max. abs. err = %g\n", maxerr);
return EXIT_SUCCESS;
}
A table-based alternative would trade-off table storage for a reduction in the amount of computation that is performed. Depending on the size of the L1 data cache, this may or may not increase performance. One possible approach is to tabulate 2f-1 for f in [0, 1). The split the function argument into an integer i and a fraction f, such that f in [0, 1). In order to keep the table reasonably small, use quadratic interpolation, with the coefficients of the polynomial computed on the fly from three consecutive table entries. The result is slightly adjusted by a heuristically determined offset to somewhat compensate for the truncating nature of fixed-point arithmetic.
The table is indexed by leading bits of the fraction f. Using seven bits for the index (resulting in a table of 128+2 entries), accuracy is slightly worse than with the previous minimax polynomial approximation. Maximum absolute error is 1.74935e-7. It occurs for an argument of 0x11580000 = 4.33593750, where fixed_exp2() returns 0x50c7d771, whereas the accurate result would be 0x50c7d765.
/* For i in [0,129]: (exp2 (i/128.0) - 1.0) * (1 << 31) */
static const uint32_t expTab [130] =
{
0x00000000, 0x00b1ed50, 0x0164d1f4, 0x0218af43,
0x02cd8699, 0x0383594f, 0x043a28c4, 0x04f1f656,
0x05aac368, 0x0664915c, 0x071f6197, 0x07db3580,
0x08980e81, 0x0955ee03, 0x0a14d575, 0x0ad4c645,
0x0b95c1e4, 0x0c57c9c4, 0x0d1adf5b, 0x0ddf0420,
0x0ea4398b, 0x0f6a8118, 0x1031dc43, 0x10fa4c8c,
0x11c3d374, 0x128e727e, 0x135a2b2f, 0x1426ff10,
0x14f4efa9, 0x15c3fe87, 0x16942d37, 0x17657d4a,
0x1837f052, 0x190b87e2, 0x19e04593, 0x1ab62afd,
0x1b8d39ba, 0x1c657368, 0x1d3ed9a7, 0x1e196e19,
0x1ef53261, 0x1fd22825, 0x20b05110, 0x218faecb,
0x22704303, 0x23520f69, 0x243515ae, 0x25195787,
0x25fed6aa, 0x26e594d0, 0x27cd93b5, 0x28b6d516,
0x29a15ab5, 0x2a8d2653, 0x2b7a39b6, 0x2c6896a5,
0x2d583eea, 0x2e493453, 0x2f3b78ad, 0x302f0dcc,
0x3123f582, 0x321a31a6, 0x3311c413, 0x340aaea2,
0x3504f334, 0x360093a8, 0x36fd91e3, 0x37fbefcb,
0x38fbaf47, 0x39fcd245, 0x3aff5ab2, 0x3c034a7f,
0x3d08a39f, 0x3e0f680a, 0x3f1799b6, 0x40213aa2,
0x412c4cca, 0x4238d231, 0x4346ccda, 0x44563ecc,
0x45672a11, 0x467990b6, 0x478d74c9, 0x48a2d85d,
0x49b9bd86, 0x4ad2265e, 0x4bec14ff, 0x4d078b86,
0x4e248c15, 0x4f4318cf, 0x506333db, 0x5184df62,
0x52a81d92, 0x53ccf09a, 0x54f35aac, 0x561b5dff,
0x5744fccb, 0x5870394c, 0x599d15c2, 0x5acb946f,
0x5bfbb798, 0x5d2d8185, 0x5e60f482, 0x5f9612df,
0x60ccdeec, 0x62055b00, 0x633f8973, 0x647b6ca0,
0x65b906e7, 0x66f85aab, 0x68396a50, 0x697c3840,
0x6ac0c6e8, 0x6c0718b6, 0x6d4f301f, 0x6e990f98,
0x6fe4b99c, 0x713230a8, 0x7281773c, 0x73d28fde,
0x75257d15, 0x767a416c, 0x77d0df73, 0x792959bb,
0x7a83b2db, 0x7bdfed6d, 0x7d3e0c0d, 0x7e9e115c,
0x80000000, 0x8163daa0
};
int32_t fixed_exp2 (int32_t x)
{
int32_t f1, f2, dx, a, b, approx, idx, i, f;
/* extract integer portion; 2**i is realized as a shift at the end */
i = (x >> 26);
/* extract fraction f so we can compute 2^f, 0 <= f < 1 */
f = x & 0x3ffffff;
/* index table of exp2 values using 7 most significant bits of fraction */
idx = (uint32_t)f >> (26 - 7);
/* difference between argument and next smaller sampling point */
dx = f - (idx << (26 - 7));
/* fit parabola through closest 3 sampling points; find coefficients a,b */
f1 = (expTab[idx+1] - expTab[idx]);
f2 = (expTab[idx+2] - expTab[idx]);
a = f2 - (f1 << 1);
b = (f1 << 1) - a;
/* find function value offset for argument x by computing ((a*dx+b)*dx) */
approx = a;
approx = (int32_t)((((int64_t)approx)*dx) >> (26 - 7)) + b;
approx = (int32_t)((((int64_t)approx)*dx) >> (26 - 7 + 1));
/* combine integer and fractional parts of result, round result */
approx = (((expTab[idx] + (uint32_t)approx + (uint32_t)(1.0*(1LL << 31)) + 22U) >> (30 - 26 - i)) + 1) >> 1;
/* flush underflow to 0 */
if (i < -27) approx = 0;
return approx;
}

MATLAB: How can I create autocorrelated data?

I'm looking to create a vector of autocorrelated data points in MATLAB, with the lag 1 higher than lag 2, and so on.
If I look at the lag 1 data pairs (1, 2), (3, 4), (5, 6), ..., then the correlation is relatively higher, but then at lag 2 it's reduced.
I found a way to do this in R
x <- filter(rnorm(1000), filter=rep(1,3), circular=TRUE)
However, I'm not sure how to do the same thing in MATLAB. Ideally I'd like to be able to fine tune exactly how autocorrelated the data is.
Math:
A group of standard models for autocorrelation in stationary time series are so called "auto regressive model" eg. an autoregressive model with 1 term is known as an AR(1) and is:
y_t = a + b*y_{t-1} + e_t
AR(1) sounds simplistic, but it turns it's a quite powerful tooll. Eg. an AR(p) with p autoregressive terms is actually an AR(1) on a p dimensional vector. (Check Wikipedia page.) Note also b=1, gives a non-stationary random walk.
A more intuitive way to write what's going on (in stationary case with |b| < 1) is define u = a / (1 - b) (turns out u is unconditional mean of AR(1)), then with some algebra:
y_t - u = b * ( y_{t-1} - u) + e_t
That is, the difference from the unconditional mean u gets hit with some decay term b and then a shock term e_t gets added. (you want -1<b<1 for stationarity)
Code:
Since e_t denotes the shock term, this is super easy to simulate. Eg. to simulate an AR(1):
a = 0; b = .4; sigma = 1; T = 1000;
y0 = a / (1 - b); %eg initialize to unconditional mean of stationary time series
y = zeros(T,1);
y(1) = a + b * y0 + randn() * sigma;
for t = 2:T
y(t) = a + b * y(t-1) + randn() * sigma;
end
This code isn't mean to be fast, but illustrative. An AR(1) model implies a certain type of correlation structure, but adding AR or MA terms, you can fit some pretty funky stuff. (MA is moving average model)
Can test sample autocorrelation with autocorr(y). For reference, the bible on time series mathematics is Hamilton's book Time Series Analysis.

Fix matlab code with error

I understand that there is an error with dimensions in line dr=(r-v*v/2)*dT . But I have little knowledge of Matlab. Help to fix it, please. The code is small and simple. Maybe someone will find time to look
function [optionPrice] = upAndOutCallOption(S,r,v,x,b,T,dT)
t = 0;
dr=[];
pert=[];
while (t < T) & (S < b)
t = t + dT;
dr = (r - v.*v./2).*dT;
pert = v.*sqrt( dT ).*randn();
S = S.*exp(dr + pert);
end
if S<b
% Within barrier, so price as for a European option.
optionPrice = exp(-r.*T).* max(0, S - x);
else
% Hit the barrier, so the option is withdrawn.
optionPrice = 0;
end
end
Call from another function of this kind:
for k=1:amountOfOptions
[optionPrices(k)] = upAndOutCallOption(stockPrice(k)*o,riskFreeRate(k)*o,... volatility(k)*o, strike(k)*o, barrier(k)*o, timeToExpiry(k)*o, sampleRate(k)*o);
result(k) = mean(optionPrices(k));
end
Therefore, any difficulties.
It's good that you know the problem is within dr = (r - v.*v./2).*dT;. The command itself has many possible problems which also related to dimensions:
Here you are doing element-wise multiplication (because of the .*) with matrices, which requires (in the case of your command) that r has the same number of rows AND columns as v (since because of element-wise, v.*v/2 has the same size as v).
Moreover, it is unnecessary to do element-wise division with scalar number, that means there is no need to have ./2 in Matlab.
And, finally, since it's element-wise multiplication again, the matrix (r - v.*v./2) must also have the same number of rows and columns as matrix dT.
Check here for more information about Matlab's matrix operations.

Fast way to compute (1:N)'*(1:N)

I am looking for a fast way to compute
(1:N)'*(1:N)
for reasonably large N. I feel like the symmetry of the problem makes it so that actually doing the multiplications and additions is wasteful.
The question of why you want to do this really matters.
In the theoretical sense, the triangular approach suggested in the other answers will save you operations. #jgmao's answer is especially interesting in reducing multiplies.
In the practical sense, number of CPU operations is no longer the metric to minimize when writing fast code. Memory bandwidth dominates when you have so few CPU operations, so tuned cache-aware access patterns are how to make this go fast. Matrix multiplication code is implemented extremely efficiently, since it's such a common operation, and every implementation of the BLAS numeric library worth its salt will use optimized access patterns, and SIMD computation as well.
Even if you wrote straight C and reduced your op count to the theoretic minimum, you'd probably still not beat the full matrix multiply. What this boils down to is to find the numeric primitive which most closely matches your operation.
All that said, there's a BLAS operation which gets a little closer than DGEMM (matrix multiply). It's called DSYRK, the rank-k update, and it can be used for exactly A'*A. The MEX function I wrote for this a long time ago is here. I haven't messed with it in a long time, but it did work when I first wrote it, and did in fact run faster than a straight A'*A.
/* xtrx.c: calculates x'*x taking advantage of the symmetry.
Peter Boettcher <email removed>
Last modified: <Thu Jan 23 13:53:02 2003> */
#include "mex.h"
const double one = 1;
const double zero = 0;
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
double *x, *z;
int i, j, mrows, ncols;
if(nrhs!=1) mexErrMsgTxt("One input required.");
x = mxGetPr(prhs[0]);
mrows = mxGetM(prhs[0]);
ncols = mxGetN(prhs[0]);
plhs[0] = mxCreateDoubleMatrix(ncols,ncols, mxREAL);
z = mxGetPr(plhs[0]);
/* Call the FORTRAN BLAS routine for rank k update */
dsyrk_("U", "T", &ncols, &mrows, &one, x, &mrows, &zero, z, &ncols);
/* Result is in the upper triangle. Copy it down the lower part */
for(i=0; i<ncols; i++)
for(j=i+1; j<ncols; j++)
z[i*ncols + j] = z[j*ncols + i];
}
MATLAB's matrix multiplication is generally pretty fast, but here are a couple of ways to get just the upper triangular matrix. They are slower than naïvely computing the v'*v (or using a MEX wrapper that calls the more appropriate symmetric rank k update function in BLAS, not surprisingly!). Anyway, here are a few MATLAB-only solutions:
The first uses linear indexing:
% test vector
N = 1e3;
v = 1:N;
% compute upper triangle of product
[ii, jj] = find(triu(ones(N)));
upperMask = false(N,N);
upperMask(ii + N*(jj-1)) = true;
Mu = zeros(N);
Mu(upperMask) = v(ii).*v(jj); % other lines always the same computation
% validate
M = v'*v;
isequal(triu(M),Mu)
This next way won't be faster than the naive approach either, but here's another solution to compute the lower triangle with bsxfun:
Ml = bsxfun(#(x,y) [zeros(y-1,1); x(y:end)*y],v',v);
For the upper triangle:
Mu = bsxfun(#(x,y) [x(1:y)*y; zeros(numel(x)-y,1)],v',v);
isequal(triu(M),Mu)
Another solution for the whole matrix using cumsum for this special case (where v=1:N). This one is actually close in speed.
M = cumsum(repmat(v,[N 1]));
Maybe these can be a starting point for something better.
This is 3 times faster than (1:N).'*(1:N) provided an int32 result is acceptable (it's even faster if the numbers are small enough to use int16 instead of int32):
N = 1000;
aux = int32(1:N);
result = bsxfun(#times,aux.',aux);
Benchmarking:
>> N = 1000; aux = int32(1:N); tic, for count = 1:1e2, bsxfun(#times,aux.',aux); end, toc
Elapsed time is 0.734992 seconds.
>> N = 1000; aux = 1:N; tic, for count = 1:1e2, aux.'*aux; end, toc
Elapsed time is 2.281784 seconds.
Note that aux.'*aux cannot be used for aux = int32(1:N).
As pointed out by #DanielE.Shub, if the result is needed as a double matrix, a final cast has to be done, and in that case the gain is very small:
>> N = 1000; aux = int32(1:N); tic, for count = 1:1e2, double(bsxfun(#times,aux.',aux)); end, toc
Elapsed time is 2.173059 seconds.
Since the special ordered structure of the input, consider the case N=4
(1:4)'*(1:4) = [1 2 3 4
2 4 6 8
3 6 9 12
4 8 12 16]
you will find that 1st row is just (1:N), from second (j=2) row, the value of this row is previous row (j=1) plus (1:N).
So 1. you do not to do many multiplications. Instead, you can generate it by N*N additions.
2. since the output is symmetric, only half of the output matrix need to be computed. So the total computation is (N-1)+(N-2)+...+1 = N^2 / 2 additions.