Reimplement vDSP_deq22 for Biquad IIR Filter by hand - filtering

I'm porting a filterbank that currently uses the Apple-specific (Accelerate) vDSP function vDSP_deq22 to Android (where Accelerate is not available). The filterbank is a set of bandpass filters that each return the RMS magnitude for their respective band. Currently the code (ObjectiveC++, adapted from NVDSP) looks like this:
- (float) filterContiguousData: (float *)data numFrames:(UInt32)numFrames channel:(UInt32)channel {
// Init float to store RMS volume
float rmsVolume = 0.0f;
// Provide buffer for processing
float tInputBuffer[numFrames + 2];
float tOutputBuffer[numFrames + 2];
// Copy the two frames we stored into the start of the inputBuffer, filling the rest with the current buffer data
memcpy(tInputBuffer, gInputKeepBuffer[channel], 2 * sizeof(float));
memcpy(tOutputBuffer, gOutputKeepBuffer[channel], 2 * sizeof(float));
memcpy(&(tInputBuffer[2]), data, numFrames * sizeof(float));
// Do the processing
vDSP_deq22(tInputBuffer, 1, coefficients, tOutputBuffer, 1, numFrames);
vDSP_rmsqv(tOutputBuffer, 1, &rmsVolume, numFrames);
// Copy the last two data points of each array to be put at the start of the next buffer.
memcpy(gInputKeepBuffer[channel], &(tInputBuffer[numFrames]), 2 * sizeof(float));
memcpy(gOutputKeepBuffer[channel], &(tOutputBuffer[numFrames]), 2 * sizeof(float));
return rmsVolume;
}
As seen here, deq22 implements a biquad filter on a given input vector via a recursive function. This is the description of the function from the docs:
A =: Single-precision real input vector
IA =: Stride for A.
B =: 5 single-precision inputs (filter coefficients), with stride 1.
C =: Single-precision real output vector.
IC =: Stride for C.
N =: Number of new output elements to produce.
This is what I have so far (it's in Swift, like the rest of the codebase that I already have running on Android):
// N is fixed on init to be the same size as buffer.count, below
// 'input' and 'output' are initialised with (N+2) length and filled with 0s
func getFilteredRMSMagnitudeFromBuffer(var buffer: [Float]) -> Float {
let inputStride = 1 // hardcoded for now
let outputStride = 1
input[0] = input[N]
input[1] = input[N+1]
output[0] = output[N]
output[1] = output[N+1]
// copy the current buffer into input
input[2 ... N+1] = buffer[0 ..< N]
// Not sure if this is neccessary, just here to duplicate NVDSP behaviour:
output[2 ... N+1] = [Float](count: N, repeatedValue: 0)[0 ..< N]
// Again duplicating NVDSP behaviour, can probably just start at 0:
var sumOfSquares = (input[0] * input[0]) + (input[1] * input[1])
for n in (2 ... N+1) {
let sumG = (0...2).reduce(Float(0)) { total, p in
return total + input[(n - p) * inputStride] * coefficients[p]
}
let sumH = (3...4).reduce(Float(0)) { total, p in
return total + output[(n - p + 2) * outputStride] * coefficients[p]
}
let filteredFrame = sumG - sumH
output[n] = filteredFrame
sumOfSquares = filteredFrame * filteredFrame
}
let meanSquare = sumOfSquares / Float(N + 2) // we added 2 values by hand, before the loop
let rootMeanSquare = sqrt(meanSquare)
return rootMeanSquare
}
The filter gives a different magnitude output to deq22 though, and seems to have a cyclic wurbling circular 'noise' in it (with a constant input tone, that frequency's magnitude pumps up and down).
I've checked to ensure the coefficients arrays are identical between each implementation. Each filter actually seems to "work" in that it picks up the correct frequency (and only that frequency), it's just this pumping, and that the RMS magnitude output is a lot quieter than vDSP's, often by orders of magnitude:
Naive | vDSP
3.24305e-06 0.000108608
1.57104e-06 5.53645e-05
1.96445e-06 4.33506e-05
2.05422e-06 2.09781e-05
1.44778e-06 1.8729e-05
4.28997e-07 2.72648e-05
Can anybody see an issue with my logic?
Edit: here is a gif video of the result with a constant 440Hz tone. The various green bars are the individual filter bands. The 3rd band (shown here) is the one tuned to 440Hz.
The NVDSP version just shows a constant (non-fluctuating) magnitude reading proportional to the input volume, as expected.

Ok, the line sumOfSquares = filteredFrame * filteredFrame should be a +=, not an assignment. So only the last frame was being calculated, explains a lot ;)
Feel free to use this if you want to do some biquad filtering in Swift. MIT Licence like NVDSP before it.

Related

Summing very large numbers without using toolboxes

I am trying to sum very large numbers in MATLAB, such as e^800 and e^1000 and obtain an answer.
I know that in Double-Precision, the largest number I can represent is 1.8 * 10^308, otherwise I get Inf, which I am getting when trying to sum these numbers.
My question is, how do I go about estimating an answer for sums of very, very large numbers like these without using vpa, or some other toolbox?
Should I use strings? It is possible to do this using logs? Can I represent the floats as m x 2^E and if so, how do I take a number such as e^700 and convert it to that? If the number is larger than the threshold for Inf, should I divide it by two, and store it in two different variables?
For example, how would I obtain an approximate answer for:
e^700 + e^800 + e^900 + e^1000 ?
A possible approximation is to use the rounded values of these numbers (I personally used Wolfram|Alpha), then perform "long addition" as they teach in elementary school:
function sumStr = q57847408()
% store rounded values as string:
e700r = "10142320547350045094553295952312676152046795722430733487805362812493517025075236830454816031618297136953899163768858065865979600395888785678282243008887402599998988678389656623693619501668117889366505232839133350791146179734135738674857067797623379884901489612849999201100199130430066930357357609994944589";
e800r = "272637457211256656736477954636726975796659226578982795071066647118106329569950664167039352195586786006860427256761029240367497446044798868927677691427770056726553709171916768600252121000026950958713667265709829230666049302755903290190813628112360876270335261689183230096592218807453604259932239625718007773351636778976141601237086887204646030033802";
e900r = "7328814222307421705188664731793809962200803372470257400807463551580529988383143818044446210332341895120636693403927733397752413275206079839254190792861282973356634441244426690921723184222561912289431824879574706220963893719030715472100992004193705579194389741613195142957118770070062108395593116134031340597082860041712861324644992840377291211724061562384383156190256314590053986874606962229";
e1000r = "197007111401704699388887935224332312531693798532384578995280299138506385078244119347497807656302688993096381798752022693598298173054461289923262783660152825232320535169584566756192271567602788071422466826314006855168508653497941660316045367817938092905299728580132869945856470286534375900456564355589156220422320260518826112288638358372248724725214506150418881937494100871264232248436315760560377439930623959705844189509050047074217568";
% pad to the same length with zeros on the left:
padded = pad([e700r; e800r; e900r; e1000r], 'left', '0');
% convert the padded value to an array of digits:
dig = uint8(char(padded) - '0');
% some helpful computations for later:
colSum = [0 uint8(sum(dig, 1))]; % extra 0 is to prevent overflow of MSB
remainder = mod(colSum, 10);
carry = idivide(colSum, 10, 'floor');
while any(carry) % can also be a 'for' loop with nDigit iterations (at most)
result = remainder + circshift(carry, -1);
remainder = mod(result, 10);
carry = idivide(result, 10, 'floor');
end
% remove leading zero (at most one):
if ~result(1)
result = result(2:end);
end
% convert result back to string:
sumStr = string(char(result + '0'));
This gives the (rounded) result of:
197007111401704699388887935224332312531693805861198801302702004327171116872054081548301452764017301057216669857236647803717912876737392925607579016038517631441936559738211677036898431095605804172455718237264052427496060405708350697523284591075347592055157466708515626775854212347372496361426842057599220506613838622595904885345364347680768544809390466197511254544019946918140384750254735105245290662192955421993462796807599177706158188
Typos fixed from before.
Decimal Approximation:
function [m, new_exponent] = base10_mantissa_exponent(base, exponent)
exact_exp = exponent*log10(abs(base));
new_exponent = floor(exact_exp);
m = power(10, exact_exp - new_exponent);
end
So the value e600 would become 3.7731 * 10260.
And the value 117150 would become 1.6899 * 10310.
To add these two values together, I took the difference between the two exponents and divided the mantissa of the smaller term by it. Then it's just as simple as adding the mantissas together.
mantissaA = 3.7731;
exponentA = 260;
mantissaB = 1.6899;
exponentB = 310;
diff = abs(exponentA - exponentB);
if exponentA < exponentB
mantissaA = mantissaA / (10^diff);
finalExponent = exponentB;
elseif exponentB < exponentA
mantissaB = mantissaB / (10^diff);
finalExponent = exponentA;
end
finalMantissa = mantissaA + mantissaB;
This was important for me as I was performing sums such as:
(Σ ex) / (Σ xex)
From x=1 to x=1000.

Fixed point approximation of 2^x, with input range of s5.26

How can I implement 2^x fixed-point arithmetic s5.26 and input values is in range [-31.9, 31.9] using the minimax polynomial approximation for exp2()
How to generate the polynomial using Sollya Tool mentioned in the following link
Power of 2 approximation in fixed point
Since fixed-point arithmetic generally does not include an "infinity" encoding representing overflowed results, any implementation of exp2() for an s5.26 format will be limited to inputs in the interval (-32, 5), resulting in outputs in [0, 32).
The computation of transcendental functions typically consist of argument reduction, core approximation, final result construction. In the case of exp2(a), a reasonable argument reduction scheme is to split a into integer part i and fractional part f, such that a == i + f, with f in [-0.5, 0.5]. One then computes exp2(f), and scales the result by 2i, which corresponds to shifts in fixed-point arithmetic: exp2(a) = exp2(f) * exp2(i).
The common design choices for the computation of exp2(f) are interpolation in tabulated values of exp2(), or polynomial approximation. Since we need 31 result bits for the largest arguments, accurate interpolation would probably want to use quadratic interpolation to keep the table size reasonable. Since many modern processors (including ones used in embedded systems) provide a fast integer multiplier, I will focus here on approximation by polynomial. For this, we want a polynomial with minimax properties, that is, one that minimizes the maximum error compared to the reference.
Both commercial and free tools offer built-in capabilities to generate minimax approximations, e.g. Mathematica's MiniMaxApproximation command, Maple's minimax command, and Sollya's fpminimax command. One might also chose to build one's own infrastructure based on the Remez algorithm, which is the approach I have used. As opposed to floating-point arithmetic which typically uses to-nearest-or-even rounding, fixed-point arithmetic is usually restricted to truncation of intermediate results. This adds additional error during expression evaluation. As a consequence, it is usually a good idea to try a heuristic-based search for small adjustments to the coefficients of the generated approximation to partially balance those accumulating one-sided errors.
Because we need up to 31 bits in the result, and because coefficients in core approximations are typically less than unity in magnitude, we cannot use the native fixed-point precision, here s5.26, for polynomial evaluation. Instead, we want to scale up the operands in intermediate computation to fully use the available range of 32-bit integers, by dynamically adjusting the fixed-point format we are working in. For reasons of efficiency, it seems advisable to arrange the computation such that multiplications use re-normalization right shifts by 32 bits. This will often allow the elimination of explicit shifts on 32-bit processors.
Since intermediate computation uses signed data, right shifts of signed, negative operands will occur. We want those right shifts to map to arithmetic right shift instructions, something the C standard does not guarantee. But on most commonly used platforms, C compilers do what is desirable for us. Otherwise, it may be necessary to resort to intrinsics or inline assembly. I developed the code below with the Microsoft compiler on an x64 platform.
In the evaluation of the polynomial approximation for exp2(f) the original floating-point coefficients, the dynamic scaling, and the heuristic adjustments are all clearly visible. The code below does not quite achieve full accuracy for large arguments. The biggest absolute error is 1.10233e-7, for the argument of 0x12de9c5b = 4.71739332: fixed_exp2() returns 0x693ab6a3 while the accurate result would be 0x693ab69c. Presumably full accuracy could be achieved by increasing the degree of the polynomial core approximation by one.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
/* on 32-bit architectures, there is often an instruction/intrinsic for this */
int32_t mulhi (int32_t a, int32_t b)
{
return (int32_t)(((int64_t)a * (int64_t)b) >> 32);
}
/* compute exp2(a) in s5.26 fixed-point arithmetic */
int32_t fixed_exp2 (int32_t a)
{
int32_t i, f, r, s;
/* split a = i + f, such that f in [-0.5, 0.5] */
i = (a + 0x2000000) & ~0x3ffffff; // 0.5
f = a - i;
s = ((5 << 26) - i) >> 26;
f = f << 5; /* scale up for maximum accuracy in intermediate computation */
/* approximate exp2(f)-1 for f in [-0.5, 0.5] */
r = (int32_t)(1.53303146e-4 * (1LL << 36) + 996);
r = mulhi (r, f) + (int32_t)(1.33887795e-3 * (1LL << 35) + 99);
r = mulhi (r, f) + (int32_t)(9.61833261e-3 * (1LL << 34) + 121);
r = mulhi (r, f) + (int32_t)(5.55036329e-2 * (1LL << 33) + 51);
r = mulhi (r, f) + (int32_t)(2.40226507e-1 * (1LL << 32) + 8);
r = mulhi (r, f) + (int32_t)(6.93147182e-1 * (1LL << 31) + 5);
r = mulhi (r, f);
/* add 1, scale based on integral portion of argument, round the result */
r = ((((uint32_t)r * 2) + (uint32_t)(1.0*(1LL << 31)) + ((1U << s) / 2) + 1) >> s);
/* when argument < -26.5, result underflows to zero */
if (a < -0x6a000000) r = 0;
return r;
}
/* convert from s5.26 fixed point to double-precision floating point */
double fixed_to_float (int32_t a)
{
return a / 67108864.0;
}
int main (void)
{
double a, res, ref, err, maxerr = 0.0;
int32_t x, start, end;
start = -0x7fffffff; // -31.999999985
end = 0x14000000; // 5.000000000
printf ("testing fixed_exp2 with inputs in [%.9f, %.9f)\n",
fixed_to_float (start), fixed_to_float (end));
for (x = start; x < end; x++) {
a = fixed_to_float (x);
ref = exp2 (a);
res = fixed_to_float (fixed_exp2 (x));
err = fabs (res - ref);
if (err > maxerr) {
maxerr = err;
}
}
printf ("max. abs. err = %g\n", maxerr);
return EXIT_SUCCESS;
}
A table-based alternative would trade-off table storage for a reduction in the amount of computation that is performed. Depending on the size of the L1 data cache, this may or may not increase performance. One possible approach is to tabulate 2f-1 for f in [0, 1). The split the function argument into an integer i and a fraction f, such that f in [0, 1). In order to keep the table reasonably small, use quadratic interpolation, with the coefficients of the polynomial computed on the fly from three consecutive table entries. The result is slightly adjusted by a heuristically determined offset to somewhat compensate for the truncating nature of fixed-point arithmetic.
The table is indexed by leading bits of the fraction f. Using seven bits for the index (resulting in a table of 128+2 entries), accuracy is slightly worse than with the previous minimax polynomial approximation. Maximum absolute error is 1.74935e-7. It occurs for an argument of 0x11580000 = 4.33593750, where fixed_exp2() returns 0x50c7d771, whereas the accurate result would be 0x50c7d765.
/* For i in [0,129]: (exp2 (i/128.0) - 1.0) * (1 << 31) */
static const uint32_t expTab [130] =
{
0x00000000, 0x00b1ed50, 0x0164d1f4, 0x0218af43,
0x02cd8699, 0x0383594f, 0x043a28c4, 0x04f1f656,
0x05aac368, 0x0664915c, 0x071f6197, 0x07db3580,
0x08980e81, 0x0955ee03, 0x0a14d575, 0x0ad4c645,
0x0b95c1e4, 0x0c57c9c4, 0x0d1adf5b, 0x0ddf0420,
0x0ea4398b, 0x0f6a8118, 0x1031dc43, 0x10fa4c8c,
0x11c3d374, 0x128e727e, 0x135a2b2f, 0x1426ff10,
0x14f4efa9, 0x15c3fe87, 0x16942d37, 0x17657d4a,
0x1837f052, 0x190b87e2, 0x19e04593, 0x1ab62afd,
0x1b8d39ba, 0x1c657368, 0x1d3ed9a7, 0x1e196e19,
0x1ef53261, 0x1fd22825, 0x20b05110, 0x218faecb,
0x22704303, 0x23520f69, 0x243515ae, 0x25195787,
0x25fed6aa, 0x26e594d0, 0x27cd93b5, 0x28b6d516,
0x29a15ab5, 0x2a8d2653, 0x2b7a39b6, 0x2c6896a5,
0x2d583eea, 0x2e493453, 0x2f3b78ad, 0x302f0dcc,
0x3123f582, 0x321a31a6, 0x3311c413, 0x340aaea2,
0x3504f334, 0x360093a8, 0x36fd91e3, 0x37fbefcb,
0x38fbaf47, 0x39fcd245, 0x3aff5ab2, 0x3c034a7f,
0x3d08a39f, 0x3e0f680a, 0x3f1799b6, 0x40213aa2,
0x412c4cca, 0x4238d231, 0x4346ccda, 0x44563ecc,
0x45672a11, 0x467990b6, 0x478d74c9, 0x48a2d85d,
0x49b9bd86, 0x4ad2265e, 0x4bec14ff, 0x4d078b86,
0x4e248c15, 0x4f4318cf, 0x506333db, 0x5184df62,
0x52a81d92, 0x53ccf09a, 0x54f35aac, 0x561b5dff,
0x5744fccb, 0x5870394c, 0x599d15c2, 0x5acb946f,
0x5bfbb798, 0x5d2d8185, 0x5e60f482, 0x5f9612df,
0x60ccdeec, 0x62055b00, 0x633f8973, 0x647b6ca0,
0x65b906e7, 0x66f85aab, 0x68396a50, 0x697c3840,
0x6ac0c6e8, 0x6c0718b6, 0x6d4f301f, 0x6e990f98,
0x6fe4b99c, 0x713230a8, 0x7281773c, 0x73d28fde,
0x75257d15, 0x767a416c, 0x77d0df73, 0x792959bb,
0x7a83b2db, 0x7bdfed6d, 0x7d3e0c0d, 0x7e9e115c,
0x80000000, 0x8163daa0
};
int32_t fixed_exp2 (int32_t x)
{
int32_t f1, f2, dx, a, b, approx, idx, i, f;
/* extract integer portion; 2**i is realized as a shift at the end */
i = (x >> 26);
/* extract fraction f so we can compute 2^f, 0 <= f < 1 */
f = x & 0x3ffffff;
/* index table of exp2 values using 7 most significant bits of fraction */
idx = (uint32_t)f >> (26 - 7);
/* difference between argument and next smaller sampling point */
dx = f - (idx << (26 - 7));
/* fit parabola through closest 3 sampling points; find coefficients a,b */
f1 = (expTab[idx+1] - expTab[idx]);
f2 = (expTab[idx+2] - expTab[idx]);
a = f2 - (f1 << 1);
b = (f1 << 1) - a;
/* find function value offset for argument x by computing ((a*dx+b)*dx) */
approx = a;
approx = (int32_t)((((int64_t)approx)*dx) >> (26 - 7)) + b;
approx = (int32_t)((((int64_t)approx)*dx) >> (26 - 7 + 1));
/* combine integer and fractional parts of result, round result */
approx = (((expTab[idx] + (uint32_t)approx + (uint32_t)(1.0*(1LL << 31)) + 22U) >> (30 - 26 - i)) + 1) >> 1;
/* flush underflow to 0 */
if (i < -27) approx = 0;
return approx;
}

matlab discrete wavelet transform wfastmod in wmulden

I am interested to derive the discrete wavelet transform for noise reduction of more than 50,000 data points. I am using wmulden - matlab tool for wavelet tranform. Under this function, wfastmcd, an another function is being called which takes only 50000 data points at a time. It would be highly helpful if anyone suggests how to partition the data point to get the transform of entire data set or if there is any other matlab tool available for these kind of calculations.
I've used a for loop to solve that one.
First of all, I've calculated how many "steps" I needed to take on my signal, on a fixed size window of 50000, like:
MAX_SAMPLES = 50000;
% mySignalSize is the size of my samples vector.
steps = ceil(mySignalSize/MAX_SAMPLES);
After that, I've applied the wmulden function "steps" times, checking every time if my step is not larger than the original signal vector size, like the following:
% Wavelet fields
level = 5;
wname = 'sym4';
tptr = 'sqtwolog';
sorh = 's';
npc_app = 'heur';
npc_fin = 'heur';
den_signal = zeros(mySignalSize,1);
for i=1:steps
if (i*MAX_SAMPLES) <= mySignalSize
x_den = wmulden(originalSignal( (((i-1) * MAX_SAMPLES) + 1) : (i*MAX_SAMPLES) ), level, wname, npc_app, npc_fin, tptr, sorh);
den_signal((((i-1) * MAX_SAMPLES) + 1):i*MAX_SAMPLES) = x_den;
else
old_step = (((i-1) * MAX_SAMPLES) + 1);
new_step = mySignalSize - old_step;
last_step = old_step + new_step;
x_den = wmulden(originalSignal( (((i-1) * MAX_SAMPLES) + 1) : last_step ), level, wname, npc_app, npc_fin, tptr, sorh);
den_signal((((i-1) * MAX_SAMPLES) + 1):last_step) = x_den;
end
end
That should do the trick.

Code or compiler: optimizing a IIR filter in C for the iPhone 4 and later

I've been profiling my almost-finished project and I'm seeing that about three-quarters of the CPU time is spent in this IIR filter function (which is called hundreds of thousands of times in about a second currently on the target hardware) so with everything else working well I am wondering if it can be optimized for my specific hardware and software target. My targets are only iPhone 4 and newer, only iOS 4.3 and newer, only LLVM 4.x. A little bit of imprecision is probably OK if there are gains to be made.
static float filter(const float a, const float *b, const float c, float *d, const int e, const float x)
{
float return_value = 0;
d[0] = x;
d[1] = c * d[0] + a * d[1];
int j;
for (j = 2; j <= e; j++) {
return_value += (d[j] += a * (d[j + 1] - d[j - 1])) * b[j];
}
for (j = e + 1; j > 1; j--) {
d[j] = d[j - 1];
}
return (return_value);
}
Any suggestions about speeding it up appreciated, also interested in your opinion if it is possible to optimize beyond the default compiler optimization at all. I am wondering if it is something where NEON SIMD would help (that is new ground for me) or if VFP can be exploited, or if LLVM autovectorization would help.
I've tried the following LLVM flags:
-ffast-math (didn't make a notable difference)
-O4 (made a big difference on the iPhone 4S with a 25% reduction in time, but no notable difference on my minimum target device the iPhone 4, improvement of which is my main goal)
-O3 -mllvm -unroll-allow-partial -mllvm -unroll-runtime -funsafe-math-optimizations -ffast-math -mllvm -vectorize -mllvm -bb-vectorize-aligned-only (LLVM autovectorization flags from Hal Finkel's slides here: http://llvm.org/devmtg/2012-04-12/Slides/Hal_Finkel.pdf, made things slower than the default LLVM optimization for an Xcode release target)
Open to other flags, different approaches, and changes to the function. I'd prefer to leave the input and return types and values alone. There is actually a discussion of using NEON intrinsic functions for FIR here: https://pixhawk.ethz.ch/_media/software/optimization/neon_support_in_the_arm_compiler.pdf but I don't have quite enough experience with its subject to successfully apply the information to my own case. Thank you for any clarification.
EDIT My apologies for not noting this sooner. After investigating aka.nice's suggestion I noticed that the values passed in for e, a and c are always the same values and I know them before runtime, so approaches incorporating this info are an option.
Here are some transformations that could be made on the code to use vDSP routines. These transformations make use of various temporary buffers named T0, T1, and T2. Each of these is an array of float with enough space for e-1 elements.
First, use a temporary buffer to compute a * b[j]. This changes the original code:
for (j = 2; j <= e; j++) {
return_value += (d[j] += a * (d[j + 1] - d[j - 1])) * b[j];
}
to:
vDSP_vsmul(b+2, 1, &a, T0, 1, e-1);
for (j = 2; j <= e; j++)
return_value += (d[j] += (d[j+1] - d[j-1])) * T0[j-2];
Then use vDSP_vmul to compute d[j+1] * T0[j-2]:
vDSP_vsmul(b+2, 1, &a, T0, 1, e-1);
vDSP_vmul(d+3, 1, T0, 1, T1, 1, e-1);
for (j = 2; j <= e; j++)
return_value += (d[j] += T1[j-2] - d[j-1] * T0[j-2];
Next, promote vDSP_vmul to vDSP_vma (vector multiply add) to compute d[j] + d[j+1] * T0[j-2]:
vDSP_vsmul(b+2, 1, &a, T0, 1, e-1);
vDSP_vma(d+3, 1, T0, 1, d+2, 1, T1, 1, e-1);
for (j = 2; j <= e; j++)
return_value += (d[j] = T1[j-2] - d[j-1] * T0[j-2];
I suppose I would time that and see if there is any improvement. There are some issues:
SIMD code works best when data is 16-byte aligned. The use of array indices such as j-1 and j+1 prevents this. The ARM processors in phones are not as bad with unaligned data as some other processors, but performance will vary from model to model.
If e is large (more than a few thousand), then T0 and d may be evicted from cache during the vDSP_vma operation, and the following loop will have to reload them. There is a technique called strip mining to reduce the effect of this. I will not detail it now, but, essentially, the operation is partitioned into smaller strips of the array.
The IIR in the final loop may still bottleneck the processor. There are routines in vDSP for performing some IIRs (such as vDSP_deq22), but it is not clear whether this filter can be expressed in a way that is a good enough match to a vDSP routine to gain more performance than might be lost by the transformation.
The summation in the final loop to calculate return_value could also be removed from the loop and replaced with a vDSP routine (likely vDSP_sve), but I suspect the slack caused by the IIR will permit the additions to be done without adding significant execution time to the loop.
The above is off the top of my head; I have not tested the code. I suggest making the transformations one-by-one so you can test the code after each change and identify any errors before going on.
If you can find a satisfactory filter that is not an IIR, more performance optimizations may be available.
I'd prefer to leave the input and return types and values alone…
Nevertheless, moving your rendering from float to integer would help considerably.
Localizing that change to the implementation you present won't be useful. But if you expand it to reimplement just the FIR as integer, it can quickly pay off (unless the sizes are guaranteed to always be incredibly small -- then conversion/move times cost more). Of course, moving larger portions of the render graph to integer will introduce larger gains and require even fewer conversions.
Another consideration would be to look at the utilities in Accelerate.framework (potentially saving you from writing your own asm).
I tried this little exercize to rewrite your filter with delay operator z
For example, for e=4, i renamed input u and output y
d1*z= u
d2*z= c*u + a*d1
d3*z= d2 + a*(d3-d1*z)
d4*z= d3 + a*(d4-d2*z)
d5*z= d4 + a*(d5-d3*z)
y = (b2*d3*z + b3*d4*z + b4*d5*z)
Note that the di are the filter states.
d3*z is the next value of d3 (it appears to be variable d2 in your code)
You can then eliminate the di to write the transfer function y/u in z.
You will then find that a minimal representation require only e states by factoring/simplifying above transfer function.
Denominator is z*(z-a)^3, that is a pole at 0, and another at a with multiplicity (e-1).
You can then put your filter in a standard state space matrix representation:
z*X = A*X + B*u
y = C*X + d*u
With the particular form of poles, you can decompose the transfer in partial fraction expansion and obtain the matrices A & B in this special form (matlab like notations)
A = [0 1 0 0; B=[0;
0 a 1 0; 0;
0 0 a 1; 0;
0 0 0 a] 1]
C & d are a bit less easy though...
They are extracted from the numerators and direct term of partial fraction expansion
They are polynomials in bi, c (degree 1) and a (degree e)
For e=4, I have
C=[ a^3*b2 - a^2*b3 + a*b4 ,
-a^2*b2 + a*b3 + (c-a^2)*b4 ,
a*b2 + (c-a^2)*b3 + (-2*a^2*c-2*a^2-a+a^4)*b4 ,
(c-a^2)*b2 + (-a^2-a^2*c-a)*b3 + (-2*a*c+2*a^3)*b4 ]
d= -a*b2 - a*c*b3 + a^2*b4
If you can find the recurrence in e governing C & d, and precompute them
then the filter can be reduced to those simple vector ops:
z*X = a*[ 0; x2 ; x3 ; x4 ... xe ] + [x2 ; x3 ; x4 ... xe ; u ];
Y = C*[ x1 ; x2 ; x3 ; x4 ... xe ] + d*u
Or expressed as a function (Xnext,y)=filter(X,u,a,C,d,e) pseudo code:
y = dot_product( C , X) + d*u; // (like BLAS _DOT)
Xnext(1:e-1) = X(2:e); // this is a memcopy (like BLAS _COPY)
Xnext(e)=u;
X(1)=0;
Xnext=a*X+Xnext; // this is an inplace vector muladd (like BLAS _AXPY)
X=Xnext; // another memcopy outside the function (can be moved inside).
Note that if you use BLAS functions, your code will be portable to many hardwares, not just Applecentric, and I guess the perf won't be much different.
EDIT: about partial fraction expansion
The pure partial fraction expansion would give a diagonal state space representation, and a matrix B full of 1s. This can be an interesting variant too. (filters in parallel)
My variant used above is more like a cascade or ladder (filters in serie).

Applying a bank of image filters in Matlab

I need to filter an image using a bank of filters in Matlab. My first attempt was to use a simple for loop to repeatedly call the "imfilter" function for each filter in the bank.
I will need to repeat this process many times for my application, so I need to this step to be as efficient as possible. Therefore, I was wondering if there was any way this operation could be vectorized to speed up the process. In an effort to simplify things, all of my filter kernels are the same size (9x9).
As an example of what I am going for, my filters are set up in a 9x9x32 element block, which needs to be applied to my image. I thought about replicating the image into a block (e.g. 100x100x32), but I'm not sure if there's a way to apply an operation like convolution without resorting to loops. Does anyone have suggestions for a good way of tackling this problem?
Other than pre allocating the space, there is not a faster way to arrive at an exact solution. If approximations are ok, then you might be able to decompose the 32 filters into a set of linear combinations of a smaller number of filters, say eight. See for instance Steerable filters.
http://people.csail.mit.edu/billf/papers/steerpaper91FreemanAdelson.pdf
edit: here is a tool to help apply filters to images.
function FiltIm = ApplyFilterBank(im,filters)
%#function FiltIm = ApplyFilterBank(im,filters)
%#
%#assume im is a single layer image, and filters is a cell array
nFilt = length(filters);
maxsz = 0;
for i = 1:nFilt
maxsz = max(maxsz,max(size(filters{i})));
end
FiltIm = zeros(size(im,1), size(im,2), nFilt);
im = padimage(im,maxsz,'symmetric');
for i = 1:nFilt
FiltIm(:,:,i) = unpadimage(imfilter(im,filters{i}),maxsz);
end
function o = padimage(i,amnt,method)
%#function o = padimage(i,amnt,method)
%#
%#padarray which operates on only the first 2 dimensions of a 3 dimensional
%#image. (of arbitrary number of layers);
%#
%#String values for METHOD
%# 'circular' Pads with circular repetion of elements.
%# 'replicate' Repeats border elements of A.
%# 'symmetric' Pads array with mirror reflections of itself.
%#
%#if(amnt) is length 1, then pad all sides same amount
%#
%#if(amnt) is length 2, then pad y direction amnt(1), and x direction amnt(2)
%#
%#if(amnt) is length 4, then pad sides unequally with order LTRB, left top right bottom
if(nargin < 3)
method = 'replicate';
end
if(length(amnt) == 1)
o = zeros(size(i,1) + 2 * amnt, size(i,2) + 2* amnt, size(i,3));
for n = 1:size(i,3)
o(:,:,n) = padarray(i(:,:,n),[amnt,amnt],method,'both');
end
end
if(length(amnt) == 2)
o = zeros(size(i,1) + 2 * amnt(1), size(i,2) + 2* amnt(2), size(i,3));
for n = 1:size(i,3)
o(:,:,n) = padarray(i(:,:,n),amnt,method,'both');
end
end
if(length(amnt) == 4)
o = zeros(size(i,1) + amnt(2) + amnt(4), size(i,2) + amnt(1) + amnt(3), size(i,3));
for n = 1:size(i,3)
o(:,:,n) = padarray(padarray(i(:,:,n),[amnt(2), amnt(1)],method,'pre'),[amnt(4), amnt(3)],method,'post');
end
end
function o = unpadimage(i,amnt)
%#un does padimage
%#if length(amnt == 1), unpad equal on each side
%#if length(amnt == 2), first amnt is left right, second up down
%#if length(amnt == 4), then [left top right bottom];
switch(length(amnt))
case 1
sx = size(i,2) - 2 * amnt;
sy = size(i,1) - 2 * amnt;
l = amnt + 1;
r = size(i,2) - amnt;
t = amnt + 1;
b = size(i,1) - amnt;
case 2
sx = size(i,2) - 2 * amnt(1);
sy = size(i,1) - 2 * amnt(2);
l = amnt(1) + 1;
r = size(i,2) - amnt(1);
t = amnt(2) + 1;
b = size(i,1) - amnt(2);
case 4
sx = size(i,2) - (amnt(1) + amnt(3));
sy = size(i,1) - (amnt(2) + amnt(4));
l = amnt(1) + 1;
r = size(i,2) - amnt(3);
t = amnt(2) + 1;
b = size(i,1) - amnt(4);
otherwise
error('illegal unpad amount\n');
end
if(any([sx,sy] < 1))
fprintf('unpadimage newsize < 0, returning []\n');
o = [];
return;
end
o = zeros(sy, sx, size(i,3));
for n = 1:size(i,3)
o(:,:,n) = i(t:b,l:r,n);
end
New Answer: Use colfilt() or block filtering style. Matlab can transform your image into large matrix where each distinct 9x9 pixel area is a single column (81 elements). Make it using im2col() method. If your image is N by M the result matrix would be 81 X (N-8)*(M-8).
Then you can concatenate all your filters to single matrix (each filter is a row) and multiply those huge matrices. This will give you the result of all filters. Now you have to reconstruct back 32 result images from the result matrix. use col2im() method.
For more information type 'doc colfilt'
This method works almost as fast as mex file and doesnt require any 'for' loop
Old answer:
Do you want to get different 32 results or single result for combination of filters?
If it is a sungle result than there is an easy way.
If you use linear filters (like convolutions) then apply filters one on another. Finally apply the resulting filter on the image. Thus image will be convolved only once.
If you filters are symmetric (x and y direction) then instead of applying the 9x9 filter apply 9x1 on y direction and 1x9 on x direction. Works a bit faster.
Finally, you can try using Mex file