Ones count system-verilog - system-verilog

I have a wire vector with 64 bits;
wire [63:0] sout;
I want to compute the sum of these bits or, equivalently, count the number of ones.
What is the best way to do this? (it should be synthesizable)

I prefer using for-loops as they are easier to scale and require less typing (and thereby less prone to typos).
SystemVerilog (IEEE Std 1800):
logic [$clog2($bits(sout)+1)-1:0] count_ones;
always_comb begin
count_ones = '0;
foreach(sout[idx]) begin
count_ones += sout[idx];
end
end
Verilog (IEEE Std 1364-2005):
parameter WIDTH = 64;
// NOTE: $clog2 was added in 1364-2005, not supported in 1364-1995 or 1364-2001
reg [$clog2(WIDTH+1)-1:0] count_ones;
integer idx;
always #* begin
count_ones = {WIDTH{1'b0}};
for( idx = 0; idx<WIDTH; idx = idx + 1) begin
count_ones = count_ones + sout[idx];
end
end

The $countones system function can be used. Refer to the IEEE Std 1800-2012, section "20.9 Bit vector system functions". It might not be synthesizable, but you did not list that as a requirement.

"Best" is rather subjective, but a simple and clear formulation would just be:
wire [6:0] sout_sum = sout[63] + sout[62] + ... + sout[1] + sout[0];
You might be able to think hard and come up with something that produces better synthesized results, but this is probably a good start until a timing tool says it's not good enough.

The following solution uses a function to calculate the total number of set (to High) bits in a 64-bits wide bus:
function logic [6:0] AddBitsOfBus (
input [63:0] InBus
);
AddBitsOfBus[2:0] = '0;
for (int k = 0; k < 64; k += 1) begin // for loop
AddBitsOfBus[6:0] += {6'b00_0000, InBus[k]};
end
endfunction

The following synthesizable SystemVerilog functions do this for you:
$countbits(sout,'1); // Counts the # of 1's
$countbits(sout,'0); // Counts the # of 0's
$countones(sout); // equivalent to $countbits(sout,'1)
The logic the synthesis tools will produce is a different story.
Ref: IEEE Std 1800-2012, Section 20.9

Related

SystemVerilog dynamically accessing subarray

I am getting an error to compile code line 9, so I am not sure how to dynamically access arrays. I have to build logic [255:0] from the received bytes.
(Looks like I have to review data types of SystemVerilog :( ).
Thanks in advance.
module test;
task test_array (logic [7:0] B);
static logic [255:0] l_ar_B;
l_ar_B[7:0] = B;
for(int i=0; i<32; i++)
l_ar_B[(i*8+7) : (i*8)] = B; // Error-[IRIPS] Illegal range in part select
$stop();
endtask
initial begin
$display("Start");
test_array(8'h11);
end
endmodule
When using the range selection with [M : N] syntax, M and N must be be constants. You should use part-select addressing with the syntax [s +: W], where W is a constant for the width and s can be a variable indicating the starting bit position. The +: been around since IEEE Std 1364-2001 (Verilog 2001). See
Indexing vectors and arrays with +:
for(int i=0; i<32; i++)
l_ar_B[(i*8) +: 8] = B;
Since you are doing replication, you can use l_ar_B = {32{B}}; to get the same result in a singe step.

RSA hardware implementation: radix-2 montgomery multiplication issues

I'm implementing RSA 1024 in hardware (xilinx ZYNQ FPGA), and am unable to figure out a few curious issues. Most notably, I am finding that my implementation only works for certain base/exponent/modulus combinations, but have not found any reason why this is the case.
Note: I am implementing the algorithm using Xilinx HLS (essentially C code that is synthesized into hardware). For the sake of this post, treat it just like a standard C implementation, except that I can have variables up to 4096 bits wide. I haven't yet parallelized it, so it should behave just like standard C code.
The Problem
My problem is that I am able to get the correct answer for certain modular exponentiation test problems, but only if the values for the base, exponent, and modulus can be written in much fewer bits than the actual 1024 bit operand width (i.e. they are zero padded).
When I use actual 1024-bit values generated from SSH-keygen, I no longer get the correct results.
For example, if my input arguments are
uint1024_t base = 1570
uint1024_t exponent = 1019
uint1024_t modulus = 3337
I correctly get a result of 1570^1029 mod(3337) = 688
However, when I actually use values that occupy all (or approximately all) 1024 bits for the inputs...
uint1024_t base = 0x00be5416af9696937b7234421f7256f78dba8001c80a5fdecdb4ed761f2b7f955946ec920399f23ce9627f66286239d3f20e7a46df185946c6c8482e227b9ce172dd518202381706ed0f91b53c5436f233dec27e8cb46c4478f0398d2c254021a7c21596b30f77e9886e2fd2a081cadd3faf83c86bfdd6e9daad12559f8d2747
uint1024_t exponent = 0x6f1e6ab386677cdc86a18f24f42073b328847724fbbd293eee9cdec29ac4dfe953a4256d7e6b9abee426db3b4ddc367a9fcf68ff168a7000d3a7fa8b9d9064ef4f271865045925660fab620fad0aeb58f946e33bdff6968f4c29ac62bd08cf53cb8be2116f2c339465a64fd02517f2bafca72c9f3ca5bbf96b24c1345eb936d1
uint1024_t modulus = 0xb4d92132b03210f62e52129ae31ef25e03c2dd734a7235efd36bad80c28885f3a9ee1ab626c30072bb3fd9906bf89a259ffd9d5fd75f87a30d75178b9579b257b5dca13ca7546866ad9f2db0072d59335fb128b7295412dd5c43df2c4f2d2f9c1d59d2bb444e6dac1d9cef27190a97aae7030c5c004c5aea3cf99afe89b86d6d
I incorrectly get a massive number, rather than the correct answer of 29 (0x1D)
I've checked both algorithms a million times over, and have experimented with different initial values and loop bounds, but nothing seems to work.
My Implementation
I am using the standard square and multiply method for the modular exponentiation, and I chose to use the Tenca-Koc radix-2 algorithm for the montgomery multiplication, detailed in pseudocode below...
/* Tenca-Koc radix2 montgomery multiplication */
Z = 0
for i = 0 to n-1
Z = Z + X[i]*Y
if Z is odd then Z = Z + M
Z = Z/2 // left shift in radix2
if (S >= M) then S = S - M
My Montgomery multiplication implementation is as follows:
void montMult(uint1024_t X, uint1024_t Y, uint1024_t M, uint1024_t* outData)
{
ap_uint<2*NUM_BITS> S = 0;
for (int i=0; i<NUM_BITS; i++)
{
// add product of X.get_bit(i) and Y to partial sum
S += X[i]*Y;
// if S is even, add modulus to partial sum
if (S.test(0))
S += M;
// rightshift 1 bit (divide by 2)
S = S >> 1;
}
// bring back to under 1024 bits by subtracting modulus
if (S >= M)
S -= M;
// write output data
*outData = S.range(NUM_BITS-1,0);
}
and my top-level modular exponentiation is as follows, where (switching notation!) ...
// k: number of bits
// r = 2^k (radix)
// M: base
// e: exponent
// n: modulus
// Mbar: (precomputed residue) M*r mod(n)
// xbar: (precomputed initial residue) 1*r mod(n)
void ModExp(uint1024_t M, uint1024_t e, uint1024_t n,
uint1024_t Mbar, uint1024_t xbar, uint1024_t* out)
{
for (int i=NUM_BITS-1; i>=0; i--)
{
// square
montMult(xbar,xbar,n,&xbar);
// multiply
if (e.test(i)) // if (e.bit(i) == 1)
montMult(Mbar,xbar,n,&xbar);
}
// undo montgomery residue transformation
montMult(xbar,1,n,out);
}
I can't for the life of me figure out why this works for everything except an actual 1024 bit value. Any help would be much appreciated
I've replaced my answer because I was wrong. Your original code is perfectly correct. I've tested it using my own BigInteger library, which includes Montgomery arithmetic, and everything works like a charm. Here is my code:
const
base1 =
'0x00be5416af9696937b7234421f7256f78dba8001c80a5fdecdb4ed761f2b7f955946ec9203'+
'99f23ce9627f66286239d3f20e7a46df185946c6c8482e227b9ce172dd518202381706ed0f91'+
'b53c5436f233dec27e8cb46c4478f0398d2c254021a7c21596b30f77e9886e2fd2a081cadd3f'+
'af83c86bfdd6e9daad12559f8d2747';
exponent1 =
'0x6f1e6ab386677cdc86a18f24f42073b328847724fbbd293eee9cdec29ac4dfe953a4256d7e'+
'6b9abee426db3b4ddc367a9fcf68ff168a7000d3a7fa8b9d9064ef4f271865045925660fab62'+
'0fad0aeb58f946e33bdff6968f4c29ac62bd08cf53cb8be2116f2c339465a64fd02517f2bafc'+
'a72c9f3ca5bbf96b24c1345eb936d1';
modulus1 =
'0xb4d92132b03210f62e52129ae31ef25e03c2dd734a7235efd36bad80c28885f3a9ee1ab626'+
'c30072bb3fd9906bf89a259ffd9d5fd75f87a30d75178b9579b257b5dca13ca7546866ad9f2d'+
'b0072d59335fb128b7295412dd5c43df2c4f2d2f9c1d59d2bb444e6dac1d9cef27190a97aae7'+
'030c5c004c5aea3cf99afe89b86d6d';
function MontMult(X, Y, N: BigInteger): BigInteger;
var
I: Integer;
begin
Result:= 0;
for I:= 0 to 1023 do begin
if not X.IsEven then Result:= Result + Y;
if not Result.IsEven then Result:= Result + N;
Result:= Result shr 1;
X:= X shr 1;
end;
if Result >= N then Result:= Result - N;
end;
function ModExp(B, E, N: BigInteger): BigInteger;
var
R, MontB: BigInteger;
I: Integer;
begin
R:= BigInteger.PowerOfTwo(1024) mod N;
MontB:= (B * R) mod N;
for I:= 1023 downto 0 do begin
R:= MontMult(R, R, N);
if not (E shr I).IsEven then
R:= MontMult(MontB, R, N);
end;
Result:= MontMult(R, 1, N);
end;
procedure TestMontMult;
var
Base, Expo, Modulus: BigInteger;
MontBase, MontExpo: BigInteger;
X, Y, R: BigInteger;
Mont: TMont;
begin
// convert to BigInteger
Base:= BigInteger.Parse(base1);
Expo:= BigInteger.Parse(exponent1);
Modulus:= BigInteger.Parse(modulus1);
R:= BigInteger.PowerOfTwo(1024) mod Modulus;
// Convert into Montgomery form
MontBase:= (Base * R) mod Modulus;
MontExpo:= (Expo * R) mod Modulus;
Writeln;
// MontMult test, all 3 versions output
// '0x146005377258684F3FFD8D9A70D723BDD3A2E3A160E11B7AD35A7106D4D903AB9D14A9201'+
// 'D0907CE2FC2E04A69656C38CE64AA0BADF2376AEFB19D8732CE2B3650466E31BB78CF24F4E3'+
// '774A78575738B668DA0E40C8DDDA972CE101E0CADC5D4CCFF6EF2E4E97AF02F34E3AB7258A7'+
// '323E472FC051825FFC72ADC53B0DAF3C4';
Writeln('Using MontMult');
Writeln(MontMult(MontMult(MontBase, MontExpo, Modulus), 1, Modulus).ToHexString);
// same using TMont instance
Writeln('Using TMont.Multiply');
Mont:= TMont.GetInstance(Modulus);
Writeln(Mont.Reduce(Mont.Multiply(MontBase, MontExpo)).ToHexString);
Writeln('Using TMont.ModMul');
Writeln(Mont.ModMul(Base,Expo).ToHexString);
// ModExp test, all 3 versions output 29
Writeln('Using ModExp');
Writeln(ModExp(Base, Expo, Modulus).ToString);
Writeln('Using BigInteger.ModPow');
Writeln(BigInteger.ModPow(Base, Expo, Modulus).ToString);
Writeln('Using TMont.ModPow');
Writeln(Mont.ModPow(Base, Expo).ToString);
end;
Update: I finally was able to fix the issue, after I ported my design to Java to check my intermediate values in the debugger. The design ran flawlessly in Java with no modifications to the code structure, and this tipped me off as to what was going wrong.
The problem came to light after getting correct intermediate values using the BigInteger java package. The HLS arbitrary precision library has a fixed bitwidth (obviously, since it synthesizes down to hardware), whereas the software BigInteger libraries are flexible bit widths. It turns out that the addition operator treats both arguments as signed values if they are different bit-widths, despite the fact that I declared them as unsigned. Thus, when there was a 1 in the MSB of an intermediate value and I tried to add it to a greater value, it treated the MSB as a sign bit and attempted to sign extend it.
This did not happen with the Java BigInt library, which quickly pointed me towards the problem.
If anyone is interested in a Java implementation of modular exponentiation using the Tenca-Koc radix2 algorithm for montgomery multiplication, you can find the code here: https://github.com/bigbrett/MontModExp-radix2

Primitive root of a number

I have tried to implement the algorithm described in here to find primitive roots for a prime number.
It works for small prime numbers, however as I try big numbers, it doesn't return correct answers anymore.
I then notice that a^(p-1)/pi tends to be a big number, it returns inf in MATLAB, so I thought factorizing (p-1) could help, but I am failing to see how.
I wrote a small piece of code in MATLABand here it is.
clear all
clc
%works with prime =23,31,37,etc.
prime=761; %doesn't work for this value
F=factor(prime-1); % the factors of prime-1
for i = 2: prime-1
a=i;
tag =1;
for j= 1 :prime-1
if (isprime(j))
p_i = j;
if(mod(a^((prime-1)/p_i),prime)== 1)
tag=0;
break
else
tag = tag +1;
end
end
end
if (tag > 1 )
a %it should only print the primitive root
break
end
end
Any input is welcome.
Thanks
What Matlab does in this case is it calculates a^((p-1)/p) before taking the modulus. As a^((p-1)/p) quite quickly becomes too large to handle, Matlab seems to resolve this by turning it into a floating point number, losing some resolution and yielding the wrong result when you take the modulus.
As mentioned by #rayreng, you could use an arbitrary precision toolbox to resolve this.
Alternatively, you could split the exponentiation into parts, taking the modulus at each stage. This should be faster, as it is less memory intensive. You could dump this in a function and just call that.
% Calculates a^b mod c
i = 0;
result = 1;
while i < b
result = mod(result*a, c);
i = i + 1;
end

Implement FIR Filter in Verilog

I am trying to implement an FIR filter in Verilog. I have predetermined the coefficients in MATLAB. But I am not sure whether the registers will propagate properly with this code.
module fir_filter(
input clock,
input reset,
input wire[15:0] input_sample,
output reg[15:0] output_sample);
parameter N = 13;
reg signed[15:0] coeffs[12:0];
reg [15:0] holderBefore[12:0];
wire [15:0] toAdd[12:0];
always #(*)
begin
coeffs[0]=6375;
coeffs[1]=1;
coeffs[2]=-3656;
coeffs[3]=3;
coeffs[4]=4171;
coeffs[5]=4;
coeffs[6]=28404;
coeffs[7]=4;
coeffs[8]=4171;
coeffs[9]=3;
coeffs[10]=-3656;
coeffs[11]=1;
coeffs[12]=6375;
end
genvar i;
generate
for (i=0; i<N; i=i+1)
begin: mult
multiplier mult1(
.dataa(coeffs[i]),
.datab(holderBefore[i]),
.result(toAdd[i]));
end
endgenerate
always #(posedge clock or posedge reset)
begin
if(reset)
begin
holderBefore[12] <= 0;
holderBefore[11] <= 0;
holderBefore[10] <= 0;
holderBefore[9] <= 0;
holderBefore[8] <= 0;
holderBefore[7] <= 0;
holderBefore[6] <= 0;
holderBefore[5] <= 0;
holderBefore[4] <= 0;
holderBefore[3] <= 0;
holderBefore[2] <= 0;
holderBefore[1] <= 0;
holderBefore[0] <= 0;
output_sample <= 0;
end
else
begin
holderBefore[12] <= holderBefore[11];
holderBefore[11] <= holderBefore[10];
holderBefore[10] <= holderBefore[9];
holderBefore[9] <= holderBefore[8];
holderBefore[8] <= holderBefore[7];
holderBefore[7] <= holderBefore[6];
holderBefore[6] <= holderBefore[5];
holderBefore[5] <= holderBefore[4];
holderBefore[4] <= holderBefore[3];
holderBefore[3] <= holderBefore[2];
holderBefore[2] <= holderBefore[1];
holderBefore[1] <= holderBefore[0];
holderBefore[0] <= input_sample;
output_sample <= (input_sample + toAdd[0] + toAdd[1] +
toAdd[2] + toAdd[3] + toAdd[4] + toAdd[5] +
toAdd[6] + toAdd[7] + toAdd[8] + toAdd[9] +
toAdd[10] + toAdd[11] + toAdd[12]);
end
end
endmodule
Is this the best way to implement this? is there a better way to do the addition?
Any help is greatly appreciated!
Also resources that would help are also greatly appreciated.
Area and power efficient FIR/IIR filters are the holy grail for some.
Using generate statements you have instantiated 13 multipliers. Multipliers take up quite a lot of area. It is common to only instantiate one and time multiplex it (TDM). In this case supply a clock (tick) 13 times faster than the required output rate.
Your adder chain while looking valid again is going to be very big and could lead to timing problems as there could be very long ripple chains. Breaking this down over multiple cycles might result in lower area and power.
If you combine the multiplication of a sample with the addition you will have a more typical MAC architecture (Multiply Accumulate).
I would also avoid initialising constants in an always #* as no right hand sides of arguments change this may not trigger the sensitivity list.
For these I would use localparams, or if going down the TDM route I would create a Look up table (LUT).
always #* begin
case( program_counter )
0 : coeff = 6375;
1 : coeff = 1 ;
...
endcase
end
Assuming your choice of filter response is justified (5.2dB ripple!)
Then an approach is to tradeoff some response accuracy for reduced chip resources by using Canonical signed digit representation [http://en.wikipedia.org/wiki/Canonical_signed_digit] to approximate each coefficient. This Strength reduction [http://en.wikipedia.org/wiki/Strength_reduction] (compiler term) allows efficient shifts ie routing and adds to be used instead of expensive multiplies.
Then due to the symmetry of the coefficients the respective samples can be summed before applying the coefficient, which significantly drops the required chip resources.[1]
But then there is likely to be common factors in the coefficients implemented, which for a chip target may get some optimisation but for firmware significant improvements can be made.
[1] = DSP Tricks: An odd way to build a simplified FIR filter structure Richard G. Lyons
try http://www.embedded.com/design/embedded/4008837/DSP-Tricks-An-odd-way-to-build-a-simplified-FIR-filter-structure

Code or compiler: optimizing a IIR filter in C for the iPhone 4 and later

I've been profiling my almost-finished project and I'm seeing that about three-quarters of the CPU time is spent in this IIR filter function (which is called hundreds of thousands of times in about a second currently on the target hardware) so with everything else working well I am wondering if it can be optimized for my specific hardware and software target. My targets are only iPhone 4 and newer, only iOS 4.3 and newer, only LLVM 4.x. A little bit of imprecision is probably OK if there are gains to be made.
static float filter(const float a, const float *b, const float c, float *d, const int e, const float x)
{
float return_value = 0;
d[0] = x;
d[1] = c * d[0] + a * d[1];
int j;
for (j = 2; j <= e; j++) {
return_value += (d[j] += a * (d[j + 1] - d[j - 1])) * b[j];
}
for (j = e + 1; j > 1; j--) {
d[j] = d[j - 1];
}
return (return_value);
}
Any suggestions about speeding it up appreciated, also interested in your opinion if it is possible to optimize beyond the default compiler optimization at all. I am wondering if it is something where NEON SIMD would help (that is new ground for me) or if VFP can be exploited, or if LLVM autovectorization would help.
I've tried the following LLVM flags:
-ffast-math (didn't make a notable difference)
-O4 (made a big difference on the iPhone 4S with a 25% reduction in time, but no notable difference on my minimum target device the iPhone 4, improvement of which is my main goal)
-O3 -mllvm -unroll-allow-partial -mllvm -unroll-runtime -funsafe-math-optimizations -ffast-math -mllvm -vectorize -mllvm -bb-vectorize-aligned-only (LLVM autovectorization flags from Hal Finkel's slides here: http://llvm.org/devmtg/2012-04-12/Slides/Hal_Finkel.pdf, made things slower than the default LLVM optimization for an Xcode release target)
Open to other flags, different approaches, and changes to the function. I'd prefer to leave the input and return types and values alone. There is actually a discussion of using NEON intrinsic functions for FIR here: https://pixhawk.ethz.ch/_media/software/optimization/neon_support_in_the_arm_compiler.pdf but I don't have quite enough experience with its subject to successfully apply the information to my own case. Thank you for any clarification.
EDIT My apologies for not noting this sooner. After investigating aka.nice's suggestion I noticed that the values passed in for e, a and c are always the same values and I know them before runtime, so approaches incorporating this info are an option.
Here are some transformations that could be made on the code to use vDSP routines. These transformations make use of various temporary buffers named T0, T1, and T2. Each of these is an array of float with enough space for e-1 elements.
First, use a temporary buffer to compute a * b[j]. This changes the original code:
for (j = 2; j <= e; j++) {
return_value += (d[j] += a * (d[j + 1] - d[j - 1])) * b[j];
}
to:
vDSP_vsmul(b+2, 1, &a, T0, 1, e-1);
for (j = 2; j <= e; j++)
return_value += (d[j] += (d[j+1] - d[j-1])) * T0[j-2];
Then use vDSP_vmul to compute d[j+1] * T0[j-2]:
vDSP_vsmul(b+2, 1, &a, T0, 1, e-1);
vDSP_vmul(d+3, 1, T0, 1, T1, 1, e-1);
for (j = 2; j <= e; j++)
return_value += (d[j] += T1[j-2] - d[j-1] * T0[j-2];
Next, promote vDSP_vmul to vDSP_vma (vector multiply add) to compute d[j] + d[j+1] * T0[j-2]:
vDSP_vsmul(b+2, 1, &a, T0, 1, e-1);
vDSP_vma(d+3, 1, T0, 1, d+2, 1, T1, 1, e-1);
for (j = 2; j <= e; j++)
return_value += (d[j] = T1[j-2] - d[j-1] * T0[j-2];
I suppose I would time that and see if there is any improvement. There are some issues:
SIMD code works best when data is 16-byte aligned. The use of array indices such as j-1 and j+1 prevents this. The ARM processors in phones are not as bad with unaligned data as some other processors, but performance will vary from model to model.
If e is large (more than a few thousand), then T0 and d may be evicted from cache during the vDSP_vma operation, and the following loop will have to reload them. There is a technique called strip mining to reduce the effect of this. I will not detail it now, but, essentially, the operation is partitioned into smaller strips of the array.
The IIR in the final loop may still bottleneck the processor. There are routines in vDSP for performing some IIRs (such as vDSP_deq22), but it is not clear whether this filter can be expressed in a way that is a good enough match to a vDSP routine to gain more performance than might be lost by the transformation.
The summation in the final loop to calculate return_value could also be removed from the loop and replaced with a vDSP routine (likely vDSP_sve), but I suspect the slack caused by the IIR will permit the additions to be done without adding significant execution time to the loop.
The above is off the top of my head; I have not tested the code. I suggest making the transformations one-by-one so you can test the code after each change and identify any errors before going on.
If you can find a satisfactory filter that is not an IIR, more performance optimizations may be available.
I'd prefer to leave the input and return types and values aloneā€¦
Nevertheless, moving your rendering from float to integer would help considerably.
Localizing that change to the implementation you present won't be useful. But if you expand it to reimplement just the FIR as integer, it can quickly pay off (unless the sizes are guaranteed to always be incredibly small -- then conversion/move times cost more). Of course, moving larger portions of the render graph to integer will introduce larger gains and require even fewer conversions.
Another consideration would be to look at the utilities in Accelerate.framework (potentially saving you from writing your own asm).
I tried this little exercize to rewrite your filter with delay operator z
For example, for e=4, i renamed input u and output y
d1*z= u
d2*z= c*u + a*d1
d3*z= d2 + a*(d3-d1*z)
d4*z= d3 + a*(d4-d2*z)
d5*z= d4 + a*(d5-d3*z)
y = (b2*d3*z + b3*d4*z + b4*d5*z)
Note that the di are the filter states.
d3*z is the next value of d3 (it appears to be variable d2 in your code)
You can then eliminate the di to write the transfer function y/u in z.
You will then find that a minimal representation require only e states by factoring/simplifying above transfer function.
Denominator is z*(z-a)^3, that is a pole at 0, and another at a with multiplicity (e-1).
You can then put your filter in a standard state space matrix representation:
z*X = A*X + B*u
y = C*X + d*u
With the particular form of poles, you can decompose the transfer in partial fraction expansion and obtain the matrices A & B in this special form (matlab like notations)
A = [0 1 0 0; B=[0;
0 a 1 0; 0;
0 0 a 1; 0;
0 0 0 a] 1]
C & d are a bit less easy though...
They are extracted from the numerators and direct term of partial fraction expansion
They are polynomials in bi, c (degree 1) and a (degree e)
For e=4, I have
C=[ a^3*b2 - a^2*b3 + a*b4 ,
-a^2*b2 + a*b3 + (c-a^2)*b4 ,
a*b2 + (c-a^2)*b3 + (-2*a^2*c-2*a^2-a+a^4)*b4 ,
(c-a^2)*b2 + (-a^2-a^2*c-a)*b3 + (-2*a*c+2*a^3)*b4 ]
d= -a*b2 - a*c*b3 + a^2*b4
If you can find the recurrence in e governing C & d, and precompute them
then the filter can be reduced to those simple vector ops:
z*X = a*[ 0; x2 ; x3 ; x4 ... xe ] + [x2 ; x3 ; x4 ... xe ; u ];
Y = C*[ x1 ; x2 ; x3 ; x4 ... xe ] + d*u
Or expressed as a function (Xnext,y)=filter(X,u,a,C,d,e) pseudo code:
y = dot_product( C , X) + d*u; // (like BLAS _DOT)
Xnext(1:e-1) = X(2:e); // this is a memcopy (like BLAS _COPY)
Xnext(e)=u;
X(1)=0;
Xnext=a*X+Xnext; // this is an inplace vector muladd (like BLAS _AXPY)
X=Xnext; // another memcopy outside the function (can be moved inside).
Note that if you use BLAS functions, your code will be portable to many hardwares, not just Applecentric, and I guess the perf won't be much different.
EDIT: about partial fraction expansion
The pure partial fraction expansion would give a diagonal state space representation, and a matrix B full of 1s. This can be an interesting variant too. (filters in parallel)
My variant used above is more like a cascade or ladder (filters in serie).