I am trying to implement an FIR filter in Verilog. I have predetermined the coefficients in MATLAB. But I am not sure whether the registers will propagate properly with this code.
module fir_filter(
input clock,
input reset,
input wire[15:0] input_sample,
output reg[15:0] output_sample);
parameter N = 13;
reg signed[15:0] coeffs[12:0];
reg [15:0] holderBefore[12:0];
wire [15:0] toAdd[12:0];
always #(*)
begin
coeffs[0]=6375;
coeffs[1]=1;
coeffs[2]=-3656;
coeffs[3]=3;
coeffs[4]=4171;
coeffs[5]=4;
coeffs[6]=28404;
coeffs[7]=4;
coeffs[8]=4171;
coeffs[9]=3;
coeffs[10]=-3656;
coeffs[11]=1;
coeffs[12]=6375;
end
genvar i;
generate
for (i=0; i<N; i=i+1)
begin: mult
multiplier mult1(
.dataa(coeffs[i]),
.datab(holderBefore[i]),
.result(toAdd[i]));
end
endgenerate
always #(posedge clock or posedge reset)
begin
if(reset)
begin
holderBefore[12] <= 0;
holderBefore[11] <= 0;
holderBefore[10] <= 0;
holderBefore[9] <= 0;
holderBefore[8] <= 0;
holderBefore[7] <= 0;
holderBefore[6] <= 0;
holderBefore[5] <= 0;
holderBefore[4] <= 0;
holderBefore[3] <= 0;
holderBefore[2] <= 0;
holderBefore[1] <= 0;
holderBefore[0] <= 0;
output_sample <= 0;
end
else
begin
holderBefore[12] <= holderBefore[11];
holderBefore[11] <= holderBefore[10];
holderBefore[10] <= holderBefore[9];
holderBefore[9] <= holderBefore[8];
holderBefore[8] <= holderBefore[7];
holderBefore[7] <= holderBefore[6];
holderBefore[6] <= holderBefore[5];
holderBefore[5] <= holderBefore[4];
holderBefore[4] <= holderBefore[3];
holderBefore[3] <= holderBefore[2];
holderBefore[2] <= holderBefore[1];
holderBefore[1] <= holderBefore[0];
holderBefore[0] <= input_sample;
output_sample <= (input_sample + toAdd[0] + toAdd[1] +
toAdd[2] + toAdd[3] + toAdd[4] + toAdd[5] +
toAdd[6] + toAdd[7] + toAdd[8] + toAdd[9] +
toAdd[10] + toAdd[11] + toAdd[12]);
end
end
endmodule
Is this the best way to implement this? is there a better way to do the addition?
Any help is greatly appreciated!
Also resources that would help are also greatly appreciated.
Area and power efficient FIR/IIR filters are the holy grail for some.
Using generate statements you have instantiated 13 multipliers. Multipliers take up quite a lot of area. It is common to only instantiate one and time multiplex it (TDM). In this case supply a clock (tick) 13 times faster than the required output rate.
Your adder chain while looking valid again is going to be very big and could lead to timing problems as there could be very long ripple chains. Breaking this down over multiple cycles might result in lower area and power.
If you combine the multiplication of a sample with the addition you will have a more typical MAC architecture (Multiply Accumulate).
I would also avoid initialising constants in an always #* as no right hand sides of arguments change this may not trigger the sensitivity list.
For these I would use localparams, or if going down the TDM route I would create a Look up table (LUT).
always #* begin
case( program_counter )
0 : coeff = 6375;
1 : coeff = 1 ;
...
endcase
end
Assuming your choice of filter response is justified (5.2dB ripple!)
Then an approach is to tradeoff some response accuracy for reduced chip resources by using Canonical signed digit representation [http://en.wikipedia.org/wiki/Canonical_signed_digit] to approximate each coefficient. This Strength reduction [http://en.wikipedia.org/wiki/Strength_reduction] (compiler term) allows efficient shifts ie routing and adds to be used instead of expensive multiplies.
Then due to the symmetry of the coefficients the respective samples can be summed before applying the coefficient, which significantly drops the required chip resources.[1]
But then there is likely to be common factors in the coefficients implemented, which for a chip target may get some optimisation but for firmware significant improvements can be made.
[1] = DSP Tricks: An odd way to build a simplified FIR filter structure Richard G. Lyons
try http://www.embedded.com/design/embedded/4008837/DSP-Tricks-An-odd-way-to-build-a-simplified-FIR-filter-structure
Related
I have a wire vector with 64 bits;
wire [63:0] sout;
I want to compute the sum of these bits or, equivalently, count the number of ones.
What is the best way to do this? (it should be synthesizable)
I prefer using for-loops as they are easier to scale and require less typing (and thereby less prone to typos).
SystemVerilog (IEEE Std 1800):
logic [$clog2($bits(sout)+1)-1:0] count_ones;
always_comb begin
count_ones = '0;
foreach(sout[idx]) begin
count_ones += sout[idx];
end
end
Verilog (IEEE Std 1364-2005):
parameter WIDTH = 64;
// NOTE: $clog2 was added in 1364-2005, not supported in 1364-1995 or 1364-2001
reg [$clog2(WIDTH+1)-1:0] count_ones;
integer idx;
always #* begin
count_ones = {WIDTH{1'b0}};
for( idx = 0; idx<WIDTH; idx = idx + 1) begin
count_ones = count_ones + sout[idx];
end
end
The $countones system function can be used. Refer to the IEEE Std 1800-2012, section "20.9 Bit vector system functions". It might not be synthesizable, but you did not list that as a requirement.
"Best" is rather subjective, but a simple and clear formulation would just be:
wire [6:0] sout_sum = sout[63] + sout[62] + ... + sout[1] + sout[0];
You might be able to think hard and come up with something that produces better synthesized results, but this is probably a good start until a timing tool says it's not good enough.
The following solution uses a function to calculate the total number of set (to High) bits in a 64-bits wide bus:
function logic [6:0] AddBitsOfBus (
input [63:0] InBus
);
AddBitsOfBus[2:0] = '0;
for (int k = 0; k < 64; k += 1) begin // for loop
AddBitsOfBus[6:0] += {6'b00_0000, InBus[k]};
end
endfunction
The following synthesizable SystemVerilog functions do this for you:
$countbits(sout,'1); // Counts the # of 1's
$countbits(sout,'0); // Counts the # of 0's
$countones(sout); // equivalent to $countbits(sout,'1)
The logic the synthesis tools will produce is a different story.
Ref: IEEE Std 1800-2012, Section 20.9
I'm trying to find an as efficient as possible way to store and call my matlab shape-functions. I have an interval x=linspace(0,20) and a position-vector
count = 10;
for i=1:count
pos(i)=rand()*length(x);
end
And now, I want to put on every position pos(j) shape functions like Gauss-kernels with compact support or hat-functions or something similar (it should be possible to change the prototype function easily). The support of the function is controlled by a so-called smoothing length h.
So I constructed a function in a .m-file like (e.g. a cubic spline)
function y = W_shape(x,pos,h)
l=length(x);
y=zeros(1,l);
if (h>0)
for i=1:l
if (-h <= x(i)-pos && x(i)-pos < -h/2)
y(i) = (x(i)-pos+h)^2;
elseif (-h/2 <= x(i)-pos && x(i)-pos <= h/2)
y(i) = -(x(i)-pos)^2 + h^2/2;
elseif (h/2 < x(i)-pos && x(i)-pos <=h)
y(i) = (x(i)-pos-h)^2;
end
end
else
error('h must be positive')
end
And then construct my functions on the interval x like
w = zeros(count,length(x));
for j=1:count
w(j,:)=W_shape(x,pos(j),h);
end
So far so good, but when I make x=linspace(0,20,10000) and count=1000, it takes my computer (Intel Core-i7) several minutes to calculate the whole stuff.
Since it should be some kind of PDE-Solver, this procedure has to be done in every time-step (under particular circumstances).
I think my problem is, that I use x as an argument for my function-call and store every function instead of store just one and shift it, but my matlab-knowledge is not that good, so any advices? Fyi: I need the integral of the areas, where two or more function-supports intersect...and when I'm done with this in 1D, I wanna do it for 2D-functions, so it has to be efficient anyways
One initial vectorization would be to remove the for loop in the W_shape function:
for i=1:l
if (-h <= x(i)-pos && x(i)-pos < -h/2)
y(i) = (x(i)-pos+h)^2;
elseif (-h/2 <= x(i)-pos && x(i)-pos <= h/2)
y(i) = -(x(i)-pos)^2 + h^2/2;
elseif (h/2 < x(i)-pos && x(i)-pos <=h)
y(i) = (x(i)-pos-h)^2;
end
end
Could become
xmpos=x-pos; % compute once and store instead of computing numerous times
inds1=(-h <= xmpos) & (xmpos < -h/2);
y(inds1) = (xmpos(inds1)+h).^2;
inds2=(-h/2 < xmpos) & (xmpos <= h/2);
y(inds2) = -(xmpos(inds2).^2 + h^2/2;
inds3=(h/2 < xmpos) & (xmpos <=h);
y(inds3) = (xmpos(inds3)-h).^2;
There is probably better optimisations than this.
EDIT:
I forgot to mention, you should use the profiler to find out what is actually slow!
profile on
run_code
profile viewer
I try to compile the following code with Dymola:
class abc
import Modelica.SIunits;
parameter SIunits.Time delta_t=0.5;
constant Real a[:]={4,2,6,-1,3,5,7,4,-3,-6};
Real x;
Integer j(start=1);
Integer k=size(a, 1);
algorithm
when {(sample(0, delta_t) and j < k),j == 1} then
x := a[j];
j := j + 1;
end when;
end abc;
and for time = 0 the variable j starts with 2. But it should start with j = 1.
Does anybody have an idea for this problem?
Keep in mind that sample(x,y) means that sample is true at x+i*y where i starts at zero. Which is to say that sample(0, ...) becomes true at time=0.
Since j starts at 1 and k is presumably more than 1, it doesn't seem unexpected to me that sample(0, delta_t) and j<k should become true at the start of the simulation.
I suspect what you want is:
class abc
import Modelica.SIunits;
parameter SIunits.Time delta_t=0.5;
constant Real a[:]={4,2,6,-1,3,5,7,4,-3,-6};
Real x;
Integer j(start=1);
Integer k=size(a, 1);
algorithm
when {(sample(delta_t, delta_t) and j < k),j == 1} then
x := a[pre(j)];
j := pre(j) + 1;
end when;
end abc;
I don't really see the point of the j==1 condition. It is true at the outset which means it doesn't "become" true then. And since j is never decremented, I don't see why it should ever return to the value 1 once it increments for the first time.
Note that I added a pre around the right-hand side values for j. If this were in an
equation section, I'm pretty sure the pre would be required. Since it is an algorithm section, it is mainly to document the intent of the code. It also makes the code robust to switching from equation to algorithm section.
Of course, there is an event at time = 0 triggered by the expression sample(0, delta_t) and j<k which becomes true.
But in older versions of Dymola there is an bug with the initialization of discrete variables. For instance even if you remove sample(0.0, delta_t) and j<k in dymola74, j will become 2 at time=0. The issue was that the pre values of when clauses, where not initialized correct. As far as I know this is corrected at least in the version FD1 2013.
I've been profiling my almost-finished project and I'm seeing that about three-quarters of the CPU time is spent in this IIR filter function (which is called hundreds of thousands of times in about a second currently on the target hardware) so with everything else working well I am wondering if it can be optimized for my specific hardware and software target. My targets are only iPhone 4 and newer, only iOS 4.3 and newer, only LLVM 4.x. A little bit of imprecision is probably OK if there are gains to be made.
static float filter(const float a, const float *b, const float c, float *d, const int e, const float x)
{
float return_value = 0;
d[0] = x;
d[1] = c * d[0] + a * d[1];
int j;
for (j = 2; j <= e; j++) {
return_value += (d[j] += a * (d[j + 1] - d[j - 1])) * b[j];
}
for (j = e + 1; j > 1; j--) {
d[j] = d[j - 1];
}
return (return_value);
}
Any suggestions about speeding it up appreciated, also interested in your opinion if it is possible to optimize beyond the default compiler optimization at all. I am wondering if it is something where NEON SIMD would help (that is new ground for me) or if VFP can be exploited, or if LLVM autovectorization would help.
I've tried the following LLVM flags:
-ffast-math (didn't make a notable difference)
-O4 (made a big difference on the iPhone 4S with a 25% reduction in time, but no notable difference on my minimum target device the iPhone 4, improvement of which is my main goal)
-O3 -mllvm -unroll-allow-partial -mllvm -unroll-runtime -funsafe-math-optimizations -ffast-math -mllvm -vectorize -mllvm -bb-vectorize-aligned-only (LLVM autovectorization flags from Hal Finkel's slides here: http://llvm.org/devmtg/2012-04-12/Slides/Hal_Finkel.pdf, made things slower than the default LLVM optimization for an Xcode release target)
Open to other flags, different approaches, and changes to the function. I'd prefer to leave the input and return types and values alone. There is actually a discussion of using NEON intrinsic functions for FIR here: https://pixhawk.ethz.ch/_media/software/optimization/neon_support_in_the_arm_compiler.pdf but I don't have quite enough experience with its subject to successfully apply the information to my own case. Thank you for any clarification.
EDIT My apologies for not noting this sooner. After investigating aka.nice's suggestion I noticed that the values passed in for e, a and c are always the same values and I know them before runtime, so approaches incorporating this info are an option.
Here are some transformations that could be made on the code to use vDSP routines. These transformations make use of various temporary buffers named T0, T1, and T2. Each of these is an array of float with enough space for e-1 elements.
First, use a temporary buffer to compute a * b[j]. This changes the original code:
for (j = 2; j <= e; j++) {
return_value += (d[j] += a * (d[j + 1] - d[j - 1])) * b[j];
}
to:
vDSP_vsmul(b+2, 1, &a, T0, 1, e-1);
for (j = 2; j <= e; j++)
return_value += (d[j] += (d[j+1] - d[j-1])) * T0[j-2];
Then use vDSP_vmul to compute d[j+1] * T0[j-2]:
vDSP_vsmul(b+2, 1, &a, T0, 1, e-1);
vDSP_vmul(d+3, 1, T0, 1, T1, 1, e-1);
for (j = 2; j <= e; j++)
return_value += (d[j] += T1[j-2] - d[j-1] * T0[j-2];
Next, promote vDSP_vmul to vDSP_vma (vector multiply add) to compute d[j] + d[j+1] * T0[j-2]:
vDSP_vsmul(b+2, 1, &a, T0, 1, e-1);
vDSP_vma(d+3, 1, T0, 1, d+2, 1, T1, 1, e-1);
for (j = 2; j <= e; j++)
return_value += (d[j] = T1[j-2] - d[j-1] * T0[j-2];
I suppose I would time that and see if there is any improvement. There are some issues:
SIMD code works best when data is 16-byte aligned. The use of array indices such as j-1 and j+1 prevents this. The ARM processors in phones are not as bad with unaligned data as some other processors, but performance will vary from model to model.
If e is large (more than a few thousand), then T0 and d may be evicted from cache during the vDSP_vma operation, and the following loop will have to reload them. There is a technique called strip mining to reduce the effect of this. I will not detail it now, but, essentially, the operation is partitioned into smaller strips of the array.
The IIR in the final loop may still bottleneck the processor. There are routines in vDSP for performing some IIRs (such as vDSP_deq22), but it is not clear whether this filter can be expressed in a way that is a good enough match to a vDSP routine to gain more performance than might be lost by the transformation.
The summation in the final loop to calculate return_value could also be removed from the loop and replaced with a vDSP routine (likely vDSP_sve), but I suspect the slack caused by the IIR will permit the additions to be done without adding significant execution time to the loop.
The above is off the top of my head; I have not tested the code. I suggest making the transformations one-by-one so you can test the code after each change and identify any errors before going on.
If you can find a satisfactory filter that is not an IIR, more performance optimizations may be available.
I'd prefer to leave the input and return types and values aloneā¦
Nevertheless, moving your rendering from float to integer would help considerably.
Localizing that change to the implementation you present won't be useful. But if you expand it to reimplement just the FIR as integer, it can quickly pay off (unless the sizes are guaranteed to always be incredibly small -- then conversion/move times cost more). Of course, moving larger portions of the render graph to integer will introduce larger gains and require even fewer conversions.
Another consideration would be to look at the utilities in Accelerate.framework (potentially saving you from writing your own asm).
I tried this little exercize to rewrite your filter with delay operator z
For example, for e=4, i renamed input u and output y
d1*z= u
d2*z= c*u + a*d1
d3*z= d2 + a*(d3-d1*z)
d4*z= d3 + a*(d4-d2*z)
d5*z= d4 + a*(d5-d3*z)
y = (b2*d3*z + b3*d4*z + b4*d5*z)
Note that the di are the filter states.
d3*z is the next value of d3 (it appears to be variable d2 in your code)
You can then eliminate the di to write the transfer function y/u in z.
You will then find that a minimal representation require only e states by factoring/simplifying above transfer function.
Denominator is z*(z-a)^3, that is a pole at 0, and another at a with multiplicity (e-1).
You can then put your filter in a standard state space matrix representation:
z*X = A*X + B*u
y = C*X + d*u
With the particular form of poles, you can decompose the transfer in partial fraction expansion and obtain the matrices A & B in this special form (matlab like notations)
A = [0 1 0 0; B=[0;
0 a 1 0; 0;
0 0 a 1; 0;
0 0 0 a] 1]
C & d are a bit less easy though...
They are extracted from the numerators and direct term of partial fraction expansion
They are polynomials in bi, c (degree 1) and a (degree e)
For e=4, I have
C=[ a^3*b2 - a^2*b3 + a*b4 ,
-a^2*b2 + a*b3 + (c-a^2)*b4 ,
a*b2 + (c-a^2)*b3 + (-2*a^2*c-2*a^2-a+a^4)*b4 ,
(c-a^2)*b2 + (-a^2-a^2*c-a)*b3 + (-2*a*c+2*a^3)*b4 ]
d= -a*b2 - a*c*b3 + a^2*b4
If you can find the recurrence in e governing C & d, and precompute them
then the filter can be reduced to those simple vector ops:
z*X = a*[ 0; x2 ; x3 ; x4 ... xe ] + [x2 ; x3 ; x4 ... xe ; u ];
Y = C*[ x1 ; x2 ; x3 ; x4 ... xe ] + d*u
Or expressed as a function (Xnext,y)=filter(X,u,a,C,d,e) pseudo code:
y = dot_product( C , X) + d*u; // (like BLAS _DOT)
Xnext(1:e-1) = X(2:e); // this is a memcopy (like BLAS _COPY)
Xnext(e)=u;
X(1)=0;
Xnext=a*X+Xnext; // this is an inplace vector muladd (like BLAS _AXPY)
X=Xnext; // another memcopy outside the function (can be moved inside).
Note that if you use BLAS functions, your code will be portable to many hardwares, not just Applecentric, and I guess the perf won't be much different.
EDIT: about partial fraction expansion
The pure partial fraction expansion would give a diagonal state space representation, and a matrix B full of 1s. This can be an interesting variant too. (filters in parallel)
My variant used above is more like a cascade or ladder (filters in serie).
I have some massive matrix computation to do in MATLAB. It's nothing complicated (see below). I'm having issues with making computation in MATLAB efficient. What I have below works but the time it takes simply isn't feasible because of the computation time.
for i = 1 : 100
for j = 1 : 20000
element = matrix{i}(j,1);
if element <= bigNum && element >= smallNum
count = count + 1;
end
end
end
Is there a way of making this faster? MATLAB is meant to be good at these problems so I would imagine so?
Thank you :).
count = 0
for i = 1:100
count = count + sum(matrix{i}(:,1) <= bigNum & matrix{i}(:,1) >= smallNum);
end
If your matrix is a matrix, then this will do:
count = sum(matrix(:) >= smallNum & matrix(:) <= bigNum);
If your matrix is really huge, use anyExceed. You can profile (check the running time) of both functions on matrix and decide.