Implementing convolution in C++ using fftw 3 - convolution

UPDATE
See my fundamental based question on DSP stackexchange here
UPDATE
I am still experiencing crackling in the output. These crackles are now less pronounced and are only audible when the volume is turned up
UPDATE
Following the advice given here has removed the crackling sound from my output. I will test with other available HRIRs to see if the convolution is indeed working properly and will answer this question once I've verified that my code now works
UPDATE
I have made some progress, but I still think there is an issue with my convolution implementation.
The following is my revised program:
#define HRIR_LENGTH 512
#define WAV_SAMPLE_SIZE 256
while (signal_input_wav.read(&signal_input_buffer[0], WAV_SAMPLE_SIZE) >= WAV_SAMPLE_SIZE)
{
#ifdef SKIP_CONVOLUTION
// Copy the input buffer over
std::copy(signal_input_buffer.begin(),
signal_input_buffer.begin() + WAV_SAMPLE_SIZE,
signal_output_buffer.begin());
signal_output_wav.write(&signal_output_buffer[0], WAV_SAMPLE_SIZE);
#else
// Copy the first segment into the buffer
// with zero padding
for (int i = 0; i < HRIR_LENGTH; ++i)
{
if (i < WAV_SAMPLE_SIZE)
{
signal_buffer_fft_in[i] = signal_input_buffer[i];
}
else
{
signal_buffer_fft_in[i] = 0; // zero pad
}
}
// Dft of the signal segment
fftw_execute(signal_fft);
// Convolve in the frequency domain by multiplying filter kernel with dft signal
for (int i = 0; i < HRIR_LENGTH; ++i)
{
signal_buffer_ifft_in[i] = signal_buffer_fft_out[i] * left_hrir_fft_out[i]
- signal_buffer_fft_out[HRIR_LENGTH - i] * left_hrir_fft_out[HRIR_LENGTH - i];
signal_buffer_ifft_in[HRIR_LENGTH - i] = signal_buffer_fft_out[i] * left_hrir_fft_out[HRIR_LENGTH - i]
+ signal_buffer_fft_out[HRIR_LENGTH - i] * left_hrir_fft_out[i];
//double re = signal_buffer_out[i];
//double im = signal_buffer_out[BLOCK_OUTPUT_SIZE - i];
}
// inverse dft back to time domain
fftw_execute(signal_ifft);
// Normalize the data
for (int i = 0; i < HRIR_LENGTH; ++i)
{
signal_buffer_ifft_out[i] = signal_buffer_ifft_out[i] / HRIR_LENGTH;
}
// Overlap-add method
for (int i = 0; i < HRIR_LENGTH; ++i)
{
if (i < WAV_SAMPLE_SIZE)
{
signal_output_buffer[i] = signal_overlap_buffer[i] + signal_buffer_ifft_out[i];
}
else
{
signal_output_buffer[i] = signal_buffer_ifft_out[i];
signal_overlap_buffer[i] = signal_output_buffer[i]; // record into the overlap buffer
}
}
// Write the block to the output file
signal_output_wav.write(&signal_output_buffer[0], HRIR_LENGTH);
#endif
}
The resulting output sound file contains crackling sounds; presumably artefacts left from the buggy fftw implementation. Also, writing blocks of 512 (HRIR_LENGTH) seems to result in some aliasing, with the soundfile upon playback sounding like a vinyl record being slowed down. Writing out blocks of size WAV_SAMPLE_SIZE (256, half of the fft output) seems to playback at normal speed.
However, irrespective of this the crackling sound remains.
ORIGINAL
I'm trying to implement convolution using the fftw library in C++.
I can load my filter perfectly fine, and am zero padding both the filter (of length 512) and the input signal (of length 513) in order to get a signal output block of 1024 and using this as the fft size.
Here is my code:
#define BLOCK_OUTPUT_SIZE 1024
#define HRIR_LENGTH 512
#define WAV_SAMPLE_SIZE 513
#define INPUT_SHIFT 511
while (signal_input_wav.read(&signal_input_buffer[0], WAV_SAMPLE_SIZE) >= WAV_SAMPLE_SIZE)
{
#ifdef SKIP_CONVOLUTION
// Copy the input buffer over
std::copy(signal_input_buffer.begin(),
signal_input_buffer.begin() + WAV_SAMPLE_SIZE,
signal_output_buffer.begin());
signal_output_wav.write(&signal_output_buffer[0], WAV_SAMPLE_SIZE);
#else
// Zero pad input
for (int i = 0; i < INPUT_SHIFT; ++i)
signal_input_buffer[WAV_SAMPLE_SIZE + i] = 0;
// Copy to the signal convolve buffer
for (int i = 0; i < BLOCK_OUTPUT_SIZE; ++i)
{
signal_buffer_in[i] = signal_input_buffer[i];
}
// Dft of the signal segment
fftw_execute(signal_fft);
// Convolve in the frequency domain by multiplying filter kernel with dft signal
for (int i = 1; i < BLOCK_OUTPUT_SIZE; ++i)
{
signal_buffer_out[i] = signal_buffer_in[i] * left_hrir_fft_in[i]
- signal_buffer_in[BLOCK_OUTPUT_SIZE - i] * left_hrir_fft_in[BLOCK_OUTPUT_SIZE - i];
signal_buffer_out[BLOCK_OUTPUT_SIZE - i]
= signal_buffer_in[BLOCK_OUTPUT_SIZE - i] * left_hrir_fft_in[i]
+ signal_buffer_in[i] * left_hrir_fft_in[BLOCK_OUTPUT_SIZE - i];
double re = signal_buffer_out[i];
double im = signal_buffer_out[BLOCK_OUTPUT_SIZE - i];
}
// inverse dft back to time domain
fftw_execute(signal_ifft);
// Normalize the data
for (int i = 0; i < BLOCK_OUTPUT_SIZE; ++i)
{
signal_buffer_out[i] = signal_buffer_out[i] / i;
}
// Overlap and add with the previous block
if (first_block)
{
first_block = !first_block;
for (int i = 0; i < BLOCK_OUTPUT_SIZE; ++i)
{
signal_output_buffer[i] = signal_buffer_out[i];
}
}
else
{
for (int i = WAV_SAMPLE_SIZE; i < BLOCK_OUTPUT_SIZE; ++i)
{
signal_output_buffer[i] = signal_output_buffer[i] + signal_buffer_out[i];
}
}
// Write the block to the output file
signal_output_wav.write(&signal_output_buffer[0], BLOCK_OUTPUT_SIZE);
#endif
}
In the end, the resulting output file contains garbage, but is not all zeros.
Things I have tried:
1) Using the standard complex interface fftw_plan_dft_1d with the appropriate fftw_complex type. Same issues arise.
2) Using a smaller input sample size and iterating over the zero padded blocks (overlap-add).
I also note that its not a fault of libsndfile; toggling SKIP_CONVOLUTION does successfully result in copying the input file to the output file.

Related

Using zoomCallback, how can I "snap" the zoom to existing x values?

I'm trying to use the zoomCallback function to set up interaction between my dygrpahs chart and a map chart. My x values are timestamps in seconds but since the sample rate is about 100Hz the timestamps are stored as float numbers.
The goal is that when dygraphs chart is zoomed in, the new x1 and x2 will be used to extract a piece of GPS track (lat, lng points). The extracted track will be used to re-fit the map boundaries - this will look like a "zoom in" on the map chart.
In my dygraphs options I specified the callback:
zoomCallback: function(x1,x2) {
let x1Index = graphHolder.getRowForX(x1);
let x2Index = graphHolder.getRowForX(x2);
// further code
}
But it looks like the zoom is not "snapped" to existing timestamp points so both x1Index and x2Index are null. Only when I zoom out, they'll correctly point to row 0 and the last row of data.
So the question is - is there a way to make the zoom snap only to the nearest existing x value so the row number can be returned? Or, is there an alternative to do what I want?
Thanks for any insights!
You can access the x-axis values via g.getValue(row, 0). From this you can either do a linear scan to find the first row in the range or (fancier but faster) use a binary search.
Here's a way to do the linear scan:
const [x1, x2] = g.xAxisRange();
let letRow = null, highRow = null;
for (let i = 0; i < g.numRows(); i++) {
if (g.getValue(i, 0) >= x1) {
lowRow = i;
break;
}
}
for (let i = g.numRows() - 1; i >= 0; i--) {
if (g.getValue(i, 0) <= x2) {
highRow = i;
break;
}
}
const dataX1 = g.getValue(lowRow, 0);
const dataX2 = g.getValue(highRow, 0);
For larger data sets you might want to do a binary search using something like lodash's _.sortedIndex.
Update Here's a binary search implementation. No promises about the exact behavior on the boundaries (i.e. whether it always returns indices that are inside the visible range or indices which contain the visible range).
function dygraphBinarySearch(g, x) {
let low = 0;
let high = g.numRows() - 1;
while (high > low) {
let i = Math.floor(low + (high - low) / 2);
const xi = g.getValue(i, 0);
if (xi < x) {
low = i + 1;
} else if (xi > x) {
high = i - 1;
} else {
return i;
}
}
return low;
}
function getVisibleDataRange(g) {
const [x1, x2] = g.xAxisRange();
let lowI = dygraphBinarySearch(g, x1);
let highI = dygraphBinarySearch(g, x2);
return [lowI, highI];
}

OpenCL 1D strided convolution performance

For downsampling a signal, I use a FIR filter + decimation stage (that's practical a strided convolution). The big advantage of combining filtering and decimation is the reduced computational cost (by the decimation factor).
With a straight forward OpenCL implementation, I am not able to benefit from the decimation. Quite to the contrary: The convolution with a decimation factor of 4 is 25% slower than the full convolution.
Kernel Code:
__kernel void decimation(__constant float *input,
__global float *output,
__constant float *coefs,
const int taps,
const int decimationFactor) {
int posOutput = get_global_id(0);
float result = 0;
for (int tap=0; tap<taps; tap++) {
int posInput = (posOutput * decimationFactor) - tap;
result += input[posInput] * coefs[tap];
}
output[posOutput] = result;
}
I guess it is due to the uncoalesced memory access. Though I can not think of a solution to fix the problem. Any ideas?
Edit: I tried Dithermaster's solution to split the problem into coalesced reads to shared local memory and convolution from local memory:
__kernel void decimation(__constant float *input,
__global float *output,
__constant float *coefs,
const int taps,
const int decimationFactor,
const int bufferSize,
__local float *localInput) {
const int posOutput = get_global_id(0);
const int localSize = get_local_size(0);
const int localId = get_local_id(0);
const int groupId = get_group_id(0);
const int localInputOffset = taps-1;
const int localInputOverlap = taps-decimationFactor;
const int localInputSize = localInputOffset + localSize * decimationFactor;
// 1. transfer global input data to local memory
// read global input to local input (only overlap)
if (localId < localInputOverlap) {
int posInputStart = ((groupId*localSize) * decimationFactor) - (taps-1);
int posInput = posInputStart + localId;
int posLocalInput = localId;
localInput[posLocalInput] = 0.0f;
if (posInput >= 0)
localInput[posLocalInput] = input[posInput];
}
// read remaining global input to local input
// 1. alternative: strided read
// for (int i=0; i<decimationFactor; i++) {
// int posInputStart = (groupId*localSize) * decimationFactor;
// int posInput = posInputStart + localId * decimationFactor - i;
// int posLocalInput = localInputOffset + localId * decimationFactor - i;
// localInput[posLocalInput] = 0.0f;
// if ((posInput >= 0) && (posInput < bufferSize*decimationFactor))
// localInput[posLocalInput] = input[posInput];
// }
// 2. alternative: coalesced read (in blocks of localSize)
for (int i=0; i<decimationFactor; i++) {
int posInputStart = (groupId*localSize) * decimationFactor;
int posInput = posInputStart - (decimationFactor-1) + i*localSize + localId;
int posLocalInput = localInputOffset - (decimationFactor-1) + i*localSize + localId;
localInput[posLocalInput] = 0.0f;
if ((posInput >= 0) && (posInput < bufferSize*decimationFactor))
localInput[posLocalInput] = input[posInput];
}
// 2. wait until every thread completed
barrier(CLK_LOCAL_MEM_FENCE);
// 3. convolution
if (posOutput < bufferSize) {
float result = 0.0f;
for (int tap=0; tap<taps; tap++) {
int posLocalInput = localInputOffset + (localId * decimationFactor) - tap;
result += localInput[posLocalInput] * coefs[tap];
}
output[posOutput] = result;
}
}
Big improvement! But still, the performance does not correlate with the overall operations (not proportional to the decimation factor):
speedup for full convolution compared to first approach: ~12 %
computatoin time for decimation compared to full convolution:
decimation factor 2: 61 %
decimation factor 4: 46 %
decimation factor 8: 53 %
decimation factor 16: 68 %
The performance has a optimum for a decimation factor of 4. Why is that? Any ideas for further improvements?
Edit 2: Diagram with shared local memory:
Edit 3: Comparison of the performance for the 3 different implementations
Due to the amount of data overlap (66%), this could benefit from sharing data read from memory between work items, within a workgroup. You could get rid of redundant reads and also make coalesced reads. Break you kernel up into two parts: The first part does coalesced reads for all the data needed within the work group, into shared local memory. Then a memory barrier to synchronize. Then in the second part do the convolutions using reads from shared local memory.
P.S. Thanks for the diagram, it helped me understand your goal more quickly than trying to read code.

OMP Atomic, why?

I've got a problem with OpenMP. I know that if you are incrementing something in a parallel block you have to set an atomic before that expression. But in my code there is a part I don't understand.
Why do I have to use the atomic here?
#pragma omp parallel
{
double distance, magnitude, factor, r;
vector_t direction;
int i, j;
#pragma omp for
for (i = 0; i < n_body - 1; i++)
{
for (j = i + 1; j < n_body; j++)
{
r = SQR (bodies[i].position.x - bodies[j].position.x) + SQR (bodies[i].position.y - bodies[j].position.y);
// avoid numerical instabilities
if (r < EPSILON)
{
// this is not how nature works :-)
r += EPSILON;
}
distance = sqrt (r);
magnitude = (G * bodies[i].mass * bodies[j].mass) / (distance * distance);
factor = magnitude / distance;
direction.x = bodies[j].position.x - bodies[i].position.x;
direction.y = bodies[j].position.y - bodies[i].position.y;
// +force for body i
#pragma omp atomic
bodies[i].force.x += factor * direction.x;
#pragma omp atomic
bodies[i].force.y += factor * direction.y;
// -force for body j
#pragma omp atomic
bodies[j].force.x -= factor * direction.x;
#pragma omp atomic
bodies[j].force.y -= factor * direction.y;
}
}
}
And why don't I have to use it here:
#pragma omp parallel
{
vector_t delta_v, delta_p;
int i;
#pragma omp for
for (i = 0; i < n_body; i++)
{
// calculate delta_v
delta_v.x = bodies[i].force.x / bodies[i].mass * dt;
delta_v.y = bodies[i].force.y / bodies[i].mass * dt;
// calculate delta_p
delta_p.x = (bodies[i].velocity.x + delta_v.x / 2.0) * dt;
delta_p.y = (bodies[i].velocity.y + delta_v.y / 2.0) * dt;
// update body velocity and position
bodies[i].velocity.x += delta_v.x;
bodies[i].velocity.y += delta_v.y;
bodies[i].position.x += delta_p.x;
bodies[i].position.y += delta_p.y;
// reset forces
bodies[i].force.x = bodies[i].force.y = 0.0;
if (bounce)
{
// bounce on boundaries (i.e. it's more like billard)
if ((bodies[i].position.x < -body_distance_factor) || (bodies[i].position.x > body_distance_factor))
bodies[i].velocity.x = -bodies[i].velocity.x;
if ((bodies[i].position.y < -body_distance_factor) || (bodies[i].position.y > body_distance_factor))
bodies[i].velocity.y = -bodies[i].velocity.y;
}
}
}
The code works at it is now, but I simply don't understand why.
Can you help me?
Kind Regards
Michael
The second of the two code samples, each parallel iteration of the loop works on element [i] of the array and never looks at any neighbouring elements. Thus each iteration of the loop has no effect on any other iteration of the loop, and they can all be executed at the same time without worry.
In the first code example however, you each parallel iteration of the loop may read from and write anywhere in the bodies array using the index [j]. This means that two threads could be trying to update the same memory location at the same time, or one thread could be writing to a location that another one is reading. To avoid race conditions you need to ensure that the writes are atomic.
When multiple threads write to the same memory location you need to use an atomic operator or a critical section to prevent a race condition. Atomic operators are faster but have more restrictions (e.g. they only operate on POD with some basic operators) but in your case you can use them.
So you have to ask yourself when threads are writing to the same memory location. In the first case you only parallelize the outer loop over i and not the inner loop over j so you don't actually need the atomic operators on i just the ones on j.
Let's consider an example from the first case. Let's assume n_body=101 and there are 4 threads.
Thread one i = 0-24, j = 1-100, j range = 100
Thread two i = 25-49, j = 26-100, j range = 75
Thread three i = 50-74, j = 51-100, j range = 50
Thread four i = 75-99, j = 76-100, j range = 25
First of all you see that each thread write to some of the same memory location. For example, all threads write to memory locations with j=76-100. That's why you need the atomic operator for j. However, no thread writes to the same memory location with i. That's why you don't need an atomic operator for i.
In your second case you only have one loop and it's parallelized so no thread writes to the same memory location so you don't need the atomic operators.
That answers your question but here are some addition comments to improve the performance of your code:
There is another important observation independent of the atomic operators. You can see that thread one runs over j 100 times while thread 4 only runs over j 25 times. Therefore, the load is not well distributed using schedule(static) which is typically the default scheduler. For larger values of n_body this is going to get worse.
One solution is to try schedule(guided). I have not used this before but I think it's the right solution OpenMP: for schedule. "A special kind of dynamic scheduling is the guided where smaller and smaller iteration blocks are given to each task as the work progresses." According to the standard for guided each successive block gets "number_of_iterations_remaining / number_of_threads". So from our example that gives
Thread one i = 0-24, j = 1-100, j range = 100
Thread two i = 25-44, j = 26-100, j range = 75
Thread three i = 45-69, j = 46-100, j range = 55
Thread four i = 60-69, j = 61-100, j range = 40
Thread one i = 70-76, j = 71-100
...
Notice now that the threads are more evenly distributed. With static scheduling the fourth thread only ran over j 25 times and now the fourth thread runs over j 40 times.
But let's look at your algorithm more carefully. In the first case you are calculating the gravitational force on each body. Here are two way you can do this:
//method1
#pragma omp parallel for
for(int i=0; i<n_body; i++) {
vector_t force = 0;
for(int j=0; j<n_body; j++) {
force += gravity_force(i,j);
}
bodies[i].force = force;
}
But the function gravity_force(i,j) = gravity_force(j,i) so you don't need to calculate it twice. So you found a faster solution:
//method2 (see your code and mine below on how to parallelize this)
for(int i=0; i<(n_body-1); i++) {
for(int j=i+1; j<nbody; j++) {
bodies[i].force += gravity_force(i,j);
bodies[j].force += gravity_force(i,j);
}
}
The first method does n_bodies*n_bodies iterations and the second method does (n_bodies-1)nbodies/2 iterations which to first order is n_bodiesn_bodies/2. However, the second case is much more difficult to parallelize efficiently (see your code and my code below). It has to use atmoic operations and the load is not balance. The first method does twice as many iterations but the load is evenly distributed and it does not need any atomic operations. You should test both methods to see which one is faster.
To parallize the second method you can do what you did:
#pragma omp parallel for schedule(static) // you should try schedule(guided)
for(int i=0; i<(n_body-1); i++) {
for(int j=i+1; j<nbody; j++) {
//#pragma omp atomic //not necessary on i
bodies[i].force += gravity_force(i,j);
#pragma omp atomic //but necessary on j
bodies[j].force += gravity_force(i,j);
}
}
Or a better solution is to use private copies of force like this:
#pragma omp parallel
{
vector_t *force = new vector_t[n_body];
#pragma omp for schedule(static)
for (int i = 0; i < n_body; i++) force[i] = 0;
#pragma omp for schedule(guided)
for(int i=0; i<(n_body-1); i++) {
for(int j=i+1; j<nbody; j++) {
force[i] += gravity_force(i,j);
force[j] += gravity_force(i,j);
}
}
#pragma omp for schedule(static)
{
#pragma omp atomic
bodies[i].force.x += force[i].x;
#pragma omp atomic
bodies[i].force.y += force[i].y;
}
delete[] force;
}

Boundry detect paper sheet opencv

I am new in openCV, I already detect edge of paper sheet but my result image is blurred after draw lines on edge, How I can draw lines on edges of paper sheet so my image quality remain unaffected.
what I am Missing..
My code is below.
Many thanks.
-(void)forOpenCV
{
if( imageView.image != nil )
{
cv::Mat greyMat=[self cvMatFromUIImage:imageView.image];
vector<vector<cv::Point> > squares;
cv::Mat img= [self debugSquares: squares: greyMat ];
imageView.image =[self UIImageFromCVMat: img];
}
}
- (cv::Mat) debugSquares: (std::vector<std::vector<cv::Point> >) squares : (cv::Mat &)image
{
NSLog(#"%lu",squares.size());
// blur will enhance edge detection
Mat blurred(image);
medianBlur(image, blurred, 9);
Mat gray0(image.size(), CV_8U), gray;
vector<vector<cv::Point> > contours;
// find squares in every color plane of the image
for (int c = 0; c < 3; c++)
{
int ch[] = {c, 0};
mixChannels(&image, 1, &gray0, 1, ch, 1);
// try several threshold levels
const int threshold_level = 2;
for (int l = 0; l < threshold_level; l++)
{
// Use Canny instead of zero threshold level!
// Canny helps to catch squares with gradient shading
if (l == 0)
{
Canny(gray0, gray, 10, 20, 3); //
// Dilate helps to remove potential holes between edge segments
dilate(gray, gray, Mat(), cv::Point(-1,-1));
}
else
{
gray = gray0 >= (l+1) * 255 / threshold_level;
}
// Find contours and store them in a list
findContours(gray, contours, CV_RETR_LIST, CV_CHAIN_APPROX_SIMPLE);
// Test contours
vector<cv::Point> approx;
for (size_t i = 0; i < contours.size(); i++)
{
// approximate contour with accuracy proportional
// to the contour perimeter
approxPolyDP(Mat(contours[i]), approx, arcLength(Mat(contours[i]), true)*0.02, true);
// Note: absolute value of an area is used because
// area may be positive or negative - in accordance with the
// contour orientation
if (approx.size() == 4 &&
fabs(contourArea(Mat(approx))) > 1000 &&
isContourConvex(Mat(approx)))
{
double maxCosine = 0;
for (int j = 2; j < 5; j++)
{
double cosine = fabs(angle(approx[j%4], approx[j-2], approx[j-1]));
maxCosine = MAX(maxCosine, cosine);
}
if (maxCosine < 0.3)
squares.push_back(approx);
}
}
}
}
NSLog(#"%lu",squares.size());
for( size_t i = 0; i < squares.size(); i++ )
{
cv:: Rect rectangle = boundingRect(Mat(squares[i]));
if(i==squares.size()-1)////Detecting Rectangle here
{
const cv::Point* p = &squares[i][0];
int n = (int)squares[i].size();
NSLog(#"%d",n);
line(image, cv::Point(507,418), cv::Point(507+1776,418+1372), Scalar(255,0,0),2,8);
polylines(image, &p, &n, 1, true, Scalar(255,255,0), 5, CV_AA);
fx1=rectangle.x;
fy1=rectangle.y;
fx2=rectangle.x+rectangle.width;
fy2=rectangle.y+rectangle.height;
line(image, cv::Point(fx1,fy1), cv::Point(fx2,fy2), Scalar(0,0,255),2,8);
}
}
return image;
}
Instead of
Mat blurred(image);
you need to do
Mat blurred = image.clone();
Because the first line does not copy the image, but just creates a second pointer to the same data.
When you blurr the image, you are also changing the original.
What you need to do instead is, to create a real copy of the actual data and operate on this copy.
The OpenCV reference states:
by using a copy constructor or assignment operator, where on the right side it can
be a matrix or expression, see below. Again, as noted in the introduction, matrix assignment is O(1) operation because it only copies the header and increases the reference counter.
Mat::clone() method can be used to get a full (a.k.a. deep) copy of the matrix when you need it.
The first problem is easily solved by doing the entire processing on a copy of the original image. That way, after you get all the points of the square you can draw the lines on the original image and it will not be blurred.
The second problem, which is cropping, can be solved by defining a ROI (region of interested) in the original image and then copying it to a new Mat. I've demonstrated that in this answer:
// Setup a Region Of Interest
cv::Rect roi;
roi.x = 50
roi.y = 10
roi.width = 400;
roi.height = 450;
// Crop the original image to the area defined by ROI
cv::Mat crop = original_image(roi);
cv::imwrite("cropped.png", crop);

Perform autocorrelation with vDSP_conv from Apple Accelerate Framework

I need to perform the autocorrelation of an array (vector) but I am having trouble finding the correct way to do so. I believe that I need the method "vDSP_conv" from the Accelerate Framework, but I can't follow how to successfully set it up. The thing throwing me off the most is the need for 2 inputs. Perhaps I have the wrong function, but I couldn't find one that operated on a single vector.
The documentation can be found here
Copied from the site
vDSP_conv
Performs either correlation or convolution on two vectors; single
precision.
void vDSP_conv ( const float __vDSP_signal[], vDSP_Stride
__vDSP_signalStride, const float __vDSP_filter[], vDSP_Stride __vDSP_strideFilter, float __vDSP_result[], vDSP_Stride __vDSP_strideResult, vDSP_Length __vDSP_lenResult, vDSP_Length __vDSP_lenFilter );
Parameters
__vDSP_signal
Input vector A. The length of this vector must be at least __vDSP_lenResult + __vDSP_lenFilter - 1.
__vDSP_signalStride
The stride through __vDSP_signal.
__vDSP_filter
Input vector B.
__vDSP_strideFilter
The stride through __vDSP_filter.
__vDSP_result
Output vector C.
__vDSP_strideResult
The stride through __vDSP_result.
__vDSP_lenResult
The length of __vDSP_result.
__vDSP_lenFilter
The length of __vDSP_filter.
For an example, just assume you have an array of float x = [1.0, 2.0, 3.0, 4.0, 5.0]. How would I take the autocorrelation of that?
The output should be something similar to float y = [5.0, 14.0, 26.0, 40.0, 55.0, 40.0, 26.0, 14.0, 5.0] //generated using Matlab's xcorr(x) function
performing autocorrelation simply means you take the cross-correlation of one vector with itself. There is nothing fancy about it.
so in your case, do:
vDSP_conv(x, 1, x, 1, result, 1, 2*len_X-1, len_X);
check a sample code for more details: (which does a convolution)
http://disanji.net/iOS_Doc/#documentation/Performance/Conceptual/vDSP_Programming_Guide/SampleCode/SampleCode.html
EDIT: This borders on ridiculous, but you need to offset the x value by a specific number of zeros, which is just crazy.
the following is a working code, just set filter to the value of x you desire, and it will put the rest in the correct position:
float *signal, *filter, *result;
int32_t signalStride, filterStride, resultStride;
uint32_t lenSignal, filterLength, resultLength;
uint32_t i;
filterLength = 5;
resultLength = filterLength*2 -1;
lenSignal = ((filterLength + 3) & 0xFFFFFFFC) + resultLength;
signalStride = filterStride = resultStride = 1;
printf("\nConvolution ( resultLength = %d, "
"filterLength = %d )\n\n", resultLength, filterLength);
/* Allocate memory for the input operands and check its availability. */
signal = (float *) malloc(lenSignal * sizeof(float));
filter = (float *) malloc(filterLength * sizeof(float));
result = (float *) malloc(resultLength * sizeof(float));
for (i = 0; i < filterLength; i++)
filter[i] = (float)(i+1);
for (i = 0; i < resultLength; i++)
if (i >=resultLength- filterLength)
signal[i] = filter[i - filterLength+1];
/* Correlation. */
vDSP_conv(signal, signalStride, filter, filterStride,
result, resultStride, resultLength, filterLength);
printf("signal: ");
for (i = 0; i < lenSignal; i++)
printf("%2.1f ", signal[i]);
printf("\n filter: ");
for (i = 0; i < filterLength; i++)
printf("%2.1f ", filter[i]);
printf("\n result: ");
for (i = 0; i < resultLength; i++)
printf("%2.1f ", result[i]);
/* Free allocated memory. */
free(signal);
free(filter);
free(result);