OMP Atomic, why? - atomic

I've got a problem with OpenMP. I know that if you are incrementing something in a parallel block you have to set an atomic before that expression. But in my code there is a part I don't understand.
Why do I have to use the atomic here?
#pragma omp parallel
{
double distance, magnitude, factor, r;
vector_t direction;
int i, j;
#pragma omp for
for (i = 0; i < n_body - 1; i++)
{
for (j = i + 1; j < n_body; j++)
{
r = SQR (bodies[i].position.x - bodies[j].position.x) + SQR (bodies[i].position.y - bodies[j].position.y);
// avoid numerical instabilities
if (r < EPSILON)
{
// this is not how nature works :-)
r += EPSILON;
}
distance = sqrt (r);
magnitude = (G * bodies[i].mass * bodies[j].mass) / (distance * distance);
factor = magnitude / distance;
direction.x = bodies[j].position.x - bodies[i].position.x;
direction.y = bodies[j].position.y - bodies[i].position.y;
// +force for body i
#pragma omp atomic
bodies[i].force.x += factor * direction.x;
#pragma omp atomic
bodies[i].force.y += factor * direction.y;
// -force for body j
#pragma omp atomic
bodies[j].force.x -= factor * direction.x;
#pragma omp atomic
bodies[j].force.y -= factor * direction.y;
}
}
}
And why don't I have to use it here:
#pragma omp parallel
{
vector_t delta_v, delta_p;
int i;
#pragma omp for
for (i = 0; i < n_body; i++)
{
// calculate delta_v
delta_v.x = bodies[i].force.x / bodies[i].mass * dt;
delta_v.y = bodies[i].force.y / bodies[i].mass * dt;
// calculate delta_p
delta_p.x = (bodies[i].velocity.x + delta_v.x / 2.0) * dt;
delta_p.y = (bodies[i].velocity.y + delta_v.y / 2.0) * dt;
// update body velocity and position
bodies[i].velocity.x += delta_v.x;
bodies[i].velocity.y += delta_v.y;
bodies[i].position.x += delta_p.x;
bodies[i].position.y += delta_p.y;
// reset forces
bodies[i].force.x = bodies[i].force.y = 0.0;
if (bounce)
{
// bounce on boundaries (i.e. it's more like billard)
if ((bodies[i].position.x < -body_distance_factor) || (bodies[i].position.x > body_distance_factor))
bodies[i].velocity.x = -bodies[i].velocity.x;
if ((bodies[i].position.y < -body_distance_factor) || (bodies[i].position.y > body_distance_factor))
bodies[i].velocity.y = -bodies[i].velocity.y;
}
}
}
The code works at it is now, but I simply don't understand why.
Can you help me?
Kind Regards
Michael

The second of the two code samples, each parallel iteration of the loop works on element [i] of the array and never looks at any neighbouring elements. Thus each iteration of the loop has no effect on any other iteration of the loop, and they can all be executed at the same time without worry.
In the first code example however, you each parallel iteration of the loop may read from and write anywhere in the bodies array using the index [j]. This means that two threads could be trying to update the same memory location at the same time, or one thread could be writing to a location that another one is reading. To avoid race conditions you need to ensure that the writes are atomic.

When multiple threads write to the same memory location you need to use an atomic operator or a critical section to prevent a race condition. Atomic operators are faster but have more restrictions (e.g. they only operate on POD with some basic operators) but in your case you can use them.
So you have to ask yourself when threads are writing to the same memory location. In the first case you only parallelize the outer loop over i and not the inner loop over j so you don't actually need the atomic operators on i just the ones on j.
Let's consider an example from the first case. Let's assume n_body=101 and there are 4 threads.
Thread one i = 0-24, j = 1-100, j range = 100
Thread two i = 25-49, j = 26-100, j range = 75
Thread three i = 50-74, j = 51-100, j range = 50
Thread four i = 75-99, j = 76-100, j range = 25
First of all you see that each thread write to some of the same memory location. For example, all threads write to memory locations with j=76-100. That's why you need the atomic operator for j. However, no thread writes to the same memory location with i. That's why you don't need an atomic operator for i.
In your second case you only have one loop and it's parallelized so no thread writes to the same memory location so you don't need the atomic operators.
That answers your question but here are some addition comments to improve the performance of your code:
There is another important observation independent of the atomic operators. You can see that thread one runs over j 100 times while thread 4 only runs over j 25 times. Therefore, the load is not well distributed using schedule(static) which is typically the default scheduler. For larger values of n_body this is going to get worse.
One solution is to try schedule(guided). I have not used this before but I think it's the right solution OpenMP: for schedule. "A special kind of dynamic scheduling is the guided where smaller and smaller iteration blocks are given to each task as the work progresses." According to the standard for guided each successive block gets "number_of_iterations_remaining / number_of_threads". So from our example that gives
Thread one i = 0-24, j = 1-100, j range = 100
Thread two i = 25-44, j = 26-100, j range = 75
Thread three i = 45-69, j = 46-100, j range = 55
Thread four i = 60-69, j = 61-100, j range = 40
Thread one i = 70-76, j = 71-100
...
Notice now that the threads are more evenly distributed. With static scheduling the fourth thread only ran over j 25 times and now the fourth thread runs over j 40 times.
But let's look at your algorithm more carefully. In the first case you are calculating the gravitational force on each body. Here are two way you can do this:
//method1
#pragma omp parallel for
for(int i=0; i<n_body; i++) {
vector_t force = 0;
for(int j=0; j<n_body; j++) {
force += gravity_force(i,j);
}
bodies[i].force = force;
}
But the function gravity_force(i,j) = gravity_force(j,i) so you don't need to calculate it twice. So you found a faster solution:
//method2 (see your code and mine below on how to parallelize this)
for(int i=0; i<(n_body-1); i++) {
for(int j=i+1; j<nbody; j++) {
bodies[i].force += gravity_force(i,j);
bodies[j].force += gravity_force(i,j);
}
}
The first method does n_bodies*n_bodies iterations and the second method does (n_bodies-1)nbodies/2 iterations which to first order is n_bodiesn_bodies/2. However, the second case is much more difficult to parallelize efficiently (see your code and my code below). It has to use atmoic operations and the load is not balance. The first method does twice as many iterations but the load is evenly distributed and it does not need any atomic operations. You should test both methods to see which one is faster.
To parallize the second method you can do what you did:
#pragma omp parallel for schedule(static) // you should try schedule(guided)
for(int i=0; i<(n_body-1); i++) {
for(int j=i+1; j<nbody; j++) {
//#pragma omp atomic //not necessary on i
bodies[i].force += gravity_force(i,j);
#pragma omp atomic //but necessary on j
bodies[j].force += gravity_force(i,j);
}
}
Or a better solution is to use private copies of force like this:
#pragma omp parallel
{
vector_t *force = new vector_t[n_body];
#pragma omp for schedule(static)
for (int i = 0; i < n_body; i++) force[i] = 0;
#pragma omp for schedule(guided)
for(int i=0; i<(n_body-1); i++) {
for(int j=i+1; j<nbody; j++) {
force[i] += gravity_force(i,j);
force[j] += gravity_force(i,j);
}
}
#pragma omp for schedule(static)
{
#pragma omp atomic
bodies[i].force.x += force[i].x;
#pragma omp atomic
bodies[i].force.y += force[i].y;
}
delete[] force;
}

Related

OpenCL 1D strided convolution performance

For downsampling a signal, I use a FIR filter + decimation stage (that's practical a strided convolution). The big advantage of combining filtering and decimation is the reduced computational cost (by the decimation factor).
With a straight forward OpenCL implementation, I am not able to benefit from the decimation. Quite to the contrary: The convolution with a decimation factor of 4 is 25% slower than the full convolution.
Kernel Code:
__kernel void decimation(__constant float *input,
__global float *output,
__constant float *coefs,
const int taps,
const int decimationFactor) {
int posOutput = get_global_id(0);
float result = 0;
for (int tap=0; tap<taps; tap++) {
int posInput = (posOutput * decimationFactor) - tap;
result += input[posInput] * coefs[tap];
}
output[posOutput] = result;
}
I guess it is due to the uncoalesced memory access. Though I can not think of a solution to fix the problem. Any ideas?
Edit: I tried Dithermaster's solution to split the problem into coalesced reads to shared local memory and convolution from local memory:
__kernel void decimation(__constant float *input,
__global float *output,
__constant float *coefs,
const int taps,
const int decimationFactor,
const int bufferSize,
__local float *localInput) {
const int posOutput = get_global_id(0);
const int localSize = get_local_size(0);
const int localId = get_local_id(0);
const int groupId = get_group_id(0);
const int localInputOffset = taps-1;
const int localInputOverlap = taps-decimationFactor;
const int localInputSize = localInputOffset + localSize * decimationFactor;
// 1. transfer global input data to local memory
// read global input to local input (only overlap)
if (localId < localInputOverlap) {
int posInputStart = ((groupId*localSize) * decimationFactor) - (taps-1);
int posInput = posInputStart + localId;
int posLocalInput = localId;
localInput[posLocalInput] = 0.0f;
if (posInput >= 0)
localInput[posLocalInput] = input[posInput];
}
// read remaining global input to local input
// 1. alternative: strided read
// for (int i=0; i<decimationFactor; i++) {
// int posInputStart = (groupId*localSize) * decimationFactor;
// int posInput = posInputStart + localId * decimationFactor - i;
// int posLocalInput = localInputOffset + localId * decimationFactor - i;
// localInput[posLocalInput] = 0.0f;
// if ((posInput >= 0) && (posInput < bufferSize*decimationFactor))
// localInput[posLocalInput] = input[posInput];
// }
// 2. alternative: coalesced read (in blocks of localSize)
for (int i=0; i<decimationFactor; i++) {
int posInputStart = (groupId*localSize) * decimationFactor;
int posInput = posInputStart - (decimationFactor-1) + i*localSize + localId;
int posLocalInput = localInputOffset - (decimationFactor-1) + i*localSize + localId;
localInput[posLocalInput] = 0.0f;
if ((posInput >= 0) && (posInput < bufferSize*decimationFactor))
localInput[posLocalInput] = input[posInput];
}
// 2. wait until every thread completed
barrier(CLK_LOCAL_MEM_FENCE);
// 3. convolution
if (posOutput < bufferSize) {
float result = 0.0f;
for (int tap=0; tap<taps; tap++) {
int posLocalInput = localInputOffset + (localId * decimationFactor) - tap;
result += localInput[posLocalInput] * coefs[tap];
}
output[posOutput] = result;
}
}
Big improvement! But still, the performance does not correlate with the overall operations (not proportional to the decimation factor):
speedup for full convolution compared to first approach: ~12 %
computatoin time for decimation compared to full convolution:
decimation factor 2: 61 %
decimation factor 4: 46 %
decimation factor 8: 53 %
decimation factor 16: 68 %
The performance has a optimum for a decimation factor of 4. Why is that? Any ideas for further improvements?
Edit 2: Diagram with shared local memory:
Edit 3: Comparison of the performance for the 3 different implementations
Due to the amount of data overlap (66%), this could benefit from sharing data read from memory between work items, within a workgroup. You could get rid of redundant reads and also make coalesced reads. Break you kernel up into two parts: The first part does coalesced reads for all the data needed within the work group, into shared local memory. Then a memory barrier to synchronize. Then in the second part do the convolutions using reads from shared local memory.
P.S. Thanks for the diagram, it helped me understand your goal more quickly than trying to read code.

Implementing convolution in C++ using fftw 3

UPDATE
See my fundamental based question on DSP stackexchange here
UPDATE
I am still experiencing crackling in the output. These crackles are now less pronounced and are only audible when the volume is turned up
UPDATE
Following the advice given here has removed the crackling sound from my output. I will test with other available HRIRs to see if the convolution is indeed working properly and will answer this question once I've verified that my code now works
UPDATE
I have made some progress, but I still think there is an issue with my convolution implementation.
The following is my revised program:
#define HRIR_LENGTH 512
#define WAV_SAMPLE_SIZE 256
while (signal_input_wav.read(&signal_input_buffer[0], WAV_SAMPLE_SIZE) >= WAV_SAMPLE_SIZE)
{
#ifdef SKIP_CONVOLUTION
// Copy the input buffer over
std::copy(signal_input_buffer.begin(),
signal_input_buffer.begin() + WAV_SAMPLE_SIZE,
signal_output_buffer.begin());
signal_output_wav.write(&signal_output_buffer[0], WAV_SAMPLE_SIZE);
#else
// Copy the first segment into the buffer
// with zero padding
for (int i = 0; i < HRIR_LENGTH; ++i)
{
if (i < WAV_SAMPLE_SIZE)
{
signal_buffer_fft_in[i] = signal_input_buffer[i];
}
else
{
signal_buffer_fft_in[i] = 0; // zero pad
}
}
// Dft of the signal segment
fftw_execute(signal_fft);
// Convolve in the frequency domain by multiplying filter kernel with dft signal
for (int i = 0; i < HRIR_LENGTH; ++i)
{
signal_buffer_ifft_in[i] = signal_buffer_fft_out[i] * left_hrir_fft_out[i]
- signal_buffer_fft_out[HRIR_LENGTH - i] * left_hrir_fft_out[HRIR_LENGTH - i];
signal_buffer_ifft_in[HRIR_LENGTH - i] = signal_buffer_fft_out[i] * left_hrir_fft_out[HRIR_LENGTH - i]
+ signal_buffer_fft_out[HRIR_LENGTH - i] * left_hrir_fft_out[i];
//double re = signal_buffer_out[i];
//double im = signal_buffer_out[BLOCK_OUTPUT_SIZE - i];
}
// inverse dft back to time domain
fftw_execute(signal_ifft);
// Normalize the data
for (int i = 0; i < HRIR_LENGTH; ++i)
{
signal_buffer_ifft_out[i] = signal_buffer_ifft_out[i] / HRIR_LENGTH;
}
// Overlap-add method
for (int i = 0; i < HRIR_LENGTH; ++i)
{
if (i < WAV_SAMPLE_SIZE)
{
signal_output_buffer[i] = signal_overlap_buffer[i] + signal_buffer_ifft_out[i];
}
else
{
signal_output_buffer[i] = signal_buffer_ifft_out[i];
signal_overlap_buffer[i] = signal_output_buffer[i]; // record into the overlap buffer
}
}
// Write the block to the output file
signal_output_wav.write(&signal_output_buffer[0], HRIR_LENGTH);
#endif
}
The resulting output sound file contains crackling sounds; presumably artefacts left from the buggy fftw implementation. Also, writing blocks of 512 (HRIR_LENGTH) seems to result in some aliasing, with the soundfile upon playback sounding like a vinyl record being slowed down. Writing out blocks of size WAV_SAMPLE_SIZE (256, half of the fft output) seems to playback at normal speed.
However, irrespective of this the crackling sound remains.
ORIGINAL
I'm trying to implement convolution using the fftw library in C++.
I can load my filter perfectly fine, and am zero padding both the filter (of length 512) and the input signal (of length 513) in order to get a signal output block of 1024 and using this as the fft size.
Here is my code:
#define BLOCK_OUTPUT_SIZE 1024
#define HRIR_LENGTH 512
#define WAV_SAMPLE_SIZE 513
#define INPUT_SHIFT 511
while (signal_input_wav.read(&signal_input_buffer[0], WAV_SAMPLE_SIZE) >= WAV_SAMPLE_SIZE)
{
#ifdef SKIP_CONVOLUTION
// Copy the input buffer over
std::copy(signal_input_buffer.begin(),
signal_input_buffer.begin() + WAV_SAMPLE_SIZE,
signal_output_buffer.begin());
signal_output_wav.write(&signal_output_buffer[0], WAV_SAMPLE_SIZE);
#else
// Zero pad input
for (int i = 0; i < INPUT_SHIFT; ++i)
signal_input_buffer[WAV_SAMPLE_SIZE + i] = 0;
// Copy to the signal convolve buffer
for (int i = 0; i < BLOCK_OUTPUT_SIZE; ++i)
{
signal_buffer_in[i] = signal_input_buffer[i];
}
// Dft of the signal segment
fftw_execute(signal_fft);
// Convolve in the frequency domain by multiplying filter kernel with dft signal
for (int i = 1; i < BLOCK_OUTPUT_SIZE; ++i)
{
signal_buffer_out[i] = signal_buffer_in[i] * left_hrir_fft_in[i]
- signal_buffer_in[BLOCK_OUTPUT_SIZE - i] * left_hrir_fft_in[BLOCK_OUTPUT_SIZE - i];
signal_buffer_out[BLOCK_OUTPUT_SIZE - i]
= signal_buffer_in[BLOCK_OUTPUT_SIZE - i] * left_hrir_fft_in[i]
+ signal_buffer_in[i] * left_hrir_fft_in[BLOCK_OUTPUT_SIZE - i];
double re = signal_buffer_out[i];
double im = signal_buffer_out[BLOCK_OUTPUT_SIZE - i];
}
// inverse dft back to time domain
fftw_execute(signal_ifft);
// Normalize the data
for (int i = 0; i < BLOCK_OUTPUT_SIZE; ++i)
{
signal_buffer_out[i] = signal_buffer_out[i] / i;
}
// Overlap and add with the previous block
if (first_block)
{
first_block = !first_block;
for (int i = 0; i < BLOCK_OUTPUT_SIZE; ++i)
{
signal_output_buffer[i] = signal_buffer_out[i];
}
}
else
{
for (int i = WAV_SAMPLE_SIZE; i < BLOCK_OUTPUT_SIZE; ++i)
{
signal_output_buffer[i] = signal_output_buffer[i] + signal_buffer_out[i];
}
}
// Write the block to the output file
signal_output_wav.write(&signal_output_buffer[0], BLOCK_OUTPUT_SIZE);
#endif
}
In the end, the resulting output file contains garbage, but is not all zeros.
Things I have tried:
1) Using the standard complex interface fftw_plan_dft_1d with the appropriate fftw_complex type. Same issues arise.
2) Using a smaller input sample size and iterating over the zero padded blocks (overlap-add).
I also note that its not a fault of libsndfile; toggling SKIP_CONVOLUTION does successfully result in copying the input file to the output file.

How does the reversebits function of HLSL SM5 work?

I am trying to implement an inverse FFT in a HLSL compute shader and don't understand how the new inversebits function works. The shader is run under Unity3D, but that shouldn't make a difference.
The problem is, that the resulting texture remains black with the exception of the leftmost one or two pixels in every row. It seems to me, as if the reversebits function wouldn't return the correct indexes.
My very simple code is as following:
#pragma kernel BitReverseHorizontal
Texture2D<float4> inTex;
RWTexture2D<float4> outTex;
uint2 getTextureThreadPosition(uint3 groupID, uint3 threadID) {
uint2 pos;
pos.x = (groupID.x * 16) + threadID.x;
pos.y = (groupID.y * 16) + threadID.y;
return pos;
}
[numthreads(16,16,1)]
void BitReverseHorizontal (uint3 threadID : SV_GroupThreadID, uint3 groupID : SV_GroupID)
{
uint2 pos = getTextureThreadPosition(groupID, threadID);
uint xPos = reversebits(pos.x);
uint2 revPos = uint2(xPos, pos.y);
float4 values;
values.x = inTex[pos].x;
values.y = inTex[pos].y;
values.z = inTex[revPos].z;
values.w = 0.0f;
outTex[revPos] = values;
}
I played around with this for quite a while and found out, that if I replace the reversebits line with this one here:
uint xPos = reversebits(pos.x << 23);
it works. Although I have no idea why. Could be just coincidence. Could someone please explain to me, how I have to use the reversebits function correctly?
Are you sure you want to reverse the bits?
x = 0: reversed: x = 0
x = 1: reversed: x = 2,147,483,648
x = 2: reversed: x = 1,073,741,824
etc....
If you fetch texels from a texture using coordinates exceeding the width of the texture then you're going to get black. Unless the texture is > 1 billion texels wide (it isn't) then you're fetching well outside the border.
I am doing the same and came to the same problem and these answers actually answered it for me but i'll give you the explanation and a whole solution.
So the solution with variable length buffers in HLSL is:
uint reversedIndx;
uint bits = 32 - log2(xLen); // sizeof(uint) - log2(numberOfIndices);
for (uint j = 0; j < xLen; j ++)
reversedIndx = reversebits(j << bits);
And what you found/noticed essentially pushes out all the leading 0 of your index so you are just reversing the least significant or rightmost bits up until the max bits we want.
for example:
int length = 8;
int bits = 32 - 3; // because 1 << 3 is 0b1000 and we want the inverse like a mask
int j = 6;
and since the size of an int is generally 32bits in binary j would be
j = 0b00000000000000000000000000000110;
and reversed it would be (AKA reversebits(j);)
j = 0b01100000000000000000000000000000;
Which was our error, so j bit shifted by bits would be
j = 0b11000000000000000000000000000000;
and then reversed and what we want would be
j = 0b00000000000000000000000000000011;

Perform autocorrelation with vDSP_conv from Apple Accelerate Framework

I need to perform the autocorrelation of an array (vector) but I am having trouble finding the correct way to do so. I believe that I need the method "vDSP_conv" from the Accelerate Framework, but I can't follow how to successfully set it up. The thing throwing me off the most is the need for 2 inputs. Perhaps I have the wrong function, but I couldn't find one that operated on a single vector.
The documentation can be found here
Copied from the site
vDSP_conv
Performs either correlation or convolution on two vectors; single
precision.
void vDSP_conv ( const float __vDSP_signal[], vDSP_Stride
__vDSP_signalStride, const float __vDSP_filter[], vDSP_Stride __vDSP_strideFilter, float __vDSP_result[], vDSP_Stride __vDSP_strideResult, vDSP_Length __vDSP_lenResult, vDSP_Length __vDSP_lenFilter );
Parameters
__vDSP_signal
Input vector A. The length of this vector must be at least __vDSP_lenResult + __vDSP_lenFilter - 1.
__vDSP_signalStride
The stride through __vDSP_signal.
__vDSP_filter
Input vector B.
__vDSP_strideFilter
The stride through __vDSP_filter.
__vDSP_result
Output vector C.
__vDSP_strideResult
The stride through __vDSP_result.
__vDSP_lenResult
The length of __vDSP_result.
__vDSP_lenFilter
The length of __vDSP_filter.
For an example, just assume you have an array of float x = [1.0, 2.0, 3.0, 4.0, 5.0]. How would I take the autocorrelation of that?
The output should be something similar to float y = [5.0, 14.0, 26.0, 40.0, 55.0, 40.0, 26.0, 14.0, 5.0] //generated using Matlab's xcorr(x) function
performing autocorrelation simply means you take the cross-correlation of one vector with itself. There is nothing fancy about it.
so in your case, do:
vDSP_conv(x, 1, x, 1, result, 1, 2*len_X-1, len_X);
check a sample code for more details: (which does a convolution)
http://disanji.net/iOS_Doc/#documentation/Performance/Conceptual/vDSP_Programming_Guide/SampleCode/SampleCode.html
EDIT: This borders on ridiculous, but you need to offset the x value by a specific number of zeros, which is just crazy.
the following is a working code, just set filter to the value of x you desire, and it will put the rest in the correct position:
float *signal, *filter, *result;
int32_t signalStride, filterStride, resultStride;
uint32_t lenSignal, filterLength, resultLength;
uint32_t i;
filterLength = 5;
resultLength = filterLength*2 -1;
lenSignal = ((filterLength + 3) & 0xFFFFFFFC) + resultLength;
signalStride = filterStride = resultStride = 1;
printf("\nConvolution ( resultLength = %d, "
"filterLength = %d )\n\n", resultLength, filterLength);
/* Allocate memory for the input operands and check its availability. */
signal = (float *) malloc(lenSignal * sizeof(float));
filter = (float *) malloc(filterLength * sizeof(float));
result = (float *) malloc(resultLength * sizeof(float));
for (i = 0; i < filterLength; i++)
filter[i] = (float)(i+1);
for (i = 0; i < resultLength; i++)
if (i >=resultLength- filterLength)
signal[i] = filter[i - filterLength+1];
/* Correlation. */
vDSP_conv(signal, signalStride, filter, filterStride,
result, resultStride, resultLength, filterLength);
printf("signal: ");
for (i = 0; i < lenSignal; i++)
printf("%2.1f ", signal[i]);
printf("\n filter: ");
for (i = 0; i < filterLength; i++)
printf("%2.1f ", filter[i]);
printf("\n result: ");
for (i = 0; i < resultLength; i++)
printf("%2.1f ", result[i]);
/* Free allocated memory. */
free(signal);
free(filter);
free(result);

Avg distance between points in a cluster

Sounds like I got the concept but cant seems to get the implementation correct. eI have a cluster (an ArrayList) with multiple points, and I want to calculate avg distance. Ex: Points in cluster (A, B, C, D, E, F, ... , n), Distance A-B, Distance A-C, Distance A-D, ... Distance A,N, Distance (B,C) Distance (B,D)... Distance (B,N)...
Thanks in advance.
You don't want to double count any segment, so your algorithm should be a double for loop. The outer loop goes from A to M (you don't need to check N, because there'll be nothing left for it to connect to), each time looping from curPoint to N, calculating each distance. You add all the distances, and divide by the number of points (n-1)^2/2. Should be pretty simple.
There aren't any standard algorithms for improving on this that I'm aware of, and this isn't a widely studied problem. I'd guess that you could get a pretty reasonable estimate (if an estimate is useful) by sampling distances from each point to a handful of others. But that's a guess.
(After seeing your code example) Here's another try:
public double avgDistanceInCluster() {
double totDistance = 0.0;
for (int i = 0; i < bigCluster.length - 1; i++) {
for (int j = i+1; j < bigCluster.length; j++) {
totDistance += distance(bigCluster[i], bigCluster[j]);
}
}
return totDistance / (bigCluster.length * (bigCluster.length - 1)) / 2;
}
Notice that the limit for the first loop is different.
Distance between two points is probably sqrt((x1 - x2)^2 + (y1 -y2)^2).
THanks for all the help, Sometimes after explaining the question on forum answer just popup to your mind. This is what I end up doing.
I have a cluster of point, and I need to calculate the avg distance of points (pairs) in the cluster. So, this is what I did. I am sure someone will come with a better answer if so please drop a note. Thanks in advance.
/**
* Calculate avg distance between points in cluster
* #return
*/
public double avgDistanceInCluster() {
double avgDistance = 0.0;
Stack<Double> holder = new Stack<Double>();
for (int i = 0; i < cluster.size(); i++) {
System.out.println(cluster.get(i));
for (int j = i+1; j < cluster.size(); j++) {
avgDistance = (cluster.get(i) + cluster.get(j))/2;
holder.push(avgDistance);
}
}
Iterator<Double> iter = holder.iterator();
double avgClusterDist = 0;
while (iter.hasNext()) {
avgClusterDist =+ holder.pop();
System.out.println(avgClusterDist);
}
return avgClusterDist/cluster.size();
}