Why is the call to array_view::synchronize() so slow? - c++-amp

i've started experimenting with C++ AMP. I've created a simple test app just to see what it can do, however the results are quite surprising to me. Consider the following code:
#include <amp.h>
#include "Timer.h"
using namespace concurrency;
int main( int argc, char* argv[] )
{
uint32_t u32Threads = 16;
uint32_t u32DataRank = u32Threads * 256;
uint32_t u32DataSize = (u32DataRank * u32DataRank) / u32Threads;
uint32_t* pu32Data = new (std::nothrow) uint32_t[ u32DataRank * u32DataRank ];
for ( uint32_t i = 0; i < u32DataRank * u32DataRank; i++ )
{
pu32Data[i] = 1;
}
uint32_t* pu32Sum = new (std::nothrow) uint32_t[ u32Threads ];
Timer tmr;
tmr.Start();
array< uint32_t, 1 > source( u32DataRank * u32DataRank, pu32Data );
array_view< uint32_t, 1 > sum( u32Threads, pu32Sum );
printf( "Array<> deep copy time: %.6f\n", tmr.Stop() );
tmr.Start();
parallel_for_each(
sum.extent,
[=, &source](index<1> idx) restrict(amp)
{
uint32_t u32Sum = 0;
uint32_t u32Start = idx[0] * u32DataSize;
uint32_t u32End = (idx[0] * u32DataSize) + u32DataSize;
for ( uint32_t i = u32Start; i < u32End; i++ )
{
u32Sum += source[i];
}
sum[idx] = u32Sum;
}
);
double dDuration = tmr.Stop();
printf( "gpu computation time: %.6f\n", dDuration );
tmr.Start();
sum.synchronize();
dDuration = tmr.Stop();
printf( "synchronize time: %.6f\n", dDuration );
printf( "first and second row sum = %u, %u\n", pu32Sum[0], pu32Sum[1] );
tmr.Start();
for ( uint32_t idx = 0; idx < u32Threads; idx++ )
{
uint32_t u32Sum = 0;
for ( uint32_t i = 0; i < u32DataSize; i++ )
{
u32Sum += pu32Data[(idx * u32DataSize) + i];
}
pu32Sum[idx] = u32Sum;
}
dDuration = tmr.Stop();
printf( "cpu computation time: %.6f\n", dDuration );
printf( "first and second row sum = %u, %u\n", pu32Sum[0], pu32Sum[1] );
delete [] pu32Sum;
delete [] pu32Data;
return 0;
}
Note that Timer is a simple timing class using QueryPerformanceCounter. Anyway, the output of the code is the following:
Array<> deep copy time: 0.089784
gpu computation time: 0.000449
synchronize time: 8.671081
first and second row sum = 1048576, 1048576
cpu computation time: 0.006647
first and second row sum = 1048576, 1048576
Why is the call to synchronize() taking so long? Is there a way how to get around this? Other than that the performance of the computation performance is amazing, however the synchronize() overhead makes it unusable for me.
It is also possible that i am doing something terribly wrong, if so, please tell me. Thanks in advance.

Function synchronize() is probably taking so long because it is waiting for the actual kernel to complete its work.
From parallel_for_each from amp.h:
Please note that the parallel_for_each executes as if synchronous to the calling code, but in reality, it is asynchronous. I.e. once the parallel_for_each call is made and the kernel has been passed to the runtime, the [code after the parallel_for_each] continues to execute immediately by the CPU thread, while in parallel the kernel is executed by the GPU threads.
So, measuring the time spent in parallel_for_each is not particularly meaningful.
EDIT: The way the algorithm is written, it won't benefit much from GPU acceleration. The read of source[i] is non-coalesced, and so it will be almost 16x slower than a coalesced read. It is possible to coalesce the read by using shared memory, but it is not quite trivial. I'd recommend reading up on GPU programming.
If you just want a simple example that demonstrates the utility of C++ AMP, try matrix multiplication.
Of course, the performance you'll observe also greatly depends on the model of you GPU hardware.

In addition to Igor's response on your specific algorithm, please note that there are multiple incorrect aspects of the way you are measuring C++ AMP performance in general (no runtime initialization exclusion, no discarding of initial JIT, no warmup of data, and the already pointed out assumption of p_f_e being synchronous), so please follow our guidelines here:
http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/28/how-to-measure-the-performance-of-c-amp-algorithms.aspx

Related

Unaligned access performance on Intel x86 vs AMD x86 CPUs

I have implemented a simple linear probing hash map with an array of structs memory layout. The struct holds the key, the value, and a flag indicating whether the entry is valid. By default, this struct gets padded by the compiler, as key and value are 64-bit integers, but the entry only takes up 8 bools. Hence, I have also tried packing the struct at the cost of unaligned access. I was hoping to get better performance from the packed/unaligned version due to higher memory density (we do not waste bandwidth on transferring padding bytes).
When benchmarking this hash map on an Intel Xeon Gold 5220S CPU (single-threaded, gcc 11.2, -O3 and -march=native), I see no performance difference between the padded version and the unaligned version. However, on an AMD EPYC 7742 CPU (same setup), I find a performance difference between unaligned and padded. Here is a graph depicting the results for hash map load factors 25 % and 50 %, for different successful query rates on the x axis (0,25,50,75,100): As you can see, on Intel, the grey and blue (circle and square) lines almost overlap, the benefit of struct packing is marginal. On AMD, however, the line representing unaligned/packed structs is consistently higher, i.e., we have more throughput.
In order to investigate this, I tried to built a smaller microbenchmark. In this microbenchmark, we perform a similar benchmark, but without the hash map find logic (i.e., we just pick random indices in the array and advance a little there). Please find the benchmark here:
#include <atomic>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <random>
#include <vector>
void ClobberMemory() { std::atomic_signal_fence(std::memory_order_acq_rel); }
template <typename T>
void doNotOptimize(T const& val) {
asm volatile("" : : "r,m"(val) : "memory");
}
struct PaddedStruct {
uint64_t key;
uint64_t value;
bool is_valid;
PaddedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
}
};
struct PackedStruct {
uint64_t key;
uint64_t value;
uint8_t is_valid;
PackedStruct() { reset(); }
void reset() {
key = uint64_t{};
value = uint64_t{};
is_valid = 0;
}
} __attribute__((__packed__));
int main() {
const uint64_t size = 134217728;
uint16_t repetitions = 0;
uint16_t advancement = 0;
std::cin >> repetitions;
std::cout << "Got " << repetitions << std::endl;
std::cin >> advancement;
std::cout << "Got " << advancement << std::endl;
std::cout << "Initializing." << std::endl;
std::vector<PaddedStruct> padded(size);
std::vector<PackedStruct> unaligned(size);
std::vector<uint64_t> queries(size);
// Initialize the structs with random values + prefault
std::random_device rd;
std::mt19937 gen{rd()};
std::uniform_int_distribution<uint64_t> dist{0, 0xDEADBEEF};
std::uniform_int_distribution<uint64_t> dist2{0, size - advancement - 1};
for (uint64_t i = 0; i < padded.size(); ++i) {
padded[i].key = dist(gen);
padded[i].value = dist(gen);
padded[i].is_valid = 1;
}
for (uint64_t i = 0; i < unaligned.size(); ++i) {
unaligned[i].key = padded[i].key;
unaligned[i].value = padded[i].value;
unaligned[i].is_valid = 1;
}
for (uint64_t i = 0; i < unaligned.size(); ++i) {
queries[i] = dist2(gen);
}
std::cout << "Running benchmark." << std::endl;
ClobberMemory();
auto start_padded = std::chrono::high_resolution_clock::now();
PaddedStruct* padded_ptr = nullptr;
uint64_t sum = 0;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
padded_ptr = &padded[query + i];
if (padded_ptr->is_valid) [[likely]] {
sum += padded_ptr->value;
}
}
doNotOptimize(sum);
}
}
ClobberMemory();
auto end_padded = std::chrono::high_resolution_clock::now();
uint64_t padded_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_padded - start_padded).count());
std::cout << "Padded Runtime (ms): " << padded_runtime << " (sum = " << sum << ")" << std::endl; // print sum to avoid that it gets optimized out
ClobberMemory();
auto start_unaligned = std::chrono::high_resolution_clock::now();
uint64_t sum2 = 0;
PackedStruct* packed_ptr = nullptr;
for (uint16_t j = 0; j < repetitions; j++) {
for (const uint64_t& query : queries) {
for (uint16_t i = 0; i < advancement; i++) {
packed_ptr = &unaligned[query + i];
if (packed_ptr->is_valid) [[likely]] {
sum2 += packed_ptr->value;
}
}
doNotOptimize(sum2);
}
}
ClobberMemory();
auto end_unaligned = std::chrono::high_resolution_clock::now();
uint64_t unaligned_runtime = static_cast<uint64_t>(std::chrono::duration_cast<std::chrono::milliseconds>(end_unaligned - start_unaligned).count());
std::cout << "Unaligned Runtime (ms): " << unaligned_runtime << " (sum = " << sum2 << ")" << std::endl;
}
When running the benchmark, I pick repetitions = 3 and advancement = 5, i.e., after compiling and running it, you have to enter 3 (and press newline) and then enter 5 and press enter/newline. I updated the source code to (a) avoid loop unrolling by the compiler because repetition/advancement were hardcoded and (b) switch to pointers into that vector as it more closely resembles what the hash map is doing.
On the Intel CPU, I get:
Padded Runtime (ms): 13204
Unaligned Runtime (ms): 12185
On the AMD CPU, I get:
Padded Runtime (ms): 28432
Unaligned Runtime (ms): 22926
So while in this microbenchmark, Intel still benefits a little from the unaligned access, for the AMD CPU, both the absolute and relative improvement is higher. I cannot explain this. In general, from what I've learned from relevant SO threads, unaligned access for a single member is just as expensive as aligned access, as long as it stays within a single cache line (1). Also in (1), a reference to (2) is given, which claims that the cache fetch width can differ from the cache line size. However, except for Linus Torvalds mail, I could not find any other documentation of cache fetch widths in processors and especially not for my concrete two CPUs to figure out if that might somehow have to do with this.
Does anybody have an idea why the AMD CPU benefits much more from the struct packing? If it is about reduced memory bandwidth consumption, I should be able to see the effects on both CPUs. And if the bandwidth usage is similar, I do not understand what might be causing the differences here.
Thank you so much.
(1) Relevant SO thread: How can I accurately benchmark unaligned access speed on x86_64?
(2) https://www.realworldtech.com/forum/?threadid=168200&curpostid=168779
The L1 Data Cache fetch width on the Intel Xeon Gold 5220S (and all the other Skylake/CascadeLake Xeon processors) is up to 64 naturally-aligned Bytes per cycle per load.
The core can execute two loads per cycle for any combination of size and alignment that does not cross a cacheline boundary. I have not tested all the combinations on the SKX/CLX processors, but on Haswell/Broadwell, throughput was reduced to one load per cycle whenever a load crossed a cacheline boundary, and I would assume that SKX/CLX are similar. This can be viewed as necessary feature rather than a "penalty" -- a line-splitting load might need to use both ports to load a pair of adjacent lines, then combine the requested portions of the lines into a payload for the target register.
Loads that cross page boundaries have a larger performance penalty, but to measure it you have to be very careful to understand and control the locations of the page table entries for the two pages: DTLB, STLB, in the caches, or in main memory. My recollection is that the most common case is pretty fast -- partly because the "Next Page Prefetcher" is pretty good at pre-loading the PTE entry for the next page into the TLB before a sequence of loads gets to the end of the first page. The only case that is painfully slow is for stores that straddle a page boundary, and the Intel compiler works very hard to avoid this case.
I have not looked at the sample code in detail, but if I were performing this analysis, I would be careful to pin the processor frequency, measure the instruction and cycle counts, and compute the average number of instructions and cycles per update. (I usually set the core frequency to the nominal (TSC) frequency just to make the numbers easier to work with.) For the naturally-aligned cases, it should be pretty easy to look at the assembly code and estimate what the cycle counts should be. If the measurements are similar to observations for that case, then you can begin looking at the overhead of unaligned accesses in reference to a more reliable understanding of the baseline.
Hardware performance counters can be valuable for this case as well, particularly the DTLB_LOAD_MISSES events and the L1D.REPLACEMENT event. It only takes a few high-latency TLB miss or L1D miss events to skew the averages.
The number of cache-line accesses when using 24-byte data structures may be the same as when using 17-byte data structure.
Please see this blog post: https://lemire.me/blog/2022/06/06/data-structure-size-and-cache-line-accesses/

CUDA class with multidimensional pointers

I have been struggling with this class implementation now for quite a while and hope someone can help me with it.
class Material_Properties_Class_device
{
public:
int max_variables;
Logical * table_prop;
Table_Class ** prop_table;
};
The implementation for the pointers looks like this
Material_Properties_Class **d_material_prop = new Material_Properties_Class* [4];
Logical *table_prop;
for (int k = 1; k <= 3; k++ )
{
cutilSafeCall(cudaMalloc((void**)&(d_material_prop[k]),sizeof(Material_Properties_Class)));
cutilSafeCall(cudaMemcpy(d_material_prop[k], material_prop[k], sizeof(Material_Properties_Class ), cudaMemcpyHostToDevice));
}
for( int i = 1; i <= 3; i++ )
{
cutilSafeCall(cudaMalloc((void**)&(table_prop), sizeof(Logical)));
cudaMemcpy(&(d_material_prop[i]->table_prop), &(table_prop), sizeof(Logical*),cudaMemcpyHostToDevice);
cudaMemcpy(table_prop, material_prop[i]->table_prop, sizeof(Logical),cudaMemcpyHostToDevice);
}
cutilSafeCall(cudaMalloc((void ***)&material_prop_device, (4) * sizeof(Material_Properties_Class *)));
cutilSafeCall(cudaMemcpy(material_prop_device, d_material_prop, (4) * sizeof(Material_Properties_Class *), cudaMemcpyHostToDevice));
This implementation works but it can't get it working for the **prop_table.
I assume it must somehow follow the same principle but I just can't get my head around it.
I have already tried
Table_Class_device **prop_table = new Table_Class_device*[3];
and insert another loop inside the second for loop
for (int k = 1; k <= 3; k++ )
{
cutilSafeCall(cudaMalloc((void**)&(prop_table[k]), sizeof(Table_Class)));
cutilSafeCall(cudaMemcpy( prop_table[k], material_prop[i]->prop_table[k], sizeof( Table_Class *), cudaMemcpyHostToDevice));
}
Help would be much appriciated
some magic. May be it'll help
struct fading_coefficient
{
double* frequency_array;
double* temperature_array;
int frequency_size;
int temperature_size;
double** fading_coefficients;
};
struct fading_coefficient* cuda_fading_coefficient;
double* frequency_array = NULL;
double* temperature_array = NULL;
double** fading_coefficients = NULL;
double** fading_coefficients1 = (double **)malloc(fading_coefficient->frequency_size * sizeof(double *));
cudaMalloc((void**)&frequency_array,fading_coefficient->frequency_size *sizeof(double));
cudaMemcpy( frequency_array, fading_coefficient->frequency_array, fading_coefficient->frequency_size *sizeof(double), cudaMemcpyHostToDevice );
free(fading_coefficient->frequency_array);
cudaMalloc((void**)&temperature_array,fading_coefficient->temperature_size *sizeof(double));
cudaMemcpy( temperature_array, fading_coefficient->temperature_array, fading_coefficient->temperature_size *sizeof(double), cudaMemcpyHostToDevice );
free(fading_coefficient->temperature_array);
cudaMalloc((void***)&fading_coefficients,fading_coefficient->temperature_size *sizeof(double*));
for (int i = 0; i < fading_coefficient->temperature_size; i++)
{
cudaMalloc((void**)&(fading_coefficients1[i]),fading_coefficient->frequency_size *sizeof(double));
cudaMemcpy( fading_coefficients1[i], fading_coefficient->fading_coefficients[i], fading_coefficient->frequency_size *sizeof(double), cudaMemcpyHostToDevice );
free(fading_coefficient->fading_coefficients[i]);
}
cudaMemcpy(fading_coefficients, fading_coefficients1, fading_coefficient->temperature_size *sizeof(double*), cudaMemcpyHostToDevice );
fading_coefficient->frequency_array = frequency_array;
fading_coefficient->temperature_array = temperature_array;
fading_coefficient->fading_coefficients = fading_coefficients;
cudaMalloc((void**)&cuda_fading_coefficient,sizeof(struct fading_coefficient));
cudaMemcpy( cuda_fading_coefficient, fading_coefficient, sizeof(struct fading_coefficient), cudaMemcpyHostToDevice );
This question comes up frequently. Multidimensional pointers are especially challenging.
If possible, it's recommended that you flatten multidimensional pointer usage (**) to single-dimensional pointer usage (*), and as you've seen, even that is somewhat cumbersome.
The single-dimensional case (*) is further described here. Although you seem to have already figured it out.
If you really want to handle the 2 dimensional (**) case, look here.
An example implementation for 3 dimensional case (***) is here. ("madness!")
Working with 2 and 3 dimensions this way is quite difficult. Thus the recommendation to flatten.

Expression result unused

I got some codes and I'm trying to fix some compiling bugs:
StkFrames& PRCRev :: tick( StkFrames& frames, unsigned int channel )
{
#if defined(_STK_DEBUG_)
if ( channel >= frames.channels() - 1 ) {
errorString_ << "PRCRev::tick(): channel and StkFrames arguments are incompatible!";
handleError( StkError::FUNCTION_ARGUMENT );
}
#endif
StkFloat *samples = &frames[channel];
unsigned int hop = frames.channels();
for ( unsigned int i=0; i<frames.frames(); i++, samples += hop ) {
*samples = tick( *samples );
*samples++; <<<<<<<<<--------- Expression result unused.
*samples = lastFrame_[1];
}
return frames;
}
I don't understand what the codes is trying to do. The codes are huge and I fixed quite a few. But googling didn't work for this.
Any ideas?
First, you do an increment (the line which actually gives you warning).
*samples++;
And then you assign to that variable something else, which makes previous action unused.
*samples = lastFrame_[1];
I recommend you to read this code inside 'for' loop more carefully. It doesn't look very logical.

AudioQueue Recording Audio Sample

I am currently in the process of building an application that reads in audio from my iPhone's microphone, and then does some processing and visuals. Of course I am starting with the audio stuff first, but am having one minor problem.
I am defining my sampling rate to be 44100 Hz and defining my buffer to hold 4096 samples. Which is does. However, when I print this data out, copy it into MATLAB to double check accuracy, the sample rate I have to use is half of my iPhone defined rate, or 22050 Hz, for it to be correct.
I think it has something to do with the following code and how it is putting 2 bytes per packet, and when I am looping through the buffer, the buffer is spitting out the whole packet, which my code assumes is a single number. So what I am wondering is how to split up those packets and read them as individual numbers.
- (void)setupAudioFormat {
memset(&dataFormat, 0, sizeof(dataFormat));
dataFormat.mSampleRate = kSampleRate;
dataFormat.mFormatID = kAudioFormatLinearPCM;
dataFormat.mFramesPerPacket = 1;
dataFormat.mChannelsPerFrame = 1;
// dataFormat.mBytesPerFrame = 2;
// dataFormat.mBytesPerPacket = 2;
dataFormat.mBitsPerChannel = 16;
dataFormat.mReserved = 0;
dataFormat.mBytesPerPacket = dataFormat.mBytesPerFrame = (dataFormat.mBitsPerChannel / 8) * dataFormat.mChannelsPerFrame;
dataFormat.mFormatFlags =
kLinearPCMFormatFlagIsSignedInteger |
kLinearPCMFormatFlagIsPacked;
}
If what I described is unclear, please let me know. Thanks!
EDIT
Adding the code that I used to print the data
float *audioFloat = (float *)malloc(numBytes * sizeof(float));
int *temp = (int*)inBuffer->mAudioData;
int i;
float power = pow(2, 31);
for (i = 0;i<numBytes;i++) {
audioFloat[i] = temp[i]/power;
printf("%f ",audioFloat[i]);
}
I found the problem with what I was doing. It was a c pointer issue, and since I have never really programmed in C before, I of course got them wrong.
You can not directly cast inBuffer->mAudioData to an int array. So what I simply did was the following
SInt16 *buffer = malloc(sizeof(SInt16)*kBufferByteSize);
buffer = inBuffer->mAudioData;
This worked out just fine and now my data is of correct length and the data is represented properly.
I saw your answer, there also is an underlying issue which gives wrong sample data bytes which is because of an endian issue of bytes being swapped.
-(void)feedSamplesToEngine:(UInt32)audioDataBytesCapacity audioData:(void *)audioData {
int sampleCount = audioDataBytesCapacity / sizeof(SAMPLE_TYPE);
SAMPLE_TYPE *samples = (SAMPLE_TYPE*)audioData;
//SAMPLE_TYPE *sample_le = (SAMPLE_TYPE *)malloc(sizeof(SAMPLE_TYPE)*sampleCount );//for swapping endians
std::string shorts;
double power = pow(2,10);
for(int i = 0; i < sampleCount; i++)
{
SAMPLE_TYPE sample_le = (0xff00 & (samples[i] << 8)) | (0x00ff & (samples[i] >> 8)) ; //Endianess issue
char dataInterim[30];
sprintf(dataInterim,"%f ", sample_le/power); // normalize it.
shorts.append(dataInterim);
}

Why does Perl's Inline::C sort 4.0e-5 after 4.4e-5?

I built a Perl Inline::C module, but there is some oddity with the sorting. Does anyone know why it would sort like this? Why is the 4.0e-5 is not first?
my $ref = [ 5.0e-5,4.2e-5,4.3e-5,4.4e-5,4.4e-5,4.2e-5,4.2e-5,4.0e-5];
use Inline C => <<'END_OF_C_CODE';
void test(SV* sv, ...) {
I32 i;
I32 arrayLen;
AV* data;
float retval;
SV** pvalue;
Inline_Stack_Vars;
data = SvUV(Inline_Stack_Item(0));
/* Determine the length of the array */
arrayLen = av_len(data);
// sort
sortsv(AvARRAY(data),arrayLen+1,Perl_sv_cmp_locale);
for (i = 0; i < arrayLen+1; i++) {
pvalue = av_fetch(data,i,0); /* fetch the scalar located at i .*/
retval = SvNV(*pvalue); /* dereference the scalar into a number. */
printf("%f \n",newSVnv(retval));
}
}
END_OF_C_CODE
test($ref);
0.000042
0.000042
0.000042
0.000043
0.000044
0.000044
0.000040
0.000050
Because you are sorting lexically, Try this code:
#!/usr/bin/perl
use strict;
use warnings;
my $ref = [ 5.0e-5,4.2e-5,4.3e-5,4.4e-5,4.4e-5,4.2e-5,4.2e-5,4.0e-5];
print "Perl with cmp\n";
for my $val (sort #$ref) {
printf "%f \n", $val;
}
print "Perl with <=>\n";
for my $val (sort { $a <=> $b } #$ref) {
printf "%f \n", $val;
}
print "C\n";
test($ref);
use Inline C => <<'END_OF_C_CODE';
void test(SV* sv, ...) {
I32 i;
I32 arrayLen;
AV* data;
float retval;
SV** pvalue;
Inline_Stack_Vars;
data = SvUV(Inline_Stack_Item(0));
/* Determine the length of the array */
arrayLen = av_len(data);
// sort
sortsv(AvARRAY(data),av_len(data)+1,Perl_sv_cmp_locale);
arrayLen = av_len(data);
for (i = 0; i < arrayLen+1; i++) {
pvalue = av_fetch(data,i,0); /* fetch the scalar located at i .*/
retval = SvNV(*pvalue); /* dereference the scalar into a number. */
printf("%f \n",newSVnv(retval));
}
}
END_OF_C_CODE
Of course, lexically 0.00040 is smaller than 0.00042 as well, but you aren't comparing 0.00040 to 0.00042; you are comparing the number 0.00040 converted to a string with the number 0.00042 converted to a string. When a number gets too large or small, Perl's stringifying logic resorts to using scientific notation. So you are sorting the set of strings
"4.2e-05", "4.2e-05", "4.2e-05", "4.3e-05", "4.4e-05", "4.4e-05", "4e-05", "5e-05"
which are properly sorted. Perl happily turns those strings back into their numbers when you ask it to with the %f format in printf. You could stringify the numbers yourself, but since you have stated you want this to be faster, that would be a mistake. You should not to be trying to optimize the program before you know where it slow (premature optimization is the root of all evil*). Write your code then run Devel::NYTProf against it to find where it is slow. If necessary, rewrite those portions in XS or Inline::C (I prefer XS). You will find that you get more speed out of choosing the right data structure than micro-optimizations like this.
* Knuth, Donald. Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.
Perl_sv_cmp_locale is your sorting function which I suspect is lexical comparison. Look for numeric sorting one or write your own.
Have an answer with help from the people over at http://www.perlmonks.org/?node_id=761015
I ran some profiling (DProf) and it's a 4x improvement in speed
Total Elapsed Time = 0.543205 Seconds
User+System Time = 0.585454 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
100. 0.590 0.490 100000 0.0000 0.0000 test_inline_c_pkg::percent2
Total Elapsed Time = 2.151647 Seconds
User+System Time = 1.991647 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
104. 2.080 1.930 100000 0.0000 0.0000 main::percent2
Here is the code
use Inline C => <<'END_OF_C_CODE';
#define SvSIOK(sv) ((SvFLAGS(sv) & (SVf_IOK|SVf_IVisUV)) == SVf_IOK)
#define SvNSIV(sv) (SvNOK(sv) ? SvNVX(sv) : (SvSIOK(sv) ? SvIVX(sv) : sv_2nv(sv)))
static I32 S_sv_ncmp(pTHX_ SV *a, SV *b) {
const NV nv1 = SvNSIV(a);
const NV nv2 = SvNSIV(b);
return nv1 < nv2 ? -1 : nv1 > nv2 ? 1 : 0;
}
void test(SV* sv, ...) {
I32 i;
I32 arrayLen;
AV* data;
float retval;
SV** pvalue;
Inline_Stack_Vars;
data = SvUV(Inline_Stack_Item(0));
/* Determine the length of the array */
arrayLen = av_len(data);
/* sort descending (send numerical sort function S_sv_ncmp) */
sortsv(AvARRAY(data),arrayLen+1, S_sv_ncmp);
for (i = 0; i < arrayLen+1; i++) {
pvalue = av_fetch(data,i,0); /* fetch the scalar located at i .*/
retval = SvNV(*pvalue); /* dereference the scalar into a number. */
printf("%f \n",newSVnv(retval));
}
}
END_OF_C_CODE