Efficient way to make a Programmatic Audio Mixdown - iphone

I'm currently working on the iPhone with Audio Units and I'm playing four tracks simultaneously. To improve the performance of my setup, I thought it would be a good idea to minimize the number of Audio Units / threads, by mixing down the four tracks into one.
With the following code I'm processing the next buffer by adding up the samples of the four tracks, keep them in the SInt16 range and add them to a temporary buffer, which will later on be copied into the ioData.mBuffers of the Audio Unit.
Although it works, I don't have the impression that this is the most efficient way to do this.
SInt16* buffer = bufferToWriteTo;
int reads = bufferSize/sizeof(SInt16);
SInt16** files = circularBuffer->files;
float tempValue;
SInt16 values[reads];
int k,j;
int numFiles=4;
for (k=0; k<reads; k++)
{
tempValue=0.f;
for (j=0; j<numFiles; j++)
{
tempValue += files[j][packetNumber];
}
if (tempValue > 32767.f) tempValue = 32767.f;
else if (tempValue < -32768.f) tempValue =- 32768.f;
values[k] = (SInt16) tempValue;
values[k] += values[k] << 16;
packetNumber++;
if (packetNumber >= totalPackets) packetNumber=0;
}
memcpy(buffer,values,bufferSize);
Any ideas or pointers to speed this up? Am I right?

The biggest improvement you can get from this code would be by not using floating point arithmetic. While the arithmetic by itself is fast, the conversions which happen in the nested loops, take a long time, especially on the ARM processor in the iPhone. You can achieve exactly the same results by using 'SInt32' instead of 'float' for the 'tempValue' variable.
Also, see if you can get rid of the memcpy() in the last string: perhaps you can construct the 'buffer' directly, without using a temporary buffer called 'values'. That saves one copy, which would be significant improvement for such a function.
Other notes: the last two lines of the loop probably belong outside of the loop and the body of the nested loop should use 'k' as a second index, instead of 'packetNumber', but I'm not sure about this logic.
And the last note: you're squashing the peaks of your resulting sound. While this seems like a good idea, it will sound pretty rough. You probably want to scale the result down instead of cropping it. Like that: instead of this code
for (j=0; j<numFiles; j++)
{
tempValue += files[j][packetNumber];
}
if (tempValue > 32767.f) tempValue = 32767.f;
else if (tempValue < -32768.f) tempValue =- 32768.f;
you probably want something like this:
for (j=0; j<numFiles; j++)
{
tempValue += files[j][packetNumber] / numFiles;
}
Edit: and please do not forget to measure the performance before and after, to see which one of the improvements gave the biggest impact. This is the best way to learn performance: trial and measurement

A couple of pointers even though I'm not really familliar with iPhone development.
You could unwind the inner loop. You don't need a for loop to add 4 numbers together although it might be your compiler will do this for you.
Write directly to the buffer in your for loop. memcpy at the end will do another loop to copy the buffers.
Don't use a float for tempvalue. Depending on the hardware integer math is quicker and you don't need floats for summing channels.
Remove the if/endif. Digital clipping will sound horrible anyway so try to avoid it before summing the channels together. Branching inside a loop like this should be avoided if possible.

One thing I found when writing the audio mixing routines for my app is that incremented pointers worked much faster than indexing. Some compilers may sort this out for you but - not sure on the iphone - but certainly this gave my app a big boost for these tight loops (about 30% if I recall).
eg: instead of this:
for (k=0; k<reads; k++)
{
// Use buffer[k]
}
do this:
SInt16* p=buffer;
SInt16* pEnd=buffer+reads;
while (p!=pEnd)
{
// Use *p
p++;
}
Also, I believe iPhone has some sort of SIMD (single instruction multiple data) support called VFP. This would let you perform math on a number of samples in one instruction but I know little about this on iPhone.

Related

iOS - bitwise XOR on a vector using Accelerate.framework

I am trying to perform a bitwise XOR between a predetermined value and each element of an array.
This can clearly be done in a loop like so (in psuedocode):
int scalar = 123;
for(int i = 0; i < VECTOR_LENGTH; i++) {
int x_or = scalar ^ a[i];
}
but I'm starting to learn about the performance enhancements by using the Accelerate.framework.
I'm looking through the docs for Accelerate.framework, but I haven't seen anyway to do an element based bitwise XOR. Does anyone know if this is possible?
Accelerate doesn't implement the operation in question. You can pretty easily write your own vector code to do it, however. Once nice approach is to use clang vector extensions:
#include <stddef.h>
typedef int vint8 __attribute__((ext_vector_type(8),aligned(4)));
typedef int vint4 __attribute__((ext_vector_type(4),aligned(4)));
typedef int vint2 __attribute__((ext_vector_type(2),aligned(4)));
int vector_xor(int *x, size_t n) {
vint8 xor8 = 0;
while (n >= 8) {
xor8 ^= *(vint8 *)x;
x += 8;
n -= 8;
}
vint4 xor4 = xor8.lo ^ xor8.hi;
vint2 xor2 = xor4.lo ^ xor4.hi;
int xor = xor2.lo ^ xor2.hi;
while (n > 0) {
xor ^= *x++;
n -= 1;
}
return xor ^ 123;
}
This is pretty nice because (a) it doesn't require use of intrinsics and (b) it doesn't tie you to any specific architecture. It generates pretty decent code for any architecture you compile for. On the other hand, it ties you to clang, whereas if you use intrinsics your code may work with other compilers as well.
Stephen's answer is useful, but as you're looking at Accelerate, keep in mind that it is not a magic "go fast" library. Unless VECTOR_LENGTH is very large (say 10,000 -- EDIT: Stephen disagrees on this scale, and tends to know more about this subject than I do; see comments), the cost of the function call will often overwhelm any benefits you get. Remember, at the end of the day, Accelerate is just code. Very often, simple hand-written loops like yours (especially with good compiler optimizations) are going to be just as good or better on simple operations like xor.
But in many cases you need to let the compiler help you. Clang knows how to do all kinds of useful vector optimizations (just like in Stephen's answer) automatically. But in most cases, the default optimization setting is -Os (Fastest, Smallest). That says "clang, you may do any optimizations you want, but not if it makes the resulting binary any larger." You might notice that Stephen's example is a little larger than yours. That means that the compiler is often forbidden from applying the automatic vector optimizations it knows how to do.
But, if you switch to -Ofast, then you give clang permission to improve performance, even if it increases binary size (and on modern hardware, even mobile hardware, that is often a very good tradeoff). In the Build Settings panel, this is called "Optimization Level: Fastest, Aggressive Optimizations." In nearly every case, that is the correct setting for iOS and OS X apps. (It is not currently the default because of history; I expect that Apple will make it the default in the future.)
For more discussion on the limitations of Accelerate (wonderful library that it is), you may be interested in "Introduction to Fast Bézier (and Trying the Accelerate.framework)". I also highly recommend "What's New in the LLVM Compiler" (Session 402 from WWDCS 2013), which I found even more useful than the introduction to Accelerate. Clang can do some really amazing optimizations if you get out of its way.

An algorithm for inverting a soundwave

Hi I’m making a simple program in Objective-C and I need to get the inverse of a soundwave.
I've tried searching for an algorithm for doing this but I haven't found anything. I guess it’s more complicated than just multiplying each value with -1 :P
Here’s my code so far, I’ve cast the data to int32_t to be able to manipulate it:
int32_t* samples = (int32_t*)(sourceBuffer.mData);
for ( int i = 0; i < sourceBuffer.mDataByteSize / sizeof(int32_t); i++ )
{
// Add algorithm here
}
Thanks.
Multiplying by -1 should work no? What output do you get? Bear in mind you won't hear ANY difference with the wave unless you layer the normal and inverted together in which case they will cancel each other out.

How to speed up array concatenation?

Instead of concatening results to this, Is there other way to do the following, I mean the loop will persist but vector=[vector,sum(othervector)]; can be gotten in any other way?
vector=[];
while a - b ~= 0
othervector = sum(something') %returns a vector like [ 1 ; 3 ]
vector=[vector,sum(othervector)];
...
end
vector=vector./100
Well, this really depends on what you are trying to do. Starting from this code, you might need to think about the actions you are doing and if you can change that behavior. Since the snippet of code you present shows little dependencies (i.e. how are a, b, something and vector related), I think we can only present vague solutions.
I suspect you want to get rid of the code to circumvent the effect of constantly moving the array around by concatenating your new results into it.
First of all, just make sure that the slowest portion of your application is caused by this. Take a look at the Matlab profiler. If that portion of your code is not a major time hog, don't bother spending a lot of time on improving it (and just say to mlint to ignore that line of code).
If you can analyse your code enough to ensure that you have a constant number of iterations, you can preallocate your variables and prevent any performance penalty (i.e. write a for loop in the worst case, or better yet really vectorized code). Or if you can `factor out' some variables, this might help also (move any loop invariants outside of the loop). So that might look something like this:
vector = zeros(1,100);
while a - b ~= 0
othervector = sum(something);
vector(iIteration) = sum(othervector);
iIteration = iIteration + 1;
end
If the nature of your code doesn't allow this (e.g. you are iterating to attain convergence; in that case, beware of checking equality of doubles: always include a tolerance), there are some tricks you can perform to improve performance, but most of them are just rules of thumb or trying to make the best of a bad situation. In this last case, you might add some maintenance code to get slightly better performance (but what you gain in time consumption, you lose in memory usage).
Let's say, you expect the code to run 100*n iterations most of the time, you might try to do something like this:
iIteration = 0;
expectedIterations = 100;
vector = [];
while a - b ~= 0
if mod(iIteration,expectedIterations) == 0
vector = [vector zeros(1,expectedIterations)];
end
iIteration = iIteration + 1;
vector(iIteration) = sum(sum(something));
...
end
vector = vector(1:iIteration); % throw away uninitialized
vector = vector/100;
It might not look pretty, but instead of resizing the array every iteration, the array only gets resized every 100th iteration. I haven't run this piece of code, but I've used very similar code in a former project.
If you want to optimize for speed, you should preallocate the vector and have a counter for the index as #Egon answered already.
If you just want to have a different way of writing vector=[vector,sum(othervector)];, you could use vector(end + 1) = sum(othervector); instead.

How to optimize quartz 2d?

I have a block of code that's essentially:
for(int i=0;i<aInt;i++){
CGPoint points[2] = {CGPointMake(i,0),CGPointMake(i,bArray[i])};
CGContextStrokeLineSegments(myContext, points, 2);
}
which is causing a bit of a bottleneck when aInt gets large, as it's likely to do in my case. I don't know enough about quartz 2d to know how to best optimize this. Is it better to create a single huge point array in the loop and then stoke the entire array once?
Or more ideally, I've just optimized a different part of code dealing with arrays. In doing so, I converted to using C-style arrays which sped things up tremendously. Is there a similar low-level way of doing the above?
Thanks!
I also imagine making a large array will make it faster. There will definitely be less calls into CGContextStrokeLineSegments.
CGPoint *points = (CGPoint*)malloc(sizeof(CGPoint)*aInt*2);
for(int i=0;i<aInt;i++){
points[i*2] = CGPointMake(i,0);
points[i*2+1] = CGPointMake(i,bArray[i]));
}
CGContextStrokeLineSegments(myContext, points, aInt*2);
free(points);
Yes, creating a single large array will definitely be faster than stroking each individual line segment.

Why is this C-style code 10X slower than this obj-C style code?

//obj C version, with some - less than one second on 18,000 iterations
for (NSString* coordStr in splitPoints) {
char *buf = [coordStr UTF8String];
sscanf(buf, "%f,%f,", &routePoints[i].latitude, &routePoints[i].longitude);
i++;
}
//C version - over 13 seconds on 18,000 iterations
for (i = 0; buf != NULL; buf = strchr(buf,'['), ++i) {
buf += sizeof(char);
sscanf(buf, "%f,%f,", &routePoints[i].latitude, &routePoints[i].longitude);
}
As a corollary question, is there any way to make this loop faster?
Also see this question: Another Speed Boost Possible?
Measure, measure, measure.
Measure the code with the Sampler instrument in Instruments.
With that said, there is an obvious inefficiency in the C code compared to the Objective-C code.
Namely, fast enumeration -- the for(x in y) syntax -- is really fast and, more importantly, implies that splitPoints is an array or set that contains a bunch of data that has already been parsed into individual objects.
The strchr() call in the second loop implies that you are parsing stuff on the fly. In and of itself, strchr() is a looping operation and will consume time, more-so as the # of characters between occurrences of the target character increase.
That is all conjecture, though. As with all optimizations, speculation is useless and gathering concrete data using the [rather awesome] set of tools provided is the only way to know for sure.
Once you have measured, then you can make it faster.
Having nothing to do with performance, your C code has an error in it. buf += sizeof(char) should simply be buf++. Pointer arithmetic always moves in units the size of the type. It worked fine in this case because sizeof(char) was 1.
Obj C code looks like it has precomputed some split points, while the C code seeks them in each iteration. Simple answer? If N is the length of buf and M the number of your split points, it looks like your two snippets have complexities of O(M) versus O(N*M); which one's slower?
edit: Really amazed me though, that some would think C code is axiomatically faster than any other solution.
Vectorization can be used to speed up C code.
Example:
Even faster UTF-8 character counting
(But maybe just try to avoid the function call strchr() in the loop condition.)