Why is this C-style code 10X slower than this obj-C style code? - iphone

//obj C version, with some - less than one second on 18,000 iterations
for (NSString* coordStr in splitPoints) {
char *buf = [coordStr UTF8String];
sscanf(buf, "%f,%f,", &routePoints[i].latitude, &routePoints[i].longitude);
i++;
}
//C version - over 13 seconds on 18,000 iterations
for (i = 0; buf != NULL; buf = strchr(buf,'['), ++i) {
buf += sizeof(char);
sscanf(buf, "%f,%f,", &routePoints[i].latitude, &routePoints[i].longitude);
}
As a corollary question, is there any way to make this loop faster?
Also see this question: Another Speed Boost Possible?

Measure, measure, measure.
Measure the code with the Sampler instrument in Instruments.
With that said, there is an obvious inefficiency in the C code compared to the Objective-C code.
Namely, fast enumeration -- the for(x in y) syntax -- is really fast and, more importantly, implies that splitPoints is an array or set that contains a bunch of data that has already been parsed into individual objects.
The strchr() call in the second loop implies that you are parsing stuff on the fly. In and of itself, strchr() is a looping operation and will consume time, more-so as the # of characters between occurrences of the target character increase.
That is all conjecture, though. As with all optimizations, speculation is useless and gathering concrete data using the [rather awesome] set of tools provided is the only way to know for sure.
Once you have measured, then you can make it faster.

Having nothing to do with performance, your C code has an error in it. buf += sizeof(char) should simply be buf++. Pointer arithmetic always moves in units the size of the type. It worked fine in this case because sizeof(char) was 1.

Obj C code looks like it has precomputed some split points, while the C code seeks them in each iteration. Simple answer? If N is the length of buf and M the number of your split points, it looks like your two snippets have complexities of O(M) versus O(N*M); which one's slower?
edit: Really amazed me though, that some would think C code is axiomatically faster than any other solution.

Vectorization can be used to speed up C code.
Example:
Even faster UTF-8 character counting
(But maybe just try to avoid the function call strchr() in the loop condition.)

Related

Replace two for loops using vectorization?

I have a function f(x,y) = abs(cos(x+3) * sin(y+2)) that I need to sum up using two for loops. Note: the real function is more complex, this is a toy version of it for the purposes of the question.
f = #(x,y) abs(cos(x+3) * sin(y+2));
tot = 0;
for m=1:100
for n=1:100
tot = tot + f(m,n);
end
end
disp(tot)
Output: 4.026314876227891e+03
How can I vectorize this code to get rid of the for loops and make it faster?
[n,m]=meshgrid(1:100,1:100);
tot=sum(f(m,n),'all')
However I am not sure this is any faster, you can time it. Matlab is quite fast in loops, the old truth about it being slower when you loop is outdated by 5 years or so. Most of the times the JIT compiler will find the fastest way to run it. This is one of the cases where your toy problem may hid the actual problem, as the JIT may find this toy problem easier to speed up, but not your real one, or vice versa.
You will need to time.

Replace values in an array in matlab without changing the original array

My question is that given an array A, how can you give another array identical to A except changing all negatives to 0 (without changing values in A)?
My way to do this is:
B = A;
B(B<0)=0
Is there any one-line command to do this and also not requiring to create another copy of A?
While this particular problem does happen to have a one-liner solution, e.g. as pointed out by Luis and Ian's suggestions, in general if you want a copy of a matrix with some operation performed on it, then the way to do it is exactly how you did it. Matlab doesn't allow chained operations or compound expressions, so you generally have no choice but to assign to a temporary variable in this manner.
However, if it makes you feel better, B=A is efficient as it will not result in any new allocated memory, unless / until B or A change later on. In other words, before the B(B<0)=0 step, B is simply a reference to A and takes no extra memory. This is just how matlab works under the hood to ensure no memory is wasted on simple aliases.
PS. There is nothing efficient about one-liners per se; in fact, you should avoid them if they lead to obscure code. It's better to have things defined over multiple lines if it makes the logic and intent of the algorithm clearer.
e.g, this is also a valid one-liner that solves your problem:
B = subsasgn(A, substruct('()',{A<0}), 0)
This is in fact the literal answer to your question (i.e. this is pretty much code that matlab will call under the hood for your commands). But is this clearer, more elegant code just because it's a one-liner? No, right?
Try
B = A.*(A>=0)
Explanation:
A>=0 - create matrix where each element is 1 if >= 0, 0 otherwise
A.*(A>=0) - multiply element-wise
B = A.*(A>=0) - Assign the above to B.

display variable value each n iterations with condition outside the loop

When I have to display the variable value every n iterations of a for loop I always do something along these lines:
for ii=1:1000
if mod(ii,100)==0
display(num2str(ii))
end
end
I was wondering if there is a way to move the if condition outside the loop in order to speed up the code. Or also if there is something different I could do.
You can use nested loops:
N = 1000;
n = 100;
for ii = n:n:N
for k = ii-n+1:ii-1
thingsToDo(k);
end
disp(ii)
thingsToDo(ii);
end
where thingsToDo() get the relevant counter (if needed). This a little more messy, but can save a lot of if testing.
Unless the number of tested values is much larger than the number of printed values, I would not blame the if-statement. It may not seem this way at first, but printing is indeed a fairly complex task. A variable needs to be converted and sent to an output stream which is then printing in the terminal. In case you need to speed the code up, then reduce the amount of printed data.
Normally Matlab function takes vector inputs as well. This is the case for disp and display and does only take a single function call. Further, conversion to string is unnecessary before printing. Matlab should send the data to some kind of stream anyway (which may indeed take argument of type char but this is not the same char as Matlab uses), so this is probably just a waste of time. In addition to that num2str do a lot of things to ensure typesafe conversion. You already know that display is typesafe, so all these checks are redundant.
Try this instead,
q = (1:1000)'; % assuming q is some real data in your case
disp(q(mod(q,100)==0)) % this requires a single call to disp

How to speed up array concatenation?

Instead of concatening results to this, Is there other way to do the following, I mean the loop will persist but vector=[vector,sum(othervector)]; can be gotten in any other way?
vector=[];
while a - b ~= 0
othervector = sum(something') %returns a vector like [ 1 ; 3 ]
vector=[vector,sum(othervector)];
...
end
vector=vector./100
Well, this really depends on what you are trying to do. Starting from this code, you might need to think about the actions you are doing and if you can change that behavior. Since the snippet of code you present shows little dependencies (i.e. how are a, b, something and vector related), I think we can only present vague solutions.
I suspect you want to get rid of the code to circumvent the effect of constantly moving the array around by concatenating your new results into it.
First of all, just make sure that the slowest portion of your application is caused by this. Take a look at the Matlab profiler. If that portion of your code is not a major time hog, don't bother spending a lot of time on improving it (and just say to mlint to ignore that line of code).
If you can analyse your code enough to ensure that you have a constant number of iterations, you can preallocate your variables and prevent any performance penalty (i.e. write a for loop in the worst case, or better yet really vectorized code). Or if you can `factor out' some variables, this might help also (move any loop invariants outside of the loop). So that might look something like this:
vector = zeros(1,100);
while a - b ~= 0
othervector = sum(something);
vector(iIteration) = sum(othervector);
iIteration = iIteration + 1;
end
If the nature of your code doesn't allow this (e.g. you are iterating to attain convergence; in that case, beware of checking equality of doubles: always include a tolerance), there are some tricks you can perform to improve performance, but most of them are just rules of thumb or trying to make the best of a bad situation. In this last case, you might add some maintenance code to get slightly better performance (but what you gain in time consumption, you lose in memory usage).
Let's say, you expect the code to run 100*n iterations most of the time, you might try to do something like this:
iIteration = 0;
expectedIterations = 100;
vector = [];
while a - b ~= 0
if mod(iIteration,expectedIterations) == 0
vector = [vector zeros(1,expectedIterations)];
end
iIteration = iIteration + 1;
vector(iIteration) = sum(sum(something));
...
end
vector = vector(1:iIteration); % throw away uninitialized
vector = vector/100;
It might not look pretty, but instead of resizing the array every iteration, the array only gets resized every 100th iteration. I haven't run this piece of code, but I've used very similar code in a former project.
If you want to optimize for speed, you should preallocate the vector and have a counter for the index as #Egon answered already.
If you just want to have a different way of writing vector=[vector,sum(othervector)];, you could use vector(end + 1) = sum(othervector); instead.

Efficient way to make a Programmatic Audio Mixdown

I'm currently working on the iPhone with Audio Units and I'm playing four tracks simultaneously. To improve the performance of my setup, I thought it would be a good idea to minimize the number of Audio Units / threads, by mixing down the four tracks into one.
With the following code I'm processing the next buffer by adding up the samples of the four tracks, keep them in the SInt16 range and add them to a temporary buffer, which will later on be copied into the ioData.mBuffers of the Audio Unit.
Although it works, I don't have the impression that this is the most efficient way to do this.
SInt16* buffer = bufferToWriteTo;
int reads = bufferSize/sizeof(SInt16);
SInt16** files = circularBuffer->files;
float tempValue;
SInt16 values[reads];
int k,j;
int numFiles=4;
for (k=0; k<reads; k++)
{
tempValue=0.f;
for (j=0; j<numFiles; j++)
{
tempValue += files[j][packetNumber];
}
if (tempValue > 32767.f) tempValue = 32767.f;
else if (tempValue < -32768.f) tempValue =- 32768.f;
values[k] = (SInt16) tempValue;
values[k] += values[k] << 16;
packetNumber++;
if (packetNumber >= totalPackets) packetNumber=0;
}
memcpy(buffer,values,bufferSize);
Any ideas or pointers to speed this up? Am I right?
The biggest improvement you can get from this code would be by not using floating point arithmetic. While the arithmetic by itself is fast, the conversions which happen in the nested loops, take a long time, especially on the ARM processor in the iPhone. You can achieve exactly the same results by using 'SInt32' instead of 'float' for the 'tempValue' variable.
Also, see if you can get rid of the memcpy() in the last string: perhaps you can construct the 'buffer' directly, without using a temporary buffer called 'values'. That saves one copy, which would be significant improvement for such a function.
Other notes: the last two lines of the loop probably belong outside of the loop and the body of the nested loop should use 'k' as a second index, instead of 'packetNumber', but I'm not sure about this logic.
And the last note: you're squashing the peaks of your resulting sound. While this seems like a good idea, it will sound pretty rough. You probably want to scale the result down instead of cropping it. Like that: instead of this code
for (j=0; j<numFiles; j++)
{
tempValue += files[j][packetNumber];
}
if (tempValue > 32767.f) tempValue = 32767.f;
else if (tempValue < -32768.f) tempValue =- 32768.f;
you probably want something like this:
for (j=0; j<numFiles; j++)
{
tempValue += files[j][packetNumber] / numFiles;
}
Edit: and please do not forget to measure the performance before and after, to see which one of the improvements gave the biggest impact. This is the best way to learn performance: trial and measurement
A couple of pointers even though I'm not really familliar with iPhone development.
You could unwind the inner loop. You don't need a for loop to add 4 numbers together although it might be your compiler will do this for you.
Write directly to the buffer in your for loop. memcpy at the end will do another loop to copy the buffers.
Don't use a float for tempvalue. Depending on the hardware integer math is quicker and you don't need floats for summing channels.
Remove the if/endif. Digital clipping will sound horrible anyway so try to avoid it before summing the channels together. Branching inside a loop like this should be avoided if possible.
One thing I found when writing the audio mixing routines for my app is that incremented pointers worked much faster than indexing. Some compilers may sort this out for you but - not sure on the iphone - but certainly this gave my app a big boost for these tight loops (about 30% if I recall).
eg: instead of this:
for (k=0; k<reads; k++)
{
// Use buffer[k]
}
do this:
SInt16* p=buffer;
SInt16* pEnd=buffer+reads;
while (p!=pEnd)
{
// Use *p
p++;
}
Also, I believe iPhone has some sort of SIMD (single instruction multiple data) support called VFP. This would let you perform math on a number of samples in one instruction but I know little about this on iPhone.