I have a block of code that's essentially:
for(int i=0;i<aInt;i++){
CGPoint points[2] = {CGPointMake(i,0),CGPointMake(i,bArray[i])};
CGContextStrokeLineSegments(myContext, points, 2);
}
which is causing a bit of a bottleneck when aInt gets large, as it's likely to do in my case. I don't know enough about quartz 2d to know how to best optimize this. Is it better to create a single huge point array in the loop and then stoke the entire array once?
Or more ideally, I've just optimized a different part of code dealing with arrays. In doing so, I converted to using C-style arrays which sped things up tremendously. Is there a similar low-level way of doing the above?
Thanks!
I also imagine making a large array will make it faster. There will definitely be less calls into CGContextStrokeLineSegments.
CGPoint *points = (CGPoint*)malloc(sizeof(CGPoint)*aInt*2);
for(int i=0;i<aInt;i++){
points[i*2] = CGPointMake(i,0);
points[i*2+1] = CGPointMake(i,bArray[i]));
}
CGContextStrokeLineSegments(myContext, points, aInt*2);
free(points);
Yes, creating a single large array will definitely be faster than stroking each individual line segment.
Related
I have two issues. Please see followinf code
it=0:0.01:360;
jt=0:0.01:270;
LaserS=zeros(size(it,2)*size(jt,2),2);
p=1;
for m=it
for n=jt
LaserS(p,:)=[m,n];
p=p+1;
end
end
It is very slow and also takes a lot of memory (about 7.7765e+009 bytes). So I can't run it. How can i improve it and solve memory issue.
I'm using win7 64 with 8Gb RAM.
What are you trying to do? 'reshape' should solve your problem.
LaserS=zeros(size(it,2)*size(jt,2),2);
JT=reshape(repmat(jt,[1,numel(it)]),1,numel(jt)*numel(it));
IT=reshape(repmat(it,[numel(jt),1]),1,numel(jt)*numel(it));
LaserS = [JT.', IT.'];
Pre-allocating array is going to save you memory hit. Otherwise there is not memory optimization here.
You can't reduce memory usage unless you use fewer values. Can it and jt step by 0.1 instead of 0.01?
Here's a way to build your result matrix without a loop.
LaserS = [rempat(it.', length(jt), 1), kron(ones(length(it), 1), jt.')];
This code seems to be doing "nothing" in the sense that, after it runs, you end up with the matrix LaserS with a shape of 972000000 x 2. If you really need these values to be loaded to memory at the same time, that is the size and there is not much to do about it.
What would be my first approach, which cannot be inferred directly from the code you posted, is that maybe you can achieve the overall goal of your program if you generate the matrix data "on the fly" while you perform further processing over LaserS.
Hope this helps!
This should do it:
it = it(:);
jt = jt(:);
jt = repmat(jt,size(it,1),1)
it = repmat(it',size(jt,1),1);
it = it(:);
LaserS = [it, jt]
In addition to the nice solutions presented here already, If you want to reduce memory, there's no reason to use double. You can use single and half the memory required. You can encode the step size 0.01 to a unit step (that is, it=uint16(0:1:36000) thus encoding the numbers as integers uint16, this will use only one quarter of the memory. etc...
Meshgrid will do this cleanly as well, perhaps with a reshape if you insist on having it in an (n*m)-by-2 matrix. But why do you want that? It seems like you're actually after something else, and it's possible that bsxfun(,it,jt') would do what you want.
I have an array of fit objects and I need to evaluate each of them with several values. Because there are over thousand of those fit objects I find it very slow to loop over them and evaluate them with the values. So is there a way to use some kind of vectorized solution to this?
For example I can evaluate a single fit object by
fitArray{1,1}(400)
but what I would like to do is to evaluate multiple fit objects at a time in a way something like this:
fitArray{1:1000}(400)
The looping in Matlab is always very slow and in this case it's really slow as I need to evaluate each of those fits with multiple values.
So is there a way to do that without looping?
Looping is not the biggest problem here, look for example at speed of fitoptions ... the memory allocation is terrible so try to do all operations before the loop itself (fitoptions, fittype etc...). If you use polynomial fitting and you don't need the cfit structure try polyfit instead - should be considerably faster.
I found the answer myself. It was quite simple after all. I achieved the result I wanted by doing this:
vals = repmat({values}, size(fitArray));
evals = cellfun(#feval, fitArray, vals);
This evaluates each fit object in the cell array with the value in the corresponding row in the vals array. So the result is that the evals array has only the results of each fit object.
I have two lists of timestamps and I'm trying to create a map between them that uses the imu_ts as the true time and tries to find the nearest vicon_ts value to it. The output is a 3xd matrix where the first row is the imu_ts index, the third row is the unix time at that index, and the second row is the index of the closest vicon_ts value above the timestamp in the same column.
Here's my code so far and it works, but it's really slow. I'm not sure how to vectorize it.
function tmap = sync_times(imu_ts, vicon_ts)
tstart = max(vicon_ts(1), imu_ts(1));
tstop = min(vicon_ts(end), imu_ts(end));
%trim imu data to
tmap(1,:) = find(imu_ts >= tstart & imu_ts <= tstop);
tmap(3,:) = imu_ts(tmap(1,:));%Use imu_ts as ground truth
%Find nearest indecies in vicon data and map
vic_t = 1;
for i = 1:size(tmap,2)
%
while(vicon_ts(vic_t) < tmap(3,i))
vic_t = vic_t + 1;
end
tmap(2,i) = vic_t;
end
The timestamps are already sorted in ascending order, so this is essentially an O(n) operation but because it's looped it runs slowly. Any vectorized ways to do the same thing?
Edit
It appears to be running faster than I expected or first measured, so this is no longer a critical issue. But I would be interested to see if there are any good solutions to this problem.
Have a look at knnsearch in MATLAB. Use cityblock distance and also put an additional constraint that the data point in vicon_ts should be less than its neighbour in imu_ts. If it is not then take the next index. This is required because cityblock takes absolute distance. Another option (and preferred) is to write your custom distance function.
I believe that your current method is sound, and I would not try and vectorize any further. Vectorization can actually be harmful when you are trying to optimize some inner loops, especially when you know more about the context of your data (e.g. it is sorted) than the Mathworks engineers can know.
Things that I typically look for when I need to optimize some piece of code liek this are:
All arrays are pre-allocated (this is the biggest driver of performance)
Fast inner loops use simple code (Matlab does pretty effective JIT on basic commands, but must interpret others.)
Take advantage of any special data features that you have, e.g. use sort appropriate algorithms and early exit conditions from some loops.
You're already doing all this. I recommend no change.
A good start might be to get rid of the while, try something like:
for i = 1:size(tmap,2)
C = max(0,tmap(3,:)-vicon_ts(i));
tmap(2,i) = find(C==min(C));
end
I'm currently working on the iPhone with Audio Units and I'm playing four tracks simultaneously. To improve the performance of my setup, I thought it would be a good idea to minimize the number of Audio Units / threads, by mixing down the four tracks into one.
With the following code I'm processing the next buffer by adding up the samples of the four tracks, keep them in the SInt16 range and add them to a temporary buffer, which will later on be copied into the ioData.mBuffers of the Audio Unit.
Although it works, I don't have the impression that this is the most efficient way to do this.
SInt16* buffer = bufferToWriteTo;
int reads = bufferSize/sizeof(SInt16);
SInt16** files = circularBuffer->files;
float tempValue;
SInt16 values[reads];
int k,j;
int numFiles=4;
for (k=0; k<reads; k++)
{
tempValue=0.f;
for (j=0; j<numFiles; j++)
{
tempValue += files[j][packetNumber];
}
if (tempValue > 32767.f) tempValue = 32767.f;
else if (tempValue < -32768.f) tempValue =- 32768.f;
values[k] = (SInt16) tempValue;
values[k] += values[k] << 16;
packetNumber++;
if (packetNumber >= totalPackets) packetNumber=0;
}
memcpy(buffer,values,bufferSize);
Any ideas or pointers to speed this up? Am I right?
The biggest improvement you can get from this code would be by not using floating point arithmetic. While the arithmetic by itself is fast, the conversions which happen in the nested loops, take a long time, especially on the ARM processor in the iPhone. You can achieve exactly the same results by using 'SInt32' instead of 'float' for the 'tempValue' variable.
Also, see if you can get rid of the memcpy() in the last string: perhaps you can construct the 'buffer' directly, without using a temporary buffer called 'values'. That saves one copy, which would be significant improvement for such a function.
Other notes: the last two lines of the loop probably belong outside of the loop and the body of the nested loop should use 'k' as a second index, instead of 'packetNumber', but I'm not sure about this logic.
And the last note: you're squashing the peaks of your resulting sound. While this seems like a good idea, it will sound pretty rough. You probably want to scale the result down instead of cropping it. Like that: instead of this code
for (j=0; j<numFiles; j++)
{
tempValue += files[j][packetNumber];
}
if (tempValue > 32767.f) tempValue = 32767.f;
else if (tempValue < -32768.f) tempValue =- 32768.f;
you probably want something like this:
for (j=0; j<numFiles; j++)
{
tempValue += files[j][packetNumber] / numFiles;
}
Edit: and please do not forget to measure the performance before and after, to see which one of the improvements gave the biggest impact. This is the best way to learn performance: trial and measurement
A couple of pointers even though I'm not really familliar with iPhone development.
You could unwind the inner loop. You don't need a for loop to add 4 numbers together although it might be your compiler will do this for you.
Write directly to the buffer in your for loop. memcpy at the end will do another loop to copy the buffers.
Don't use a float for tempvalue. Depending on the hardware integer math is quicker and you don't need floats for summing channels.
Remove the if/endif. Digital clipping will sound horrible anyway so try to avoid it before summing the channels together. Branching inside a loop like this should be avoided if possible.
One thing I found when writing the audio mixing routines for my app is that incremented pointers worked much faster than indexing. Some compilers may sort this out for you but - not sure on the iphone - but certainly this gave my app a big boost for these tight loops (about 30% if I recall).
eg: instead of this:
for (k=0; k<reads; k++)
{
// Use buffer[k]
}
do this:
SInt16* p=buffer;
SInt16* pEnd=buffer+reads;
while (p!=pEnd)
{
// Use *p
p++;
}
Also, I believe iPhone has some sort of SIMD (single instruction multiple data) support called VFP. This would let you perform math on a number of samples in one instruction but I know little about this on iPhone.
//obj C version, with some - less than one second on 18,000 iterations
for (NSString* coordStr in splitPoints) {
char *buf = [coordStr UTF8String];
sscanf(buf, "%f,%f,", &routePoints[i].latitude, &routePoints[i].longitude);
i++;
}
//C version - over 13 seconds on 18,000 iterations
for (i = 0; buf != NULL; buf = strchr(buf,'['), ++i) {
buf += sizeof(char);
sscanf(buf, "%f,%f,", &routePoints[i].latitude, &routePoints[i].longitude);
}
As a corollary question, is there any way to make this loop faster?
Also see this question: Another Speed Boost Possible?
Measure, measure, measure.
Measure the code with the Sampler instrument in Instruments.
With that said, there is an obvious inefficiency in the C code compared to the Objective-C code.
Namely, fast enumeration -- the for(x in y) syntax -- is really fast and, more importantly, implies that splitPoints is an array or set that contains a bunch of data that has already been parsed into individual objects.
The strchr() call in the second loop implies that you are parsing stuff on the fly. In and of itself, strchr() is a looping operation and will consume time, more-so as the # of characters between occurrences of the target character increase.
That is all conjecture, though. As with all optimizations, speculation is useless and gathering concrete data using the [rather awesome] set of tools provided is the only way to know for sure.
Once you have measured, then you can make it faster.
Having nothing to do with performance, your C code has an error in it. buf += sizeof(char) should simply be buf++. Pointer arithmetic always moves in units the size of the type. It worked fine in this case because sizeof(char) was 1.
Obj C code looks like it has precomputed some split points, while the C code seeks them in each iteration. Simple answer? If N is the length of buf and M the number of your split points, it looks like your two snippets have complexities of O(M) versus O(N*M); which one's slower?
edit: Really amazed me though, that some would think C code is axiomatically faster than any other solution.
Vectorization can be used to speed up C code.
Example:
Even faster UTF-8 character counting
(But maybe just try to avoid the function call strchr() in the loop condition.)