I'm at the debugging/optimization phase with an iPhone app. I have one bottleneck left - the only place where the program has a noticeable lag, and it's in the following loop: (By the way, I've renamed the vars with letters and types. (The real names are much more human-readable in the actual app, but make little sense out of context, so I hope this is clear enough.) Here's the loop:
for(i=0;i<xLong; i+=yFloat*zShort){
aFloat=0.0;
for(int j=i;j<i+yFloat*zShort;j++){
aFloat=hArray[j]/kFloat;
}
bNSNumber = [NSNumber numberWithFloat:aFloat];
[cNSMutableArray addObject:bNSNumber];
}
All objection creation and clean-up is outside of this loop.
(It should be pretty straight forward what's happening here, but basically I have a very large array (in the millions) and I'm going through that array at chunks of yFloat*zShort length, adding all of the elements in that chunk, and inserting that final sum in another array. So if hArray is a million elements long, and my chunk length is 200, I'll sum the first 200 elements, insert that total in cNSMutableArray, and move on to the next 200 elements in hArray. In the end, cNSMutableArray will be 5000 elements long.)
When the outer loop is around 25k and the inner loop is around 200, this code takes about 4 seconds to run. I would sure like to get that down as much as possible, as in the real world, the outer loop might be quite a bit larger.
Any ideas how to quicken this up?
Thanks for any ideas you have!
Have you tried to make a C style float array instead of using a NSMutableArray? the overhead of creating that many wrappers (NSNumber) can add up.
First off, from your description it sounds like the inner loop should read:
for(int j=i;j<i+yFloat*zShort;j++){
aFloat+=hArray[j]/kFloat;
}
Anyway, since kFloat is not changing, you can move that out of the loop and do the division once:
for(int j=i;j<i+yFloat*zShort;j++){
aFloat+=hArray[j];
}
aFloat/=kFloat;
That said, this can affect the accuracy of the final value. Without knowing exactly what you are doing, I don't know if that will matter.
I see that you already got a nice speedup, but here's my two cents: Floating-point division is notoriously expensive; you could precompute
float invKFloat = 1.0f / kFloat;
and then mulitply by this instead of dividing by kFloat. This means you only have to do the division once, instead of every time in the outer loop.
This seems like the kind of calculation that should be spun off in a background thread.
You have several options- NSOperation is a viable alternative, but depending on your data structures it might be easier to use detachNewThreadSelector:toTarget:withObject:
You really want to avoid creating objects inside a tight loop. Every time you do that, you're allocating a new object on the heap, which involves a hash insert.
Related
I have the following problem.
for i=1:3000
[~,dist(i,1)]=knnsearch(C(selectedIndices,:),C);
end
Let me explain the code above. Matrix C is a huge matrix (300000 x 1984). C(selectedIndices,:) is a subset of 100 elements of C depending on the value of i. It means: For i=1, first 100 points of C are selected, for i==2, C(101:200,:) is selected. As you can see, the second argument remains constant.
Is there any way to make this work faster. I have tried the following:
- [~,dist(i,1)]=knnsearch(C,C); %obviously goes out of memory
send a bigger chunk of selectedIndices instead of sending just 100. This adds a little bit post-processing which I am not worried about. But this doesn't work since it takes equivalent amount of time. For example, if I send 100 points of C at a time, it takes 60 seconds. If I send 500, it takes 380 seconds with the post-processing.
Tried using parfor as: different sets of selectedIndices will be executed parallely. It doesn't work as two copies of big matrix C may have got created (not sure how parfor works), but I am sure that computer becomes very slow in turn negating the advantage of parfor.
Haven't tried yet: break both arguments into smaller chunks and send it in parfor. Do you think this will make any difference?
I am open to any suggestion i.e. if you feel braking a matrix in some different way may speed up the computation, do suggest it. Since, at the end I only care about finding closest point from a set of points (here each set has 100 points) for each point in C.
I needed some help with a problem I'd been assigned in class. It's our introduction to for loops. Here is the problem:
Consider the following riddle.
This is all I have so far:
function pile = IslandBananas(numpeople, numbears)
for pilesize=1:10000000
end
I would really appreciate your input. Thank you!
I will help you, but you need to try harder than that. And also, you only need one for loop. First, think about how you would construct this algorithm. Well you know you have to use a for loop so that is a start. So let's think about what is going on in the problem.
1) You have a pile.
2) First night someone takes the pile and divides it into 3 and finds that one is left over, this means mod(pile,3) = 1.
3) But he discards the extra banana. This means (pile-1).
4) He takes a third of it, leaving two-thirds left. This means (2/3)*(pile-1).
5) In the morning they take the pile and divide it into 3 and find again that one is left over, so this means mod((2/3)*(pile-1),3) = 1.
6) But they discard the extra banana. This means (2/3)*(pile-1)-1.
7) Finally, they have to each have at least one banana if it is to be the smallest pile possible. Thus, the smallest pile must be such that (1/3)*((2/3)*(pile-1)-1) = 1.
I have essentially given you the answer, the rest you can write with the formula (1/3)*((2/3)*(pile-1)-1) and a simple if statement to test for the smallest possible integer which is 1. This can be done in four lines inside of your for loop.
Now, expanding this to any number of people and any number of bears requires two simple substitutions in that formula! If your teacher demands it, this can easily be split into two nested for loops.
I have a 300x300 matrix. I need to make a 300x300x1024 matrix where each "slice" is the original 300x300 matrix. Is there any way to do this without a loop? I tried the following:
old=G;
for j=2:N;
G(:,:,j)=old;
end
where N is 1024, but I run out of memory.
Know any shorter routes?
use repmat.
B = repmat(A,[m n p...])
produces a multidimensional array B composed of copies of A. The size of B is [size(A,1)*m, size(A,2)*n, size(A,3)*p, ...].
In your case ,
G=repmat(old,[1 1 1024]);
Will yield the result you wanted without the for loop. The memory issue is a completely different subject. A 300x300x1024 double matrix will "cost" you ~740 MB of memory, that's not a lot these days. Check your memory load before you try the repmat and see why you don't have these extra 700 MB. use memory and whos to see what is the available memory and which variables can be cleared.
You are likely running out of memory because you haven't pre-initialized your matrix.
if you do this first,
old = G;
G = zeros(size(old,1), size(old,2), 1024);
and then start the loop from 1 instead of 2, you will probably not run out of memory
Why this works is because you first set aside a block of memory large enough for the entire matrix. If you do not initialize your matrix, matlab first sets aside enough memory for a 300x300x1 matrix. Next when you add the second slice, it moves down the memory, and allocates a new block for a 300x300x2 matrix, and so on, never being able to access the memory allocated for the first matrices.
This occurs often in matlab, so it is important to never grow your matrices within a loop.
Quick answer is no, you will need to loop.
You might be able to do something smart like block-copying your array's memory but you didn't even give us a language to work with.
You will probably want to make sure each entry in your matrix is a minimum size, even at byte matrix size you will require 92mb, if you are storing a 64bit value we're talking nearly a gig. If it's an object your number will leap into the many-gig range in no time. Bit packing may come in handy... but again no idea what your other constraints are.
Edit: I updated your tags.
I'm not sure if I can help much, but doubles are 64bits each so at bare minimum you're talking about 2gb (You're already past impossible if you are on a 32 bit os). This could easily double if each cell involves one or two pointers to different memory locations (I don't know enough about matlab to tell you for sure).
If you're not running on an 8gb 64 bit machine I don't think you have a chance. If you are, allocate all the memory you can to matlab and pray.
Sorry I can't be of more help, maybe someone else knows more tricks.
I'm trying to write a simple generic parallel code for minimizing a function in MATLAB. The idea is very simple, essentially:
parfor k = 1:N
(...find a good solution xcurrent with cost fcurrent ... )
% keep best current value
fmin = min(fmin,fxcurrent)
end
This works fine, because fmin is a reduction variable, and thus I can use this construction to update the current best value.
However, I couldn't find a nice elegant way of keeping (or storing) the best current solution ("xcurrent").
How do I keep track of the best solution found so far?
In other words, if the current value is strictly smaller than fmin, how can I save xcurrent (subject to the constraints that parallel loops impose in MATLAB)?
[Of course, the serial version is trivial, just prepend
if fxcurrent < fmin;
xbest = xcurrent;
end;
but this does not work on a parfor loop.]
A few approaches that come to mind:
I could just store all solutions and costs (using sliced variables), but this is hugely memory inefficient (the number of iterations N is very large, and the solutions themselves are very big).
Similarly, I could use a (set or matrix) reduction variable and do:
solutionset = [solutionset,xcurrent]
but this is almost as bad in terms of memory requirement.
I could also save xcurrent to disk every time the solution is improved.
I tried to look around for a simpler solution, but nothing was very useful.
The question seems to be well-defined (so it's not like in other problems, where the output could depend on iteration order), but I couldn't find an elegant way of doing this.
Apologies in advance if I'm missing something obvious, and thanks a lot in advance!
Thanks so I copy the suggestion down here.
Just an idea- what if you write your own reduction function - basically just containing the if block and a save or output?
You will presumably need to maintain multiple xcurrent structures in memory anyway, since there will have to be a separate copy for each worker executing the loop-body. I would try splitting your loop into an outer parallel part and an inner serial part -- this will allow you to adjust the number of copies of xcurrent separately to the total iteration count.
The inner (serial) loop can use the normal if fxcurrent < fmin; xmin = xcurrent; end construct to update its best solution, and the outer (parallel) loop can just store all solutions using slicing. As a final step you select the best solution from your (small) set.
I have an m x n matrix of integers and where n is a fairly big number m and n ~1000. I want to iterate through all of these and perform a some operations, like accessing a particular cell and assigning a value of a particular cells.
However, at least in my implementation, this is rather inefficient as I have two for loops with Matrix(a,b) = Matrix(a,b+1) or something along these lines. Is there any other way to do this seeing as my current implementation takes a long time to traverse through about 100,000 cells and perform some operations.
Thank you
In matlab, it's almost always possible to avoid loops.
If you want to do Matrix(a,b)=Matrix(a,b+1), you should just do Matrix2=Matrix(:,2:end);
If you are more precise about what you do inside the loop, I can help you more.
Matlab uses column major ordering of matrixes in memory (unlike C). Are you sure you are iterating the indexes in the correct order? If not, try switching them and see if performance improves..
If you can't get rid of the for loops, one possibility would be to rewrite the expensive operations in C and create a MEX file as described here.