error in fortran program - fortran90

This is my simple fortran program
program accel
implicit none
integer, dimension(5000) ::a,b,c
integer i
real t1,t2
do i=1,5000
a(i)=i+1
b(i)=i+2
end do
call cpu_time(t1)
do i=1,5000
c(i)=a(i)*b(i)
end do
call cpu_time(t2)
write (*,*)'Elapsed CPU time = ',t2-t1,'seconds'
end program accel
but cpu time shows 0.0000 sec. why?

Short answer
Use this to display your answer:
write(*,'(A,F12.10,A)')'Elapsed CPU time = ',t2-t1,' seconds.'
Longer answer
There can be at least two reasons why you get zero as suggested in the answers of #Ernest Friedman-Hill and #Klas Lindbäck:
The computation is taking less than 0.00005 seconds
The compiler optimizes away the whole loop
In the first case, you have a few options:
You can display more digits of t2-t1 using a format like I gave you above, or alternatively you can print the result in milliseconds: 1000*(t2-t1)
Add more iterations: if you do 50000 iterations instead of 5000, it should take ten times longer.
Make each iteration longer: you can replace your multiplication by a sequence of complicated operations possibly using math functions
In the second case, you can:
Disable optimization by passing the appropriate flag to your compiler (-O0 for gfortran)
Use c somewhere in your program after the loop
I compiled your program using gfortran 4.2.1 on OS X Lion and it worked out of the box (displaying time in exponential notation) and the formatting (short answer) worked fine too. I tried enabling and disabling optimization and it worked fine too.
The accuracy of cpu_time is probably platform dependent so that may also explain different behaviors across different machines, but with all this you should be able to solve your problem.

It doesn't take very long to do 5000 multiplications -- it may simply be taking less than one unit of cpu_time()'s resolution. Crank that 5000 up to 100000 or so, and then you'll likely see something.

The optimiser has seen that c is never read, so the calculation of c can be skipped.
If you print the value of c, the loop willnot be optimised away.

Related

faster way to add many large matrix in matlab

Say I have many (around 1000) large matrices (about 1000 by 1000) and I want to add them together element-wise. The very naive way is using a temp variable and accumulates in a loop. For example,
summ=0;
for ii=1:20
for jj=1:20
summ=summ+ rand(400);
end
end
After searching on the Internet for some while, someone said it's better to do with the help of sum(). For example,
sump=zeros(400,400,400);
count=0;
for ii=1:20
for j=1:20
count=count+1;
sump(:,:,count)=rand(400);
end
end
sum(sump,3);
However, after I tested two ways, the result is
Elapsed time is 0.780819 seconds.
Elapsed time is 1.085279 seconds.
which means the second method is even worse.
So I am just wondering if there any effective way to do addition? Assume that I am working on a computer with very large memory and a GTX 1080 (CUDA might be helpful but I don't know whether it's worthy to do so since communication also takes time.)
Thanks for your time! Any reply will be highly appreciated!.
The fastes way is to not use any loops in matlab at all.
In many cases, the internal functions of matlab all well optimized to use SIMD or other acceleration techniques.
An example for using the build in functionalities to create matrices of the desired size is X = rand(sz1,...,szN).
In your explicit case sum(rand(400,400,400),3) should give you then the fastest result.

Programming practice: Does not creating variables at first lead to faster computation?

I have a, b and A.
a = some expression 1
b = some expression 2
A = a + b
vs
A = some expression 1 + some expression 2
In my code, there are not just a and b but a lot of those. By using the later method without creating variables at first, i.e. by just summing all the expressions in A, I get 1s faster in my program, total is about 11s. This is confirmed after a long of tests. So it reduces from 11s to about 10s. Is this due to just not creating variables at first? Does not creating variables at first lead to faster computation?
I need to run a lot of for loop and run ode solver and for long computation. Variables are calculated and created inside the loop. If i can get a about 10% decrease this is good.
In general (not just MATLAB).
Your first scenario these additional steps are required, which do not apply to the second scenario.
When variable is created, memory needs to be allocated where the value for the variable can be stored.
When a value is assigned to that variable, that value needs to be written to the variable's space in memory.
When the calculation is requested, the value for each variable needs to be retrieved from memory.
Many compilers optimize away these additional overheads by using various techniques, but many interpreted languages do not. (This is not a hard and fast rule though, there are smart interpreted languages and stupid compiled ones).
I do not know exactly how the internals of MATLAB works, but I do think it is
interpreted, which means that the additional steps likely will incur additional overhead.
The problem with your second scenario is that is less readable and maintainable in the long run though. It is easier to read computations and intermediate steps when variable names are used. The trick is balance performance and readability.
I'm not sure how much of a difference it would make in terms of performance, but I don't think it would be a sizeable difference. Maybe a few hundredths of a second.
You can test it for yourself by using the tic toc function.
tic
a = some expression 1
b = some expression 2
A = a + b
toc
VS
tic
A = some expression 1 + some expression 2
toc
As mentioned in the other answer, readability is the main difference. You want to keep your code as simple as possible so that if there is a problem you know exactly where it is and hopefully why there was a problem!

coding in maxima language vs lisp

I just thought of writing some function similar to Mathematica's partition function with passing option in maxima as,
listpartitionpad(l,n,k,d):= block([temp:[],gap,newl,ntemp:[]],
newl:apply(create_listpad,flatten([n,k,d,l])),
map(lambda([x],if(length(newl)>=x+n-1 and is(unique[x]#[d]))then temp:cons(part(newl,makelist(i,i,x,x+n-1)),temp)
else temp:cons(part(newl,makelist(s,s,x,length(newl))),temp)),makelist(i,i,1,length(newl),k)),
ntemp:sublist(temp,lambda([x],is(length(x)=n))),reverse(ntemp));
Usage:listpartitionpad([a,b,c,d,e,f,g],3,3,x); => [[a,b,c],[d,e,f],[g,x,x]]
Now allof the list manipulation functions are coded in lisp as I checked.
My question is that is it fine that I could code any such function in maxima language rather than in lisp or it will give me some performance issue or something else I should know that I don't yet.
I ran a simple test
:lisp(time(loop repeat 1000000))
real time : 0.850 secs
run-gbc time : 0.540 secs
child run time : 0.000 secs
gbc time : 0.310 secs
In case of another maxima based approach,
block([],for i:1 thru 1000000 do []);
Evaluation took 5.7700 seconds (5.7700 elapsed)
And this difference grows exponentially as i grows.
Is this the reason all list manipulations are coded in lisp ?
A few different points.
Maxima is mostly implemented in Lisp because (1) the implementation needs access to the internal structure of expressions, and (2) typically Lisp code is faster than Maxima code.
While it's true that Lisp code usually runs much faster than Maxima code, my advice is to use Maxima to implement algorithms unless you are very familiar with Lisp. The speed difference is probably not going to make much practical difference. The time you spend, as the developer, is more important.
About block([],for i:1 thru 1000000 do []), if its run time really is nonlinear in the loop count, that sounds like a bug. If you can verify that, feel free to open a bug report about it. http://sourceforge.net/p/maxima/bugs (you need to create an SF login to submit a bug report.)
I don't think the language of implementation can possible explain the exponential growth (which is probably quadratic rather then exponential really).
I suspect that your algorithm is suboptimal.

Why does Matlab run faster after a script is "warmed up"?

I have noticed that the first time I run a script, it takes considerably more time than the second and third time1. The "warm-up" is mentioned in this question without an explanation.
Why does the code run faster after it is "warmed up"?
I don't clear all between calls2, but the input parameters change for every function call. Does anyone know why this is?
1. I have my license locally, so it's not a problem related to license checking.
2. Actually, the behavior doesn't change if I clear all.
One reason why it would run faster after the first time is that many things are initialized once, and their results are cached and reused the next time. For example in the M-side, variables can be defined as persistent in functions that can be locked. This can also occur on the MEX-side of things.
In addition many dependencies are loaded after the first time and remain so in memory to be re-used. This include M-functions, OOP classes, Java classes, MEX-functions, and so on. This applies to both builtin and user-defined ones.
For example issue the following command before and after running your script for the first run, then compare:
[M,X,C] = inmem('-completenames')
Note that clear all does not necessarily clear all of the above, not to mention locked functions...
Finally let us not forget the role of the accelerator. Instead of interpreting the M-code every time a function is invoked, it gets compiled into machine code instructions during runtime. JIT compilation occurs only for the first invocation, so ideally the efficiency of running object code the following times will overcome the overhead of re-interpreting the program every time it runs.
Matlab is interpreted. If you don't warm up the code, you will be losing a lot of time due to interpretation instead of the actual algorithm. This can skew results of timings significantly.
Running the code at least once will enable Matlab to actually compile appropriate code segments.
Besides Matlab-specific reasons like JIT-compilation, modern CPUs have large caches, branch predictors, and so on. Warming these up is an issue for benchmarking even in assembly language.
Also, more importantly, modern CPUs usually idle at low clock speed, and only jump to full speed after several milliseconds of sustained load.
Intel's Turbo feature gets even more funky: when power and thermal limits allow, the CPU can run faster than its sustainable max frequency. So the first ~20 seconds to 1 minute of your benchmark may run faster than the rest of it, if you aren't careful to control for these factors.
Another issue not mensioned by Amro and Marc is memory (pre)allocation.
If your script does not pre-allocate its memory it's first run would be very slow due to memory allocation. Once it completed its first iteration all memory is allocated, so consecutive invokations of the script would be more efficient.
An illustrative example
for ii = 1:1000
vec(ii) = ii; %// vec grows inside loop the first time this code is executed only
end

Fastest language for FOR loops

I'm trying to figure out the best programming language for an analytical model I'm building. Primary consideration is speed at which it will run FOR loops.
Some detail:
The model needs to perform numerous (~30 per entry, over 12 cycles) operations on a set of elements from an array -- there are ~300k rows, and ~150 columns in the array. Most of these operations are logical in nature, e.g., if place(i) = 1, then j(i) = 2.
I've built an earlier version of this model using Octave -- to run it takes ~55 hours on an Amazon EC2 m2.xlarge instance (and it uses ~10 GB of memory, but I'm perfectly happy to throw more memory at it). Octave/Matlab won't do elementwise logical operations, so a large number of for loops are needed -- I'm relatively certain that I've vectorized as much as possible -- the loops that remain are necessary. I've gotten octave-multicore to work with this code, which makes some improvement (~30% speed reduction when I get it running on 8 EC2 cores), but ends up being unstable with file locking, etc.
+I'm really looking for a step change in run-time -- I know that actually using Matlab might get me as much as a 50% improvement from looking at some benchmarks, but that is cost-prohibitive. The original plan when starting this was to actually run a Monte Carlo with this, but at 55 hours a run, that's completely impractical.
The next version of this is going to be a complete rebuild from the ground up (for IP reasons I won't get in to if nothing else), so I'm completely open to any programming language. I'm most familiar with Octave/Matlab, but have dabbled in R, C, C++, Java. I'm also proficient w/ SQL if the solution involves storing the data in a database. I'll learn any language for this -- these aren't complicated functionality we're looking for, no interfacing with other programs, etc., so not too concerned about learning curve.
So with all that said, what's the fastest programming language specifically for FOR loops? From a search of SO and Google, Fortran and C bubble to the top, but looking for some more advice before diving in to one or the other.
Thanks!
This for loop looks no more complex than this when it hits the CPU:
for(int i = 0; i != 1024; i++) translates to
mov r0, 0 ;;start the counter
top:
;;some processing
add r0, r0, 1 ;;increment the counter by 1
jne top: r0, 1024 ;;jump to the loop top if we havn't hit the top of the for loop (1024 elements)
;;continue on
As you can tell, this is sufficiently simple you can't really optimize it very well[1]... Refocus towards the algorithm level.
The first cut at the problem is to look at cache locality. Look up the classic example of matrix multiplication and swapping the i and j indexes.
edit: As a second cut, I would suggest evaluating the algorithm for data-dependencies between iterations and data dependency between localities in your 'matrix' of data. It may be a good candidate for parallelization.
[1] There are some micro-optimizations possible, but those will not produce the speedsups you're looking for.
~300k * ~150 * ~30 * ~12 = ~16G iterations, right?
This number of primitive operations should complete in a matter of minutes (if not seconds) in any compiled language on any decent CPU.
Fortran, C/C++ should do it almost equally well. Even managed languages, such as Java and C#, would only fall behind by a small margin (if at all).
If you have a problem of ~16G iterations running 55 hours, this means that they are very far from being primitive (80k per second? this is ridiculous), so maybe we should know more. (as was already suggested, is disk access limiting performance? is it network access?)
As #Rotsor said, 16G operations / 55 hours is about 80,000 operations per second, or one operation every 12.5 microseconds. That's a lot of time per operation.
That means your loops are not the cause of poor performance, it's what's in the innermost loop that's taking the time. And Octave is an interpreted language. That alone means an order of magnitude slowdown.
If you want speed, you first need to be in a compiled language. Then you need to do performance tuning (aka profiling) or, just single step it in a debugger at the instruction level. That will tell you what it is actually doing in its heart of hearts. Once you've got it to where it's not wasting cycles, fancier hardware, cores, CUDA, etc. will give you a parallelism speedup. But it's silly to do that if your code is taking unnecessarily many cycles. (Here's an example of performance tuning - a 43x speedup just by trimming the fat.)
I can't believe the number of responders talking about matlab, APL, and other vectorized languages. Those are interpreters. They give you concise source code, which is not at all the same thing as fast execution. When it comes down to the bare metal, they are stuck with the same hardware as every other language.
Added: to show you what I mean, I just ran this C++ code, which does 16G operations, on this moldy old laptop, and it took 94 seconds, or about 6ns per iteration. (I can't believe you baby-sat that thing for 2 whole days.)
void doit(){
double sum = 0;
for (int i = 0; i < 1000; i++){
for (int j = 0; j < 16000000; j++){
sum += j * 3.1415926;
}
}
}
In terms of absolute speed, probably Fortran followed by C, followed by C++. In practical application, well written code in any of the three, compiled with a descent compiler should be plenty fast.
Edit- generally you are going to see much better performance with any kind of looped or forking (e.g.- if statements) code with a compiled language, versus an interpreted language.
To give an example, on a recent project I was working on, the data sizes and operations were about 3/4 the size of what you're talking about here, but like your code, had very little room for vectorization, and required significant looping. After converting the code from matlab to C++, runtimes went from 16-18 hours, down to around 25 minutes.
For what you're discussing, Fortran is probably your first choice. The closest second place is probably C++. Some C++ libraries use "expression templates" to gain some speed over C for this kind of task. It's not entirely certain those will do you any good, but C++ can be at least as fast as C, and possibly somewhat faster.
At least in theory, there's no reason Ada couldn't be competitive as well, but it's been so long since I used it for anything like this that I hesitate to recommend it -- not because it isn't good, but because I just haven't kept track of current Ada compilers well enough to comment on them intelligently.
Any compiled language should perform the loop itself on roughly equal terms.
If you can formulate your problem in its terms, you might want to look at CUDA or OpenCL and run your matrix code on the GPU - though this is less good for code with lots of conditionals.
If you want to stay on conventional CPUs, you may be able to formulate your problem in terms of SSE scatter/gather and bitmask operations.
Probably the assembly language for whatever your platform is. But compilers (especially special-purpose ones that specifically target a single platform (e.g., Analog Devices or TI DSPs)) are often as good as or better than humans. Also, compilers often know about tricks that you don't. For example, the aforementioned DSPs support zero-overhead loops and the compiler will know how to optimize code to use those loops.
Matlab will do element-wise logical operations and they are generally quite fast.
Here is a quick example on my computer (AMD Athalon 2.3GHz w/3GB) :
d=rand(300000,150);
d=floor(d*10);
>> numel(d(d==1))
ans =
4501524
>> tic;d(d==1)=10;toc;
Elapsed time is 0.754711 seconds.
>> numel(d(d==1))
ans =
0
>> numel(d(d==10))
ans =
4501524
In general I've found matlab's operators are very speedy, you just have to find ways to express your algorithms directly in terms of matrix operators.
C++ is not fast when doing matrixy things with for loops. C is, in fact, specifically bad at it. See good math bad math.
I hear that C99 has __restrict pointers that help, but don't know much about it.
Fortran is still the goto language for numerical computing.
How is the data stored? Your execution time is probably more effected by I/O (especially disk or worse, network) than by your language.
Assuming operations on rows are orthogonal, I would go with C# and use PLINQ to exploit all the parallelism I could.
Might you not be best with a hand-coded assembler insert? Assuming, of course, that you don't need portability.
That and an optimized algorithm should help (and perhaps restructuring the data?).
You might also want to try multiple algorithms and profile them.
APL.
Even though it's interpreted, its primitive operators all operate on arrays natively, therefore you rarely need any explicit loops. You write the same code, whether the data is scalar or array, and the interpreter takes care of any looping needed internally, and thus with the minimum overhead - the loops themselves are in a compiled language, and will have been heavily optimised for the specific architecture of the CPU it's running on.
Here's an example of the simplicity of array handling in APL:
A <- 2 3 4 5 6 8 10
((2|A)/A) <- 0
A
2 0 4 0 6 8 10
The first line sets A to a vector of numbers.
The second line replaces all the odd numbers in the vector with zeroes.
The third line queries the new values of A, and the fourth line is the resulting output.
Note that no explicit looping was required, as scalar operators such as '|' (remainder) automatically extend to arrays as required. APL also has built-in primitives for searching and sorting, which will probably be faster than writing your own loops for these operations.
Wikipedia has a good article on APL, which also provides links to suppliers such as IBM and Dyalog.
Any modern compiled or JITted language is going to render down to pretty much the same machine language code, giving a loop overhead of 10 nano seconds or less, per iteration, on modern processors.
Quoting #Rotsor:
If you have a problem of ~16G iterations running 55 hours, this means that they are very far from being primitive (80k per second? this is ridiculous), so maybe we should know more.
80k operations per second is around 12.5 microseconds each - a factor of 1000 greater than the loop overhead you'd expect.
Assuming your 55 hour runtime is single threaded, and if your per item operations are as simple as suggested, you should be able to (conservatively) achieve a speedup of 100x and cut it down to under an hour very easily.
If you want to run faster still, you'll want to look at writing multi-threaded solution, in which case a language that provides good support would be essential. I'd lean towards PLINQ and C# 4.0, but that's because I already know C# - YMMV.
what about a lazy loading language like clojure. it is a lisp so like most lisp dialects lacks a for loop but has many other forms that operate more idiomatically for list processing. It might help your scaling issues as well because the operations are thread safe and because the language is functional it has fewer side effects. If you wanted to find all the items in the list that were 'i' values to operate on them you might do something like this.
(def mylist ["i" "j" "i" "i" "j" "i"])
(map #(= "i" %) mylist)
result
(true false true true false true)