I'm building a library for iphone (speex, but i'm sure it will apply to a lot of other libs too) and the make script has an option to use fixed point instead of floating point.
As the iphone ARM processor has the VFP extension and performs very well floating point calculations, do you think it's a better choice to use the fixed point option ?
If someone already benchmarked this and wants to share , i would really thank him.
Well, it depends on the setup of your application, here is some guidelines
First try turning on optimization to 0s (Fastest Smallest)
Turn on Relax IEEE Compliance
If your application can easily process floating point numbers in contiguous memory locations independently, you should look at the ARM NEON intrinsic's and assembly instructions, they can process up to 4 floating point numbers in a single instruction.
If you are already heavily using floating point math, try to switch some of your logic to fixed point (but keep in mind that moving from an NEON register to an integer register results in a full pipeline stall)
If you are already heavily using integer math, try changing some of your logic to floating point math.
Remember to profile before optimization
And above all, better algorithms will always beat micro-optimizations such as the above.
If you are dealing with large blocks of sequential data, NEON is definitely the way to go.
Float or fixed, that's a good question. NEON is somewhat faster dealing with fixed, but I'd keep the native input format since conversions take time and eventually, extra memory.
Even if the lib offers a different output formats as an option, it almost alway means lib-internal conversions. So I guess float is the native one in this case. Stick to it.
Noone prevents you from micro-optimizing better algorithms. And usually, the better the algorithm, the more performance gain can be achieved through micro-optimizations due to the pipelining on modern machines.
I'd stay away from intrinsics though. There are so many posts on the net complaining about intrinsics doing something crazy, especially when dealing with immediate values.
It can and will get very troublesome, and you can hardly optimize anything with intrinsics either.
Related
I am working on reducing dimentionality of a set of (Boolean) vectors with both the number and dimentionality of vectors tending to be of the order of 10^5-10^6 using autoencoders. Hence even though speed is not of essence (it is supposed to be a pre-computation for a clustering algorithm) but obviously one would expect that the computations take a reasonable amount of time. Seeing how the library itself was written in c++ would it be a good idea to stick to it or to code in Java (Since the rest of the code is written in Java)? Or would it not matter at all?
That question is difficult to answer. It depends on:
How computationally demanding will be your code? If the hard part is done by the library and your code is only to generate the input and post-process the output, Java would be a valid choice. Compare it to Matlab: The language is very slow but the built-in algorithms are super-fast.
How skilled are you (or your team, or your future students) in Java and C++. Consider learning C++ takes a lot of time. If you have only a small scaled project, it could be easier to buy a bigger machine or wait two days instead of one, to get the results.
Have you legacy code in one of the languages you want to couple or maybe re-use?
Overall, I would advice you to set up a benchmark example in whatever language you like more. Then give it a try. If the speed is ok, stick to it. If you wait to long, think about alternatives (new hardware, parallel execution, different language).
so I have the following Integral that i need to do numerically:
Int[Exp(0.5*(aCosx + bSinx + cCos2x + dSin2x))] x=0..2Pi
The problem is that the output at any given value of x can be extremely large, e^2000, so larger than I can deal with in double precision.
I havn't had much luck googling for the following, how do you deal with large numbers in fortran, not high precision, i dont care if i know it to beyond double precision, and at the end i'll just be taking the log, but i just need to be able to handle the large numbers untill i can take the log..
Are there integration packes that have the ability to handle arbitrarily large numbers? Mathematica clearly can.. so there must be something like this out there.
Cheers
This is probably an extended comment rather than an answer but here goes anyway ...
As you've already observed Fortran isn't equipped, out of the box, with the facility for handling such large numbers as e^2000. I think you have 3 options.
Use mathematics to reduce your problem to one which does (or a number of related ones which do) fall within the numerical range that your Fortran compiler can compute.
Use Mathematica or one of the other computer algebra systems (eg Maple, SAGE, Maxima). All (I think) of these can be integrated into a Fortran program (with varying degrees of difficulty and integration).
Use a library for high-precision (often called either arbitray-precision or multiple-precision too) arithmetic. Your favourite search engine will turn up a number of these for you, some written in Fortran (and therefore easy to integrate), some written in C/C++ or other languages (and therefore slightly harder to integrate). You might start your search at Lawrence Berkeley or the GNU bignum library.
(Yes I know that I wrote that you have 3 options, but your question suggests that you aren't ready to consider this yet) You could write your own high-/arbitrary-/multiple-precision functions. Fortran provides everything you need to construct such a library, there is a lot of work already done in the field to learn from, and it might be something of interest to you.
In practice it generally makes sense to apply as much mathematics as possible to a problem before resorting to a computer, that process can not only assist in solving the problem but guide your selection or construction of a program to solve what's left of the problem.
I agree with High Peformance Mark that the best option here numerically is to use analytics to scale or simplify the result first.
I will mention that if you do want to brute force it, gfortran (as of 4.6, with the libquadmath library) has support for quadruple precision reals, which you can use by selecting the appropriate kind. As long as your answers (and the intermediate results!) don't get too much bigger than what you're describing, that may work, but it will generally be much slower than double precision.
This requires looking deeper at the problem you are trying to solve and the behavior of the underlying mathematics. To add to the good advice already provided by Mark and Jonathan, consider expanding the exponential and trig functions into Taylor series and truncating to the desired level of precision.
Also, take a step back and ask why you are trying to accomplish by calculating this value. As an example, I recently had to debug why I was getting outlandish results from a property correlation which was calculating vapor pressure of a fluid to see if condensation was occurring. I spent a long time trying to understand what was wrong with the temperature being fed into the correlation until I realized the case causing the error was a simulation of vapor detonation. The problem was not in the numerics but in the logic of checking for condensation during a literal explosion; physically, a condensation check made no sense. The real problem was the code was asking an unnecessary question; it already had the answer.
I highly recommend Forman Acton's Numerical Methods That (Usually) Work and Real Computing Made Real. Both focus on problems like this and suggest techniques to tame ill-mannered computations.
I am about to start a project in visual image-processing and have no had experience with Matlab, Aforge, OpenCV and was wondering if anyone had any experiences with these different software packages.
I was also wondering which of the three packages were most efficient I assume OpenCV but has anyone had any experience?
Thanks
Jamie.
The question you need to ask yourself is which is more important - your time or the computer's time. If your task is really simple, you may be able to code it up in MATLAB and have it work right off the bat. MATLAB is by far the easiest for development - a scripted language with built-in memory management, a huge array of provided functions, and a great interface for displaying and manipulating data while debugging.
On the other hand, MATLAB is at least an order of magnitude slower than compiled openCV code for many tasks. This is especially true if you use the intel performance primitives libraries.
If you know how to code in MATLAB, I would suggest writing and debugging your algorithms in that language, then porting them to c/c++ with openCV for speed. If there are only a couple of simple functions that you need to speed up, you can call c code from MATLAB, but it's hard to get this working right the first few times you try it, so you're probably better off just rewriting your finished code entirely in c/c++
First, please elaborate about your project's needs. It has the biggest impact on the choice, in addition to other factors - your general programming knowledge (If you haven't dealt with dot net but just with C++, AForge is not a good choice, for example).
Generally,
Both AForge and OpenCV has a built-in interface to .Net, and OpenCV also with C++, python, and more. Matlab might be more efficient, but if you don't have any experience with it - you should also learn its syntax. Take it into consideration.
Matlab probably has the largest variety of functions, but it is more complicated than the other projects. OpenCV and AForge themselves have some differences - see them described in this StackOverflow question/ answers.
I worked last year in two similar projects with cars on the highway. Afaik, Matlab allows to process only one picture frame at a time (surely you could elaborate an algorithm to compute a stream) but using Simulink you can process the stream directly.
On the other hand, i found AForge a lot friendlier and easier to use since you can easily adjust the processing parameters from a GUI (not so fast/easy) to do in Matlab/simulink.
I'd go for Aforge.Net. It's also fast enough if you're worrying about processing speed. (using 640x480)
If you are asking about using one of these in .net,easily you can get info by this:
1-matlab mostly used in simulation of projects not the End-prototype project; my numer : 30;
2-aforge (as I'v used in many project) if you do not need the circular process like capturing image, or recognition of something in images or ... you'll find it very good, cause it is easy to use but useful for single processes; my number : 50
3-opencv very good at speed and useful for circular processes, for example you can capture images from a webcam and Instantly cartoonize it without any delay, But not easy-to-use as aforge. I like it anyway cause of its speed and MANY functions it gives us mostly anything we need in programming; my number : 80
Dr.Taha - Tahasoft.net
As the title, I'm finding vector/matrix library in C optimized for iPhone/iPod processors.
Or just generally fast.
---(edit)---
I'm sorry for unclear question.
I'm looking for fast lib for commercial games for iPhone/iPod. So GPL lib cannot be used.
However, I'll stop finding fastest lib, it maybe meaningless.
Now(2010.06.26) Accelerate framework included on iOS4, so vDSP/BLAS functions are available.
This utilizes hardware feature (CPU or SIMD) to accelerate floating point operations, so superior speed (2~4.5x average, 8x maximum) and less energy(0.25x maximum) consuming can be gained by using this.
Thanks people for other answers.
Depends very much on your needs, if you're just using straight floating point math you will probably find that the compiler will use software floating point, which will be very slow. So step one is making sure that youuse the hardware floating point unit that is available in the iPhone processor.
Step two is using an already well established library, there are several, Hassan already provided you with a link to the GNU GSL which is nice.
The next step would be to take advantage of the VFP SIMD like abilities. The VFP is not actually SIMD, but does provide SIMD like instructions for which the individual operations are perform consequtively. The advantage of still using these instructions is that your program text will be shorter, allowing better use of the instruction cache and less problems when missing branch predictions and so forth. I am however not aware of any vector library taking advantage of the VFP, you'd have to do a good search and possible write your own if it's not available.
Finally, if you still need more speed, you'll want to use the true SIMD unit in the iPhone processor. However this unit is not a floating point unit, but an integer unit. So, assuming you do want real numbers, you'll be stuck with fixed point, it depends on your application whether you can get away with that. Again I am not aware of any vector library providing fixed point arithmetic using the SIMD unit provided by the iPhone processor, so again you'd need a thorough search and possibly get your hands dirty yourself.
I am converting some Int16s and Int32s to float and then back again.
I'm just using a straight cast, but doing this 44100 times per second (any guesses what its for? :) )
Is a cast efficient? Can it be done any faster?
P.S Compile for thumb is turned off.
There are only two ways to know.
1) Read the code the compiler generates for promoting ints to floats in your case.
2) Measure the performance of the code the compiler generates vs. other options.
To do the former, set the SDK to Device and the Active Architecture to arm, and choose Build > Show Assembly Code. Then read the compiler-generated code.
If you are smarter than a compiler then you can write your own assembly code and use it instead. Odds are you aren't.
If you are doing an operation many, many times, Instruments will do a good job at showing you how many processor samples it's taking. But Jim's point is valid, and you shouldn't dismiss it as unhelpful: in an operation involving math on floating-point numbers, compiler type promotion is the least of your worries. Chips are built to do that in two or three cycles, and compilers usually manage to make that happen. But the effects processing you're doing will probably take thousands of cycles. The promotion will be lost in the noise.
Is a cast efficient? In your case, I'd guess it's efficient enough.
Can it be done faster? Maybe...but would it be worth the effort? Have you benchmarked it and discovered a performance problem due to the cast operations?
If you're doing anything mathematically nontrivial with the floating point sample data,
I'd be really surprised if the casts turned out to be a significant bottleneck!