Fast inverse square root on the iPhone - iphone

The fast inverse square function used by SGI/3dfx and most notably in Quake is often cited as being faster than the assembly instruction equivalent, however the posts claiming that seem quite dated. I was curious about its performance on more modern hardware, and particularly on mobile devices like the iPhone. I wouldn't be surprised if the Quake sqrt is not longer a worthwhile optimization on desktop systems, but how about for an iPhone project involving a lot of 3D math? Is it something that would be worthwhile to include?

No.
The NEON instruction set (like every other vector ISA*) has a hardware approximate reciprocal square root instruction that is much faster than that oft-cited "trick". Use it instead if reciprocal square root is actually a performance bottleneck in your code (as always, benchmark first; don't spend time optimizing something if you have no hard evidence that its performance matters).
You can get at it by writing your own assembly (inline or otherwise) with the vrsqrte.f32 instruction, or from C, Objective-C, or C++ by including the <arm_neon.h> header and using the vrsqrte_f32( ) intrinsic.
[*] On SSE it's rsqrtss/rsqrtps; on Altivec it's frsqrte/vrsqrte.

Related

Which language should I prefer working with if I want to use the Fast Artificial Neural Network Library (FANN)?

I am working on reducing dimentionality of a set of (Boolean) vectors with both the number and dimentionality of vectors tending to be of the order of 10^5-10^6 using autoencoders. Hence even though speed is not of essence (it is supposed to be a pre-computation for a clustering algorithm) but obviously one would expect that the computations take a reasonable amount of time. Seeing how the library itself was written in c++ would it be a good idea to stick to it or to code in Java (Since the rest of the code is written in Java)? Or would it not matter at all?
That question is difficult to answer. It depends on:
How computationally demanding will be your code? If the hard part is done by the library and your code is only to generate the input and post-process the output, Java would be a valid choice. Compare it to Matlab: The language is very slow but the built-in algorithms are super-fast.
How skilled are you (or your team, or your future students) in Java and C++. Consider learning C++ takes a lot of time. If you have only a small scaled project, it could be easier to buy a bigger machine or wait two days instead of one, to get the results.
Have you legacy code in one of the languages you want to couple or maybe re-use?
Overall, I would advice you to set up a benchmark example in whatever language you like more. Then give it a try. If the speed is ok, stick to it. If you wait to long, think about alternatives (new hardware, parallel execution, different language).

Different results for the same algorithm in matlab

I'm doing an assignment of linear algebra, to compare the performance and stability of QR factorization algorithms Gram-Schmidt and Householder.
My doubt comes when calculating the following table:
Where the matrices Q and R are the resulting matrices of the QR factorizations by applying the Gram-Schmidt and householder to a Hilbert matrix A, I is the identity matrix of dimension N; and || * || is the Frobenius norm‎.
When I do the calculations on different computers i have different results in some cases, may be due to this?. The above table corresponds to the calculations performed in a 32-bit computer and the next table in a 64-bit:
These results in matlab involves computer architectures in which the calculations were made?
I'm really interested by an answer if you find one!
Unfortunately plenty of things can change the numerical results...
For being efficient, some LAPACK algorithm iterate on sub-matrix blocks. For optimal efficiency, the size of the blocks has to fit somehow the size of CPU L1/L2/L3 caches...
The size of block is controlled by LAPACK routine ILAENV see http://www.netlib.org/lapack/lug/node120.html
Of course, if block size differ, result will differ numerically... It is possible that the lapack/BLAS DLL provided by Matlab are compiled with a differently tuned version of ILAENV on the two machines, or that ILAENV has been replaced with a customized optimized version taking cache size into account, you could check by yourself making a small C program which call ILAENV and link it to DLL provided by Matlab...
For underlying BLAS, it's even worse: if an optimized version is used, some fused mul-add FPU instruction might be used when available by exemple, and they are not necessarily available on all FPU. AFAIK, Matlab use ATLAS http://math-atlas.sourceforge.net/, and you'll have to inquire about how the lirary was produced... You would have to track differences in the result of basic algebra operations (like matrix*vector or matrix*matrix ...).
UPDATE: Even if ILAENV were the same, QR uses elementary rotations, so it obviously depend on sin/cos implementation. Unfortunately no standard tells exactly how sin and cos should bitwise-behave, they can be off a few ulp from rounded exact result, and differ from one library to another and will give different results on different architectures/compilers (hardwired in x87 FPU). So unless you provide your own version of these functions (or work in ADA) and compile them with specially crafted compiler options, and maybe finely control the FPU modes there's close to no chance to find exactly same results on different architectures... You will also have to ask Matlab if they took special care to insure floating point deterministic results when they compiled those libraries.
That depends on matlab implementation. Do you get the same result when rerun on same architecture? If yes, this problem may caused by precision. Sometimes, it is caused by different FPU (floatingpoint process uint) of CPU. You may try on more 32-bit/64-bit with different CPU.
The best answer should be reply by your matlab provider. Just call them if you have valid license.
According this link.
one cause of difference is that if calculations are done on with the x87 instructions, it get held in 80 bit precision. depending on compiler optimisations, it numbers may stay at 80bit for a few operation before being truncated back to 64 bit. this can cause variations. see http://gcc.gnu.org/wiki/x87note for more info.
the gcc man pages says that using sse (instead of 387) is default on x86-64 platforms. you should be able to force it on 32bit using something like
-mfpmath=sse -msse -msse2

Compilation optimization for iPhone : floating point or fixed point?

I'm building a library for iphone (speex, but i'm sure it will apply to a lot of other libs too) and the make script has an option to use fixed point instead of floating point.
As the iphone ARM processor has the VFP extension and performs very well floating point calculations, do you think it's a better choice to use the fixed point option ?
If someone already benchmarked this and wants to share , i would really thank him.
Well, it depends on the setup of your application, here is some guidelines
First try turning on optimization to 0s (Fastest Smallest)
Turn on Relax IEEE Compliance
If your application can easily process floating point numbers in contiguous memory locations independently, you should look at the ARM NEON intrinsic's and assembly instructions, they can process up to 4 floating point numbers in a single instruction.
If you are already heavily using floating point math, try to switch some of your logic to fixed point (but keep in mind that moving from an NEON register to an integer register results in a full pipeline stall)
If you are already heavily using integer math, try changing some of your logic to floating point math.
Remember to profile before optimization
And above all, better algorithms will always beat micro-optimizations such as the above.
If you are dealing with large blocks of sequential data, NEON is definitely the way to go.
Float or fixed, that's a good question. NEON is somewhat faster dealing with fixed, but I'd keep the native input format since conversions take time and eventually, extra memory.
Even if the lib offers a different output formats as an option, it almost alway means lib-internal conversions. So I guess float is the native one in this case. Stick to it.
Noone prevents you from micro-optimizing better algorithms. And usually, the better the algorithm, the more performance gain can be achieved through micro-optimizations due to the pipelining on modern machines.
I'd stay away from intrinsics though. There are so many posts on the net complaining about intrinsics doing something crazy, especially when dealing with immediate values.
It can and will get very troublesome, and you can hardly optimize anything with intrinsics either.

How to do hardware independent parallel programming?

These days there are two main hardware environments for parallel programming, one is multi-threading CPU's and the other is the graphics cards which can do parallel operations on arrays of data.
The question is, given that there are two different hardware environments, how can I write a program which is parallel but independent of these two different hardware environments.
I mean that I would like to write a program and regardless of whether I have a graphics card or multi-threaded CPU or both, the system should choose automatically what to execute it on, either or both graphics card and/or multi-thread CPU.
Is there any software libraries/language constructs which allow this?
I know there are ways to target the graphics card directly to run code on, but my question is about how can we as programmers write parallel code without knowing anything about the hardware and the software system should schedule it to either graphics card or CPU.
If you require me to be more specific as to the platform/language, I would like the answer to be about C++ or Scala or Java.
Thanks
Martin Odersky's research group at EPFL just recently received a multi-million-euro European Research Grant to answer exactly this question. (The article contains several links to papers with more details.)
In a few years from now programs will rewrite themselves from scratch at run-time (hey, why not?)...
...as of right now (as far as I am aware) it's only viable to target related groups of parallel systems with given paradigms and a GPU ("embarrassingly parallel") is significantly different than a "conventional" CPU (2-8 "threads") is significantly different than a 20k processor supercomputer.
There are actually parallel run-times/libraries/protocols like Charm++ or MPI (think "Actors") that can scale -- with specially engineered algorithms to certain problems -- from a single CPU to tens of thousands of processors, so the above is a bit of hyperbole. However, there are enormous fundamental differences between a GPU -- or even a Cell micoprocessor -- and a much more general-purpose processor.
Sometimes a square peg just doesn't fit into a round hole.
Happy coding.
OpenCL is precisely about running the same code on CPUs and GPUs, on any platform (Cell, Mac, PC...).
From Java you can use JavaCL, which is an object-oriented wrapper around the OpenCL C API that will save you lots of time and effort (handles memory allocation and conversion burdens, and comes with some extras).
From Scala, there's ScalaCL which builds upon JavaCL to completely hide away the OpenCL language : it converts some parts of your Scala program into OpenCL code, at compile-time (it comes with a compiler plugin to do so).
Note that Scala features parallel collections as part of its standard library since 2.9.0, which are useable in a pretty similar way to ScalaCL's OpenCL-backed parallel collections (Scala's parallel collections can be created out of regular collections with .par, while ScalaCL's parallel collections are created with .cl).
The (very-)recently announced MS C++ AMP looks like the kind of thing you're after. It seems (from reading the news articles) that initially it's targeted at using GPUs, but the longer term aim seems to be to include multi-core too.
Sure. See ScalaCL for an example, though it's still alpha code at the moment. Note also that it uses some Java libraries that perform the same thing.
I will cover the more theoretical answer.
Different parallel hardware architectures implement different models of computation. Bridging between these is hard.
In the sequential world we've been happily hacking away basically the same single model of computation: the Random Access Machine. This creates a nice common language between hardware implementors and software writers.
No such single optimal model for parallel computation exists. Since the dawn of modern computers a large design space has been explored; current multicore CPUs and GPUs cover but a small fraction of this space.
Bridging these models is hard because parallel programming is essentially about performance. You typically make something work on two different models or systems by adding a layer of abstraction to hide specifics. However, it is rare that an abstraction does not come with a performance cost. This will typically land you with a lower common denominator of both models.
And now answering your actual question. Having a computational model (language, OS, library, ...) that is independent of CPU or GPU will typically not abstract over both while retaining the full power you're used to with your CPU, due to the performance penalties. To keep everything relatively efficient the model will lean towards GPUs by restricting what you can do.
Silver lining:
What does happen is hybrid computations. Some computations are more suited for other kinds of architectures. You also rarely do only one type of computation, so that a 'sufficiently smart compiler/runtime' will be able to distinguish what part of your computation should run on what architecture.

What's the fastest vector/matrix math library in C for iPhone game?

As the title, I'm finding vector/matrix library in C optimized for iPhone/iPod processors.
Or just generally fast.
---(edit)---
I'm sorry for unclear question.
I'm looking for fast lib for commercial games for iPhone/iPod. So GPL lib cannot be used.
However, I'll stop finding fastest lib, it maybe meaningless.
Now(2010.06.26) Accelerate framework included on iOS4, so vDSP/BLAS functions are available.
This utilizes hardware feature (CPU or SIMD) to accelerate floating point operations, so superior speed (2~4.5x average, 8x maximum) and less energy(0.25x maximum) consuming can be gained by using this.
Thanks people for other answers.
Depends very much on your needs, if you're just using straight floating point math you will probably find that the compiler will use software floating point, which will be very slow. So step one is making sure that youuse the hardware floating point unit that is available in the iPhone processor.
Step two is using an already well established library, there are several, Hassan already provided you with a link to the GNU GSL which is nice.
The next step would be to take advantage of the VFP SIMD like abilities. The VFP is not actually SIMD, but does provide SIMD like instructions for which the individual operations are perform consequtively. The advantage of still using these instructions is that your program text will be shorter, allowing better use of the instruction cache and less problems when missing branch predictions and so forth. I am however not aware of any vector library taking advantage of the VFP, you'd have to do a good search and possible write your own if it's not available.
Finally, if you still need more speed, you'll want to use the true SIMD unit in the iPhone processor. However this unit is not a floating point unit, but an integer unit. So, assuming you do want real numbers, you'll be stuck with fixed point, it depends on your application whether you can get away with that. Again I am not aware of any vector library providing fixed point arithmetic using the SIMD unit provided by the iPhone processor, so again you'd need a thorough search and possibly get your hands dirty yourself.