Differences in numerical behaviour between Scala versions - scala

We are in the process of upgrading to Scala 2.10. We seem to have fixed all the show stoppers, but for some of our numerical calculations the answers are a little different in a small number of cases. Are there any known changes introduced between 2.9.1 and 2.10.2 that might cause this or has anybody else seen something like this before?
Since all our small tests pass this should be some cumulative effect in iterative calculations, and even then it only affects a small percentage of cases. These are complex calculations and this is the only system for performing them, so we do not have an independent way to verify which version is right, short of doing a clean room reimplementation.

Sorry, can't post just a comment yet.
What is the magnitude of the differences you're seeing, and what kind of calculation are you doing in your code? It may not be meaningful to ask which is correct, it's FP, after all. Could you try compiling your code with strictfp? Slight differences in code gen and optimization can result in differences in when/how often 80-bit intermediates are truncated to 64 -- assuming you're running on an x86 architecture. If you have access to hardware that doesn't have 80-bit fp registers you might try running both versions on that. Even then, simple expressions like LL + S1 + S2, where LL is much greater than S1 or S2, can give different results depending on the order of the adds.

Related

Which language should I prefer working with if I want to use the Fast Artificial Neural Network Library (FANN)?

I am working on reducing dimentionality of a set of (Boolean) vectors with both the number and dimentionality of vectors tending to be of the order of 10^5-10^6 using autoencoders. Hence even though speed is not of essence (it is supposed to be a pre-computation for a clustering algorithm) but obviously one would expect that the computations take a reasonable amount of time. Seeing how the library itself was written in c++ would it be a good idea to stick to it or to code in Java (Since the rest of the code is written in Java)? Or would it not matter at all?
That question is difficult to answer. It depends on:
How computationally demanding will be your code? If the hard part is done by the library and your code is only to generate the input and post-process the output, Java would be a valid choice. Compare it to Matlab: The language is very slow but the built-in algorithms are super-fast.
How skilled are you (or your team, or your future students) in Java and C++. Consider learning C++ takes a lot of time. If you have only a small scaled project, it could be easier to buy a bigger machine or wait two days instead of one, to get the results.
Have you legacy code in one of the languages you want to couple or maybe re-use?
Overall, I would advice you to set up a benchmark example in whatever language you like more. Then give it a try. If the speed is ok, stick to it. If you wait to long, think about alternatives (new hardware, parallel execution, different language).

Numerical Integral of large numbers in Fortran 90

so I have the following Integral that i need to do numerically:
Int[Exp(0.5*(aCosx + bSinx + cCos2x + dSin2x))] x=0..2Pi
The problem is that the output at any given value of x can be extremely large, e^2000, so larger than I can deal with in double precision.
I havn't had much luck googling for the following, how do you deal with large numbers in fortran, not high precision, i dont care if i know it to beyond double precision, and at the end i'll just be taking the log, but i just need to be able to handle the large numbers untill i can take the log..
Are there integration packes that have the ability to handle arbitrarily large numbers? Mathematica clearly can.. so there must be something like this out there.
Cheers
This is probably an extended comment rather than an answer but here goes anyway ...
As you've already observed Fortran isn't equipped, out of the box, with the facility for handling such large numbers as e^2000. I think you have 3 options.
Use mathematics to reduce your problem to one which does (or a number of related ones which do) fall within the numerical range that your Fortran compiler can compute.
Use Mathematica or one of the other computer algebra systems (eg Maple, SAGE, Maxima). All (I think) of these can be integrated into a Fortran program (with varying degrees of difficulty and integration).
Use a library for high-precision (often called either arbitray-precision or multiple-precision too) arithmetic. Your favourite search engine will turn up a number of these for you, some written in Fortran (and therefore easy to integrate), some written in C/C++ or other languages (and therefore slightly harder to integrate). You might start your search at Lawrence Berkeley or the GNU bignum library.
(Yes I know that I wrote that you have 3 options, but your question suggests that you aren't ready to consider this yet) You could write your own high-/arbitrary-/multiple-precision functions. Fortran provides everything you need to construct such a library, there is a lot of work already done in the field to learn from, and it might be something of interest to you.
In practice it generally makes sense to apply as much mathematics as possible to a problem before resorting to a computer, that process can not only assist in solving the problem but guide your selection or construction of a program to solve what's left of the problem.
I agree with High Peformance Mark that the best option here numerically is to use analytics to scale or simplify the result first.
I will mention that if you do want to brute force it, gfortran (as of 4.6, with the libquadmath library) has support for quadruple precision reals, which you can use by selecting the appropriate kind. As long as your answers (and the intermediate results!) don't get too much bigger than what you're describing, that may work, but it will generally be much slower than double precision.
This requires looking deeper at the problem you are trying to solve and the behavior of the underlying mathematics. To add to the good advice already provided by Mark and Jonathan, consider expanding the exponential and trig functions into Taylor series and truncating to the desired level of precision.
Also, take a step back and ask why you are trying to accomplish by calculating this value. As an example, I recently had to debug why I was getting outlandish results from a property correlation which was calculating vapor pressure of a fluid to see if condensation was occurring. I spent a long time trying to understand what was wrong with the temperature being fed into the correlation until I realized the case causing the error was a simulation of vapor detonation. The problem was not in the numerics but in the logic of checking for condensation during a literal explosion; physically, a condensation check made no sense. The real problem was the code was asking an unnecessary question; it already had the answer.
I highly recommend Forman Acton's Numerical Methods That (Usually) Work and Real Computing Made Real. Both focus on problems like this and suggest techniques to tame ill-mannered computations.

Different results for the same algorithm in matlab

I'm doing an assignment of linear algebra, to compare the performance and stability of QR factorization algorithms Gram-Schmidt and Householder.
My doubt comes when calculating the following table:
Where the matrices Q and R are the resulting matrices of the QR factorizations by applying the Gram-Schmidt and householder to a Hilbert matrix A, I is the identity matrix of dimension N; and || * || is the Frobenius norm‎.
When I do the calculations on different computers i have different results in some cases, may be due to this?. The above table corresponds to the calculations performed in a 32-bit computer and the next table in a 64-bit:
These results in matlab involves computer architectures in which the calculations were made?
I'm really interested by an answer if you find one!
Unfortunately plenty of things can change the numerical results...
For being efficient, some LAPACK algorithm iterate on sub-matrix blocks. For optimal efficiency, the size of the blocks has to fit somehow the size of CPU L1/L2/L3 caches...
The size of block is controlled by LAPACK routine ILAENV see http://www.netlib.org/lapack/lug/node120.html
Of course, if block size differ, result will differ numerically... It is possible that the lapack/BLAS DLL provided by Matlab are compiled with a differently tuned version of ILAENV on the two machines, or that ILAENV has been replaced with a customized optimized version taking cache size into account, you could check by yourself making a small C program which call ILAENV and link it to DLL provided by Matlab...
For underlying BLAS, it's even worse: if an optimized version is used, some fused mul-add FPU instruction might be used when available by exemple, and they are not necessarily available on all FPU. AFAIK, Matlab use ATLAS http://math-atlas.sourceforge.net/, and you'll have to inquire about how the lirary was produced... You would have to track differences in the result of basic algebra operations (like matrix*vector or matrix*matrix ...).
UPDATE: Even if ILAENV were the same, QR uses elementary rotations, so it obviously depend on sin/cos implementation. Unfortunately no standard tells exactly how sin and cos should bitwise-behave, they can be off a few ulp from rounded exact result, and differ from one library to another and will give different results on different architectures/compilers (hardwired in x87 FPU). So unless you provide your own version of these functions (or work in ADA) and compile them with specially crafted compiler options, and maybe finely control the FPU modes there's close to no chance to find exactly same results on different architectures... You will also have to ask Matlab if they took special care to insure floating point deterministic results when they compiled those libraries.
That depends on matlab implementation. Do you get the same result when rerun on same architecture? If yes, this problem may caused by precision. Sometimes, it is caused by different FPU (floatingpoint process uint) of CPU. You may try on more 32-bit/64-bit with different CPU.
The best answer should be reply by your matlab provider. Just call them if you have valid license.
According this link.
one cause of difference is that if calculations are done on with the x87 instructions, it get held in 80 bit precision. depending on compiler optimisations, it numbers may stay at 80bit for a few operation before being truncated back to 64 bit. this can cause variations. see http://gcc.gnu.org/wiki/x87note for more info.
the gcc man pages says that using sse (instead of 387) is default on x86-64 platforms. you should be able to force it on 32bit using something like
-mfpmath=sse -msse -msse2

Compilation optimization for iPhone : floating point or fixed point?

I'm building a library for iphone (speex, but i'm sure it will apply to a lot of other libs too) and the make script has an option to use fixed point instead of floating point.
As the iphone ARM processor has the VFP extension and performs very well floating point calculations, do you think it's a better choice to use the fixed point option ?
If someone already benchmarked this and wants to share , i would really thank him.
Well, it depends on the setup of your application, here is some guidelines
First try turning on optimization to 0s (Fastest Smallest)
Turn on Relax IEEE Compliance
If your application can easily process floating point numbers in contiguous memory locations independently, you should look at the ARM NEON intrinsic's and assembly instructions, they can process up to 4 floating point numbers in a single instruction.
If you are already heavily using floating point math, try to switch some of your logic to fixed point (but keep in mind that moving from an NEON register to an integer register results in a full pipeline stall)
If you are already heavily using integer math, try changing some of your logic to floating point math.
Remember to profile before optimization
And above all, better algorithms will always beat micro-optimizations such as the above.
If you are dealing with large blocks of sequential data, NEON is definitely the way to go.
Float or fixed, that's a good question. NEON is somewhat faster dealing with fixed, but I'd keep the native input format since conversions take time and eventually, extra memory.
Even if the lib offers a different output formats as an option, it almost alway means lib-internal conversions. So I guess float is the native one in this case. Stick to it.
Noone prevents you from micro-optimizing better algorithms. And usually, the better the algorithm, the more performance gain can be achieved through micro-optimizations due to the pipelining on modern machines.
I'd stay away from intrinsics though. There are so many posts on the net complaining about intrinsics doing something crazy, especially when dealing with immediate values.
It can and will get very troublesome, and you can hardly optimize anything with intrinsics either.

What is the fastest int to float conversion on the iPhone?

I am converting some Int16s and Int32s to float and then back again.
I'm just using a straight cast, but doing this 44100 times per second (any guesses what its for? :) )
Is a cast efficient? Can it be done any faster?
P.S Compile for thumb is turned off.
There are only two ways to know.
1) Read the code the compiler generates for promoting ints to floats in your case.
2) Measure the performance of the code the compiler generates vs. other options.
To do the former, set the SDK to Device and the Active Architecture to arm, and choose Build > Show Assembly Code. Then read the compiler-generated code.
If you are smarter than a compiler then you can write your own assembly code and use it instead. Odds are you aren't.
If you are doing an operation many, many times, Instruments will do a good job at showing you how many processor samples it's taking. But Jim's point is valid, and you shouldn't dismiss it as unhelpful: in an operation involving math on floating-point numbers, compiler type promotion is the least of your worries. Chips are built to do that in two or three cycles, and compilers usually manage to make that happen. But the effects processing you're doing will probably take thousands of cycles. The promotion will be lost in the noise.
Is a cast efficient? In your case, I'd guess it's efficient enough.
Can it be done faster? Maybe...but would it be worth the effort? Have you benchmarked it and discovered a performance problem due to the cast operations?
If you're doing anything mathematically nontrivial with the floating point sample data,
I'd be really surprised if the casts turned out to be a significant bottleneck!