Matlab `corr` gives different results on the same dataset. Is floating-point calculation deterministic? - matlab

I am using Matlab's corr function to calculate the correlation of a dataset. While the results agree within the double point accuracy (<10^-14), they are not exactly the same even on the same computer for different runs.
Is floating-point calculation deterministic? Where is the source of the randomness?

Yes and no.
Floating point arithmetic, as in a sequence of operations +, *, etc. is deterministic. However in this case, linear algebra libraries (BLAS, LAPACK, etc) are most likely being used, which may not be: for example, matrix multiplication is typically not performed as a "triple loop" as some references would have you believe, but instead matrices are split up into blocks that are optimised for maximum performance based on things like cache size. Therefore, you will get different sequences of operations, with different intermediate rounding, which will give slightly different results. Typically, however, the variation in these results is smaller than the total rounding error you are incurring.
I have to admit, I am a little bit surprised that you get different results on the same computer, but it is difficult to know why without knowing what the library is doing (IIRC, Matlab uses the Intel BLAS libraries, so you could look at their documentation).

Related

Neural Networks w/ Fixed Point Parameters

Most neural networks are trained with floating point weights/biases.
Quantization methods exist to convert the weights from float to int, for deployment on smaller platforms.
Can you build neural networks from the ground up that constrain all parameters, and their updates to be integer arithmetic?
Could such networks achieve a good accuracy?
(I know a bit about fixed-point and have only some rusty NN experience from the 90's so take what I have to say with a pinch of salt!)
The general answer is yes, but it depends on a number of factors.
Bear in mind that floating-point arithmetic is basically the combination of an integer significand with an integer exponent so it's all integer under the hood. The real question is: can you do it efficiently without floats?
Firstly, "good accuracy" is highly dependent on many factors. It's perfectly possible to perform integer operations that have higher granularity than floating-point. For example, 32-bit integers have 31 bits of mantissa while 32-bit floats effectively have only 24. So provided you do not require the added precision that floats give you near zero, it's all about the types that you choose. 16-bit -- or even 8-bit -- values might suffice for much of the processing.
Secondly, accumulating the inputs to a neuron has the issue that unless you know the maximum number of inputs to a node, you cannot be sure what the upper bound is on the values being accumulated. So effectively you must specify this limit at compile time.
Thirdly, the most complicated operation during the execution of a trained network is often the activation function. Again, you firstly have to think about what the range of values are within which you will be operating. You then need to implement the function without the aid of an FPU with all of the advanced mathematical functions it provides. One way to consider doing this is via lookup tables.
Finally, training involves measuring the error between values and that error can often be quite small. Here is where accuracy is a concern. If the differences you are measuring are too low, they will round down to zero and this may prevent progress. One solution is to increase the resolution of the value by providing more fractional digits.
One advantage that integers have over floating-point here is their even distribution. Where floating-point numbers lose accuracy as they increase in magnitude, integers maintain a constant precision. This means that if you are trying to measure very small differences in values that are close to 1, you should have no more trouble than you would if those values were as close to 0. The same is not true for floats.
It's possible to train a network with higher precision types than those used to run the network if training time is not the bottleneck. You might even be able to train the network using floating-point types and run it using lower-precision integers but you need to be aware of differences in behavior that these shortcuts will bring.
In short the problems involved are by no means insurmountable but you need to take on some of the mental effort that would normally be saved by using floating-point. However, especially if your hardware is physically constrained, this can be a hugely benneficial approach as floating-point arithmetic requires as much as 100 times more silicon and power than integer arithmetic.
Hope that helps.

Why does deep learning not suffer from float or numerical precision errors if most of its training is on data with mean 0 and std 1?

Inspired by the question:
Why do different methods for solving Xc=y in python give different solution when they should not?
that seems to have numerical issue due to floating points, inverting matrices and restricting values to [-1,1], what I am curious now is why does deep learning not suffer from float or numerical precision errors if most of its training is on data with mean 0 and std 1 (I guess I am assuming that most of the data has been pre-processed to be in that range, plus I feel this has to be roughly right considering the high usage of batch-normalization). Is it because deep learning does not train by raising a polynomial to a very high degree, or why is deep learning usually fine? Is there something special with SGD or maybe the (popular) activation function, relu, elu, etc are not numerically unstable (compared to a high degree polynomial)? Or maybe the GPU training avoids floating point representation all together? Or why is deep learning training numerically stable?
There is nothing really magical about DL as such - it suffers from numerical errors too, all the time. However, due to the scale and number of nonlinearities, numerical instabilities in DL usually lead to infinities or nans, not - wrong answers. Consequently they are usually easy to detect. In particular there is nothing hard about [0,1] interval, in fact, it is a great storage spot for floats, as quarter of representable floats actually live in [0,1]! The problem you are referring to lies in taking huge exponent of such a number, which goes dangerously close to machine precision. None of the standard DL techniques takes 30th power of any activation. In fact, most of the most succesfull DL techniques (based on sigmoids, tanhs and relus) are almost linear, and so the numerical instabilities come mostly from exp operations in probability estimates.
So:
is it about high degree polynomial? yes, this is the main issue, and is not encountered in DL.
is there something special about SGD? Not really.
is it about activation functions? Yes, they do not let such huge precision drops (exponent is the exception though, and it does lead to numerical issues)
is GPU avoiding floats? No, it is not, GPUs have nothing to do with it.

What exactly is the nobalance option in Matlab eig function?

In Matlab you can issue the eig function with the 'nobalance' option. What exactly does it do differently from the default one?
From mathworks documentation:
Balance option, specified as one two strings: 'balance', which enables a preliminary balancing step, or 'nobalance' which disables it. In most cases, the balancing step improves the conditioning of A to produce more accurate results. However, there are cases in which balancing produces incorrect results. Specify 'nobalance' when A contains values whose scale differs dramatically. For example, if A contains nonzero integers, as well as very small (near zero) values, then the balancing step might scale the small values to make them as significant as the integers and produce inaccurate results.
EDIT: A related function balance is said to be the default preceding step in eig.
Note a few lines in the documentation - "The ill conditioning is concentrated in the scaling matrix" .... "If a matrix contains small elements that are due to roundoff error, balancing might scale them up to make them as significant as the other elements of the original matrix."
So, my answer to #Isopycnal's question is "nobalance suppresses amplification of round-off errors, when dealing with ill-conditioned matrices". Here are a few points that may help -
"balancing" a matrix A is essentially performing a similarity transformation B = T\A*T where B is called as a "balanced matrix".
by balancing a good-conditioned matrix (which means it has reasonable scale), the "asymmetry" is concentrated into the scaling matrix, T. According to the documentation of eig, "In most cases, the balancing step improves the conditioning of A to produce more accurate results. "
however, balancing an ill-conditioned (means very large scale) matrix will scale up the round-off errors, because Matlab is trying to make the small values (such as 1e-9) as significant as the large ones (say 1e10). Without careful thinking it's already known that the result will be less precise.
I know it has something to do with the matrix decomposition algorithms which Matlab picks when performing eig, eg "Pencil decomposition LU factorization etc", as #EJG89 has pointed out. But it's too deeply buried in my memory to recall :( Anyone who knows how Matlab perform commands like eig please consider expanding this answer! Thanks!
Just for completeness, the balancing method is along the lines of LAPACK's ?GEBAL and ?GEBAK routines but some testing suggests that there are some modifications as the results differ occasionally.
The balancing helps to improve the conditioning via similarity transformations. However, in some cases balancing actually makes the problem worse. The documented cases include Hessenberg matrices and matrices with numerical noise that is amplified by the scaling which the algorithm tries to balance with the actual data. Depending on the problem the data matrix is also permuted to bring the matrix to upper triangular form as much as possible.
The balancing algorithm can also be used via balance.m
Other relevant balancing routines deep in the toolboxes are mscale.m and arescale.m routines from control system toolbox which offers more refined control (excuse the pun).

Difference between data processed by CPU and GPU in openCL

I have some matlab code that uses several large MEX functions and I want to speed things up by using openCL ( I am replacing parts of code of the MEX functions with openCL code using openCL API ). I've translated a small part of the code into an openCL kernel and I am already facing difficulties.
Some elements of the resulting matrix after execution on GPU are different from the corresponding elements of the resulting matrix when the original MEX function is called and the error is less than 0.01. This leads to a small error in the final result but I fear the error will accumulate as I translate more code.
This is probably related with different precision of the calculations on CPU and GPU. Does anyone know how to ensure the same precision? I am running 64 bit matlab R2012b on Ubuntu 12.04. The hardware I am using is Intel Core2 Duo E4700 and NVIDIA GeForce GT 520.
The small differences between results on your CPU and GPU are easily explained as arising from differences in floating-point precision if you have modified your code from using double precision (64-bit) f-p numbers on the CPU to using single-precision (32-bit) f-p numbers on the GPU.
I would not call this difference an error, rather it is an artefact of doing arithmetic on computers with floating-point numbers. The results you were getting on your CPU-only code were already different from any theoretically 'true' result. Much of the art of numerical computing is in keeping the differences between theoretical and actual computations small enough (whatever the heck that means) for the entire duration of a computation. It would take more time and space than I have now to expand on this, but surprises arising from lack of understanding of what floating-point arithmetic is, and isn't, are a rich source of questions here on SO. Some of the answers to those questions are very illuminating. This one should get you started.
If you have taken care to use the same precision on both CPU and GPU then the differences you report may be explained by the non-commutativity of floating-point arithmetic: in floating-point arithmetic it is not guaranteed that (a+b)+c == a+(b+c). The order of operations matters; if you have any SIMD going on I'd bet that the order of operations is not identical on the two implementations. Even if you haven't, what have you done to ensure that operations are ordered the same on both GPU and CPU ?
As to what you should do about it, that's rather up to you. You could (though I personally wouldn't recommend it) write your own routines for doing double-precision f-p arithmetic on the GPU. If you choose to do this, expect to wave goodbye to much of the speed-up that the GPU promises.
A better course of action is to ensure that your single-precision software provides sufficient accuracy for your purposes. For example, in the world I work in our original measurements from the environment are generally not accurate to more than about 3 significant figures, so any results that our codes produce have no validity after about 3 s-f. So if I can keep the errors in the 5th and lower s-fs that's good enough.
Unfortunately, from your point of view, getting enough accuracy from single-precision computations isn't necessarily guaranteed by globally replacing double with float and reompiling, you may (generally would) need to implement different algorithms, ones which take more time to guarantee more accuracy and which do not drift so much as computations proceed. Again, you'll lose some of the speed advantage that GPUs promise.
A common problem is, that floating point values are kept within an 80bit CPU register, instead of getting truncated and stored each time. In these cases, the additional precision leads to deviations. So you may check, what options your compiler offers to counter such issues. It can also be interesting to view the difference of release and debug builds.

How can MLE Likelihood evaluations be so different if I break up a log likelihood into its sum?

This is something I noticed in Matlab when trying to do MLE. My first estimator used the log likelihood of a pdf and broke the product up as a sum. For example, a log weibull pdf (f(x)=b ax^(a-1)exp(-bx^a)) broken up is:
log_likelihood=log(b)+log(a)+(a-1)log(x)-bx^a
Evaluating this is wildly different to this:
log_likelihood=log(bax^(a-1)exp(-bx^a))
What is the computer doing differently in the two stages? The first one gives a much larger number (by a couple orders of magnitude).
Depending on the numbers you use, this could be a numerical issue: If you combine very large numbers with very small numbers, you can get inaccuracies due to limitations in number precision.
One possibility is that you lose some accuracy in the second case since you are operating at different scales.
I work on a scientific software project implementing maximum likelihood of phylogenetic trees, and consistently run into issues regarding numerical precision. Often the descepency is ...
between competing applications with the same values in the model,
when calculating the MLE scores by hand,
in the order of the operations in the computation.
It really all comes down to number three, and even in your case. Mulitplication of small and very large numbers can cause weird results when their exponents are scaled during computation. There is a lot about this in the (in)famous "What Every Computer Scientist Should Know About Floating-Point Arithmetic". But, what I've mentioned is the short of it if that's all you are interested in.
Over all, the issue you are seeing are strictly numerical issues in the representation of floating point / double precision numbers and operations when computing the function. I'm not too familiar with MATLAB, but they may have an arbitrary-precision type that would give you better results.
Aside from that, keep them symbolic as long as possible and if you have any intuition about the variables size (as in a is always very large compared to x), then make sure you are choosing the order of parenthesis wisely.
The first equation should be better since it is dealing with adding logs, and should be much more stable then the second --although x^a makes me a bit weary as it would dominate the equation, but it would in practice anyways.