Obviously, float comparison is always tricky. I have a lot of assert-check in my (scientific) code, so very often I have to check for equality of sums to one, and similar issues.
Is there a quick-and easy / best-practice way of performing those checks?
The easiest way I can think of is to build a custom function for fixed tolerance float comparison, but that seems quite ugly to me. I'd prefer a built-in solution, or at least something that is extremely clear and straightforward to understand.
I think it's most likely going to have to be a function you write yourself. I use three things pretty constantly for running computational vector tests so to speak:
Maximum absolute error
return max(abs(result(:) - expected(:))) < tolerance
This calculates maximum absolute error point-wise and tells you whether that's less than some tolerance.
Maximum excessive error count
return sum( (abs(result(:) - expected(:))) < tolerance )
This returns the number of points that fall outside your tolerance range. It's also easy to modify to return percentage.
Root mean squared error
return norm(result(:) - expected(:)) < rmsTolerance
Since these and many other criteria exist for comparing arrays of floats, I would suggest writing a function which would accept the calculation result, the expected result, the tolerance and the comparison method. This way you can make your checks very compact, and it's going to be much less ugly than trying to explain what it is that you're doing in comments.
Any fixed tolerance will fail if you put in very large or very small numbers, simplest solution is to use eps to get the double precision:
abs(A-B)<eps(A)*4
The 4 is a totally arbitrary number, which is sufficient in most cases.
Don't know any special build in solution. Maybe something with using eps function?
For example as you probably know this will give False (i.e. 0) as a result:
>> 0.1 + 0.1 + 0.1 == 0.3
ans =
0
But with eps you could do the following and the result is as expected:
>> (0.1+0.1+0.1) - 0.3 < eps
ans =
1
I have had good experience with xUnit, a unit test framework for Matlab. After installing it, you can use:
assertVectorsAlmostEqual(a,b) (checks for normwise closeness between vectors; configurable absolute/relative tolerance and sane defaults)
assertElementsAlmostEqual(a,b) (same check, but elementwise on every single entry -- so [1 1e-12] and [1 -1e-9] will compare equal with the former but not with the latter).
They are well-tested, fast to use and clear enough to read. The function names are quite long, but with any decent editor (or the Matlab one) you can write them as assertV<tab>.
For those who understand both MATLAB and Python (NumPy), it would maybe be useful to check the code of the following Python functions, which do the job:
numpy.allclose(a, b, rtol=1e-05, atol=1e-08)
numpy.isclose(a, b, rtol=1e-05, atol=1e-08, equal_nan=False)
Related
I am working on a code in Least Square Non Negative solution recovery context on Matlab, and I need (with no more details because it's not that important for this question) to know the number of non zero elements in my matrices and arrays.
The function NNZ on matlab does exactly what I want, but it happens that I need more information about what Matlab thinks of a "zero element", it could be 0 itself, or the numerical zero like 1e-16 or less.
Does anybody has this information about the NNZ function, cause I couldn't get the original script
Thanks.
PS : I am not an expert on Matlab, so accept my apologies if it's a really simple task.
I tried "open nnz", on Matlab but I only get a small script of commented code lines...
Since nnz counts everything that isn't an exact zero (i.e. 1e-100 is non-zero), you just have to apply a relational operator to your data first to find how many values exceed some tolerance around zero. For a matrix A:
n = nnz(abs(A) > 1e-16);
Also, this discussion of floating-point comparison might be of interest to you.
You can add in a tolerance by doing something like:
nnz(abs(myarray)>tol);
This will create a binary array that is 1 when abs(myarray)>tol and 0 otherwise and then count the number of non-zero entries.
Obviously, float comparison is always tricky. I have a lot of assert-check in my (scientific) code, so very often I have to check for equality of sums to one, and similar issues.
Is there a quick-and easy / best-practice way of performing those checks?
The easiest way I can think of is to build a custom function for fixed tolerance float comparison, but that seems quite ugly to me. I'd prefer a built-in solution, or at least something that is extremely clear and straightforward to understand.
I think it's most likely going to have to be a function you write yourself. I use three things pretty constantly for running computational vector tests so to speak:
Maximum absolute error
return max(abs(result(:) - expected(:))) < tolerance
This calculates maximum absolute error point-wise and tells you whether that's less than some tolerance.
Maximum excessive error count
return sum( (abs(result(:) - expected(:))) < tolerance )
This returns the number of points that fall outside your tolerance range. It's also easy to modify to return percentage.
Root mean squared error
return norm(result(:) - expected(:)) < rmsTolerance
Since these and many other criteria exist for comparing arrays of floats, I would suggest writing a function which would accept the calculation result, the expected result, the tolerance and the comparison method. This way you can make your checks very compact, and it's going to be much less ugly than trying to explain what it is that you're doing in comments.
Any fixed tolerance will fail if you put in very large or very small numbers, simplest solution is to use eps to get the double precision:
abs(A-B)<eps(A)*4
The 4 is a totally arbitrary number, which is sufficient in most cases.
Don't know any special build in solution. Maybe something with using eps function?
For example as you probably know this will give False (i.e. 0) as a result:
>> 0.1 + 0.1 + 0.1 == 0.3
ans =
0
But with eps you could do the following and the result is as expected:
>> (0.1+0.1+0.1) - 0.3 < eps
ans =
1
I have had good experience with xUnit, a unit test framework for Matlab. After installing it, you can use:
assertVectorsAlmostEqual(a,b) (checks for normwise closeness between vectors; configurable absolute/relative tolerance and sane defaults)
assertElementsAlmostEqual(a,b) (same check, but elementwise on every single entry -- so [1 1e-12] and [1 -1e-9] will compare equal with the former but not with the latter).
They are well-tested, fast to use and clear enough to read. The function names are quite long, but with any decent editor (or the Matlab one) you can write them as assertV<tab>.
For those who understand both MATLAB and Python (NumPy), it would maybe be useful to check the code of the following Python functions, which do the job:
numpy.allclose(a, b, rtol=1e-05, atol=1e-08)
numpy.isclose(a, b, rtol=1e-05, atol=1e-08, equal_nan=False)
Here's some simple code that shows what I've been seeing:
A = randn(1,5e6)+1i*randn(1,5e6);
B = randn(1,5e6)+1i*randn(1,5e6);
sum(A.*conj(B)) - A*B'
sum(A.*conj(B)) - mtimes(A,B')
A*B' - mtimes(A,B')
Now, the three methods shown on the bottom are supposed to do the same thing, so the answers should be zero, right? Wrong! The differences are small, though not small enough that I would consider them negligible. In addition, the error increases as the length of A and B increases.
Does anyone know what the actual difference between these methods is? I understand that there are probably shortcuts written into the code, but I would like to quantify that if possible. Does Matlab post the differences anywhere? I've looked around, but have not found anything.
It probably has something to do with the order in which the operations are performed. For example,
sum(A.*conj(B)) - fliplr(A)*fliplr(B)'
gives a result different than
sum(A.*conj(B)) - A*B'
Or, more strikingly,
A*B' - fliplr(A)*fliplr(B)'
gives a nonzero result, of the same order as your tests.
So my bet is that depending on the method (sum or *) Matlab internally does the operations in a different order, and that may well account for the different roundoff errors you observed.
Given the size of each number, the rounding error in simple operations on this number is of the order 10^-14
You have 5*10^6 numbers, hence if you are really unlucky the rounding error can become anything upto 5*10^-8.
Your observed error is of size 10^10, which is well in the expected range.
Note that the difference is not caused by the complex transpose, but by the product of the sum vs the matrix product.
A = randn(1,5e6)+1i*randn(1,5e6);
B = randn(1,5e6)+1i*randn(1,5e6);
B1 = conj(B);
B2 = B';
isequal(B1(:),B2(:)) % This returns true
A*transpose(conj(B)) - A*B' % Hence this returns zero
sum(A.*transpose(B')) - A*B' % But this returns something like 1e-10
A similar effect occurs for non complex A and B:
N=1e6;
A = 1:N;
B=1:N;
(N * (N + 1) * (2*N + 1))/6 % This will give exactly the right answer
A*B'
fliplr(A)*fliplr(B)'
Note that the two lowest answers only vary a few hundred from eachother, whilst they are actually over 2000 from the correct answer. If this is a problem consider using the symbolic toolbox. That allows you to calculate with arbitrary precision.
My Question - part 1: What is the best way to test if a floating point number is an "integer" (in Matlab)?
My current solution for part 1: Obviously, isinteger is out, since this tests the type of an element, rather than the value, so currently, I solve the problem like this:
abs(round(X) - X) <= sqrt(eps(X))
But perhaps there is a more native Matlab method?
My Question - part 2: If my current solution really is the best way, then I was wondering if there is a general tolerance that is recommended? As you can see from above, I use sqrt(eps(X)), but I don't really have any good reason for this. Perhaps I should just use eps(X), or maybe 5 * eps(X)? Any suggestions would be most welcome.
An Example: In Matlab, sqrt(2)^2 == 2 returns False. But in practice, we might want that logical condition to return True. One can achieve this using the method described above, since sqrt(2)^2 actually equals 2 + eps(2) (ie well within the tolerance of sqrt(eps(2)). But does this mean I should always use eps(X) as my tolerance, or is there good reason to use a larger tolerance, such as 5 * eps(X), or sqrt(eps(X))?
UPDATE (2012-10-31): #FakeDIY pointed out that my question is partially a duplicate of this SO question (apologies, not sure how I missed it in my initial search). Given this I'd like to emphasize the "tolerance" part of the question (which is not covered in that link), ie is eps(X) a sensible tolerance, or should I use something larger, like 5 * eps(X), and if so, why?
UPDATE (2012-11-01): Thanks everyone for the responses. I've +1'ed all three answers as I feel they all contribute meaningfully to various aspects of the question. I'm giving the answer tick to Eric Postpischil as that answer really nailed the tolerance part of the question well (and it has the most upvotes at this point in time).
No, there is no general tolerance that is recommended, and there cannot be.
The difference between a computed result and a mathematically ideal result is a function of the operations that produced the computed result. Because those operations are specific to each application, there is no general rule for testing any property of a computed result.
To design a proper test, you must determine what errors may have occurred during computation, determine bounds on the resulting error in the computed result, and test whether the computed result differs from the ideal result (perhaps the nearest integer) by less than those bounds. You must also decide whether those bounds are sufficiently small to satisfy your application’s requirements. (Using a relaxed test that accepts as an integer something that is not an integer decreases false negatives [incorrect rejections of a result as an integer where the ideal result would be an integer] but increases false positives [incorrect acceptances of a result as an integer where the ideal result would not be an integer].)
(Note that it can even be the case the testing as if the error bounds were zero can produce false negatives: It is possible a computation produces a result that is exactly an integer when the ideal result is not an integer, so any error tolerance, even zero, will falsely report this result is an integer. If this is unacceptable for your application, then, in such a case, the computations must be redesigned.)
It is not only not possible to state, without specific knowledge of the application, a numerical tolerance that may be used, it is impossible to state whether the tolerance should be absolute, should be relative to the computed value or to a target value, should be measured in ULPs (units of least precision), or should be set in some other manner. This is because errors may be introduced into computations in a variety of ways. For example, if there is a small relative error in a and a and b are close in value, then a-b has a large relative error. Additionally, if c is large, then (a-b)*c has a large absolute error.
Its probably not the most efficient method but I would use mod for this:
a = 15.0000000000;
b = mod(a,1.0)
c = 15.0000000001;
d = mod(c,1.0)
returns b = 0 and d = 1.0000e-010
There are a number of other alternatives suggested here:
How do I test for integers in MATLAB?
I like the idea of comparing (x == floor(x)) too.
1) I have historically used your method with a simple tolerance, eps(X). The mod methods interested me though, so I benchmarked a couple using Steve Eddins timeit function.
f = #() abs(X - round(X)) <= eps(X);
g = #() X == round(X);
h = #() ~mod(X,1);
For single values, like X=1.0, yours appears to fastest:
timeit(f) = 7.3635e-006
timeit(g) = 9.9677e-006
timeit(h) = 9.9214e-006
For vectors though, like X = 1:0.01:100, the other methods are faster (though round still beats mod):
timeit(f) = 0.00076636
timeit(g) = 0.00028182
timeit(h) = 0.00040539
2) The error bound is really problem dependent. Other answers cover this much better than I am able to.
I have two arrays of data that I'm trying to amalgamate. One contains actual latencies from an experiment in the first column (e.g. 0.345, 0.455... never more than 3 decimal places), along with other data from that experiment. The other contains what is effectively a 'look up' list of latencies ranging from 0.001 to 0.500 in 0.001 increments, along with other pieces of data. Both data sets are X-by-Y doubles.
What I'm trying to do is something like...
for i = 1:length(actual_latency)
row = find(predicted_data(:,1) == actual_latency(i))
full_set(i,1:4) = [actual_latency(i) other_info(i) predicted_info(row,2) ...
predicted_info(row,3)];
end
...in order to find the relevant row in predicted_data where the look up latency corresponds to the actual latency. I then use this to created an amalgamated data set, full_set.
I figured this would be really simple, but the find function keeps failing by throwing up an empty matrix when looking for an actual latency that I know is in predicted_data(:,1) (as I've double-checked during debugging).
Moreover, if I replace find with a for loop to do the same job, I get a similar error. It doesn't appear to be systematic - using different participant data sets throws it up in different places.
Furthermore, during debugging mode, if I use find to try and find a hard-coded value of actual_latency, it doesn't always work. Sometimes yes, sometimes no.
I'm really scratching my head over this, so if anyone has any ideas about what might be going on, I'd be really grateful.
You are likely running into a problem with floating point comparisons when you do the following:
predicted_data(:,1) == actual_latency(i)
Even though your numbers appear to only have three decimal places of precision, they may still differ by very small amounts that are not being displayed, thus giving you an empty matrix since FIND can't get an exact match.
One feature of floating point numbers is that certain numbers can't be exactly represented, since they aren't an integer power of 2. This occurs with the numbers 0.1 and 0.001. If you repeatedly add or multiply one of these numbers you can see some unexpected behavior. Amro pointed out one example in his comment: 0.3 is not exactly equal to 3*0.1. This can also be illustrated by creating your look-up list of latencies in two different ways. You can use the normal colon syntax:
vec1 = 0.001:0.001:0.5;
Or you can use LINSPACE:
vec2 = linspace(0.001,0.5,500);
You'd think these two vectors would be equal to one another, but think again!:
>> isequal(vec1,vec2)
ans =
0 %# FALSE!
This is because the two methods create the vectors by performing successive additions or multiplications of 0.001 in different ways, giving ever so slightly different values for some entries in the vector. You can take a look at this technical solution for more details.
When comparing floating point numbers, you should therefore do your comparisons using some tolerance. For example, this finds the indices of entries in the look-up list that are within 0.0001 of your actual latency:
tolerance = 0.0001;
for i = 1:length(actual_latency)
row = find(abs(predicted_data(:,1) - actual_latency(i)) < tolerance);
...
The topic of floating point comparison is also covered in this related question.
You may try to do the following:
row = find(abs(predicted_data(:,1) - actual_latency(i))) < eps)
EPS is accuracy of floating-point operation.
Have you tried using a tolerance rather than == ?