Problems sampling Double-Ranges in Scala - scala

I want to sample a numeric range (start, end and increment provided). I'm wondering why the last element sometimes exceeds my upper limit?
E.g.
(9.474 to 49.474 by 1.0).last>49.474 // gives true
This is because the last element is 49.474000000000004
I know that I could also use until to make an exclusive range, but then I "loose" 1 element (the range stops at 48.474000000000004).
Is there a way to sample a range having the start and end "exactely" set to the provided bounds? (background : The above solution gets me into trouble when e.g. doing interpolation using apache commons math (as extrapolation is not allowed))
What I'm currently doing to prevent this is rounding :
def roundToScale(d:Double,scale:Int) = BigDecimal(d).setScale(scale,BigDecimal.RoundingMode.HALF_EVEN).toDouble
(9.474 to 49.474 by 1.0).map(roundToScale(_:Double,5))

That's due to double arithmetics: 9.474 + 1.0 * 40 = 49.474000000000004.
You can follow that expression from the source: https://github.com/scala/scala/blob/v2.12.5/src/library/scala/collection/immutable/NumericRange.scala#L56
If you want to have something exactly, you should use Int or Long types.

you should go with BigDecimal:
(BigDecimal(9.474) to BigDecimal(49.474) by BigDecimal(1.0))
.map(_.toDouble)

Related

what is the difference between defining a vector using linspace and defining a vector using steps?

i am trying to learn the basics of matlab ,
i wanted to write a mattlab script ,
in this script i defined a vector x with a "d" step that it's length is (2*pi/1000)
and i wanted to plot two sin function according to x :
the first sin is with a frequency of 1, and the second sin frequency 10.3 ..
this is what i did:
d=(2*pi/1000);
x=-pi:d:pi;
first=sin(x);
second=sin(10.3*x);
plot(x,first,x,second);
my question:
what is the different between :
x=linspace(-pi,pi,1000);
and ..
d=(2*pi/1000);
x=-pi:d:pi;
? i am asking because i got confused since i think they both are the same but i think there is something wrong with my assumption ..
also is there is a more sufficient way to write sin function with a giveng frequency ?
The main difference can be summarizes as predefined size vs predefined step. And your example highlights it very well, indeed (1000 elements vs 1001 elements).
The linspace function produces a fixed-length vector (the length being defined by the third input argument, which defaults to 100) whose lower and upper limits are set, respectively, by the first and the second input arguments. The correct step to use is internally computed by the function itself (step = (x2 - x1) / n).
The colon operator defines a vector of elements whose values range between the specified lower and upper limits. The step, which is an optional parameter that defaults to 1, is the discriminant of the vector length. This means that the length of the result is determined by the number of steps that must be accomplished in order to reach the upper limit, starting from the lower one. On an side note, on this MathWorks thread you can find a very interesting discussion concerning the behavior of the colon operator in respect of floating-point management.
Another difference, related to the first one, is that linspace always includes the upper limit value while the colon operator only contains it if the specified step allows it (0:5:14 = [0 5 10]).
As a general rule, I prefer to use the former when I want to produce a vector of a predefined length (pretty obvious, isn't it?), and the latter when I need to create a sequence whose length has only a marginal relevance (or no relevance at all)

Scala: large calculation losing value to zero/infinity

I'm trying calculate a perplexity value for a language model and the calculation uses a lot of large powers. I have tried converting my calculation to log space using BigDecimal, but I'm not having any luck.
var sum=0.0
for(ngram<-testNGrams)
{
var prob = Math.log(lm.prob(ngram.last, ngram.slice(0,ngram.size-1)))
if (prob==0.0) sum = sum
else sum = sum + prob
}
Math.pow(Math.log(Math.exp(sum)),-1.0/wordSize.toDouble)
How can I perform such a calculation in Scala without losing my large/small values to zero/Infinity? It seems like a trivial question but I haven't managed to do it.
In the above, you can assume that the method lm.prob issues the correct probabilities between 0 and 1, this has been amply tested.
Write everything in terms of log probabilities, not probabilities.
For instance, things like log(exp(sum)) just warm up your CPU while throwing away useful information. Avoid!
If you must convert to actual probabilities, do so at the very last step you can.

MATLAB decimal places in array

Consider the following example:
Bathymetry = [0,4134066;
3,3817906;
6,3343666;
9,2978725;
12,2742092;
14,2584337;
16,2415355;
18,2228054;
20,2040753;
23,1761373;
26,1514085];
Depth = [0;1;2;3;5;8;10;11.6;15];
newDepth = min(Bathymetry(:,1)):0.1:max(Bathymetry(:,1));
From this I want to find which column of 'newDepth' corresponds to 'Depth'. For example:
dd = find(newDepth==Depth(1))
dd =
1
Showing that Depth == 0, is located in the first column of newDepth. When I apply this to all of the entries of 'Depth'
for i = 1:length(Depth);
dd(i) = find(newDepth == Depth(i));
end
I receive an error:
Improper assignment with rectangular empty matrix.
Initially I couldn't understand why, but by looking at the array for newDepth, especially column 117 where newDepth == 11.6, I noticed that the value isnt equal to 11.6 but equal to 11.600000000000001 thus being different from Depth(8). How can I fix this? and why does MATLAB not just write the value as 11.6? nowhere have I specified to include the .000000000000001.
This is because there is no exact representation of 0.1 in binary. Read the wiki for more background. In binary, representing 0.1 is something like trying to write out all the decimals of one-third:
1/3 == 0.333333333333333333...
it will never be exact, no matter how many 3's you add.
For this (and many other) reasons, I'd suggest you do not use == (which is a very stringent demand), but rather use
for ii = 1:length(Depth);
[~,dd(ii)] = min( abs(newDepth-Depth(ii)) );
end
This problem is to to with floating point arithmetic which is quite complicated, i recommend you google it and read a bit, there is plenty out there explaining it. Here is a good start: http://blogs.mathworks.com/loren/2006/08/23/a-glimpse-into-floating-point-accuracy/
To solve it for your case I would suggest rounding
newDepth = round(newDepth * 10) / 10
The 11.600000000000001 is because the number 11.6 is not exactly representable in binary floating point notation. This is to do with the way the hardware works rather than any limitation of Matlab.
You want to change your compare to something like
dd(i) = find(abs(newDepth - Depth(i))<.0000001);

Test if a floating point number is an integer in Matlab

My Question - part 1: What is the best way to test if a floating point number is an "integer" (in Matlab)?
My current solution for part 1: Obviously, isinteger is out, since this tests the type of an element, rather than the value, so currently, I solve the problem like this:
abs(round(X) - X) <= sqrt(eps(X))
But perhaps there is a more native Matlab method?
My Question - part 2: If my current solution really is the best way, then I was wondering if there is a general tolerance that is recommended? As you can see from above, I use sqrt(eps(X)), but I don't really have any good reason for this. Perhaps I should just use eps(X), or maybe 5 * eps(X)? Any suggestions would be most welcome.
An Example: In Matlab, sqrt(2)^2 == 2 returns False. But in practice, we might want that logical condition to return True. One can achieve this using the method described above, since sqrt(2)^2 actually equals 2 + eps(2) (ie well within the tolerance of sqrt(eps(2)). But does this mean I should always use eps(X) as my tolerance, or is there good reason to use a larger tolerance, such as 5 * eps(X), or sqrt(eps(X))?
UPDATE (2012-10-31): #FakeDIY pointed out that my question is partially a duplicate of this SO question (apologies, not sure how I missed it in my initial search). Given this I'd like to emphasize the "tolerance" part of the question (which is not covered in that link), ie is eps(X) a sensible tolerance, or should I use something larger, like 5 * eps(X), and if so, why?
UPDATE (2012-11-01): Thanks everyone for the responses. I've +1'ed all three answers as I feel they all contribute meaningfully to various aspects of the question. I'm giving the answer tick to Eric Postpischil as that answer really nailed the tolerance part of the question well (and it has the most upvotes at this point in time).
No, there is no general tolerance that is recommended, and there cannot be.
The difference between a computed result and a mathematically ideal result is a function of the operations that produced the computed result. Because those operations are specific to each application, there is no general rule for testing any property of a computed result.
To design a proper test, you must determine what errors may have occurred during computation, determine bounds on the resulting error in the computed result, and test whether the computed result differs from the ideal result (perhaps the nearest integer) by less than those bounds. You must also decide whether those bounds are sufficiently small to satisfy your application’s requirements. (Using a relaxed test that accepts as an integer something that is not an integer decreases false negatives [incorrect rejections of a result as an integer where the ideal result would be an integer] but increases false positives [incorrect acceptances of a result as an integer where the ideal result would not be an integer].)
(Note that it can even be the case the testing as if the error bounds were zero can produce false negatives: It is possible a computation produces a result that is exactly an integer when the ideal result is not an integer, so any error tolerance, even zero, will falsely report this result is an integer. If this is unacceptable for your application, then, in such a case, the computations must be redesigned.)
It is not only not possible to state, without specific knowledge of the application, a numerical tolerance that may be used, it is impossible to state whether the tolerance should be absolute, should be relative to the computed value or to a target value, should be measured in ULPs (units of least precision), or should be set in some other manner. This is because errors may be introduced into computations in a variety of ways. For example, if there is a small relative error in a and a and b are close in value, then a-b has a large relative error. Additionally, if c is large, then (a-b)*c has a large absolute error.
Its probably not the most efficient method but I would use mod for this:
a = 15.0000000000;
b = mod(a,1.0)
c = 15.0000000001;
d = mod(c,1.0)
returns b = 0 and d = 1.0000e-010
There are a number of other alternatives suggested here:
How do I test for integers in MATLAB?
I like the idea of comparing (x == floor(x)) too.
1) I have historically used your method with a simple tolerance, eps(X). The mod methods interested me though, so I benchmarked a couple using Steve Eddins timeit function.
f = #() abs(X - round(X)) <= eps(X);
g = #() X == round(X);
h = #() ~mod(X,1);
For single values, like X=1.0, yours appears to fastest:
timeit(f) = 7.3635e-006
timeit(g) = 9.9677e-006
timeit(h) = 9.9214e-006
For vectors though, like X = 1:0.01:100, the other methods are faster (though round still beats mod):
timeit(f) = 0.00076636
timeit(g) = 0.00028182
timeit(h) = 0.00040539
2) The error bound is really problem dependent. Other answers cover this much better than I am able to.

Why fractional iteration step compiles into different JavaScript than integer step

I was wondering about slightly different JavaScript which ranges comprehensions in CoffeeScript compiles into. Is there any reason why following differencies in generated JavaScript?
Iterating a range by integer step
numbers = (i for i in [start..end] by 2)
compiles into:
for (i = start; i <= end; i += 2) {
_results.push(i);
}
But when iterating by fractional step
numbers = (i for i in [start..end] by 1/2)
generates bit more complicated JavaScript:
for (i = start, _ref = 1 / 2; start <= end ? i <= end : i >= end; i += _ref) {
_results.push(i);
}
So why this additional start <= end condition?
You'll get similarly elaborate code if you just do numbers = (i for i in [start..end]). This is because CoffeeScript doesn't know which direction the range goes when the beginning or ending is a variable. The compiler has a special optimization where it will output simpler code if a constant step is provided, but unfortunately 1/2 is counted as an expression rather than a constant.
Coffeescript doesn't know completely what the expression 1/2 evaluates to. It could be Math.random() - .5 and it would depend on the particular running of the script.
Therefore, it's impossible for Coffeescript to know if the step is negative or positive, so it just keys the condition based on the relative positioning of start and end rather than on the sign of the constant step.
This is constant vs. expression, rather than integer vs. fraction. When the step is a constant (such as 2), CoffeeScript knows whether step is positive at compile time and outputs the correct code for that. When the step is an expression (such as 1/2), it needs to determine whether it is positive at runtime.
Unfortunately, CoffeeScript appears to recognize fractional number as expressions regardless of how they are written (0.5 or 1/2), so there's no simple way to avoid this problem.