Averaging a very long List[Double] Without getting infinity in Scala - scala

I have a very long list of doubles that I need to average but I can't sum them within the double data type so when I go to divide I still get Infinity.
def applyToMap(list: Map[String, List[Map[String, String]]], f: Map[String, String]=>Double): Map[String,Double]={
val mSLD = list.mapValues(lm=>lm.map(f))
mSLD.mapValues(ld=> ld.sum/ld.size)
}
This leaves me with a Map[String, Double] that are all Key -> Infinity

You could use fold to compute an average as you go. Rather than doing sum / size you should count your way through the items with n, and for each one adjust your accumulator with acc = (acc * n/(n+1)) + (item * 1/(n+1))
Here’s the general scala code:
val average = seq.foldLeft((0.0, 1)) ((acc, i) => ((acc._1 + (i - acc._1) / acc._2), acc._2 + 1))._1
Taken from here.
You’d probably still have precision difficulty if the list is really long, as you’d be dividing by a gradually very large number. To be really safe you should break the list into sublists, and compute the average of averages of the sublists. Make sure the sublists are all the same length though, or do a weighted average based on their size.

Interested in implementing gandaliters solution, I came up with the following (Since I'm not the well known friend of Doubles, I tried to find an easy to follow numeric sequence with Bytes). First, I generate 10 Bytes in the range of 75..125, to be close to MaxByte, but below for every value, and in average 100, for simple control:
val rnd = util.Random
val is=(1 to 10).map (i => (rnd.nextInt (50)+75).toByte)
// = Vector(99, 122, 99, 105, 102, 104, 122, 99, 87, 114)
The 1st algo multiplies before division (which increases the danger to exceed MaxByte), the 2nd divides before multiplication, which leads to rounding errors.
def slidingAvg0 (sofar: Byte, x: Byte, cnt: Byte): (Byte, Byte) = {
val acc : Byte = ((sofar * cnt).toByte / (cnt + 1).toByte + (x/(cnt + 1).toByte).toByte).toByte
println (acc)
(acc.toByte, (cnt + 1).toByte)
}
def slidingAvg1 (sofar: Byte, x: Byte, cnt: Byte): (Byte, Byte) = {
val acc : Byte = (((sofar / (cnt + 1).toByte).toByte * cnt).toByte + (x/(cnt + 1).toByte).toByte).toByte
println (acc)
(acc.toByte, (cnt + 1).toByte)
}
This is foldLeft in scala:
((is.head, 1.toByte) /: is.tail) { case ((sofar, cnt), x) => slidingAvg0 (sofar, x, cnt)}
110
21
41
2
18
32
8
16
0
scala> ((is.head, 1.toByte) /: is.tail) { case ((sofar, cnt), x) => slidingAvg1 (sofar, x, cnt)}
110
105
104
100
97
95
89
81
83
Since 10 values is far too less to rely on the average being close to 100, let's see the sum as Int:
is.map (_.toInt).sum
res65: Int = 1053
The drift is pretty significant (should be 105, is 0/83)
Whether the findings are transferable from Bytes/Int to Doubles is the other question. And I'm not 100% confident, that my braces mirror the evaluation order, but imho, for multiplication/division of same precedence it is left to right.
So the original formulas were:
acc = (acc * n/(n+1)) + (item * 1/(n+1))
acc = (acc /(n+1) *n) + (item/(n+1))

If i understand the OP correctly then the amount of data doesn't seem to be a problem otherwise it wouldn't fit into memory.
So i concentrate on the data types only.
Summary
My suggestion is to go with BigDecimal instead of Double.
Especially if you are adding reasonbly high values.
The only significant drawback is the performance and a small amount of cluttered syntax.
Alternatively you must rescale your input upfront but this will degrade precision and requires special care with post processing.
Double breaks at some scale
scala> :paste
// Entering paste mode (ctrl-D to finish)
val res0 = (Double.MaxValue + 1) == Double.MaxValue
val res1 = Double.MaxValue/10 == Double.MaxValue
val res2 = List.fill(11)(Double.MaxValue/10).sum
val res3 = List.fill(10)(Double.MaxValue/10).sum == Double.MaxValue
val res4 = (List.fill(10)(Double.MaxValue/10).sum + 1) == Double.MaxValue
// Exiting paste mode, now interpreting.
res0: Boolean = true
res1: Boolean = false
res2: Double = Infinity
res3: Boolean = true
res4: Boolean = true
Take a look these simple Double arithmetics examples in your scala REPL:
Double.MaxValue + 1 will numerically cancel out and nothing is going to be added, thus it is still the same as Double.MaxValue
Double.MaxValue/10 behaves as expected and doesn't equal to Double.MaxValue
Adding Double.MaxValue/10 for 11 times will produce an overflow to Infintiy
Adding Double.MaxValue/10 for 10 times won't break arithmetics and evaluate to Double.MaxValue again
The summed Double.MaxValue/10 behaves exactly as the Double.MaxValue
BigDecimal works on all scales but is slower
scala> :paste
// Entering paste mode (ctrl-D to finish)
val res0 = (BigDecimal(Double.MaxValue) + 1) == BigDecimal(Double.MaxValue)
val res1 = BigDecimal(Double.MaxValue)/10 == BigDecimal(Double.MaxValue)
val res2 = List.fill(11)(BigDecimal(Double.MaxValue)/10).sum
val res3 = List.fill(10)(BigDecimal(Double.MaxValue)/10).sum == BigDecimal(Double.MaxValue)
val res4 = (List.fill(10)(BigDecimal(Double.MaxValue)/10).sum + 1) == BigDecimal(Double.MaxValue)
// Exiting paste mode, now interpreting.
res0: Boolean = false
res1: Boolean = false
res2: scala.math.BigDecimal = 197746244834854727000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
res3: Boolean = true
res4: Boolean = false
Now compare these results with the ones above from Double.
As you can see everything works as expected.
Rescaling reduces precision and can be tedious
When working with astronomic or microscopic scales it is likely to happen that numbers will overflow or underflow quickly.
Then it is appropriate to work with other units than the base units to compensate this.
E.g. with km instead of m.
However, then you will have to take special care when multiplying those numbers in formulas.
10km * 10km ≠ 100 km^2
but rather
10,000 m * 10,000 m = 100,000,000 m^2 = 100 Mm^2
So keep this in mind.
Another trap is when dealing with very diverse datasets where numbers exist in all kinds of scales and quantities.
When scaling down your input domain you will loose precision and small numbers may be cancelled out.
In some scenarios these numbers don't need to be considered because of their small impact.
However, when these small numbers exist in a high frequency and are ignored all the time you will introduce a large error in the end.
So keep this in mind as well ;)
Hope this helps

Related

scala can't make an add on a long

I'm not able to do an add on long type.
scala or the processor doesn't manage correctly the sign
scala> var i="-1014570924054025346".toLong
i: Long = -1014570924054025346
scala> i=i+92233720368547758L
i: Long = -922337203685477588
scala> var i=9223372036854775807L
i: Long = 9223372036854775807
scala> i=i+5
i: Long = -9223372036854775804
The first test where a negative number doesn't pass to a positive one is a problem for me
I have not fully understood the question, but for the first example, you get the expected result. What happens in the second example, the Long number happens to be the maximum value for a Long (i.e Long.MaxValue) so essentially when you had another positive number, it's overflowing:
scala> Long.MaxValue
res4: Long = 9223372036854775807L
scala> Long.MaxValue + 1
res7: Long = -9223372036854775808L // which is Long.MinValue
scala> Long.MinValue + 4
res8: Long = -9223372036854775804L // which is the result that you get
In other words:
9223372036854775807L + 5
is equivalent to:
Long.MaxValue + 5
which is equivalent to:
Long.MinValue + 4 // because (Long.MaxValue + 1) = Long.MinValue
which is equals to -9223372036854775804L
If you really need to use such big numbers, you might try using BigInt
scala> val x = BigInt(Long.MaxValue)
x: scala.math.BigInt = 9223372036854775807
scala> x + 1
res6: scala.math.BigInt = 9223372036854775808
scala> x + 5
res11: scala.math.BigInt = 9223372036854775812
scala> x + 10
res8: scala.math.BigInt = 9223372036854775817
scala> x * 1000
res10: scala.math.BigInt = 9223372036854775807000
scala> x * x
res9: scala.math.BigInt = 85070591730234615847396907784232501249
scala> x * x * x * x
res13: scala.math.BigInt = 7237005577332262210834635695349653859421902880380109739573089701262786560001
scala>
The documentation on BigInt is rather, err, small. However, i believe that it is basically an infinite precision integer (can support as many digits as you need). Having said that, there will probably at some point be a limit. There is a comment on BigDecimal - which has more documentation - that at about 4,934 digits there might be some deviation between BigDecimal and BigInt.
I will leave it to someone else to work out whether or not x ^ 4 is the value shown above.
Oh, I almost forgot your negative number test, I aligned the sum with the initialisation, to make it easier to visualise that the result appears to be correct:
scala> val x = BigInt("-1014570924054025346")
x: scala.math.BigInt = -1014570924054025346
scala> x + 92233720368547758L
res15: scala.math.BigInt = -922337203685477588
scala>
As for Ints, Longs and similar data types, they are limited in their size due to the number of bits they are constrained to. Int's are typically 32 bit and longs are typically 64 bits.
It is easier to visualise when you look at them in hexadecimal. A signed Byte (at 8 bits) has a maximum positive value of 0x7F (127). When you add one to it, you get 0x80 (-128). This is because we use the "Most Significant Bit" as an indicator of whether the number is positive or negative.
If the same byte was interpreted as unsigned, then 0x7F (127) would still become 0x80 when 1 is added to it. However, since we are interpreting it as unsigned, this would be equivalent to 128. We can keep adding one until we get to 0xFF (255) at which point if we add another 1 we will end up at 0x00 again which is of course 0.
Here are some references that explain this in much more detail:
Wikipedia - Twos complement
Cornell University - what is twos complement
Stack Overflow - what is 2s complement

find out if a number is a good number in scala

Hi I am new to scala functional programming methodology. I want to input a number to my function and check if it is a good number or not.
A number is a good number if its every digit is larger than the sum of digits which are on the right side of that digit. 
For example:
9620  is good as (2 > 0, 6 > 2+0, 9 > 6+2+0)
steps I am using to solve this is
1. converting a number to string and reversing it
2. storing all digits of the reversed number as elements of a list
3. applying for loop from i equals 1 to length of number - 1
4. calculating sum of first i digits as num2
5. extracting ith digit from the list as digit1 which is one digit ahead of the first i numbers for which we calculated sum because list starts from zero.
6. comparing output of 4th and 5th step. if num1 is greater than num2 then we will break the for loop and come out of the loop to print it is not a good number.
please find my code below
val num1 = 9521.toString.reverse
val list1 = num1.map(_.todigit).toList
for (i <- 1 to num1.length - 1) {
val num2 = num1.take(i).map(_.toDigits) sum
val digit1 = list1(i)
if (num2 > digit1) {
print("number is not a good number")
break
}
}
I know this is not the most optimized way to solve this problem. Also I am looking for a way to code this using tail recursion where I pass two numbers and get all the good numbers falling in between those two numbers.
Can this be done in more optimized way?
Thanks in advance!
No String conversions required.
val n = 9620
val isGood = Stream.iterate(n)(_/10)
.takeWhile(_>0)
.map(_%10)
.foldLeft((true,-1)){ case ((bool,sum),digit) =>
(bool && digit > sum, sum+digit)
}._1
Here is a purely numeric version using a recursive function.
def isGood(n: Int): Boolean = {
#tailrec
def loop(n: Int, sum: Int): Boolean =
(n == 0) || (n%10 > sum && loop(n/10, sum + n%10))
loop(n/10, n%10)
}
This should compile into an efficient loop.
Using this function:(This will be the efficient way as the function forall will not traverse the entire list of digits. it stops when it finds the false condition immediately ( ie., when v(i)>v.drop(i+1).sum becomes false) while traversing from left to right of the vector v. )
def isGood(n: Int)= {
val v1 = n.toString.map(_.asDigit)
val v = if(v1.last!=0) v1 else v1.dropRight(1)
(0 to v.size-1).forall(i=>v(i)>v.drop(i+1).sum)
}
If we want to find good numbers in an interval of integers ranging from n1 to n2 we can use this function:
def goodNums(n1:Int,n2:Int) = (n1 to n2).filter(isGood(_))
In Scala REPL:
scala> isGood(9620)
res51: Boolean = true
scala> isGood(9600)
res52: Boolean = false
scala> isGood(9641)
res53: Boolean = false
scala> isGood(9521)
res54: Boolean = true
scala> goodNums(412,534)
res66: scala.collection.immutable.IndexedSeq[Int] = Vector(420, 421, 430, 510, 520, 521, 530, 531)
scala> goodNums(3412,5334)
res67: scala.collection.immutable.IndexedSeq[Int] = Vector(4210, 5210, 5310)
This is a more functional way. pairs is a list of tuples between a digit and the sum of the following digits. It is easy to create these tuples with drop, take and slice (a combination of drop and take) methods.
Finally I can represent my condition in an expressive way with forall method.
val n = 9620
val str = n.toString
val pairs = for { x <- 1 until str.length } yield (str.slice(x - 1, x).toInt, str.drop(x).map(_.asDigit).sum)
pairs.forall { case (a, b) => a > b }
If you want to be functional and expressive avoid to use break. If you need to check a condition for each element is a good idea to move your problem to collections, so you can use forAll.
This is not the case, but if you want performance (if you don't want to create an entire pairs collection because the condition for the first element is false) you can change your for collection from a Range to Stream.
(1 until str.length).toStream
Functional style tends to prefer monadic type things, such as maps and reduces. To make this look functional and clear, I'd do something like:
def isGood(value: Int) =
value.toString.reverse.map(digit=>Some(digit.asDigit)).
reduceLeft[Option[Int]]
{
case(sum, Some(digit)) => sum.collectFirst{case sum if sum < digit => sum+digit}
}.isDefined
Instead of using tail recursion to calculate this for ranges, just generate the range and then filter over it:
def goodInRange(low: Int, high: Int) = (low to high).filter(isGood(_))

How can I check whether a Double value overflows?

I want to check if adding some value to a double value exceed the Double limits or not. I tried this:
object Hello {
def main(args: Array[String]): Unit = {
var t = Double.MaxValue
var t2 = t+100000000
if(t2 > 0) {
println("t2 > 0: " + t2)
} else
println("t2 <= 0: " + t2)
}
}
The output I get is
t2 > 0: 1.7976931348623157E308
What I actually want is to sum billions of values and check whether or not the running sum overflows at any time.
The first part of your question seems to stem from a misunderstanding of floating-point numbers.
IEEE-754 floating-point numbers do not wrap around like some finite-size integers would. Instead, they "saturate" at Double.PositiveInfinity, which represents mathematical (positive) infinity. Double.MaxValue is the largest finite positive value of doubles. The next Double after that is Double.PositiveInfinity. Adding any double (other than Double.NegativeInfinity or NaNs) to Double.PositiveInfinity yields Double.PositiveInfinity.
scala> Double.PositiveInfinity + 1
res0: Double = Infinity
scala> Double.PositiveInfinity - 1
res1: Double = Infinity
scala> Double.PositiveInfinity + Double.NaN
res2: Double = NaN
scala> Double.PositiveInfinity + Double.NegativeInfinity
res3: Double = NaN
Floating-point numbers get fewer and farther between as their magnitude grows. Double.MaxValue + 100000000 evaluates to Double.MaxValue as a result of roundoff error: Double.MaxValue is so much larger than 100000000 that the former "swallows up" the latter if you try to add them. You would need to add a Double of the order of math.pow(2, -52) * Double.MaxValue to Double.MaxValue in order to get Double.PositiveInfinity:
scala> math.pow(2,-52) * Double.MaxValue + Double.MaxValue
res4: Double = Infinity
Now, you write
What I actually want is to sum billions of values and check whether or not the running sum overflows at any time.
One possible approach is to define a function that adds the numbers recursively but stops if the running sum is an infinity or a NaN, and wraps the result in an Either[String, Double]:
import scala.collection.immutable
def sumToEither(xs: immutable.Seq[Double]): Either[String, Double] = {
#annotation.tailrec
def go(ys: immutable.Seq[Double], acc: Double): Double =
if (ys.isEmpty || acc.isInfinite || acc.isNaN) acc
else go(ys.tail, ys.head + acc)
go(xs, 0.0) match {
case x if x.isInfinite => Left("overflow")
case x if x.isNaN => Left("NaN")
case x => Right(x)
}
}
In response to your question in the comments:
Actually, I want to get the total of billions of values and check if the total overflows anytime or not. Could you please tell a way to check that?
If the total overflows, the result will be either an infinity (either positive or negative), or NaN (if at some point you have added a positive and negative infinity): the easiest way is to check total.isInfinity || total.isNaN.

Why does scala return an out of range value in this modulo operation?

This is a piece of code to generate random Long values within a given range, simplified for clarity:
def getLong(min: Long, max: Long): Long = {
if(min > max) {
throw new IncorrectBoundsException
}
val rangeSize = (max - min + 1L)
val randValue = math.abs(Random.nextLong())
val result = (randValue % (rangeSize)) + min
result
}
I know the results of this aren't uniform and this wouldn't work correctly for some values of min and max, but that's beside the point.
In the tests it turned out, that the following assertion isn't always true:
getLong(-1L, 1L) >= -1L
More specifically the returned value is -3. How is that even possible?
As it turns out, math.abs(x: Long): Long isn't guaranteed to always return non-negative values! There is no Long value that could represent math.abs(Long.MinValue), so instead of throwing an exception, math.abs returns Long.MinValue:
scala> Long.MinValue
res27: Long = -9223372036854775808
scala> math.abs(Long.MinValue)
res28: Long = -9223372036854775808
scala> math.abs(Long.MinValue) % 3
res29: Long = -2
scala> math.abs(Long.MinValue) % 3 + (-1)
res30: Long = -3
Which is, in my opinion, a very good example of why one should be using ScalaCheck to test at least parts of their codebase.

Why stream fold operation throws Out of memory exception?

I have following simple code
def fib(i:Long,j:Long):Stream[Long] = i #:: fib(j, i+j)
(0l /: fib(1,1).take(10000000)) (_+_)
And it throws OutOfMemmoryError exception.
I can not understand why, because I think all the parts use constant memmory i.e. lazy evaluation streams and foldLeft...
Those code also don't work
fib(1,1).take(10000000).sum or max, min e.t.c.
How to correctly implement infinite streams and do iterative operations upon it?
Scala version: 2.9.0
Also scala javadoc said, that foldLeft operation is memmory safe for streams
/** Stream specialization of foldLeft which allows GC to collect
* along the way.
*/
#tailrec
override final def foldLeft[B](z: B)(op: (B, A) => B): B = {
if (this.isEmpty) z
else tail.foldLeft(op(z, head))(op)
}
EDIT:
Implementation with iterators still not useful, since it throws ${domainName} exception
def fib(i:Long,j:Long): Iterator[Long] = Iterator(i) ++ fib(j, i + j)
How to define correctly infinite stream/iterator in Scala?
EDIT2:
I don't care about int overflow, I just want to understand how to create infinite stream/iterator etc in scala without side effects .
The reason to use Stream instead of Iterator is so that you don't have to calculate all the small terms in the series over again. But this means that you need to store ten million stream nodes. These are pretty large, unfortunately, so that could be enough to overflow the default memory. The only realistic way to overcome this is to start with more memory (e.g. scala -J-Xmx2G). (Also, note that you're going to overflow Long by an enormous margin; the Fibonacci series increases pretty quickly.)
P.S. The iterator implementation I have in mind is completely different; you don't build it out of concatenated singleton Iterators:
def fib(i: Long, j: Long) = Iterator.iterate((i,j)){ case (a,b) => (b,a+b) }.map(_._1)
Now when you fold, past results can be discarded.
The OutOfMemoryError happens indenpendently from the fact that you use Stream. As Rex Kerr mentioned above, Stream -- unlike Iterator -- stores everything in memory. The difference with List is that the elements of Stream are calculated lazily, but once you reach 10000000, there will be 10000000 elements, just like List.
Try with new Array[Int](10000000), you will have the same problem.
To calculate the fibonacci number as above you may want to use different approach. You can take into account the fact that you only need to have two numbers, instead of the whole fibonacci numbers discovered so far.
For example:
scala> def fib(i:Long,j:Long): Iterator[Long] = Iterator(i) ++ fib(j, i + j)
fib: (i: Long,j: Long)Iterator[Long]
And to get, for example, the index of the first fibonacci number exceeding 1000000:
scala> fib(1, 1).indexWhere(_ > 1000000)
res12: Int = 30
Edit: I added the following lines to cope with the StackOverflow
If you really want to work with 1 millionth fibonacci number, the iterator definition above will not work either for StackOverflowError. The following is the best I have in mind at the moment:
class FibIterator extends Iterator[BigDecimal] {
var i: BigDecimal = 1
var j: BigDecimal = 1
def next = {val temp = i
i = i + j
j = temp
j }
def hasNext = true
}
scala> new FibIterator().take(1000000).foldLeft(0:BigDecimal)(_ + _)
res49: BigDecimal = 82742358764415552005488531917024390424162251704439978804028473661823057748584031
0652444660067860068576582339667553466723534958196114093963106431270812950808725232290398073106383520
9370070837993419439389400053162345760603732435980206131237515815087375786729469542122086546698588361
1918333940290120089979292470743729680266332315132001038214604422938050077278662240891771323175496710
6543809955073045938575199742538064756142664237279428808177636434609546136862690895665103636058513818
5599492335097606599062280930533577747023889877591518250849190138449610994983754112730003192861138966
1418736269315695488126272680440194742866966916767696600932919528743675517065891097024715258730309025
7920682881137637647091134870921415447854373518256370737719553266719856028732647721347048627996967...
#yura's problem:
def fib(i:Long,j:Long):Stream[Long] = i #:: fib(j, i+j)
(0l /: fib(1,1).take(10000000)) (_+_)
besides using a Long which can't possibly hold the Fibonacci of 10,000,000, it does work. That is, if the foldLeft is written as:
fib(1,1).take(10000000).foldLeft(0L)(_+_)
Looking at the Streams.scala source, foldLeft() is clearly designed for Garbage Collection, but /: is not def'd.
The other answers alluded to another problem. The Fibonacci of 10 million is a big number and if BigInt is used, instead of just overflowing like with a Long, absolutely enormous numbers are being added to each over and over again.
Since Stream.foldLeft is optimized for GC it does look like the way to solve for really big Fibonacci numbers, rather than using a zip or tail recursion.
// Fibonacci using BigInt
def fib(i:BigInt,j:BigInt):Stream[BigInt] = i #:: fib(j, i+j)
fib(1,0).take(10000000).foldLeft(BigInt("0"))(_+_)
Results of the above code: 10,000,000 is a 8-figure number. How many figures in fib(10000000)? 2,089,877
fib(1,1).take(10000000) is the "this" of the method /:, it is likely that the JVM will consider the reference alive as long as the method runs, even if in this case, it might get rid of it.
So you keep a reference on the head of the stream all along, hence on the whole stream as you build it to 10M elements.
You could just use recursion, which is about as simple:
def fibSum(terms: Int, i: Long = 1, j: Long = 1, total: Long = 2): Long = {
if (terms == 2) total
else fibSum(terms - 1, j, i + j, total + i + j)
}
With this, you can "fold" a billion elements in only a couple of seconds, but as Rex points out, summing the Fibbonaci sequence overflows Long very quickly.
If you really wanted to know the answer to your original problem and don't mind sacrificing some accuracy you could do this:
def fibSum(terms: Int, i: Double = 1, j: Double = 1, tot: Double = 2,
exp: Int = 0): String = {
if (terms == 2) "%.6f".format(tot) + " E+" + exp
else {
val (i1, j1, tot1, exp1) =
if (tot + i + j > 10) (i/10, j/10, tot/10, exp + 1)
else (i, j, tot, exp)
fibSum(terms - 1, j1, i1 + j1, tot1 + i1 + j1, exp1)
}
}
scala> fibSum(10000000)
res54: String = 2.957945 E+2089876