I have a very long list of doubles that I need to average but I can't sum them within the double data type so when I go to divide I still get Infinity.
def applyToMap(list: Map[String, List[Map[String, String]]], f: Map[String, String]=>Double): Map[String,Double]={
val mSLD = list.mapValues(lm=>lm.map(f))
mSLD.mapValues(ld=> ld.sum/ld.size)
}
This leaves me with a Map[String, Double] that are all Key -> Infinity
You could use fold to compute an average as you go. Rather than doing sum / size you should count your way through the items with n, and for each one adjust your accumulator with acc = (acc * n/(n+1)) + (item * 1/(n+1))
Here’s the general scala code:
val average = seq.foldLeft((0.0, 1)) ((acc, i) => ((acc._1 + (i - acc._1) / acc._2), acc._2 + 1))._1
Taken from here.
You’d probably still have precision difficulty if the list is really long, as you’d be dividing by a gradually very large number. To be really safe you should break the list into sublists, and compute the average of averages of the sublists. Make sure the sublists are all the same length though, or do a weighted average based on their size.
Interested in implementing gandaliters solution, I came up with the following (Since I'm not the well known friend of Doubles, I tried to find an easy to follow numeric sequence with Bytes). First, I generate 10 Bytes in the range of 75..125, to be close to MaxByte, but below for every value, and in average 100, for simple control:
val rnd = util.Random
val is=(1 to 10).map (i => (rnd.nextInt (50)+75).toByte)
// = Vector(99, 122, 99, 105, 102, 104, 122, 99, 87, 114)
The 1st algo multiplies before division (which increases the danger to exceed MaxByte), the 2nd divides before multiplication, which leads to rounding errors.
def slidingAvg0 (sofar: Byte, x: Byte, cnt: Byte): (Byte, Byte) = {
val acc : Byte = ((sofar * cnt).toByte / (cnt + 1).toByte + (x/(cnt + 1).toByte).toByte).toByte
println (acc)
(acc.toByte, (cnt + 1).toByte)
}
def slidingAvg1 (sofar: Byte, x: Byte, cnt: Byte): (Byte, Byte) = {
val acc : Byte = (((sofar / (cnt + 1).toByte).toByte * cnt).toByte + (x/(cnt + 1).toByte).toByte).toByte
println (acc)
(acc.toByte, (cnt + 1).toByte)
}
This is foldLeft in scala:
((is.head, 1.toByte) /: is.tail) { case ((sofar, cnt), x) => slidingAvg0 (sofar, x, cnt)}
110
21
41
2
18
32
8
16
0
scala> ((is.head, 1.toByte) /: is.tail) { case ((sofar, cnt), x) => slidingAvg1 (sofar, x, cnt)}
110
105
104
100
97
95
89
81
83
Since 10 values is far too less to rely on the average being close to 100, let's see the sum as Int:
is.map (_.toInt).sum
res65: Int = 1053
The drift is pretty significant (should be 105, is 0/83)
Whether the findings are transferable from Bytes/Int to Doubles is the other question. And I'm not 100% confident, that my braces mirror the evaluation order, but imho, for multiplication/division of same precedence it is left to right.
So the original formulas were:
acc = (acc * n/(n+1)) + (item * 1/(n+1))
acc = (acc /(n+1) *n) + (item/(n+1))
If i understand the OP correctly then the amount of data doesn't seem to be a problem otherwise it wouldn't fit into memory.
So i concentrate on the data types only.
Summary
My suggestion is to go with BigDecimal instead of Double.
Especially if you are adding reasonbly high values.
The only significant drawback is the performance and a small amount of cluttered syntax.
Alternatively you must rescale your input upfront but this will degrade precision and requires special care with post processing.
Double breaks at some scale
scala> :paste
// Entering paste mode (ctrl-D to finish)
val res0 = (Double.MaxValue + 1) == Double.MaxValue
val res1 = Double.MaxValue/10 == Double.MaxValue
val res2 = List.fill(11)(Double.MaxValue/10).sum
val res3 = List.fill(10)(Double.MaxValue/10).sum == Double.MaxValue
val res4 = (List.fill(10)(Double.MaxValue/10).sum + 1) == Double.MaxValue
// Exiting paste mode, now interpreting.
res0: Boolean = true
res1: Boolean = false
res2: Double = Infinity
res3: Boolean = true
res4: Boolean = true
Take a look these simple Double arithmetics examples in your scala REPL:
Double.MaxValue + 1 will numerically cancel out and nothing is going to be added, thus it is still the same as Double.MaxValue
Double.MaxValue/10 behaves as expected and doesn't equal to Double.MaxValue
Adding Double.MaxValue/10 for 11 times will produce an overflow to Infintiy
Adding Double.MaxValue/10 for 10 times won't break arithmetics and evaluate to Double.MaxValue again
The summed Double.MaxValue/10 behaves exactly as the Double.MaxValue
BigDecimal works on all scales but is slower
scala> :paste
// Entering paste mode (ctrl-D to finish)
val res0 = (BigDecimal(Double.MaxValue) + 1) == BigDecimal(Double.MaxValue)
val res1 = BigDecimal(Double.MaxValue)/10 == BigDecimal(Double.MaxValue)
val res2 = List.fill(11)(BigDecimal(Double.MaxValue)/10).sum
val res3 = List.fill(10)(BigDecimal(Double.MaxValue)/10).sum == BigDecimal(Double.MaxValue)
val res4 = (List.fill(10)(BigDecimal(Double.MaxValue)/10).sum + 1) == BigDecimal(Double.MaxValue)
// Exiting paste mode, now interpreting.
res0: Boolean = false
res1: Boolean = false
res2: scala.math.BigDecimal = 197746244834854727000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
res3: Boolean = true
res4: Boolean = false
Now compare these results with the ones above from Double.
As you can see everything works as expected.
Rescaling reduces precision and can be tedious
When working with astronomic or microscopic scales it is likely to happen that numbers will overflow or underflow quickly.
Then it is appropriate to work with other units than the base units to compensate this.
E.g. with km instead of m.
However, then you will have to take special care when multiplying those numbers in formulas.
10km * 10km ≠ 100 km^2
but rather
10,000 m * 10,000 m = 100,000,000 m^2 = 100 Mm^2
So keep this in mind.
Another trap is when dealing with very diverse datasets where numbers exist in all kinds of scales and quantities.
When scaling down your input domain you will loose precision and small numbers may be cancelled out.
In some scenarios these numbers don't need to be considered because of their small impact.
However, when these small numbers exist in a high frequency and are ignored all the time you will introduce a large error in the end.
So keep this in mind as well ;)
Hope this helps
I have written custom function to get absolute value for long number. Below is the
def absolute(x:Long):Long= x match {
case y:Long if(y<0) => -1 * y
case y if(y>=0) => y
}
println(absolute(-9223372036854775808L))
println(absolute(-2300L))
Below is the output of above program
-9223372036854775808
2300
I am not sure why it is working for very big long values. Ang suggestions on the same.
This is just a case of integer overflow:
scala> Long.MaxValue
res0: Long = 9223372036854775807
scala> Long.MinValue
res1: Long = -9223372036854775808
Thus when you negate -9223372036854775808 you are overflowing the Long by 1 unit, causing it to wrap around (to itself!).
Also there is no need for a match here:
scala> def abs(x: Long): Long = if (x < 0) -x else x
abs: (x: Long)Long
Why not use scala.math.abs?
See scala.math
I am new to scala and was trying out a few basic concepts. I have an Integer value that I am trying to convert an integer x into a hex value using the following command
val y = Integer.toHexString(x)
This value gives me the hex number in a string format. However I want to get the hex value as a value and not a string. I could write some code for it, but I was wondering if there was some direct command available to do this? Any help is appreciated.
Edit: For example with an integer value of say x =38
val y = Integer.toHexString(38)
y is "26" which is a string. I want to use the hex value 0x26 (not the string) to do bitwise AND operations.
Hex is simply a presentation of a numerical value in base 16. You don't need a numeric value in hexadecimal representation to do bitwise operations on it. In memory, a 32bit integer will be stored in binary format, which is a different way of representation that same number, only in a different base. For example, if you have the number 4 (0100 in binary representation, 0x4 in hex) as variable in scala, you can bitwise on it using the & operator:
scala> val y = 4
y: Int = 4
scala> y & 6
res0: Int = 4
scala> y & 2
res1: Int = 0
scala> y & 0x4
res5: Int = 4
Same goes for bitwise OR (|) operations:
scala> y | 2
res2: Int = 6
scala> y | 4
res3: Int = 4
You do not need to convert the integer to a "hex value" to do bitwise operations. You can just do:
val x = 38
val y = 64
x | y
In fact, there is no such thing as a "hex value" in memory. Every integer is stored in binary. If you want to write an integer literal in hex, you can prefix it with 0x:
val x = 0x38 // 56 in decimal.
x | 0x10 // Turn on 5th bit.
This is a piece of code to generate random Long values within a given range, simplified for clarity:
def getLong(min: Long, max: Long): Long = {
if(min > max) {
throw new IncorrectBoundsException
}
val rangeSize = (max - min + 1L)
val randValue = math.abs(Random.nextLong())
val result = (randValue % (rangeSize)) + min
result
}
I know the results of this aren't uniform and this wouldn't work correctly for some values of min and max, but that's beside the point.
In the tests it turned out, that the following assertion isn't always true:
getLong(-1L, 1L) >= -1L
More specifically the returned value is -3. How is that even possible?
As it turns out, math.abs(x: Long): Long isn't guaranteed to always return non-negative values! There is no Long value that could represent math.abs(Long.MinValue), so instead of throwing an exception, math.abs returns Long.MinValue:
scala> Long.MinValue
res27: Long = -9223372036854775808
scala> math.abs(Long.MinValue)
res28: Long = -9223372036854775808
scala> math.abs(Long.MinValue) % 3
res29: Long = -2
scala> math.abs(Long.MinValue) % 3 + (-1)
res30: Long = -3
Which is, in my opinion, a very good example of why one should be using ScalaCheck to test at least parts of their codebase.
This Stackoverflow post discusses the potential problem of a numeric overflow if not appending L to a number:
Here's an example from the REPL:
scala> 100000 * 100000 // no type specified, so numbers are `int`'s
res0: Int = 1410065408
One way to avoid this problem is to use L.
scala> 100000L * 100000L
res1: Long = 10000000000
Or to specify the number's types:
scala> val x: Long = 100000
x: Long = 100000
scala> x * x
res2: Long = 10000000000
What's considered the best practice to properly specify a number's type?
You should always use L if you are using a long. Otherwise, you can still have problems:
scala> val x: Long = 10000000000
<console>:1: error: integer number too large
val x: Long = 10000000000
^
scala> val x = 10000000000L
x: Long = 10000000000
The conversion due to type ascription happens after the literal has been interpreted as Int.