Losing precision when moving to Spark for big decimals

Losing precision when moving to Spark for big decimals - scala

Below is the sample test code and its output. I see that java bigDecimal stores all the digits where as scala BigDecimal is losing on precision and does some rounding off and the same is happening with spark. Is there a way to set the precision or say never round off. I do not want to truncate or round off in any case
val sc = sparkSession
import java.math.BigDecimal
import sc.implicits._
val bigNum : BigDecimal = new BigDecimal(0.02498934809987987982348902384928349)
val convertedNum: scala.math.BigDecimal = scala.math.BigDecimal(bigNum)
val scalaBigNum: scala.math.BigDecimal = scala.math.BigDecimal(0.02498934809987987982348902384928349)
println("Big num in java" + bigNum)
println("Converted " + convertedNum)
println("Big num in scala " + scalaBigNum)
val ds = List(scalaBigNum).toDS()
println(ds.head)
println(ds.toDF.head)
Output
Big num in java0.0249893480998798801773208566601169877685606479644775390625
Converted 0.0249893480998798801773208566601169877685606479644775390625
Big num in scala 0.02498934809987988
0.024989348099879880
[0.024989348099879880]

Based on spark.apache.org/docs
The precision can be up to 38, scale can also be up to 38 (less or equal to precision). The default precision and scale is (10, 0).
here: https://www.scala-lang.org/api/2.12.5/scala/math/BigDecimal.html
But if you want in a simple way then how about convert it to String before
converting to DF or DS in order to get the precise value. :)
Just try if you want :)

Related

Scala BigDecimal - loss of precision

I need to do some precision calcs with Big Numbers and I have been trying with Scala BigDecimal but I have noted loss of precision.
As an example:
2^63 == 9223372036854775808
2^64 == 18446744073709551616
However when I do
println(BigDecimal.decimal(scala.math.pow(2, 63)).toBigIntExact())
println(BigDecimal.decimal(scala.math.pow(2, 64)).toBigIntExact())
I get
9223372036854776000 != 9223372036854775808
18446744073709552000 != 18446744073709551616
I don't know if I can get the exact BigInt.
Maybe I have to take other approach.
Could anyone help me to fix this issue?

# scala.math.pow(2, 63)
res0: Double = 9.223372036854776E18
You get Double on math.pow, and then you pass the result to BigDecimal - it means that you lost precision even before you started using Big* classes.
If you put numbers into BigDecimal when they are still small and haven't yet lost precision (and if you use the constructors correctly) then you'll get the expected result:
# BigDecimal(2).pow(63).toBigInt
res4: BigInt = 9223372036854775808
# BigDecimal(2).pow(64).toBigInt
res5: BigInt = 18446744073709551616
# BigDecimal(2).pow(63).toBigIntExact
res6: Option[BigInt] = Some(9223372036854775808)
# BigDecimal(2).pow(64).toBigIntExact
res7: Option[BigInt] = Some(18446744073709551616)

Precision set to only 1 decimal place only in scala if all zeros in precision

Scala seems to drop all the decimal places and keep just 1 if we convert an Integer to BigDecimal.
How can I tell the compile to preserve the decimal places irrespective of input.
scala> val tmp:BigDecimal = 1.00
tmp: BigDecimal = 1.0
// was expecting 1.00
scala> val tmp:BigDecimal = 1.01
tmp: BigDecimal = 1.01
scala> val tmp:BigDecimal = 1.11
tmp: BigDecimal = 1.11
EDIT 1: Some context on why I am trying to preserve the decimal places. I work in a fintech job, where the decimals upto 3 places is mandatory for all amount fields. The scala api is leveraged by Front-End team to show customer balance, reports etc.

From the ScalaDocs page we learn:
In most cases, the value of the BigDecimal is also rounded to the precision specified by the MathContext. To create a BigDecimal with a different precision than its MathContext, use new BigDecimal(new java.math.BigDecimal(...), mc).
And, indeed, that does appear to work.
import java.math.{BigDecimal=>JBD, MathContext=>JMC}
val a = BigDecimal(2.4) //a: scala.math.BigDecimal = 2.4
a.precision //res0: Int = 2
val b = BigDecimal(new JBD(2.4, new JMC(7))) //b: scala.math.BigDecimal = 2.400000
b.precision //res1: Int = 7
But, unfortunately, the results aren't always consistent.
import java.math.{BigDecimal=>JBD, MathContext=>JMC}
val a = BigDecimal(2.5) //a: scala.math.BigDecimal = 2.5
a.precision //res0: Int = 2
val b = BigDecimal(new JBD(2.5, new JMC(7))) //b: scala.math.BigDecimal = 2.5
b.precision //res1: Int = 2
So it would appear that precision is either set by the MathContext or it is whatever is sufficient to accurately represent the stated value, whichever is smaller. (Warning: supposition from limited observation.)
However, if what you need is a consistent presentation of the value then I suggest we move away from MathContext to StringContext.
val a :BigDecimal = 1
val b :BigDecimal = 2.2
val c :BigDecimal = 3.456
f"$a%.2f" //res0: String = 1.00
f"$b%.2f" //res1: String = 2.20
f"$c%.2f" //res2: String = 3.46 (note the rounding)

scala.math.BigDecimal : 1.2 and 1.20 are equal

How to keep precision and the trailing zero while converting a Double or a String to scala.math.BigDecimal ?
Use Case - In a JSON message, an attribute is of type String and has a value of "1.20". But while reading this attribute in Scala and converting it to a BigDecimal, I am loosing the precision and it is converted to 1.2

#Saurabh What a nice question! It is crucial that you shared the use case!
I think my answer lets to solve it in a most safe and efficient way... In a short form it is:
Use jsoniter-scala for parsing BigDecimal values precisely.
Encoding/decoding to/from JSON strings for any numeric type can by defined per codec or per class field basis. Please see code bellow:
Add dependencies into your build.sbt:
libraryDependencies ++= Seq(
"com.github.plokhotnyuk.jsoniter-scala" %% "jsoniter-scala-core" % "2.17.4",
"com.github.plokhotnyuk.jsoniter-scala" %% "jsoniter-scala-macros" % "2.17.4" % Provided // required only in compile-time
)
Define data structures, derive a codec for the root structure, parse the response body and serialize it back:
import com.github.plokhotnyuk.jsoniter_scala.core._
import com.github.plokhotnyuk.jsoniter_scala.macros._
case class Response(
amount: BigDecimal,
#stringified price: BigDecimal)
implicit val codec: JsonValueCodec[Response] = JsonCodecMaker.make {
CodecMakerConfig
.withIsStringified(true) // switch it on to stringify all numeric and boolean values in this codec
.withBigDecimalPrecision(34) // set a precision to round up to decimal128 format: java.math.MathContext.DECIMAL128.getPrecision
.withBigDecimalScaleLimit(6178) // limit scale to fit the decimal128 format: BigDecimal("0." + "0" * 33 + "1e-6143", java.math.MathContext.DECIMAL128).scale + 1
.withBigDecimalDigitsLimit(308) // limit a number of mantissa digits to be parsed before rounding with the specified precision
}
val response = readFromArray("""{"amount":1000,"price":"1.20"}""".getBytes("UTF-8"))
val json = writeToArray(Response(amount = BigDecimal(1000), price = BigDecimal("1.20")))
Print results to the console and verify them:
println(response)
println(new String(json, "UTF-8"))
Response(1000,1.20)
{"amount":1000,"price":"1.20"}
Why the proposed approach is safe?
Well... Parsing of JSON is a minefield, especially when you are going to have precise BigDecimal values after that. Most JSON parsers for Scala do it using Java's constructor for string representation which has O(n^2) complexity (where n is a number of digits in the mantissa) and do not round results to the safe option of MathContext (by default the MathContext.DECIMAL128 value is used for that in Scala's BigDecimal constructors and operations).
It introduces vulnerabilities under low bandwidth DoS/DoW attacks for systems that accept untrusted input. Below is a simple example how it can be reproduced in Scala REPL with the latest version of the most popular JSON parser for Scala in the classpath:
...
Starting scala interpreter...
Welcome to Scala 2.12.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_222).
Type in expressions for evaluation. Or try :help.
scala> def timed[A](f: => A): A = { val t = System.currentTimeMillis; val r = f; println(s"Elapsed time (ms): ${System.currentTimeMillis - t}"); r }
timed: [A](f: => A)A
scala> timed(io.circe.parser.decode[BigDecimal]("9" * 1000000))
Elapsed time (ms): 29192
res0: Either[io.circe.Error,BigDecimal] = Right(999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999...
scala> timed(io.circe.parser.decode[BigDecimal]("1e-100000000").right.get + 1)
Elapsed time (ms): 87185
res1: scala.math.BigDecimal = 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...
For contemporary 1Gbit networks 10ms of receiving a malicious message with the 1M-digit number can produce 29 seconds of 100% CPU load on a single core. More than 256 cores can be effectively DoS-ed at the full bandwidth rate. The last expression demonstrates how to burn a CPU core for ~1.5 minutes using a message with a 13-byte number if subsequent + or - operations were used with Scala 2.12.8.
And, jsoniter-scala take care about all these cases for Scala 2.11.x, 2.12.x, 2.13.x, and 3.x.
Why it is the most efficient?
Below are charts with throughput (operations per second, so greater is better) results of JSON parsers for Scala on different JVMs during parsing of an array of 128 small (up to 34-digit mantissas) values and a medium (with a 128-digit mantissa) value of BigDecimal accordingly:
The parsing routine for BigDecimal in jsoniter-scala:
uses BigDecimal values with compact representation for small numbers up to 36 digits
uses more efficient hot-loops for medium numbers that have from 37 to 284 digits
switches to the recursive algorithm which has O(n^1.5) complexity for values that have more than 285 digits
Moreover, jsoniter-scala parses and serializes JSON directly from UTF-8 bytes to your data structures and back, and does it crazily fast without using of run-time reflection, intermediate ASTs, strings or hash maps, with minimum allocations and copying. Please see here the results of 115 benchmarks for different data types and real-life message samples for GeoJSON, Google Maps API, OpenRTB, and Twitter API.

For Double, 1.20 is exactly the same as 1.2, so you can't convert them to different BigDecimals. For String, you are not losing precision; you can see that because res3: scala.math.BigDecimal = 1.20 and not ... = 1.2! But equals on scala.math.BigDecimal happens to be defined so that numerically equal BigDecimals are equal even though they are distinguishable.
If you want to avoid that, you could use java.math.BigDecimals for which
Unlike compareTo, this method considers two BigDecimal objects equal only if they are equal in value and scale (thus 2.0 is not equal to 2.00 when compared by this method).
For your case, res2.underlying == res3.underlying will be false.
Of course, its documentation also states
Note: care should be exercised if BigDecimal objects are used as keys in a SortedMap or elements in a SortedSet since BigDecimal's natural ordering is inconsistent with equals. See Comparable, SortedMap or SortedSet for more information.
which is probably part of the reason why the Scala designers decided on different behavior.

I don't normally do numbers, but:
scala> import java.math.MathContext
import java.math.MathContext
scala> val mc = new MathContext(2)
mc: java.math.MathContext = precision=2 roundingMode=HALF_UP
scala> BigDecimal("1.20", mc)
res0: scala.math.BigDecimal = 1.2
scala> BigDecimal("1.2345", mc)
res1: scala.math.BigDecimal = 1.2
scala> val mc = new MathContext(3)
mc: java.math.MathContext = precision=3 roundingMode=HALF_UP
scala> BigDecimal("1.2345", mc)
res2: scala.math.BigDecimal = 1.23
scala> BigDecimal("1.20", mc)
res3: scala.math.BigDecimal = 1.20
Edit: also, https://github.com/scala/scala/pull/6884
scala> res3 + BigDecimal("0.003")
res4: scala.math.BigDecimal = 1.20
scala> BigDecimal("1.2345", new MathContext(5)) + BigDecimal("0.003")
res5: scala.math.BigDecimal = 1.2375

Multiply a BigDecimal with an Integer

I want to multiply a financial amount with a quantity. I know Scala uses Java's BigDecimal under the hood but the syntax doesn't seem to be the same.
val price = BigDecimal("0.01") // £0.01
val qty = 10
I tried to do this
BigDecimal(price).*(BigDecimal(qty))
But it's a compile error. If I look at the Java SO posts you can pass integer into BigDecimal and then multiply it like this
BigDecimal(price).multiply(BigDecimal(qty))
So how do you do this in Scala? And are there any dangers in losing precision by multiplying a decimal and integer like this? I will need sum a lot of these together as well

You can actually multiply a BigDecimal with an Int using BigDecimal's multiplication operator:
def *(that: BigDecimal): BigDecimal
since the Int you will provide as its parameter will be implicitly converted to a BigDecimal based on:
implicit def int2bigDecimal(i: Int): BigDecimal
You can thus safely multiply your BigDecimal with an Int as if it was a BigDecimal and receive a BigDecimal with no loss in precision:
val price = BigDecimal("0.01")
// scala.math.BigDecimal = 0.01
val qty = 10
// Int = 10
price * qty // same as BigDecimal("0.01").*(10)
// scala.math.BigDecimal = 0.10

You can do this:
val a = 10
val b = BigDecimal(0.1000000000001)
a * b
res0: scala.math.BigDecimal = 1.0000000000010
As you can see you don´t lose precision

The problem is actually this:
BigDecimal(price)
price is already a BigDecimal so the compiler does't know what to do! If you fix this, the first version works. The second version fails because there is no multiply method on BigDecimal.
However, as others have pointed out, the simple solution is just
price*qty

Compute average of numbers in a text file in spark scala

Lets say I have a file with each line representing a number. How do I find average of all the numbers in the file in Scala - Spark.
val data = sc.textFile("../../numbers.txt")
val sum = data.reduce( (x,y) => x+y )
val avg = sum/data.count()
The problem here is x and y are strings. How do I convert them into Long within the reduce function.

You need to apply a RDD.map which parses the strings before reducing them:
val sum = data.map(_.toInt).reduce(_+_)
val avg = sum / data.count()
But I think you're better off using DoubleRDDFunctions.mean instead of calculating it yourself:
val mean = data.map(_.toInt).mean()