scala.math.BigDecimal : 1.2 and 1.20 are equal - scala

How to keep precision and the trailing zero while converting a Double or a String to scala.math.BigDecimal ?
Use Case - In a JSON message, an attribute is of type String and has a value of "1.20". But while reading this attribute in Scala and converting it to a BigDecimal, I am loosing the precision and it is converted to 1.2

#Saurabh What a nice question! It is crucial that you shared the use case!
I think my answer lets to solve it in a most safe and efficient way... In a short form it is:
Use jsoniter-scala for parsing BigDecimal values precisely.
Encoding/decoding to/from JSON strings for any numeric type can by defined per codec or per class field basis. Please see code bellow:
Add dependencies into your build.sbt:
libraryDependencies ++= Seq(
"com.github.plokhotnyuk.jsoniter-scala" %% "jsoniter-scala-core" % "2.17.4",
"com.github.plokhotnyuk.jsoniter-scala" %% "jsoniter-scala-macros" % "2.17.4" % Provided // required only in compile-time
)
Define data structures, derive a codec for the root structure, parse the response body and serialize it back:
import com.github.plokhotnyuk.jsoniter_scala.core._
import com.github.plokhotnyuk.jsoniter_scala.macros._
case class Response(
amount: BigDecimal,
#stringified price: BigDecimal)
implicit val codec: JsonValueCodec[Response] = JsonCodecMaker.make {
CodecMakerConfig
.withIsStringified(true) // switch it on to stringify all numeric and boolean values in this codec
.withBigDecimalPrecision(34) // set a precision to round up to decimal128 format: java.math.MathContext.DECIMAL128.getPrecision
.withBigDecimalScaleLimit(6178) // limit scale to fit the decimal128 format: BigDecimal("0." + "0" * 33 + "1e-6143", java.math.MathContext.DECIMAL128).scale + 1
.withBigDecimalDigitsLimit(308) // limit a number of mantissa digits to be parsed before rounding with the specified precision
}
val response = readFromArray("""{"amount":1000,"price":"1.20"}""".getBytes("UTF-8"))
val json = writeToArray(Response(amount = BigDecimal(1000), price = BigDecimal("1.20")))
Print results to the console and verify them:
println(response)
println(new String(json, "UTF-8"))
Response(1000,1.20)
{"amount":1000,"price":"1.20"}
Why the proposed approach is safe?
Well... Parsing of JSON is a minefield, especially when you are going to have precise BigDecimal values after that. Most JSON parsers for Scala do it using Java's constructor for string representation which has O(n^2) complexity (where n is a number of digits in the mantissa) and do not round results to the safe option of MathContext (by default the MathContext.DECIMAL128 value is used for that in Scala's BigDecimal constructors and operations).
It introduces vulnerabilities under low bandwidth DoS/DoW attacks for systems that accept untrusted input. Below is a simple example how it can be reproduced in Scala REPL with the latest version of the most popular JSON parser for Scala in the classpath:
...
Starting scala interpreter...
Welcome to Scala 2.12.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_222).
Type in expressions for evaluation. Or try :help.
scala> def timed[A](f: => A): A = { val t = System.currentTimeMillis; val r = f; println(s"Elapsed time (ms): ${System.currentTimeMillis - t}"); r }
timed: [A](f: => A)A
scala> timed(io.circe.parser.decode[BigDecimal]("9" * 1000000))
Elapsed time (ms): 29192
res0: Either[io.circe.Error,BigDecimal] = Right(999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999...
scala> timed(io.circe.parser.decode[BigDecimal]("1e-100000000").right.get + 1)
Elapsed time (ms): 87185
res1: scala.math.BigDecimal = 1.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000...
For contemporary 1Gbit networks 10ms of receiving a malicious message with the 1M-digit number can produce 29 seconds of 100% CPU load on a single core. More than 256 cores can be effectively DoS-ed at the full bandwidth rate. The last expression demonstrates how to burn a CPU core for ~1.5 minutes using a message with a 13-byte number if subsequent + or - operations were used with Scala 2.12.8.
And, jsoniter-scala take care about all these cases for Scala 2.11.x, 2.12.x, 2.13.x, and 3.x.
Why it is the most efficient?
Below are charts with throughput (operations per second, so greater is better) results of JSON parsers for Scala on different JVMs during parsing of an array of 128 small (up to 34-digit mantissas) values and a medium (with a 128-digit mantissa) value of BigDecimal accordingly:
The parsing routine for BigDecimal in jsoniter-scala:
uses BigDecimal values with compact representation for small numbers up to 36 digits
uses more efficient hot-loops for medium numbers that have from 37 to 284 digits
switches to the recursive algorithm which has O(n^1.5) complexity for values that have more than 285 digits
Moreover, jsoniter-scala parses and serializes JSON directly from UTF-8 bytes to your data structures and back, and does it crazily fast without using of run-time reflection, intermediate ASTs, strings or hash maps, with minimum allocations and copying. Please see here the results of 115 benchmarks for different data types and real-life message samples for GeoJSON, Google Maps API, OpenRTB, and Twitter API.

For Double, 1.20 is exactly the same as 1.2, so you can't convert them to different BigDecimals. For String, you are not losing precision; you can see that because res3: scala.math.BigDecimal = 1.20 and not ... = 1.2! But equals on scala.math.BigDecimal happens to be defined so that numerically equal BigDecimals are equal even though they are distinguishable.
If you want to avoid that, you could use java.math.BigDecimals for which
Unlike compareTo, this method considers two BigDecimal objects equal only if they are equal in value and scale (thus 2.0 is not equal to 2.00 when compared by this method).
For your case, res2.underlying == res3.underlying will be false.
Of course, its documentation also states
Note: care should be exercised if BigDecimal objects are used as keys in a SortedMap or elements in a SortedSet since BigDecimal's natural ordering is inconsistent with equals. See Comparable, SortedMap or SortedSet for more information.
which is probably part of the reason why the Scala designers decided on different behavior.

I don't normally do numbers, but:
scala> import java.math.MathContext
import java.math.MathContext
scala> val mc = new MathContext(2)
mc: java.math.MathContext = precision=2 roundingMode=HALF_UP
scala> BigDecimal("1.20", mc)
res0: scala.math.BigDecimal = 1.2
scala> BigDecimal("1.2345", mc)
res1: scala.math.BigDecimal = 1.2
scala> val mc = new MathContext(3)
mc: java.math.MathContext = precision=3 roundingMode=HALF_UP
scala> BigDecimal("1.2345", mc)
res2: scala.math.BigDecimal = 1.23
scala> BigDecimal("1.20", mc)
res3: scala.math.BigDecimal = 1.20
Edit: also, https://github.com/scala/scala/pull/6884
scala> res3 + BigDecimal("0.003")
res4: scala.math.BigDecimal = 1.20
scala> BigDecimal("1.2345", new MathContext(5)) + BigDecimal("0.003")
res5: scala.math.BigDecimal = 1.2375

Related

Losing precision when moving to Spark for big decimals

Below is the sample test code and its output. I see that java bigDecimal stores all the digits where as scala BigDecimal is losing on precision and does some rounding off and the same is happening with spark. Is there a way to set the precision or say never round off. I do not want to truncate or round off in any case
val sc = sparkSession
import java.math.BigDecimal
import sc.implicits._
val bigNum : BigDecimal = new BigDecimal(0.02498934809987987982348902384928349)
val convertedNum: scala.math.BigDecimal = scala.math.BigDecimal(bigNum)
val scalaBigNum: scala.math.BigDecimal = scala.math.BigDecimal(0.02498934809987987982348902384928349)
println("Big num in java" + bigNum)
println("Converted " + convertedNum)
println("Big num in scala " + scalaBigNum)
val ds = List(scalaBigNum).toDS()
println(ds.head)
println(ds.toDF.head)
Output
Big num in java0.0249893480998798801773208566601169877685606479644775390625
Converted 0.0249893480998798801773208566601169877685606479644775390625
Big num in scala 0.02498934809987988
0.024989348099879880
[0.024989348099879880]
Based on spark.apache.org/docs
The precision can be up to 38, scale can also be up to 38 (less or equal to precision). The default precision and scale is (10, 0).
here: https://www.scala-lang.org/api/2.12.5/scala/math/BigDecimal.html
But if you want in a simple way then how about convert it to String before
converting to DF or DS in order to get the precise value. :)
Just try if you want :)

what is the difference between Scala Stream vs Scala List vs Scala Sequence

I have a scenario where i get DB data in the form of Stream of Objects.
and while transforming it into a sequence of Object it is taking time.
I am looking for alternative which takes less time.
Quick answer: a Scala stream is already a Scala sequence and does not need to be converted at all. Further explanation below...
A Scala sequence (scala.collection.Seq) is simply any collection that stores a sequence of elements in a specific order (the ordering is arbitrary, but element order doesn't change once defined).
A Scala list (scala.collection.immutable.List) is a subclass of Seq and is also the default implementation of a scala.collection.Seq. That is, Seq(1, 2, 3) is implemented as a List(1, 2, 3). Lists are strict, so any operation on a list processes all elements, one after the other, before another operation can be performed.
For example, consider this example in the Scala REPL:
$ scala
Welcome to Scala 2.12.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171).
Type in expressions for evaluation. Or try :help.
scala> val xs = List(1, 2, 3)
xs: List[Int] = List(1, 2, 3)
scala> xs.map {x =>
| val newX = 2 * x
| println(s"Mapping value $x to $newX...")
| newX
| }.foreach {x =>
| println(s"Printing value $x")
| }
Mapping value 1 to 2...
Mapping value 2 to 4...
Mapping value 3 to 6...
Printing value 2
Printing value 4
Printing value 6
Note how each value is mapped, creating a new list (List(2, 4, 6)), before any of the values of that new list are printed out?
A Scala stream (scala.collection.immutable.Stream) is also a subclass of Seq, but it is lazy (or non-strict), meaning that the next value from the stream is only taken when required. It is often referred to as a lazy list.
To illustrate the difference between a Stream and a List, let's redo that example:
scala> val xs = Stream(1, 2, 3)
xs: scala.collection.immutable.Stream[Int] = Stream(1, ?)
scala> xs.map {x =>
| val newX = 2 * x
| println(s"Mapping value $x to $newX...")
| newX
| }.foreach {x =>
| println(s"Printing value $x")
| }
Mapping value 1 to 2...
Printing value 2
Mapping value 2 to 4...
Printing value 4
Mapping value 3 to 6...
Printing value 6
Note how, for a Stream, we only process the next map operation after all of the operations for the previous element have been completed? The Map operation still returns a new stream (Stream(2, 4, 6)), but values are only taken when needed.
Whether a Stream performs better than a List in any particular situation will depend upon what you're trying to do. If performance is your primary goal, I suggest that you benchmark your code (using a tool such as ScalaMeter) to determine which type works best.
BTW, since both Stream and List are subclasses of Seq, it is common practice to write code that requires a sequence to utilize Seq. That way, you can supply a List or a Stream or any other Seq subclass, without having to change your code, and without having to convert lists, streams, etc. to sequences. For example:
def doSomethingWithSeq[T](seq: Seq[T]) = {
//
}
// This works!
val list = List(1, 2, 3)
doSomethingWithSeq(list)
// This works too!
val stream = Stream(4, 5, 6)
doSomethingWithSeq(stream)
UPDATED
The performance of List vs. Stream for a groupBy operation is going to be very similar. Depending upon how it's used, a Stream can require less memory than a List, but might require a little extra CPU time. If collection performance is definitely the issue, benchmark both types of collection (see above) and measure precisely to determine the trade-offs between the two. I cannot make that determination for you. It's possible that the slowness you refer to is down to the transmission of data between the database and your application, and has nothing to do with the collection type.
For general information on Scala collection performance, refer to Collections: Performance Charateristics.
UPDATED 2
Also note that any type of Scala sequence will typically be processed sequentially (hence the name), by a single thread at a time. Neither List nor Stream lend themselves to parallel processing of their elements. If you need to process a collection in parallel, you'll need a parallel collection type (one of the collections in scala.collection.parallel). A scala.collection.parallel.ParSeq should process groupBy faster than a List or a Stream, but only if you have multiple cores/hyperthreads available. However, ParSeq operations do not guarantee to preserve the order of the grouped-by elements.

Where is the ceil function on scala.math.BigDecimal?

I can't find the ceil function in the places I've looked:
https://duckduckgo.com/?q=scala+bigdecimal+ceil+function&t=ffab&ia=qa
http://www.scala-lang.org/api/current/scala/math/BigDecimal.html
But I do find a ceil method on Double and Float.
I know MathContext can be made to use the ceiling rounding mode, but I can't seem to make it work:
scala> BigDecimal("1.1").round(new MathContext(0, RoundingMode.UP))
res2: scala.math.BigDecimal = 1.1
scala> BigDecimal("1.8").round(new MathContext(0, RoundingMode.UP))
res3: scala.math.BigDecimal = 1.8
scala> BigDecimal("1.8").round(new MathContext(0, RoundingMode.CEILING))
res5: scala.math.BigDecimal = 1.8
scala> BigDecimal("1.8", new MathContext(0, RoundingMode.CEILING))
res6: scala.math.BigDecimal = 1.8
scala> res6.rounded
res8: scala.math.BigDecimal = 1.8
I expected the result of all the operations above to be == BigDecimal(2).
I use a BigDecimal because these are money operations, hence I need exact values.
To answer the title question: there isn't. At least, not in the way that it exists for Double (and it doesn't really exist for Double, either, it's added implicitly).
You are specifying the precision of the MathContext as zero, which means you're saying you want unlimited precision--that is, no rounding will be done. If you want the true ceiling (rounding up to the nearest integer), then you don't want to use BigDecimal#round at all. The first parameter of the MathContext is precision, which is the number of significant digits you want to maintain. Rounding up a number like 1234567 to keep five significant digits would give you 1234600.
To find the ceiling of a number, you want to set the scale, which is the number of digits to maintain after the decimal point. In this case, you'd want a scale of zero.
import scala.math._, BigDecimal._
scala> BigDecimal("3134.1222").setScale(0, RoundingMode.CEILING)
res1: scala.math.BigDecimal = 3135
scala> BigDecimal("1.1").setScale(0, RoundingMode.CEILING)
res2: scala.math.BigDecimal = 2
scala> BigDecimal("1.8").setScale(0, RoundingMode.CEILING)
res3: scala.math.BigDecimal = 2
We can use the enrich my library pattern to add a ceil method to BigDecimal implicitly:
implicit class RichBigDecimal(bd: BigDecimal) {
def ceil: BigDecimal = bd.setScale(0, RoundingMode.CEILING)
}
scala> BigDecimal("1.1").ceil
res6: scala.math.BigDecimal = 2

Count number of Strings that can be converted to Int in a List

For example, my input is:
scala> val myList = List("7842", "abf45", "abd", "56")
myList: List[String] = List(7842, abf45, abd, 56)
7842 and 56 can be converted to Int; therefore, my expected output is 2. We can assume that negative integers don't happen, so -67 is not possible.
This is what I have so far:
scala> myList.map(x => Try(x.toInt).getOrElse(-1)).count(_ > -1)
res15: Int = 2
This should work correctly, but I feel like I am missing a more elegant and readable solution, because all I have to do is count number of successes.
I would caution against using exception handling (like Try) in control flow -- it's very slow.
Here's a solution that uses idiomatic Scala collection operations, performs well, and will not count negative numbers:
scala> val myList = List("7842", "abf45", "abd", "56")
myList: List[String] = List(7842, abf45, abd, 56)
scala> myList.count(_.forall(_.isDigit))
res8: Int = 2
EDIT: #immibis pointed out that this won't detect strings of numbers that exceed Integer.MaxValue. If this is a concern, I would recommend one of the following approaches:
import scala.util.Try
myList.count(x => Try(x.toInt).filter(_ >= 0).isSuccess)
or, if you want to keep the performance of my first answer while still handling this edge case:
import scala.util.Try
myList.count(x => x.forall(_.isDigit) && Try(x.toInt).filter(_ >= 0).isSuccess)
This is a bit shorter:
myList.count(x => Try(x.toInt).isSuccess)
Note that this solution will handle any string that can be converted to integer via .toInt, including negative numbers.
You may consider string.matches method with regex as well, to match only positive integers:
val myList = List("7842", "abf45", "abd", "-56")
// myList: List[String] = List(7842, abf45, abd, -56)
myList.count(_.matches("\\d+"))
// res18: Int = 1
If negative integers need to be counted (and take into account possible +/- signs):
myList.count(_.matches("[+-]?\\d+"))
// res17: Int = 2
Starting Scala 2.13 and the introduction of String::toIntOption, we can count items ("34"/"2s3") for which applying toIntOption (Some(34)/None) is defined (true/false):
List("34", "abf45", "2s3", "56").count(_.toIntOption.isDefined) // 2

What is the fastest way to sum a collection in Scala

I've tried different collections in Scala to sum it's elements and they are much slower than Java sums it's arrays (with for cycle). Is there a way for Scala to be as fast as Java arrays?
I've heard that in scala 2.8 arrays will be same as in java, but they are much slower in practice
Indexing into arrays in a while loop is as fast in Scala as in Java. (Scala's "for" loop is not the low-level construct that Java's is, so that won't work the way you want.)
Thus if in Java you see
for (int i=0 ; i < array.length ; i++) sum += array(i)
in Scala you should write
var i=0
while (i < array.length) {
sum += array(i)
i += 1
}
and if you do your benchmarks appropriately, you'll find no difference in speed.
If you have iterators anyway, then Scala is as fast as Java in most things. For example, if you have an ArrayList of doubles and in Java you add them using
for (double d : arraylist) { sum += d }
then in Scala you'll be approximately as fast--if using an equivalent data structure like ArrayBuffer--with
arraybuffer.foreach( sum += _ )
and not too far off the mark with either of
sum = (0 /: arraybuffer)(_ + _)
sum = arraybuffer.sum // 2.8 only
Keep in mind, though, that there's a penalty to mixing high-level and low-level constructs. For example, if you decide to start with an array but then use "foreach" on it instead of indexing into it, Scala has to wrap it in a collection (ArrayOps in 2.8) to get it to work, and often will have to box the primitives as well.
Anyway, for benchmark testing, these two functions are your friends:
def time[F](f: => F) = {
val t0 = System.nanoTime
val ans = f
printf("Elapsed: %.3f\n",1e-9*(System.nanoTime-t0))
ans
}
def lots[F](n: Int, f: => F): F = if (n <= 1) f else { f; lots(n-1,f) }
For example:
val a = Array.tabulate(1000000)(_.toDouble)
val ab = new collection.mutable.ArrayBuffer[Double] ++ a
def adSum(ad: Array[Double]) = {
var sum = 0.0
var i = 0
while (i<ad.length) { sum += ad(i); i += 1 }
sum
}
// Mixed array + high-level; convenient, not so fast
scala> lots(3, time( lots(100,(0.0 /: a)(_ + _)) ) )
Elapsed: 2.434
Elapsed: 2.085
Elapsed: 2.081
res4: Double = 4.999995E11
// High-level container and operations, somewhat better
scala> lots(3, time( lots(100,(0.0 /: ab)(_ + _)) ) )
Elapsed: 1.694
Elapsed: 1.679
Elapsed: 1.635
res5: Double = 4.999995E11
// High-level collection with simpler operation
scala> lots(3, time( lots(100,{var s=0.0;ab.foreach(s += _);s}) ) )
Elapsed: 1.171
Elapsed: 1.166
Elapsed: 1.162
res7: Double = 4.999995E11
// All low level operations with primitives, no boxing, fast!
scala> lots(3, time( lots(100,adSum(a)) ) )
Elapsed: 0.185
Elapsed: 0.183
Elapsed: 0.186
res6: Double = 4.999995E11
You can now simply use sum.
val values = Array.fill[Double](numValues)(0)
val sumOfValues = values.sum
The proper scala or functional was to do this would be:
val numbers = Array(1, 2, 3, 4, 5)
val sum = numbers.reduceLeft[Int](_+_)
Check out this link for the full explanation of the syntax:
http://www.codecommit.com/blog/scala/quick-explanation-of-scalas-syntax
I doubt this would be faster than doing it in the ways described in the other answers but I haven't tested it so I'm not sure. In my opinion this is the proper way to do it though since Scala is a functional language.
It is very difficult to explain why some code you haven't shown performs worse than some other code you haven't shown in some benchmark you haven't shown.
You may be interested in this question and its accepted answer, for one thing. But benchmarking JVM code is hard, because the JIT will optimize code in ways that are difficult to predict (which is why JIT beats traditional optimization at compile time).
Scala 2.8 Array are JVM / Java arrays and as such have identical performance characteristics. But that means they cannot directly have extra methods that unify them with the rest of the Scala collections. To provide the illusion that arrays have these methods, there are implicit conversions to wrapper classes that add those capabilities. If you are not careful you'll incur inordinate overhead using those features.
In those cases where iteration overhead is critical, you can explicitly get an iterator (or maintain an integer index, for indexed sequential structures like Array or other IndexedSeq) and use a while loop, which is a language-level construct that need not operate on functions (literals or otherwise) but can compile in-line code blocks.
val l1 = List(...) // or any Iteralbe
val i1 = l1.iterator
while (i1.hasNext) {
val e = i1.next
// Do stuff with e
}
Such code will execute essentially as fast as a Java counterpart.
Timing is not the only concern.
With sum you might find an overflow issue:
scala> Array(2147483647,2147483647).sum
res0: Int = -2
in this case seeding foldLeft with a Long is preferable
scala> Array(2147483647,2147483647).foldLeft(0L)(_+_)
res1: Long = 4294967294
EDIT:
Long can be used from beginning:
scala> Array(2147483647L,2147483647L).sum
res1: Long = 4294967294