Scala map() on a Map[..] much slower than mapValues() - scala

In a Scala program I wrote I have a scala.collection.Map that maps a String to some calculated values (in detail it's Map[String, (Double, immutable.Map[String, Double], Double)] - I know that's ugly and should (and will be) wrapped). Now, if I do this:
stats.map { case(c, (prior, pwc, denom)) => {
println(c)
...
}
}
it takes about 30 seconds to print out roughly 50 times a value of c! The println is just a test statement - the actual calculation I need was even slower (I aborted after 1 minute of complete silence). However, if I do it like this:
stats.mapValues { case (prior, pwc, denom) => {
println(prior)
...
}
}
I don't run into these performance issues ... Can anyone explain why this is happening? Am I not following some important Scala guidelines?
Thanks for the help!
Edit:
I further investigated the behaviour. My guess is that the bottleneck comes from accessin the Map datastructure. If I do the following, I have have the same performance issues:
classes.foreach{c => {
println(c)
val ps = stats(c)
}
}
Here classes is a List[String] that stores the keys of the Map externally. Without the access to stats(c) no performance losses occur.

mapValues actually returns a view on the original map, which can lead to unexpected performance issues. From this blog post:
...here is a catch: map and mapValues are different in a not-so-subtle
way. mapValues, unlike map, returns a view on the original map. This
view holds references to both the original map and to the
transformation function (here (_ + 1)). Every time the returned map
(view) is queried, the original map is first queried and the
tranformation function is called on the result.
I recommend reading the rest of that post for some more details.

Related

Scala adding elements to seq and handling futures, maps, and async behavior

I'm still a newbie in scala and don't quite yet understand the concept of Futures/Maps/Flatmaps/Seq and how to use them properly.
This is what I want to do (pseudo code):
def getContentComponents: Action[AnyContent] = Action.async {
contentComponentDTO.list().map( //Future[Seq[ContentComponentModel]] Get all contentComponents
contentComponents => contentComponents.map( //Iterate over [Seq[ContentComponentModel]
contentComponent => contentComponent.typeOf match { //Match the type of the contentComponent
case 1 => contentComponent.pictures :+ contentComponentDTO.getContentComponentPicture(contentComponent.id.get) //Future[Option[ContentComponentPictureModel]] add to _.pictures seq
case 2 => contentComponent.videos :+ contentComponentDTO.getContentComponentVideo(contentComponent.id.get) //Future[Option[ContentComponentVideoModel]] add to _.videos seq
}
)
Ok(Json.toJson(contentComponents)) //Return all the contentComponents in the end
)
}
I want to add a Future[Option[Foo]] to contentComponent.pictures: Option[Seq[Foo]] like so:
case 2 => contentComponent.pictures :+ contentComponentDTO.getContentComponentPicture(contentComponent.id.get) //contentComponent.pictures is Option[Seq[Foo]]
and return the whole contentComponent back to the front-end via json in the end.
I know this might be far away from the actual code in the end, but I hope you got the idea. Thanks!
I'll ignore your code and focus on what is short and makes sense:
I want to add a Future[Option[Foo]] to contentComponent.pictures: Option[Seq[Foo]] like so:
Let's do this, focusing on code readability:
// what you already have
val someFuture: Future[Option[Foo]] = ???
val pics: Option[Seq[Foo]] = contentComponent.pictures
// what I'm adding
val result: Future[Option[Seq[Foo]]] = someFuture.map {
case None => pics
case Some(newElement) =>
pics match {
case None => Some(Seq(newElement)) // not sure what you want here if pics is empty...
case Some(picsSequence) => Some(picsSequence :+ newElement)
}
}
And to show an example of flatMap let's say you need the result of result future in another future, just do:
val otherFuture: Future[Any] = ???
val everything: Future[Option[Seq[Foo]]] = otherFuture.flatmap { otherResult =>
// do something with otherResult i.e., the code above could be pasted in here...
result
}
My answer will attempt to help with some of the conceptual sub-questions which form parts of your overall larger question.
flatMap and for-yield
One of the points of flatMap is to help with the problem of the Pyramid of Doom. This happens when you have
structures nested within structures nested within structures ...
doA().map { resultOfA =>
doB(resultOfA).map { resolutOfB =>
doC(resultOfB).map { resultOfC =>
...
}
}
}
If you use for-yield you get flatMap out of the box and it allows you to
flatten the pyramid
so that your code looks more like a linear structure
for {
resultOfA <- doA
resultOfB <- doB(resultOfA)
resultOfC <- doC(resultOfB)
...
} yield {...}
There is a rule of thumb in software engineering that deeply nested structures are harder to debug and reason about, so
we strive to minimise the nesting. You will hit this issue especially when dealing with Futures.
Mapping over Future vs. mapping over sequence
Mapping is usually first thought in terms of iteration over a sequence, which might lead to understanding of
mapping over a Future in terms of iterating over a sequence of one. My advice would be not to use the iteration concept when
trying to understand mapping over Futures, Options etc. In these cases it might be better to think of mapping as a process of destructing the structure
so that you get at the element inside the structure. One could visualise mapping as
breaking the shell of a walnut so you get at the delicious kernel inside and then rebuilding the shell.
Futures and monads
As you try to learn more about Futures and when you begin to deal with types like Future[Option[SomeType]] you will inevitably
stumble upon documentation about monads and its cryptic terminology might scare you away. If this happens, it might help to think of monads (of which Future is a particular instance) as simply
something you can stick into a for-yield so that you can get at the
delicious walnut kernels whilst avoiding the pyramid of doom.

Scala practices: lists and case classes

I've just started using Scala/Spark and having come from a Java background and I'm still trying to wrap my head around the concept of immutability and other best practices of Scala.
This is a very small segment of code from a larger program:
intersections is RDD(Key, (String, String))
obs is (Key, (String, String))
Data is just a case class I've defined above.
val intersections = map1 join map2
var listOfDatas = List[Data]()
intersections take NumOutputs foreach (obs => {
listOfDatas ::= ParseInformation(obs._1.key, obs._2._1, obs._2._2)
})
listOfDatas foreach println
This code works and does what I need it to do, but I was wondering if there was a better way of making this happen. I'm using a variable list and rewriting it with a new list every single time I iterate, and I'm sure there has to be a better way to create an immutable list that's populated with the results of the ParseInformation method call. Also, I remember reading somewhere that instead of accessing the tuple values directly, the way I have done, you should use case classes within functions (as partial functions I think?) to improve readability.
Thanks in advance for any input!
This might work locally, but only because you are takeing locally. It will not work once distributed as the listOfDatas is passed to each worker as a copy. The better way of doing this IMO is:
val processedData = intersections map{case (key, (item1, item2)) => {
ParseInfo(key, item1, item2)
}}
processedData foreach println
A note for a new to functional dev: If all you are trying to do is transform data in an iterable (List), forget foreach. Use map instead, which runs your transformation on each item and spits out a new iterable of the results.
What's the type of intersections? It looks like you can replace foreach with map:
val listOfDatas: List[Data] =
intersections take NumOutputs map (obs => {
ParseInformation(obs._1.key, obs._2._1, obs._2._2)
})

Using `map` function on Map in Scala

First off, apologies for the lame question. I am reading the `Scala for the Impatient' religiously and trying to solve all the exercise questions (and doing some minimal exploration)
Background :
The exercise question goes like - Setup a map of prices for a number of gizmos that you covet. Then produce a second map with the same keys and the prices at a 10% discount.
Unfortunately, at this point, most parts of the scaladoc are still cryptic to me but I understand that the map function of the Map takes a function and returns another map after applying a function (I guess?) - def map[B](f: (A) ⇒ B): HashMap[B]. I tried googling but couldnt get much useful results for map function for Map in scala :-)
My Question:
As attempted in my variation 3, does using map function for this purpose make any sense or should I stick with the variation 2 which actually solves my problem.
Code :
val gizmos:Map[String,Double]=Map("Samsung Galaxy S4 Zoom"-> 1000, "Mac Pro"-> 6000.10, "Google Glass"->2000)
//1. Normal for/yield
val discountedGizmos=(for ((k,v)<-gizmos) yield (k, v*0.9)) //Works fine
//2. Variation using mapValues
val discGizmos1=gizmos.mapValues(_*0.9) //Works fine
//3. Variation using only map function
val discGizmos2=gizmos.map((_,v) =>v*0.9) //ERROR : Wrong number of parameters: expected 1
In this case, mapValues does seem the more appropriate method to use. You would use the map method when you need to perform a transformation that requires knowledge of the keys (eg. converting a product reference into a product name, say).
That said, the map method is more general as it gives you acces to both the keys and values for you to act upon, and you could emulate the mapValues method by simply transforming the values and passing the keys through untouched - and that is where you are going wrong in your code above. To use the map method correctly, you should be producing a (key, value) pair from your function, not just a key:
val discGizmos2=gizmos.map{ case (k,v) => (k,v*0.9) } // pass the key through unchanged
It can be also:
val discGizmos2 = gizmos.map(kv => (kv._1, kv._2*0.9))

What are good examples of: "operation of a program should map input values to output values rather than change data in place"

I came across this sentence in Scala in explaining its functional behavior.
operation of a program should map input of values to output values rather than change data in place
Could somebody explain it with a good example?
Edit: Please explain or give example for the above sentence in its context, please do not make it complicate to get more confusion
The most obvious pattern that this is referring to is the difference between how you would write code which uses collections in Java when compared with Scala. If you were writing scala but in the idiom of Java, then you would be working with collections by mutating data in place. The idiomatic scala code to do the same would favour the mapping of input values to output values.
Let's have a look at a few things you might want to do to a collection:
Filtering
In Java, if I have a List<Trade> and I am only interested in those trades executed with Deutsche Bank, I might do something like:
for (Iterator<Trade> it = trades.iterator(); it.hasNext();) {
Trade t = it.next();
if (t.getCounterparty() != DEUTSCHE_BANK) it.remove(); // MUTATION
}
Following this loop, my trades collection only contains the relevant trades. But, I have achieved this using mutation - a careless programmer could easily have missed that trades was an input parameter, an instance variable, or is used elsewhere in the method. As such, it is quite possible their code is now broken. Furthermore, such code is extremely brittle for refactoring for this same reason; a programmer wishing to refactor a piece of code must be very careful to not let mutated collections escape the scope in which they are intended to be used and, vice-versa, that they don't accidentally use an un-mutated collection where they should have used a mutated one.
Compare with Scala:
val db = trades filter (_.counterparty == DeutscheBank) //MAPPING INPUT TO OUTPUT
This creates a new collection! It doesn't affect anyone who is looking at trades and is inherently safer.
Mapping
Suppose I have a List<Trade> and I want to get a Set<Stock> for the unique stocks which I have been trading. Again, the idiom in Java is to create a collection and mutate it.
Set<Stock> stocks = new HashSet<Stock>();
for (Trade t : trades) stocks.add(t.getStock()); //MUTATION
Using scala the correct thing to do is to map the input collection and then convert to a set:
val stocks = (trades map (_.stock)).toSet //MAPPING INPUT TO OUTPUT
Or, if we are concerned about performance:
(trades.view map (_.stock)).toSet
(trades.iterator map (_.stock)).toSet
What are the advantages here? Well:
My code can never observe a partially-constructed result
The application of a function A => B to a Coll[A] to get a Coll[B] is clearer.
Accumulating
Again, in Java the idiom has to be mutation. Suppose we are trying to sum the decimal quantities of the trades we have done:
BigDecimal sum = BigDecimal.ZERO
for (Trade t : trades) {
sum.add(t.getQuantity()); //MUTATION
}
Again, we must be very careful not to accidentally observe a partially-constructed result! In scala, we can do this in a single expression:
val sum = (0 /: trades)(_ + _.quantity) //MAPPING INTO TO OUTPUT
Or the various other forms:
(trades.foldLeft(0)(_ + _.quantity)
(trades.iterator map (_.quantity)).sum
(trades.view map (_.quantity)).sum
Oh, by the way, there is a bug in the Java implementation! Did you spot it?
I'd say it's the difference between:
var counter = 0
def updateCounter(toAdd: Int): Unit = {
counter += toAdd
}
updateCounter(8)
println(counter)
and:
val originalValue = 0
def addToValue(value: Int, toAdd: Int): Int = value + toAdd
val firstNewResult = addToValue(originalValue, 8)
println(firstNewResult)
This is a gross over simplification but fuller examples are things like using a foldLeft to build up a result rather than doing the hard work yourself: foldLeft example
What it means is that if you write pure functions like this you always get the same output from the same input, and there are no side effects, which makes it easier to reason about your programs and ensure that they are correct.
so for example the function:
def times2(x:Int) = x*2
is pure, while
def add5ToList(xs: MutableList[Int]) {
xs += 5
}
is impure because it edits data in place as a side effect. This is a problem because that same list could be in use elsewhere in the the program and now we can't guarantee the behaviour because it has changed.
A pure version would use immutable lists and return a new list
def add5ToList(xs: List[Int]) = {
5::xs
}
There are plenty examples with collections, which are easy to come by but might give the wrong impression. This concept works at all levels of the language (it doesn't at the VM level, however). One example is the case classes. Consider these two alternatives:
// Java-style
class Person(initialName: String, initialAge: Int) {
def this(initialName: String) = this(initialName, 0)
private var name = initialName
private var age = initialAge
def getName = name
def getAge = age
def setName(newName: String) { name = newName }
def setAge(newAge: Int) { age = newAge }
}
val employee = new Person("John")
employee.setAge(40) // we changed the object
// Scala-style
case class Person(name: String, age: Int) {
def this(name: String) = this(name, 0)
}
val employee = new Person("John")
val employeeWithAge = employee.copy(age = 40) // employee still exists!
This concept is applied on the construction of the immutable collection themselves: a List never changes. Instead, new List objects are created when necessary. Use of persistent data structures reduce the copying that would happen on a mutable data structure.

How should I avoid unintentionally capturing the local scope in function literals?

I'll ask this with a Scala example, but it may well be that this affects other languages which allow hybrid imperative and functional styles.
Here's a short example (UPDATED, see below):
def method: Iterator[Int] {
// construct some large intermediate value
val huge = (1 to 1000000).toList
val small = List.fill(5)(scala.util.Random.nextInt)
// accidentally use huge in a literal
small.iterator filterNot ( huge contains _ )
}
Now iterator.filterNot works lazily, which is great! As a result, we'd expect that the returned iterator won't consume much memory (indeed, O(1)). Sadly, however, we've made a terrible mistake: since filterNot is lazy, it keeps a reference to the function literal huge contains _.
Thus while we thought that the method would require a large amount of memory while it was running, and that that memory could be freed up immediately after the termination of the method, in fact that memory is stuck until we forget the returned Iterator.
(I just made such a mistake, which took a long time to track down! You can catch such things looking at heap dumps ...)
What are best practices for avoiding this problem?
It seems that the only solution is to carefully check for function literals which survive the end of the scope, and which captured intermediate variables. This is a bit awkward if you're constructing a non-strict collection and planning on returning it. Can anyone think of some nice tricks, Scala-specific or otherwise, that avoid this problem and let me write nice code?
UPDATE: the example I'd given previously was stupid, as huynhjl's answer below demonstrates. It had been:
def method: Iterator[Int] {
val huge = (1 to 1000000).toList // construct some large intermediate value
val n = huge.last // do some calculation based on it
(1 to n).iterator map (_ + 1) // return some small value
}
In fact, now that I understand a bit better how these things work, I'm not so worried!
Are you sure you're not oversimplifying the test case? Here is what I run:
object Clos {
def method: Iterator[Int] = {
val huge = (1 to 2000000).toList
val n = huge.last
(1 to n).iterator map (_ + 1)
}
def gc() { println("GC!!"); Runtime.getRuntime.gc }
def main(args:Array[String]) {
val list = List(method, method, method)
list.foreach(m => println(m.next))
gc()
list.foreach(m => println(m.next))
list.foreach(m => println(m.next))
}
}
If I understand you correctly, because main is using the iterators even after the gc() call, the JVM would be holding onto the huge objects.
This is how I run it:
JAVA_OPTS="-verbose:gc" scala -cp classes Clos
This is what it prints towards the end:
[Full GC 57077K->57077K(60916K), 0.3340941 secs]
[Full GC 60852K->60851K(65088K), 0.3653304 secs]
2
2
2
GC!!
[Full GC 62959K->247K(65088K), 0.0610994 secs]
3
3
3
4
4
4
So it looks to me as if the huge objects were reclaimed...