Scala practices: lists and case classes - scala

I've just started using Scala/Spark and having come from a Java background and I'm still trying to wrap my head around the concept of immutability and other best practices of Scala.
This is a very small segment of code from a larger program:
intersections is RDD(Key, (String, String))
obs is (Key, (String, String))
Data is just a case class I've defined above.
val intersections = map1 join map2
var listOfDatas = List[Data]()
intersections take NumOutputs foreach (obs => {
listOfDatas ::= ParseInformation(obs._1.key, obs._2._1, obs._2._2)
})
listOfDatas foreach println
This code works and does what I need it to do, but I was wondering if there was a better way of making this happen. I'm using a variable list and rewriting it with a new list every single time I iterate, and I'm sure there has to be a better way to create an immutable list that's populated with the results of the ParseInformation method call. Also, I remember reading somewhere that instead of accessing the tuple values directly, the way I have done, you should use case classes within functions (as partial functions I think?) to improve readability.
Thanks in advance for any input!

This might work locally, but only because you are takeing locally. It will not work once distributed as the listOfDatas is passed to each worker as a copy. The better way of doing this IMO is:
val processedData = intersections map{case (key, (item1, item2)) => {
ParseInfo(key, item1, item2)
}}
processedData foreach println
A note for a new to functional dev: If all you are trying to do is transform data in an iterable (List), forget foreach. Use map instead, which runs your transformation on each item and spits out a new iterable of the results.

What's the type of intersections? It looks like you can replace foreach with map:
val listOfDatas: List[Data] =
intersections take NumOutputs map (obs => {
ParseInformation(obs._1.key, obs._2._1, obs._2._2)
})

Related

HashMap only has one element instead of three

I am filling a HashMap in Scala like so:
val hashMap = new HashMap[P, List[T]]() { list.map(x => put(x.param1, x.param1.elements)) }
The problem is that hashMap will have only a size of 1 while list has a size of 3.
What am I doing wrong here?
You''re mixing imperative commands (put, new HashMap) with functional constructs (map). This cannot behave nicely.
What you should do (if I understand your goal correctly):
list.map(x => x.param1 -> x.param1.elements).toMap[P, List[T]]
Also, beware that if several elements in your list have the same param1, only the last one will be kept, since Map can only have one value for a given key.

Why can not modify a map in foreach?

I am new to Scala and use Spark to process data. Why does the following code fail to change the categoryMap?
import scala.collection.mutable.LinkedHashMap
val catFile=sc.textFile(inputFile);
var categoryMap=LinkedHashMap[Int,Tuple2[String,Int]]()
catFile.foreach(line => {
val strs=line.split("\001");
categoryMap += (strs(0).toInt -> (strs(2),strs(3).toInt));
})
It's a good practice to try to stay away from both mutable data structures and vars. Sometimes they are needed, but mostly this kind of processing is easy to do by chaining transformation operations on collections. Also, .toMap is handy to convert a Seq containing Tuple2's to a Map.
Here's one way (that I didn't test properly):
val categoryMap = catFile map { _.split("\001") } map { array =>
(array(0).toInt, (array(2), array(3).toInt))
} toMap
Note that if there are more than one record corresponding to a key then only the last one will be present in the resulting map.
Edit: I didn't actually answer your original question - based on a quick test it results in a similar map to what my code above produces. Blind guess, you should make sure that your catFile actually contains data to process.

Using `map` function on Map in Scala

First off, apologies for the lame question. I am reading the `Scala for the Impatient' religiously and trying to solve all the exercise questions (and doing some minimal exploration)
Background :
The exercise question goes like - Setup a map of prices for a number of gizmos that you covet. Then produce a second map with the same keys and the prices at a 10% discount.
Unfortunately, at this point, most parts of the scaladoc are still cryptic to me but I understand that the map function of the Map takes a function and returns another map after applying a function (I guess?) - def map[B](f: (A) ⇒ B): HashMap[B]. I tried googling but couldnt get much useful results for map function for Map in scala :-)
My Question:
As attempted in my variation 3, does using map function for this purpose make any sense or should I stick with the variation 2 which actually solves my problem.
Code :
val gizmos:Map[String,Double]=Map("Samsung Galaxy S4 Zoom"-> 1000, "Mac Pro"-> 6000.10, "Google Glass"->2000)
//1. Normal for/yield
val discountedGizmos=(for ((k,v)<-gizmos) yield (k, v*0.9)) //Works fine
//2. Variation using mapValues
val discGizmos1=gizmos.mapValues(_*0.9) //Works fine
//3. Variation using only map function
val discGizmos2=gizmos.map((_,v) =>v*0.9) //ERROR : Wrong number of parameters: expected 1
In this case, mapValues does seem the more appropriate method to use. You would use the map method when you need to perform a transformation that requires knowledge of the keys (eg. converting a product reference into a product name, say).
That said, the map method is more general as it gives you acces to both the keys and values for you to act upon, and you could emulate the mapValues method by simply transforming the values and passing the keys through untouched - and that is where you are going wrong in your code above. To use the map method correctly, you should be producing a (key, value) pair from your function, not just a key:
val discGizmos2=gizmos.map{ case (k,v) => (k,v*0.9) } // pass the key through unchanged
It can be also:
val discGizmos2 = gizmos.map(kv => (kv._1, kv._2*0.9))

Converting a Scala Map to a List

I have a map that I need to map to a different type, and the result needs to be a List. I have two ways (seemingly) to accomplish what I want, since calling map on a map seems to always result in a map. Assuming I have some map that looks like:
val input = Map[String, List[Int]]("rk1" -> List(1,2,3), "rk2" -> List(4,5,6))
I can either do:
val output = input.map{ case(k,v) => (k.getBytes, v) } toList
Or:
val output = input.foldRight(List[Pair[Array[Byte], List[Int]]]()){ (el, res) =>
(el._1.getBytes, el._2) :: res
}
In the first example I convert the type, and then call toList. I assume the runtime is something like O(n*2) and the space required is n*2. In the second example, I convert the type and generate the list in one go. I assume the runtime is O(n) and the space required is n.
My question is, are these essentially identical or does the second conversion cut down on memory/time/etc? Additionally, where can I find information on storage and runtime costs of various scala conversions?
Thanks in advance.
My favorite way to do this kind of things is like this:
input.map { case (k,v) => (k.getBytes, v) }(collection.breakOut): List[(Array[Byte], List[Int])]
With this syntax, you are passing to map the builder it needs to reconstruct the resulting collection. (Actually, not a builder, but a builder factory. Read more about Scala's CanBuildFroms if you are interested.) collection.breakOut can exactly be used when you want to change from one collection type to another while doing a map, flatMap, etc. — the only bad part is that you have to use the full type annotation for it to be effective (here, I used a type ascription after the expression). Then, there's no intermediary collection being built, and the list is constructed while mapping.
Mapping over a view in the first example could cut down on the space requirement for a large map:
val output = input.view.map{ case(k,v) => (k.getBytes, v) } toList

Scala: What is the most efficient way convert a Map[K,V] to an IntMap[V]?

Let"s say I have a class Point with a toInt method, and I have an immutable Map[Point,V], for some type V. What is the most efficient way in Scala to convert it to an IntMap[V]? Here is my current implementation:
def pointMap2IntMap[T](points: Map[Point,T]): IntMap[T] = {
var result: IntMap[T] = IntMap.empty[T]
for(t <- points) {
result += (t._1.toInt, t._2)
}
result
}
[EDIT] I meant primarily faster, but I would also be interested in shorter versions, even if they are not obviously faster.
IntMap has a built-in factory method (apply) for this:
IntMap(points.map(p => (p._1.toInt, p._2)).toSeq: _*)
If speed is an issue, you may use:
points.foldLeft(IntMap.empty[T])((m, p) => m.updated(p._1.toInt, p._2))
A one liner that uses breakOut to obtain an IntMap. It does a map to a new collection, using a custom builder factory CanBuildFrom which the breakOut call resolves:
Map[Int, String](1 -> "").map(kv => kv)(breakOut[Map[Int, String], (Int, String), immutable.IntMap[String]])
In terms of performance, it's hard to tell, but it creates a new IntMap, goes through all the bindings and adds them to the IntMap. A handwritten iterator while loop (preceded with a pattern match to check if the source map is an IntMap) would possibly result in somewhat better performance.