Scala Collection Specific Implementation - scala

Say I have some data in a Seq in Scala 2.10.2, e.g:
scala> val data = Seq( 1, 2, 3, 4, 5 )
data: Seq[Int] = List(1, 2, 3, 4, 5)
Now, I perform some operations and convert it to a Map
scala> val pairs = data.map( i => i -> i * 2 )
pairs: Seq[(Int, Int)] = List((1,2), (2,4), (3,6), (4,8), (5,10))
scala> val pairMap = pairs.toMap
pairMap: scala.collection.immutable.Map[Int,Int] = Map(5 -> 10, 1 -> 2, 2 -> 4, 3 -> 6, 4 -> 8)
Now say, for performance reasons, I'd like pairMap to use the HashMap implementation of Map. What's the best way to achieve this?
Ways I've considered:
Casting:
pairMap.asInstanceOf[scala.collection.immutable.HashMap[Int,Int]]
This seems a bit horrible.
Manually converting:
var hm = scala.collection.immutable.HashMap[Int,Int]()
pairMap.foreach( p => hm += p )
But this isn't very functional.
Using the builder
scala.collection.immutable.HashMap[Int,Int]( pairMap.toSeq:_* )
This works, but it's not the most readable piece of code.
Is there a better way that I'm missing? If not, which of these is the best approach?

Interesting bit: it already is an immutable HashMap.
scala> val data = Seq( 1, 2, 3, 4, 5 )
data: Seq[Int] = List(1, 2, 3, 4, 5)
scala> val pairs = data.map( i => i -> i * 2 )
pairs: Seq[(Int, Int)] = List((1,2), (2,4), (3,6), (4,8), (5,10))
scala> val pairMap = pairs.toMap
pairMap: scala.collection.immutable.Map[Int,Int] =
Map(5 -> 10, 1 -> 2, 2 -> 4, 3 -> 6, 4 -> 8)
scala> pairMap.getClass
res0: Class[_ <: scala.collection.immutable.Map[Int,Int]] =
class scala.collection.immutable.HashMap$HashTrieMap
Note: casting to be a hashmap doesn't at all change the underlying thing. If you wanted to guarantee building a hashmap (or some specific type), then I'd recommend this:
scala> import scala.collection.immutable
import scala.collection.immutable
scala> val pairMap = immutable.HashMap(pairs: _*)
pairMap: scala.collection.immutable.HashMap[Int,Int] =
Map(5 -> 10, 1 -> 2, 2 -> 4, 3 -> 6, 4 -> 8)
If you're looking for performance improvements you should look into using a mutable.HashMap or java.util.HashMap. Most of scala's collections are out-performed by the native java.util collections.

You can combine
An explicit result type
map: Map key and value into a tuple
breakOut: "Break out" the sequence of tuples from map, and create the target type directly
like this
val s = Seq.range(1, 6)
val m: scala.collection.immutable.HashMap[Int, Int] =
s.map(n => (n, n * n))(scala.collection.breakOut)
which creates the HashMap on-the-fly without an intermediate map.
By using breakOut and the explicit result type an appropriate builder for map is chosen, and your target type is created directly

Related

Scala: How to "map" an Array[Int] to a Map[String, Int] using the "map" method?

I have the following Array[Int]: val array = Array(1, 2, 3), for which I have the following mapping relation between an Int and a String:
val a1 = array.map{
case 1 => "A"
case 2 => "B"
case 3 => "C"
}
To create a Map to contain the above mapping relation, I am aware that I can use a foldLeft method:
val a2 = array.foldLeft(Map[String, Int]()) { (m, e) =>
m + (e match {
case 1 => ("A", 1)
case 2 => "B" -> 2
case 3 => "C" -> 3
})
}
which outputs:
a2: scala.collection.immutable.Map[String,Int] = Map(A -> 1, B -> 2, C
-> 3)
This is the result I want. But can I achieve the same result via the map method?
The following codes do not work:
val a3 = array.map[(String, Int), Map[String, Int]] {
case 1 => ("A", 1)
case 2 => ("B", 2)
case 3 => ("C", 3)
}
The signature of map is
def map[B, That](f: A => B)
(implicit bf: CanBuildFrom[Repr, B, That]): That
What is this CanBuildFrom[Repr, B, That]? I tried to read Tribulations of CanBuildFrom but don't really understand it. That article mentioned Scala 2.12+ has provided two implementations for map. But how come I didn't find it when I use Scala 2.12.4?
I mostly use Scala 2.11.12.
Call toMap in the end of your expression:
val a3 = array.map {
case 1 => ("A", 1)
case 2 => ("B", 2)
case 3 => ("C", 3)
}.toMap
I'll first define your function here for the sake of brevity in later explanation:
// worth noting that this function is effectively partial
// i.e. will throw a `MatchError` if n is not in (1, 2, 3)
def toPairs(n: Int): (String, Int) =
n match {
case 1 => "a" -> 1
case 2 => "b" -> 2
case 3 => "c" -> 3
}
One possible way to go (as already highlighted in another answer) is to use toMap, which only works on collection of pairs:
val ns = Array(1, 2, 3)
ns.toMap // doesn't compile
ns.map(toPairs).toMap // does what you want
It is worth noting however that unless you are working with a lazy representation (like an Iterator or a Stream) this will result in two passes over the collection and the creation of unnecessary intermediate collections: the first time by mapping toPairs over the collection and then by turning the whole collection from a collection of pairs to a Map (with toMap).
You can see it clearly in the implementation of toMap.
As suggested in the read you already linked in the answer (and in particular here) You can avoid this double pass in two ways:
you can leverage scala.collection.breakOut, an implementation of CanBuildFrom that you can give map (among others) to change the target collection, provided that you explicitly provide a type hint for the compiler:
val resultMap: Map[String, Int] = ns.map(toPairs)(collection.breakOut)
val resultSet: Set[(String, Int)] = ns.map(toPairs)(collection.breakOut)
otherwise, you can create a view over your collection, which puts it in the lazy wrapper that you need for the operation to not result in a double pass
ns.view.map(toPairs).toMap
You can read more about implicit builder providers and views in this Q&A.
Basically toMap (credits to Sergey Lagutin) is the right answer.
You could actually make the code a bit more compact though:
val a1 = array.map { i => ((i + 64).toChar, i) }.toMap
If you run this code:
val array = Array(1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 0)
val a1 = array.map { i => ((i + 64).toChar, i) }.toMap
println(a1)
You will see this on the console:
Map(E -> 5, J -> 10, F -> 6, A -> 1, # -> 0, G -> 7, L -> 12, B -> 2, C -> 3, H -> 8, K -> 11, D -> 4)

Lift algebird aggregator to consume (and return) Map

The example in the README is very elegant:
scala> Map(1 -> Max(2)) + Map(1 -> Max(3)) + Map(2 -> Max(4))
res0: Map[Int,Max[Int]] = Map(2 -> Max(4), 1 -> Max(3))
Essentially the use of Map here is equivalent to SQL's group by.
But how do I do the same with an arbitrary Aggregator? For example, to achieve the same thing as the code above (but without the Max wrapper class):
scala> import com.twitter.algebird._
scala> val mx = Aggregator.max[Int]
mx: Aggregator[Int,Int,Int] = MaxAggregator(scala.math.Ordering$Int$#78c77)
scala> val mxOfMap = // what goes here?
mxOfMap: Aggregator[Map[Int,Int],Map[Int,Int],Map[Int,Int]] = ...
scala> mxOfMap.reduce(List(Map(1 -> 2), Map(1 -> 3), Map(2 -> 4)))
res0: Map[Int,Int] = Map(2 -> 4, 1 -> 3)
In other words, how to I convert (or "lift") an Aggregator that operates on values of type T into an Aggregator that operates on values of type Map[K,T] (for some arbitrary K)?
Looks like this can be done fairly easily for Semigroup at least. This should be sufficient in the case where there is no additional logic in the "compose" or "present" phases of the aggregator which needs to be preserved (a Semigroup can be obtained from an Aggregator, discarding compose/prepare).
The code to answer the original question is:
scala> val sgOfMap = Semigroup.mapSemigroup[Int,Int](mx.semigroup)
scala> val mxOfMap = Aggregator.fromSemigroup(sgOfMap)
scala> mxOfMap.reduce(List(Map(1 -> 2), Map(1 -> 3), Map(2 -> 4)))
res0: Map[Int,Int] = Map(2 -> 4, 1 -> 3)
But in practice, it would be better to start by constructing the arbitrary Semigroup directly, rather than constructing an Aggregator merely to extract the semigroup from:
scala> import com.twitter.algebird._
scala> val mx = Semigroup.from { (x: Int, y: Int) => Math.max(x, y) }
scala> val mxOfMap = Semigroup.mapSemigroup[Int,Int](mx)
scala> mxOfMap.sumOption(List(Map(1 -> 2), Map(1 -> 3), Map(2 -> 4)))
res33: Option[Map[Int,Int]] = Some(Map(2 -> 4, 1 -> 3))
Alternatively, convert to aggregator: Aggregator.fromSemigroup(mxOfMap)

Scala - final map after groupBy / map not sorted even when initial list sorted [duplicate]

This question already has answers here:
Why does groupBy in Scala change the ordering of a list's items?
(2 answers)
Closed 6 years ago.
This is a very simple code that can be executed inside a Scala Worksheet also. It is a map reduce kind of approach to calculate frequency of the numbers in the list.
I am sorting the list before I am starting the groupBy and map operation. Even then the list.groupBy.map operation generates a map, which is not sorted. Neither number wise nor frequency wise
//put this code in Scala worksheet
//this list is sorted and sorted in list variable
val list = List(1,2,4,2,4,7,3,2,4).sorted
//now you can see list is sorted
list
//now applying groupBy and map operation to create frequency map
val freqMap = list.groupBy(x => x) map{ case(k,v) => k-> v.length }
freqMap
groupBy doesn't guarantee any order
val list = List(1,2,4,2,4,7,3,2,4).sorted
val freqMap = list.groupBy(x => x)
Output:
freqMap: scala.collection.immutable.Map[Int,List[Int]] = Map(1 -> List(1), 2 -> List(2, 2, 2), 7 -> List(7), 3 -> List(3), 4 -> List(4, 4, 4))
groupBy takes the list and groups the elements. It builds a Map in which the
key is a unique value of the list
value is a List of all occurrence in the list.
Here is the official method definition from the Scala docs:
def groupBy [K] (f: (A) ⇒ K): Map[K, Traversable[A]]
If you want to order the grouped result you can do it with ListMap:
scala> val freqMap = list.groupBy(x => x)
freqMap: scala.collection.immutable.Map[Int,List[Int]] = Map(1 -> List(1), 2 -> List(2, 2, 2), 7 -> List(7), 3 -> List(3), 4 -> List(4, 4, 4))
scala> import scala.collection.immutable.ListMap
import scala.collection.immutable.ListMap
scala> ListMap(freqMap.toSeq.sortBy(_._1):_*)
res0: scala.collection.immutable.ListMap[Int,List[Int]] = Map(1 -> List(1), 2 -> List(2, 2, 2), 3 -> List(3), 4 -> List(4, 4, 4), 7 -> List(7))

Scala hash of stacks has only one stack for all the keys

I have the following hashmap, where each element should be mapped to a stack:
var pos = new HashMap[Int, Stack[Int]] withDefaultValue Stack.empty[Int]
for(i <- a.length - 1 to 0 by -1) {
pos(a(i)).push(i)
}
If a will have elements {4, 6, 6, 4, 6, 6},
and if I add the following lines after the code above:
println("pos(0) is " + pos(0))
println("pos(4) is " + pos(4))
The output will be:
pos(0) is Stack(0, 1, 2, 3, 4, 5)
pos(4) is Stack(0, 1, 2, 3, 4, 5)
Why is this happening?
I don't want to add elements to pos(0), but only to pos(4) and pos(6) (the lements of a).
It looks like there is only one stack mapped to all the keys. I want a stack for each key.
Check the docs:
http://www.scala-lang.org/api/current/index.html#scala.collection.mutable.HashMap
method withDefaultValue takes this value as a regular parameter, it won't be recalculated so all your entries share the same mutable stack.
def withDefaultValue(d: B): Map[A, B]
You should use withDefault method instead.
val pos = new HashMap[Int, Stack[Int]] withDefault (_ => Stack.empty[Int])
Edit
Above solution doesn't seem to work, I get empty stacks. Checking with sources shows that the default value is returned but never put into map
override def apply(key: A): B = {
val result = findEntry(key)
if (result eq null) default(key)
else result.value
}
I guess one solution might be to override apply or default method to add the entry to map before returning it. Example for default method:
val pos = new mutable.HashMap[Int, mutable.Stack[Int]]() {
override def default(key: Int) = {
val newValue = mutable.Stack.empty[Int]
this += key -> newValue
newValue
}
}
Btw. that is punishment for being mutable, I encourage you to use immutable data structures.
If you are looking for a more idiomatic Scala, functional-style solution without those mutable collections, consider this:
scala> val a = List(4, 6, 6, 4, 6, 6)
a: List[Int] = List(4, 6, 6, 4, 6, 6)
scala> val pos = a.zipWithIndex groupBy {_._1} mapValues { _ map (_._2) }
pos: scala.collection.immutable.Map[Int,List[Int]] = Map(4 -> List(0, 3), 6 -> List(1, 2, 4, 5))
It may look confusing at first, but if you break it down, zipWithIndex gets pairs of values and their positions, groupBy makes a map from each value to a list of entries, and mapValues is then used to turn the lists of (value, position) pairs into just lists of positions.
scala> val pairs = a.zipWithIndex
pairs: List[(Int, Int)] = List((4,0), (6,1), (6,2), (4,3), (6,4), (6,5))
scala> val pairsByValue = pairs groupBy (_._1)
pairsByValue: scala.collection.immutable.Map[Int,List[(Int, Int)]] = Map(4 -> List((4,0), (4,3)), 6 -> List((6,1), (6,2), (6,4), (6,5)))
scala> val pos = pairsByValue mapValues (_ map (_._2))
pos: scala.collection.immutable.Map[Int,List[Int]] = Map(4 -> List(0, 3), 6 -> List(1, 2, 4, 5))

Why Spark doesn't allow map-side combining with array keys?

I'm using Spark 1.3.1 and I'm curious why Spark doesn't allow using array keys on map-side combining.
Piece of combineByKey function:
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
}
Basically for the same reason why default partitioner cannot partition array keys.
Scala Array is just a wrapper around Java array and its hashCode doesn't depend on a content:
scala> val x = Array(1, 2, 3)
x: Array[Int] = Array(1, 2, 3)
scala> val h = x.hashCode
h: Int = 630226932
scala> x(0) = -1
scala> x.hashCode() == h1
res3: Boolean = true
It means that two arrays with exact the same content are not equal
scala> x
res4: Array[Int] = Array(-1, 2, 3)
scala> val y = Array(-1, 2, 3)
y: Array[Int] = Array(-1, 2, 3)
scala> y == x
res5: Boolean = false
As result Arrays cannot be used as a meaningful keys. If you're not convinced just check what happens when you use Array as key for Scala Map:
scala> Map(Array(1) -> 1, Array(1) -> 2)
res7: scala.collection.immutable.Map[Array[Int],Int] = Map(Array(1) -> 1, Array(1) -> 2)
If you want to use a collection as key you should use an immutable data structure like a Vector or a List.
scala> Map(Array(1).toVector -> 1, Array(1).toVector -> 2)
res15: scala.collection.immutable.Map[Vector[Int],Int] = Map(Vector(1) -> 2)
See also:
SI-1607
How does HashPartitioner work?
A list as a key for PySpark's reduceByKey