How to un-nest a spark rdd that has the following type ((String, scala.collection.immutable.Map[String,scala.collection.immutable.Map[String,Int]])) - scala

Its a nested map with contents like this when i print it onto screen
(5, Map ( "ABCD" -> Map("3200" -> 3,
"3350.800" -> 4,
"200.300" -> 3)
(1, Map ( "DEF" -> Map("1200" -> 32,
"1320.800" -> 4,
"2100" -> 3)
I need to get something like this
Case Class( 5, ABCD 3200, 3)
Case Class(5, ABCD 3350.800, 4)
CaseClass(5,ABCD., 200.300, 3)
CaseClass(1, DEF 1200, 32)
CaseClass(1 DEF, 1320.800, 4)
etc etc.
basically a list of case classes
And map it to a case class object so that i can save it to cassandra.
I have tried doing flatMapValues but that un nests the map only one level. Also used flatMap . that doesnt work either or I'am making mistakes
Any suggestions ?

Fairly straightforward using a for-comprehension and some pattern matching to destructure things:
val in = List((5, Map ( "ABCD" -> Map("3200" -> 3, "3350.800" -> 4, "200.300" -> 3))),
(1, Map ("DEF" -> Map("1200" -> 32, "1320.800" -> 4, "2100" -> 3))))
case class Thing(a:Int, b:String, c:String, d:Int)
for { (index, m) <- in
(k,v) <-m
(innerK, innerV) <- v}
yield Thing(index, k, innerK, innerV)
//> res0: List[maps.maps2.Thing] = List(Thing(5,ABCD,3200,3),
// Thing(5,ABCD,3350.800,4),
// Thing(5,ABCD,200.300,3),
// Thing(1,DEF,1200,32),
// Thing(1,DEF,1320.800,4),
// Thing(1,DEF,2100,3))
So let's pick part the for-comprehension
(index, m) <- in
This is the same as
t <- in
(index, m) = t
In the first line t will successively be set to each element of in.
t is therefore a tuple (Int, Map(...))
Patten matching lets us put that "patten" for the tuple on the right hand side and the compiler picks apart the tuple, sets index to the Int and m to the Map.
(k,v) <-m
As before this is equivalent to
u <-m
(k, v) = u
And this time u takes each element of Map. Which again are tuples of key and value. So k is set successively to each key and v to the value.
And v is your inner map so we do the same thing again with the inner map
(innerK, innerV) <- v}
Now we have everything we need to create the case class. yield just says make a collection of whatever is "yielded" each time through the loop.
yield Thing(index, k, innerK, innerV)
Under the hood, this just translates to a set of maps/flatmaps
The yield is just the value Thing(index, k, innerK, innerV)
We get one of those for each element of v
v.map{x=>val (innerK, innerV) = t;Thing(index, k, innerK, innerV)}
but there's an inner map per element of the outer map
m.flatMap{y=>val (k, v) = y;v.map{x=>val (innerK, innerV) = t;Thing(index, k, innerK, innerV)}}
(flatMap because we get a List of Lists if we just did a map and we want to flatten it to just the list of items)
Similarly, we do one of those for every element in the List
in.flatMap (z => val (index, m) = z; m.flatMap{y=>val (k, v) = y;v.map{x=>val (innerK, innerV) = t;Thing(index, k, innerK, innerV)}}
Let's do that in _1, _2 style-y.
in.flatMap (z=> z._2.flatMap{y=>y._2.map{x=>;Thing(z._1, y._1, x._1, x._2)}}}
which produces exactly the same result. But isn't it clearer as a for-comprehension?

You can do this like this if you prefer collection operation
case class Record(v1: Int, v2: String, v3: Double, v4: Int)
val data = List(
(5, Map ( "ABC" ->
Map(
3200. -> 3,
3350.800 -> 4,
200.300 -> 3))
),
(1, Map ( "DEF" ->
Map(
1200. -> 32,
1320.800 -> 4,
2100. -> 3))
)
)
val rdd = sc.parallelize(data)
val result = rdd.flatMap(p => {
p._2.toList
.flatMap(q => q._2.toList.map(l => (q._1, l)))
.map((p._1, _))
}).map(p => Record(p._1, p._2._1, p._2._2._1, p._2._2._2))
println(result.collect.toList)
//List(
// Record(5,ABC,3200.0,3),
// Record(5,ABC,3350.8,4),
// Record(5,ABC,200.3,3),
// Record(1,DEF,1200.0,32),
// Record(1,DEF,1320.8,4),
// Record(1,DEF,2100.0,3)
//)

Related

Scala: How to "map" an Array[Int] to a Map[String, Int] using the "map" method?

I have the following Array[Int]: val array = Array(1, 2, 3), for which I have the following mapping relation between an Int and a String:
val a1 = array.map{
case 1 => "A"
case 2 => "B"
case 3 => "C"
}
To create a Map to contain the above mapping relation, I am aware that I can use a foldLeft method:
val a2 = array.foldLeft(Map[String, Int]()) { (m, e) =>
m + (e match {
case 1 => ("A", 1)
case 2 => "B" -> 2
case 3 => "C" -> 3
})
}
which outputs:
a2: scala.collection.immutable.Map[String,Int] = Map(A -> 1, B -> 2, C
-> 3)
This is the result I want. But can I achieve the same result via the map method?
The following codes do not work:
val a3 = array.map[(String, Int), Map[String, Int]] {
case 1 => ("A", 1)
case 2 => ("B", 2)
case 3 => ("C", 3)
}
The signature of map is
def map[B, That](f: A => B)
(implicit bf: CanBuildFrom[Repr, B, That]): That
What is this CanBuildFrom[Repr, B, That]? I tried to read Tribulations of CanBuildFrom but don't really understand it. That article mentioned Scala 2.12+ has provided two implementations for map. But how come I didn't find it when I use Scala 2.12.4?
I mostly use Scala 2.11.12.
Call toMap in the end of your expression:
val a3 = array.map {
case 1 => ("A", 1)
case 2 => ("B", 2)
case 3 => ("C", 3)
}.toMap
I'll first define your function here for the sake of brevity in later explanation:
// worth noting that this function is effectively partial
// i.e. will throw a `MatchError` if n is not in (1, 2, 3)
def toPairs(n: Int): (String, Int) =
n match {
case 1 => "a" -> 1
case 2 => "b" -> 2
case 3 => "c" -> 3
}
One possible way to go (as already highlighted in another answer) is to use toMap, which only works on collection of pairs:
val ns = Array(1, 2, 3)
ns.toMap // doesn't compile
ns.map(toPairs).toMap // does what you want
It is worth noting however that unless you are working with a lazy representation (like an Iterator or a Stream) this will result in two passes over the collection and the creation of unnecessary intermediate collections: the first time by mapping toPairs over the collection and then by turning the whole collection from a collection of pairs to a Map (with toMap).
You can see it clearly in the implementation of toMap.
As suggested in the read you already linked in the answer (and in particular here) You can avoid this double pass in two ways:
you can leverage scala.collection.breakOut, an implementation of CanBuildFrom that you can give map (among others) to change the target collection, provided that you explicitly provide a type hint for the compiler:
val resultMap: Map[String, Int] = ns.map(toPairs)(collection.breakOut)
val resultSet: Set[(String, Int)] = ns.map(toPairs)(collection.breakOut)
otherwise, you can create a view over your collection, which puts it in the lazy wrapper that you need for the operation to not result in a double pass
ns.view.map(toPairs).toMap
You can read more about implicit builder providers and views in this Q&A.
Basically toMap (credits to Sergey Lagutin) is the right answer.
You could actually make the code a bit more compact though:
val a1 = array.map { i => ((i + 64).toChar, i) }.toMap
If you run this code:
val array = Array(1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 0)
val a1 = array.map { i => ((i + 64).toChar, i) }.toMap
println(a1)
You will see this on the console:
Map(E -> 5, J -> 10, F -> 6, A -> 1, # -> 0, G -> 7, L -> 12, B -> 2, C -> 3, H -> 8, K -> 11, D -> 4)

Invert a Map (String -> List) in Scala

I have a Map[String, List[String]] and I want to invert it. For example, if I have something like
"1" -> List("a","b","c")
"2" -> List("a","j","k")
"3" -> List("a","c")
The result should be
"a" -> List("1","2","3")
"b" -> List("1")
"c" -> List("1","3")
"j" -> List("2")
"k" -> List("2")
I've tried this:
m.map(_.swap)
But it returns a Map[List[String], String]:
List("a","b","c") -> "1"
List("a","j","k") -> "2"
List("a","c") -> "3"
Map inversion is a little more complicated.
val m = Map("1" -> List("a","b","c")
,"2" -> List("a","j","k")
,"3" -> List("a","c"))
m flatten {case(k, vs) => vs.map((_, k))} groupBy (_._1) mapValues {_.map(_._2)}
//res0: Map[String,Iterable[String]] = Map(j -> List(2), a -> List(1, 2, 3), b -> List(1), c -> List(1, 3), k -> List(2))
Flatten the Map into a collection of tuples. groupBy will create a new Map with the old values as the new keys. Then un-tuple the values by removing the key (previously value) elements.
An alternative that does not rely on strange implicit arguments of flatten, as requested by yishaiz:
val m = Map(
"1" -> List("a","b","c"),
"2" -> List("a","j","k"),
"3" -> List("a","c"),
)
val res = (for ((digit, chars) <- m.toList; c <- chars) yield (c, digit))
.groupBy(_._1) // group by characters
.mapValues(_.unzip._2) // drop redundant digits from lists
res foreach println
gives:
(j,List(2))
(a,List(1, 2, 3))
(b,List(1))
(c,List(1, 3))
(k,List(2))
A simple nested for-comprehension may be used to invert the map in such a way that each value in the List of values are keys in the inverted map with respective keys as their values
implicit class MapInverter[T] (map: Map[T, List[T]]) {
def invert: Map[T, T] = {
val result = collection.mutable.Map.empty[T, T]
for ((key, values) <- map) {
for (v <- values) {
result += (v -> key)
}
}
result.toMap
}
Usage:
Map(10 -> List(3, 2), 20 -> List(16, 17, 18, 19)).invert

merge sets' elements with HashMap's key in scala

I hope there is an easy way to solve that
I have two RDDs
g.vertices
(4,Set(5, 3))
(0,Set(1, 4))
(1,Set(2))
(6,Set())
(3,Set(0))
(5,Set(2))
(2,Set(1))
maps
Map(4 -> Set(5, 3))
Map(0 -> Set(1, 4))
Map(1 -> Set(2))
Map(6 -> Set())
Map(3 -> Set(0))
Map(5 -> Set(2))
Map(2 -> Set(1))
How can I do something like this?
(4,Map(5 -> Set(2), 3 -> Set(0)))
(0,Map(1 -> Set(2), 4 -> Set(5, 3)))
(1,Map(2 -> Set(1)))
(6,Map())
(3,Map(0 -> Set(1, 4)))
(5,Map(2 -> Set(1)))
(2,Map(1 -> Set(2)))
I want to combine map's key with elements of set, so I want to change sets' elements (merge them with map's key)
I thought about
val maps = g.vertices.map { case (id, attr) => HashMap(id -> attr) }
g.mapVertices{case (id, data) => data.map{case vId => maps.
map { case i if i.keySet.contains(vId) => HashMap(vId -> i.values) } }}
but I have an error
org.apache.spark.SparkException: RDD transformations and actions can
only be invoked by the driver, not inside of other transformations;
for example, rdd1.map(x => rdd2.values.count() * x) is invalid because
the values transformation and count action cannot be performed inside
of the rdd1.map transformation. For more information, see SPARK-5063.
This is a simple use case for join. In the following code, A is the type of the keys in g.vertices, K and V are the key and value types for maps:
def joinByKeys[A, K, V](sets: RDD[(A, Set[K])], maps: RDD[Map[K, V]]): RDD[(A, Map[K, V])] = {
val flattenSets = sets.flatMap(p => p._2.map(_ -> p._1)) // create a pair for each element of vertice's sets
val flattenMaps = maps.flatMap(identity) // create an RDD with all indexed values in Maps
flattenMaps.join(flattenSets).map{ // join them by their key
case (k, (v, a)) => (a, (k, v)) // reorder to put the vertexId as id
}.aggregateByKey(Map.empty[K, V])(_ + _, _ ++ _) // aggregate the maps
}

Scala: Cartesian product between a variable number of Arrays

I am a newbie in Scala and would appreciate any direction or help in solving the following problem.
Input
I have a Map[String, Array[Double]] that looks like follow:
Map(foo -> Array(12, 25, 100), bar -> Array(0.1, 0.001))
The Map can contain between 1 and 10 keys (depends on the some parameters in my application).
Processing
I would like to apply a cartesian product between the arrays of all keys and generate a structure that contains all the possible combinations of all values of all arrays.
In the example above, the cartesian product will create 3x2=6 different combinations: (12, 0.1), (12, 0.001), (25, 0.1), (25, 0.01), (100,0.1) and (100, 0.01).
For the sake of another example, in some cases I might have three key: the first key has an Array of 4 values, the second has an Array of 5 values and the third has an Array of 3 values, in this case the product has to generate 4x5x3=60 different combinations.
Desired output
Something like:
Map(config1 -> (foo -> 12, bar -> 0.1), config2 -> (foo -> 12, bar -> 0.001), config3 -> (foo -> 25, bar -> 0.1), config4 -> (foo -> 25, bar -> 0.001), config5 -> (foo -> 100, bar -> 0.1), config6 -> (foo -> 100, bar -> 0.001))
You can use a for comprehension to create a cartesian product of two lists, arrays, ...
val start = Map(
'foo -> Array(12, 25, 100),
'bar -> Array(0.1, 0.001),
'baz -> Array(2))
// transform arrays into lists with values paired with map key
val pairedWithKey = start.map { case (k,v) => v.map(i => k -> i).toList }
val accumulator = pairedWithKey.head.map(x => Vector(x))
val cartesianProd = pairedWithKey.tail.foldLeft(accumulator)( (acc, elem) =>
for { x <- acc; y <- elem } yield x :+ y
)
cartesianProd foreach println
// Vector(('foo,12), ('bar,0.1), ('baz,2))
// Vector(('foo,12), ('bar,0.001), ('baz,2))
// Vector(('foo,25), ('bar,0.1), ('baz,2))
// Vector(('foo,25), ('bar,0.001), ('baz,2))
// Vector(('foo,100), ('bar,0.1), ('baz,2))
// Vector(('foo,100), ('bar,0.001), ('baz,2))
You might want to add some checks before using head and tail.
Since the number of Arrays is dynamic, there is no way you can get tuples as the result.
You can, however, utilize recursion for your purpose:
def process(a: Map[String, Seq[Double]]) = {
def product(a: List[(String, Seq[Double])]): Seq[List[(String, Double)]] =
a match {
case (name, values) :: tail =>
for {
result <- product(tail)
value <- values
} yield (name, value) :: result
case Nil => Seq(List())
}
product(a.toList)
}
val a = Map("foo" -> List(12.0, 25.0, 100.0), "bar" -> List(0.1, 0.001))
println(process(a))
Which gives a result of:
List(List((foo,12.0), (bar,0.1)), List((foo,25.0), (bar,0.1)), List((foo,100.0), (bar,0.1)), List((foo,12.0), (bar,0.001)), List((foo,25.0), (bar,0.001)), List((foo,100.0), (bar,0.001)))

Fold from Map[String,List[Int]] to Map[String,Int]

I'm fairly new to Scala and functional approaches in general. I have a Map that looks something like this:
val myMap: Map[String, List[Int]]
I want to end up something that maps the key to the total of the associated list:
val totalsMap: Map[String, Int]
My initial hunch was to use a for comprehension:
val totalsMap = for (kvPair <- myMap) {
kvPair._2.foldLeft(0)(_+_)
}
But I have no idea what I would put in the yield() clause in order to get a map out of the for comprehension.
You can use mapValues for this,
val totalMap = myMap.mapValues(_.sum)
But mapValues will recalculate the sum every time you get a key from the Map. e.g. If you do totalMap("a") multiple times, it will recalculate the sum each time.
If you don't want this, you should use
val totalMap = myMap map {
case (k, v) => k -> v.sum
}
mapValues would be more suited for this case:
val m = Map[String, List[Int]]("a" -> List(1,2,3), "b" -> List(4,5,6))
m.mapValues(_.foldLeft(0)(_+_))
res1: scala.collection.immutable.Map[String,Int] = Map(a -> 6, b -> 15)
Or without foldLeft:
m.mapValues(_.sum)
val m = Map("hello" -> Seq(1, 1, 1, 1), "world" -> Seq(1, 1))
for ((k, v) <- m) yield (k, v.sum)
yields
Map(hello -> 4, world -> 2)`
The for comprehension will return whatever monadic type you give it. In this case, m is a Map, so that's what's going to come out. The yield must return a tuple. The first element (which becomes the key in each Map entry) is the word you're counting, and the second element (you guessed it, the value in each Map entry) becomes the sum of the original sequence of counts.