FlatmapValues on Map - scala

Given a Seq of tuples like:
Seq(
("a",Set(1,2)),
("a",Set(2,3)),
("b",Set(4,6)),
("b",Set(5,6))
)
I would like to groupBy and then flatMap the values to obtain something like:
Map(
b -> Set(4, 6, 5),
a -> Set(1, 2, 3)
)
My first implementation is:
Seq(
("a" -> Set(1,2)),
("a" -> Set(2,3)),
("b" -> Set(4,6)),
("b" -> Set(5,6))
) groupBy (_._1) mapValues (_ map (_._2)) mapValues (_.flatten.toSet)
I was wondering if there was a more efficient and maybe simpler way to achieve that result.

You were on the right track, but you can simplify a bit by using a single mapValues and combining the map and flatten:
val r = Seq(
("a" -> Set(1,2)),
("a" -> Set(2,3)),
("b" -> Set(4,6)),
("b" -> Set(5,6))
).groupBy(_._1).mapValues(_.flatMap(_._2).toSet)
I actually find this a lot more readable than the foldLeft version (but note that mapValues returns a non-strict collection, which may or may not be what you want).

I would use foldLeft, I think it's more readable, you can avoid groupBy
val r = Seq(
("a",Set(1,2)),
("a",Set(2,3)),
("b",Set(4,6)),
("b",Set(5,6))
).foldLeft(Map[String, Set[Int]]()){
case (seed,(k,v)) => {
seed.updated(k,v ++ seed.getOrElse(k,Set[Int]()))
}
}

#grotrianster answer could be refined using the Semigroup binary operation |+| of Set and Map:
import scalaz.syntax.semigroup._
import scalaz.std.map._
import scalaz.std.set._
Seq(
("a",Set(1,2)),
("a",Set(2,3)),
("b",Set(4,6)),
("b",Set(5,6))
).foldLeft(Map[String, Set[Int]]()){case (seed, (k, v)) => seed |+| Map(k -> v)}
Using reduce instead of fold:
Seq(
("a", Set(1, 2)),
("a", Set(2, 3)),
("b", Set(4, 6)),
("b", Set(5, 6))
).map(Map(_)).reduce({_ |+| _})
Treating Set and Map as Monoids:
Seq(
("a", Set(1, 2)),
("a", Set(2, 3)),
("b", Set(4, 6)),
("b", Set(5, 6))
).map(Map(_)).toList.suml

Related

Scala grouping of Sequence of <Key, Value(Key)> to Map of <Key, Seq(Value)> [duplicate]

I have
val a = List((1,2), (1,3), (3,4), (3,5), (4,5))
I am using A.groupBy(_._1) which is groupBy with the first element. But, it gives me output as
Map(1 -> List((1,2) , (1,3)) , 3 -> List((3,4), (3,5)), 4 -> List((4,5)))
But, I want answer as
Map(1 -> List(2, 3), 3 -> List(4,5) , 4 -> List(5))
So, how can I do this?
You can do that by following up with mapValues (and a map over each value to extract the second element):
scala> a.groupBy(_._1).mapValues(_.map(_._2))
res2: scala.collection.immutable.Map[Int,List[Int]] = Map(4 -> List(5), 1 -> List(2, 3), 3 -> List(4, 5))
Make life easy with pattern match and Map#withDefaultValue:
scala> a.foldLeft(Map.empty[Int, List[Int]].withDefaultValue(Nil)){
case(r, (x, y)) => r.updated(x, r(x):+y)
}
res0: scala.collection.immutable.Map[Int,List[Int]] =
Map(1 -> List(2, 3), 3 -> List(4, 5), 4 -> List(5))
There are two points:
Map#withDefaultValue will get a map with a given default value, then you don't need to check if the map contains a key.
When somewhere in scala expected a function value (x1,x2,..,xn) => y, you can always use a pattern matching case(x1,x2,..,xn) => y here, the compiler will translate it to a function auto. Look into 8.5 Pattern Matching Anonymous Functions for more information.
Sorry for my poor english.
As from Scala 2.13 it would be possible to use groupMap
so you'd be able to write just:
// val list = List((1, 2), (1, 3), (3, 4), (3, 5), (4, 5))
list.groupMap(_._1)(_._2)
// Map(1 -> List(2, 3), 3 -> List(4, 5), 4 -> List(5))
As a variant:
a.foldLeft(Map[Int, List[Int]]()) {case (acc, (a,b)) => acc + (a -> (b::acc.getOrElse(a,List())))}
You can also do it with a foldLeft to have only one iteration.
a.foldLeft(Map.empty[Int, List[Int]])((map, t) =>
if(map.contains(t._1)) map + (t._1 -> (t._2 :: map(t._1)))
else map + (t._1 -> List(t._2)))
scala.collection.immutable.Map[Int,List[Int]] = Map(1 -> List(3, 2), 3 ->
List(5, 4), 4 -> List(5))
If the order of the elements in the lists matters you need to include a reverse.
a.foldLeft(Map.empty[Int, List[Int]])((map, t) =>
if(map.contains(t._1)) (map + (t._1 -> (t._2 :: map(t._1)).reverse))
else map + (t._1 -> List(t._2)))
scala.collection.immutable.Map[Int,List[Int]] = Map(1 -> List(2, 3), 3 ->
List(4, 5), 4 -> List(5))

Convert List of Maps to Map of Lists based on Map Key

Lets say I have the following list:
val myList = List(Map(1 -> 1), Map(2 -> 2), Map(2 -> 7))
I want to convert this list to a single Map of Int -> List(Int) such that if we have duplicate keys then both values should be included in the resulting value list:
Map(2 -> List(7, 2), 1 -> List(1))
I came up with this working solution but it seems excessive and clunky:
myList.foldLeft(scala.collection.mutable.Map[Int,List[Int]]()) {(result,element) =>
for((k,v) <- element) {
if (result.keySet.contains(k)) {
result(k) = result(k).:: (v)
} else {
result += (k -> List(v))
}
}
result
}
Is there a better or more efficient approach here?
myList
.flatten
.groupBy(_._1)
.mapValues(_.map(_._2))
You can use a simpler (but probably less efficient) code:
val myList = List(Map(1 -> 1), Map(2 -> 2), Map(2 -> 7))
val grouped = myList.flatMap(_.toList).groupBy(_._1).mapValues(l => l.map(_._2))
println(grouped)
Map(2 -> List(2, 7), 1 -> List(1))
The idea is to first get List of all tuples from all inner Maps and then group them.
Starting Scala 2.13, we can now use groupMap which is a one-pass equivalent of a groupBy followed by mapValues (as its name suggests):
// val maps = List(Map(1 -> 1), Map(2 -> 2), Map(2 -> 7))
maps.flatten.groupMap(_._1)(_._2) // Map(1 -> List(1), 2 -> List(2, 7))
This:
flattens the list of maps into a list of tuples (List((1, 1), (2, 2), (2, 7)))
groups elements based on their first tuple part (_._1) (group part of groupMap)
maps grouped values to their second tuple part (_._2) (map part of groupMap)

Getting first n distinct Key Tuples in Scala Spark

I have a RDD with Tuple as follows
(a, 1), (a, 2), (b,1)
How can I can get the first two tuples with distinct keys. If I do a take(2), I will get (a, 1) and (a, 2)
What I need is (a, 1), (b,1) (Keys are distinct). Values are irrelevant.
Here's what I threw together in Scala.
sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
.reduceByKey((k1,k2) => k1)
.collect()
Outputs
Array[(String, Int)] = Array((a,1), (b,1))
As you already have a RDD of Pair, your RDD has extra key-value functionality provided by org.apache.spark.rdd.PairRDDFunctions. Lets make use of that.
val pairRdd = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
// RDD[(String, Int)]
val groupedRdd = pairRdd.groupByKey()
// RDD[(String, Iterable[Int])]
val requiredRdd = groupedRdd.map((key, iter) => (key, iter.head))
// RDD[(String, Int)]
Or in short
sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
.groupByKey()
.map((key, iter) => (key, iter.head))
It is easy.....
you just need to use the function just like the bellow:
val data = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
data.collectAsMap().foreach(println)

GroupBy in scala

I have
val a = List((1,2), (1,3), (3,4), (3,5), (4,5))
I am using A.groupBy(_._1) which is groupBy with the first element. But, it gives me output as
Map(1 -> List((1,2) , (1,3)) , 3 -> List((3,4), (3,5)), 4 -> List((4,5)))
But, I want answer as
Map(1 -> List(2, 3), 3 -> List(4,5) , 4 -> List(5))
So, how can I do this?
You can do that by following up with mapValues (and a map over each value to extract the second element):
scala> a.groupBy(_._1).mapValues(_.map(_._2))
res2: scala.collection.immutable.Map[Int,List[Int]] = Map(4 -> List(5), 1 -> List(2, 3), 3 -> List(4, 5))
Make life easy with pattern match and Map#withDefaultValue:
scala> a.foldLeft(Map.empty[Int, List[Int]].withDefaultValue(Nil)){
case(r, (x, y)) => r.updated(x, r(x):+y)
}
res0: scala.collection.immutable.Map[Int,List[Int]] =
Map(1 -> List(2, 3), 3 -> List(4, 5), 4 -> List(5))
There are two points:
Map#withDefaultValue will get a map with a given default value, then you don't need to check if the map contains a key.
When somewhere in scala expected a function value (x1,x2,..,xn) => y, you can always use a pattern matching case(x1,x2,..,xn) => y here, the compiler will translate it to a function auto. Look into 8.5 Pattern Matching Anonymous Functions for more information.
Sorry for my poor english.
As from Scala 2.13 it would be possible to use groupMap
so you'd be able to write just:
// val list = List((1, 2), (1, 3), (3, 4), (3, 5), (4, 5))
list.groupMap(_._1)(_._2)
// Map(1 -> List(2, 3), 3 -> List(4, 5), 4 -> List(5))
As a variant:
a.foldLeft(Map[Int, List[Int]]()) {case (acc, (a,b)) => acc + (a -> (b::acc.getOrElse(a,List())))}
You can also do it with a foldLeft to have only one iteration.
a.foldLeft(Map.empty[Int, List[Int]])((map, t) =>
if(map.contains(t._1)) map + (t._1 -> (t._2 :: map(t._1)))
else map + (t._1 -> List(t._2)))
scala.collection.immutable.Map[Int,List[Int]] = Map(1 -> List(3, 2), 3 ->
List(5, 4), 4 -> List(5))
If the order of the elements in the lists matters you need to include a reverse.
a.foldLeft(Map.empty[Int, List[Int]])((map, t) =>
if(map.contains(t._1)) (map + (t._1 -> (t._2 :: map(t._1)).reverse))
else map + (t._1 -> List(t._2)))
scala.collection.immutable.Map[Int,List[Int]] = Map(1 -> List(2, 3), 3 ->
List(4, 5), 4 -> List(5))

Sequence transformations in scala

In scala, is there a simple way of transforming this sequence
Seq(("a", 1), ("b", 2), ("a", 3), ("c", 4), ("b", 5))
into this Seq(("a", 4), ("b", 7), ("c", 4))?
Thanks
I'm not sure if you meant to have Strings in the second ordinate of the tuple. Assuming Seq[(String, Int)], you can use groupBy to group the elements by the first ordinate:
Seq(("a", 1), ("b", 2), ("a", 3), ("c", 4), ("b", 5))
.groupBy(_._1)
.mapValues(_.map(_._2).sum)
.toSeq
Otherwise, you'll next an extra .toInt
Here is another way by unzipping and using the second item in the tuple.
val sq = sqSeq(("a", 1), ("b", 2), ("a", 3), ("c", 4), ("b", 5))
sq.groupBy(_._1)
.transform {(k,lt) => lt.unzip._2.sum}
.toSeq
The above code details:
scala> sq.groupBy(_._1)
res01: scala.collection.immutable.Map[String,Seq[(String, Int)]] = Map(b -> List((b,2), (b,5)), a -> List((a,1), (a,3)), c -> List((c,4)))
scala> sq.groupBy(_._1).transform {(k,lt) => lt.unzip._2.sum}
res02: scala.collection.immutable.Map[String,Int] = Map(b -> 7, a -> 4, c -> 4)
scala> sq.groupBy(_._1).transform {(k,lt) => lt.unzip._2.sum}.toSeq
res03: Seq[(String, Int)] = ArrayBuffer((b,7), (a,4), (c,4))