Reduce RDD[Map[T, V]] by merging maps - scala

I have an RDD of maps, where the maps are certain to have intersecting key sets. Each map may have 10,000s of entries.
I need to merge the maps, such that those with intersecting key sets are merged, but others are left distinct.
Here's what I have. I haven't tested that it works, but I know that it's slow.
def mergeOverlapping(maps: RDD[Map[Int, Int]])(implicit sc: SparkContext): RDD[Map[Int, Int]] = {
val in: RDD[List[Map[Int, Int]]] = maps.map(List(_))
val z = List.empty[Map[Int, Int]]
val t: List[Map[Int, Int]] = in.fold(z) { case (l, r) =>
(l ::: r).foldLeft(List.empty[Map[Int, Int]]) { case (acc, next) =>
val (overlapping, distinct) = acc.partition(_.keys.exists(next.contains))
overlapping match {
case Nil => next :: acc
case xs => (next :: xs).reduceLeft(merge) :: distinct
}
}
}
sc.parallelize(t)
}
def merge(l: Map[Int, Int], r: Map[Int, Int]): Map[Int, Int] = {
val keys = l.keySet ++ r.keySet
keys.map { k =>
(l.get(k), r.get(k)) match {
case (Some(i), Some(j)) => k -> math.min(i, j)
case (a, b) => k -> (a orElse b).get
}
}.toMap
}
The problem, as far as I can tell, is that RDD#fold is merging and re-merging maps many more times than it has to.
Is there a more efficient mechanism that I could use? Is there another way I can structure my data to make it efficient?

Related

flatmapping a nested Map in scala

Suppose I have val someMap = Map[String -> Map[String -> String]] defined as such:
val someMap =
Map(
("a1" -> Map( ("b1" -> "c1"), ("b2" -> "c2") ) ),
("a2" -> Map( ("b3" -> "c3"), ("b4" -> "c4") ) ),
("a3" -> Map( ("b5" -> "c5"), ("b6" -> "c6") ) )
)
and I would like to flatten it to something that looks like
List(
("a1","b1","c1"),("a1","b2","c2"),
("a2","b3","c3"),("a2","b4","c4"),
("a3","b5","c5"),("a3","b6","c6")
)
What is the most efficient way of doing this? I was thinking about creating some helper function that processes each (a_i -> Map(String,String)) key value pair and return
def helper(key: String, values: Map[String -> String]): (String,String,String)
= {val sublist = values.map(x => (key,x._1,x._2))
return sublist
}
then flatmap this function over someMap. But this seems somewhat unnecessary to my novice scala eyes, so I was wondering if there was a more efficient way to parse this Map.
No need to create helper function just write nested lambda:
val result = someMap.flatMap { case (k, v) => v.map { case (k1, v1) => (k, k1, v1) } }
Or
val y = someMap.flatMap(x => x._2.map(y => (x._1, y._1, y._2)))
Since you're asking about efficiency, the most efficient yet functional approach I can think of is using foldLeft and foldRight.
You need foldRight since :: constructs the immutable list in reverse.
someMap.foldRight(List.empty[(String, String, String)]) { case ((a, m), acc) =>
m.foldRight(acc) {
case ((b, c), acc) => (a, b, c) :: acc
}
}
Here, assuming Map.iterator.reverse is implemented efficiently, no intermediate collections are created.
Alternatively, you can use foldLeft and then reverse the result:
someMap.foldLeft(List.empty[(String, String, String)]) { case (acc, (a, m)) =>
m.foldLeft(acc) {
case (acc, (b, c)) => (a, b, c) :: acc
}
}.reverse
This way a single intermediate List is created, but you don't rely on the implementation of the reversed iterator (foldLeft uses forward iterator).
Note: one liners, such as someMap.flatMap(x => x._2.map(y => (x._1, y._1, y._2))) are less efficient, as, in addition to the temporary buffer to hold intermediate results of flatMap, they create and discard additional intermediate collections for each inner map.
UPD
Since there seems to be some confusion, I'll clarify what I mean. Here is an implementation of map, flatMap, foldLeft and foldRight from TraversibleLike:
def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
def builder = { // extracted to keep method size under 35 bytes, so that it can be JIT-inlined
val b = bf(repr)
b.sizeHint(this)
b
}
val b = builder
for (x <- this) b += f(x)
b.result
}
def flatMap[B, That](f: A => GenTraversableOnce[B])(implicit bf: CanBuildFrom[Repr, B, That]): That = {
def builder = bf(repr) // extracted to keep method size under 35 bytes, so that it can be JIT-inlined
val b = builder
for (x <- this) b ++= f(x).seq
b.result
}
def foldLeft[B](z: B)(op: (B, A) => B): B = {
var result = z
this foreach (x => result = op(result, x))
result
}
def foldRight[B](z: B)(op: (A, B) => B): B =
reversed.foldLeft(z)((x, y) => op(y, x))
It's clear that map and flatMap create intermediate buffer using corresponding builder, while foldLeft and foldRight reuse the same user-supplied accumulator object, and only use iterators.

How to improve this "update" function?

Suppose I've got case class A(x: Int, s: String) and need to update a List[A] using a Map[Int, String] like that:
def update(as: List[A], map: Map[Int, String]): List[A] = ???
val as = List(A(1, "a"), A(2, "b"), A(3, "c"), A(4, "d"))
val map = Map(2 -> "b1", 4 -> "d1", 5 -> "e", 6 -> "f")
update(as, map) // List(A(1, "a"), A(2, "b1"), A(3, "c"), A(4, "d1"))
I am writing update like that:
def update(as: List[A], map: Map[Int, String]): List[A] = {
#annotation.tailrec
def loop(acc: List[A], rest: List[A], map: Map[Int, String]): List[A] = rest match {
case Nil => acc
case as => as.span(a => !map.contains(a.x)) match {
case (xs, Nil) => xs ++ acc
case (xs, y :: ys) => loop((y.copy(s = map(y.x)) +: xs) ++ acc, ys, map - y.x)
}
}
loop(Nil, as, map).reverse
}
This function works fine but it's suboptimal because it continues iterating over the input list when map is empty. Besides, it looks overcomplicated. How would you suggest improve this update function ?
If you can not make any supposition about the List and the Map. Then the best is to just iterate the former, juts once and in the simplest way possible; that is, using the map function.
list.map { a =>
map
.get(key = a.x)
.fold(ifEmpty = a) { s =>
a.copy(s = s)
}
}
However, if and only if, you can be sure that most of the time:
The List will be big.
The Map will be small.
The keys in the Map are a subset of the values in the List.
And all operations will be closer to the head of the List rather than the tail.
Then, you could use the following approach which should be more efficient in such cases.
def optimizedUpdate(data: List[A], updates: Map[Int, String]): List[A] = {
#annotation.tailrec
def loop(remaining: List[A], map: Map[Int, String], acc: List[A]): List[A] =
if (map.isEmpty) acc reverse_::: remaining
else remaining match {
case a :: as =>
map.get(key = a.x) match {
case None =>
loop(
remaining = as,
map,
a :: acc
)
case Some(s) =>
loop(
remaining = as,
map = map - a.x,
a.copy(s = s) :: acc
)
}
case Nil =>
acc.reverse
}
loop(remaining = data, map = updates, acc = List.empty)
}
However note that the code is not only longer and more difficult to understand.
It is actually more inefficient than the map solution (if the conditions are not meet); this is because the stdlib implementation "cheats" and constructs the List my mutating its tail instead of building it backwards and then reversing it as we did.
In any case, as with any things performance, the only real answer is to benchmark.
But, I would go with the map solution just for clarity or with a mutable approach if you really need speed.
You can see the code running here.
How about
def update(as: List[A], map: Map[Int, String]): List[A] =
as.foldLeft(List.empty[A]) { (agg, elem) =>
val newA = map
.get(elem.x)
.map(a => elem.copy(s = a))
.getOrElse(elem)
newA :: agg
}.reverse

Immutable Vector or List Builder in Scala

I am running into situation where I need to extract certain entries into corresponding lists based on some condition. Here is my code
var keys = Vector[String]()
var data = Vector[String]()
for ((k, v) <- myMap) {
if (v.endsWith("abc")) { keys = keys :+ v }
if (v.endsWith("xyz")) { data = data :+ v }
}
What would be the best way to implement this logic without making keys and data as var? Is there such a thing as Immutable List builder in Scala?
For example, ImmutableList.Builder in guava (Java) https://google.github.io/guava/releases/21.0/api/docs/com/google/common/collect/ImmutableList.Builder.html
Every Scala collection comes with append-only builder:
val keysB, dataB = Vector.newBuilder[String]
for ((k, v) <- myMap) {
if (v.endsWith("abc")) { keysB += v }
if (v.endsWith("xyz")) { dataB += v }
}
val keys = keysB.result()
val data = dataB.result()
You could just partition the values as needed.
val (keys, notKeys) = myMap.values.partition(_.endsWith("abc"))
val (data, _) = notKeys.partition(_.endsWith("xyz"))
Your keys and data collections will be List[String] instead of Vector but that's an easy mod if necessary.
What about using foldLeft?
val map: Map[Int, String] = Map(
1 -> "abc",
2 -> "xyz",
3 -> "abcxyz",
4 -> "xyzabc"
)
val r = map.foldLeft((Seq.empty[String], Seq.empty[String])) {
case ((keys, data), (k, v)) =>
if (v.endsWith("abc")) {
(keys :+ v, data)
}
else if (v.endsWith("xyz")) {
(keys, data :+ v)
}
else {
(keys, data)
}
}
r match {
case (keys, data) =>
println(s"keys: $keys")
println(s"data: $data")
}
If you're forced to use a var or mutable collection (beyond your needs for optimization), you're probably not thinking about the problem correctly.
Suppose we had a map m:
Map(1 -> "abc", 2 -> "xyz")
Now, we can use recursion to solve this problem (and I've done it in a tail recursive form here):
type Keys = Vector[String]
type Data = Vector[String]
def keyData(m: Map[Int, String]): (Keys, Data) = {
def go(keys: Keys, data: Data, m: List[(Int, String)]): (Keys, Data) =
m match {
case (k, v) :: ks if v endsWith("abc") =>
go(v +: keys, data, ks)
case (k, v) :: ks if v endsWith("xyz") =>
go(keys, v +: data, ks)
case k :: ks =>
go(keys, data, ks)
case _ => (keys, data)
}
go(Vector.empty[String], Vector.empty[String], m.toList)
}
This will take a map and produce a pair of vectors holding the string data that matches the predicates you listed. Now, suppose we wanted to abstract and partition our map elements into vectors satisfying any two predicates p: Int => Boolean or q: Int => Boolean. Then, we'd have something that looks like this:
type Keys = Vector[String]
type Data = Vector[String]
def keyData(m: Map[Int, String], p: Int => Boolean, q: Int => Boolean): (Keys, Data) = {
def go(keys: Keys, data: Data, m: List[(Int, String)]): (Keys, Data) =
m match {
case (k, v) :: ks if p(v) =>
go(v +: keys, data, ks)
case (k, v) :: ks if q(v) =>
go(keys, v +: data, ks)
case k :: ks =>
go(keys, data, ks)
case _ => (keys, data)
}
go(Vector.empty[String], Vector.empty[String], m.toList)
}
Now, we can abstract this for any key and value types K and V:
def partitionMapBy[K, V](m: Map[K, V], p: V => Boolean, q: V => Boolean): (Vector[V], Vector[V]) = {
def go(keys: Vector[V], data: Vector[V], m: List[(K, V)]): (Vector[V], Vector[V]) =
m match {
case (k, v) :: ks if p(v) =>
go(v +: keys, data, ks)
case (k, v) :: ks if q(v) =>
go(keys, v +: data, ks)
case k :: ks =>
go(keys, data, ks)
case _ => (keys, data)
}
go(Vector.empty[V], Vector.empty[V], m.toList)
}
You'll notice that there's nothing fancy going on with the recursion here. This means we can use a fold to accomplish the same thing. Here's an implementation using foldLeft:
def partitionMapBy[K, V](m: Map[K, V])(p: V => Boolean)(q: V => Boolean): (Vector[V], Vector[V]) =
m.foldLeft[(Vector[V], Vector[V])]((Vector.empty[V], Vector.empty[V])) {
case (acc # (keys: Vector[V], data: Vector[V]), (_, v: V)) =>
if(p(v)) (v +: keys, data)
else if(q(v)) (keys, v +: data)
else acc
}
And you can see, for m, we get the this where, if you let p be _ endsWith("abc") and q be _ endsWith("xyz"), then you'll have exactly what you want.
`

Change list of Eithers to two value lists in scala

How can I change list of Eithers into two list of value Right and Left. When I use partition it returns two lists of Either's not values. What is the simplest way to do it?
foldLeft allows you to easily write your own method:
def separateEithers[T, U](list: List[Either[T, U]]) = {
val (ls, rs) = list.foldLeft(List[T](), List[U]()) {
case ((ls, rs), Left(x)) => (x :: ls, rs)
case ((ls, rs), Right(x)) => (ls, x :: rs)
}
(ls.reverse, rs.reverse)
}
You'll have to map the two resulting lists after partitioning.
val origin: List[Either[A, B]] = ???
val (lefts, rights) = origin.partition(_.isInstanceOf[Left[_]])
val leftValues = lefts.map(_.asInstanceOf[Left[A]].a)
val rightValues = rights.map(_.asInstanceOf[Right[B]].b)
If you are not happy with the casts and isInstanceOf's, you can also do it in two passes:
val leftValues = origin collect {
case Left(a) => a
}
val rightValues = origin collect {
case Right(b) => b
}
And if you are not happy with the two passes, you'll have to do it "by hand":
def myPartition[A, B](origin: List[Either[A, B]]): (List[A], List[B]) = {
val leftBuilder = List.newBuilder[A]
val rightBuilder = List.newBuilder[B]
origin foreach {
case Left(a) => leftBuilder += a
case Right(b) => rightBuilder += b
}
(leftBuilder.result(), rightBuilder.result())
}
Finally, if you don't like mutable state, you can do so:
def myPartition[A, B](origin: List[Either[A, B]]): (List[A], List[B]) = {
#tailrec
def loop(xs: List[Either[A, B]], accLeft: List[A],
accRight: List[B]): (List[A], List[B]) = {
xs match {
case Nil => (accLeft.reverse, accRight.reverse)
case Left(a) :: xr => loop(xr, a :: accLeft, accRight)
case Right(b) :: xr => loop(xr, accLeft, b :: accRight)
}
}
loop(origin, Nil, Nil)
}
If making two passes through the list is okay for you, you can use collect:
type E = Either[String, Int]
val xs: List[E] = List(Left("foo"), Right(1), Left("bar"), Right(2))
val rights = xs.collect { case Right(x) => x}
// rights: List[Int] = List(1, 2)
val lefts = xs.collect { case Left(x) => x}
// lefts: List[String] = List(foo, bar)
Using for comprehensions, like this,
for ( Left(v) <- xs ) yield v
and
for ( Right(v) <- xs ) yield v

What's the idiomatic way to map producing 0 or 1 results per entry?

What's the idiomatic way to call map over a collection producing 0 or 1 result per entry?
Suppose I have:
val data = Array("A", "x:y", "d:e")
What I'd like as a result is:
val target = Array(("x", "y"), ("d", "e"))
(drop anything without a colon, split on colon and return tuples)
So in theory I think I want to do something like:
val attempt1 = data.map( arg => {
arg.split(":", 2) match {
case Array(l,r) => (l, r)
case _ => (None, None)
}
}).filter( _._1 != None )
What I'd like to do is avoid the need for the any-case and get rid of the filter.
I could do this by pre-filtering (but then I have to test the regex twice):
val attempt2 = data.filter( arg.contains(":") ).map( arg => {
val Array(l,r) = arg.split(":", 2)
(l,r)
})
Last, I could use Some/None and flatMap...which does get rid of the need to filter, but is it what most scala programmers would expect?
val attempt3 = data.flatMap( arg => {
arg.split(":", 2) match {
case Array(l,r) => Some((l,r))
case _ => None
}
})
It seems to me like there'd be an idiomatic way to do this in Scala, is there?
With a Regex extractor and collect :-)
scala> val R = "(.+):(.+)".r
R: scala.util.matching.Regex = (.+):(.+)
scala> Array("A", "x:y", "d:e") collect {
| case R(a, b) => (a, b)
| }
res0: Array[(String, String)] = Array((x,y), (d,e))
Edit:
If you want a map, you can do:
scala> val x: Map[String, String] = Array("A", "x:y", "d:e").collect { case R(a, b) => (a, b) }.toMap
x: Map[String,String] = Map(x -> y, d -> e)
If performance is a concern, you can use collection.breakOut as shown below to avoid creation of an intermediate array:
scala> val x: Map[String, String] = Array("A", "x:y", "d:e").collect { case R(a, b) => (a, b) } (collection.breakOut)
x: Map[String,String] = Map(x -> y, d -> e)