I am running into situation where I need to extract certain entries into corresponding lists based on some condition. Here is my code
var keys = Vector[String]()
var data = Vector[String]()
for ((k, v) <- myMap) {
if (v.endsWith("abc")) { keys = keys :+ v }
if (v.endsWith("xyz")) { data = data :+ v }
}
What would be the best way to implement this logic without making keys and data as var? Is there such a thing as Immutable List builder in Scala?
For example, ImmutableList.Builder in guava (Java) https://google.github.io/guava/releases/21.0/api/docs/com/google/common/collect/ImmutableList.Builder.html
Every Scala collection comes with append-only builder:
val keysB, dataB = Vector.newBuilder[String]
for ((k, v) <- myMap) {
if (v.endsWith("abc")) { keysB += v }
if (v.endsWith("xyz")) { dataB += v }
}
val keys = keysB.result()
val data = dataB.result()
You could just partition the values as needed.
val (keys, notKeys) = myMap.values.partition(_.endsWith("abc"))
val (data, _) = notKeys.partition(_.endsWith("xyz"))
Your keys and data collections will be List[String] instead of Vector but that's an easy mod if necessary.
What about using foldLeft?
val map: Map[Int, String] = Map(
1 -> "abc",
2 -> "xyz",
3 -> "abcxyz",
4 -> "xyzabc"
)
val r = map.foldLeft((Seq.empty[String], Seq.empty[String])) {
case ((keys, data), (k, v)) =>
if (v.endsWith("abc")) {
(keys :+ v, data)
}
else if (v.endsWith("xyz")) {
(keys, data :+ v)
}
else {
(keys, data)
}
}
r match {
case (keys, data) =>
println(s"keys: $keys")
println(s"data: $data")
}
If you're forced to use a var or mutable collection (beyond your needs for optimization), you're probably not thinking about the problem correctly.
Suppose we had a map m:
Map(1 -> "abc", 2 -> "xyz")
Now, we can use recursion to solve this problem (and I've done it in a tail recursive form here):
type Keys = Vector[String]
type Data = Vector[String]
def keyData(m: Map[Int, String]): (Keys, Data) = {
def go(keys: Keys, data: Data, m: List[(Int, String)]): (Keys, Data) =
m match {
case (k, v) :: ks if v endsWith("abc") =>
go(v +: keys, data, ks)
case (k, v) :: ks if v endsWith("xyz") =>
go(keys, v +: data, ks)
case k :: ks =>
go(keys, data, ks)
case _ => (keys, data)
}
go(Vector.empty[String], Vector.empty[String], m.toList)
}
This will take a map and produce a pair of vectors holding the string data that matches the predicates you listed. Now, suppose we wanted to abstract and partition our map elements into vectors satisfying any two predicates p: Int => Boolean or q: Int => Boolean. Then, we'd have something that looks like this:
type Keys = Vector[String]
type Data = Vector[String]
def keyData(m: Map[Int, String], p: Int => Boolean, q: Int => Boolean): (Keys, Data) = {
def go(keys: Keys, data: Data, m: List[(Int, String)]): (Keys, Data) =
m match {
case (k, v) :: ks if p(v) =>
go(v +: keys, data, ks)
case (k, v) :: ks if q(v) =>
go(keys, v +: data, ks)
case k :: ks =>
go(keys, data, ks)
case _ => (keys, data)
}
go(Vector.empty[String], Vector.empty[String], m.toList)
}
Now, we can abstract this for any key and value types K and V:
def partitionMapBy[K, V](m: Map[K, V], p: V => Boolean, q: V => Boolean): (Vector[V], Vector[V]) = {
def go(keys: Vector[V], data: Vector[V], m: List[(K, V)]): (Vector[V], Vector[V]) =
m match {
case (k, v) :: ks if p(v) =>
go(v +: keys, data, ks)
case (k, v) :: ks if q(v) =>
go(keys, v +: data, ks)
case k :: ks =>
go(keys, data, ks)
case _ => (keys, data)
}
go(Vector.empty[V], Vector.empty[V], m.toList)
}
You'll notice that there's nothing fancy going on with the recursion here. This means we can use a fold to accomplish the same thing. Here's an implementation using foldLeft:
def partitionMapBy[K, V](m: Map[K, V])(p: V => Boolean)(q: V => Boolean): (Vector[V], Vector[V]) =
m.foldLeft[(Vector[V], Vector[V])]((Vector.empty[V], Vector.empty[V])) {
case (acc # (keys: Vector[V], data: Vector[V]), (_, v: V)) =>
if(p(v)) (v +: keys, data)
else if(q(v)) (keys, v +: data)
else acc
}
And you can see, for m, we get the this where, if you let p be _ endsWith("abc") and q be _ endsWith("xyz"), then you'll have exactly what you want.
`
Related
Suppose I have val someMap = Map[String -> Map[String -> String]] defined as such:
val someMap =
Map(
("a1" -> Map( ("b1" -> "c1"), ("b2" -> "c2") ) ),
("a2" -> Map( ("b3" -> "c3"), ("b4" -> "c4") ) ),
("a3" -> Map( ("b5" -> "c5"), ("b6" -> "c6") ) )
)
and I would like to flatten it to something that looks like
List(
("a1","b1","c1"),("a1","b2","c2"),
("a2","b3","c3"),("a2","b4","c4"),
("a3","b5","c5"),("a3","b6","c6")
)
What is the most efficient way of doing this? I was thinking about creating some helper function that processes each (a_i -> Map(String,String)) key value pair and return
def helper(key: String, values: Map[String -> String]): (String,String,String)
= {val sublist = values.map(x => (key,x._1,x._2))
return sublist
}
then flatmap this function over someMap. But this seems somewhat unnecessary to my novice scala eyes, so I was wondering if there was a more efficient way to parse this Map.
No need to create helper function just write nested lambda:
val result = someMap.flatMap { case (k, v) => v.map { case (k1, v1) => (k, k1, v1) } }
Or
val y = someMap.flatMap(x => x._2.map(y => (x._1, y._1, y._2)))
Since you're asking about efficiency, the most efficient yet functional approach I can think of is using foldLeft and foldRight.
You need foldRight since :: constructs the immutable list in reverse.
someMap.foldRight(List.empty[(String, String, String)]) { case ((a, m), acc) =>
m.foldRight(acc) {
case ((b, c), acc) => (a, b, c) :: acc
}
}
Here, assuming Map.iterator.reverse is implemented efficiently, no intermediate collections are created.
Alternatively, you can use foldLeft and then reverse the result:
someMap.foldLeft(List.empty[(String, String, String)]) { case (acc, (a, m)) =>
m.foldLeft(acc) {
case (acc, (b, c)) => (a, b, c) :: acc
}
}.reverse
This way a single intermediate List is created, but you don't rely on the implementation of the reversed iterator (foldLeft uses forward iterator).
Note: one liners, such as someMap.flatMap(x => x._2.map(y => (x._1, y._1, y._2))) are less efficient, as, in addition to the temporary buffer to hold intermediate results of flatMap, they create and discard additional intermediate collections for each inner map.
UPD
Since there seems to be some confusion, I'll clarify what I mean. Here is an implementation of map, flatMap, foldLeft and foldRight from TraversibleLike:
def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
def builder = { // extracted to keep method size under 35 bytes, so that it can be JIT-inlined
val b = bf(repr)
b.sizeHint(this)
b
}
val b = builder
for (x <- this) b += f(x)
b.result
}
def flatMap[B, That](f: A => GenTraversableOnce[B])(implicit bf: CanBuildFrom[Repr, B, That]): That = {
def builder = bf(repr) // extracted to keep method size under 35 bytes, so that it can be JIT-inlined
val b = builder
for (x <- this) b ++= f(x).seq
b.result
}
def foldLeft[B](z: B)(op: (B, A) => B): B = {
var result = z
this foreach (x => result = op(result, x))
result
}
def foldRight[B](z: B)(op: (A, B) => B): B =
reversed.foldLeft(z)((x, y) => op(y, x))
It's clear that map and flatMap create intermediate buffer using corresponding builder, while foldLeft and foldRight reuse the same user-supplied accumulator object, and only use iterators.
I have an RDD of maps, where the maps are certain to have intersecting key sets. Each map may have 10,000s of entries.
I need to merge the maps, such that those with intersecting key sets are merged, but others are left distinct.
Here's what I have. I haven't tested that it works, but I know that it's slow.
def mergeOverlapping(maps: RDD[Map[Int, Int]])(implicit sc: SparkContext): RDD[Map[Int, Int]] = {
val in: RDD[List[Map[Int, Int]]] = maps.map(List(_))
val z = List.empty[Map[Int, Int]]
val t: List[Map[Int, Int]] = in.fold(z) { case (l, r) =>
(l ::: r).foldLeft(List.empty[Map[Int, Int]]) { case (acc, next) =>
val (overlapping, distinct) = acc.partition(_.keys.exists(next.contains))
overlapping match {
case Nil => next :: acc
case xs => (next :: xs).reduceLeft(merge) :: distinct
}
}
}
sc.parallelize(t)
}
def merge(l: Map[Int, Int], r: Map[Int, Int]): Map[Int, Int] = {
val keys = l.keySet ++ r.keySet
keys.map { k =>
(l.get(k), r.get(k)) match {
case (Some(i), Some(j)) => k -> math.min(i, j)
case (a, b) => k -> (a orElse b).get
}
}.toMap
}
The problem, as far as I can tell, is that RDD#fold is merging and re-merging maps many more times than it has to.
Is there a more efficient mechanism that I could use? Is there another way I can structure my data to make it efficient?
What is the best/cleanest/most-efficient way to detect changes between two Map instances. I.e.
val before = Map(1 -> "foo", 2 -> "bar", 3 -> "baz")
val after = Map(1 -> "baz", 2 -> "bar", 4 -> "boo")
// not pretty:
val removed = before.keySet diff after.keySet
val added = after.filterNot { case (key, _) => before contains key }
val changed = (before.keySet intersect after.keySet).flatMap { key =>
val a = before(key)
val b = after (key)
if (a == b) None else Some(key -> (a, b))
}
Here is an idea. It probably takes O(N * log N) with N = max(before.size, after.size):
sealed trait Change[+K, +V]
case class Removed[K ](key: K) extends Change[K, Nothing]
case class Added [K, V](key: K, value : V) extends Change[K, V]
case class Updated[K, V](key: K, before: V, after: V) extends Change[K, V]
def changes[K, V](before: Map[K, V], after: Map[K, V]): Iterable[Change[K, V]] ={
val b = Iterable.newBuilder[Change[K, V]]
before.foreach { case (k, vb) =>
after.get(k) match {
case Some(va) if vb != va => b += Updated(k, vb, va)
case None => b += Removed(k)
case _ =>
}
}
after.foreach { case (k, va) =>
if (!before.contains(k)) b += Added(k, va)
}
b.result()
}
changes(before, after).foreach(println)
// Updated(1,foo,baz)
// Removed(3)
// Added(4,boo)
How can I change list of Eithers into two list of value Right and Left. When I use partition it returns two lists of Either's not values. What is the simplest way to do it?
foldLeft allows you to easily write your own method:
def separateEithers[T, U](list: List[Either[T, U]]) = {
val (ls, rs) = list.foldLeft(List[T](), List[U]()) {
case ((ls, rs), Left(x)) => (x :: ls, rs)
case ((ls, rs), Right(x)) => (ls, x :: rs)
}
(ls.reverse, rs.reverse)
}
You'll have to map the two resulting lists after partitioning.
val origin: List[Either[A, B]] = ???
val (lefts, rights) = origin.partition(_.isInstanceOf[Left[_]])
val leftValues = lefts.map(_.asInstanceOf[Left[A]].a)
val rightValues = rights.map(_.asInstanceOf[Right[B]].b)
If you are not happy with the casts and isInstanceOf's, you can also do it in two passes:
val leftValues = origin collect {
case Left(a) => a
}
val rightValues = origin collect {
case Right(b) => b
}
And if you are not happy with the two passes, you'll have to do it "by hand":
def myPartition[A, B](origin: List[Either[A, B]]): (List[A], List[B]) = {
val leftBuilder = List.newBuilder[A]
val rightBuilder = List.newBuilder[B]
origin foreach {
case Left(a) => leftBuilder += a
case Right(b) => rightBuilder += b
}
(leftBuilder.result(), rightBuilder.result())
}
Finally, if you don't like mutable state, you can do so:
def myPartition[A, B](origin: List[Either[A, B]]): (List[A], List[B]) = {
#tailrec
def loop(xs: List[Either[A, B]], accLeft: List[A],
accRight: List[B]): (List[A], List[B]) = {
xs match {
case Nil => (accLeft.reverse, accRight.reverse)
case Left(a) :: xr => loop(xr, a :: accLeft, accRight)
case Right(b) :: xr => loop(xr, accLeft, b :: accRight)
}
}
loop(origin, Nil, Nil)
}
If making two passes through the list is okay for you, you can use collect:
type E = Either[String, Int]
val xs: List[E] = List(Left("foo"), Right(1), Left("bar"), Right(2))
val rights = xs.collect { case Right(x) => x}
// rights: List[Int] = List(1, 2)
val lefts = xs.collect { case Left(x) => x}
// lefts: List[String] = List(foo, bar)
Using for comprehensions, like this,
for ( Left(v) <- xs ) yield v
and
for ( Right(v) <- xs ) yield v
A message class:
case class Message(username:String, content:String)
A message list:
val list = List(
Message("aaa", "111"),
Message("aaa","222"),
Message("bbb","333"),
Message("aaa", "444"),
Message("aaa", "555"))
How to group the messages by name and get the following result:
List( "aaa"-> List(Message("aaa","111"), Message("aaa","222")),
"bbb" -> List(Message("bbb","333")),
"aaa" -> List(Message("aaa","444"), Message("aaa", "555")) )
That means, if a user post several messages, then group them together, until another user posted. The order should be kept.
I can't think of an easy way to do this with the provided Seq methods, but you can write your own pretty concisely with a fold:
def contGroupBy[A, B](s: List[A])(p: A => B) = (List.empty[(B, List[A])] /: s) {
case (((k, xs) :: rest), y) if k == p(y) => (k, y :: xs) :: rest
case (acc, y) => (p(y), y :: Nil) :: acc
}.reverse.map { case (k, xs) => (k, xs.reverse) }
Now contGroupBy(list)(_.username) gives you what you want.
I tried to create such a code which works not only with Lists and can be written in operator notation. I came up with this:
object Grouper {
import collection.generic.CanBuildFrom
class GroupingCollection[A, C, CC[C]](ca: C)(implicit c2i: C => Iterable[A]) {
def groupBySep[B](f: A => B)(implicit
cbf: CanBuildFrom[C,(B, C),CC[(B,C)]],
cbfi: CanBuildFrom[C,A,C]
): CC[(B, C)] =
if (ca.isEmpty) cbf().result
else {
val iter = c2i(ca).iterator
val outer = cbf()
val inner = cbfi()
val head = iter.next()
var olda = f(head)
inner += head
for (a <- iter) {
val fa = f(a)
if (olda != fa) {
outer += olda -> inner.result
inner.clear()
}
inner += a
olda = fa
}
outer += olda -> inner.result
outer.result
}
}
implicit def GroupingCollection[A, C[A]](ca: C[A])(
implicit c2i: C[A] => Iterable[A]
): GroupingCollection[A, C[A], C] =
new GroupingCollection[A, C[A],C](ca)(c2i)
}
Can be used (with Lists, Seqs, Arrays, ...) as:
list groupBySep (_.username)
def group(lst: List[Message], out: List[(String, List[Message])] = Nil)
: List[(String, List[Message])] = lst match {
case Nil => out.reverse
case Message(u, c) :: xs =>
val (same, rest) = lst span (_.username == u)
group(rest, (u -> same) :: out)
}
Tail recursive version. Usage is simply group(list).
(List[Tuple2[String,List[Message]]]() /: list) {
case (head :: tail, msg) if msg.username == head._1 =>
(msg.username -> (msg :: head._2)) :: tail
case (xs, msg) =>
(msg.username -> List(msg)) :: xs
} map { t => t._1 -> t._2.reverse } reverse
Here's another method using pattern matching and tail recursion. Probably not as efficient as those above though due to the use of both takeWhile and dropWhile.
def groupBy(msgs: List[Message]): List[(String,List[Message])] = msgs match {
case Nil => List()
case head :: tail => (head.username ->
(head :: tail.takeWhile(m => m.username == head.username))) +:
groupBy(tail.dropWhile(m => m.username == head.username))
}