Spark: How to efficiently have intersections preserving duplicates (in Scala)? - scala

I have 2 RDDs, each of which is a set of strings containing duplicates. I want to find the intersection of the two sets preserving duplicates. Example:
RDD1 : a, b, b, c, c, c, c
RDD2 : a, a, b, c, c
The intersection I want is the set a, b, c, c i.e. the intersection would contain each element the minimum amount of times that it is present in both the sets.
The default intersection transformation does not preserve duplicates AFAIK. Is there a way to efficiently compute the intersection using some other transformations and/or the intersection transformation? I'm trying to avoid doing it algorithmically, which is unlikely to be as efficient as doing it the Spark way. (For the interested, I'm trying to compute Jaccard bag similarity for a set of files).

Borrowing a little from the implementation of intersection, you could do something like:
(val rdd1 = sc.parallelize(Seq("a", "b", "b", "c", "c", "c", "c")))
(val rdd2 = sc.parallelize(Seq("a", "a", "b", "c", "c")))
val cogrouped = rdd1.map(k => (k, null)).cogroup(rdd2.map(k => (k, null)))
val groupSize = cogrouped.map { case (key, (buf1, buf2)) => (key, math.min(buf1.size, buf2.size)) }
val finalSet = groupSize.flatMap { case (key, size) => List.fill(size)(key) }
(finalSet.collect = Array(a, b, c, c))
This works because cogroup will maintain duplicate occurrences of values of a pair for each grouping (in this case, all of your nulls). Also note that we are doing no more shuffles here than we would have with the original use of intersection.

Related

group list elements if they share a common element

I have a list of strings. The elements are comprised of two letters. For Example,
val A = List("bf", "dc", "ab", "af")
I want to collect all the letters that share a common letter, i.e.,
"bf" and "af" share "f"
into a tuple
("a", "b", "f")
the other tuple would be
("c", "d")
so I want to return a list that looks like this
List(List("a", "b", "f"), List("c", "d"))
I got my intended result with
val A= List("bf", "dc", "ab", "af")
val B= A.flatMap(x => x.split("")).distinct
B.map(y => A.map(x => if(x.contains(y)) {x} else {""}).filter(_ !="").flatMap(_.split("")).distinct.sorted).distinct
but there must be a better way.
Your solution is not bad, but can be simplified.
You don't have to split the strings and flatMap then. You can just flatten the List of Strings.
A.map(x => if(x.contains(y)) {x} else {""}).filter(_!="")
It would be better to write:
A.flatMap(x => if(x.contains(y)) Some(x) else None) or
A.filter(_.contains(y))
But you can use partition and count to express it, here is my solution:
val a = List("bf", "dc", "ab", "af")
val b = a.flatten.distinct.sorted
b.partition(x => a.count(_.contains(x)) > 1)
I am curious, "better" in what sense? If you care about algorithm complexity in terms of Big O, then assuming the following:
the alphabet is simple, let it be 26 letters;
the strings are of arbitrary length, but much smaller than the input's length n.
Then you can do it in O(n):
def fn(in: Iterator[String]) = {
val a = Array.fill(26)(-1)
for {s <- in} {
val cl = s.toSeq.map(_ - 'a')
val i = cl.map(a(_)).find(_ >= 0) getOrElse cl.head
cl.foreach(a(_) = i)
}
a.zipWithIndex.filter(_._1 > 0).groupBy(_._1).values.map {
_.map(_._2+'a').map(_.toChar)
}
}
scala> fn(List("bf", "dc", "ab", "af").toIterator)
res17: Iterable[Array[Char]] = List(Array(a, b, f), Array(c, d))
Back to "better". If you wanted a nifty FP solution, then some may say that we sacrifice FP here, because used a mutable variable.
It's arguable, since that mutable variable is scoped inside and the function is still pure.

Merge Two Scala Lists of Same Length, With Elements At Same Index Becoming One Element

I have two Scala lists with the same number and type of elements, like so:
val x = List("a", "b", "c")
val y = List("1", "2", "3")
The result I want is as follows:
List("a1", "b2", "c3")
How can this be done in Scala? I could figure this out using mutable structures but I think that would be unidiomatic for Scala.
Combine zip and map:
x zip y map { case (a, b) => a + b }
Strangely enough, this also works:
x zip y map (_.productIterator.mkString)
but I would strongly prefer the first version.

Scala/Spark Ordering on multiple dimensions with reverse on specific one

I would like to order for exemple an 2d array or RDD like follow :
val a = Array((1,1),(1,2),(1,3),(2,1),(2,2),(2,3))
To obtain an ascending sort on d1 and a descending one on d2 :
val b = Array((1,3),(1,2),(1,1),(2,3),(2,2),(2,1))
Unfortunately when i apply reverse in the ordering it apply on all dimensions
a.sortBy( x=> (x._1,x._2) )(Ordering[(Int,Int)].reverse.on(x=> (x._1,x._2)))
Array((2,3), (2,2), (2,1), (1,3), (1,2), (1,1))
So i would like to be able to sort on multiple dimension choising on which one i need a reverse sorting.
This post contains the answear Scala idiom for ordering by multiple criteria
val ord1 = Ordering.by{ x:Int => x }
val ord2 = Ordering.by{ x:Int => x }.reverse
val multOrd = Ordering.by{ x:(Int,Int) => x }(Ordering.Tuple2(ord1,ord2))
a.sortBy( identity )(multOrd)
Array((1,3),(1,2),(1,1),(2,3),(2,2),(2,1))
Hope it can help

Group pair of elements in a List

I have a list (in Scala).
val seqRDD = sc.parallelize(Seq(("a","b"),("b","c"),("c","a"),("d","b"),("e","c"),("f","b"),("g","a"),("h","g"),("i","e"),("j","m"),("k","b"),("l","m"),("m","j")))
I group by the second element for a particular statistics and flatten the result into one list.
val checkItOut = seqRDD.groupBy(each => (each._2))
.map(each => each._2.toList)
.collect
.flatten
.toList
The output looks like this:
checkItOut: List[(String, String)] = List((c,a), (g,a), (a,b), (d,b), (f,b), (k,b), (m,j), (b,c), (e,c), (i,e), (j,m), (l,m), (h,g))
Now, what I'm trying to do is "group" all elements (not pairs) that are connected to other elements in any pair to one list.
For example:
c is with a in one pair, a is with g in its next, so (a,c,g) are connected. Then, c is also with b and e, that b is with a, d, f, k and these are with other characters in some other pair. I want to have them in a list.
I know this can be done with a BFS traversal. BUt wondering if there was an API in Spark that does this?
GraphX, Connected Components: http://spark.apache.org/docs/latest/graphx-programming-guide.html#connected-components

Simplest way to define a type that is a sequence of a specific number of elements in Scala?

Suppose I'm doing something like the following:
val a = complicatedChainOfSteps("c")
val b = complicatedChainOfSteps("d")
I'm interested in writing code like the following (to reduce code and copy/paste errors):
val Seq(a, b) = Seq("c", "d").map(complicatedChainOfSteps(_))
but having the compiler ensure that the number of elements matches, so the following don't compile:
val Seq(a, b) = Seq("c", "d", "e").map(s => s + s)
val Seq(a, b) = Seq("c").map(s => s + s)
I know that using tuples instead to ensure that the number of elements matches works when performing multiple assignment (e.g., val (a, b) = ("c", "d")), but you cannot map over tuples (which makes sense because they have heterogeneous types).
I also know I can just define my own types for sequence of 2 elements and sequence of 3 elements or whatever, but is there a convenient built in way of doing this? If not, what's the simplest way to define a type that is a sequence of a specific number of elements?