How to process cogroup values? - scala

I am cogrouping two RDDs and I want to process its values. That is,
rdd1.cogroup(rdd2)
as a result of this cogrouping I get results as below:
(ion,(CompactBuffer(100772C121, 100772C111, 6666666666),CompactBuffer(100772C121)))
Considering this result I would like to obtain all distinct pairs. e.g.
For the key 'ion'
100772C121 - 100772C111
100772C121 - 666666666
100772C111 - 666666666
How can I do this in scala?

You could try something like the following:
(l1 ++ l2).distinct.combinations(2).map { case Seq(x, y) => (x, y) }.toList
You would need to update l1 and l2 for your CompactBuffer fields. When I tried this locally, I get this (which is what I believe you want):
scala> val l1 = List("100772C121", "100772C111", "6666666666")
l1: List[String] = List(100772C121, 100772C111, 6666666666)
scala> val l2 = List("100772C121")
l2: List[String] = List(100772C121)
scala> val combine = (l1 ++ l2).distinct.combinations(2).map { case Seq(x, y) => (x, y) }.toList
combine: List[(String, String)] = List((100772C121,100772C111), (100772C121,6666666666), (100772C111,6666666666))
If you would like all of these pairs on separate rows, you can enclose this logic within a flatMap.
EDIT: Added steps per your example above.
scala> val rdd1 = sc.parallelize(Array(("ion", "100772C121"), ("ion", "100772C111"), ("ion", "6666666666")))
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> val rdd2 = sc.parallelize(Array(("ion", "100772C121")))
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[1] at parallelize at <console>:12
scala> val cgrp = rdd1.cogroup(rdd2).flatMap {
| case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
| (l1.toSeq ++ l2.toSeq).distinct.combinations(2).map { case Seq(x, y) => (x, y) }.toList
| }
cgrp: org.apache.spark.rdd.RDD[(String, String)] = FlatMappedRDD[4] at flatMap at <console>:16
scala> cgrp.foreach(println)
...
(100772C121,100772C111)
(100772C121,6666666666)
(100772C111,6666666666)
EDIT 2: Updated again per your use case.
scala> val cgrp = rdd1.cogroup(rdd2).flatMap {
| case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
| for { e1 <- l1.toSeq; e2 <- l2.toSeq; if (e1 != e2) }
| yield if (e1 > e2) ((e1, e2), 1) else ((e2, e1), 1)
| }.reduceByKey(_ + _)
...
((6666666666,100772C121),2)
((6666666666,100772C111),1)
((100772C121,100772C111),1)

Related

Expand a RDD[List[(ImmutableBytesWritable, Put)]] to RDD[(ImmutableBytesWritable, Put)] [duplicate]

In Scala I can flatten a collection using :
val array = Array(List("1,2,3").iterator,List("1,4,5").iterator)
//> array : Array[Iterator[String]] = Array(non-empty iterator, non-empty itera
//| tor)
array.toList.flatten //> res0: List[String] = List(1,2,3, 1,4,5)
But how can I perform similar in Spark ?
Reading the API doc http://spark.apache.org/docs/0.7.3/api/core/index.html#spark.RDD there does not seem to be a method which provides this functionality ?
Use flatMap and the identity Predef, this is more readable than using x => x, e.g.
myRdd.flatMap(identity)
Try flatMap with an identity map function (y => y):
scala> val x = sc.parallelize(List(List("a"), List("b"), List("c", "d")))
x: org.apache.spark.rdd.RDD[List[String]] = ParallelCollectionRDD[1] at parallelize at <console>:12
scala> x.collect()
res0: Array[List[String]] = Array(List(a), List(b), List(c, d))
scala> x.flatMap(y => y)
res3: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[3] at flatMap at <console>:15
scala> x.flatMap(y => y).collect()
res4: Array[String] = Array(a, b, c, d)

Best way to implement "zipLongest" in Scala

I need to implement a "zipLongest" function in Scala; that is, combine two sequences together as pairs, and if one is longer than the other, use a default value. (Unlike the standard zip method, which will just truncate to the shortest sequence.)
I've implemented it directly as follows:
def zipLongest[T](xs: Seq[T], ys: Seq[T], default: T): Seq[(T, T)] = (xs, ys) match {
case (Seq(), Seq()) => Seq()
case (Seq(), y +: rest) => (default, y) +: zipLongest(Seq(), rest, default)
case (x +: rest, Seq()) => (x, default) +: zipLongest(rest, Seq(), default)
case (x +: restX, y +: restY) => (x, y) +: zipLongest(restX, restY, default)
}
Is there a better way to do it?
Use zipAll :
scala> val l1 = List(1,2,3)
l1: List[Int] = List(1, 2, 3)
scala> val l2 = List("a","b")
l2: List[String] = List(a, b)
scala> l1.zipAll(l2,0,".")
res0: List[(Int, String)] = List((1,a), (2,b), (3,.))
If you want to use the same default value for the first and second seq :
scala> def zipLongest[T](xs:Seq[T], ys:Seq[T], default:T) = xs.zipAll(ys, default, default)
zipLongest: [T](xs: Seq[T], ys: Seq[T], default: T)Seq[(T, T)]
scala> val l3 = List(4,5,6,7)
l3: List[Int] = List(4, 5, 6, 7)
scala> zipLongest(l1,l3,0)
res1: Seq[(Int, Int)] = List((1,4), (2,5), (3,6), (0,7))
You can do this as a oneliner:
xs.padTo(ys.length, x).zip(ys.padTo(xs.length, y))

How to flatten a collection with Spark/Scala?

In Scala I can flatten a collection using :
val array = Array(List("1,2,3").iterator,List("1,4,5").iterator)
//> array : Array[Iterator[String]] = Array(non-empty iterator, non-empty itera
//| tor)
array.toList.flatten //> res0: List[String] = List(1,2,3, 1,4,5)
But how can I perform similar in Spark ?
Reading the API doc http://spark.apache.org/docs/0.7.3/api/core/index.html#spark.RDD there does not seem to be a method which provides this functionality ?
Use flatMap and the identity Predef, this is more readable than using x => x, e.g.
myRdd.flatMap(identity)
Try flatMap with an identity map function (y => y):
scala> val x = sc.parallelize(List(List("a"), List("b"), List("c", "d")))
x: org.apache.spark.rdd.RDD[List[String]] = ParallelCollectionRDD[1] at parallelize at <console>:12
scala> x.collect()
res0: Array[List[String]] = Array(List(a), List(b), List(c, d))
scala> x.flatMap(y => y)
res3: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[3] at flatMap at <console>:15
scala> x.flatMap(y => y).collect()
res4: Array[String] = Array(a, b, c, d)

Does Scala have a statement equivalent to ML's "as" construct?

In ML, one can assign names for each element of a matched pattern:
fun findPair n nil = NONE
| findPair n (head as (n1, _))::rest =
if n = n1 then (SOME head) else (findPair n rest)
In this code, I defined an alias for the first pair of the list and matched the contents of the pair. Is there an equivalent construct in Scala?
You can do variable binding with the # symbol, e.g.:
scala> val wholeList # List(x, _*) = List(1,2,3)
wholeList: List[Int] = List(1, 2, 3)
x: Int = 1
I'm sure you'll get a more complete answer later as I'm not sure how to write it recursively like your example, but maybe this variation would work for you:
scala> val pairs = List((1, "a"), (2, "b"), (3, "c"))
pairs: List[(Int, String)] = List((1,a), (2,b), (3,c))
scala> val n = 2
n: Int = 2
scala> pairs find {e => e._1 == n}
res0: Option[(Int, String)] = Some((2,b))
OK, next attempt at direct translation. How about this?
scala> def findPair[A, B](n: A, p: List[Tuple2[A, B]]): Option[Tuple2[A, B]] = p match {
| case Nil => None
| case head::rest if head._1 == n => Some(head)
| case _::rest => findPair(n, rest)
| }
findPair: [A, B](n: A, p: List[(A, B)])Option[(A, B)]

Convert iterative two sum k to functional

I have this code in Python that finds all pairs of numbers in an array that sum to k:
def two_sum_k(array, k):
seen = set()
out = set()
for v in array:
if k - v in seen:
out.add((min(v, k-v), max(v, k-v)))
seen.add(v)
return out
Can anyone help me convert this to Scala (in a functional style)? Also with linear complexity.
I think this is a classic case of when a for-comprehension can provide additional clarity
scala> def algo(xs: IndexedSeq[Int], target: Int) =
| for {
| i <- 0 until xs.length
| j <- (i + 1) until xs.length if xs(i) + xs(j) == target
| }
| yield xs(i) -> xs(j)
algo: (xs: IndexedSeq[Int], target: Int)scala.collection.immutable.IndexedSeq[(Int, Int)]
Using it:
scala> algo(1 to 20, 15)
res0: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((1,14), (2,13), (3,12), (4,11), (5,10), (6,9), (7,8))
I think it also doesn't suffer from the problems that your algorithm has
I'm not sure this is the clearest, but folds usually do the trick:
def two_sum_k(xs: Seq[Int], k: Int) = {
xs.foldLeft((Set[Int](),Set[(Int,Int)]())){ case ((seen,out),v) =>
(seen+v, if (seen contains k-v) out+((v min k-v, v max k-v)) else out)
}._2
}
You could just filter for (k-x <= x) by only using those x as first element, which aren't bigger than k/2:
def two_sum_k (xs: List[Int], k: Int): List [(Int, Int)] =
xs.filter (x => (x <= k/2)).
filter (x => (xs contains k-x) && (xs.indexOf (x) != xs.lastIndexOf (x))).
map (x => (x, k-x)).distinct
My first filter on line 3 was just filter (x => xs contains k-x)., which failed as found in the comment by Someone Else. Now it's more complicated and doesn't find (4, 4).
scala> li
res6: List[Int] = List(2, 3, 3, 4, 5, 5)
scala> two_sum_k (li, 8)
res7: List[(Int, Int)] = List((3,5))
def twoSumK(xs: List[Int], k: Int): List[(Int, Int)] = {
val tuples = xs.iterator map { x => (x, k-x) }
val potentialValues = tuples map { case (a, b) => (a min b) -> (a max b) }
val values = potentialValues filter { xs contains _._2 }
values.toSet.toList
}
Well, a direct translation would be this:
import scala.collection.mutable
def twoSumK[T : Numeric](array: Array[T], k: T) = {
val num = implicitly[Numeric[T]]
import num._
val seen = mutable.HashSet[T]()
val out: mutable.Set[(T, T)] = mutable.HashSet[(T, T)]()
for (v <- array) {
if (seen contains k - v) out += min(v, k - v) -> max(v, k - v)
seen += v
}
out
}
One clever way of doing it would be this:
def twoSumK[T : Numeric](array: Array[T], k: T) = {
val num = implicitly[Numeric[T]]
import num._
// One can write all the rest as a one-liner
val s1 = array.toSet
val s2 = s1 map (k -)
val s3 = s1 intersect s2
s3 map (v => min(v, k - v) -> max(v, k - v))
}
This does the trick:
def two_sum_k(xs: List[Int], k: Int): List[(Int, Int)] ={
xs.map(a=>xs.map(b=>(b,a+b)).filter(_._2 == k).map(b=>(b._1,a))).flatten.collect{case (a,b)=>if(a>b){(b,a)}else{(a,b)}}.distinct
}