How to flatten a collection with Spark/Scala? - scala

In Scala I can flatten a collection using :
val array = Array(List("1,2,3").iterator,List("1,4,5").iterator)
//> array : Array[Iterator[String]] = Array(non-empty iterator, non-empty itera
//| tor)
array.toList.flatten //> res0: List[String] = List(1,2,3, 1,4,5)
But how can I perform similar in Spark ?
Reading the API doc http://spark.apache.org/docs/0.7.3/api/core/index.html#spark.RDD there does not seem to be a method which provides this functionality ?

Use flatMap and the identity Predef, this is more readable than using x => x, e.g.
myRdd.flatMap(identity)

Try flatMap with an identity map function (y => y):
scala> val x = sc.parallelize(List(List("a"), List("b"), List("c", "d")))
x: org.apache.spark.rdd.RDD[List[String]] = ParallelCollectionRDD[1] at parallelize at <console>:12
scala> x.collect()
res0: Array[List[String]] = Array(List(a), List(b), List(c, d))
scala> x.flatMap(y => y)
res3: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[3] at flatMap at <console>:15
scala> x.flatMap(y => y).collect()
res4: Array[String] = Array(a, b, c, d)

Related

Scala: map(f) and map(_.f)

I thought in scala map(f) is the same as map(_.f) as map(x => x.f), but turns out it is not
scala> val a = List(1,2,3)
val a: List[Int] = List(1, 2, 3)
scala> a.map(toString)
val res7: List[Char] = List(l, i, n)
scala> a.map(_.toString)
val res8: List[String] = List(1, 2, 3)
What happenes when a.map(toString) is called? Where did the three charaacters l, i, and n come from?
map(f) is not the same as map(_.f()). It's the same as map(f(_)). That is, it's going to call f(x), not x.f(), for each x in the list.
So a.map(toString) should be an error because the normal toString method does not take any arguments. My guess is that in your REPL session you've defined your own toString method that takes an argument and that's the one that's being called.

Expand a RDD[List[(ImmutableBytesWritable, Put)]] to RDD[(ImmutableBytesWritable, Put)] [duplicate]

In Scala I can flatten a collection using :
val array = Array(List("1,2,3").iterator,List("1,4,5").iterator)
//> array : Array[Iterator[String]] = Array(non-empty iterator, non-empty itera
//| tor)
array.toList.flatten //> res0: List[String] = List(1,2,3, 1,4,5)
But how can I perform similar in Spark ?
Reading the API doc http://spark.apache.org/docs/0.7.3/api/core/index.html#spark.RDD there does not seem to be a method which provides this functionality ?
Use flatMap and the identity Predef, this is more readable than using x => x, e.g.
myRdd.flatMap(identity)
Try flatMap with an identity map function (y => y):
scala> val x = sc.parallelize(List(List("a"), List("b"), List("c", "d")))
x: org.apache.spark.rdd.RDD[List[String]] = ParallelCollectionRDD[1] at parallelize at <console>:12
scala> x.collect()
res0: Array[List[String]] = Array(List(a), List(b), List(c, d))
scala> x.flatMap(y => y)
res3: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[3] at flatMap at <console>:15
scala> x.flatMap(y => y).collect()
res4: Array[String] = Array(a, b, c, d)

Define a 2d list and append lists to it in a for loop, scala

I want to define a 2d list before a for loop and afterwards I want to append to it 1d lists in a for loop, like so:
var 2dEmptyList: listOf<List<String>>
for (element<-elements){
///do some stuff
2dEmptyList.plusAssign(1dlist)
}
The code above does not work. But I can't seem to find a solution for this and it is so simple!
scala> val elements = List("a", "b", "c")
elements: List[String] = List(a, b, c)
scala> val twoDimenstionalList: List[List[String]] = List.empty[List[String]]
twoDimenstionalList: List[List[String]] = List()
scala> val res = for(element <- elements) yield twoDimenstionalList ::: List(element)
res: List[List[java.io.Serializable]] = List(List(a), List(b), List(c))
Better still:
scala> twoDimenstionalList ::: elements.map(List(_))
res8: List[List[String]] = List(List(a), List(b), List(c))
If you want 2dEmptyList be mutable, please consider using scala.collection.mutable.ListBuffer:
scala> val ll = scala.collection.mutable.ListBuffer.empty[List[String]]
ll: scala.collection.mutable.ListBuffer[List[String]] = ListBuffer()
scala> ll += List("Hello")
res7: ll.type = ListBuffer(List(Hello))
scala> ll += List("How", "are", "you?")
res8: ll.type = ListBuffer(List(Hello), List(How, are, you?))

How to process cogroup values?

I am cogrouping two RDDs and I want to process its values. That is,
rdd1.cogroup(rdd2)
as a result of this cogrouping I get results as below:
(ion,(CompactBuffer(100772C121, 100772C111, 6666666666),CompactBuffer(100772C121)))
Considering this result I would like to obtain all distinct pairs. e.g.
For the key 'ion'
100772C121 - 100772C111
100772C121 - 666666666
100772C111 - 666666666
How can I do this in scala?
You could try something like the following:
(l1 ++ l2).distinct.combinations(2).map { case Seq(x, y) => (x, y) }.toList
You would need to update l1 and l2 for your CompactBuffer fields. When I tried this locally, I get this (which is what I believe you want):
scala> val l1 = List("100772C121", "100772C111", "6666666666")
l1: List[String] = List(100772C121, 100772C111, 6666666666)
scala> val l2 = List("100772C121")
l2: List[String] = List(100772C121)
scala> val combine = (l1 ++ l2).distinct.combinations(2).map { case Seq(x, y) => (x, y) }.toList
combine: List[(String, String)] = List((100772C121,100772C111), (100772C121,6666666666), (100772C111,6666666666))
If you would like all of these pairs on separate rows, you can enclose this logic within a flatMap.
EDIT: Added steps per your example above.
scala> val rdd1 = sc.parallelize(Array(("ion", "100772C121"), ("ion", "100772C111"), ("ion", "6666666666")))
rdd1: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> val rdd2 = sc.parallelize(Array(("ion", "100772C121")))
rdd2: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[1] at parallelize at <console>:12
scala> val cgrp = rdd1.cogroup(rdd2).flatMap {
| case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
| (l1.toSeq ++ l2.toSeq).distinct.combinations(2).map { case Seq(x, y) => (x, y) }.toList
| }
cgrp: org.apache.spark.rdd.RDD[(String, String)] = FlatMappedRDD[4] at flatMap at <console>:16
scala> cgrp.foreach(println)
...
(100772C121,100772C111)
(100772C121,6666666666)
(100772C111,6666666666)
EDIT 2: Updated again per your use case.
scala> val cgrp = rdd1.cogroup(rdd2).flatMap {
| case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
| for { e1 <- l1.toSeq; e2 <- l2.toSeq; if (e1 != e2) }
| yield if (e1 > e2) ((e1, e2), 1) else ((e2, e1), 1)
| }.reduceByKey(_ + _)
...
((6666666666,100772C121),2)
((6666666666,100772C111),1)
((100772C121,100772C111),1)

Best way to implement "zipLongest" in Scala

I need to implement a "zipLongest" function in Scala; that is, combine two sequences together as pairs, and if one is longer than the other, use a default value. (Unlike the standard zip method, which will just truncate to the shortest sequence.)
I've implemented it directly as follows:
def zipLongest[T](xs: Seq[T], ys: Seq[T], default: T): Seq[(T, T)] = (xs, ys) match {
case (Seq(), Seq()) => Seq()
case (Seq(), y +: rest) => (default, y) +: zipLongest(Seq(), rest, default)
case (x +: rest, Seq()) => (x, default) +: zipLongest(rest, Seq(), default)
case (x +: restX, y +: restY) => (x, y) +: zipLongest(restX, restY, default)
}
Is there a better way to do it?
Use zipAll :
scala> val l1 = List(1,2,3)
l1: List[Int] = List(1, 2, 3)
scala> val l2 = List("a","b")
l2: List[String] = List(a, b)
scala> l1.zipAll(l2,0,".")
res0: List[(Int, String)] = List((1,a), (2,b), (3,.))
If you want to use the same default value for the first and second seq :
scala> def zipLongest[T](xs:Seq[T], ys:Seq[T], default:T) = xs.zipAll(ys, default, default)
zipLongest: [T](xs: Seq[T], ys: Seq[T], default: T)Seq[(T, T)]
scala> val l3 = List(4,5,6,7)
l3: List[Int] = List(4, 5, 6, 7)
scala> zipLongest(l1,l3,0)
res1: Seq[(Int, Int)] = List((1,4), (2,5), (3,6), (0,7))
You can do this as a oneliner:
xs.padTo(ys.length, x).zip(ys.padTo(xs.length, y))