Flatmap scala [String, String,List[String]] - scala

I have this prbolem, I have an RDD[(String,String, List[String]), and I would like to "flatmap" it to obtain a RDD[(String,String, String)]:
e.g:
val x :RDD[(String,String, List[String]) =
RDD[(a,b, list[ "ra", "re", "ri"])]
I would like get:
val result: RDD[(String,String,String)] =
RDD[(a, b, ra),(a, b, re),(a, b, ri)])]

Use flatMap:
val rdd = sc.parallelize(Seq(("a", "b", List("ra", "re", "ri"))))
// rdd: org.apache.spark.rdd.RDD[(String, String, List[String])] = ParallelCollectionRDD[7] at parallelize at <console>:28
rdd.flatMap{ case (x, y, z) => z.map((x, y, _)) }.collect
// res23: Array[(String, String, String)] = Array((a,b,ra), (a,b,re), (a,b,ri))

This is an alternative way of doing it using flatMap again
val rdd = sparkContext.parallelize(Seq(("a", "b", List("ra", "re", "ri"))))
rdd.flatMap(array => array._3.map(list => (array._1, array._2, list))).foreach(println)

Related

How to find the sum of values by for each as per index position in spark/scala

scala> val dataArray = Array("a,1|2|3","b,4|5|6","a,7|8|9","b,10|11|12")
dataArray: Array[String] = Array(a,1|2|3, b,4|5|6, a,7|8|9, b,10|11|12)
scala> val dataRDD = sc.parallelize(dataArray)
dataRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at parallelize at <console>:26
scala> val mapRDD = dataRDD.map(rec => (rec.split(",")(0),rec.split(",")(1).split("\\|")))
mapRDD: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[9] at map at
<console>:25
scala> mapRDD.collect
res20: Array[(String, Array[String])] = Array((a,Array(1, 2, 3)), (b,Array(4, 5, 6)), (a,Array(7, 8, 9)), (b,Array(10, 11, 12)))
scala> mapRDD.reduceByKey((value1,value2) => List(value1(0) + value2(0)))
<console>:26: error: type mismatch;
found : List[String]
required: Array[String]
mapRDD.reduceByKey((value1,value2) => List(value1(0) + value2(0)))
I tried like this further
val finalRDD = mapRDD.map(elem => (elem._1,elem._2.mkString("#")))
scala> finalRDD.reduceByKey((a,b) => (a.split("#")(0).toInt + b.split("#")(0).toInt).toString )
res31: org.apache.spark.rdd.RDD[(String, String)] = ShuffledRDD[14] at reduceByKey at <console>:26
scala> res31.collect
res32: Array[(String, String)] = Array((b,14), (a,8))
As you can see i am not able to get the result for all indexes , my code gives the sum only for one index .
My expected output is below
I want the sum to be applied on index basis such sum of all a[0] and sum of all a[1]
(a,(8,10,12))
(b,(14,16,18))
please help
Use transpose and map the result to _.sum
import org.apache.spark.rdd.RDD
object RddExample {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val sc = spark.sparkContext
// import spark.implicits._
val dataArray = Array("a,1|2|3","b,4|5|6","a,7|8|9","b,10|11|12")
val dataRDD = sc.parallelize(dataArray)
val mapRDD : RDD[(String, Array[Int])] = dataRDD.map(rec => (rec.split(",")(0),rec.split(",")(1)
.split("\\|").map(_.toInt)))
// val mapRdd : Array[(String, Array[String])] = mapRDD.collect
val result : Array[(String, List[Int])] = mapRDD.groupByKey().mapValues(itr => {
itr.toList.transpose.map(_.sum)
}).collect()
println(result)
}
}
I feel that QuickSilver answer is perfect
I tried this approach using reduceByKey , but i am manually adding for every index , if there are more indexes, then this wont help
Till mapRDD the code is same as per my question
.....
.....
mapRDD = code as per Question
scala> val finalRDD = mapRDD.map(elem => (elem._1, elem._2 match {
| case Array(a:String,b:String,c:String) => (a.toInt,b.toInt,c.toInt)
| case _ => (100,200,300)
| }
| )
| )
scala> finalRDD.collect
res39: Array[(String, (Int, Int, Int))] = Array((a,(1,2,3)), (b,(4,5,6)), (a,(7,8,9)), (b,(10,11,12)))
scala> finalRDD.reduceByKey((v1,v2) => (v1._1+v2._1,v1._2+v2._2,v1._3+v2._3))
res40: org.apache.spark.rdd.RDD[(String, (Int, Int, Int))] = ShuffledRDD[18] at reduceByKey at <console>:26
scala> res40.collect
res41: Array[(String, (Int, Int, Int))] = Array((b,(14,16,18)), (a,(8,10,12)))

scala: how to rectify "option" type after leftOuterJoin

Given
scala> val rdd1 = sc.parallelize(Seq(("a",1),("a",2),("b",3)))
scala> val rdd2 = sc.parallelize(Seq("a",5),("c",6))
scala> val rdd3 = rdd1.leftOuterJoin(rdd2)
scala> rdd3.collect()
res: Array[(String, (Int, Option[Int]))] = Array((a,(1,Some(5))), (a,(2,Some(5))), (b,(3,None)))
We can see that the data type of "Option[Int]" in rdd3. Is there a way to rectify this so that rdd3 can be Array[String, (Int, Int)]? Suppose we can specify a value (e.g. 999) for the "None".
scala> val result = rdd3.collect()
scala> result.map(t => (t._1, (t._2._1, t._2._2.getOrElse(999))))
This should do it.

Scala: Append only Some's to immutable list

So say we're given a List[String] and bunch of Option[String]'s call them a, b, c. Say I want to append the valid (Some's) Options[String]'s out of a, b, c to my existingList[String]. What would be the best way to go about this using immutable structures?
I.e. I know I could use a ListBuffer and do something like:
def foo(a: Option[String], b: Option[String], c: Option[String]) : ListBuffer[String] = {
val existingList = new ListBuffer("hey")
a.map(_ => existingList += _)
b.map(_ => existingList += _)
c.map(_ => existingList += _)
}
but I want to use immutable structures.
Use .flatten on a list of options and append it to your list
val existingList = List(1, 2, 3)
val a = Some(4)
val b = None
val c = Some(5)
val newList = existingList ::: List(a, b, c).flatten
def foo(a: Option[String], b: Option[String], c: Option[String]): List[String] =
List("hey") ++ a.toList ++ b.toList ++ c.toList
which is similar to flatten or flatMap.
scala> foo(Some("a"), None, Some("c"))
res1: List[String] = List(hey, a, c)
It's better to define a generic function like this:
def foo[T](xs: Option[T]*) : List[T] =
xs.toList.flatten
scala> foo(Some("a"), None, Some("c"))
res2: List[String] = List(a, c)
Let val list = List("A", "B", "C") and val opts = = List(Some("X"), None, Some("Y"), None, Some("Z")). Then list ++ opts.filter(_.isDefined).map(_.get) will give an new List("A", "B", "C", "X", "Y", "Z") with all elements from list and all non-empty elements of opts.

Map and Split the data based on key in Spark Scala

How can I achieve this in scala
val a = sc.parallelize(List(("a", "aaa$$bbb"), ("b", ("ccc$$ddd$$eee"))))
val res1 = a.mapValues(_.replaceAll("\\$\\$", "-"))
here I have Array[(String, String)]
Array[(String, String)] = Array(("a",aaa-bbb), ("b",ccc-ddd-eee))
Now I want the result to be as below
1,aaa
1,bbb
2,ccc
2,ddd
2,eee
Thanks in advance
You can use flatMap:
res1.flatMap{ case (k, v) => v.split("-").map((k, _)) }.collect
// res7: Array[(String, String)] = Array((a,aaa), (a,bbb), (b,ccc), (b,ddd), (b,eee))

Apache Spark - Scala - how to FlatMap (k, {v1,v2,v3,...}) to ((k,v1),(k,v2),(k,v3),...)

I got this:
val vector: RDD[(String, Array[String])] = [("a", {v1,v2,..}),("b", {u1,u2,..})]
wanna convert to:
RDD[(String, String)] = [("a",v1), ("a",v2), ..., ("b",u1), ("b",u2), ...]
Any idea how to do that using flatMap.
This:
vector.flatMap { case (x, arr) => arr.map((x, _)) }
Will give you:
scala> val vector = sc.parallelize(Vector(("a", Array("b", "c")), ("b", Array("d", "f"))))
vector: org.apache.spark.rdd.RDD[(String, Array[String])] =
ParallelCollectionRDD[3] at parallelize at <console>:27
scala> vector.flatMap { case (x, arr) => arr.map((x, _)) }.collect
res4: Array[(String, String)] = Array((a,b), (a,c), (b,d), (b,f))
You can definitely need to use flatMap like you mentioned, but in addition, you need to use scala map as well.
For example:
val idToVectorValue: RDD[(String, String ] = vector.flatMap((id,values) => values.map(value => (id, value)))
Using single parameter function:
vector.flatMap(data => data._2.map((data._1, _)))