Map and Split the data based on key in Spark Scala

Map and Split the data based on key in Spark Scala - scala

How can I achieve this in scala
val a = sc.parallelize(List(("a", "aaa$$bbb"), ("b", ("ccc$$ddd$$eee"))))
val res1 = a.mapValues(_.replaceAll("\\$\\$", "-"))
here I have Array[(String, String)]
Array[(String, String)] = Array(("a",aaa-bbb), ("b",ccc-ddd-eee))
Now I want the result to be as below
1,aaa
1,bbb
2,ccc
2,ddd
2,eee
Thanks in advance

You can use flatMap:
res1.flatMap{ case (k, v) => v.split("-").map((k, _)) }.collect
// res7: Array[(String, String)] = Array((a,aaa), (a,bbb), (b,ccc), (b,ddd), (b,eee))

Related

How to find the sum of values by for each as per index position in spark/scala

scala> val dataArray = Array("a,1|2|3","b,4|5|6","a,7|8|9","b,10|11|12")
dataArray: Array[String] = Array(a,1|2|3, b,4|5|6, a,7|8|9, b,10|11|12)
scala> val dataRDD = sc.parallelize(dataArray)
dataRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at parallelize at <console>:26
scala> val mapRDD = dataRDD.map(rec => (rec.split(",")(0),rec.split(",")(1).split("\\|")))
mapRDD: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[9] at map at
<console>:25
scala> mapRDD.collect
res20: Array[(String, Array[String])] = Array((a,Array(1, 2, 3)), (b,Array(4, 5, 6)), (a,Array(7, 8, 9)), (b,Array(10, 11, 12)))
scala> mapRDD.reduceByKey((value1,value2) => List(value1(0) + value2(0)))
<console>:26: error: type mismatch;
found : List[String]
required: Array[String]
mapRDD.reduceByKey((value1,value2) => List(value1(0) + value2(0)))
I tried like this further
val finalRDD = mapRDD.map(elem => (elem._1,elem._2.mkString("#")))
scala> finalRDD.reduceByKey((a,b) => (a.split("#")(0).toInt + b.split("#")(0).toInt).toString )
res31: org.apache.spark.rdd.RDD[(String, String)] = ShuffledRDD[14] at reduceByKey at <console>:26
scala> res31.collect
res32: Array[(String, String)] = Array((b,14), (a,8))
As you can see i am not able to get the result for all indexes , my code gives the sum only for one index .
My expected output is below
I want the sum to be applied on index basis such sum of all a[0] and sum of all a[1]
(a,(8,10,12))
(b,(14,16,18))
please help

Use transpose and map the result to _.sum
import org.apache.spark.rdd.RDD
object RddExample {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val sc = spark.sparkContext
// import spark.implicits._
val dataArray = Array("a,1|2|3","b,4|5|6","a,7|8|9","b,10|11|12")
val dataRDD = sc.parallelize(dataArray)
val mapRDD : RDD[(String, Array[Int])] = dataRDD.map(rec => (rec.split(",")(0),rec.split(",")(1)
.split("\\|").map(_.toInt)))
// val mapRdd : Array[(String, Array[String])] = mapRDD.collect
val result : Array[(String, List[Int])] = mapRDD.groupByKey().mapValues(itr => {
itr.toList.transpose.map(_.sum)
}).collect()
println(result)
}
}

I feel that QuickSilver answer is perfect
I tried this approach using reduceByKey , but i am manually adding for every index , if there are more indexes, then this wont help
Till mapRDD the code is same as per my question
.....
.....
mapRDD = code as per Question
scala> val finalRDD = mapRDD.map(elem => (elem._1, elem._2 match {
| case Array(a:String,b:String,c:String) => (a.toInt,b.toInt,c.toInt)
| case _ => (100,200,300)
| }
| )
| )
scala> finalRDD.collect
res39: Array[(String, (Int, Int, Int))] = Array((a,(1,2,3)), (b,(4,5,6)), (a,(7,8,9)), (b,(10,11,12)))
scala> finalRDD.reduceByKey((v1,v2) => (v1._1+v2._1,v1._2+v2._2,v1._3+v2._3))
res40: org.apache.spark.rdd.RDD[(String, (Int, Int, Int))] = ShuffledRDD[18] at reduceByKey at <console>:26
scala> res40.collect
res41: Array[(String, (Int, Int, Int))] = Array((b,(14,16,18)), (a,(8,10,12)))

scala: how to rectify "option" type after leftOuterJoin

Given
scala> val rdd1 = sc.parallelize(Seq(("a",1),("a",2),("b",3)))
scala> val rdd2 = sc.parallelize(Seq("a",5),("c",6))
scala> val rdd3 = rdd1.leftOuterJoin(rdd2)
scala> rdd3.collect()
res: Array[(String, (Int, Option[Int]))] = Array((a,(1,Some(5))), (a,(2,Some(5))), (b,(3,None)))
We can see that the data type of "Option[Int]" in rdd3. Is there a way to rectify this so that rdd3 can be Array[String, (Int, Int)]? Suppose we can specify a value (e.g. 999) for the "None".

scala> val result = rdd3.collect()
scala> result.map(t => (t._1, (t._2._1, t._2._2.getOrElse(999))))
This should do it.

Spark: Intersection between Key-Value pair and Key RDD

I have two RDDs; rdd1 = RDD[(String, Array[String])] and rdd2 = RDD[String].
I want to remove all rdd1's where the Key is not found in rdd2.
Thank you in advance!

You can make an inner join but first you have to make the second RDD be pair rdd.
val rdd1: RDD[(String, Array[String])] = ???
val rdd2: RDD[String] = ???
val asPairRdd: RDD[(String, Unit)] = rdd2.map(s => (s, ()))
val res: RDD[(String, Array[String])] = rdd1.join(asPairRdd).map{
case (k, (v, dummy)) => (k, v)
}

Flatmap scala [String, String,List[String]]

I have this prbolem, I have an RDD[(String,String, List[String]), and I would like to "flatmap" it to obtain a RDD[(String,String, String)]:
e.g:
val x :RDD[(String,String, List[String]) =
RDD[(a,b, list[ "ra", "re", "ri"])]
I would like get:
val result: RDD[(String,String,String)] =
RDD[(a, b, ra),(a, b, re),(a, b, ri)])]

Use flatMap:
val rdd = sc.parallelize(Seq(("a", "b", List("ra", "re", "ri"))))
// rdd: org.apache.spark.rdd.RDD[(String, String, List[String])] = ParallelCollectionRDD[7] at parallelize at <console>:28
rdd.flatMap{ case (x, y, z) => z.map((x, y, _)) }.collect
// res23: Array[(String, String, String)] = Array((a,b,ra), (a,b,re), (a,b,ri))

This is an alternative way of doing it using flatMap again
val rdd = sparkContext.parallelize(Seq(("a", "b", List("ra", "re", "ri"))))
rdd.flatMap(array => array._3.map(list => (array._1, array._2, list))).foreach(println)

Apache Spark - Scala - how to FlatMap (k, {v1,v2,v3,...}) to ((k,v1),(k,v2),(k,v3),...)

I got this:
val vector: RDD[(String, Array[String])] = [("a", {v1,v2,..}),("b", {u1,u2,..})]
wanna convert to:
RDD[(String, String)] = [("a",v1), ("a",v2), ..., ("b",u1), ("b",u2), ...]
Any idea how to do that using flatMap.

This:
vector.flatMap { case (x, arr) => arr.map((x, _)) }
Will give you:
scala> val vector = sc.parallelize(Vector(("a", Array("b", "c")), ("b", Array("d", "f"))))
vector: org.apache.spark.rdd.RDD[(String, Array[String])] =
ParallelCollectionRDD[3] at parallelize at <console>:27
scala> vector.flatMap { case (x, arr) => arr.map((x, _)) }.collect
res4: Array[(String, String)] = Array((a,b), (a,c), (b,d), (b,f))

You can definitely need to use flatMap like you mentioned, but in addition, you need to use scala map as well.
For example:
val idToVectorValue: RDD[(String, String ] = vector.flatMap((id,values) => values.map(value => (id, value)))

Using single parameter function:
vector.flatMap(data => data._2.map((data._1, _)))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Map and Split the data based on key in Spark Scala - scala

You can use flatMap: res1.flatMap{ case (k, v) => v.split("-").map((k, _)) }.collect // res7: Array[(String, String)] = Array((a,aaa), (a,bbb), (b,ccc), (b,ddd), (b,eee))

Related

How to find the sum of values by for each as per index position in spark/scala

scala: how to rectify "option" type after leftOuterJoin

Spark: Intersection between Key-Value pair and Key RDD

Flatmap scala [String, String,List[String]]

Apache Spark - Scala - how to FlatMap (k, {v1,v2,v3,...}) to ((k,v1),(k,v2),(k,v3),...)

Categories

Resources