Spark: Intersection between Key-Value pair and Key RDD - scala

I have two RDDs; rdd1 = RDD[(String, Array[String])] and rdd2 = RDD[String].
I want to remove all rdd1's where the Key is not found in rdd2.
Thank you in advance!

You can make an inner join but first you have to make the second RDD be pair rdd.
val rdd1: RDD[(String, Array[String])] = ???
val rdd2: RDD[String] = ???
val asPairRdd: RDD[(String, Unit)] = rdd2.map(s => (s, ()))
val res: RDD[(String, Array[String])] = rdd1.join(asPairRdd).map{
case (k, (v, dummy)) => (k, v)
}

Related

How to do that without dataset to rdd conversion?

Could someone help me how to avoid rdd conversion?
val qksDistribution: Array[((String, Int), Long)] = tripDataset
.map(i => ((i.getFirstPoint.getQk.substring(0, QK_PARTITION_LEVEL), i.getProviderId), 1L))
.rdd
.reduceByKey(_+_)
.filter(_._2>maxCountInPartition/10)
.collect
val qksDistribution: Array[((String, Int), Long)] = tripDataset
.map(i => (i.getFirstPoint.getQk.substring(0, QK_PARTITION_LEVEL), i.getProviderId)) // no need to add the 1
.groupByKey(x => x) //similar to key by
.count // you wanted to count per key
.filter(_._2>maxCountInPartition/10)
.collect

scala: how to rectify "option" type after leftOuterJoin

Given
scala> val rdd1 = sc.parallelize(Seq(("a",1),("a",2),("b",3)))
scala> val rdd2 = sc.parallelize(Seq("a",5),("c",6))
scala> val rdd3 = rdd1.leftOuterJoin(rdd2)
scala> rdd3.collect()
res: Array[(String, (Int, Option[Int]))] = Array((a,(1,Some(5))), (a,(2,Some(5))), (b,(3,None)))
We can see that the data type of "Option[Int]" in rdd3. Is there a way to rectify this so that rdd3 can be Array[String, (Int, Int)]? Suppose we can specify a value (e.g. 999) for the "None".
scala> val result = rdd3.collect()
scala> result.map(t => (t._1, (t._2._1, t._2._2.getOrElse(999))))
This should do it.

Map and Split the data based on key in Spark Scala

How can I achieve this in scala
val a = sc.parallelize(List(("a", "aaa$$bbb"), ("b", ("ccc$$ddd$$eee"))))
val res1 = a.mapValues(_.replaceAll("\\$\\$", "-"))
here I have Array[(String, String)]
Array[(String, String)] = Array(("a",aaa-bbb), ("b",ccc-ddd-eee))
Now I want the result to be as below
1,aaa
1,bbb
2,ccc
2,ddd
2,eee
Thanks in advance
You can use flatMap:
res1.flatMap{ case (k, v) => v.split("-").map((k, _)) }.collect
// res7: Array[(String, String)] = Array((a,aaa), (a,bbb), (b,ccc), (b,ddd), (b,eee))

Flatmap scala [String, String,List[String]]

I have this prbolem, I have an RDD[(String,String, List[String]), and I would like to "flatmap" it to obtain a RDD[(String,String, String)]:
e.g:
val x :RDD[(String,String, List[String]) =
RDD[(a,b, list[ "ra", "re", "ri"])]
I would like get:
val result: RDD[(String,String,String)] =
RDD[(a, b, ra),(a, b, re),(a, b, ri)])]
Use flatMap:
val rdd = sc.parallelize(Seq(("a", "b", List("ra", "re", "ri"))))
// rdd: org.apache.spark.rdd.RDD[(String, String, List[String])] = ParallelCollectionRDD[7] at parallelize at <console>:28
rdd.flatMap{ case (x, y, z) => z.map((x, y, _)) }.collect
// res23: Array[(String, String, String)] = Array((a,b,ra), (a,b,re), (a,b,ri))
This is an alternative way of doing it using flatMap again
val rdd = sparkContext.parallelize(Seq(("a", "b", List("ra", "re", "ri"))))
rdd.flatMap(array => array._3.map(list => (array._1, array._2, list))).foreach(println)

Apache Spark - Scala - how to FlatMap (k, {v1,v2,v3,...}) to ((k,v1),(k,v2),(k,v3),...)

I got this:
val vector: RDD[(String, Array[String])] = [("a", {v1,v2,..}),("b", {u1,u2,..})]
wanna convert to:
RDD[(String, String)] = [("a",v1), ("a",v2), ..., ("b",u1), ("b",u2), ...]
Any idea how to do that using flatMap.
This:
vector.flatMap { case (x, arr) => arr.map((x, _)) }
Will give you:
scala> val vector = sc.parallelize(Vector(("a", Array("b", "c")), ("b", Array("d", "f"))))
vector: org.apache.spark.rdd.RDD[(String, Array[String])] =
ParallelCollectionRDD[3] at parallelize at <console>:27
scala> vector.flatMap { case (x, arr) => arr.map((x, _)) }.collect
res4: Array[(String, String)] = Array((a,b), (a,c), (b,d), (b,f))
You can definitely need to use flatMap like you mentioned, but in addition, you need to use scala map as well.
For example:
val idToVectorValue: RDD[(String, String ] = vector.flatMap((id,values) => values.map(value => (id, value)))
Using single parameter function:
vector.flatMap(data => data._2.map((data._1, _)))