Spark pairRDD not working - scala

value subtractByKey is not a member of
org.apache.spark.rdd.RDD[(String, LabeledPoint)]
value join is not a member of org.apache.spark.rdd.RDD[(String,
LabeledPoint)]
How come this is happening? org.apache.spark.rdd.RDD[(String, LabeledPoint)] is pair-value RDD and I already imported import org.apache.spark.rdd._

In the spark-shell, this works exactly as expected, without having to import anything:
scala> case class LabeledPoint(x: Int, y: Int, label: String)
defined class LabeledPoint
scala> val rdd1 = sc.parallelize(List("this","is","a","test")).map(label => (label, LabeledPoint(0,0,label)))
rdd1: org.apache.spark.rdd.RDD[(String, LabeledPoint)] = MapPartitionsRDD[1] at map at <console>:23
scala> val rdd2 = sc.parallelize(List("this","is","a","test")).map(label => (label, 1))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:21
scala> rdd1.join(rdd2)
res0: org.apache.spark.rdd.RDD[(String, (LabeledPoint, Int))] = MapPartitionsRDD[6] at join at <console>:28

Related

How to find the sum of values by for each as per index position in spark/scala

scala> val dataArray = Array("a,1|2|3","b,4|5|6","a,7|8|9","b,10|11|12")
dataArray: Array[String] = Array(a,1|2|3, b,4|5|6, a,7|8|9, b,10|11|12)
scala> val dataRDD = sc.parallelize(dataArray)
dataRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at parallelize at <console>:26
scala> val mapRDD = dataRDD.map(rec => (rec.split(",")(0),rec.split(",")(1).split("\\|")))
mapRDD: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[9] at map at
<console>:25
scala> mapRDD.collect
res20: Array[(String, Array[String])] = Array((a,Array(1, 2, 3)), (b,Array(4, 5, 6)), (a,Array(7, 8, 9)), (b,Array(10, 11, 12)))
scala> mapRDD.reduceByKey((value1,value2) => List(value1(0) + value2(0)))
<console>:26: error: type mismatch;
found : List[String]
required: Array[String]
mapRDD.reduceByKey((value1,value2) => List(value1(0) + value2(0)))
I tried like this further
val finalRDD = mapRDD.map(elem => (elem._1,elem._2.mkString("#")))
scala> finalRDD.reduceByKey((a,b) => (a.split("#")(0).toInt + b.split("#")(0).toInt).toString )
res31: org.apache.spark.rdd.RDD[(String, String)] = ShuffledRDD[14] at reduceByKey at <console>:26
scala> res31.collect
res32: Array[(String, String)] = Array((b,14), (a,8))
As you can see i am not able to get the result for all indexes , my code gives the sum only for one index .
My expected output is below
I want the sum to be applied on index basis such sum of all a[0] and sum of all a[1]
(a,(8,10,12))
(b,(14,16,18))
please help
Use transpose and map the result to _.sum
import org.apache.spark.rdd.RDD
object RddExample {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val sc = spark.sparkContext
// import spark.implicits._
val dataArray = Array("a,1|2|3","b,4|5|6","a,7|8|9","b,10|11|12")
val dataRDD = sc.parallelize(dataArray)
val mapRDD : RDD[(String, Array[Int])] = dataRDD.map(rec => (rec.split(",")(0),rec.split(",")(1)
.split("\\|").map(_.toInt)))
// val mapRdd : Array[(String, Array[String])] = mapRDD.collect
val result : Array[(String, List[Int])] = mapRDD.groupByKey().mapValues(itr => {
itr.toList.transpose.map(_.sum)
}).collect()
println(result)
}
}
I feel that QuickSilver answer is perfect
I tried this approach using reduceByKey , but i am manually adding for every index , if there are more indexes, then this wont help
Till mapRDD the code is same as per my question
.....
.....
mapRDD = code as per Question
scala> val finalRDD = mapRDD.map(elem => (elem._1, elem._2 match {
| case Array(a:String,b:String,c:String) => (a.toInt,b.toInt,c.toInt)
| case _ => (100,200,300)
| }
| )
| )
scala> finalRDD.collect
res39: Array[(String, (Int, Int, Int))] = Array((a,(1,2,3)), (b,(4,5,6)), (a,(7,8,9)), (b,(10,11,12)))
scala> finalRDD.reduceByKey((v1,v2) => (v1._1+v2._1,v1._2+v2._2,v1._3+v2._3))
res40: org.apache.spark.rdd.RDD[(String, (Int, Int, Int))] = ShuffledRDD[18] at reduceByKey at <console>:26
scala> res40.collect
res41: Array[(String, (Int, Int, Int))] = Array((b,(14,16,18)), (a,(8,10,12)))

scala: how to rectify "option" type after leftOuterJoin

Given
scala> val rdd1 = sc.parallelize(Seq(("a",1),("a",2),("b",3)))
scala> val rdd2 = sc.parallelize(Seq("a",5),("c",6))
scala> val rdd3 = rdd1.leftOuterJoin(rdd2)
scala> rdd3.collect()
res: Array[(String, (Int, Option[Int]))] = Array((a,(1,Some(5))), (a,(2,Some(5))), (b,(3,None)))
We can see that the data type of "Option[Int]" in rdd3. Is there a way to rectify this so that rdd3 can be Array[String, (Int, Int)]? Suppose we can specify a value (e.g. 999) for the "None".
scala> val result = rdd3.collect()
scala> result.map(t => (t._1, (t._2._1, t._2._2.getOrElse(999))))
This should do it.

Spark: Intersection between Key-Value pair and Key RDD

I have two RDDs; rdd1 = RDD[(String, Array[String])] and rdd2 = RDD[String].
I want to remove all rdd1's where the Key is not found in rdd2.
Thank you in advance!
You can make an inner join but first you have to make the second RDD be pair rdd.
val rdd1: RDD[(String, Array[String])] = ???
val rdd2: RDD[String] = ???
val asPairRdd: RDD[(String, Unit)] = rdd2.map(s => (s, ()))
val res: RDD[(String, Array[String])] = rdd1.join(asPairRdd).map{
case (k, (v, dummy)) => (k, v)
}

Map and Split the data based on key in Spark Scala

How can I achieve this in scala
val a = sc.parallelize(List(("a", "aaa$$bbb"), ("b", ("ccc$$ddd$$eee"))))
val res1 = a.mapValues(_.replaceAll("\\$\\$", "-"))
here I have Array[(String, String)]
Array[(String, String)] = Array(("a",aaa-bbb), ("b",ccc-ddd-eee))
Now I want the result to be as below
1,aaa
1,bbb
2,ccc
2,ddd
2,eee
Thanks in advance
You can use flatMap:
res1.flatMap{ case (k, v) => v.split("-").map((k, _)) }.collect
// res7: Array[(String, String)] = Array((a,aaa), (a,bbb), (b,ccc), (b,ddd), (b,eee))

Why does Spark/Scala compiler fail to find toDF on RDD[Map[Int, Int]]?

Why does the following end up with an error?
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> val rdd = sc.parallelize(1 to 10).map(x => (Map(x -> 0), 0))
rdd: org.apache.spark.rdd.RDD[(scala.collection.immutable.Map[Int,Int], Int)] = MapPartitionsRDD[20] at map at <console>:27
scala> rdd.toDF
res8: org.apache.spark.sql.DataFrame = [_1: map<int,int>, _2: int]
scala> val rdd = sc.parallelize(1 to 10).map(x => Map(x -> 0))
rdd: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,Int]] = MapPartitionsRDD[23] at map at <console>:27
scala> rdd.toDF
<console>:30: error: value toDF is not a member of org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,Int]]
rdd.toDF
So what exactly is happening here, toDF can convert RDD of type (scala.collection.immutable.Map[Int,Int], Int) to DataFrame but not of type scala.collection.immutable.Map[Int,Int]. Why is that?
For the same reason why you cannot use
sqlContext.createDataFrame(1 to 10).map(x => Map(x -> 0))
If you take a look at the org.apache.spark.sql.SQLContext source you'll find two different implementations of the createDataFrame method:
def createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame
and
def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
As you can see both require A to be a subclass of Product. When you call toDF on a RDD[(Map[Int,Int], Int)] it works because Tuple2 is indeed a Product. Map[Int,Int] by itself is not hence the error.
You can make it work by wrapping Map with Tuple1:
sc.parallelize(1 to 10).map(x => Tuple1(Map(x -> 0))).toDF
Basically because there is no implicit to create a DataFrame for a Map inside an RDD.
In you first example you are returning a Tuple, which is a Product for which there is an implicit conversion.
rddToDataFrameHolder[A <: Product : TypeTag](rdd: RDD[A])
In the second example you use have a Map in your RDD, for which there is no implicit conversion.