scala> val dataArray = Array("a,1|2|3","b,4|5|6","a,7|8|9","b,10|11|12")
dataArray: Array[String] = Array(a,1|2|3, b,4|5|6, a,7|8|9, b,10|11|12)
scala> val dataRDD = sc.parallelize(dataArray)
dataRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at parallelize at <console>:26
scala> val mapRDD = dataRDD.map(rec => (rec.split(",")(0),rec.split(",")(1).split("\\|")))
mapRDD: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[9] at map at
<console>:25
scala> mapRDD.collect
res20: Array[(String, Array[String])] = Array((a,Array(1, 2, 3)), (b,Array(4, 5, 6)), (a,Array(7, 8, 9)), (b,Array(10, 11, 12)))
scala> mapRDD.reduceByKey((value1,value2) => List(value1(0) + value2(0)))
<console>:26: error: type mismatch;
found : List[String]
required: Array[String]
mapRDD.reduceByKey((value1,value2) => List(value1(0) + value2(0)))
I tried like this further
val finalRDD = mapRDD.map(elem => (elem._1,elem._2.mkString("#")))
scala> finalRDD.reduceByKey((a,b) => (a.split("#")(0).toInt + b.split("#")(0).toInt).toString )
res31: org.apache.spark.rdd.RDD[(String, String)] = ShuffledRDD[14] at reduceByKey at <console>:26
scala> res31.collect
res32: Array[(String, String)] = Array((b,14), (a,8))
As you can see i am not able to get the result for all indexes , my code gives the sum only for one index .
My expected output is below
I want the sum to be applied on index basis such sum of all a[0] and sum of all a[1]
(a,(8,10,12))
(b,(14,16,18))
please help
Use transpose and map the result to _.sum
import org.apache.spark.rdd.RDD
object RddExample {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val sc = spark.sparkContext
// import spark.implicits._
val dataArray = Array("a,1|2|3","b,4|5|6","a,7|8|9","b,10|11|12")
val dataRDD = sc.parallelize(dataArray)
val mapRDD : RDD[(String, Array[Int])] = dataRDD.map(rec => (rec.split(",")(0),rec.split(",")(1)
.split("\\|").map(_.toInt)))
// val mapRdd : Array[(String, Array[String])] = mapRDD.collect
val result : Array[(String, List[Int])] = mapRDD.groupByKey().mapValues(itr => {
itr.toList.transpose.map(_.sum)
}).collect()
println(result)
}
}
I feel that QuickSilver answer is perfect
I tried this approach using reduceByKey , but i am manually adding for every index , if there are more indexes, then this wont help
Till mapRDD the code is same as per my question
.....
.....
mapRDD = code as per Question
scala> val finalRDD = mapRDD.map(elem => (elem._1, elem._2 match {
| case Array(a:String,b:String,c:String) => (a.toInt,b.toInt,c.toInt)
| case _ => (100,200,300)
| }
| )
| )
scala> finalRDD.collect
res39: Array[(String, (Int, Int, Int))] = Array((a,(1,2,3)), (b,(4,5,6)), (a,(7,8,9)), (b,(10,11,12)))
scala> finalRDD.reduceByKey((v1,v2) => (v1._1+v2._1,v1._2+v2._2,v1._3+v2._3))
res40: org.apache.spark.rdd.RDD[(String, (Int, Int, Int))] = ShuffledRDD[18] at reduceByKey at <console>:26
scala> res40.collect
res41: Array[(String, (Int, Int, Int))] = Array((b,(14,16,18)), (a,(8,10,12)))
Given
scala> val rdd1 = sc.parallelize(Seq(("a",1),("a",2),("b",3)))
scala> val rdd2 = sc.parallelize(Seq("a",5),("c",6))
scala> val rdd3 = rdd1.leftOuterJoin(rdd2)
scala> rdd3.collect()
res: Array[(String, (Int, Option[Int]))] = Array((a,(1,Some(5))), (a,(2,Some(5))), (b,(3,None)))
We can see that the data type of "Option[Int]" in rdd3. Is there a way to rectify this so that rdd3 can be Array[String, (Int, Int)]? Suppose we can specify a value (e.g. 999) for the "None".
scala> val result = rdd3.collect()
scala> result.map(t => (t._1, (t._2._1, t._2._2.getOrElse(999))))
This should do it.
I have two RDDs; rdd1 = RDD[(String, Array[String])] and rdd2 = RDD[String].
I want to remove all rdd1's where the Key is not found in rdd2.
Thank you in advance!
You can make an inner join but first you have to make the second RDD be pair rdd.
val rdd1: RDD[(String, Array[String])] = ???
val rdd2: RDD[String] = ???
val asPairRdd: RDD[(String, Unit)] = rdd2.map(s => (s, ()))
val res: RDD[(String, Array[String])] = rdd1.join(asPairRdd).map{
case (k, (v, dummy)) => (k, v)
}
How can I achieve this in scala
val a = sc.parallelize(List(("a", "aaa$$bbb"), ("b", ("ccc$$ddd$$eee"))))
val res1 = a.mapValues(_.replaceAll("\\$\\$", "-"))
here I have Array[(String, String)]
Array[(String, String)] = Array(("a",aaa-bbb), ("b",ccc-ddd-eee))
Now I want the result to be as below
1,aaa
1,bbb
2,ccc
2,ddd
2,eee
Thanks in advance
You can use flatMap:
res1.flatMap{ case (k, v) => v.split("-").map((k, _)) }.collect
// res7: Array[(String, String)] = Array((a,aaa), (a,bbb), (b,ccc), (b,ddd), (b,eee))
Why does the following end up with an error?
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> val rdd = sc.parallelize(1 to 10).map(x => (Map(x -> 0), 0))
rdd: org.apache.spark.rdd.RDD[(scala.collection.immutable.Map[Int,Int], Int)] = MapPartitionsRDD[20] at map at <console>:27
scala> rdd.toDF
res8: org.apache.spark.sql.DataFrame = [_1: map<int,int>, _2: int]
scala> val rdd = sc.parallelize(1 to 10).map(x => Map(x -> 0))
rdd: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,Int]] = MapPartitionsRDD[23] at map at <console>:27
scala> rdd.toDF
<console>:30: error: value toDF is not a member of org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,Int]]
rdd.toDF
So what exactly is happening here, toDF can convert RDD of type (scala.collection.immutable.Map[Int,Int], Int) to DataFrame but not of type scala.collection.immutable.Map[Int,Int]. Why is that?
For the same reason why you cannot use
sqlContext.createDataFrame(1 to 10).map(x => Map(x -> 0))
If you take a look at the org.apache.spark.sql.SQLContext source you'll find two different implementations of the createDataFrame method:
def createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame
and
def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
As you can see both require A to be a subclass of Product. When you call toDF on a RDD[(Map[Int,Int], Int)] it works because Tuple2 is indeed a Product. Map[Int,Int] by itself is not hence the error.
You can make it work by wrapping Map with Tuple1:
sc.parallelize(1 to 10).map(x => Tuple1(Map(x -> 0))).toDF
Basically because there is no implicit to create a DataFrame for a Map inside an RDD.
In you first example you are returning a Tuple, which is a Product for which there is an implicit conversion.
rddToDataFrameHolder[A <: Product : TypeTag](rdd: RDD[A])
In the second example you use have a Map in your RDD, for which there is no implicit conversion.