Spark scala RDD traversing - scala

How can i traverse following RDD using Spark scala. I wants to print every value present in Seq with associated key
res1: org.apache.spark.rdd.RDD[(java.lang.String, Seq[java.lang.String])] = MapPartitionsRDD[6] at groupByKey at <console>:14
I tried following code for it.
val ss=mapfile.map(x=>{
val key=x._1
val value=x._2.sorted
var i=0
while (i < value.length) {
(key,value(i))
i += 1
}
}
)
ss.top(20).foreach(println)

I try to convert your codes as follows:
val ss = mapfile.flatMap {
case (key, value) => value.sorted.map((key, _))
}
ss.top(20).foreach(println)
Is it what you want?

I tried this and it works for the return type as mentioned.
val ss=mapfile.map(x=>{case (key, value) => value.sorted.map((key, _))}.groupByKey().map(x=>(x._1,x._2.toSeq))
ss.top(20).foreach(println)
Note: ss is of type::: org.apache.spark.rdd.RDD[(java.lang.String, Seq[java.lang.String])]

Related

sort and rank in spark RDD in one file

I have an spark RDD as below
(maths,60)
(english,65)
(english,77)
(maths,23)
(maths,50)
I need to sort and rank the given RDD in one as below
(maths,23,1)
(maths,50,2)
(maths,50,3)
(english,65,1)
(english,77,2)
i know this can be done easily using Data Frame, but i need Spark rdd code to get the solution, please suggest
Spark RDD functions(so called transformations) like groupByKey flatMap and Scala List function like sorted should helps in achieving it.
val rdd = spark.sparkContext.parallelize(
Seq(("maths",60),
("english",65),
("english",77),
("maths",23),
("maths",50)))
val result = rdd.groupByKey().flatMap(group => {
group._2.toList
.sorted.toList // sort marks
.zipWithIndex // add the position/rank
.map {
case(marks, index) => (group._1, marks, index + 1)
}
})
result.collect
// Array((english,65,1), (english,77,2), (maths,23,1), (maths,50,2), (maths,60,3))
Databricks notebook
Another rdd solution:
val df = Seq(("maths",60),("english",65),("english",77),("maths",23),("maths",50)).toDF("subject","marks")
val rdd1 = df.rdd
rdd1.groupBy( x=> x(0))
.map( x=>
{
val p = x._2.toList.map(a=>a(1)).map(_.toString.toInt).sortWith((a1,a2)=> a1 < a2 ).zipWithIndex.map(b=>(b._1,b._2+1))
(x._1,p)
}
)
.flatMap( x => x._2.map((x._1,_)))
.collect.foreach(println)
Results:
(english,(65,1))
(english,(77,2))
(maths,(23,1))
(maths,(50,2))
(maths,(60,3))

Convert map to mapPartitions in spark

I have a code to analyze the log file using map transformation. Then the RDD got converted to DF.
val logData = sc.textFile("hdfs://quickstart.cloudera:8020/user/cloudera/syslog.txt")
val logDataDF = logData.map(rec => (rec.split(" ")(0), rec.split(" ")(2), rec.split(" ")(5))).toDF("month", "date", "process")
I would like to know whether I can use mapPartitions in this case instead of map.
I don't know what is your use case but you can definitely use mapPartition instead of map. Below code will return the same logDataDF.
val logDataDF = logData.mapPartitions(x => {
val lst = scala.collection.mutable.ListBuffer[(String, String, String)]()
while (x.hasNext) {
val rec = x.next().split(" ")
lst += ((rec(0), rec(2), rec(5)))
}
lst.iterator
}).toDF("month", "date", "process")

Spark accumulator empty when used in UDF

I was working on optimizing my Spark process, and was trying to use a UDF with an accumulator. I have gotten the accumulator to work on its own, and was looking to see if I would get any speed up using a UDF. But instead, when I wrap the accumulator in the UDF, it remains empty. Am I going something wrong in particular? Is there something going on with Lazy Execution where even with my .count it is still not executing?
Input:
0,[0.11,0.22]
1,[0.22,0.33]
Output:
(0,0,0.11),(0,1,0.22),(1,0,0.22),(1,1,0.33)
Code
val accum = new MapAccumulator2d()
val session = SparkSession.builder().getOrCreate()
session.sparkContext.register(accum)
//Does not work - Empty Accumlator
val rowAccum = udf((itemId: Int, item: mutable.WrappedArray[Float]) => {
val map = item
.zipWithIndex
.map(ff => {
((itemId, ff._2), ff._1.toDouble)
}).toMap
accum.add(map)
itemId
})
dataFrame.select(rowAccum(col("itemId"), col("jaccardList"))).count
//Works
dataFrame.foreach(f => {
val map = f.getAs[mutable.WrappedArray[Float]](1)
.zipWithIndex
.map(ff => {
((f.getInt(0), ff._2), ff._1.toDouble)
}).toMap
accum.add(map)
})
val list = accum.value.toList.map(f => (f._1._1, f._1._2, f._2))
Looks like the only issue here is using count to "trigger" the lazily-evaluated UDF: Spark is "smart" enough to realize that the select operation can't change the result of count and therefore doesn't really execute the UDF. Choosing a different operation (e.g. collect) shows that the UDF works and updates the accumulator.
Here's a (more concise) example:
val accum = sc.longAccumulator
val rowAccum = udf((itemId: Int) => { accum.add(itemId); itemId })
val dataFrame = Seq(1,2,3,4,5).toDF("itemId")
dataFrame.select(rowAccum(col("itemId"))).count() // won't trigger UDF
println(s"RESULT: ${accum.value}") // prints 0
dataFrame.select(rowAccum(col("itemId"))).collect() // triggers UDF
println(s"RESULT: ${accum.value}") // prints 15

Filtering One RDD based on another RDD using regex

I have two RDD's of the form:
data_wo_header: RDD[String], data_test_wo_header: RDD[String]
scala> data_wo_header.first
res2: String = 1,2,3.5,1112486027
scala> data_test_wo_header.first
res2: String = 1,2
RDD2 is smaller than RDD 1. I am trying to filter RDD1 by removing the elements whose regEx matches with RDD2.
The 1,2 in the above example represent UserID,MovID. Since it's present in the test I want the new RDD such that it's removed from RDD1.
I have asked a similar ques but it is requiring to do unnecessary split of RDD.
I am trying to do something of this sort but it's not working:
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
var ratings_train = new ListBuffer[String]()
data_wo_header.foreach(x => {
data_test_wo_header.foreach(y => {
if (x.indexOf(y) == 0) {
ratings_train += x
}
})
})
val ratings_train_list = ratings_train.toList
return ratings_train_list
}
How should I do a regex match and filter based on it.
You can use broadcast variable to share state of rdd2 and then filter rdd1 based on broadcasted variable of rdd2. I replicate your code and this works for me
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
val rdd2array = sparkSession.sparkContext.broadcast(data_test_wo_header.collect())
val training_set = data_wo_header.filter{
case(x) => rdd2array.value.filter(y => x.matches(y)).length == 0
}
training_set.collect().toList
}
Also with scala and spark I recommend you if it is possible to avoid foreach and use more functional paradigm with map,flatMap and filter functions

Scala - Convert List[String] to tuple List[(Int, Int)]

I would like to getLine from a Source and convert it to a tuple (Int, Int). I've did it using foreach.
val values = collection.mutable.ListBuffer[(Int, Int)]()
Source.fromFile(invitationFile.ref.file).getLines().filter(line => !line.isEmpty).foreach(line => {
val value = line.split("\\s")
values += ((value(0).toInt, (value(1).toInt)))
})
What's the best way to write the same code without use foreach?
Use map, it builds a new list for you:
Source.fromFile(invitationFile.ref.file)
.getLines()
.filter(line => !line.isEmpty)
.map(line => {
val value = line.split("\\s")
(value(0).toInt, value(1).toInt)
})
.toList()
foreach should be a final operation, not a transformation.
In your case, you want to use the function map
val values = Source.fromFile(invitationFile.ref.file).getLines()
.filter(line => !line.isEmpty)
.map(line => line.split("\\s"))
.map(line => (line(0).toInt, line(1).toInt))
Using a for comprehension:
val values = for(line <- Source.fromFile(invitationFile.ref.file).getLines(); if !line.isEmpty) {
val splits = line.split("\\s")
yield (split(0).toInt, split(1).toInt)
}