Whats the equivalent of RDD aggregateByKey for Spark dataframe using scala? - scala

I wrote my below code to calculate the correlation for a dataframe using grouping but I eventually had to use RDD AggregateByKey, Sequential Operation and Combiner operation to achieve what I needed. However, I want to implement the same using only spark dataframe and avoid RDD completely. I tried learning about Spark dataframe and I came across "agg" and "Group by" function but wasn't exactly sure how to achieve the same results as using the RDD. Any help here is much appreciated?
val columnIndexes = columns.indices.map(i => i + groupIndexes.length).toArray
//removing rows with nulls in group by columns like the MR version
val cleanDF = selectedDF.na.drop("any", groupByColumns)
val allCountersPerGroupRDD: RDD[(immutable.IndexedSeq[Any], Seq[Seq[CovCounter]])] = cleanDF.rdd.map(row =>
//create key value pairs
(groupIndexes.map(ind => row.get(ind)), columnIndexes.map(i => toDouble(row.get(i)))))
.aggregateByKey(zeroCounters, numPartitions)(
seqOp = (counters, newValues) => {
for ((i, j) <- columnHalfPairedIndicesFlattened) {
counters(i)(j).addIfNotNaN(newValues(i), newValues(j))
}
counters
}, combOp = (baseCounters, otherCounters) => {
for ((i, j) <- columnHalfPairedIndicesFlattened) {
baseCounters(i)(j).merge(otherCounters(i)(j))
}
baseCounters
})
val finalRDD: RDD[Row] = allCountersPerGroupRDD.mapPartitions { iterator =>
iterator.flatMap { case (groupKeys, counts) =>
columns.indices.map(ind =>
Row.fromSeq(groupKeys ++ Seq(columns(ind)) ++ columnPairedIndicesAll(ind).map { case (i, j) =>
getCovOrCorrFromCounters(i, j, counts, useCorrelation)
}))
}
}
val outDF = sparkSession.createDataFrame(finalRDD, outputSchema)

See https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html. You need to make your own UDAF.

Related

org.apache.spark.SparkException: This RDD lacks a SparkContext error

Complete error is:
org.apache.spark.SparkException: This RDD lacks a SparkContext. It
could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x =>
rdd2.values.count() * x) is invalid because the values transformation
and count action cannot be performed inside of the rdd1.map
transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the
streaming job is used in DStream operations. For more information, See
SPARK-13758.
but I think I didn't use nested rdd transform in my code.
how to solve it?
my scala code:
stream.foreachRDD { rdd => {
val nRDD = rdd.map(item => item.value())
val oldRDD = sc.textFile("hdfs://localhost:9011/recData/miniApp/mall")
val top = oldRDD.sortBy(item => {
val arr = item.split(' ')
arr(0)
}, ascending = false).take(200)
val topRDD = sc.makeRDD(top)
val unionRDD = topRDD.union(nRDD)
val validRDD = unionRDD.map(item => {
val arr = item.split(' ')
((arr(1), arr(2)), arr(3).toDouble)
})
.reduceByKey((f, s) => {
if (f > s) f else s
})
.distinct()
val ratings = validRDD.map(item => {
Rating(item._1._2.toInt, item._1._1.toInt, item._2)
})
val rank = 10
val numIterations = 5
val model = ALS.train(ratings, rank, numIterations, 0.01)
nRDD.map(item => {
val arr = item.split(' ')
arr(2)
}).toDS()
.distinct()
.foreach(item=>{
println("als recommending for user "+item)
val recommendRes = model.recommendProducts(item.toInt, 10)
for (elem <- recommendRes) {
println(elem)
}
})
nRDD.saveAsTextFile("hdfs://localhost:9011/recData/miniApp/mall")
}
}
The error is telling you that you're missing a SparkContext. I'm guessing that the program fails on this line:
val oldRDD = sc.textFile("hdfs://localhost:9011/recData/miniApp/mall")
The documentation provides an example of creating a SparkContext to use in this situation.
From the docs:
val stream: DStream[String] = ...
stream.foreachRDD { rdd =>
// Get the singleton instance of SparkSession
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
// Do things...
}
Although you're using RDDs instead of DataFrames, the same principles should apply.

sort and rank in spark RDD in one file

I have an spark RDD as below
(maths,60)
(english,65)
(english,77)
(maths,23)
(maths,50)
I need to sort and rank the given RDD in one as below
(maths,23,1)
(maths,50,2)
(maths,50,3)
(english,65,1)
(english,77,2)
i know this can be done easily using Data Frame, but i need Spark rdd code to get the solution, please suggest
Spark RDD functions(so called transformations) like groupByKey flatMap and Scala List function like sorted should helps in achieving it.
val rdd = spark.sparkContext.parallelize(
Seq(("maths",60),
("english",65),
("english",77),
("maths",23),
("maths",50)))
val result = rdd.groupByKey().flatMap(group => {
group._2.toList
.sorted.toList // sort marks
.zipWithIndex // add the position/rank
.map {
case(marks, index) => (group._1, marks, index + 1)
}
})
result.collect
// Array((english,65,1), (english,77,2), (maths,23,1), (maths,50,2), (maths,60,3))
Databricks notebook
Another rdd solution:
val df = Seq(("maths",60),("english",65),("english",77),("maths",23),("maths",50)).toDF("subject","marks")
val rdd1 = df.rdd
rdd1.groupBy( x=> x(0))
.map( x=>
{
val p = x._2.toList.map(a=>a(1)).map(_.toString.toInt).sortWith((a1,a2)=> a1 < a2 ).zipWithIndex.map(b=>(b._1,b._2+1))
(x._1,p)
}
)
.flatMap( x => x._2.map((x._1,_)))
.collect.foreach(println)
Results:
(english,(65,1))
(english,(77,2))
(maths,(23,1))
(maths,(50,2))
(maths,(60,3))

Spark accumulator empty when used in UDF

I was working on optimizing my Spark process, and was trying to use a UDF with an accumulator. I have gotten the accumulator to work on its own, and was looking to see if I would get any speed up using a UDF. But instead, when I wrap the accumulator in the UDF, it remains empty. Am I going something wrong in particular? Is there something going on with Lazy Execution where even with my .count it is still not executing?
Input:
0,[0.11,0.22]
1,[0.22,0.33]
Output:
(0,0,0.11),(0,1,0.22),(1,0,0.22),(1,1,0.33)
Code
val accum = new MapAccumulator2d()
val session = SparkSession.builder().getOrCreate()
session.sparkContext.register(accum)
//Does not work - Empty Accumlator
val rowAccum = udf((itemId: Int, item: mutable.WrappedArray[Float]) => {
val map = item
.zipWithIndex
.map(ff => {
((itemId, ff._2), ff._1.toDouble)
}).toMap
accum.add(map)
itemId
})
dataFrame.select(rowAccum(col("itemId"), col("jaccardList"))).count
//Works
dataFrame.foreach(f => {
val map = f.getAs[mutable.WrappedArray[Float]](1)
.zipWithIndex
.map(ff => {
((f.getInt(0), ff._2), ff._1.toDouble)
}).toMap
accum.add(map)
})
val list = accum.value.toList.map(f => (f._1._1, f._1._2, f._2))
Looks like the only issue here is using count to "trigger" the lazily-evaluated UDF: Spark is "smart" enough to realize that the select operation can't change the result of count and therefore doesn't really execute the UDF. Choosing a different operation (e.g. collect) shows that the UDF works and updates the accumulator.
Here's a (more concise) example:
val accum = sc.longAccumulator
val rowAccum = udf((itemId: Int) => { accum.add(itemId); itemId })
val dataFrame = Seq(1,2,3,4,5).toDF("itemId")
dataFrame.select(rowAccum(col("itemId"))).count() // won't trigger UDF
println(s"RESULT: ${accum.value}") // prints 0
dataFrame.select(rowAccum(col("itemId"))).collect() // triggers UDF
println(s"RESULT: ${accum.value}") // prints 15

Filtering One RDD based on another RDD using regex

I have two RDD's of the form:
data_wo_header: RDD[String], data_test_wo_header: RDD[String]
scala> data_wo_header.first
res2: String = 1,2,3.5,1112486027
scala> data_test_wo_header.first
res2: String = 1,2
RDD2 is smaller than RDD 1. I am trying to filter RDD1 by removing the elements whose regEx matches with RDD2.
The 1,2 in the above example represent UserID,MovID. Since it's present in the test I want the new RDD such that it's removed from RDD1.
I have asked a similar ques but it is requiring to do unnecessary split of RDD.
I am trying to do something of this sort but it's not working:
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
var ratings_train = new ListBuffer[String]()
data_wo_header.foreach(x => {
data_test_wo_header.foreach(y => {
if (x.indexOf(y) == 0) {
ratings_train += x
}
})
})
val ratings_train_list = ratings_train.toList
return ratings_train_list
}
How should I do a regex match and filter based on it.
You can use broadcast variable to share state of rdd2 and then filter rdd1 based on broadcasted variable of rdd2. I replicate your code and this works for me
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
val rdd2array = sparkSession.sparkContext.broadcast(data_test_wo_header.collect())
val training_set = data_wo_header.filter{
case(x) => rdd2array.value.filter(y => x.matches(y)).length == 0
}
training_set.collect().toList
}
Also with scala and spark I recommend you if it is possible to avoid foreach and use more functional paradigm with map,flatMap and filter functions

Apache Spark WordCount method not sorting Data

def wordCount(dataSet: RDD[String]): Map[String, Int] = {
val counts = dataSet.flatMap(line => line.split(","))
.map(word => (word, 1))
.reduceByKey(_ + _)
.sortBy(_._2, ascending = false)
counts.collectAsMap()
}
This Method is not sorting the final result as aspected
.sortBy(_._2, ascending = false)
the output of this method will be in descending order but the output is still the random
any reason or solution?
method collectAsMap() internally creates a HashMap of values which in this case not ordered. Use collect or takeOrdered for sorted values.