sort and rank in spark RDD in one file - scala

I have an spark RDD as below
(maths,60)
(english,65)
(english,77)
(maths,23)
(maths,50)
I need to sort and rank the given RDD in one as below
(maths,23,1)
(maths,50,2)
(maths,50,3)
(english,65,1)
(english,77,2)
i know this can be done easily using Data Frame, but i need Spark rdd code to get the solution, please suggest

Spark RDD functions(so called transformations) like groupByKey flatMap and Scala List function like sorted should helps in achieving it.
val rdd = spark.sparkContext.parallelize(
Seq(("maths",60),
("english",65),
("english",77),
("maths",23),
("maths",50)))
val result = rdd.groupByKey().flatMap(group => {
group._2.toList
.sorted.toList // sort marks
.zipWithIndex // add the position/rank
.map {
case(marks, index) => (group._1, marks, index + 1)
}
})
result.collect
// Array((english,65,1), (english,77,2), (maths,23,1), (maths,50,2), (maths,60,3))
Databricks notebook

Another rdd solution:
val df = Seq(("maths",60),("english",65),("english",77),("maths",23),("maths",50)).toDF("subject","marks")
val rdd1 = df.rdd
rdd1.groupBy( x=> x(0))
.map( x=>
{
val p = x._2.toList.map(a=>a(1)).map(_.toString.toInt).sortWith((a1,a2)=> a1 < a2 ).zipWithIndex.map(b=>(b._1,b._2+1))
(x._1,p)
}
)
.flatMap( x => x._2.map((x._1,_)))
.collect.foreach(println)
Results:
(english,(65,1))
(english,(77,2))
(maths,(23,1))
(maths,(50,2))
(maths,(60,3))

Related

How to apply filters on spark scala dataframe view?

I am pasting a snippet here where I am facing issues with the BigQuery Read. The "wherePart" has more number of records and hence BQ call is invoked again and again. Keeping the filter outside of BQ Read would help. The idea is, first read the "mainTable" from BQ, store it in a spark view, then apply the "wherePart" filter to this view in spark.
["subDate" is a function to subtract one date from another and return the number of days in between]
val Df = getFb(config, mainTable, ds)
def getFb(config: DataFrame, mainTable: String, ds: String) : DataFrame = {
val fb = config.map(row => Target.Pfb(
row.getAs[String]("m1"),
row.getAs[String]("m2"),
row.getAs[Seq[Int]]("days")))
.collect
val wherePart = fb.map(x => (x.m1, x.m2, subDate(ds, x.days.max - 1))).
map(x => s"(idata_${x._1} = '${x._2}' AND ds BETWEEN '${x._3}' AND '${ds}')").
mkString(" OR ")
val q = new Q()
val tempView = "tempView"
spark.readBigQueryTable(mainTable, wherePart).createOrReplaceTempView(tempView)
val Df = q.mainTableLogs(tempView)
Df
}
Could someone please help me here.
Are you using the spark-bigquery-connector? If so the right syntax is
spark.read.format("bigquery")
.load(mainTable)
.where(wherePart)
.createOrReplaceTempView(tempView)

Whats the equivalent of RDD aggregateByKey for Spark dataframe using scala?

I wrote my below code to calculate the correlation for a dataframe using grouping but I eventually had to use RDD AggregateByKey, Sequential Operation and Combiner operation to achieve what I needed. However, I want to implement the same using only spark dataframe and avoid RDD completely. I tried learning about Spark dataframe and I came across "agg" and "Group by" function but wasn't exactly sure how to achieve the same results as using the RDD. Any help here is much appreciated?
val columnIndexes = columns.indices.map(i => i + groupIndexes.length).toArray
//removing rows with nulls in group by columns like the MR version
val cleanDF = selectedDF.na.drop("any", groupByColumns)
val allCountersPerGroupRDD: RDD[(immutable.IndexedSeq[Any], Seq[Seq[CovCounter]])] = cleanDF.rdd.map(row =>
//create key value pairs
(groupIndexes.map(ind => row.get(ind)), columnIndexes.map(i => toDouble(row.get(i)))))
.aggregateByKey(zeroCounters, numPartitions)(
seqOp = (counters, newValues) => {
for ((i, j) <- columnHalfPairedIndicesFlattened) {
counters(i)(j).addIfNotNaN(newValues(i), newValues(j))
}
counters
}, combOp = (baseCounters, otherCounters) => {
for ((i, j) <- columnHalfPairedIndicesFlattened) {
baseCounters(i)(j).merge(otherCounters(i)(j))
}
baseCounters
})
val finalRDD: RDD[Row] = allCountersPerGroupRDD.mapPartitions { iterator =>
iterator.flatMap { case (groupKeys, counts) =>
columns.indices.map(ind =>
Row.fromSeq(groupKeys ++ Seq(columns(ind)) ++ columnPairedIndicesAll(ind).map { case (i, j) =>
getCovOrCorrFromCounters(i, j, counts, useCorrelation)
}))
}
}
val outDF = sparkSession.createDataFrame(finalRDD, outputSchema)
See https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html. You need to make your own UDAF.

Eliminate for loops in Spark using Scala

I am having a scenario in which. I iterate over a list of DataFrames. Perform same type of operation on each using a FOR LOOP, and store the transformed data frame in a Map(String -> DataFrame).
for (df <- dfList)
{
//perform some transformation of dataframe
dfMap = dfMap + ("some_name", df)
}
This solution is working fine. But in a sequential manner. I want to make use of async to achieve parallelism and performance improvements. Such that the transformations on each df occur parallelly making using of distributed processing capabilities of Spark.
Check below code.
def logic(df: DataFrame):Map[String,DataFrame] = {
// Return Map[String,DataFrame]
}
val dfa = // DataFrame 1
val dfb = // DataFrame 2
val dfc = // DataFrame 3
Seq(dfa,dfb,dfc,dfd)
.par // Parallel
.map(logic) // invoking logic function for every dataframe.
.reduce( _ ++ _ ) // Final result in Map["aaa" -> dfa,"bbb" -> dfb,"ccc" -> dfc]
Update
def writeToMap(a: Int, i: Int) = Map(a -> i)
def doOperation(a: Int)=writeToMap(a,a+10)
val list = Seq.range(0, 33)
list.par.map(x => doOperation(x))
val dfList : List[DataFrame] = // Your Dataframe list
val dfMap : Map[String,DataFrame] = dfList.map("some_name" -> _).toMap
.map do the mapping of each element with the Pair
.toMap would aggregate the result to a Map.
Note : some_name should be unique for every dataframe

Spark accumulator empty when used in UDF

I was working on optimizing my Spark process, and was trying to use a UDF with an accumulator. I have gotten the accumulator to work on its own, and was looking to see if I would get any speed up using a UDF. But instead, when I wrap the accumulator in the UDF, it remains empty. Am I going something wrong in particular? Is there something going on with Lazy Execution where even with my .count it is still not executing?
Input:
0,[0.11,0.22]
1,[0.22,0.33]
Output:
(0,0,0.11),(0,1,0.22),(1,0,0.22),(1,1,0.33)
Code
val accum = new MapAccumulator2d()
val session = SparkSession.builder().getOrCreate()
session.sparkContext.register(accum)
//Does not work - Empty Accumlator
val rowAccum = udf((itemId: Int, item: mutable.WrappedArray[Float]) => {
val map = item
.zipWithIndex
.map(ff => {
((itemId, ff._2), ff._1.toDouble)
}).toMap
accum.add(map)
itemId
})
dataFrame.select(rowAccum(col("itemId"), col("jaccardList"))).count
//Works
dataFrame.foreach(f => {
val map = f.getAs[mutable.WrappedArray[Float]](1)
.zipWithIndex
.map(ff => {
((f.getInt(0), ff._2), ff._1.toDouble)
}).toMap
accum.add(map)
})
val list = accum.value.toList.map(f => (f._1._1, f._1._2, f._2))
Looks like the only issue here is using count to "trigger" the lazily-evaluated UDF: Spark is "smart" enough to realize that the select operation can't change the result of count and therefore doesn't really execute the UDF. Choosing a different operation (e.g. collect) shows that the UDF works and updates the accumulator.
Here's a (more concise) example:
val accum = sc.longAccumulator
val rowAccum = udf((itemId: Int) => { accum.add(itemId); itemId })
val dataFrame = Seq(1,2,3,4,5).toDF("itemId")
dataFrame.select(rowAccum(col("itemId"))).count() // won't trigger UDF
println(s"RESULT: ${accum.value}") // prints 0
dataFrame.select(rowAccum(col("itemId"))).collect() // triggers UDF
println(s"RESULT: ${accum.value}") // prints 15

Filtering One RDD based on another RDD using regex

I have two RDD's of the form:
data_wo_header: RDD[String], data_test_wo_header: RDD[String]
scala> data_wo_header.first
res2: String = 1,2,3.5,1112486027
scala> data_test_wo_header.first
res2: String = 1,2
RDD2 is smaller than RDD 1. I am trying to filter RDD1 by removing the elements whose regEx matches with RDD2.
The 1,2 in the above example represent UserID,MovID. Since it's present in the test I want the new RDD such that it's removed from RDD1.
I have asked a similar ques but it is requiring to do unnecessary split of RDD.
I am trying to do something of this sort but it's not working:
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
var ratings_train = new ListBuffer[String]()
data_wo_header.foreach(x => {
data_test_wo_header.foreach(y => {
if (x.indexOf(y) == 0) {
ratings_train += x
}
})
})
val ratings_train_list = ratings_train.toList
return ratings_train_list
}
How should I do a regex match and filter based on it.
You can use broadcast variable to share state of rdd2 and then filter rdd1 based on broadcasted variable of rdd2. I replicate your code and this works for me
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String]): List[String] = {
val rdd2array = sparkSession.sparkContext.broadcast(data_test_wo_header.collect())
val training_set = data_wo_header.filter{
case(x) => rdd2array.value.filter(y => x.matches(y)).length == 0
}
training_set.collect().toList
}
Also with scala and spark I recommend you if it is possible to avoid foreach and use more functional paradigm with map,flatMap and filter functions