Finding Max sum of marks each year - scala

I am new to Scala and Spark, can someone optimize below Scala code for finding maximum marks scored by students each year
val m=sc.textFile("marks.csv")
val SumOfMarks=m.map(_.split(",")).mapPartitionsWithIndex {(idx, iter) => if (idx == 0) iter.drop(1) else iter}.map(l=>((l(0),l(1)),l(3).toInt)).reduceByKey(_+_).sortBy(line => (line._1._1, line._2), ascending=false)
var s:Int=0
var y:String="0"
for(i<-SumOfMarks){ if((i._1._1!=y) || (i._2==s && i._1._1==y)){ println(i);s=i._2;y=i._1._1}}
Input : marks.csv
year,student,sub,marks
2016,ram,maths,90
2016,ram,physics,86
2016,ram,chemistry,88
2016,raj,maths,84
2016,raj,physics,96
2016,raj,chemistry,98
2017,raghu,maths,96
2017,raghu,physics,98
2017,raghu,chemistry,94
2017,rajesh,maths,92
2017,rajesh,physics,98
2017,rajesh,chemistry,98
Output :
2017,raghu,288
2017,rajesh,288
2016,raj,278

I am not sure what you mean exactly by "Optimised", but a more "scala-y" and "spark-y" way of doing this might be as follows:
import org.apache.spark.sql.expressions.Window
// Read your data file as a CSV file with row headers.
val marksDF = spark.read.option("header","true").csv("marks.csv")
// Calculate the total marks for each student in each year. The new total mark column will be called "totMark"
val marksByStudentYear = marksDF.groupBy(col("year"), col("student")).agg(sum(col("marks")).as("totMark"))
// Rank the marks within each year. Highest Mark will get rank 1, second highest rank 2 and so on.
// A benefit of rank is that if two scores have the same mark, they will both get the
// same rank.
val marksRankedByYear = marksByStudentYear.withColumn("rank", dense_rank().over(Window.partitionBy("year").orderBy($"totMark".desc)))
// Finally filter so that we only have the "top scores" (rank = 1) for each year,
// order by year and student name and display the result.
val topStudents = marksRankedByYear.filter($"rank" === 1).orderBy($"year", $"student").show
topStudents.show
This will produce the following output in Spark-shell:
+----+-------+-------+----+
|year|student|totMark|rank|
+----+-------+-------+----+
|2016| raj| 278.0| 1|
|2017| raghu| 288.0| 1|
|2017| rajesh| 288.0| 1|
+----+-------+-------+----+
If you need a CSV displayed as per your question, you can use:
topStudents.collect.map(_.mkString(",")).foreach(println)
which produces:
2016,raj,278.0,1
2017,raghu,288.0,1
2017,rajesh,288.0,1
I have broken the process up into individual steps. This will allow you to see what is going on at each step by simply running show on an intermediate result. For example to see what the spark.read.option... does, simply enter marksDF.show into spark-shell
Since OP wanted an RDD version, here is one example. Probably it is not optimal, but it does give the correct result:
import org.apache.spark.rdd.RDD
// A Helper function which makes it slightly easier to view RDD content.
def dump[R] (rdd : RDD[R]) = rdd.collect.foreach(println)
val marksRdd = sc.textFile("marks.csv")
// A case class to annotate the content in the RDD
case class Report(year:Int, student:String, sub:String, mark:Int)
// Create the RDD as a series of Report objects - ignore the header.
val marksReportRdd = marksRdd.map(_.split(",")).mapPartitionsWithIndex {
(idx, iter) => if (idx == 0) iter.drop(1) else iter
}.map(r => Report(r(0).toInt,r(1),r(2),r(3).toInt))
// Group the data by year and student.
val marksGrouped = marksReportRdd.groupBy(report => (report.year, report.student))
// Calculate the total score for each student for each year by adding up the scores
// of each subject the student has taken in that year.
val totalMarkStudentYear = marksGrouped.map{ case (key, marks:Iterable[Report]) => (key, marks.foldLeft(0)((acc, rep) => acc + rep.mark))}
// Determine the highest score for each year.
val yearScoreHighest = totalMarkStudentYear.map{ case (key, score:Int) => (key._1, score) }.reduceByKey(math.max(_, _))
// Determine the list of students who have received the highest score in each year.
// This is achieved by joining the total marks each student received in each year
// to the highest score in each year.
// The join is performed on the key which must is a Tuple2(year, score).
// To achieve this, both RDD's must be mapped to produce this key with a data attribute.
// The data attribute for the highest scores is a dummy value "x".
// The data attribute for the student scores is the student's name.
val highestRankStudentByYear = totalMarkStudentYear.map{ case (key, score) => ((key._1, score), key._2)}.join (yearScoreHighest.map (k => (k, "x")))
// Finally extract the year, student name and score from the joined RDD
// Sort by year and name.
val result = highestRankStudentByYear.map{ case (key, score) => (key._1, score._1, key._2)}.sortBy( r => (r._1, r._2))
// Show the final result.
dump(result)
val result = highestRankStudentByYear.map{ case (key, score) => (key._1, score._1, key._2)}.sortBy( r => (r._1, r._2))
dump(result)
The result of the above is:
(2016,raj,278)
(2017,raghu,288)
(2017,rajesh,288)
As before, you can view the intermediate RDD's simply by dumping them using the dump function. NB: the dump function takes an RDD. If you want to show the content of a DataFrame or dataset use it's show method.
It is probably that there is a more optimal solution than the one above, but it does the job.
Hopefully the RDD version will encourage you to use DataFrames and/or DataSets if you can. Not only is the code simpler, but:
Spark will evaluate DataFrames and DataSets and can optimise the overall transformation process. RDD's are not (i.e. they are executed one after another without optimisation). Translations DataFrame and DataSet based processes will likely run faster (assuming you don't manually optimise the RDD equivalent)
DataSets and DataFrames allow schemas to varying degrees (e.g. named columns and data typing).
DataFrames and DataSets can be queried using SQL.
DataFrame and DataSet operations/methods are more aligned with SQL constructs
DataFrames and DataSets are easier to use than RDD's
DataSets (and RDD's) offer compile time error detection.
DataSets are the future direction.
Check out these couple of links for more information:
https://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset/
https://www.linkedin.com/pulse/apache-spark-rdd-vs-dataframe-dataset-chandan-prakash/
https://medium.com/#sachee/apache-spark-dataframe-vs-rdd-24a04d2eb1b9
or simply google "spark should i use rdd or dataframe"
All the best with your project.

Try It on SCALA spark-shell
scala> val df = spark.read.format("csv").option("header", "true").load("/CSV file location/marks.csv")
scala> df.registerTempTable("record")
scala> sql(" select year, student, marks from (select year, student, marks, RANK() over (partition by year order by marks desc) rank From ( Select year, student, SUM(marks) as marks from record group by Year, student)) where rank =1 ").show
It will generate the following table
+----+-------+-----+
|year|student|marks|
+----+-------+-----+
|2016| raj|278.0|
|2017| raghu|288.0|
|2017| rajesh|288.0|
+----+-------+-----+

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions
//Finding Max sum of marks each year
object Marks2 {
def getSparkContext() = {
val conf = new SparkConf().setAppName("MaxMarksEachYear").setMaster("local")
val sc = new SparkContext(conf)
sc
}
def dump[R](rdd: RDD[R]) = rdd.collect.foreach(println)
def main(args: Array[String]): Unit = {
// System.setProperty("hadoop.home.dir", "D:\\Setup\\hadoop_home")
val sc = getSparkContext()
val inpRDD = sc.textFile("marks.csv")
val head = inpRDD.first()
val marksRdd = inpRDD.filter(record=> !record.equals(head)).map(rec => rec.split(","))
val marksByNameyear = marksRdd.map(rec =>((rec(0).toInt,rec(1)),rec(3).toInt))
//marksByNameyear.cache()
val aggMarksByYearName = marksByNameyear.reduceByKey(_+_)
val maxMarksByYear = aggMarksByYearName.map(s=> (s._1._1,s._2))reduceByKey(math.max(_, _))
val markYearName = aggMarksByYearName.map(s => (s._2.toInt,s._1._2))
val marksAndYear = maxMarksByYear.map(s => (s._2.toInt,s._1))
val tt = sc.broadcast(marksAndYear.collect().toMap)
marksAndYear.flatMap {case(key,value) => tt.value.get(key).map {other => (other,value, key) } }
val yearMarksName = marksAndYear.leftOuterJoin(markYearName)
val result = yearMarksName.map(s =>(s._2._1,s._2._2,s._1)).sortBy(f=>f._3, true)
//dump(markYearName);'
dump(result)
}
}

Related

Spark: Performant way to find top n values

I have a large dataset and I would like to find rows with n highest values.
id, count
id1, 10
id2, 15
id3, 5
...
The only method I can think of is using row_number without partition like
val window = Window.orderBy(desc("count"))
df.withColumn("row_number", row_number over window).filter(col("row_number") <= n)
but this is in no way performant when the data contains millions or billions of rows because it pushes the data into one partition and I get OOM.
Has anyone managed to come up with a performant solution?
I see two methods to improve your algorithm performance. First is to use sort and limit to retrieve the top n rows. The second is to develop your custom Aggregator.
Sort and Limit method
You sort your dataframe and then you take the first n rows:
val n: Int = ???
import org.apache.spark.functions.sql.desc
df.orderBy(desc("count")).limit(n)
Spark optimizes this kind of transformations sequence by first performing sort on each partition, taking first n rows on each partition, retrieving it on a final partition and reperforming sort and taking final first n rows. You can check this by executing explain() on transformations. You get the following execution plan:
== Physical Plan ==
TakeOrderedAndProject(limit=3, orderBy=[count#8 DESC NULLS LAST], output=[id#7,count#8])
+- LocalTableScan [id#7, count#8]
And by looking how TakeOrderedAndProject step is executed in limit.scala in Spark's source code (case class TakeOrderedAndProjectExec, method doExecute).
Custom Aggregator method
For custom aggregator, you create an Aggregator that will populate and update an ordered array of top n rows.
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.Encoder
import scala.collection.mutable.ArrayBuffer
case class Record(id: String, count: Int)
case class TopRecords(limit: Int) extends Aggregator[Record, ArrayBuffer[Record], Seq[Record]] {
def zero: ArrayBuffer[Record] = ArrayBuffer.empty[Record]
def reduce(topRecords: ArrayBuffer[Record], currentRecord: Record): ArrayBuffer[Record] = {
val insertIndex = topRecords.lastIndexWhere(p => p.count > currentRecord.count)
if (topRecords.length < limit) {
topRecords.insert(insertIndex + 1, currentRecord)
} else if (insertIndex < limit - 1) {
topRecords.insert(insertIndex + 1, currentRecord)
topRecords.remove(topRecords.length - 1)
}
topRecords
}
def merge(topRecords1: ArrayBuffer[Record], topRecords2: ArrayBuffer[Record]): ArrayBuffer[Record] = {
val merged = ArrayBuffer.empty[Record]
while (merged.length < limit && (topRecords1.nonEmpty || topRecords2.nonEmpty)) {
if (topRecords1.isEmpty) {
merged.append(topRecords2.remove(0))
} else if (topRecords2.isEmpty) {
merged.append(topRecords1.remove(0))
} else if (topRecords2.head.count < topRecords1.head.count) {
merged.append(topRecords1.remove(0))
} else {
merged.append(topRecords2.remove(0))
}
}
merged
}
def finish(reduction: ArrayBuffer[Record]): Seq[Record] = reduction
def bufferEncoder: Encoder[ArrayBuffer[Record]] = ExpressionEncoder[ArrayBuffer[Record]]
def outputEncoder: Encoder[Seq[Record]] = ExpressionEncoder[Seq[Record]]
}
And then you apply this aggregator on your dataframe, and flatten the aggregation result:
val n: Int = ???
import sparkSession.implicits._
df.as[Record].select(TopRecords(n).toColumn).flatMap(record => record)
Method comparison
To compare those two methods, let's say we want to take top n rows of a dataframe that is distributed on p partitions, each partition having around k records. So dataframe has size p·k. Which gives the following complexity (subject to errors):
method
total number of operations
memory consumption(on executor)
memory consumption(on final executor)
Current code
O(p·k·log(p·k))
--
O(p·k)
Sort and Limit
O(p·k·log(k) + p·n·log(p·n))
O(k)
O(p·n)
Custom Aggregator
O(p·k)
O(k) + O(n)
O(p·n)
So regarding number of operations, Custom Aggregator is the most performant. However, this method is by far the most complex and implies lots of serialization/deserialization so it may be less performant than Sort and Limit on certain case.
Conclusion
You have two methods to efficiently take top n rows, Sort and Limit and Custom Aggregator. To select which one to use, you should benchmark those two methods with your real dataframe. If after benchmarking Sort and Limit is a bit slower than Custom aggregator, I would select Sort and Limit as its code is a lot easier to maintain.
Convert to rdd
mapPartitions sorting the data, take N
Convert to df
Then sort and rank and take top N. Unlikely you will have OOM
Actual example, slightly updated, roll your own approach for posterity.
import org.apache.spark.sql.functions._
import spark.sqlContext.implicits._
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
// 1. Create data
val data = Seq(("James ","","Smith","36636","M",33000),
("Michael ","Rose","","40288","M",14000),
("Robert ","","Williams","42114","M",40),
("Robert ","","Williams","42114","M",540),
("Robert ","","Zeedong","42114","M",40000000),
("Maria ","Anne","Jones","39192","F",300),
("Maria ","Anne","Vangelis","39192","F",1300),
("Jen","Mary","Brown","","F",-1))
val columns = Seq("firstname","middlename","lastname","dob","gender","val")
val df = data.toDF(columns:_*)
//df.show()
//2. Any number of partitions, and sort that partition. Combiner function like Hadoop.
val df2 = df.repartition(1000,col("lastname")).sortWithinPartitions(desc("val"))
//df2.rdd.glom().collect()
//3. Take top N per partition. Thus num partitions x 2 in this case. The take(n) is the top n per partition. No OOM.
val rdd2 = df2.rdd.mapPartitions(_.take(2))
//4. Ghastly Row to DF work-arounds.
val schema = new StructType()
.add(StructField("f", StringType, true))
.add(StructField("m", StringType, true))
.add(StructField("l", StringType, true))
.add(StructField("d", StringType, true))
.add(StructField("g", StringType, true))
.add(StructField("v", IntegerType, true))
val df3 = spark.createDataFrame(rdd2, schema)
//5. Sort and take top(n) = 2 and Bob's your uncle. The Reduce after Combine.
df3.sort(col("v").desc).limit(2).show()
Returns for top 2 desc:
+-------+---+-------+-----+---+--------+
| f| m| l| d| g| v|
+-------+---+-------+-----+---+--------+
|Robert | |Zeedong|42114| M|40000000|
| James | | Smith|36636| M| 33000|
+-------+---+-------+-----+---+--------+

How would I create bins of date ranges in spark scala?

Hi how's it going? I'm a Python developer trying to learn Spark Scala. My task is to create date range bins, and count the frequency of occurrences in each bin (histogram).
My input dataframe looks something like this
My bin edges are this (in Python):
bins = ["01-01-1990 - 12-31-1999","01-01-2000 - 12-31-2009"]
and the output dataframe I'm looking for is (counts of how many values in original dataframe per bin):
Is there anyone who can guide me on how to do this is spark scala? I'm a bit lost. Thank you.
Are You Looking for A result Like Following:
+------------------------+------------------------+
|01-01-1990 -- 12-31-1999|01-01-2000 -- 12-31-2009|
+------------------------+------------------------+
| 3| null|
| null| 2|
+------------------------+------------------------+
It can be achieved with little bit of spark Sql and pivot function as shown below
check out the left join condition
val binRangeData = Seq(("01-01-1990","12-31-1999"),
("01-01-2000","12-31-2009"))
val binRangeDf = binRangeData.toDF("start_date","end_date")
// binRangeDf.show
val inputDf = Seq((0,"10-12-1992"),
(1,"11-11-1994"),
(2,"07-15-1999"),
(3,"01-20-2001"),
(4,"02-01-2005")).toDF("id","input_date")
// inputDf.show
binRangeDf.createOrReplaceTempView("bin_range")
inputDf.createOrReplaceTempView("input_table")
val countSql = """
SELECT concat(date_format(c.st_dt,'MM-dd-yyyy'),' -- ',date_format(c.end_dt,'MM-dd-yyyy')) as header, c.bin_count
FROM (
(SELECT
b.st_dt, b.end_dt, count(1) as bin_count
FROM
(select to_date(input_date,'MM-dd-yyyy') as date_input , * from input_table) a
left join
(select to_date(start_date,'MM-dd-yyyy') as st_dt, to_date(end_date,'MM-dd-yyyy') as end_dt from bin_range ) b
on
a.date_input >= b.st_dt and a.date_input < b.end_dt
group by 1,2) ) c"""
val countDf = spark.sql(countSql)
countDf.groupBy("bin_count").pivot("header").sum("bin_count").drop("bin_count").show
Although, since you have 2 bin ranges there will be 2 rows generated.
We can achieve this by looking at the date column and determining within which range each record falls.
// First we set up the problem
// Create a format that looks like yours
val dateFormat = java.time.format.DateTimeFormatter.ofPattern("MM-dd-yyyy")
// Get the current local date
val now = java.time.LocalDate.now
// Create a range of 1-10000 and map each to minusDays
// so we can have range of dates going 10000 days back
val dates = (1 to 10000).map(now.minusDays(_).format(dateFormat))
// Create a DataFrame we can work with.
val df = dates.toDF("date")
So far so good. We have date entries to work with, and they are like your format (MM-dd-yyyy).
Next up, we need a function which returns 1 if the date falls within range, and 0 if not. We create a UserDefinedFunction (UDF) from this function so we can apply it to all rows simultaneously across Spark executors.
// We will process each range one at a time, so we'll take it as a string
// and split it accordingly. Then we perform our tests. Using Dates is
// necessary to cater to your format.
import java.text.SimpleDateFormat
def isWithinRange(date: String, binRange: String): Int = {
val format = new SimpleDateFormat("MM-dd-yyyy")
val startDate = format.parse(binRange.split(" - ").head)
val endDate = format.parse(binRange.split(" - ").last)
val testDate = format.parse(date)
if (!(testDate.before(startDate) || testDate.after(endDate))) 1
else 0
}
// We create a udf which uses an anonymous function taking two args and
// simply pass the values to our prepared function
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def isWithinRangeUdf: UserDefinedFunction =
udf((date: String, binRange: String) => isWithinRange(date, binRange))
Now that we have our UDF setup, we create new columns in our DataFrame and group by the given bins and sum the values over (hence why we made our functions evaluate to an Int)
// We define our bins List
val bins = List("01-01-1990 - 12-31-1999",
"01-01-2000 - 12-31-2009",
"01-01-2010 - 12-31-2020")
// We fold through the bins list, creating a column from each bin at a time,
// enriching the DataFrame with more columns as we go
import org.apache.spark.sql.functions.{col, lit}
val withBinsDf = bins.foldLeft(df){(changingDf, bin) =>
changingDf.withColumn(bin, isWithinRangeUdf(col("date"), lit(bin)))
}
withBinsDf.show(1)
//+----------+-----------------------+-----------------------+-----------------------+
//| date|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+----------+-----------------------+-----------------------+-----------------------+
//|09-01-2020| 0| 0| 1|
//+----------+-----------------------+-----------------------+-----------------------+
//only showing top 1 row
Finally we select our bin columns and groupBy them and sum.
val binsDf = withBinsDf.select(bins.head, bins.tail:_*)
val sums = bins.map(b => sum(b).as(b)) // keep col name as is
val summedBinsDf = binsDf.groupBy().agg(sums.head, sums.tail:_*)
summedBinsDf.show
//+-----------------------+-----------------------+-----------------------+
//|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+-----------------------+-----------------------+-----------------------+
//| 2450| 3653| 3897|
//+-----------------------+-----------------------+-----------------------+
2450 + 3653 + 3897 = 10000, so it seems our work was correct.
Perhaps I overdid it and there is a simpler solution, please let me know if you know a better way (especially to handle MM-dd-yyyy dates).

Using Spark 2.3.1 with Scala, Reduce Arbitrary List of Date Ranges into distinct non-overlapping ranges of dates

Given a list of date ranges, some of which overlap:
val df = Seq(
("Mike","2018-09-01","2018-09-10"), // range 1
("Mike","2018-09-05","2018-09-05"), // range 1
("Mike","2018-09-12","2018-09-12"), // range 1
("Mike","2018-09-11","2018-09-11"), // range 1
("Mike","2018-09-25","2018-09-29"), // range 4
("Mike","2018-09-21","2018-09-23"), // range 4
("Mike","2018-09-24","2018-09-24"), // range 4
("Mike","2018-09-14","2018-09-16"), // range 2
("Mike","2018-09-15","2018-09-17"), // range 2
("Mike","2018-09-05","2018-09-05"), // range 1
("Mike","2018-09-19","2018-09-19"), // range 3
("Mike","2018-09-19","2018-09-19"), // range 3
("Mike","2018-08-19","2018-08-20"), // range 5
("Mike","2018-10-01","2018-10-20"), // range 6
("Mike","2018-10-10","2018-10-30") // range 6
).toDF("name", "start", "end")
I'd like to reduce the data down to the minimum set of date ranges that completely encapsulate the above dates with no extra dates added:
+----+----------+----------+
|name|start |end |
+----+----------+----------+
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-09-21|2018-09-29|
|Mike|2018-08-19|2018-08-20|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+
EDIT: Added three new entries to the test data to account for new edge cases.
I cannot rely on the dates being in any particular order.
My best attempt at this so far:
Explode each date range into its set of individual days
Union the sets together into one big set of all the days
Sort the set into a list so the days are in order
Aggregate the individual days back into a list of lists of days.
Take the first and last day of each list as the new ranges.
The code, such as it is:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import scala.collection.immutable.NumericRange
import java.time.LocalDate
case class MyRange(start:String, end:String)
val combineRanges = udf((ranges: Seq[Row]) => {
ranges.map(r => LocalDate.parse(r(0).toString).toEpochDay to LocalDate.parse(r(1).toString).toEpochDay)
.map(_.toIndexedSeq).reduce(_ ++ _).distinct.toList.sorted
.aggregate(List.empty[Vector[Long]])((ranges:List[Vector[Long]], d:Long) => {
ranges.lastOption.find(_.last + 1 == d) match {
case Some(r:Vector[Long]) => ranges.dropRight(1) :+ (r :+ d)
case None => ranges :+ Vector(d)
}
}, _ ++ _).map(v => MyRange(LocalDate.ofEpochDay(v.head).toString, LocalDate.ofEpochDay(v.last).toString))
})
df.groupBy("name")
.agg(combineRanges(collect_list(struct($"start", $"end"))) as "ranges")
.withColumn("ranges", explode($"ranges"))
.select($"name", $"ranges.start", $"ranges.end")
.show(false)
It seems to work, but it is very ugly and probably wasteful of time and memory.
I was kind of hoping to use the scala Range class to only notionally explode the date ranges into their individual days, but I've got a feeling the sort operation forces scala's hand and makes it actually create a list of all the dates in memory.
Does anyone have a better way of doing this?
I think the easiest (and most readable) way is to explode the ranges into individual days and then aggregate back into intervals. As the number of days cannot grow too large, I think exploding is not a bottleneck here. I show a "pure-Scala" solution which is then used inside an UDF which gets all the intervals from a collect_list aggregation :
import java.time.LocalDate
import java.time.temporal.ChronoUnit
def enumerateDays(start: LocalDate, end: LocalDate) = {
Iterator.iterate(start)(d => d.plusDays(1L))
.takeWhile(d => !d.isAfter(end))
.toList
}
implicit val localDateOrdering: Ordering[LocalDate] = Ordering.by(_.toEpochDay)
val combineRanges = udf((data: Seq[Row]) => {
val dateEnumerated =
data
.toSet[Row] // use Set to save memory if many spans overlap
// "explode" date spans into individual days
.flatMap { case Row(start: String, end: String) => enumerateDays(LocalDate.parse(start), LocalDate.parse(end)) }
.toVector
.sorted
// combine subsequent dates into Vectors
dateEnumerated.tail
// combine subsequent dates into Vectors
.foldLeft(Vector(Vector(dateEnumerated.head)))((agg, curr) => {
if (ChronoUnit.DAYS.between(agg.last.last, curr) == 1L) {
agg.init :+ (agg.last :+ curr)
} else {
agg :+ Vector(curr)
}
})
// now get min/max of dates per span
.map(r => (r.min.toString, r.max.toString))
})
df.groupBy("name")
.agg(combineRanges(collect_list(struct($"start", $"end"))) as "ranges")
.withColumn("ranges", explode($"ranges"))
.select($"name", $"ranges._1".as("start"), $"ranges._2".as("end"))
.show(false)
gives
+----+----------+----------+
|name|start |end |
+----+----------+----------+
|Mike|2018-08-19|2018-08-20|
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-09-21|2018-09-29|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+
I think it's also doable with more logic DataFrame API. I would still explode using UDFs, but then use Window-Functions and groupBy to build the new block based on number of days between 2 dates. But I think the above solution is also ok
Here is an alternative with DFs and SPARK SQL both non-procedural and procedural by definition. You need to read well and persist.
// Aspects such as caching and re-partitioning for performance not considered. On the other hand it all happens under the bonnet wth DF's - so they say.
// Functional only.
import org.apache.spark.sql.functions._
import spark.implicits._
import java.time._
import org.apache.spark.sql.functions.{lead, lag}
import org.apache.spark.sql.expressions.Window
def toEpochDay(s: String) = LocalDate.parse(s).toEpochDay
val toEpochDayUdf = udf(toEpochDay(_: String))
val df = Seq(
("Betty","2018-09-05","2018-09-05"), ("Betty","2018-09-05","2018-09-05"),
("Betty","2018-09-05","2018-09-08"), ("Betty","2018-09-07","2018-09-10"),
("Betty","2018-09-07","2018-09-08"), ("Betty","2018-09-06","2018-09-07"),
("Betty","2018-09-10","2018-09-15"), ("Betty","2017-09-10","2017-09-15"),
("XXX","2017-09-04","2017-09-10"), ("XXX","2017-09-10","2017-09-15"),
("YYY","2017-09-04","2017-09-10"), ("YYY","2017-09-11","2017-09-15"),
("Bob","2018-09-01","2018-09-02"), ("Bob","2018-09-04","2018-09-05"),
("Bob","2018-09-06","2018-09-07"), ("Bob","2019-09-04","2019-09-05"),
("Bob","2019-09-06","2019-09-07"), ("Bob","2018-09-08","2018-09-22")
).toDF("name", "start", "end")
// Remove any duplicates - pointless to n-process these!
val df2 = df.withColumn("s", toEpochDayUdf($"start")).withColumn("e", toEpochDayUdf($"end")).distinct
df2.show(false) // The original input
df2.createOrReplaceTempView("ranges")
// Find those records encompassed by a broader time frame and hence not required for processing.
val q = spark.sql(""" SELECT *
FROM ranges r1
WHERE EXISTS (SELECT r2.name
FROM ranges r2
WHERE r2.name = r1.name
AND ((r1.s >= r2.s AND r1.e < r2.e) OR
(r1.e <= r2.e AND r1.s > 2.s))
)
""")
//q.show(false)
val df3 = df2.except(q) // Overlapping or on their own / single range records left.
//df3.show(false)
df3.createOrReplaceTempView("ranges2")
// Find those ranges that have a gap between them and the next adjacent records, before or after, i.e. records that exist on their own and are in fact per de facto the first part of the answer.
val q2 = spark.sql(""" SELECT *
FROM ranges2 r1
WHERE NOT EXISTS (SELECT r2.name
FROM ranges2 r2
WHERE r2.name = r1.name
AND (r2.e >= r1.s - 1 AND r2.s <= r1.s - 1 ) OR
(r2.s <= r1.e + 1 AND r2.e >= r1.e + 1 ))
)
""")
// Store the first set of records that exist on their own with some form of gap, first part of result overall result set.
val result1 = q2.select("name", "start", "end")
result1.show(false)
// Get the records / ranges that have overlaps to process - the second remaining set of such records to process.
val df4 = df3.except(q2)
//df4.show(false)
//Avoid Serialization errors with lag!
#transient val w = org.apache.spark.sql.expressions.Window.partitionBy("name").orderBy("e")
#transient val lag_y = lag("e", 1, -99999999).over(w)
//df.select(lag_y).map(f _).first
val df5 = df4.withColumn("new_col", lag_y)
//df5.show(false)
// Massage data to get results via easier queries, e.g. avoid issues with correlated sub-queries.
val myExpression = "s - new_col"
val df6 = df5.withColumn("result", when($"new_col" === 0, 0).otherwise(expr(myExpression)))
//df6.show(false)
df6.createOrReplaceTempView("ranges3")
val q3 = spark.sql(""" SELECT *, dense_rank() over (PARTITION BY name ORDER BY start ASC) as RANK
FROM ranges3
WHERE new_col = -99999999 OR result > 1
""")
q3.createOrReplaceTempView("rangesSTARTS")
val q4 = spark.sql(""" SELECT *
FROM ranges3
WHERE result <= 1 AND new_col <> -99999999
""")
q4.createOrReplaceTempView("rangesFOLLOWERS")
val q5 = spark.sql(""" SELECT r1.*, r2.start as next_start
FROM rangesSTARTS r1 LEFT JOIN rangesSTARTS r2
ON r2.name = r1.name
AND r2.RANK = r1.RANK + 1
""")
//q5.show(false)
val q6 = q5.withColumn("end_period", when($"next_start".isNull, "2525-01-01").otherwise($"next_start"))
//q6.show(false)
q6.createOrReplaceTempView("rangesSTARTS2")
// Second and final set of results - the head and tail of such set of range records.
val result2 = spark.sql(""" SELECT r1.name, r1.start, MAX(r2.end) as end
FROM rangesFOLLOWERS r2, rangesSTARTS2 r1
WHERE r2.name = r1.name
AND r2.end >= r1.start
AND r2.end < r1.end_period
GROUP BY r1.name, r1.start """)
result2.show(false)
val finalresult = result1.union(result2)
finalresult.show
returns:
+-----+----------+----------+
| name| start| end|
+-----+----------+----------+
| Bob|2018-09-01|2018-09-02|
|Betty|2017-09-10|2017-09-15|
| YYY|2017-09-04|2017-09-15|
| Bob|2018-09-04|2018-09-22|
| Bob|2019-09-04|2019-09-07|
| XXX|2017-09-04|2017-09-15|
|Betty|2018-09-05|2018-09-15|
+-----+----------+----------+
An interesting contrast - what is better for performance and style? My last such effort for a while. Interested in your comments. You know the programming aspects better than I, so this question provides some good comparison and some good education. the other solutions do explode, not how I saw it.
I think my second approach is better, but still far from perfect. It at least avoids iterating through every day in every date range, although it now processes every range multiple times. I think I'm mostly going to be processing a few large ranges instead of a bunch of little ranges, so maybe that's ok.
Given:
val ranges = Seq(
("Mike","2018-09-01","2018-09-10"),
("Mike","2018-09-05","2018-09-05"),
("Mike","2018-09-12","2018-09-12"),
("Mike","2018-09-11","2018-09-11"),
("Mike","2018-09-25","2018-09-30"),
("Mike","2018-09-21","2018-09-23"),
("Mike","2018-09-24","2018-09-24"),
("Mike","2018-09-14","2018-09-16"),
("Mike","2018-09-15","2018-09-17"),
("Mike","2018-09-05","2018-09-05"),
("Mike","2018-09-19","2018-09-19"),
("Mike","2018-09-19","2018-09-19"),
("Mike","2018-08-19","2018-08-20"),
("Mike","2018-10-01","2018-10-20"),
("Mike","2018-10-10","2018-10-30")
)
val df = ranges.toDF("name", "start", "end")
I want:
+----+----------+----------+
|name|start |end |
+----+----------+----------+
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-21|2018-09-30|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-08-19|2018-08-20|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+
(They are not in order this time. I'm ok with that since it was never a requirement. It just happened to be an artifact of my previous approach)
// very specific helper functions to convert a date string to and from a range
implicit class MyString(s:String) {
def toFilteredInt: Int = s.filter(_.isDigit).toInt
def to(s2:String): Range = s.toFilteredInt to s2.toFilteredInt
// this only works for YYYYMMDD strings. very dangerous.
def toDateStr = s"${s.slice(0,4)}-${s.slice(4,6)}-${s.slice(6,8)}"
}
// helper functions to combine two ranges
implicit class MyRange(r:Range) {
def expand(i: Int): Range = r.head - i * r.step to r.last + i * r.step
def containsPart(r2:Range): Boolean = r.contains(r2.head) || r.contains(r2.last)
def intersects(r2:Range): Boolean = r.containsPart(r2) || r2.containsPart(r)
def combine(r2:Range): Option[Range] = {
if (r.step == r2.step && r.intersects(r2 expand 1)) {
if (r.step > 0) Some(Math.min(r.head, r2.head) to Math.max(r.last, r2.last))
else Some(Math.max(r.head, r2.head) to Math.min(r.last, r2.last))
}
else None
}
def toDateStrTuple: (String,String) = (r.start.toString.toDateStr, r.end.toString.toDateStr)
}
// combines a range to one of the ranges in a sequence if possible;
// adds it to the sequence if it can't be combined.
def addToRanges(rs:Seq[Range], r:Range): Seq[Range] = {
if (rs.isEmpty) Seq(r)
else r.combine(rs.last) match {
case Some(r:Range) => rs.init :+ r
case None => addToRanges(rs.init, r) :+ rs.last
}
}
// tries to combine every range in the sequence with every other range
// does not handle the case where combining two ranges together allows
// them to be combined with a previous range in the sequence.
// however, if we call this and nothing has been combined, we know
// we are done
def collapseOnce(rs:Seq[Range]):Seq[Range] = {
if (rs.size <= 1) rs
else addToRanges(collapseOnce(rs.init), rs.last)
}
// keep collapsing the sequence of ranges until they can't collapse
// any further
def collapseAll(rs:Seq[Range]):Seq[Range] = {
val collapsed = collapseOnce(rs)
if (rs.size == collapsed.size) rs
else collapseAll(collapsed)
}
// now our udf is much simpler
val combineRanges = udf((rows: Seq[Row]) => {
val ranges = rows.map(r => r(0).toString to r(1).toString)
collapseAll(ranges).map(_.toDateStrTuple)
})
df.groupBy("name").agg(combineRanges(collect_list(struct($"start", $"end"))) as "ranges"
).withColumn("ranges", explode($"ranges")
).select($"name", $"ranges._1" as "start", $"ranges._2" as "end").show(false)
Room for improvement:
I'm pretty sure I'd get better performance most of the time if I bailed out of collapseOnce as soon as I found a range to combine. The typical use case will be adding a single day range to the last range in the sequence.
collapseOnce and addToRanges are not yet tail recursive.
Some of the date to string and string to date methods in my implicit classes should probably be stand alone methods. They are super specific to my problem and don't deserve to be included in general String and Range classes.
What about
exploding date range into days by a udf
using analytical functions to do the magic
Code:
import java.time.LocalDate
import java.time.format.DateTimeFormatter
def enumerateDays(start: LocalDate, end: LocalDate) = {
Iterator.iterate(start)(d => d.plusDays(1L))
.takeWhile(d => !d.isAfter(end))
.toSeq
}
val udf_enumerateDays = udf( (start:String, end:String) => enumerateDays(LocalDate.parse(start), LocalDate.parse(end)).map(_.toString))
df.select($"name", explode(udf_enumerateDays($"start",$"end")).as("day"))
.distinct
.withColumn("day_prev", lag($"day",1).over(Window.partitionBy($"name").orderBy($"day")))
.withColumn("is_consecutive", coalesce(datediff($"day",$"day_prev"),lit(0))<=1)
.withColumn("group_nb", sum(when($"is_consecutive",lit(0)).otherwise(lit(1))).over(Window.partitionBy($"name").orderBy($"day")))
.groupBy($"name",$"group_nb").agg(min($"day").as("start"), max($"day").as("end"))
.drop($"group_nb")
.orderBy($"name",$"start")
.show
Result:
+----+----------+----------+
|name| start| end|
+----+----------+----------+
|Mike|2018-08-19|2018-08-20|
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-09-21|2018-09-29|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+
Here is one more solution using DF without the UDFs
val df = Seq(
("Mike","2018-09-01","2018-09-10"), // range 1
("Mike","2018-09-05","2018-09-05"), // range 1
("Mike","2018-09-12","2018-09-12"), // range 1
("Mike","2018-09-11","2018-09-11"), // range 1
("Mike","2018-09-25","2018-09-30"), // range 4
("Mike","2018-09-21","2018-09-23"), // range 4
("Mike","2018-09-24","2018-09-24"), // range 4
("Mike","2018-09-14","2018-09-16"), // range 2
("Mike","2018-09-15","2018-09-17"), // range 2
("Mike","2018-09-05","2018-09-05"), // range 1
("Mike","2018-09-19","2018-09-19"), // range 3
("Mike","2018-09-19","2018-09-19") // range 3
).toDF("name", "start", "end").withColumn("start",'start.cast("date")).withColumn("end",'end.cast("date"))
df.printSchema()
val df2 = df.as("t1").join(df.as("t2"), $"t1.start" =!= $"t2.start" and $"t1.end" =!= $"t2.end")
.withColumn("date_diff_start",datediff($"t1.start",$"t2.start"))
.withColumn("date_diff_end",datediff($"t1.end",$"t2.end"))
.withColumn("n1_start",when('date_diff_start===1,$"t2.start"))
.withColumn("n1_end",when('date_diff_end === -1,$"t2.end"))
.filter( 'n1_start.isNotNull or 'n1_end.isNotNull)
.withColumn( "new_start", when('n1_start.isNotNull, $"n1_start").otherwise($"t1.start"))
.withColumn( "new_end", when('n1_end.isNotNull, $"n1_end").otherwise($"t1.end"))
.select("t1.name","new_start","new_end")
.distinct
val df3= df2.alias("t3").join(df2.alias("t4"),$"t3.name" === $"t4.name")
.withColumn("x1",when($"t3.new_end"=== $"t4.new_start",1)
.when($"t3.new_start" === $"t4.new_end",1)
.otherwise(0))
.groupBy("t3.name","t3.new_start","t3.new_end")
.agg( min( when('x1===1,$"t4.new_start" ).otherwise($"t3.new_start") ).as("ns"), max(when('x1===1,$"t4.new_end").otherwise($"t3.new_end")).as("ne"))
.select("t3.name","ns","ne")
.distinct
df3.show(false)
val num_combinations = df3.count
val df4 = df.filter('start==='end).distinct.select("name","start").alias("dup")
.join(df3.alias("d4"), $"d4.name"===$"dup.name" , "leftOuter")
.withColumn("cond", ! $"dup.start".between($"ns" , $"ne"))
.filter('cond)
.groupBy("d4.name","start" ).agg(count($"start").as("count"),collect_set('start).as("dup_s1"))
.filter('count===num_combinations)
.withColumn("start",explode('dup_s1))
.withColumn("end",'start)
.select("name","start","end")
df3.union(df4).show(false)
Results:
+----+----------+----------+
|name|ns |ne |
+----+----------+----------+
|Mike|2018-09-21|2018-09-30|
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
+----+----------+----------+

In spark MLlib, How do i convert string to integer in spark scala?

As i know, MLlib supports only interger.
Then i want to convert string to interger in scala.
For example, I have many reviewerID, productID in txtfile.
reviewerID productID
03905X0912 ZXASQWZXAS
0325935ODD PDLFMBKGMS
...
StringIndexer is the solution. It will fit into the ML pipeline with an estimator and transformer. Essentially once you set the input column, it computes the frequency of each category and numbers them starting 0. You can add IndexToString at the end of pipeline to replace by original strings if required.
You can look at ML Documentation for "Estimating, transforming and selecting features" for further details.
In your case it will go like:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer().setInputCol("productID").setOutputCol("productIndex")
val indexed = indexer.fit(df).transform(df)
indexed.show()
You can add a new row with a unique id for each reviewerID, productID. You can add a new row in the following ways.
By monotonicallyIncreasingId:
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
("123xyx", "ab"),
("123xyz", "cd")
)).toDF("reviewerID", "productID")
data.withColumn("uniqueReviID", monotonicallyIncreasingId).show()
By using zipWithUniqueId:
val rows = data.rdd.zipWithUniqueId.map {
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)
}
val finalDf = spark.createDataFrame(rows, StructType(StructField("uniqueRevID", LongType, false) +: data.schema.fields))
finalDf.show()
You can also do this by using row_number() in SQL syntax:
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
("123xyx", "ab"),
("123xyz", "cd")
)).toDF("reviewerID", "productID").createOrReplaceTempView("review")
val tmpTable1 = spark.sqlContext.sql(
"select row_number() over (order by reviewerID) as id, reviewerID, productID from review")
Hope this helps!

How to add corresponding Integer values in 2 different DataFrames

I have two DataFrames in my code with exact same dimensions, let's say 1,000,000 X 50. I need to add corresponding values in both dataframes. How to achieve that.
One option would be to add another column with ids, union both DataFrames and then use reduceByKey. But is there any other more elegent way?
Thanks.
Your approach is good. Another option can be two take the RDD and zip those together and then iterate over those to sum the columns and create a new dataframe using any of the original dataframe schemas.
Assuming the data types for all the columns are integer, this code snippets should work. Please note that, this has been done in spark 2.1.0.
import spark.implicits._
val a: DataFrame = spark.sparkContext.parallelize(Seq(
(1, 2),
(3, 6)
)).toDF("column_1", "column_2")
val b: DataFrame = spark.sparkContext.parallelize(Seq(
(3, 4),
(1, 5)
)).toDF("column_1", "column_2")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => {
val totalColumns = rowLeft.schema.fields.size
val summedRow = for(i <- (0 until totalColumns)) yield rowLeft.getInt(i) + rowRight.getInt(i)
Row.fromSeq(summedRow)
}
}
// Create new data frame
val ab: DataFrame = spark.createDataFrame(rows, a.schema) // use any of the schemas
ab.show()
Update:
So, I tried to experiment with the performance of my solution vs yours. I tested with 100000 rows and each row has 50 columns. In case of your approach it has 51 columns, the extra one is for the ID column. In a single machine(no cluster), my solution seems to work a bit faster.
The union and group by approach takes about 5598 milliseconds.
Where as my solution takes about 5378 milliseconds.
My assumption is the first solution takes a bit more time because of the union operation of the two dataframes.
Here are the methods which I created for testing the approaches.
def option_1()(implicit spark: SparkSession): Unit = {
import spark.implicits._
val a: DataFrame = getDummyData(withId = true)
val b: DataFrame = getDummyData(withId = true)
val allData = a.union(b)
val result = allData.groupBy($"id").agg(allData.columns.collect({ case col if col != "id" => (col, "sum") }).toMap)
println(result.count())
// result.show()
}
def option_2()(implicit spark: SparkSession): Unit = {
val a: DataFrame = getDummyData()
val b: DataFrame = getDummyData()
// Merge rows
val rows = a.rdd.zip(b.rdd).map {
case (rowLeft, rowRight) => {
val totalColumns = rowLeft.schema.fields.size
val summedRow = for (i <- (0 until totalColumns)) yield rowLeft.getInt(i) + rowRight.getInt(i)
Row.fromSeq(summedRow)
}
}
// Create new data frame
val result: DataFrame = spark.createDataFrame(rows, a.schema) // use any of the schemas
println(result.count())
// result.show()
}