Spark: Performant way to find top n values - scala

I have a large dataset and I would like to find rows with n highest values.
id, count
id1, 10
id2, 15
id3, 5
...
The only method I can think of is using row_number without partition like
val window = Window.orderBy(desc("count"))
df.withColumn("row_number", row_number over window).filter(col("row_number") <= n)
but this is in no way performant when the data contains millions or billions of rows because it pushes the data into one partition and I get OOM.
Has anyone managed to come up with a performant solution?

I see two methods to improve your algorithm performance. First is to use sort and limit to retrieve the top n rows. The second is to develop your custom Aggregator.
Sort and Limit method
You sort your dataframe and then you take the first n rows:
val n: Int = ???
import org.apache.spark.functions.sql.desc
df.orderBy(desc("count")).limit(n)
Spark optimizes this kind of transformations sequence by first performing sort on each partition, taking first n rows on each partition, retrieving it on a final partition and reperforming sort and taking final first n rows. You can check this by executing explain() on transformations. You get the following execution plan:
== Physical Plan ==
TakeOrderedAndProject(limit=3, orderBy=[count#8 DESC NULLS LAST], output=[id#7,count#8])
+- LocalTableScan [id#7, count#8]
And by looking how TakeOrderedAndProject step is executed in limit.scala in Spark's source code (case class TakeOrderedAndProjectExec, method doExecute).
Custom Aggregator method
For custom aggregator, you create an Aggregator that will populate and update an ordered array of top n rows.
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.Encoder
import scala.collection.mutable.ArrayBuffer
case class Record(id: String, count: Int)
case class TopRecords(limit: Int) extends Aggregator[Record, ArrayBuffer[Record], Seq[Record]] {
def zero: ArrayBuffer[Record] = ArrayBuffer.empty[Record]
def reduce(topRecords: ArrayBuffer[Record], currentRecord: Record): ArrayBuffer[Record] = {
val insertIndex = topRecords.lastIndexWhere(p => p.count > currentRecord.count)
if (topRecords.length < limit) {
topRecords.insert(insertIndex + 1, currentRecord)
} else if (insertIndex < limit - 1) {
topRecords.insert(insertIndex + 1, currentRecord)
topRecords.remove(topRecords.length - 1)
}
topRecords
}
def merge(topRecords1: ArrayBuffer[Record], topRecords2: ArrayBuffer[Record]): ArrayBuffer[Record] = {
val merged = ArrayBuffer.empty[Record]
while (merged.length < limit && (topRecords1.nonEmpty || topRecords2.nonEmpty)) {
if (topRecords1.isEmpty) {
merged.append(topRecords2.remove(0))
} else if (topRecords2.isEmpty) {
merged.append(topRecords1.remove(0))
} else if (topRecords2.head.count < topRecords1.head.count) {
merged.append(topRecords1.remove(0))
} else {
merged.append(topRecords2.remove(0))
}
}
merged
}
def finish(reduction: ArrayBuffer[Record]): Seq[Record] = reduction
def bufferEncoder: Encoder[ArrayBuffer[Record]] = ExpressionEncoder[ArrayBuffer[Record]]
def outputEncoder: Encoder[Seq[Record]] = ExpressionEncoder[Seq[Record]]
}
And then you apply this aggregator on your dataframe, and flatten the aggregation result:
val n: Int = ???
import sparkSession.implicits._
df.as[Record].select(TopRecords(n).toColumn).flatMap(record => record)
Method comparison
To compare those two methods, let's say we want to take top n rows of a dataframe that is distributed on p partitions, each partition having around k records. So dataframe has size p·k. Which gives the following complexity (subject to errors):
method
total number of operations
memory consumption(on executor)
memory consumption(on final executor)
Current code
O(p·k·log(p·k))
--
O(p·k)
Sort and Limit
O(p·k·log(k) + p·n·log(p·n))
O(k)
O(p·n)
Custom Aggregator
O(p·k)
O(k) + O(n)
O(p·n)
So regarding number of operations, Custom Aggregator is the most performant. However, this method is by far the most complex and implies lots of serialization/deserialization so it may be less performant than Sort and Limit on certain case.
Conclusion
You have two methods to efficiently take top n rows, Sort and Limit and Custom Aggregator. To select which one to use, you should benchmark those two methods with your real dataframe. If after benchmarking Sort and Limit is a bit slower than Custom aggregator, I would select Sort and Limit as its code is a lot easier to maintain.

Convert to rdd
mapPartitions sorting the data, take N
Convert to df
Then sort and rank and take top N. Unlikely you will have OOM
Actual example, slightly updated, roll your own approach for posterity.
import org.apache.spark.sql.functions._
import spark.sqlContext.implicits._
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
// 1. Create data
val data = Seq(("James ","","Smith","36636","M",33000),
("Michael ","Rose","","40288","M",14000),
("Robert ","","Williams","42114","M",40),
("Robert ","","Williams","42114","M",540),
("Robert ","","Zeedong","42114","M",40000000),
("Maria ","Anne","Jones","39192","F",300),
("Maria ","Anne","Vangelis","39192","F",1300),
("Jen","Mary","Brown","","F",-1))
val columns = Seq("firstname","middlename","lastname","dob","gender","val")
val df = data.toDF(columns:_*)
//df.show()
//2. Any number of partitions, and sort that partition. Combiner function like Hadoop.
val df2 = df.repartition(1000,col("lastname")).sortWithinPartitions(desc("val"))
//df2.rdd.glom().collect()
//3. Take top N per partition. Thus num partitions x 2 in this case. The take(n) is the top n per partition. No OOM.
val rdd2 = df2.rdd.mapPartitions(_.take(2))
//4. Ghastly Row to DF work-arounds.
val schema = new StructType()
.add(StructField("f", StringType, true))
.add(StructField("m", StringType, true))
.add(StructField("l", StringType, true))
.add(StructField("d", StringType, true))
.add(StructField("g", StringType, true))
.add(StructField("v", IntegerType, true))
val df3 = spark.createDataFrame(rdd2, schema)
//5. Sort and take top(n) = 2 and Bob's your uncle. The Reduce after Combine.
df3.sort(col("v").desc).limit(2).show()
Returns for top 2 desc:
+-------+---+-------+-----+---+--------+
| f| m| l| d| g| v|
+-------+---+-------+-----+---+--------+
|Robert | |Zeedong|42114| M|40000000|
| James | | Smith|36636| M| 33000|
+-------+---+-------+-----+---+--------+

Related

Scala - splitting dataframe based on number of rows

I have a spark dataframe that has approximately a million records. I'm trying to split this dataframe into multiple small dataframes where each of these dataframes has a maximum rowCount of 20,000 (Each of these dataframes should have a row count of 20,000 except the last dataframe which may or may not have 20,000). Can you help me with this? Thank you.
Ok, maybe not the the most efficient way, but here it is. You can create a new column that counts every row (in case you don't have a unique Id column). Here we are basically iterating over the whole dataframe and selecting batches of size 20k, adding them to a list of DataFrames.
import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.monotonically_increasing_id
var index = 0
val subsetSize = 20000
var listOfDF: List[DataFrame] = List()
// withColumn optional if you already have a unique id per row
val df = spark.table("your_table").withColumn("rowNum", monotonically_increasing_id())
def returnSubDF(fromIndex: Int, toIndex: Int) = {
df.filter($"rowNum" >= fromIndex && $"rowNum" < toIndex)
}
while (index <= 1000000){
listOfDF = listOfDF :+ returnSubDF(index, index+subsetSize)
index += subsetSize
}
listOfDF.head.show()

Using Spark SQL joinWith, how can I join two datasets to match current records with their previous records based on date?

I am trying to join two datasets of meters readings in Spark SQL using joinWith, so that the returned type is Dataset[(Reading, Reading)]. The goal is to match each row in the first dataset (called Current) with its previous record in the second dataset (called Previous), based on a date column.
I need to first join on the meter key, and then join by comparing date, finding the next largest date that is smaller than the current reading date (i.e. the previous reading).
Here is what I have tried, but I think this is too trivial. I am also getting a 'Can't resolve' error with MAX.
val joined = Current.joinWith(
Previous,
(Current("Meter_Key") === Previous("Meter_Key"))
&& (Current("Reading_Dt_Key") > MAX(Previous("Reading_Dt_Key"))
)
Can anyone help?
Did not try to use LAG, think that would also work. But looked at your requirement with a joinWith and decided to do apply some logic for performance reasons. Many Steps in Jobs skipped. Used different names, you can abstract, rename and drop cols.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
case class mtr0(mtr: String, seqNum: Int)
case class mtr(mtr: String, seqNum: Int, rank: Int)
// Gen data & optimize for JOINing, just interested in max 2 records for ranked sets.
val curr0 = Seq(
mtr0("m1", 1),
mtr0("m1", 2),
mtr0("m1", 3),
mtr0("m2", 7)
).toDS
val curr1 = curr0.withColumn("rank", row_number()
.over(Window.partitionBy($"mtr").orderBy($"seqNum".desc)))
// Reduce before JOIN.
val currF=curr1.filter($"rank" === 1 ).as[mtr]
//currF.show(false)
val prevF=curr1.filter($"rank" === 2 ).as[mtr]
//prevF.show(false)
val selfDF = currF.as("curr").joinWith(prevF.as("prev"),
( col("curr.mtr") === col("prev.mtr") && (col("curr.rank") === 1) && (col("prev.rank") === 2)),"left")
// Null value evident when only 1 entry per meter.
selfDF.show(false)
returns:
+----------+----------+
|_1 |_2 |
+----------+----------+
|[m1, 3, 1]|[m1, 2, 2]|
|[m2, 7, 1]|null |
+----------+----------+
selfDF: org.apache.spark.sql.Dataset[(mtr, mtr)] = [_1: struct<mtr: string, seqNum: int ... 1 more field>, _2: struct<mtr: string, seqNum: int ... 1 more field>]

Finding Max sum of marks each year

I am new to Scala and Spark, can someone optimize below Scala code for finding maximum marks scored by students each year
val m=sc.textFile("marks.csv")
val SumOfMarks=m.map(_.split(",")).mapPartitionsWithIndex {(idx, iter) => if (idx == 0) iter.drop(1) else iter}.map(l=>((l(0),l(1)),l(3).toInt)).reduceByKey(_+_).sortBy(line => (line._1._1, line._2), ascending=false)
var s:Int=0
var y:String="0"
for(i<-SumOfMarks){ if((i._1._1!=y) || (i._2==s && i._1._1==y)){ println(i);s=i._2;y=i._1._1}}
Input : marks.csv
year,student,sub,marks
2016,ram,maths,90
2016,ram,physics,86
2016,ram,chemistry,88
2016,raj,maths,84
2016,raj,physics,96
2016,raj,chemistry,98
2017,raghu,maths,96
2017,raghu,physics,98
2017,raghu,chemistry,94
2017,rajesh,maths,92
2017,rajesh,physics,98
2017,rajesh,chemistry,98
Output :
2017,raghu,288
2017,rajesh,288
2016,raj,278
I am not sure what you mean exactly by "Optimised", but a more "scala-y" and "spark-y" way of doing this might be as follows:
import org.apache.spark.sql.expressions.Window
// Read your data file as a CSV file with row headers.
val marksDF = spark.read.option("header","true").csv("marks.csv")
// Calculate the total marks for each student in each year. The new total mark column will be called "totMark"
val marksByStudentYear = marksDF.groupBy(col("year"), col("student")).agg(sum(col("marks")).as("totMark"))
// Rank the marks within each year. Highest Mark will get rank 1, second highest rank 2 and so on.
// A benefit of rank is that if two scores have the same mark, they will both get the
// same rank.
val marksRankedByYear = marksByStudentYear.withColumn("rank", dense_rank().over(Window.partitionBy("year").orderBy($"totMark".desc)))
// Finally filter so that we only have the "top scores" (rank = 1) for each year,
// order by year and student name and display the result.
val topStudents = marksRankedByYear.filter($"rank" === 1).orderBy($"year", $"student").show
topStudents.show
This will produce the following output in Spark-shell:
+----+-------+-------+----+
|year|student|totMark|rank|
+----+-------+-------+----+
|2016| raj| 278.0| 1|
|2017| raghu| 288.0| 1|
|2017| rajesh| 288.0| 1|
+----+-------+-------+----+
If you need a CSV displayed as per your question, you can use:
topStudents.collect.map(_.mkString(",")).foreach(println)
which produces:
2016,raj,278.0,1
2017,raghu,288.0,1
2017,rajesh,288.0,1
I have broken the process up into individual steps. This will allow you to see what is going on at each step by simply running show on an intermediate result. For example to see what the spark.read.option... does, simply enter marksDF.show into spark-shell
Since OP wanted an RDD version, here is one example. Probably it is not optimal, but it does give the correct result:
import org.apache.spark.rdd.RDD
// A Helper function which makes it slightly easier to view RDD content.
def dump[R] (rdd : RDD[R]) = rdd.collect.foreach(println)
val marksRdd = sc.textFile("marks.csv")
// A case class to annotate the content in the RDD
case class Report(year:Int, student:String, sub:String, mark:Int)
// Create the RDD as a series of Report objects - ignore the header.
val marksReportRdd = marksRdd.map(_.split(",")).mapPartitionsWithIndex {
(idx, iter) => if (idx == 0) iter.drop(1) else iter
}.map(r => Report(r(0).toInt,r(1),r(2),r(3).toInt))
// Group the data by year and student.
val marksGrouped = marksReportRdd.groupBy(report => (report.year, report.student))
// Calculate the total score for each student for each year by adding up the scores
// of each subject the student has taken in that year.
val totalMarkStudentYear = marksGrouped.map{ case (key, marks:Iterable[Report]) => (key, marks.foldLeft(0)((acc, rep) => acc + rep.mark))}
// Determine the highest score for each year.
val yearScoreHighest = totalMarkStudentYear.map{ case (key, score:Int) => (key._1, score) }.reduceByKey(math.max(_, _))
// Determine the list of students who have received the highest score in each year.
// This is achieved by joining the total marks each student received in each year
// to the highest score in each year.
// The join is performed on the key which must is a Tuple2(year, score).
// To achieve this, both RDD's must be mapped to produce this key with a data attribute.
// The data attribute for the highest scores is a dummy value "x".
// The data attribute for the student scores is the student's name.
val highestRankStudentByYear = totalMarkStudentYear.map{ case (key, score) => ((key._1, score), key._2)}.join (yearScoreHighest.map (k => (k, "x")))
// Finally extract the year, student name and score from the joined RDD
// Sort by year and name.
val result = highestRankStudentByYear.map{ case (key, score) => (key._1, score._1, key._2)}.sortBy( r => (r._1, r._2))
// Show the final result.
dump(result)
val result = highestRankStudentByYear.map{ case (key, score) => (key._1, score._1, key._2)}.sortBy( r => (r._1, r._2))
dump(result)
The result of the above is:
(2016,raj,278)
(2017,raghu,288)
(2017,rajesh,288)
As before, you can view the intermediate RDD's simply by dumping them using the dump function. NB: the dump function takes an RDD. If you want to show the content of a DataFrame or dataset use it's show method.
It is probably that there is a more optimal solution than the one above, but it does the job.
Hopefully the RDD version will encourage you to use DataFrames and/or DataSets if you can. Not only is the code simpler, but:
Spark will evaluate DataFrames and DataSets and can optimise the overall transformation process. RDD's are not (i.e. they are executed one after another without optimisation). Translations DataFrame and DataSet based processes will likely run faster (assuming you don't manually optimise the RDD equivalent)
DataSets and DataFrames allow schemas to varying degrees (e.g. named columns and data typing).
DataFrames and DataSets can be queried using SQL.
DataFrame and DataSet operations/methods are more aligned with SQL constructs
DataFrames and DataSets are easier to use than RDD's
DataSets (and RDD's) offer compile time error detection.
DataSets are the future direction.
Check out these couple of links for more information:
https://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset/
https://www.linkedin.com/pulse/apache-spark-rdd-vs-dataframe-dataset-chandan-prakash/
https://medium.com/#sachee/apache-spark-dataframe-vs-rdd-24a04d2eb1b9
or simply google "spark should i use rdd or dataframe"
All the best with your project.
Try It on SCALA spark-shell
scala> val df = spark.read.format("csv").option("header", "true").load("/CSV file location/marks.csv")
scala> df.registerTempTable("record")
scala> sql(" select year, student, marks from (select year, student, marks, RANK() over (partition by year order by marks desc) rank From ( Select year, student, SUM(marks) as marks from record group by Year, student)) where rank =1 ").show
It will generate the following table
+----+-------+-----+
|year|student|marks|
+----+-------+-----+
|2016| raj|278.0|
|2017| raghu|288.0|
|2017| rajesh|288.0|
+----+-------+-----+
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions
//Finding Max sum of marks each year
object Marks2 {
def getSparkContext() = {
val conf = new SparkConf().setAppName("MaxMarksEachYear").setMaster("local")
val sc = new SparkContext(conf)
sc
}
def dump[R](rdd: RDD[R]) = rdd.collect.foreach(println)
def main(args: Array[String]): Unit = {
// System.setProperty("hadoop.home.dir", "D:\\Setup\\hadoop_home")
val sc = getSparkContext()
val inpRDD = sc.textFile("marks.csv")
val head = inpRDD.first()
val marksRdd = inpRDD.filter(record=> !record.equals(head)).map(rec => rec.split(","))
val marksByNameyear = marksRdd.map(rec =>((rec(0).toInt,rec(1)),rec(3).toInt))
//marksByNameyear.cache()
val aggMarksByYearName = marksByNameyear.reduceByKey(_+_)
val maxMarksByYear = aggMarksByYearName.map(s=> (s._1._1,s._2))reduceByKey(math.max(_, _))
val markYearName = aggMarksByYearName.map(s => (s._2.toInt,s._1._2))
val marksAndYear = maxMarksByYear.map(s => (s._2.toInt,s._1))
val tt = sc.broadcast(marksAndYear.collect().toMap)
marksAndYear.flatMap {case(key,value) => tt.value.get(key).map {other => (other,value, key) } }
val yearMarksName = marksAndYear.leftOuterJoin(markYearName)
val result = yearMarksName.map(s =>(s._2._1,s._2._2,s._1)).sortBy(f=>f._3, true)
//dump(markYearName);'
dump(result)
}
}

Merge multiple Dataframes into one Dataframe in Spark [duplicate]

I have two DataFrame a and b.
a is like
Column 1 | Column 2
abc | 123
cde | 23
b is like
Column 1
1
2
I want to zip a and b (or even more) DataFrames which becomes something like:
Column 1 | Column 2 | Column 3
abc | 123 | 1
cde | 23 | 2
How can I do it?
Operation like this is not supported by a DataFrame API. It is possible to zip two RDDs but to make it work you have to match both number of partitions and number of elements per partition. Assuming this is the case:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
val a: DataFrame = sc.parallelize(Seq(
("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val b: DataFrame = sc.parallelize(Seq(Tuple1(1), Tuple1(2))).toDF("column_3")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
// Merge schemas
val schema = StructType(a.schema.fields ++ b.schema.fields)
// Create new data frame
val ab: DataFrame = sqlContext.createDataFrame(rows, schema)
If above conditions are not met the only option that comes to mind is adding an index and join:
def addIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
// Add indices
val aWithIndex = addIndex(a)
val bWithIndex = addIndex(b)
// Join and clean
val ab = aWithIndex
.join(bWithIndex, Seq("_index"))
.drop("_index")
In Scala's implementation of Dataframes, there is no simple way to concatenate two dataframes into one. We can simply work around this limitation by adding indices to each row of the dataframes. Then, we can do a inner join by these indices. This is my stub code of this implementation:
val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId)
val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3")
val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId)
aWithId.join(bWithId, "id")
A little light reading - Check out how Python does this!
What about pure SQL ?
SELECT
room_name,
sender_nickname,
message_id,
row_number() over (partition by room_name order by message_id) as message_index,
row_number() over (partition by room_name, sender_nickname order by message_id) as user_message_index
from messages
order by room_name, message_id
I know the OP was using Scala but if, like me, you need to know how to do this in pyspark then try the Python code below. Like #zero323's first solution it relies on RDD.zip() and will therefore fail if both DataFrames don't have the same number of partitions and the same number of rows in each partition.
from pyspark.sql import Row
from pyspark.sql.types import StructType
def zipDataFrames(left, right):
CombinedRow = Row(*left.columns + right.columns)
def flattenRow(row):
left = row[0]
right = row[1]
combinedVals = [left[col] for col in left.__fields__] + [right[col] for col in right.__fields__]
return CombinedRow(*combinedVals)
zippedRdd = left.rdd.zip(right.rdd).map(lambda row: flattenRow(row))
combinedSchema = StructType(left.schema.fields + right.schema.fields)
return zippedRdd.toDF(combinedSchema)
joined = zipDataFrames(a, b)

How to zip two (or more) DataFrame in Spark

I have two DataFrame a and b.
a is like
Column 1 | Column 2
abc | 123
cde | 23
b is like
Column 1
1
2
I want to zip a and b (or even more) DataFrames which becomes something like:
Column 1 | Column 2 | Column 3
abc | 123 | 1
cde | 23 | 2
How can I do it?
Operation like this is not supported by a DataFrame API. It is possible to zip two RDDs but to make it work you have to match both number of partitions and number of elements per partition. Assuming this is the case:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
val a: DataFrame = sc.parallelize(Seq(
("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val b: DataFrame = sc.parallelize(Seq(Tuple1(1), Tuple1(2))).toDF("column_3")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
// Merge schemas
val schema = StructType(a.schema.fields ++ b.schema.fields)
// Create new data frame
val ab: DataFrame = sqlContext.createDataFrame(rows, schema)
If above conditions are not met the only option that comes to mind is adding an index and join:
def addIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
// Add indices
val aWithIndex = addIndex(a)
val bWithIndex = addIndex(b)
// Join and clean
val ab = aWithIndex
.join(bWithIndex, Seq("_index"))
.drop("_index")
In Scala's implementation of Dataframes, there is no simple way to concatenate two dataframes into one. We can simply work around this limitation by adding indices to each row of the dataframes. Then, we can do a inner join by these indices. This is my stub code of this implementation:
val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId)
val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3")
val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId)
aWithId.join(bWithId, "id")
A little light reading - Check out how Python does this!
What about pure SQL ?
SELECT
room_name,
sender_nickname,
message_id,
row_number() over (partition by room_name order by message_id) as message_index,
row_number() over (partition by room_name, sender_nickname order by message_id) as user_message_index
from messages
order by room_name, message_id
I know the OP was using Scala but if, like me, you need to know how to do this in pyspark then try the Python code below. Like #zero323's first solution it relies on RDD.zip() and will therefore fail if both DataFrames don't have the same number of partitions and the same number of rows in each partition.
from pyspark.sql import Row
from pyspark.sql.types import StructType
def zipDataFrames(left, right):
CombinedRow = Row(*left.columns + right.columns)
def flattenRow(row):
left = row[0]
right = row[1]
combinedVals = [left[col] for col in left.__fields__] + [right[col] for col in right.__fields__]
return CombinedRow(*combinedVals)
zippedRdd = left.rdd.zip(right.rdd).map(lambda row: flattenRow(row))
combinedSchema = StructType(left.schema.fields + right.schema.fields)
return zippedRdd.toDF(combinedSchema)
joined = zipDataFrames(a, b)