PySpark DataFrame Manipulation Efficiency - pyspark

Suppose I have the following data frame :
+----------+-----+----+-------+
|display_id|ad_id|prob|clicked|
+----------+-----+----+-------+
| 123| 989| 0.9| 0|
| 123| 990| 0.8| 1|
| 123| 999| 0.7| 0|
| 234| 789| 0.9| 0|
| 234| 777| 0.7| 0|
| 234| 769| 0.6| 1|
| 234| 798| 0.5| 0|
+----------+-----+----+-------+
I then perform the following operations to get a final data set (shown below the code) :
# Add a new column with the clicked ad_id if clicked == 1, 0 otherwise
df_adClicked = df.withColumn("ad_id_clicked", when(df.clicked==1, df.ad_id).otherwise(0))
# DF -> RDD with tuple : (display_id, (ad_id, prob), clicked)
df_blah = df_adClicked.rdd.map(lambda x : (x[0],(x[1],x[2]),x[4])).toDF(["display_id", "ad_id","clicked_ad_id"])
# Group by display_id and create column with clicked ad_id and list of tuples : (ad_id, prob)
df_blah2 = df_blah.groupby('display_id').agg(F.collect_list('ad_id'), F.max('clicked_ad_id'))
# Define function to sort list of tuples by prob and create list of only ad_ids
def sortByRank(ad_id_list):
sortedVersion = sorted(ad_id_list, key=itemgetter(1), reverse=True)
sortedIds = [i[0] for i in sortedVersion]
return(sortedIds)
# Sort the (ad_id, prob) tuples by using udf/function and create new column ad_id_sorted
sort_ad_id = udf(lambda x : sortByRank(x), ArrayType(IntegerType()))
df_blah3 = df_blah2.withColumn('ad_id_sorted', sort_ad_id('collect_list(ad_id)'))
# Function to change clickedAdId into an array of size 1
def createClickedSet(clickedAdId):
setOfDocs = [clickedAdId]
return setOfDocs
clicked_set = udf(lambda y : createClickedSet(y), ArrayType(IntegerType()))
df_blah4 = df_blah3.withColumn('ad_id_set', clicked_set('max(clicked_ad_id)'))
# Select the necessary columns
finalDF = df_blah4.select('display_id', 'ad_id_sorted','ad_id_set')
+----------+--------------------+---------+
|display_id|ad_id_sorted |ad_id_set|
+----------+--------------------+---------+
|234 |[789, 777, 769, 798]|[769] |
|123 |[989, 990, 999] |[990] |
+----------+--------------------+---------+
Is there a more efficient way of doing this? Doing this set of transformations in the way that I am seems to be the bottle neck in my code. I would greatly appreciate any feedback.

I haven't done any timing comparisons, but I would think that by not using any UDFs Spark should be able to optimally optimize itself.
#scala: val dfad = sc.parallelize(Seq((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF("display_id","ad_id","prob","clicked")
#^^^that's^^^ the only difference (besides putting val in front of variables) between this python response and a Scala one
dfad = sc.parallelize(((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF(["display_id","ad_id","prob","clicked"])
dfad.registerTempTable("df_ad")
df1 = sqlContext.sql("SELECT display_id,collect_list(ad_id) ad_id_sorted FROM (SELECT * FROM df_ad SORT BY display_id,prob DESC) x GROUP BY display_id")
+----------+--------------------+
|display_id| ad_id_sorted|
+----------+--------------------+
| 234|[789, 777, 769, 798]|
| 123| [989, 990, 999]|
+----------+--------------------+
df2 = sqlContext.sql("SELECT display_id, max(ad_id) as ad_id_set from df_ad where clicked=1 group by display_id")
+----------+---------+
|display_id|ad_id_set|
+----------+---------+
| 234| 769|
| 123| 990|
+----------+---------+
final_df = df1.join(df2,"display_id")
+----------+--------------------+---------+
|display_id| ad_id_sorted|ad_id_set|
+----------+--------------------+---------+
| 234|[789, 777, 769, 798]| 769|
| 123| [989, 990, 999]| 990|
+----------+--------------------+---------+
I didn't put the ad_id_set into an Array because you were calculating the max and max should only return 1 value. I'm sure if you really need it in an array you can make that happen.
I included the subtle Scala difference if a future someone using Scala has a similar problem.

Related

Pyspark forward and backward fill within column level

I try to fill missing data in a pyspark dataframe. The pyspark dataframe looks as such:
+---------+---------+-------------------+----+
| latitude|longitude| timestamplast|name|
+---------+---------+-------------------+----+
| | 4.905615|2019-08-01 00:00:00| 1|
|51.819645| |2019-08-01 00:00:00| 1|
| 51.81964| 4.961713|2019-08-01 00:00:00| 2|
| | |2019-08-01 00:00:00| 3|
| 51.82918| 4.911187| | 3|
| 51.82385| 4.901488|2019-08-01 00:00:03| 5|
+---------+---------+-------------------+----+
Within the column "name" I want to either forward fill or backward fill (whichever is necessary) to fill only "latitude" and "longitude" ("timestamplast" should not be filled). How do I do this?
Output will be:
+---------+---------+-------------------+----+
| latitude|longitude| timestamplast|name|
+---------+---------+-------------------+----+
|51.819645| 4.905615|2019-08-01 00:00:00| 1|
|51.819645| 4.905615|2019-08-01 00:00:00| 1|
| 51.81964| 4.961713|2019-08-01 00:00:00| 2|
| 51.82918| 4.911187|2019-08-01 00:00:00| 3|
| 51.82918| 4.911187| | 3|
| 51.82385| 4.901488|2019-08-01 00:00:03| 5|
+---------+---------+-------------------+----+
In Pandas this would be done as such:
df = df.groupby("name")['longitude','latitude'].apply(lambda x : x.ffill().bfill())
How would this be done in Pyspark?
I suggest you use the following two Window Specs:
from pyspark.sql import Window
w1 = Window.partitionBy('name').orderBy('timestamplast')
w2 = w1.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
Where:
w1 is the regular WinSpec we use to calculate the forward-fill which is the same as the following:
w1 = Window.partitionBy('name').orderBy('timestamplast').rowsBetween(Window.unboundedPreceding,0)
see the following note from the documentation for default window frames:
Note: When ordering is not defined, an unbounded window frame (rowFrame, unboundedPreceding, unboundedFollowing) is used by default. When ordering is defined, a growing window frame (rangeFrame, unboundedPreceding, currentRow) is used by default.
after ffill, we only need to fix the null values at the very front if exists, so we can use a fixed Window frame(Between Window.unboundedPreceding and Window.unboundedFollowing), this is more efficient than using a running Window frame since it requires only one aggregate, see SPARK-8638
Then the x.ffill().bfill() can be handled by using coalesce + last + first based on the above two WindowSpecs:
from pyspark.sql.functions import coalesce, last, first
df.withColumn('latitude_new', coalesce(last('latitude',True).over(w1), first('latitude',True).over(w2))) \
.select('name','timestamplast', 'latitude','latitude_new') \
.show()
+----+-------------------+---------+------------+
|name| timestamplast| latitude|latitude_new|
+----+-------------------+---------+------------+
| 1|2019-08-01 00:00:00| null| 51.819645|
| 1|2019-08-01 00:00:01| null| 51.819645|
| 1|2019-08-01 00:00:02|51.819645| 51.819645|
| 1|2019-08-01 00:00:03| 51.81964| 51.81964|
| 1|2019-08-01 00:00:04| null| 51.81964|
| 1|2019-08-01 00:00:05| null| 51.81964|
| 1|2019-08-01 00:00:06| null| 51.81964|
| 1|2019-08-01 00:00:07| 51.82385| 51.82385|
+----+-------------------+---------+------------+
Edit: to process (ffill+bfill) on multiple columns, use a list comprehension:
cols = ['latitude', 'longitude']
df_new = df.select([ c for c in df.columns if c not in cols ] + [ coalesce(last(c,True).over(w1), first(c,True).over(w2)).alias(c) for c in cols ])
I got a working solution for either forward or backward fill of one target name "longitude". I guess I could repeat the procedure for also "latitude" and then again for backward fill. Is there a more efficient way?
window = Window.partitionBy('name')\
.orderBy('timestamplast')\
.rowsBetween(-sys.maxsize, 0) # this is for forward fill
# .rowsBetween(0,sys.maxsize) # this is for backward fill
# define the forward-filled column
filled_column = last(df['longitude'], ignorenulls=True).over(window) # this is for forward fill
# filled_column = first(df['longitude'], ignorenulls=True).over(window) # this is for backward fill
df = df.withColumn('mmsi_filled', filled_column) # do the fill

Histogram -Doing it in a parallel way

+----+----+--------+
| Id | M1 | trx |
+----+----+--------+
| 1 | M1 | 11.35 |
| 2 | M1 | 3.4 |
| 3 | M1 | 10.45 |
| 2 | M1 | 3.95 |
| 3 | M1 | 20.95 |
| 2 | M2 | 25.55 |
| 1 | M2 | 9.95 |
| 2 | M2 | 11.95 |
| 1 | M2 | 9.65 |
| 1 | M2 | 14.54 |
+----+----+--------+
With the above dataframe I should be able to generate a histogram as below using the below code.
Similar Queston is here
val (Range,counts) = df
.select(col("trx"))
.rdd.map(r => r.getDouble(0))
.histogram(10)
// Range: Array[Double] = Array(3.4, 5.615, 7.83, 10.045, 12.26, 14.475, 16.69, 18.905, 21.12, 23.335, 25.55)
// counts: Array[Long] = Array(2, 0, 2, 3, 0, 1, 0, 1, 0, 1)
But Issue here is,how can I parallely create the histogram based on column 'M1' ?This means I need to have two histogram output for column Values M1 and M2.
First, you need to know that histogram generates two separate sequential jobs. One to detect the minimum and maximum of your data, one to compute the actual histogram. You can check this using the Spark UI.
We can follow the same scheme to build histograms on as many columns as you wish, with only two jobs. Yet, we cannot use the histogram function which is only meant to handle one collection of doubles. We need to implement it by ourselves. The first job is dead simple.
val Row(min_trx : Double, max_trx : Double) = df.select(min('trx), max('trx)).head
Then we compute locally the ranges of the histogram. Note that I use the same ranges for all the columns. It allows to compare the results easily between the columns (by plotting them on the same figure). Having different ranges per column would just be a small modification of this code though.
val hist_size = 10
val hist_step = (max_trx - min_trx) / hist_size
val hist_ranges = (1 until hist_size)
.scanLeft(min_trx)((a, _) => a + hist_step) :+ max_trx
// I add max_trx manually to avoid rounding errors that would exclude the value
That was the first part. Then, we can use a UDF to determine in what range each value ends up, and compute all the histograms in parallel with spark.
val range_index = udf((x : Double) => hist_ranges.lastIndexWhere(x >= _))
val hist_df = df
.withColumn("rangeIndex", range_index('trx))
.groupBy("M1", "rangeIndex")
.count()
// And voilĂ , all the data you need is there.
hist_df.show()
+---+----------+-----+
| M1|rangeIndex|count|
+---+----------+-----+
| M2| 2| 2|
| M1| 0| 2|
| M2| 5| 1|
| M1| 3| 2|
| M2| 3| 1|
| M1| 7| 1|
| M2| 10| 1|
+---+----------+-----+
As a bonus, you can shape the data to use it locally (within the driver), either using the RDD API or by collecting the dataframe and modifying it in scala.
Here is one way to do it with spark since this is a question about spark ;-)
val hist_map = hist_df.rdd
.map(row => row.getAs[String]("M1") ->
(row.getAs[Int]("rangeIndex"), row.getAs[Long]("count")))
.groupByKey
.mapValues( _.toMap)
.mapValues( hists => (1 to hist_size)
.map(i => hists.getOrElse(i, 0L)).toArray )
.collectAsMap
EDIT: how to build one range per column value:
Instead of computing the min and max of M1, we compute it for each value of the column with groupBy.
val min_max_map = df.groupBy("M1")
.agg(min('trx), max('trx))
.rdd.map(row => row.getAs[String]("M1") ->
(row.getAs[Double]("min(trx)"), row.getAs[Double]("max(trx)")))
.collectAsMap // maps each column value to a tuple (min, max)
Then we adapt the UDF so that it uses this map and we are done.
// for clarity, let's define a function that generates histogram ranges
def generate_ranges(min_trx : Double, max_trx : Double, hist_size : Int) = {
val hist_step = (max_trx - min_trx) / hist_size
(1 until hist_size).scanLeft(min_trx)((a, _) => a + hist_step) :+ max_trx
}
// and use it to generate one range per column value
val range_map = min_max_map.keys
.map(key => key ->
generate_ranges(min_max_map(key)._1, min_max_map(key)._2, hist_size))
.toMap
val range_index = udf((x : Double, m1 : String) =>
range_map(m1).lastIndexWhere(x >= _))
Finally, just replace range_index('trx) by range_index('trx, 'M1) and you will have one range per column value.
The way I do histograms with Spark is as follows:
val binEdes = 0.0 to 25.0 by 5.0
val bins = binEdes.init.zip(binEdes.tail).toDF("bin_from","bin_to")
df
.join(bins,$"trx">=$"bin_from" and $"trx"<$"bin_to","right")
.groupBy($"bin_from",$"bin_to")
.agg(
count($"trx").as("count")
// add more, e.g. sum($"trx)
)
.orderBy($"bin_from",$"bin_to")
.show()
gives:
+--------+------+-----+
|bin_from|bin_to|count|
+--------+------+-----+
| 0.0| 5.0| 2|
| 5.0| 10.0| 2|
| 10.0| 15.0| 4|
| 15.0| 20.0| 0|
| 20.0| 25.0| 1|
+--------+------+-----+
Now if you have more dimensions, just add that to the groupBy-clause
df
.join(bins,$"trx">=$"bin_from" and $"trx"<$"bin_to","right")
.groupBy($"M1",$"bin_from",$"bin_to")
.agg(
count($"trx").as("count")
)
.orderBy($"M1",$"bin_from",$"bin_to")
.show()
gives:
+----+--------+------+-----+
| M1|bin_from|bin_to|count|
+----+--------+------+-----+
|null| 15.0| 20.0| 0|
| M1| 0.0| 5.0| 2|
| M1| 10.0| 15.0| 2|
| M1| 20.0| 25.0| 1|
| M2| 5.0| 10.0| 2|
| M2| 10.0| 15.0| 2|
+----+--------+------+-----+
You may tweak to code a bit to get the output you want, but this should get you started. You could also do the UDAF approach I posted here : Spark custom aggregation : collect_list+UDF vs UDAF
I think its not easily possible using RDD's, because histogram is only available on DoubleRDD, i.e. RDDs of Double. If you really need to use RDD API, you can do it in parallel by firing parallel jobs, this can be done using scalas parallel collection:
import scala.collection.parallel.immutable.ParSeq
val List((rangeM1,histM1),(rangeM2,histM2)) = ParSeq("M1","M2")
.map(c => df.where($"M1"===c)
.select(col("trx"))
.rdd.map(r => r.getDouble(0))
.histogram(10)
).toList
println(rangeM1.toSeq,histM1.toSeq)
println(rangeM2.toSeq,histM2.toSeq)
gives:
(WrappedArray(3.4, 5.155, 6.91, 8.665000000000001, 10.42, 12.175, 13.930000000000001, 15.685, 17.44, 19.195, 20.95),WrappedArray(2, 0, 0, 0, 2, 0, 0, 0, 0, 1))
(WrappedArray(9.65, 11.24, 12.83, 14.420000000000002, 16.01, 17.6, 19.19, 20.78, 22.37, 23.96, 25.55),WrappedArray(2, 1, 0, 1, 0, 0, 0, 0, 0, 1))
Note that the bins differ here for M1 and M2

How to select the dynamic columns during a spark join?

I am trying to join 2 data frame, in the 1st DF i need to pass a dynamic number of column and join that with another DF. The complexity I am facing here i have a case statement with the output of 1st DF. I am able get the desired output by creating the temp view. But not able to achieve the same output through spark.
Below is the snippet, i have tried and works as expected.
// Sample DF1
val studentDF = Seq(
(1, "Peter","M",15,"Tution Received"),
(2, "Merry","F",14,null),
(3, "Sam","M",16,"Tution Received"),
(4, "Kat","O",16,null),
(5, "Keivn","M",18,null)
).toDF("Enrollment", "Name","Gender","Age","Notes")
//Sample DF2
val studentFees = Seq((1,"$500","Deposit"),(2, "$800","Deposit"),(3,"$200","Deposit"),(4,"$100","Deposit")).toDF("Enrollment","Fees","Notes")
studentDF.createOrReplaceTempView("STUDENT")
studentFees.createOrReplaceTempView("FEES")
val displayColumns = List("Enrollment","Name","Gender").map("a."+_).reduce(_+","+_)
val queryStr = spark.sql(s"select $displayColumns, case when a.Notes is null then b.Notes else a.Notes end as Notes, b.Fees from STUDENT a join FEES b on a.Enrollment=b.Enrollment")
queryStr.show()
---------+-----+------+---------------+----+
|Enrollment| Name|Gender| Notes|Fees|
+----------+-----+------+---------------+----+
| 1|Peter| M|Tution Received|$500|
| 2|Merry| F| Deposit|$800|
| 3| Sam| M|Tution Received|$200|
| 4| Kat| O| Deposit|$100|
+----------+-----+------+---------------+----+
// Below is not giving the desired output
val displayColumns = List("Enrollment","Name","Gender","Notes")
val queryStr = studentDF.select(displayColumns.head, displayColumns.tail: _*).alias("a").join(studentFees.as("b"),Seq("Enrollment"),"inner").withColumn("Notes",when($"a.Notes".isNull,$"b.Notes").otherwise($"a.Notes"))
queryStr.show()
Enrollment| Name|Gender| Notes|Fees| Notes|
+----------+-----+------+---------------+----+---------------+
| 1|Peter| M|Tution Received|$500|Tution Received|
| 2|Merry| F| Deposit|$800| Deposit|
| 3| Sam| M|Tution Received|$200|Tution Received|
| 4| Kat| O| Deposit|$100| Deposit|
+----------+-----+------+---------------+----+---------------+
// Expecting the output like below.
---------+-----+------+---------------+----+
|Enrollment| Name|Gender| Notes|Fees|
+----------+-----+------+---------------+----+
| 1|Peter| M|Tution Received|$500|
| 2|Merry| F| Deposit|$800|
| 3| Sam| M|Tution Received|$200|
| 4| Kat| O| Deposit|$100|
+----------+-----+------+---------------+----+
Is there a better way to handle such scenarios instead of crating temp table/views?
Thank You all whoever read my post!!
I was able to find the solution for my problem.
val displayColumns = List("Enrollment","Name","Gender","Notes")
val queryStr = studentDF.select(displayColumns.head, displayColumns.tail: _*).alias("a").join(studentFees.as("b"),Seq("Enrollment"),"inner").select($"a.*",when($"a.Notes".isNull,$"b.Notes").otherwise($"a.Notes").as("Notes"),$"b.Fees").drop($"a.Notes")

Map a multimap to columns of dataframe

Simply, I want to convert a multimap like this:
val input = Map("rownum"-> List("1", "2", "3") , "plant"-> List( "Melfi", "Pomigliano", "Torino" ), "tipo"-> List("gomme", "telaio")).toArray
in the following Spark dataframe:
+-------+--------------+-------+
|rownum | plant | tipo |
+------ +--------------+-------+
| 1 | Melfi | gomme |
| 2 | Pomigliano | telaio|
| 3 | Torino | null |
+-------+--------------+-------+
replacing missing values with "null" values. My issue is apply a map function to the RDD:
val inputRdd = sc.parallelize(input)
inputRdd.map(..).toDF()
Any suggestions? Thanks in advance
Although, see my comments, I'm really not sure the multimap format is well suited to your problem (did you have a look at Spark XML parsing modules ?)
The pivot table solution
The idea is to flatten you input table into a (elementPosition, columnName, columnValue) format :
// The max size of the multimap lists
val numberOfRows = input.map(_._2.size).max
// For each index in the list, emit a tuple of (index, multimap key, multimap value at index)
val flatRows = (0 until numberOfRows).flatMap(rowIdx => input.map({ case (colName, allColValues) => (rowIdx, colName, if(allColValues.size > rowIdx) allColValues(rowIdx) else null)}))
// Probably faster at runtime to write it this way (less iterations) :
// val flatRows = input.flatMap({ case (colName, existingValues) => (0 until numberOfRows).zipAll(existingValues, null, null).map(t => (t._1.asInstanceOf[Int], colName, t._2)) })
// To dataframe
val flatDF = sc.parallelize(flatRows).toDF("elementIndex", "colName", "colValue")
flatDF.show
Will output :
+------------+-------+----------+
|elementIndex|colName| colValue|
+------------+-------+----------+
| 0| rownum| 1|
| 0| plant| Melfi|
| 0| tipo| gomme|
| 1| rownum| 2|
| 1| plant|Pomigliano|
| 1| tipo| telaio|
| 2| rownum| 3|
| 2| plant| Torino|
| 2| tipo| null|
+------------+-------+----------+
Now this is a pivot table problem :
flatDF.groupBy("elementIndex").pivot("colName").agg(expr("first(colValue)")).drop("elementIndex").show
+----------+------+------+
| plant|rownum| tipo|
+----------+------+------+
|Pomigliano| 2|telaio|
| Torino| 3| null|
| Melfi| 1| gomme|
+----------+------+------+
This might not be the best looking solution, but it is fully scalable to any number of columns.

Randomly join two dataframes

I have two tables, one called Reasons that has 9 records and another containing IDs with 40k records.
IDs:
+------+------+
|pc_pid|pc_aid|
+------+------+
| 4569| 1101|
| 63961| 1101|
|140677| 4364|
|127113| 7|
| 96097| 480|
| 8309| 3129|
| 45218| 89|
|147036| 3289|
| 88493| 3669|
| 29973| 3129|
|127444| 3129|
| 36095| 89|
|131001| 1634|
|104731| 781|
| 79219| 244|
+-------------+
Reasons:
+-----------------+
| reasons|
+-----------------+
| follow up|
| skin chk|
| annual meet|
|review lab result|
| REF BY DR|
| sick visit|
| body pain|
| test|
| other|
+-----------------+
I want output like this
|pc_pid|pc_aid| reason
+------+------+-------------------
| 4569| 1101| body pain
| 63961| 1101| review lab result
|140677| 4364| body pain
|127113| 7| sick visit
| 96097| 480| test
| 8309| 3129| other
| 45218| 89| follow up
|147036| 3289| annual meet
| 88493| 3669| review lab result
| 29973| 3129| REF BY DR
|127444| 3129| skin chk
| 36095| 89| other
In the reasons I have only 9 records and in the ID dataframe I have 40k records, I want to assign reason randomly to each and every id.
The following solution tries to be more robust to the number of reasons (ie. you can have as many reasons as you can reasonably fit in your cluster). If you just have few reasons (like the OP asks), you can probably broadcast them or embed them in a udf and easily solve this problem.
The general idea is to create an index (sequential) for the reasons and then random values from 0 to N (where N is the number of reasons) on the IDs dataset and then join the two tables using these two new columns. Here is how you can do this:
case class Reasons(s: String)
defined class Reasons
case class Data(id: Long)
defined class Data
Data will hold the IDs (simplified version of the OP) and Reasons will hold some simplified reasons.
val d1 = spark.createDataFrame( Data(1) :: Data(2) :: Data(10) :: Nil)
d1: org.apache.spark.sql.DataFrame = [id: bigint]
d1.show()
+---+
| id|
+---+
| 1|
| 2|
| 10|
+---+
val d2 = spark.createDataFrame( Reasons("a") :: Reasons("b") :: Reasons("c") :: Nil)
+---+
| s|
+---+
| a|
| b|
| c|
+---+
We will later need the number of reasons so we calculate that first.
val numerOfReasons = d2.count()
val d2Indexed = spark.createDataFrame(d2.rdd.map(_.getString(0)).zipWithIndex)
d2Indexed.show()
+---+---+
| _1| _2|
+---+---+
| a| 0|
| b| 1|
| c| 2|
+---+---+
val d1WithRand = d1.select($"id", (rand * numerOfReasons).cast("int").as("rnd"))
The last step is to join on the new columns and the remove them.
val res = d1WithRand.join(d2Indexed, d1WithRand("rnd") === d2Indexed("_2")).drop("_2").drop("rnd")
res.show()
+---+---+
| id| _1|
+---+---+
| 2| a|
| 10| b|
| 1| c|
+---+---+
pyspark random join itself
data_neg = data_pos.sortBy(lambda x: uniform(1, 10000))
data_neg = data_neg.coalesce(1, False).zip(data_pos.coalesce(1, True))
The fastest way to randomly join dataA (huge dataframe) and dataB (smaller dataframe, sorted by any column):
dfB = dataB.withColumn(
"index", F.row_number().over(Window.orderBy("col")) - 1
)
dfA = dataA.withColumn("index", (F.rand() * dfB.count()).cast("bigint"))
df = dfA.join(dfB, on="index", how="left").drop("index")
Since dataB is already sorted, row numbers can be assigned over sorted window with high degree of parallelism. F.rand() is another highly parallel function, so adding index to dataA will be very fast as well.
If dataB is small enough, you may benefit from broadcasting it.
This method is better than using:
zipWithIndex: Can be very expensive to convert dataframe to rdd, zipWithIndex, and then to df.
monotonically_increasing_id: Need to be used with row_number which will collect all the partitions into a single executor.
Reference: https://towardsdatascience.com/adding-sequential-ids-to-a-spark-dataframe-fa0df5566ff6