Pyspark forward and backward fill within column level - pyspark

I try to fill missing data in a pyspark dataframe. The pyspark dataframe looks as such:
+---------+---------+-------------------+----+
| latitude|longitude| timestamplast|name|
+---------+---------+-------------------+----+
| | 4.905615|2019-08-01 00:00:00| 1|
|51.819645| |2019-08-01 00:00:00| 1|
| 51.81964| 4.961713|2019-08-01 00:00:00| 2|
| | |2019-08-01 00:00:00| 3|
| 51.82918| 4.911187| | 3|
| 51.82385| 4.901488|2019-08-01 00:00:03| 5|
+---------+---------+-------------------+----+
Within the column "name" I want to either forward fill or backward fill (whichever is necessary) to fill only "latitude" and "longitude" ("timestamplast" should not be filled). How do I do this?
Output will be:
+---------+---------+-------------------+----+
| latitude|longitude| timestamplast|name|
+---------+---------+-------------------+----+
|51.819645| 4.905615|2019-08-01 00:00:00| 1|
|51.819645| 4.905615|2019-08-01 00:00:00| 1|
| 51.81964| 4.961713|2019-08-01 00:00:00| 2|
| 51.82918| 4.911187|2019-08-01 00:00:00| 3|
| 51.82918| 4.911187| | 3|
| 51.82385| 4.901488|2019-08-01 00:00:03| 5|
+---------+---------+-------------------+----+
In Pandas this would be done as such:
df = df.groupby("name")['longitude','latitude'].apply(lambda x : x.ffill().bfill())
How would this be done in Pyspark?

I suggest you use the following two Window Specs:
from pyspark.sql import Window
w1 = Window.partitionBy('name').orderBy('timestamplast')
w2 = w1.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
Where:
w1 is the regular WinSpec we use to calculate the forward-fill which is the same as the following:
w1 = Window.partitionBy('name').orderBy('timestamplast').rowsBetween(Window.unboundedPreceding,0)
see the following note from the documentation for default window frames:
Note: When ordering is not defined, an unbounded window frame (rowFrame, unboundedPreceding, unboundedFollowing) is used by default. When ordering is defined, a growing window frame (rangeFrame, unboundedPreceding, currentRow) is used by default.
after ffill, we only need to fix the null values at the very front if exists, so we can use a fixed Window frame(Between Window.unboundedPreceding and Window.unboundedFollowing), this is more efficient than using a running Window frame since it requires only one aggregate, see SPARK-8638
Then the x.ffill().bfill() can be handled by using coalesce + last + first based on the above two WindowSpecs:
from pyspark.sql.functions import coalesce, last, first
df.withColumn('latitude_new', coalesce(last('latitude',True).over(w1), first('latitude',True).over(w2))) \
.select('name','timestamplast', 'latitude','latitude_new') \
.show()
+----+-------------------+---------+------------+
|name| timestamplast| latitude|latitude_new|
+----+-------------------+---------+------------+
| 1|2019-08-01 00:00:00| null| 51.819645|
| 1|2019-08-01 00:00:01| null| 51.819645|
| 1|2019-08-01 00:00:02|51.819645| 51.819645|
| 1|2019-08-01 00:00:03| 51.81964| 51.81964|
| 1|2019-08-01 00:00:04| null| 51.81964|
| 1|2019-08-01 00:00:05| null| 51.81964|
| 1|2019-08-01 00:00:06| null| 51.81964|
| 1|2019-08-01 00:00:07| 51.82385| 51.82385|
+----+-------------------+---------+------------+
Edit: to process (ffill+bfill) on multiple columns, use a list comprehension:
cols = ['latitude', 'longitude']
df_new = df.select([ c for c in df.columns if c not in cols ] + [ coalesce(last(c,True).over(w1), first(c,True).over(w2)).alias(c) for c in cols ])

I got a working solution for either forward or backward fill of one target name "longitude". I guess I could repeat the procedure for also "latitude" and then again for backward fill. Is there a more efficient way?
window = Window.partitionBy('name')\
.orderBy('timestamplast')\
.rowsBetween(-sys.maxsize, 0) # this is for forward fill
# .rowsBetween(0,sys.maxsize) # this is for backward fill
# define the forward-filled column
filled_column = last(df['longitude'], ignorenulls=True).over(window) # this is for forward fill
# filled_column = first(df['longitude'], ignorenulls=True).over(window) # this is for backward fill
df = df.withColumn('mmsi_filled', filled_column) # do the fill

Related

Reducing Multiple Actions/Filter optimization with Transformations in Pyspark

I have a master dataset (df) out of which I am trying to create averages based on certain filters.
F1 = df.filter((df.Identifier1>0)).groupBy().avg('Amount')
F2 = df.filter((df.Identifier1>2)).groupBy().avg('Amount')
F3 = df.filter((df.Identifier2<2)).groupBy().avg('Amount')
F4 = df.filter((df.Identifier2<4)).groupBy().avg('Amount')
#Alternatively also tried another way for avg calculation,
F1 = df.filter((df.Identifier1>0)).agg(avg(col('Amount')))
..
Post Calculating these averages, I am trying to assign them to the records in the master df into two columns A1 and A2 using the same filters used in average calculation.
df = df.withColumn("A1", when((col("Identifier1") > 0)), (F1.collect()[0][0]))
….
….
.otherwise(avg(col('Amount')))
df = df.withColumn("A2", when((col("Identifier2") <2 )), (F3.collect()[0][0]))
….
….
.otherwise(avg(col('Amount')))
I am facing two issues:
When one of the averages is Null then I get an error while calling collect() or first()
Error:
Unsupported literal type class java.util.ArrayList [null]
As there are multiple actions involved the process takes over 2hrs.
Any help on the above is welcome.
Make a column for your filter conditions, e.g.
+---+--------+-----------+-----------+------+
| ID|Category|Identifier1|Identifier2|Amount|
+---+--------+-----------+-----------+------+
| 12| A| 2| 1| 100|
| 23| B| 7| 8| 500|
| 34| C| 1| 4| 300|
+---+--------+-----------+-----------+------+
df.withColumn('group', when(df.Identifier1 > 0, array(lit(1))).otherwise(array(lit(None)))) \
.withColumn('group', when(df.Identifier1 > 2, array_union(col('group'), array(lit(2)))).otherwise(col('group'))) \
.withColumn('group', when(df.Identifier2 < 2, array_union(col('group'), array(lit(3)))).otherwise(col('group'))) \
.withColumn('group', when(df.Identifier2 < 4, array_union(col('group'), array(lit(4)))).otherwise(col('group'))) \
.withColumn('group', explode('group')) \
.groupBy('group').agg(sum('Amount').alias('sum'), avg('Amount').alias('avg')).show()
+-----+---+-----+
|group|sum| avg|
+-----+---+-----+
| 1|900|300.0|
| 3|100|100.0|
| 4|100|100.0|
| 2|500|500.0|
+-----+---+-----+
and then group by to the group.

How to get the lists' length in one column in dataframe spark?

I have a df whose 'products' column are lists like below:
+----------+---------+--------------------+
|member_srl|click_day| products|
+----------+---------+--------------------+
| 12| 20161223| [2407, 5400021771]|
| 12| 20161226| [7320, 2407]|
| 12| 20170104| [2407]|
| 12| 20170106| [2407]|
| 27| 20170104| [2405, 2407]|
| 28| 20161212| [2407]|
| 28| 20161213| [2407, 100093]|
| 28| 20161215| [1956119]|
| 28| 20161219| [2407, 100093]|
| 28| 20161229| [7905970]|
| 124| 20161011| [5400021771]|
| 6963| 20160101| [103825645]|
| 6963| 20160104|[3000014912, 6626...|
| 6963| 20160111|[99643224, 106032...|
How to add a new column product_cnt which are the length of products list? And how to filter df to get specified rows with condition of given products length ?
Thanks.
Pyspark has a built-in function to achieve exactly what you want called size. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.size .
To add it as column, you can simply call it during your select statement.
from pyspark.sql.functions import size
countdf = df.select('*',size('products').alias('product_cnt'))
Filtering works exactly as #titiro89 described. Furthermore, you can use the size function in the filter. This will allow you to bypass adding the extra column (if you wish to do so) in the following way.
filterdf = df.filter(size('products')==given_products_length)
First question:
How to add a new column product_cnt which are the length of products list?
>>> a = [(12,20161223, [2407,5400021771]),(12,20161226,[7320,2407])]
>>> df = spark.createDataFrame(a,
["member_srl","click_day","products"])
>>> df.show()
+----------+---------+------------------+
|member_srl|click_day| products|
+----------+---------+------------------+
| 12| 20161223|[2407, 5400021771]|
| 12| 20161226|[7320, 2407, 4344]|
+----------+---------+------------------+
You can find a similar example here
>>> from pyspark.sql.types import IntegerType
>>> from pyspark.sql.functions import udf
>>> slen = udf(lambda s: len(s), IntegerType())
>>> df2 = df.withColumn("product_cnt", slen(df.products))
>>> df2.show()
+----------+---------+------------------+-----------+
|member_srl|click_day| products|product_cnt|
+----------+---------+------------------+-----------+
| 12| 20161223|[2407, 5400021771]| 2|
| 12| 20161226|[7320, 2407, 4344]| 3|
+----------+---------+------------------+-----------+
Second question:
And how to filter df to get specified rows with condition of given products length ?
You can use filter function docs here
>>> givenLength = 2
>>> df3 = df2.filter(df2.product_cnt==givenLength)
>>> df3.show()
+----------+---------+------------------+-----------+
|member_srl|click_day| products|product_cnt|
+----------+---------+------------------+-----------+
| 12| 20161223|[2407, 5400021771]| 2|
+----------+---------+------------------+-----------+

Randomly join two dataframes

I have two tables, one called Reasons that has 9 records and another containing IDs with 40k records.
IDs:
+------+------+
|pc_pid|pc_aid|
+------+------+
| 4569| 1101|
| 63961| 1101|
|140677| 4364|
|127113| 7|
| 96097| 480|
| 8309| 3129|
| 45218| 89|
|147036| 3289|
| 88493| 3669|
| 29973| 3129|
|127444| 3129|
| 36095| 89|
|131001| 1634|
|104731| 781|
| 79219| 244|
+-------------+
Reasons:
+-----------------+
| reasons|
+-----------------+
| follow up|
| skin chk|
| annual meet|
|review lab result|
| REF BY DR|
| sick visit|
| body pain|
| test|
| other|
+-----------------+
I want output like this
|pc_pid|pc_aid| reason
+------+------+-------------------
| 4569| 1101| body pain
| 63961| 1101| review lab result
|140677| 4364| body pain
|127113| 7| sick visit
| 96097| 480| test
| 8309| 3129| other
| 45218| 89| follow up
|147036| 3289| annual meet
| 88493| 3669| review lab result
| 29973| 3129| REF BY DR
|127444| 3129| skin chk
| 36095| 89| other
In the reasons I have only 9 records and in the ID dataframe I have 40k records, I want to assign reason randomly to each and every id.
The following solution tries to be more robust to the number of reasons (ie. you can have as many reasons as you can reasonably fit in your cluster). If you just have few reasons (like the OP asks), you can probably broadcast them or embed them in a udf and easily solve this problem.
The general idea is to create an index (sequential) for the reasons and then random values from 0 to N (where N is the number of reasons) on the IDs dataset and then join the two tables using these two new columns. Here is how you can do this:
case class Reasons(s: String)
defined class Reasons
case class Data(id: Long)
defined class Data
Data will hold the IDs (simplified version of the OP) and Reasons will hold some simplified reasons.
val d1 = spark.createDataFrame( Data(1) :: Data(2) :: Data(10) :: Nil)
d1: org.apache.spark.sql.DataFrame = [id: bigint]
d1.show()
+---+
| id|
+---+
| 1|
| 2|
| 10|
+---+
val d2 = spark.createDataFrame( Reasons("a") :: Reasons("b") :: Reasons("c") :: Nil)
+---+
| s|
+---+
| a|
| b|
| c|
+---+
We will later need the number of reasons so we calculate that first.
val numerOfReasons = d2.count()
val d2Indexed = spark.createDataFrame(d2.rdd.map(_.getString(0)).zipWithIndex)
d2Indexed.show()
+---+---+
| _1| _2|
+---+---+
| a| 0|
| b| 1|
| c| 2|
+---+---+
val d1WithRand = d1.select($"id", (rand * numerOfReasons).cast("int").as("rnd"))
The last step is to join on the new columns and the remove them.
val res = d1WithRand.join(d2Indexed, d1WithRand("rnd") === d2Indexed("_2")).drop("_2").drop("rnd")
res.show()
+---+---+
| id| _1|
+---+---+
| 2| a|
| 10| b|
| 1| c|
+---+---+
pyspark random join itself
data_neg = data_pos.sortBy(lambda x: uniform(1, 10000))
data_neg = data_neg.coalesce(1, False).zip(data_pos.coalesce(1, True))
The fastest way to randomly join dataA (huge dataframe) and dataB (smaller dataframe, sorted by any column):
dfB = dataB.withColumn(
"index", F.row_number().over(Window.orderBy("col")) - 1
)
dfA = dataA.withColumn("index", (F.rand() * dfB.count()).cast("bigint"))
df = dfA.join(dfB, on="index", how="left").drop("index")
Since dataB is already sorted, row numbers can be assigned over sorted window with high degree of parallelism. F.rand() is another highly parallel function, so adding index to dataA will be very fast as well.
If dataB is small enough, you may benefit from broadcasting it.
This method is better than using:
zipWithIndex: Can be very expensive to convert dataframe to rdd, zipWithIndex, and then to df.
monotonically_increasing_id: Need to be used with row_number which will collect all the partitions into a single executor.
Reference: https://towardsdatascience.com/adding-sequential-ids-to-a-spark-dataframe-fa0df5566ff6

PySpark DataFrame Manipulation Efficiency

Suppose I have the following data frame :
+----------+-----+----+-------+
|display_id|ad_id|prob|clicked|
+----------+-----+----+-------+
| 123| 989| 0.9| 0|
| 123| 990| 0.8| 1|
| 123| 999| 0.7| 0|
| 234| 789| 0.9| 0|
| 234| 777| 0.7| 0|
| 234| 769| 0.6| 1|
| 234| 798| 0.5| 0|
+----------+-----+----+-------+
I then perform the following operations to get a final data set (shown below the code) :
# Add a new column with the clicked ad_id if clicked == 1, 0 otherwise
df_adClicked = df.withColumn("ad_id_clicked", when(df.clicked==1, df.ad_id).otherwise(0))
# DF -> RDD with tuple : (display_id, (ad_id, prob), clicked)
df_blah = df_adClicked.rdd.map(lambda x : (x[0],(x[1],x[2]),x[4])).toDF(["display_id", "ad_id","clicked_ad_id"])
# Group by display_id and create column with clicked ad_id and list of tuples : (ad_id, prob)
df_blah2 = df_blah.groupby('display_id').agg(F.collect_list('ad_id'), F.max('clicked_ad_id'))
# Define function to sort list of tuples by prob and create list of only ad_ids
def sortByRank(ad_id_list):
sortedVersion = sorted(ad_id_list, key=itemgetter(1), reverse=True)
sortedIds = [i[0] for i in sortedVersion]
return(sortedIds)
# Sort the (ad_id, prob) tuples by using udf/function and create new column ad_id_sorted
sort_ad_id = udf(lambda x : sortByRank(x), ArrayType(IntegerType()))
df_blah3 = df_blah2.withColumn('ad_id_sorted', sort_ad_id('collect_list(ad_id)'))
# Function to change clickedAdId into an array of size 1
def createClickedSet(clickedAdId):
setOfDocs = [clickedAdId]
return setOfDocs
clicked_set = udf(lambda y : createClickedSet(y), ArrayType(IntegerType()))
df_blah4 = df_blah3.withColumn('ad_id_set', clicked_set('max(clicked_ad_id)'))
# Select the necessary columns
finalDF = df_blah4.select('display_id', 'ad_id_sorted','ad_id_set')
+----------+--------------------+---------+
|display_id|ad_id_sorted |ad_id_set|
+----------+--------------------+---------+
|234 |[789, 777, 769, 798]|[769] |
|123 |[989, 990, 999] |[990] |
+----------+--------------------+---------+
Is there a more efficient way of doing this? Doing this set of transformations in the way that I am seems to be the bottle neck in my code. I would greatly appreciate any feedback.
I haven't done any timing comparisons, but I would think that by not using any UDFs Spark should be able to optimally optimize itself.
#scala: val dfad = sc.parallelize(Seq((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF("display_id","ad_id","prob","clicked")
#^^^that's^^^ the only difference (besides putting val in front of variables) between this python response and a Scala one
dfad = sc.parallelize(((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF(["display_id","ad_id","prob","clicked"])
dfad.registerTempTable("df_ad")
df1 = sqlContext.sql("SELECT display_id,collect_list(ad_id) ad_id_sorted FROM (SELECT * FROM df_ad SORT BY display_id,prob DESC) x GROUP BY display_id")
+----------+--------------------+
|display_id| ad_id_sorted|
+----------+--------------------+
| 234|[789, 777, 769, 798]|
| 123| [989, 990, 999]|
+----------+--------------------+
df2 = sqlContext.sql("SELECT display_id, max(ad_id) as ad_id_set from df_ad where clicked=1 group by display_id")
+----------+---------+
|display_id|ad_id_set|
+----------+---------+
| 234| 769|
| 123| 990|
+----------+---------+
final_df = df1.join(df2,"display_id")
+----------+--------------------+---------+
|display_id| ad_id_sorted|ad_id_set|
+----------+--------------------+---------+
| 234|[789, 777, 769, 798]| 769|
| 123| [989, 990, 999]| 990|
+----------+--------------------+---------+
I didn't put the ad_id_set into an Array because you were calculating the max and max should only return 1 value. I'm sure if you really need it in an array you can make that happen.
I included the subtle Scala difference if a future someone using Scala has a similar problem.

Spark Dataframe sliding window over pair of rows

I have an eventlog in csv consisting of three columns timestamp, eventId and userId.
What I would like to do is append a new column nextEventId to the dataframe.
An example eventlog:
eventlog = sqlContext.createDataFrame(Array((20160101, 1, 0),(20160102,3,1),(20160201,4,1),(20160202, 2,0))).toDF("timestamp", "eventId", "userId")
eventlog.show(4)
|timestamp|eventId|userId|
+---------+-------+------+
| 20160101| 1| 0|
| 20160102| 3| 1|
| 20160201| 4| 1|
| 20160202| 2| 0|
+---------+-------+------+
The desired endresult would be:
|timestamp|eventId|userId|nextEventId|
+---------+-------+------+-----------+
| 20160101| 1| 0| 2|
| 20160102| 3| 1| 4|
| 20160201| 4| 1| Nil|
| 20160202| 2| 0| Nil|
+---------+-------+------+-----------+
So far I've been messing around with sliding windows but can't figure out how to compare 2 rows...
val w = Window.partitionBy("userId").orderBy(asc("timestamp")) //should be a sliding window over 2 rows...
val nextNodes = second($"eventId").over(w) //should work if there are only 2 rows
What you're looking for is lead (or lag). Using window you already defined:
import org.apache.spark.sql.functions.lead
eventlog.withColumn("nextEventId", lead("eventId", 1).over(w))
For true sliding window (like sliding average) you can use rowsBetween or rangeBetween clauses of the window definition but it is not really required here. Nevertheless example usage could be something like this:
val w2 = Window.partitionBy("userId")
.orderBy(asc("timestamp"))
.rowsBetween(-1, 0)
avg($"foo").over(w2)