PySpark get value of dataframe column with max date - pyspark

I need to create a new column in a pyspark dataframe using a column value from the row of the max date over a window. Given the dataframe below, I need to set a new column called max_adj_factor on each record for each assetId based on the adjustment factor of the most recent date.
+----------------+-------+----------+-----+
|adjustmentFactor|assetId| date| nav|
+----------------+-------+----------+-----+
|9.96288362069999|4000123|2019-12-20| 18.5|
|9.96288362069999|4000123|2019-12-23|18.67|
|9.96288362069999|4000123|2019-12-24| 18.6|
|9.96288362069999|4000123|2019-12-26|18.57|
|10.0449181987999|4000123|2019-12-27|18.46|
|10.0449181987999|4000123|2019-12-30|18.41|
|10.0449181987999|4000123|2019-12-31|18.34|
|10.0449181987999|4000123|2020-01-02|18.77|
|10.0449181987999|4000123|2020-01-03|19.07|
|10.0449181987999|4000123|2020-01-06|19.16|
|10.0449181987999|4000123|2020-01-07| 19.2|

You can use max_by over a Window:
df.withColumn("max_adj_factor", \
F.expr("max_by(adjustmentFactor, date)") \
.over(Window.partitionBy("assetId"))) \
.show()
Output:
+----------------+-------+----------+-----+----------------+
|adjustmentFactor|assetId| date| nav| max_adj_factor|
+----------------+-------+----------+-----+----------------+
|9.96288362069999|4000123|2019-12-20| 18.5|10.0449181987999|
|9.96288362069999|4000123|2019-12-23|18.67|10.0449181987999|
|9.96288362069999|4000123|2019-12-24| 18.6|10.0449181987999|
|9.96288362069999|4000123|2019-12-26|18.57|10.0449181987999|
|10.0449181987999|4000123|2019-12-27|18.46|10.0449181987999|
|10.0449181987999|4000123|2019-12-30|18.41|10.0449181987999|
|10.0449181987999|4000123|2019-12-31|18.34|10.0449181987999|
|10.0449181987999|4000123|2020-01-02|18.77|10.0449181987999|
|10.0449181987999|4000123|2020-01-03|19.07|10.0449181987999|
|10.0449181987999|4000123|2020-01-06|19.16|10.0449181987999|
|10.0449181987999|4000123|2020-01-07| 19.2|10.0449181987999|
+----------------+-------+----------+-----+----------------+

Related

Increase the date based on the length of Name column in PySpark dataframe

I'm trying to add new columns based on the input Name and Date columns as below:
Input:
+------+-----------+
|Name |Date |
+------+-----------+
|PETER |1986-May-29|
+------+-----------+
Expected Output:
+---------+-----------+
|Character| New_Date|
+---------+-----------+
| P|1986-May-29|
| E|1986-May-30|
| T|1986-May-31|
| E|1986-Jun-01|
| R|1986-Jun-02|
+---------+-----------+
df_withchars = df.withColumn("Character", F.explode(F.split('Name','')))\
.filter(F.col('Character') != '')
df_withchars.withColumn('New_Date', (lambda x: F.date_add(x['Date'], 1) for i in range(len(x[0])))).show()
Tried the above code and throwing NameError: name 'x' is not defined
You can use split to create an array from the column and then posexplode to explode this array.
posexplode is similar to explode function but it returns one additional column - the position/index of an item. That means it will give you the number from 0 to 4 in your particular example. You can add this number to the date using date_add function.
Before we start lets import relevant functions:
from pyspark.sql.types import StructField, StructType, DateType, StringType
from pyspark.sql.functions import split, col, date_add, posexplode
from datetime import date
Then create a data frame:
sample_data = [('PETER', date.today())]
sample_schema = StructType([
StructField('Name', StringType(), True),
StructField('Date', DateType(), True),
])
df = spark.createDataFrame(data=sample_data, schema=sample_schema)
Final step is to use split and posexplode functions to get the index of each character, and then add the index to date.
df \
.withColumn('SplittedName', split('Name', "(?!$)")) \
.select('Name', 'Date', posexplode('SplittedName')) \
.withColumn('NewDate', date_add('Date', col('pos'))).show()
Note that in the above code, I've used regex pattern (?!$) (reference). You can use empty string "" as well. However, it will return an additional empty item in the resulting array in SplittedName column, which will requires adding another step just to remove this empty, not relevant item.
The result:
EDIT:
To get the resulting date in the desired format, you just need to use date_format (don't forget to import the function first) as follows:
df \
.withColumn('SplittedName', split('Name', "(?!$)")) \
.select('Name', 'Date', posexplode('SplittedName')) \
.withColumn('NewDate', date_format(date_add('Date', col('pos')), "yyyy-MMM-dd")).show()
New result:

Extract specific string from a column in pyspark dataframe

I have below pyspark dataframe.
column_a
name,age,pct_physics,country,class
name,age,pct_chem,class
pct_math,class
I have to extract only the part of string which begins with only pct and discard rest of them.
Expected output:
column_a
pct_physics
pct_chem
pct_math
How to achieve this in pyspark
Use regexp_extract function.
Example:
df.withColumn("output",regexp_extract(col("column_a"),"(pct_.*?),",1)).show(10,False)
#+----------------------------------+-----------+
#|column_a |output |
#+----------------------------------+-----------+
#|name,age,pct_physics,country,class|pct_physics|
#|name,age,pct_chem,class |pct_chem |
#+----------------------------------+-----------+

pyspark: change string to timestamp

I've a column in String format , some rows are also null.
I add random timestamp to make it in the following form to convert it into timestamp.
date
null
22-04-2020
date
01-01-1990 23:59:59.000
22-04-2020 23:59:59.000
df = df.withColumn('date', F.concat (df.date, F.lit(" 23:59:59.000")))
df = df.withColumn('date', F.when(F.col('date').isNull(), '01-01-1990 23:59:59.000').otherwise(F.col('date')))
df.withColumn("date", F.to_timestamp(F.col("date"),"MM-dd-yyyy HH mm ss SSS")).show(2)
but after this the column date becomes null.
can anyone help me solve this.
either convert the string to timestamp direct
Your timestamp format should start with dd-MM, not MM-dd, and you're also missing some colons and dots in the time part. Try the code below:
df.withColumn("date", F.to_timestamp(F.col("date"),"dd-MM-yyyy HH:mm:ss.SSS")).show()
+-------------------+
| date|
+-------------------+
|1990-01-01 23:59:59|
|2020-04-22 23:59:59|
+-------------------+

add new column in a dataframe depending on another dataframe's row values

I need to add a new column to dataframe DF1 but the new column's value should be calculated using other columns' value present in that DF. Which of the other columns to be used will be given in another dataframe DF2.
eg. DF1
|protocolNo|serialNum|testMethod |testProperty|
+----------+---------+------------+------------+
|Product1 | AB |testMethod1 | TP1 |
|Product2 | CD |testMethod2 | TP2 |
DF2-
|action| type| value | exploded |
+------------+---------------------------+-----------------+
|append|hash | [protocolNo] | protocolNo |
|append|text | _ | _ |
|append|hash | [serialNum,testProperty] | serialNum |
|append|hash | [serialNum,testProperty] | testProperty |
Now the value of exploded column in DF2 will be column names of DF1 if value of type column is hash.
Required -
New column should be created in DF1. the value should be calculated like below-
hash[protocolNo]_hash[serialNumTestProperty] ~~~ here on place of column their corresponding row values should come.
eg. for Row1 of DF1, col value should be
hash[Product1]_hash[ABTP1]
this will result into something like this abc-df_egh-45e after hashing.
The above procedure should be followed for each and every row of DF1.
I've tried using map and withColumn function using UDF on DF1. But in UDF, outer dataframe value is not accessible(gives Null Pointer Exception], also I'm not able to give DataFrame as input to UDF.
Input DFs would be DF1 and DF2 as mentioned above.
Desired Output DF-
|protocolNo|serialNum|testMethod |testProperty| newColumn |
+----------+---------+------------+------------+----------------+
|Product1 | AB |testMethod1 | TP1 | abc-df_egh-4je |
|Product2 | CD |testMethod2 | TP2 | dfg-df_ijk-r56 |
newColumn value is after hashing
Instead of DF2, you can translate DF2 to case class like Specifications, e.g
case class Spec(columnName:String,inputColumns:Seq[String],action:String,action:String,type:String*){}
Create instances of above class
val specifications = Seq(
Spec("new_col_name",Seq("serialNum","testProperty"),"hash","append")
)
Then you can process the below columns
val transformed = specifications
.foldLeft(dtFrm)((df: DataFrame, spec: Specification) => df.transform(transformColumn(columnSpec)))
def transformColumn(spec: Spec)(df: DataFrame): DataFrame = {
spec.type.foldLeft(df)((df: DataFrame, type : String) => {
type match {
case "append" => {have a case match of the action and do that , then append with df.withColumn}
}
}
Syntax may not be correct
Since DF2 has the column names that will be used to calculate a new column from DF1, I have made this assumption that DF2 will not be a huge Dataframe.
First step would be to filter DF2 and get the column names that we want to pick from DF1.
val hashColumns = DF2.filter('type==="hash").select('exploded).collect
Now, hashcolumns will have the columns that we want to use to calculate hash in the newColumn. The hashcolumns is an Array of Row. We need this to be a Column that will be applied while creating the newColumn in DF1.
val newColumnHash = hashColumns.map(f=>hash(col(f.getString(0)))).reduce(concat_ws("_",_,_))
The above line will convert the Row to a Column with hash function applied to it. And we reduce it while concatenating _. Now, the task becomes simple. We just need to apply this to DF1.
DF1.withColumn("newColumn",newColumnHash).show(false)
Hope this helps!

How to filter a second pyspark dataframe based on date value from other pyspark dataframe?

I have a DataFrame where load_date_time is populated. I want to filter this dataframe with max(date_value) from some other DataFrame.
I have tried to do the following.
df2_max_create_date = df2.select("create_date").agg(F.max(df_dsa["create_date"]))
df2_max_create_date.show()
+----------------+
|max(create_date)|
+----------------+
| 2019-11-10|
+----------------+
then trying to filter the first dataframe based on this date. It has a timestamp column called load_date_time.
df_delta = df1.where(F.col('load_date_time') > (F.lit(df2_max_create_date)))
But I am getting below error.
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
You can get the max_date variable by calling collect:
max_create_date = df2.select(F.max(df_dsa["create_date"])).collect()[0][0]
df_delta = df1.where(F.col('load_date_time') > max_create_date)