Identifying values that revert in Spark - pyspark

I have a Spark DataFrame of customers as shown below.
#SparkR code
customers <- data.frame(custID = c("001", "001", "001", "002", "002", "002", "002"),
date = c("2017-02-01", "2017-03-01", "2017-04-01", "2017-01-01", "2017-02-01", "2017-03-01", "2017-04-01"),
value = c('new', 'good', 'good', 'new', 'good', 'new', 'bad'))
customers <- createDataFrame(customers)
display(customers)
custID| date | value
--------------------------
001 | 2017-02-01| new
001 | 2017-03-01| good
001 | 2017-04-01| good
002 | 2017-01-01| new
002 | 2017-02-01| good
002 | 2017-03-01| new
002 | 2017-04-01| bad
In the first month observation for a custID the customer gets a value of 'new'. Thereafter they are classified as 'good' or 'bad'. However, it is possible for a customer to revert from 'good' or 'bad' back to 'new' in the case that they open a second account. When this happens I want to tag the customer with '2' instead of '1', to indicate that they opened a second account, as shown below. How can I do this in Spark? Either SparkR or PySpark commands work.
#What I want to get
custID| date | value | tag
--------------------------------
001 | 2017-02-01| new | 1
001 | 2017-03-01| good | 1
001 | 2017-04-01| good | 1
002 | 2017-01-01| new | 1
002 | 2017-02-01| good | 1
002 | 2017-03-01| new | 2
002 | 2017-04-01| bad | 2

In pyspark:
from pyspark.sql import functions as f
spark = SparkSession.builder.getOrCreate()
# df is equal to your customers dataframe
df = spark.read.load('file:///home/zht/PycharmProjects/test/text_file.txt', format='csv', header=True, sep='|').cache()
df_new = df.filter(df['value'] == 'new').withColumn('tag', f.rank().over(Window.partitionBy('custID').orderBy('date')))
df = df_new.union(df.filter(df['value'] != 'new').withColumn('tag', f.lit(None)))
df = df.withColumn('tag', f.collect_list('tag').over(Window.partitionBy('custID').orderBy('date'))) \
.withColumn('tag', f.UserDefinedFunction(lambda x: x.pop(), IntegerType())('tag'))
df.show()
And output:
+------+----------+-----+---+
|custID| date|value|tag|
+------+----------+-----+---+
| 001|2017-02-01| new| 1|
| 001|2017-03-01| good| 1|
| 001|2017-04-01| good| 1|
| 002|2017-01-01| new| 1|
| 002|2017-02-01| good| 1|
| 002|2017-03-01| new| 2|
| 002|2017-04-01| bad| 2|
+------+----------+-----+---+
By the way, pandas can do that easy.

This can be done using the following piece of code:
Filter out all the records with "new"
df_new<-sql("select * from df where value="new")
createOrReplaceTempView(df_new,"df_new")
df_new<-sql("select *,row_number() over(partiting by custID order by date)
tag from df_new")
createOrReplaceTempView(df_new,"df_new")
df<-sql("select custID,date,value,min(tag) as tag from
(select t1.*,t2.tag from df t1 left outer join df_new t2 on
t1.custID=t2.custID and t1.date>=t2.date) group by 1,2,3")

Related

Merge multiple spark rows inside dataframe by ID into one row based on update_time

we need to merge multiple rows based on ID into a single record using Pyspark. If there are multiple updates to the column, then we have to select the one with the last update made to it.
Please note, NULL would mean there was no update made to the column in that instance.
So, basically we have to create a single row with the consolidated updates made to the records.
So,for example, if this is the dataframe ...
Looking for similar answer, but in Pyspark .. Merge rows in a spark scala Dataframe
------------------------------------------------------------
| id | column1 | column2 | updated_at |
------------------------------------------------------------
| 123 | update1 | <*no-update*> | 1634228709 |
| 123 | <*no-update*> | 80 | 1634228724 |
| 123 | update2 | <*no-update*> | 1634229000 |
expected output is -
------------------------------------------------------------
| id | column1 | column2 | updated_at |
------------------------------------------------------------
| 123 | update2 | 80 | 1634229000 |
Let's say that our input dataframe is:
+---+-------+----+----------+
|id |col1 |col2|updated_at|
+---+-------+----+----------+
|123|null |null|1634228709|
|123|null |80 |1634228724|
|123|update2|90 |1634229000|
|12 |update1|null|1634221233|
|12 |null |80 |1634228333|
|12 |update2|null|1634221220|
+---+-------+----+----------+
What we want is to covert updated_at to TimestampType then order by id and updated_at in desc order:
df = df.withColumn("updated_at", F.col("updated_at").cast(TimestampType())).orderBy(
F.col("id"), F.col("updated_at").desc()
)
that gives us:
+---+-------+----+-------------------+
|id |col1 |col2|updated_at |
+---+-------+----+-------------------+
|12 |null |80 |2021-10-14 18:18:53|
|12 |update1|null|2021-10-14 16:20:33|
|12 |update2|null|2021-10-14 16:20:20|
|123|update2|90 |2021-10-14 18:30:00|
|123|null |80 |2021-10-14 18:25:24|
|123|null |null|2021-10-14 18:25:09|
+---+-------+----+-------------------+
Now get first non None value in each column or return None and group by id:
exp = [F.first(x, ignorenulls=True).alias(x) for x in df.columns[1:]]
df = df.groupBy(F.col("id")).agg(*exp)
And the result is:
+---+-------+----+-------------------+
|id |col1 |col2|updated_at |
+---+-------+----+-------------------+
|123|update2|90 |2021-10-14 18:30:00|
|12 |update1|80 |2021-10-14 18:18:53|
+---+-------+----+-------------------+
Here's the full example code:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import TimestampType
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
data = [
(123, None, None, 1634228709),
(123, None, 80, 1634228724),
(123, "update2", 90, 1634229000),
(12, "update1", None, 1634221233),
(12, None, 80, 1634228333),
(12, "update2", None, 1634221220),
]
columns = ["id", "col1", "col2", "updated_at"]
df = spark.createDataFrame(data, columns)
df = df.withColumn("updated_at", F.col("updated_at").cast(TimestampType())).orderBy(
F.col("id"), F.col("updated_at").desc()
)
exp = [F.first(x, ignorenulls=True).alias(x) for x in df.columns[1:]]
df = df.groupBy(F.col("id")).agg(*exp)

Spark scala create multiple columns from array column

Creating a multiple columns from array column
Dataframe
Car name | details
Toyota | [[year,2000],[price,20000]]
Audi | [[mpg,22]]
Expected dataframe
Car name | year | price | mpg
Toyota | 2000 | 20000 | null
Audi | null | null | 22
You can try this
Let's define the data
scala> val carsDF = Seq(("toyota",Array(("year", 2000), ("price", 100000))), ("Audi", Array(("mpg", 22)))).toDF("car", "details")
carsDF: org.apache.spark.sql.DataFrame = [car: string, details: array<struct<_1:string,_2:int>>]
scala> carsDF.show(false)
+------+-----------------------------+
|car |details |
+------+-----------------------------+
|toyota|[[year,2000], [price,100000]]|
|Audi |[[mpg,22]] |
+------+-----------------------------+
Splitting the data & accessing the values in the data
scala> carsDF.withColumn("split", explode($"details")).withColumn("col", $"split"("_1")).withColumn("val", $"split"("_2")).select("car", "col", "val").show
+------+-----+------+
| car| col| val|
+------+-----+------+
|toyota| year| 2000|
|toyota|price|100000|
| Audi| mpg| 22|
+------+-----+------+
Define the list of columns that are required
scala> val colNames = Seq("mpg", "price", "year", "dummy")
colNames: Seq[String] = List(mpg, price, year, dummy)
Use pivoting on the above defined column names gives required output.
By giving new column names in the sequence makes it a single point input
scala> weDF.groupBy("car").pivot("col", colNames).agg(avg($"val")).show
+------+----+--------+------+-----+
| car| mpg| price| year|dummy|
+------+----+--------+------+-----+
|toyota|null|100000.0|2000.0| null|
| Audi|22.0| null| null| null|
+------+----+--------+------+-----+
This seems more elegant & easy way to achieve the output
you can do it like that
import org.apache.spark.functions.col
val df: DataFrame = Seq(
("toyota",Array(("year", 2000), ("price", 100000))),
("toyota",Array(("year", 2001)))
).toDF("car", "details")
+------+-------------------------------+
|car |details |
+------+-------------------------------+
|toyota|[[year, 2000], [price, 100000]]|
|toyota|[[year, 2001]] |
+------+-------------------------------+
val newdf = df
.withColumn("year", when(col("details")(0)("_1") === lit("year"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
.withColumn("price", when(col("details")(0)("_1") === lit("price"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
.drop("details")
newdf.show()
+------+----+------+
| car|year| price|
+------+----+------+
|toyota|2000|100000|
|toyota|2001| null|
+------+----+------+

How to populate Dataframe values based on data in another dataframe

Lookup DF:
+--------------------+------------------+
| seller_name| codes|
+--------------------+------------------+
| BlueR |[5944, 5813, 5812]|
| jack |[4814, 5734, 5968]|
| Cwireless |[7349, 7399, 5999]|
| Tea |[4899, 5813, 8398]|
Base DF:
seller_name | raw_code
BlueR | 5813
jack | 5968
Cwireless | 7865
Tea | 5999
Tea | 5813
blueR | 5678
jack | 9999
Tea | null
If the seller_name in the BaseDF is present in the LookUp data frame, and if the raw_code of the seller_name from the Base DF is present in the Lookup DF codes then i should retain the same value, but if the raw_code value is something else apart from the elements in the tuple of LookUp DF than the raw_code value should be replaced by the first element in the tuple for that seller.
edit: if the seller_name of base_df is not present in lookup df than the raw_code value should be retained as it is.
Expected Output DF:
seller_name | revised_code
blueR | 5813
jack | 5968
Cwireless | 7349
Tea | 4899
Tea | 5813
blueR | 5678
jack | 4814
Tea | 4899
How can i implement this feature?
Broadcast the small lookUpDf and left join with baseDf, then use a udf function to check whether the raw_code is contained in codes, if it does return the raw_code else first value of codes array.
import org.apache.spark.sql.functions._
def retainUdf = udf((rawCode: Int, codes:Seq[Int]) => if(codes == null || codes.isEmpty) rawCode else if(codes.contains(rawCode)) rawCode else codes.head)
baseDf.join(broadcast(lookUpDf), Seq("seller_name"), "left")
.select(col("seller_name"), retainUdf(col("raw_code"), col("codes")).as("raw_code"))
which should give you
+-----------+--------+
|seller_name|raw_code|
+-----------+--------+
|BlueR |5813 |
|jack |5968 |
|Cwireless |7349 |
|Tea |4899 |
|Tea |5813 |
|blueR |5678 |
|jack |4814 |
+-----------+--------+
I hope the answer is helpful

Spark Dataframe - Write a new record for a change in VALUE for a particular KEY group

Need to write a row when there is change in "AMT" column for a particular "KEY" group.
Eg :
Scenarios-1: For KEY=2, first change is 90 to 20, So need to write a record with value (20-90).
Similarly the next change for the same key group is 20 to 30.5, So again need to write another record with value (30.5 - 20)
Scenarios-2: For KEY=1, only one record for this KEY group so write as is
Scenarios-3: For KEY=3, Since the same AMT value exists twice, so write once
How can this be implemented ? Using window functions or by groupBy agg functions?
Sample Input Data :
val DF1 = List((1,34.6),(2,90.0),(2,90.0),(2,20.0),(2,30.5),(3,89.0),(3,89.0)).toDF("KEY", "AMT")
DF1.show(false)
+-----+-------------------+
|KEY |AMT |
+-----+-------------------+
|1 |34.6 |
|2 |90.0 |
|2 |90.0 |
|2 |20.0 |----->[ 20.0 - 90.0 = -70.0 ]
|2 |30.5 |----->[ 30.5 - 20.0 = 10.5 ]
|3 |89.0 |
|3 |89.0 |
+-----+-------------------+
Expected Values :
scala> df2.show()
+----+--------------------+
|KEY | AMT |
+----+--------------------+
| 1 | 34.6 |-----> As Is
| 2 | -70.0 |----->[ 20.0 - 90.0 = -70.0 ]
| 2 | 10.5 |----->[ 30.5 - 20.0 = 10.5 ]
| 3 | 89.0 |-----> As Is, with one record only
+----+--------------------+
i have tried to solve it in pyspark not in scala.
from pyspark.sql.functions import lead
from pyspark.sql.window import Window
w1=Window().partitionBy("key").orderBy("key")
DF4 =spark.createDataFrame([(1,34.6),(2,90.0),(2,90.0),(2,20.0),(2,30.5),(3,89.0),(3,89.0)],["KEY", "AMT"])
DF4.createOrReplaceTempView('keyamt')
DF7=spark.sql('select distinct key,amt from keyamt where key in ( select key from (select key,count(distinct(amt))dist from keyamt group by key) where dist=1)')
DF8=DF4.join(DF7,DF4['KEY']==DF7['KEY'],'leftanti').withColumn('new_col',((lag('AMT',1).over(w1)).cast('double') ))
DF9=DF8.withColumn('new_col1', ((DF8['AMT']-DF8['new_col'].cast('double'))))
DF9.withColumn('new_col1', ((DF9['AMT']-DF9['new_col'].cast('double')))).na.fill(0)
DF9.filter(DF9['new_col1'] !=0).select(DF9['KEY'],DF9['new_col1']).union(DF7).orderBy(DF9['KEY'])
Output:
+---+--------+
|KEY|new_col1|
+---+--------+
| 1| 34.6|
| 2| -70.0|
| 2| 10.5|
| 3| 89.0|
+---+--------+
You can implement your logic using window function with combination of when, lead, monotically_increasing_id() for ordering and withColumn api as below
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("KEY").orderBy("rowNo")
val tempdf = DF1.withColumn("rowNo", monotonically_increasing_id())
tempdf.select($"KEY", when(lead("AMT", 1).over(windowSpec).isNull || (lead("AMT", 1).over(windowSpec)-$"AMT").as("AMT")===lit(0.0), $"AMT").otherwise(lead("AMT", 1).over(windowSpec)-$"AMT").as("AMT")).show(false)

Create a new column based on date checking

I have two dataframes in Scala:
df1 =
ID Field1
1 AAA
2 BBB
4 CCC
and
df2 =
PK start_date_time
1 2016-10-11 11:55:23
2 2016-10-12 12:25:00
3 2016-10-12 16:20:00
I also have a variable start_date with the format yyyy-MM-dd equal to 2016-10-11.
I need to create a new column check in df1 based on the following condition: If PK is equal to ID AND the year, month and day of start_date_time are equal to start_date, then check is equal to 1, otherwise 0.
The result should be this one:
df1 =
ID Field1 check
1 AAA 1
2 BBB 0
4 CCC 0
In my previous question I had two dataframes and it was suggested to use joining and filtering. However, in this case it won't work. My initial idea was to use udf, but not sure how to make it working for this case.
You can combine join and withColumn for this case. i.e. firstly join with df2 on ID column and then use when.otherwise syntax to modify the check column:
import org.apache.spark.sql.functions.lit
val df2_date = df2.withColumn("date", to_date(df2("start_date_time"))).withColumn("check", lit(1)).select($"PK".as("ID"), $"date", $"check")
df1.join(df2_date, Seq("ID"), "left").withColumn("check", when($"date" === "2016-10-11", $"check").otherwise(0)).drop("date").show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
Or another option, firstly filter on df2, and then join it back with df1 on ID column:
val df2_date = (df2.withColumn("date", to_date(df2("start_date_time"))).
filter($"date" === "2016-10-11").
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check"))
df1.join(df2_date, Seq("ID"), "left").drop("date").na.fill(0).show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
In case you have a date like 2016-OCT-11, you can convert it sql Date for comparison as follows:
val format = new java.text.SimpleDateFormat("yyyy-MMM-dd")
val parsed = format.parse("2016-OCT-11")
val date = new java.sql.Date(parsed.getTime())
// date: java.sql.Date = 2016-10-11