Seperating string and numeric values in Pyspark - pyspark

I have the following dataframe. There are several ID's which have either a numeric or a string value. If the ID is a string_value like "B" the numeric_value is "NULL" as a string. Vice versa for the numeric value e.g. ID "D".
ID string_value numeric_value timestamp
0 B On NULL 1632733508
1 B Off NULL 1632733508
2 A Inactive NULL 1632733511
3 A Active NULL 1632733512
4 D NULL 450 1632733513
5 D NULL 431 1632733515
6 C NULL 20 1632733518
7 C NULL 30 1632733521
Now I want to seperate the dataframe in a new one for each ID by an existing list containing all the unique ID's. Afterwards the new dataframes like "B" in this example, should drop the column with the "NULL" values. So if B is a string_value the numeric_value should be dropped.
ID string_value timestamp
0 B On 1632733508
1 B Off 1632733508
After that, the column with the value should be renamed with the ID "B" and the ID column should be dropped.
B timestamp
0 On 1632733508
1 Off 1632733508
As described, the same procedure should be applied for the numeric values in this case ID "D"
ID numeric_value timestamp
0 D 450 1632733513
1 D 431 1632733515
D timestamp
0 450 1632733513
1 431 1632733515
It is important to safe the original data types within the value column.

Assuming your dataframe is called df and your list of IDs is ids. You can write a function which does what you need and call it for every id.
The function applies the required filter, and the selects the needed columns with the id as an alias.
from pyspark.sql import functions as f
ids = ["B", "A", "D", "C"]
def split_df(df, id):
split_df = df.filter(f.col("ID") == id).select(
f.coalesce(f.col("string_value"), f.col("numeric_value")).alias(id),
f.col("timestamp"),
)
return split_df
dfs = [split_df(df, id) for id in ids]
for df in dfs:
df.show()
output
+---+----------+
| B| timestamp|
+---+----------+
| On|1632733508|
|Off|1632733508|
+---+----------+
+--------+----------+
| A| timestamp|
+--------+----------+
|Inactive|1632733511|
| Active|1632733512|
+--------+----------+
+---+----------+
| D| timestamp|
+---+----------+
|450|1632733513|
|431|1632733515|
+---+----------+
+---+----------+
| C| timestamp|
+---+----------+
| 20|1632733518|
| 30|1632733521|
+---+----------+

Related

How to match column values of one table with column Name another table pyspark

I have two dataframes
dataframe_1
Id
qname
qval
01
Mango
[100,200]
01
Banana
[500,400,800]
dataframe_2
reqId
Mango
Banana
Orange
Apple
1000
100
500
NULL
NULL
1001
200
500
NULL
NULL
1002
200
800
NULL
NULL
1003
900
1100
NULL
NULL
Expected Result
Id
ReqId
01
1000
01
1001
01
10002
Please give me some idea. I need to match all qname and value of dataframe_1 to the columns of dataframe_2, ignoring the NULL columns of dataframe_2. Get all the reqId from dataframe_2.
Note - All qname and val of a particular id of dataframe_1 should match with all the columns of dataframe_2, ignoring nulls. For example, id -01 , has two qname and val. These two should match with corresponding column names of dataframe_2.
The logic is:
In df2, for each column, pair "reqId" with the column.
In df2, introduce a dummy column with some constant value and group by this column so all values are in one group.
Unpivot df2.
Join df1 and above processed df2.
For each element in "qval" list, filter corresponding "reqId" from joined df2 column.
Group by "id" and explode "reqId".
df1 = spark.createDataFrame(data=[["01","Mango",[100,200]],["01","Banana",[500,400,800]],["02","Banana",[800,1100]]], schema=["Id","qname","qval"])
df2 = spark.createDataFrame(data=[[1000,100,500,None,None],[1001,200,500,None,None],[1002,200,800,None,None],[1003,900,1100,None,None]], schema="reqId int,Mango int,Banana int,Orange int,Apple int")
for c in df2.columns:
if c != "reqId":
df2 = df2.withColumn(c, F.array(c, "reqId"))
df2 = df2.withColumn("dummy", F.lit(0)) \
.groupBy("dummy") \
.agg(*[F.collect_list(c).alias(c) for c in df2.columns]) \
.drop("dummy", "reqId")
stack_cols = ", ".join([f"{c}, '{c}'" for c in df2.columns])
df2 = df2.selectExpr(f"stack({len(df2.columns)},{stack_cols}) as (qval2, qname2)")
#F.udf(returnType=ArrayType(IntegerType()))
def compare_qvals(qval, qval2):
return [x[1] for x in qval2 if x[0] in qval]
#
df_result = df1.join(df2, on=(df1.qname == df2.qname2)) \
.withColumn("reqId", compare_qvals("qval", "qval2")) \
.groupBy("Id") \
.agg(F.flatten(F.array_distinct(F.collect_list("reqId"))).alias("reqId")) \
.withColumn("reqId", F.explode("reqId"))
Output:
+---+-----+
|Id |reqId|
+---+-----+
|01 |1000 |
|01 |1001 |
|01 |1002 |
|02 |1002 |
|02 |1003 |
+---+-----+
PS - To cover case with multiple "Id"s, I have added some extra data to the sample dataset, hence output has some extra rows.

PySpark withColumn that uses column data from another row

I have a dataframe like this:
order_type
customer_id
order_id
related_order_id
purchase
123
abc
null
return
123
bcd
null
purchase
234
xyz
null
return
234
zzz
null
Where I want to fill in the related_order_id column as the order_id of the related purchase, only for rows where order_type is return. A return and a purchase row can be related by their customer_id.
I've tried to use withColumn(), but I haven't figured out a way that would allow me to also look at other rows and their column data.
The end result should look something like
order_type
customer_id
order_id
related_order_id
purchase
123
abc
null
return
123
bcd
abc
purchase
234
xyz
null
return
234
zzz
xyz
You can use the lag() function to use data from the previous row.
Assuming a return is always preceded by a purchase, you can do:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import col
w = Window().partitionBy("customer_id").orderBy("order_type")
df.withColumn("related_order_id", F.when(col("order_type")=="return", \
F.lag(col("order_id")).over(w)) \
.otherwise(col("related_order_id"))).show()
Output:
+----------+-----------+--------+----------------+
|order_type|customer_id|order_id|related_order_id|
+----------+-----------+--------+----------------+
| purchase| 123| abc| null|
| return| 123| bcd| abc|
| purchase| 234| xyz| null|
| return| 234| zzz| xyz|
+----------+-----------+--------+----------------+

Find for each row the first non-null value in a group of columns and the column name

I have a dataframe like this:
col1 col2 col3 Other
====================================
NULL 1 2 A
3 4 5 B
NULL NULL NULL C
and I would like to get as result the following one with this rules:
For each row find the first non-NULL value and set its value in FirstValue and set its column name in ColName
If in a row all values are NULL, FirstValue and ColName are set to NULL
keep Other column
Expected result:
FirstValue ColName Other
====================================
1 col2 A
3 col1 B
NULL NULL C
You can use coalesce:
val df2 = df.select(
coalesce(df.columns.dropRight(1).map(col):_*).as("FirstValue"),
coalesce(df.columns.dropRight(1).map(c => when(col(c).isNotNull, lit(c))):_*).as("ColName"),
col("Other")
)
df2.show
+----------+-------+-----+
|FirstValue|ColName|Other|
+----------+-------+-----+
| 1| col2| A|
| 3| col1| B|
| null| null| C|
+----------+-------+-----+

Apache spark aggregation: aggregate column based on another column value

I am not sure if I am asking this correctly and maybe that is the reason why I didn't find the correct answer so far. Anyway, if it will be duplicate I will delete this question.
I have following data:
id | last_updated | count
__________________________
1 | 20190101 | 3
1 | 20190201 | 2
1 | 20190301 | 1
I want to group by this data by "id" column, get max value from "last_updated" and regarding "count" column I want keep value from row where "last_updated" has max value. So in that case result should be like that:
id | last_updated | count
__________________________
1 | 20190301 | 1
So I imagine it will look like that:
df
.groupBy("id")
.agg(max("last_updated"), ... ("count"))
Is there any function I can use to get "count" based on "last_updated" column.
I am using spark 2.4.0.
Thanks for any help
You have two options, the first the better as for my understanding
OPTION 1
Perform a window function over the ID, create a column with the max value over that window function. Then select where the desired column equals the max value and finally drop the column and rename the max column as desired
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
OPTION 2
You can perform a join with the original dataframe after grouping
df.groupBy("id")
.agg(max("last_updated").as("last_updated"))
.join(df, Seq("id", "last_updated"))
QUICK EXAMPLE
INPUT
df.show
+---+------------+-----+
| id|last_updated|count|
+---+------------+-----+
| 1| 20190101| 3|
| 1| 20190201| 2|
| 1| 20190301| 1|
+---+------------+-----+
OUTPUT
Option 1
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
+---+-----+------------+
| id|count|last_updated|
+---+-----+------------+
| 1| 1| 20190301|
+---+-----+------------+
Option 2
df.groupBy("id")
.agg(max("last_updated").as("last_updated")
.join(df, Seq("id", "last_updated")).show
+---+-----------------+----------+
| id| last_updated| count |
+---+-----------------+----------+
| 1| 20190301| 1|
+---+-----------------+----------+

Create a new column based on date checking

I have two dataframes in Scala:
df1 =
ID Field1
1 AAA
2 BBB
4 CCC
and
df2 =
PK start_date_time
1 2016-10-11 11:55:23
2 2016-10-12 12:25:00
3 2016-10-12 16:20:00
I also have a variable start_date with the format yyyy-MM-dd equal to 2016-10-11.
I need to create a new column check in df1 based on the following condition: If PK is equal to ID AND the year, month and day of start_date_time are equal to start_date, then check is equal to 1, otherwise 0.
The result should be this one:
df1 =
ID Field1 check
1 AAA 1
2 BBB 0
4 CCC 0
In my previous question I had two dataframes and it was suggested to use joining and filtering. However, in this case it won't work. My initial idea was to use udf, but not sure how to make it working for this case.
You can combine join and withColumn for this case. i.e. firstly join with df2 on ID column and then use when.otherwise syntax to modify the check column:
import org.apache.spark.sql.functions.lit
val df2_date = df2.withColumn("date", to_date(df2("start_date_time"))).withColumn("check", lit(1)).select($"PK".as("ID"), $"date", $"check")
df1.join(df2_date, Seq("ID"), "left").withColumn("check", when($"date" === "2016-10-11", $"check").otherwise(0)).drop("date").show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
Or another option, firstly filter on df2, and then join it back with df1 on ID column:
val df2_date = (df2.withColumn("date", to_date(df2("start_date_time"))).
filter($"date" === "2016-10-11").
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check"))
df1.join(df2_date, Seq("ID"), "left").drop("date").na.fill(0).show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
In case you have a date like 2016-OCT-11, you can convert it sql Date for comparison as follows:
val format = new java.text.SimpleDateFormat("yyyy-MMM-dd")
val parsed = format.parse("2016-OCT-11")
val date = new java.sql.Date(parsed.getTime())
// date: java.sql.Date = 2016-10-11