Issue while applying aggregate functions like count,sum on columns having null values in PySpark - pyspark

We have the following input dataframe.
df1
Dep |Gender|Salary|DOB |Place
Finance |Male |5000 |2009-02-02 00:00:00|UK
HR |Female|6000 |2006-02-02 00:00:00|null
HR |Male |14200 |null |US
IT |Male |null |2008-02-02 00:00:00|null
IT |Male |55555 |2008-02-02 00:00:00|UK
Marketing|Female|12200 |2005-02-02 00:00:00|UK
Used the following code to find the count:
df = df1.groupBy(df1['Dep'])
df2 = df.agg({'Salary':'count'})
df2.show()
The result is:
Dep |count(Salary)
Finance |1
HR |2
Marketing|1
IT |1
The expected result is shown below.
Dep |count(Salary)
Finance |1
HR |2
Marketing|1
IT |2
Here issue comes with 4-th row data, where Salary data is null. And count operation on null is not working.
Appreciate your help in solving this issue.

You can replace null values:
df \
.na.fill({'salary':0}) \
.groupBy('Dep') \
.agg({'Salary':'count'})

Related

Get last value of previous partition/group in pyspark

I have a dataframe looking like this (just some example values):
| id | timestamp | mode | trip | journey | value |
1 2021-09-12 23:59:19.717000 walking 1 1 1.21
1 2021-09-12 23:59:38.617000 walking 1 1 1.36
1 2021-09-12 23:59:38.617000 driving 2 1 1.65
2 2021-09-11 23:52:09.315000 walking 4 6 1.04
I want to create new columns which I fill with the previous and next mode. Something like this:
| id | timestamp | mode | trip | journey | value | prev | next
1 2021-09-12 23:59:19.717000 walking 1 1 1.21 bus driving
1 2021-09-12 23:59:38.617000 walking 1 1 1.36 bus driving
1 2021-09-12 23:59:38.617000 driving 2 1 1.65 walking walking
2 2021-09-11 23:52:09.315000 walking 4 6 1.0 walking driving
I have tried to partition by id, trip, journey and mode and ordered by timestamp. Then I tried to use lag() and lead() but I am not sure these work on other partitions. I came across the Window.unboundedPreceding and Window.unboundedFollowing, however I am not sure I completely understand how these work. In my mind I think that if I partition the data as explained above I will always just need the last value of mode from the previous partition and to fill the next I could reorder the partition from ascending to descending on the timestamp and then do the same to fill the next column. However, I am unsure how I get the last value of the previous partition.
I have tried this:
w = Window.partitionBy("id", "journey", "trip").orderBy(col("timestamp").asc())
w_prev = w.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df = df.withColumn("prev", first("mode").over(w_prev))
Code examples and explainations using pyspark will be very appreciated!
So, based on what I could understand you could do something like this,
Create a partition based on ID and their journey, within each journey there are multiple trips, so order by trip and lastly the timestamp, and then simply use the lead and lag to get the output!
w = Window().partitionBy('id', 'journey').orderBy('trip', 'timestamp')
df.withColumn('prev', F.lag('mode', 1).over(w)) \
.withColumn('next', F.lead('mode', 1).over(w)) \
.show(truncate=False)
Output:
+---+--------------------------+-------+----+-------+-----+-------+-------+
|id |timestamp |mode |trip|journey|value|prev |next |
+---+--------------------------+-------+----+-------+-----+-------+-------+
|1 |2021-09-12 23:59:19.717000|walking|1 |1 |1.21 |null |walking|
|1 |2021-09-12 23:59:38.617000|walking|1 |1 |1.36 |walking|driving|
|1 |2021-09-12 23:59:38.617000|driving|2 |1 |1.65 |walking|null |
|2 |2021-09-11 23:52:09.315000|walking|4 |6 |1.04 |null |null |
+---+--------------------------+-------+----+-------+-----+-------+-------+
EDIT:
Okay as OP asked, you can do this to achieve it,
# Used for taking the latest record from same id, trip, journey
w = Window().partitionBy('id', 'trip', 'journey').orderBy(F.col('timestamp').desc())
# Used to calculate prev and next mode
w1 = Window().partitionBy('id', 'journey').orderBy('trip')
# First take only the latest rows for a particular combination of id, trip, journey
# Second, use the filtered rows to get prev and next modes
df2 = df.withColumn('rn', F.row_number().over(w)) \
.filter(F.col('rn') == 1) \
.withColumn('prev', F.lag('mode', 1).over(w1)) \
.withColumn('next', F.lead('mode', 1).over(w1)) \
.drop('rn')
df2.show(truncate=False)
Output:
+---+--------------------------+-------+----+-------+-----+-------+-------+
|id |timestamp |mode |trip|journey|value|prev |next |
+---+--------------------------+-------+----+-------+-----+-------+-------+
|1 |2021-09-12 23:59:38.617000|walking|1 |1 |1.36 |null |driving|
|1 |2021-09-12 23:59:38.617000|driving|2 |1 |1.65 |walking|null |
|2 |2021-09-11 23:52:09.315000|walking|4 |6 |1.04 |null |null |
+---+--------------------------+-------+----+-------+-----+-------+-------+
# Finally, join the calculated DF with the original DF to get prev and next mode
final_df = df.alias('a').join(df2.alias('b'), ['id', 'trip', 'journey'], how='left') \
.select('a.*', 'b.prev', 'b.next')
final_df.show(truncate=False)
Output:
+---+----+-------+--------------------------+-------+-----+-------+-------+
|id |trip|journey|timestamp |mode |value|prev |next |
+---+----+-------+--------------------------+-------+-----+-------+-------+
|1 |1 |1 |2021-09-12 23:59:19.717000|walking|1.21 |null |driving|
|1 |1 |1 |2021-09-12 23:59:38.617000|walking|1.36 |null |driving|
|1 |2 |1 |2021-09-12 23:59:38.617000|driving|1.65 |walking|null |
|2 |4 |6 |2021-09-11 23:52:09.315000|walking|1.04 |null |null |
+---+----+-------+--------------------------+-------+-----+-------+-------+

Align multiple dataframes in pyspark

I have these 4 spark dataframes:
order,device,count_1
101,201,2
102,202,4
order,device,count_2
101,201,10
103,203,100
order,device,count_3
104,204,111
103,203,10
order,device,count_4
101,201,4
104,204,11
I want to create a resultant dataframe as:
order,device,count_1,count_2,count_3,count_4
101,201,2,10,,4,
102,202,4,,,,
103,203,,100,10,,
104,204,,,111,11
Is this a case of UNION or JOIN or APPEND? How to get the final resultant df?
You can think of UNION as combining tables by rows, so the number of rows will likely increase. JOIN combines tables by columns. I'm not sure what you mean by APPEND, but in this case, you would want JOIN.
Try:
val df1 = Seq((101,201,2), (102,202,4)).toDF("order" ,"device", "count_1")
val df2 = Seq((101,201,10), (103,203,100)).toDF("order" ,"device", "count_2")
val df3 = Seq((104,204,111), (103,203,10)).toDF("order" ,"device", "count_3")
val df4 = Seq((101,201,4), (104,204,11)).toDF("order" ,"device", "count_4")
val df12 = df1.join(df2, Seq("order", "device"),"fullouter")
df12.show(false)
val df123 = df12.join(df3, Seq("order", "device"),"fullouter")
df123.show(false)
val df1234 = df123.join(df4, Seq("order", "device"),"fullouter")
df1234.show(false)
returns:
+-----+------+-------+-------+-------+-------+
|order|device|count_1|count_2|count_3|count_4|
+-----+------+-------+-------+-------+-------+
|101 |201 |2 |10 |null |4 |
|102 |202 |4 |null |null |null |
|103 |203 |null |100 |10 |null |
|104 |204 |null |null |111 |11 |
+-----+------+-------+-------+-------+-------+
As you can see the comments are flawed and the 1st answer incorrect.
Did in Scala, should be easy to do in pyspark.

how to get oldest month data using datetime column in spark dataframe(Scala)?

I have a spark dataframe which has two column one is id and secound id col_datetime. as you can see the dataframe given below. how can i filter the dataframe based on col_datetime to get the oldest month data. I want to achieve the result dynamically because I have 20 odd dataframes.
INPUT DF:-
import spark.implicits._
val data = Seq((1 , "2020-07-02 00:00:00.0"),(2 , "2020-08-02 00:00:00.0"),(3 , "2020-09-02 00:00:00.0"),(4 , "2020-10-02 00:00:00.0"),(5 , "2020-11-02 00:00:00.0"),(6 , "2020-12-02 00:00:00.0"),(7 , "2021-01-02 00:00:00.0"),(8 , "2021-02-02 00:00:00.0"),(9 , "2021-03-02 00:00:00.0"),(10, "2021-04-02 00:00:00.0"),(11, "2021-05-02 00:00:00.0"),(12, "2021-06-02 00:00:00.0"),(13, "2021-07-22 00:00:00.0"))
val dfFromData1 = data.toDF("ID","COL_DATETIME").withColumn("COL_DATETIME",to_timestamp(col("COL_DATETIME")))
+------+---------------------+
|ID |COL_DATETIME |
+------+---------------------+
|1 |2020-07-02 00:00:00.0|
|2 |2020-08-02 00:00:00.0|
|3 |2020-09-02 00:00:00.0|
|4 |2020-10-02 00:00:00.0|
|5 |2020-11-02 00:00:00.0|
|6 |2020-12-02 00:00:00.0|
|7 |2021-01-02 00:00:00.0|
|8 |2021-02-02 00:00:00.0|
|9 |2021-03-02 00:00:00.0|
|10 |2021-04-02 00:00:00.0|
|11 |2021-05-02 00:00:00.0|
|12 |2021-06-02 00:00:00.0|
|13 |2021-07-22 00:00:00.0|
+------+---------------------+
OUTPUT:-
DF1 : - Oldest month data
+------+---------------------+
|ID |COL_DATETIME |
+------+---------------------+
|1 |2020-07-02 00:00:00.0|
+------+---------------------+
DF2:- lastest months data after removing oldest month data from orginal DF.
+------+---------------------+
|ID |COL_DATETIME |
+------+---------------------+
|2 |2020-08-02 00:00:00.0|
|3 |2020-09-02 00:00:00.0|
|4 |2020-10-02 00:00:00.0|
|5 |2020-11-02 00:00:00.0|
|6 |2020-12-02 00:00:00.0|
|7 |2021-01-02 00:00:00.0|
|8 |2021-02-02 00:00:00.0|
|9 |2021-03-02 00:00:00.0|
|10 |2021-04-02 00:00:00.0|
|11 |2021-05-02 00:00:00.0|
|12 |2021-06-02 00:00:00.0|
|13 |2021-07-22 00:00:00.0|
+------+---------------------+
logic/approach:-
step1 :- calculate the minimum datetime for col_datetime column for given dataframe and assign to mindate variable.
Lets assume I will get
var mindate = "2020-07-02 00:00:00.0"
val mindate = dfFromData1.select(min("COL_DATETIME")).first()
print(mindate)
result:-
mindate : org.apache.spark.sql.Row = [2020-07-02 00:00:00.0]
[2020-07-02 00:00:00.0]
Step2:- to get the end date of month using mindate.I haven’t code for this part to get enddatemonth using mindate.
Val enddatemonth = "2020-07-31 00:00:00.0"
Step3 : - Now I can use enddatemonth variable to filter the spark dataframe in DF1 and DF2 based on conditions.
Even if I tried to filter the dataframe based on mindate I am getting error
val DF1 = dfFromData1.where(col("COL_DATETIME") <= enddatemonth)
val DF2 = dfFromData1.where(col("COL_DATETIME") > enddatemonth)
Error : <console>:166: error: type mismatch;
found : org.apache.spark.sql.Row
required: org.apache.spark.sql.Column val DF1 = dfFromData1.where(col("COL_DATETIME" )<= mindate)
Thanks...!!
A Similar approach , but I find it cleaner just to deal with MONTHS.
Idea : like we have epoch for seconds, compute it for months
val dfWithEpochMonth = dfFromData1.
withColumn("year",year($"COL_DATETIME")).
withColumn("month",month($"COL_DATETIME")).
withColumn("epochMonth", (($"year" - 1970 - 1) * 12) + $"month")
Now your df will look like :
+---+-------------------+----+-----+----------+
| ID| COL_DATETIME|year|month|epochMonth|
+---+-------------------+----+-----+----------+
| 1|2020-07-02 00:00:00|2020| 7| 595|
| 2|2020-08-02 00:00:00|2020| 8| 596|
| 3|2020-09-02 00:00:00|2020| 9| 597|
| 4|2020-10-02 00:00:00|2020| 10| 598|
Now, you can calculate min epochMonth and filter directly.
val minEpochMonth = dfWithEpochMonth.select(min("epochMonth")).first().apply(0).toString().toInt
val df1 = dfWithEpochMonth.where($"epochMonth" <= minEpochMonth)
val df2 = dfWithEpochMonth.where($"epochMonth" > minEpochMonth)
You can drop unnecessary columns.
To address your error message :
val mindate = dfFromData1.select(min("COL_DATETIME")).first()
val mindateString = mindate.apply(0).toString()
Now you can use mindateString to filter.

Join with uneven columns

I have two dataframes structured the following way:
|Source|#Users|#Clicks|Hour|Type
and
Type|Total # Users|Hour
I'd like to join these columns based on hour however the first dataframe is at a deeper granularity in the second and therefore has more rows. Basically I want a dataframe where I have
|Source|#Users|#Clicks|Hour|Type|Total # Users
where the total # users is from the second dataframe. Any suggestions? I think I maybe want to use map?
Edit:
Here's an example
DF1
|Source|#Users|#Clicks|Hour|Type
|Prod1 |50 |3 |01 |Internet
|Prod2 |10 |2 |07 |iOS
|Prod3 |1 |50 |07 |Internet
|Prod2 |3 |2 |07 |Internet
|Prod3 |8 |2 |05 |Internet
DF2
|Type |Total #Users|Hour
|Internet|100 |01
|iOS |500 |01
|Internet|300 |07
|Internet|15 |05
|iOS |20 |07
Result
|Source|#Users|#Clicks|Hour|Type |Total #Users
|Prod1 |50 |3 |01 |Internet|100
|Prod2 |10 |2 |07 |iOS |20
|Prod3 |1 |50 |07 |Internet|300
|Prod2 |3 |2 |07 |Internet|300
|Prod3 |8 |2 |05 |Internet|15
That's a left join you're trying to do :
df1.join(df2, (df1.Hour === df2.Hour) & (df1.Type === df2.Type), "left_outer")
Short version : a left join keep all the rows from df1 and join on condition with matching rows of df2 if there is a match (null if not, duplicate if multiple matches).
More info on Pyspark join
More info on SQL Joins types

How to create pairs of nodes in Spark?

I have the following DataFrame in Spark and Scala:
group nodeId date
1 1 2016-10-12T12:10:00.000Z
1 2 2016-10-12T12:00:00.000Z
1 3 2016-10-12T12:05:00.000Z
2 1 2016-10-12T12:30:00.000Z
2 2 2016-10-12T12:35:00.000Z
I need to group records by group, sort them in ascending order by date and make pairs of sequential nodeId. Also, date should be converted to Unix epoch.
This can be better explained with the expected output:
group nodeId_1 nodeId_2 date
1 2 3 2016-10-12T12:00:00.000Z
1 3 1 2016-10-12T12:05:00.000Z
2 1 2 2016-10-12T12:30:00.000Z
This is what I did so far:
df
.groupBy("group")
.agg($"nodeId",$"date")
.orderBy(asc("date"))
But I don't know how to create pairs of nodeId.
You can be benefited by using Window function with lead inbuilt function to create the pairs and to_utc_timestamp inbuilt function to convert the date to epoch date. Finally you have to filter the unpaired rows as you don't require them in the output.
Following is the program of above explanation. I have used comments for clarity
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("group").orderBy("date") //defining window function grouping by group and ordering by date
import org.apache.spark.sql.functions._
df.withColumn("date", to_utc_timestamp(col("date"), "Asia/Kathmandu")) //converting the date to epoch datetime you can choose other timezone as required
.withColumn("nodeId_2", lead("nodeId", 1).over(windowSpec)) //using window for creating pairs
.filter(col("nodeId_2").isNotNull) //filtering out the unpaired rows
.select(col("group"), col("nodeId").as("nodeId_1"), col("nodeId_2"), col("date")) //selecting as required final dataframe
.show(false)
You should get the final dataframe as required
+-----+--------+--------+-------------------+
|group|nodeId_1|nodeId_2|date |
+-----+--------+--------+-------------------+
|1 |2 |3 |2016-10-12 12:00:00|
|1 |3 |1 |2016-10-12 12:05:00|
|2 |1 |2 |2016-10-12 12:30:00|
+-----+--------+--------+-------------------+
I hope the answer is helpful
Note to get the correct epoch date I have used Asia/Kathmandu as timezone.
If I understand your requirement correctly, you can use a self-join on group and a < inequality condition on nodeId:
val df = Seq(
(1, 1, "2016-10-12T12:10:00.000Z"),
(1, 2, "2016-10-12T12:00:00.000Z"),
(1, 3, "2016-10-12T12:05:00.000Z"),
(2, 1, "2016-10-12T12:30:00.000Z"),
(2, 2, "2016-10-12T12:35:00.000Z")
).toDF("group", "nodeId", "date")
df.as("df1").join(
df.as("df2"),
$"df1.group" === $"df2.group" && $"df1.nodeId" < $"df2.nodeId"
).select(
$"df1.group", $"df1.nodeId", $"df2.nodeId",
when($"df1.date" < $"df2.date", $"df1.date").otherwise($"df2.date").as("date")
)
// +-----+------+------+------------------------+
// |group|nodeId|nodeId|date |
// +-----+------+------+------------------------+
// |1 |1 |3 |2016-10-12T12:05:00.000Z|
// |1 |1 |2 |2016-10-12T12:00:00.000Z|
// |1 |2 |3 |2016-10-12T12:00:00.000Z|
// |2 |1 |2 |2016-10-12T12:30:00.000Z|
// +-----+------+------+------------------------+