Pyspark remove duplicates based on string value from another columns

Pyspark remove duplicates based on string value from another columns - pyspark

I have this dataframe below:
+--------+----------+----------+
|SID |Date |Attribute |
+--------+----------+----------+
|1001 |2021-01-01|Y |
|1001 |2021-05-31|N |
|1001 |2021-05-15|N |
|1002 |2021-05-31|N |
|1002 |2021-04-06|N |
|1003 |2021-01-01|Y |
|1003 |2021-02-01|N |
|1004 |2021-03-30|N |
+--------+----------+----------+
I'm trying to get the result like below.
+--------+----------+----------+
|SID |Date |Attribute |
+--------+----------+----------+
|1001 |2021-01-01|Y |
|1002 |2021-05-31|N |
|1002 |2021-04-06|N |
|1003 |2021-01-01|Y |
|1004 |2021-03-30|N |
+--------+----------+----------+
I want to exclude the record when a duplicate SID has Y in one its row for Attribute but keep the records for SID if there's only N in the Attribute.
I think window partition with filter can help but I'm not sure how to do it with the conditions I mentioned. Is there any way this can be achieved in Pyspark? I saw a similar post but it was for Scala SQL and not for Pyspark.

from pyspark.sql import Window
import pyspark.sql.functions as F
#Create a window of each group ordered by Date and containing all elements in a specified column
h = Window.partitionBy('SID').orderBy('Date').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
(
# Create a column in which you broadcast first Attribute value in each SID
df.withColumn('filt', F.first('Attribute').over(h))
# After broadcast, filter where Attribute value equals to the new columns value and drop the new column
.filter(F.col('Attribute') == F.col('filt')).drop('filt')
).show()
+----+----------+---------+
| SID| Date|Attribute|
+----+----------+---------+
|1001|2021-01-01| Y|
|1002|2021-04-06| N|
|1002|2021-05-31| N|
|1003|2021-01-01| Y|
|1004|2021-03-30| N|
+----+----------+---------+

Related

What is the best practice to handle non-datetime timestamp column within pandas dataframe?

Let's say I have the following pandas dataframe with a non-standard timestamp column without datetime format. Due to I need to include a new column and convert it into an 24hourly-based timestamp for time-series visualizing matter by:
df['timestamp(24hrs)'] = round(df['timestamp(sec)']/24*3600)
and get this:
+----------------+----+-----+
|timestamp(24hrs)|User|count|
+----------------+----+-----+
|0.0 |U100|435 |
|1.0 |U100|1091 |
|2.0 |U100|992 |
|3.0 |U100|980 |
|4.0 |U100|288 |
|8.0 |U100|260 |
|9.0 |U100|879 |
|10.0 |U100|875 |
|11.0 |U100|911 |
|13.0 |U100|628 |
|14.0 |U100|642 |
|16.0 |U100|631 |
|17.0 |U100|233 |
... ... ...
|267.0 |U100|1056 |
|269.0 |U100|878 |
|270.0 |U100|256 |
+----------------+----+-----+
Now I noticed that some records' timestamps are missing, and I need to impute those missing data:
timestamp(24hrs) in continuous order
count value by 0
Expected output:
+----------------+----+-----+
|timestamp(24hrs)|User|count|
+----------------+----+-----+
|0.0 |U100|435 |
|1.0 |U100|1091 |
|2.0 |U100|992 |
|3.0 |U100|980 |
|4.0 |U100|288 |
|5.0 |U100|0 |
|6.0 |U100|0 |
|7.0 |U100|0 |
|8.0 |U100|260 |
|9.0 |U100|879 |
|10.0 |U100|875 |
|11.0 |U100|911 |
|12.0 |U100|0 |
|13.0 |U100|628 |
|14.0 |U100|642 |
|15.0 |U100|0 |
|16.0 |U100|631 |
|17.0 |U100|233 |
... ... ...
|267.0 |U100|1056 |
|268.0 |U100|0 |
|269.0 |U100|878 |
|270.0 |U100|256 |
+----------------+----+-----+
Any idea how can I do this? Based on this answer over standard timestamp, I can imagine I need to create a new column timestamp(24hrs) from the start and end of the previous one and do left join() & crossJoin() but I couldn't manage it yet.
I've tried the following unsuccessfully:
import pyspark.sql.functions as F
all_dates_df = df.selectExpr(
"sequence(min(timestamp(24hrs)), max(timestamp(24hrs)), interval 1 hour) as hour"
).select(F.explode("timestamp(24hrs)").alias("timestamp(24hrs)"))
all_dates_df.show()
result_df = all_dates_df.crossJoin(
df.select("UserName").distinct()
).join(
df,
["count", "timestamp(24hrs)"],
"left"
).fillna(0)
result_df.show()

sequence function is available for integer. It doesn't work for double type so it requires to cast to integer then cast back to double (if you want to retain as double).
df_seq = (df.withColumn('time_int', F.col('timestamp(24hrs)').cast(IntegerType()))
.select(F.explode(F.sequence(F.min('time_int'), F.max('time_int'))).alias('timestamp(24hrs)'))
.select(F.col('timestamp(24hrs)').cast(DoubleType())))
df = (df_seq.crossJoin(df.select("User").distinct())
.join(df, on=['User', 'timestamp(24hrs)'], how='left')
.fillna(0))

SAS to PySpark conversion using lag function or by other methods equivalent to what to achieve in SAS

I am rewriting legacy SAS codes to PySpark. In one of those blocks, the SAS codes used the lag function. The way I understood the notes, it says an ID is a duplicate if it as two intake dates that are less than 4 days apart.
/*Next create flag if the same ID has two intake dates less than 4 days apart*/
/*Data MUST be sorted by ID and DESCENDING IntakeDate!!!*/
data duplicates (drop= lag_ID lag_IntakeDate);
set df2;
by ID;
lag_ID = lag(ID);
lag_INtakeDate = lag(IntakeDate);
if ID = lag_ID then do;
intake2TIME = intck('day', lag_IntakeDate, IntakeDate);
end;
if 0 <= abs(intake2TIME) < 4 then DUPLICATE = 1;
run;
/* If the DUPLICATE > 1, then it is a duplicate and eventually will be dropped.*/
I tried meeting the condition as described in the comments: I pulled by sql the ID and intake dates ordered by ID and descending intake date:
SELECT ID, intakeDate, col3, col4
from df order by ID, intakeDate DESC
I googled the lag equivalent and this is what I found:
https://www.educba.com/pyspark-lag/
However, I have not used window function before, the concept introduced by the site does not somehow make sense to me, though I tried the following to check if my understanding of WHERE EXISTS might work:
SELECT *
FROM df
WHERE EXISTS (
SELECT *
FROM df v2
WHERE df.ID = v2.ID AND DATEDIFF(df.IntakeDate, v2.IntakeDate) > 4 )
/* not sure about the second condition, though*/)
Initial df
+-----------+------------------+
| Id| IntakeDate|
+-----------+------------------+
| 5.0| 2021-04-14|
| 5.0| 2021-05-06|
| 5.0| 2021-05-08|
| 10.0| 2021-04-21|
| 10.0| 2021-05-25|
| 14.0| 2021-03-08|
| 14.0| 2021-03-09|
| 14.0| 2021-09-30|
| 14.0| 2022-04-08|
| 15.0| 2021-04-27|
| 15.0| 2021-05-18|
| 15.0| 2022-01-17|
| 26.0| 2021-08-27|
| 26.0| 2021-09-17|
+-----------+------------------+
expected df will have row dropped if the next intakedate is less than 3 days of the prior date
+-----------+------------------+
| Id| IntakeDate|
+-----------+------------------+
| 5.0| 2021-04-14|
| 5.0| 2021-05-06| row to drop
| 5.0| 2021-05-08|
| 10.0| 2021-04-21|
| 10.0| 2021-05-25|
| 14.0| 2021-03-08| row to drop
| 14.0| 2021-03-09|
| 14.0| 2021-09-30|
| 14.0| 2022-04-08|
| 15.0| 2021-04-27|
| 15.0| 2021-05-18|
| 15.0| 2022-01-17|
| 26.0| 2021-08-27|
| 26.0| 2021-09-17|
+-----------+------------------+

Please try the following code:
import pyspark.sql.window as Window
import pyspark.sql.functions as F
lead_over_id = Window.partitionBy('id').orderBy('IntakeDate')
df = (df
.withColumn('lead_1_date', F.lag('IntakeDate', -1).over(lead_over_id))
.withColumn('date_diff', F.datediff('IntakeDate', 'lead_1_date'))
.where((F.col('date_diff') > 4) | F.col('date_diff').isnull())
.drop('lead_1_date', 'date_diff')
)

Compare two values with Scala Spark

I got the next parquet file:
+--------------+------------+-------+
|gf_cutoff | country_id |gf_mlt |
+--------------+------------+-------+
|2020-12-14 |DZ |5 |
|2020-08-06 |DZ |4 |
|2020-07-03 |DZ |4 |
|2020-12-14 |LT |1 |
|2020-08-06 |LT |1 |
|2020-07-03 |LT |1 |
As you can see is particioned by country_id and ordered by gf_cutoff DESC. What I want to do es compare gf_mlt to check if the value has changed. To do that I want to compare the most recently gf_cutoff with the second one.
A example of this case would be compare:
2020-12-14 DZ 5
with
2020-08-06 DZ 4
And I want to write in a new column, if the value of the most recent date is different of the second row, put in a new column, the most recent value that is 5 for DZ and put in another column True if the value has changed or false if has not changed.
Afther did this comparation, delete the rows with the older rows.
For DZ has changed and for LT hasn't changed because is all time 1.
So the output would be like this:
+--------------+------------+-------+------------+-----------+
|gf_cutoff | country_id |gf_mlt | Has_change | old_value |
+--------------+------------+-------+------------+-----------+
|2020-12-14 |DZ |5 | True | 4 |
|2020-12-14 |LT |1 | False | 1 |
If you need more explanation, just tell me it.

You can use lag over an appropriate window to get the most recent value, and then filter the most recent rows using a row_number over another appropriate window:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn(
"last_value",
lag("gf_mlt", 1).over(Window.partitionBy("country_id").orderBy("gf_cutoff"))
).withColumn(
"rn",
row_number().over(Window.partitionBy("country_id").orderBy(desc("gf_cutoff")))
).filter("rn = 1").withColumn(
"changed",
$"gf_mlt" === $"last_value"
).drop("rn")
df2.show
+----------+----------+------+----------+-------+
| gf_cutoff|country_id|gf_mlt|last_value|changed|
+----------+----------+------+----------+-------+
|2020-12-14| DZ| 5| 4| false|
|2020-12-14| LT| 1| 1| true|
+----------+----------+------+----------+-------+

Spark dataframe: Pivot and Group based on columns

I have input dataframe as below with id, app, and customer
Input dataframe
+--------------------+-----+---------+
| id|app |customer |
+--------------------+-----+---------+
|id1 | fw| WM |
|id1 | fw| CS |
|id2 | fw| CS |
|id1 | fe| WM |
|id3 | bc| TR |
|id3 | bc| WM |
+--------------------+-----+---------+
Expected output
Using pivot and aggregate - make app values as column name and put aggregated customer names as list in the dataframe
Expected dataframe
+--------------------+----------+-------+----------+
| id| bc | fe| fw |
+--------------------+----------+-------+----------+
|id1 | 0 | WM| [WM,CS]|
|id2 | 0 | 0| [CS] |
|id3 | [TR,WM] | 0| 0 |
+--------------------+----------+-------+----------+
What have i tried ?
val newDF =
df.groupBy("id").pivot("app").agg(expr("coalesce(first(customer),0)")).drop("app").show()
+--------------------+-----+-------+------+
| id|bc | fe| fw|
+--------------------+-----+-------+------+
|id1 | 0 | WM| WM|
|id2 | 0 | 0| CS|
|id3 | TR | 0| 0|
+--------------------+-----+-------+------+
Issue : In my query , i am not able to get the list of customer like [WM,CS] for "id1" under "fw" (as shown in expected output) , only "WM" is coming. Similarly, for "id3" only "TR" is appearing - instead a list should appear with value [TR,WM] under "bc" for "id3"
Need your suggestion to get the list of customer under each app respectively.

You can use collect_list if you can bear with an empty List at cells where it should be zero:
df.groupBy("id").pivot("app").agg(collect_list("customer")).show
+---+--------+----+--------+
| id| bc| fe| fw|
+---+--------+----+--------+
|id3|[TR, WM]| []| []|
|id1| []|[WM]|[CS, WM]|
|id2| []| []| [CS]|
+---+--------+----+--------+

Using CONCAT_WS we can explode array and can remove the square brackets.
df.groupBy("id").pivot("app").agg(concat_ws(",",collect_list("customer")))

Sum of single column across rows based on a condition in Spark Dataframe

Consider the following dataframe:
+-------+-----------+-------+
| rid| createdon| count|
+-------+-----------+-------+
| 124| 2017-06-15| 1 |
| 123| 2017-06-14| 2 |
| 123| 2017-06-14| 1 |
+-------+-----------+-------+
I need to add the count column among rows which has createdon and rid of are same.
Therefore the resultant dataframe should be follows:
+-------+-----------+-------+
| rid| createdon| count|
+-------+-----------+-------+
| 124| 2017-06-15| 1 |
| 123| 2017-06-14| 3 |
+-------+-----------+-------+
I am using Spark 2.0.2.
I have tried agg, conditions inside select etc, but couldn't find the solution. Can anyone help me?

Try this
import org.apache.spark.sql.{functions => func}
df.groupBy($"rid", $"createdon").agg(func.sum($"count").alias("count"))

this should do what you want:
import org.apache.spark.sql.functions.sum
df
.groupBy($"rid",$"createdon")
.agg(sum($"count").as("count"))
.show

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark remove duplicates based on string value from another columns - pyspark

Related

What is the best practice to handle non-datetime timestamp column within pandas dataframe?

SAS to PySpark conversion using lag function or by other methods equivalent to what to achieve in SAS

Compare two values with Scala Spark

Spark dataframe: Pivot and Group based on columns

Sum of single column across rows based on a condition in Spark Dataframe

Categories

Resources