PySpark Identify Duplicates based on Partial Matches from a Column - pyspark

I am looking for a solution to identify an approach to rank the duplicates based on partial match from a Pyspark Dataframe column and tag the first match id to newly generated column.
Input Dataframe:
id
name
class
1
Roger Fernandes
12
2
Kevin Kingsely
11
3
Fernandes Roger
13
4
Jack Sparrow
14
5
Roger thinker
16
6
Ro seman
17
Output Dataframe:
id
name
class
duplicate
id
1
Roger Ferna
12
yes
1
2
Kevin Kingsely
11
no
None
3
Ferna Roger
13
yes
1
4
Jack Sparrow
14
no
None
5
Roger think
16
yes
1
6
Ro seman
17
no
None
Note: We can consider the weightage of partial match as >50%.
Tried:
partition_dataframe = Window.partitionBy(["name"]).orderBy(order_by_column)
data_frame = data_frame.withColumn("Duplicated_Rank", rank().over(partition_dataframe))
Output Dataframe:
id
name
class
duplicate
id
1
Roger Ferna
12
yes
1
2
Kevin Kingsely
11
no
None
3
Ferna Roger
13
yes
1
4
Jack Sparrow
14
no
None
5
Roger think
16
yes
1
6
Ro seman
17
no
None

Related

Create a range of dates in a pyspark DataFrame

I have the following abstracted DataFrame (my original DF has 60 billion lines +)
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-05 8 4
2 2021-02-03 2 0
1 2021-02-07 12 5
2 2021-02-05 1 3
My expected ouput is:
Id Date Val1 Val2
1 2021-02-01 10 2
1 2021-02-02 10 2
1 2021-02-03 10 2
1 2021-02-04 10 2
1 2021-02-05 8 4
1 2021-02-06 8 4
1 2021-02-07 12 5
2 2021-02-03 2 0
2 2021-02-04 2 0
2 2021-02-05 1 3
Basically, what I need is: if Val1 or Val2 changes in a period of time, all the values between this two dates must have have the value from previous date. (To be more clearly, look at ID 2).
I know that I can do this in many ways (window function, udf,...) but my doubt is, since my original DF has more than 60 billion lines, what is the best approach to do this processing?
I think the best approach (performance-wise) is performing an inner join (probably with broadcasting). If you worry about the number of records, I suggest you run them by batch (could be the number of records, or by date, or even a random number). The general idea is just to avoid running all at once.

aggregating with a condition in groupby spark dataframe

I have a dataframe
id lat long lag_lat lag_long detector lag_interval gpsdt lead_gpsdt
1 12 13 12 13 1 [1.5,3.5] 4 4.5
1 12 13 12 13 1 null 4.5 5
1 12 13 12 13 1 null 5 5.5
1 12 13 12 13 1 null 5.5 6
1 13 14 12 13 2 null 6 6.5
1 13 14 13 14 2 null 6.5 null
2 13 14 13 14 2 [0.5,1.5] 2.5 3.5
2 13 14 13 14 2 null 3.5 4
2 13 14 13 14 2 null 4 null
so I wanted to apply a condition while using groupby in agg function that if we do groupby col("id") and col("detector") then I want to check the condition that if lag_interval in that group has any non-null value then in aggregation I want two columns one is
min("lag_interval.col1") and other is max("lead_gpsdt")
If the above condition is not met then I want
min("gpsdt"), max("lead_gpsdt")
using this approach I want to get the data with a condition
df.groupBy("detector","id").agg(first("lat-long").alias("start_coordinate"),
last("lat-long").alias("end_coordinate"),struct(min("gpsdt"), max("lead_gpsdt")).as("interval"))
output
id interval start_coordinate end_coordinate
1 [1.5,6] [12,13] [13,14]
1 [6,6.5] [13,14] [13,14]
2 [0.5,4] [13,14] [13,14]
**
for more explanation
**
if we see a part of what groupby("id","detector") does is taking a part out,
we have to see that if in that group of data if one of the value in the col("lag_interval") is not null then we need to use aggregation like this min(lag_interval.col1),max(lead_gpsdt)
this condition will apply to below set of data
id lat long lag_lat lag_long detector lag_interval gpsdt lead_gpsdt
1 12 13 12 13 1 [1.5,3.5] 4 4.5
1 12 13 12 13 1 null 4.5 5
1 12 13 12 13 1 null 5 5.5
1 12 13 12 13 1 null 5.5 6
and if the all value of col("lag_interval") is null in that group of data then we need aggregation output as
min("gpsdt"),max("lead_gpsdt")
this condition will apply to below set of data
id lat long lag_lat lag_long detector lag_interval gpsdt lead_gpsdt
1 13 14 12 13 2 null 6 6.5
1 13 14 13 14 2 null 6.5 null
The conditional dilemma that you have should be solved by using simple when inbuilt function as suggested below
import org.apache.spark.sql.functions._
df.groupBy("id","detector")
.agg(
struct(
when(isnull(min("lag_interval.col1")), min("gpsdt")).otherwise(min("lag_interval.col1")).as("min"),
max("lead_gpsdt").as(("max"))
).as("interval")
)
which should give you output as
+---+--------+----------+
|id |detector|interval |
+---+--------+----------+
|2 |2 |[0.5, 4.0]|
|1 |2 |[6.0, 6.5]|
|1 |1 |[1.5, 6.0]|
+---+--------+----------+
and I guess you must already have idea how to do first("lat-long").alias("start_coordinate"), last("lat-long").alias("end_coordinate") as you have done.
I hope the answer is helpful

Combing rows in a spark dataframe

If I have an input as below:
sno name time
1 hello 1
1 hello 2
1 hai 3
1 hai 4
1 hai 5
1 how 6
1 how 7
1 are 8
1 are 9
1 how 10
1 how 11
1 are 12
1 are 13
1 are 14
I want to combine the fields having similar values in name as the below output format:
sno name timestart timeend
1 hello 1 2
1 hai 3 5
1 how 6 7
1 are 8 9
1 how 10 11
1 are 12 14
The input will be sorted according to time and only the records which are having the same name for repeated time intervals must be merged.
I am trying to do using spark but I cannot figure out a way to do this using spark functions since I am new to spark. Any suggestions on the approach will be appreciated.
I tried thinking of writing a user-defined function and applying maps to the data frame but I could not come up with the right logic for the function.
PS: I am trying to do this using scala spark.
One way to do so would be to use a plain SQL query.
Let's say df is your input dataframe.
val viewName = s"dataframe"
df.createOrReplaceTempView(viewName)
spark.sql(query(viewName))
def query(viewName: String): String = s"SELECT sno, name, MAX(time) AS timeend, MIN(time) AS timestart FROM $viewName GROUP BY name"
You can of course use df set. This would be something like:
df.groupBy($"name")
.agg($"sno", $"name", max($"time").as("timeend"), min($"time").as("timestart"))

Tableau Pivot Rows into Columns

I have a table structure like this:
Department Employee Class Peroid Qty1 Qty2 Qty3
----------------------------------------------------
Dept1 John 1 1st 1 2 3
Dept1 John 1 2nd 11 22 33
Dept1 Mary 1 1st 2 3 4
Dept1 Mary 1 2nd 22 33 44
Dept2 Joe 1 1st 3 4 5
Dept2 Joe 1 2nd 33 44 55
Dept2 Paul 1 1st 4 5 6
Dept2 Paul 1 2nd 44 55 66
In a view I'd like to display the format as such:
Class / Period
1
Department Employee 1st 2nd
----------------------------------------------
Dept1 John 1 2 3 11 22 33
Dept1 Mary 2 3 4 22 33 44
Dept2 Joe 3 4 5 33 44 55
Dept2 Paul 4 5 6 44 55 66
I can't seem to find a way to do this. I have Class, Period as Columns and Department, Employee as Rows then drag Qty1, Qty2, Qty3 to the Text Mark but the format becomes:
Class / Period
1
Department Employee 1st 2nd
----------------------------------------------
Dept1 John 1 11
2 22
3 33
Dept1 Mary 2 22
3 33
4 44
Dept2 Joe 3 33
4 44
5 55
Dept2 Paul 4 44
5 55
6 66
How do I turn those rows under each employee to sub-columns under Period?
I think this is what you're trying to achieve.
A lot of times when you see a repeating column in a database table, Qty1, Qty2, Qty3, it is a sign that you really want multiple rows each with a single Qty (and repeating the other information) -- At least when you are building reports. That way you can have rows with any number of instances of Qty, and you can also easily aggregate all the Qty together when needed.
There are situations where you may want to stick with a repeating field design. But if you do want to reshape the data, you can do that in Tableau's data connection window by selecting the columns you want to pull out into a single field and selecting the pivot command.

KDB: Aggregate across consecutive rows with common label

I would like to sum across consecutive rows that share the same label. Any very simple ways to do this?
Example: I start with this table...
qty flag
1 OFF
3 ON
2 ON
2 OFF
9 OFF
4 ON
... and would like to generate...
qty flag
1 OFF
5 ON
11 OFF
4 ON
One method:
q)show t:flip`qty`flag!(1 3 2 2 9 4;`OFF`ON`ON`OFF`OFF`ON)
qty flag
--------
1 OFF
3 ON
2 ON
2 OFF
9 OFF
4 ON
q)show result:select sum qty by d:sums differ flag,flag from t
d flag1| qty
----------| ---
1 OFF | 1
2 ON | 5
3 OFF | 11
4 ON | 4
Then to get it in the format you require:
q)`qty`flag#0!result
qty flag
--------
1 OFF
5 ON
11 OFF
4 ON