Add increasing number for each group in column of Spark dataframe - scala

I have a dataframe with 2 columns "Id" and "category". For each category, I want to label encode the column "Id", so the expected outcome will be the column "Enc_id" like this
Id Category Enc_id
a1 A 0
a2 A 1
b1 B 0
c1 C 0
c2 C 1
a3 A 2
b2 B 1
b3 B 2
b4 B 3
b4 B 3
b3 B 2
Here, the Id may not be unique, so that there may be duplicated rows. I thought of creating a window to partitionBy(category), then apply the label encoding (StringIndexer) over this window but it didn't work. Any hint, please?

You can use the window function with substring function with and calculate the rank
val window = Window.partitionBy($"Category", substring($"Id", 1,1)).orderBy("Id")
df.withColumn("Enc_id", rank().over(window) - 1) // -1 to start the rank from 0
.show(false)
Output:
+---+--------+------+
|Id |Category|Enc_id|
+---+--------+------+
|a1 |A |0 |
|a2 |A |1 |
|a3 |A |2 |
|c1 |C |0 |
|c2 |C |1 |
|b1 |B |0 |
|b2 |B |1 |
|b3 |B |2 |
|b4 |B |3 |
+---+--------+------+
Update1:
for the updated case with duplicate id
df1.groupBy("Id", "Category")
.agg(collect_list("Category") as "list_category")
.withColumn("Enc_id", rank().over(window) - 1)
.withColumn("Category", explode($"list_category"))
.drop("list_category")
.show(false)
Output:
+---+--------+------+
|Id |Category|Enc_id|
+---+--------+------+
|a1 |A |0 |
|a2 |A |1 |
|a3 |A |2 |
|c1 |C |0 |
|c2 |C |1 |
|b1 |B |0 |
|b2 |B |1 |
|b3 |B |2 |
|b3 |B |2 |
|b4 |B |3 |
|b4 |B |3 |
+---+--------+------+

Related

Using PySpark, Incrementing frequency in data, for a specific number?

I have a dataset like:
a
c
c
d
b
a
a
d
d
c
c
b
a
b
I want to add a column that looks like the one below. When 'c' is reached, the new column will be zero and then be increased by one. Is there a way we can do this using pyspark?
a 1
c 0
c 0
d 2
b 2
a 2
a 2
d 2
d 2
c 0
c 0
b 3
a 3
b 3
I have tried the below code but it is not working.
from pyspark.sql.functions import col, when, lag, sum
s = df.filter(col("col") == 'c')
df = df.withColumn("new", when(s.neq(lag("s", 1).over()), sum("s").over(Window.orderBy("index"))).otherwise(0))
The following solution uses PySpark SQL functions to implement the logic requested above.
Set-Up
Create a DataFrame to mimic the example provided
df = spark.createDataFrame(
[('a',),
('c',),
('c',),
('d',),
('b',),
('a',),
('a',),
('d',),
('d',),
('c',),
('c',),
('b',),
('a',),
('b',),],
['id',])
Output
+---+
|id |
+---+
|a |
|c |
|c |
|d |
|b |
|a |
|a |
|d |
|d |
|c |
|c |
|b |
|a |
|b |
+---+
Logic
Calculate row number (reference logic for row_num here)
df = df.withColumn("row_num", F.row_number().over(Window.orderBy(F.monotonically_increasing_id())))
Use row number to determine the preceding id value (the lag). There is no preceding id for the first row so the lag results in a null - set this missing value to "c".
df = df.withColumn("lag_id", F.lag("id",1).over(Window.orderBy("row_num")))
df = df.na.fill(value="c", subset=['lag_id'])
output
+---+--------------+------+
|id | row_num |lag_id|
+---+--------------+------+
|a |1 |c |
|c |2 |a |
|c |3 |c |
|d |4 |c |
|b |5 |d |
|a |6 |b |
|a |7 |a |
|d |8 |a |
|d |9 |d |
|c |10 |d |
|c |11 |c |
|b |12 |c |
|a |13 |b |
|b |14 |a |
+---+--------------+------+
Determine order (sequence) for rows that immediately follow a row where id = "c"
df_sequence = df.filter((df.id != "c") & (df.lag_id == "c"))
df_sequence = df_sequence.withColumn("sequence", F.row_number().over(Window.orderBy("row_num")))
output
+---+--------------+------+--------+
|id | row_num |lag_id|sequence|
+---+--------------+------+--------+
|a |1 |c |1 |
|d |4 |c |2 |
|b |12 |c |3 |
+---+--------------+------+--------+
Join the sequence DF to the original DF
df_joined = df.alias("df1").join(df_sequence.alias("df2"),
on="row_num",
how="leftouter")\
.select(df["*"],df_sequence["sequence"])
)
Set sequence to 0 when id = "c"
df_joined = df_joined.withColumn('sequence', F.when(df_joined.id == "c", 0)
.otherwise(df_joined.sequence)
output
+---+--------------+------+--------+
|id | row_num |lag_id|sequence|
+---+--------------+------+--------+
|a |1 |c |1 |
|c |2 |a |0 |
|c |3 |c |0 |
|d |4 |c |2 |
|b |5 |d |null |
|a |6 |b |null |
|a |7 |a |null |
|d |8 |a |null |
|d |9 |d |null |
|c |10 |d |0 |
|c |11 |c |0 |
|b |12 |c |3 |
|a |13 |b |null |
|b |14 |a |null |
+---+--------------+------+--------+
Forward fill sequence values (reference the forward fill logic here)
df_final = df_joined.withColumn('sequence', F.last('sequence', ignorenulls=True).over(Window.orderBy("row_num")
Final Output
+---+--------------+------+--------+
|id | row_num |lag_id|sequence|
+---+--------------+------+--------+
|a |1 |c |1 |
|c |2 |a |0 |
|c |3 |c |0 |
|d |4 |c |2 |
|b |5 |d |2 |
|a |6 |b |2 |
|a |7 |a |2 |
|d |8 |a |2 |
|d |9 |d |2 |
|c |10 |d |0 |
|c |11 |c |0 |
|b |12 |c |3 |
|a |13 |b |3 |
|b |14 |a |3 |
+---+--------------+------+--------+

Loop through large dataframe in Pyspark - alternative

df_hrrchy
|lefId |Lineage |
|-------|--------------------------------------|
|36326 |["36326","36465","36976","36091","82"]|
|36121 |["36121","36908","36976","36091","82"]|
|36380 |["36380","36465","36976","36091","82"]|
|36448 |["36448","36465","36976","36091","82"]|
|36683 |["36683","36465","36976","36091","82"]|
|36949 |["36949","36908","36976","36091","82"]|
|37349 |["37349","36908","36976","36091","82"]|
|37026 |["37026","36908","36976","36091","82"]|
|36879 |["36879","36465","36976","36091","82"]|
df_trans
|tranID | T_Id |
|-----------|-------------------------------------------------------------------------|
|1000540 |["36121","36326","37349","36949","36380","37026","36448","36683","36879"]|
df_creds
|T_Id |T_val |T_Goal |Parent_T_Id |Parent_Val |parent_Goal|
|-------|-------|-------|---------------|----------------|-----------|
|36448 |100 |1 |36465 |200 |1 |
|36465 |200 |1 |36976 |300 |2 |
|36326 |90 |1 |36465 |200 |1 |
|36091 |500 |19 |82 |600 |4 |
|36121 |90 |1 |36908 |200 |1 |
|36683 |90 |1 |36465 |200 |1 |
|36908 |200 |1 |36976 |300 |2 |
|36949 |90 |1 |36908 |200 |1 |
|36976 |300 |2 |36091 |500 |19 |
|37026 |90 |1 |36908 |200 |1 |
|37349 |100 |1 |36908 |200 |1 |
|36879 |90 |1 |36465 |200 |1 |
|36380 |90 |1 |36465 |200 |1 |
Desired Result
T_id
children
T_Val
T_Goal
parent_T_id
parent_Goal
trans_id
36091
["36976"]
500
19
82
4
1000540
36465
["36448","36326","36683","36879","36380"]
200
1
36976
2
1000540
36908
["36121","36949","37026","37349"]
200
1
36976
2
1000540
36976
["36465","36908"]
300
2
36091
19
1000540
36683
null
90
1
36465
1
1000540
37026
null
90
1
36908
1
1000540
36448
null
100
1
36465
1
1000540
36949
null
90
1
36908
1
1000540
36326
null
90
1
36465
1
1000540
36380
null
90
1
36465
1
1000540
36879
null
90
1
36465
1
1000540
36121
null
90
1
36908
1
1000540
37349
null
100
1
36908
1
1000540
Code Tried
from pyspark.sql import functions as F
from pyspark.sql import DataFrame
from pyspark.sql.functions import explode, collect_set, expr, col, collect_list,array_contains, lit
from functools import reduce
for row in df_transactions.rdd.toLocalIterator():
# def find_nodemap(row):
dfs = []
df_hy_set = (df_hrrchy.filter(df_hrrchy. lefId.isin(row["T_ds"]))
.select(explode("Lineage").alias("Terrs"))
.agg(collect_set(col("Terrs")).alias("hierarchy_list"))
.select(F.lit(row["trans_id"]).alias("trans_id "),"hierarchy_list")
)
df_childrens = (df_creds.join(df_ hy _set, expr("array_contains(hierarchy_list, T_id)"))
.select("T_id", "T_Val","T_Goal","parent_T_id", "parent_Goal", "trans _id" )
.groupBy("parent_T_id").agg(collect_list("T_id").alias("children"))
)
df_filter_creds = (df_creds.join(df_ hy _set, expr("array_contains(hierarchy_list, T_id)"))
.select ("T_id", "T_val","T_Goal","parent_T_id", "parent_Goal”, "trans_id")
)
df_nodemap = (df_filter_ creds.alias("A").join(df_childrens.alias("B"), col("A.T_id") == col("B.parent_T_id"), "left")
.select("A.T_id","B.children", "A.T_val","A.terr_Goal","A.parent_T_id", "A.parent_Goal", "A.trans_ id")
)
display(df_nodemap)
# dfs.append(df_nodemap)
# df = reduce(DataFrame.union, dfs)
# display(df)
# # display(df)
My problem - Its a bad design. df_trans is having millions of data and looping through dataframe , its taking forever. Without looping can I do it. I tried couple of other methods, not able to get the desired result.
You certainly need to process entire DataFrame in batch, not iterate row by row.
Key points are to "reverse" df_hrrchy, ie. from parent lineage obtain list of children for every T_Id:
val df_children = df_hrrchy.withColumn("children", slice($"Lineage", lit(1), size($"Lineage") - 1))
.withColumn("parents", slice($"Lineage", 2, 999999))
.select(explode(arrays_zip($"children", $"parents")).as("rels"))
.distinct
.groupBy($"rels.parents".as("T_Id"))
.agg(collect_set($"rels.children").as("children"))
df_children.show(false)
+-----+-----------------------------------+
|T_Id |children |
+-----+-----------------------------------+
|36091|[36976] |
|36465|[36448, 36380, 36326, 36879, 36683]|
|36976|[36465, 36908] |
|82 |[36091] |
|36908|[36949, 37349, 36121, 37026] |
+-----+-----------------------------------+
then expand list of T_Ids in df_trans and also include all T_Ids from the hierarchy:
val df_trans_map = df_trans.withColumn("T_Id", explode($"T_Id"))
.join(df_hrrchy, array_contains($"Lineage", $"T_Id"))
.select($"tranID", explode($"Lineage").as("T_Id"))
.distinct
df_trans_map.show(false)
+-------+-----+
|tranID |T_Id |
+-------+-----+
|1000540|36976|
|1000540|82 |
|1000540|36091|
|1000540|36465|
|1000540|36326|
|1000540|36121|
|1000540|36908|
|1000540|36380|
|1000540|36448|
|1000540|36683|
|1000540|36949|
|1000540|37349|
|1000540|37026|
|1000540|36879|
+-------+-----+
With this it is just a simple join to obtain final result:
df_trans_map.join(df_creds, Seq("T_Id"))
.join(df_children, Seq("T_Id"), "left_outer")
.show(false)
+-----+-------+-----+------+-----------+----------+-----------+-----------------------------------+
|T_Id |tranID |T_val|T_Goal|Parent_T_Id|Parent_Val|parent_Goal|children |
+-----+-------+-----+------+-----------+----------+-----------+-----------------------------------+
|36976|1000540|300 |2 |36091 |500 |19 |[36465, 36908] |
|36091|1000540|500 |19 |82 |600 |4 |[36976] |
|36465|1000540|200 |1 |36976 |300 |2 |[36448, 36380, 36326, 36879, 36683]|
|36326|1000540|90 |1 |36465 |200 |1 |null |
|36121|1000540|90 |1 |36908 |200 |1 |null |
|36908|1000540|200 |1 |36976 |300 |2 |[36949, 37349, 36121, 37026] |
|36380|1000540|90 |1 |36465 |200 |1 |null |
|36448|1000540|100 |1 |36465 |200 |1 |null |
|36683|1000540|90 |1 |36465 |200 |1 |null |
|36949|1000540|90 |1 |36908 |200 |1 |null |
|37349|1000540|100 |1 |36908 |200 |1 |null |
|37026|1000540|90 |1 |36908 |200 |1 |null |
|36879|1000540|90 |1 |36465 |200 |1 |null |
+-----+-------+-----+------+-----------+----------+-----------+-----------------------------------+
You need to re-write this to use the full cluster, using a localIterator means that you aren't fully utilizing the cluster for shared work.
Below code was not run as you didn't provide a workable data set to test. If you do I'll run the code to make sure it's sound.
from pyspark.sql import functions as F
from pyspark.sql import DataFrame
from pyspark.sql.functions import explode, collect_set, expr, col, collect_list,array_contains, lit
from functools import reduce
#uses explode I know this will create a lot of short lived records but the flip side is it will use the entire cluster to complete the work instead of the driver.
df_trans_expld = df_trans.select( df_trans.tranID, explode(df_trans.T_Id).alias("T_Id") )
#uses explode
df_hrrchy_expld = df_hrrchy.select( df_hrrchy.leftId, explode( df_hrrchy.Lineage ).alias("Lineage") )
#uses exploded data to join which is the same as a filter.
df_hy_set = df_trans_expld.join( df_hrrchy_expld, df_hrrchy_expld.lefId === df_trans_expld.T_id, "left").select( "trans_id" ).agg(collect_set(col("Lineage")).alias("hierarchy_list"))
.select(F.lit(col("trans_id")).alias("trans_id "),"hierarchy_list")
#logic unchanged from here down
df_childrens = (df_creds.join(df_hy_set, expr("array_contains(hierarchy_list, T_id)"))
.select("T_id", "T_Val","T_Goal","parent_T_id", "parent_Goal", "trans _id" )
.groupBy("parent_T_id").agg(collect_list("T_id").alias("children"))
)
df_filter_creds = (df_creds.join(ddf_hy_set, expr("array_contains(hierarchy_list, T_id)"))
.select ("T_id", "T_val","T_Goal","parent_T_id", "parent_Goal”, "trans_id")
)
df_nodemap = (df_filter_creds.alias("A").join(df_childrens.alias("B"), col("A.T_id") == col("B.parent_T_id"), "left")
.select("A.T_id","B.children", "A.T_val","A.terr_Goal","A.parent_T_id", "A.parent_Goal", "A.trans_ id")
)
# no need to append/union data as it's now just one dataframe df_nodemap
I'd have to look into this more but I'm pretty sure you are pulling all the data through the driver(with your existing code), which will really slow things down, this will make use of all executors to complete the work.
There may be another optimization to get rid of the array_contains (and use a join instead). I'd have to look at the explain to see if you could get even more performance out of it. Don't remember off the top of my head, you are avoiding a shuffle so it may be better as is.

How can i get output as below jn spark scala

I have data like below.
A
B
C
D
1
A
Day
D1
1
A
Tim
1am
1
A
Tim
3am
Need to create like this
A
B
Day
Tim1
Tim2
1
A
D1
1am
3am
Can you help how to get in spark scala
You can add the row numbers for the duplicates first and then do the pivot.
import org.apache.spark.sql.expressions.Window
val w1 = Window.partitionBy("A", "B", "C").orderBy("D")
val w2 = Window.partitionBy("A", "B", "C")
val df1 = df0.withColumn("row_num", row_number().over(w1)).withColumn("max_num", max("row_num").over(w2))
df1.show(false)
//+---+---+---+---+-------+-------+
//|A |B |C |D |row_num|max_num|
//+---+---+---+---+-------+-------+
//|1 |A |Tim|1am|1 |2 |
//|1 |A |Tim|3am|2 |2 |
//|1 |A |Day|D1 |1 |1 |
//+---+---+---+---+-------+-------+
val df2 = df1.withColumn("C", expr("if(max_num != 1, concat(C, row_num), C)"))
df2.show(false)
//+---+---+----+---+-------+-------+
//|A |B |C |D |row_num|max_num|
//+---+---+----+---+-------+-------+
//|1 |A |Tim1|1am|1 |2 |
//|1 |A |Tim2|3am|2 |2 |
//|1 |A |Day |D1 |1 |1 |
//+---+---+----+---+-------+-------+
val df3 = df2.groupBy("A", "B").pivot("C").agg(first("D"))
df3.show(false)
//+---+---+---+----+----+
//|A |B |Day|Tim1|Tim2|
//+---+---+---+----+----+
//|1 |A |D1 |1am |3am |
//+---+---+---+----+----+

Spark Scala, merging two columnar dataframes duplicating the second dataframe each time

I want to merge 2 columns or 2 dataframes like
df1
+--+
|id|
+--+
|1 |
|2 |
|3 |
+--+
df2 --> this one can be a list as well
+--+
|m |
+--+
|A |
|B |
|C |
+--+
I want to have as resulting table
+--+--+
|id|m |
+--+--+
|1 |A |
|1 |B |
|1 |C |
|2 |A |
|2 |B |
|2 |C |
|3 |A |
|3 |B |
|3 |C |
+--+--+
def crossJoin(right: org.apache.spark.sql.Dataset[_]): org.apache.spark.sql.DataFrame
Using crossJoin function you can get same result. Please check code below.
scala> dfa.show
+---+
| id|
+---+
| 1|
| 2|
| 3|
+---+
scala> dfb.show
+---+
| m|
+---+
| A|
| B|
| C|
+---+
scala> dfa.crossJoin(dfb).orderBy($"id".asc).show(false)
+---+---+
|id |m |
+---+---+
|1 |B |
|1 |A |
|1 |C |
|2 |A |
|2 |B |
|2 |C |
|3 |C |
|3 |B |
|3 |A |
+---+---+

How to replace null values with above/below not null value on same column in Data-frame using spark?

I'm trying to replace Null or invalid values present in a column with the above or below nonnull value of the same column. For Example:-
Name|Place|row_count
a |a1 |1
a |a2 |2
a |a2 |3
|d1 |4
b |a2 |5
c |a2 |6
| |7
| |8
d |c1 |9
In this case, I try to replace all the NULL values in the column "Name" 1st NULL will replace with 'a' & 2nd NULL will replace with 'c' and in column "Place" NULL replace with 'a2'.
When we try to replace the 8th cell NULL of 'Place' column then also replace with its sparse nonnull value 'a2'.
Required Result:
If we select the 8th cell NULL of 'Place' column replacing then result will be
Name|Place|row_count
a |a1 |1
a |a2 |2
a |a2 |3
|d1 |4
b |a2 |5
c |a2 |6
| |7
|a2 |8
d |c1 |9
if we select the 4th cell NULL of 'Name' column for replace then result will be
Name|Place|row_count
a |a1 |1
a |a2 |2
a |a2 |3
a |d1 |4
b |a2 |5
c |a2 |6
| |7
| |8
d |c1 |9
Windows functions would come handy to solve the issue. For the sake of simplicity, I'm focusing on just name column. If previous row has null, I'm using next row value. You can change this order according to your need.Same approach needs to be done for other columns as well.
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(("a", "a1", "1"),
("a", "a2", "2"),
("a", "a2", "3"),
("d1", null, "4"),
("b", "a2", "5"),
("c", "a2", "6"),
(null, null, "7"),
(null, null, "8"),
("d", "c1", "9")).toDF("name", "place", "row_count")
val window = Window.orderBy("row_count")
val lagNameWindowExpression = lag('name, 1).over(window)
val leadNameWindowExpression = lead('name, 1).over(window)
val nameConditionExpression = when($"name".isNull.and('previous_name_col.isNull), 'next_name_col)
.when($"name".isNull.and('previous_name_col.isNotNull), 'previous_name_col).otherwise($"name")
df.select($"*", lagNameWindowExpression as 'previous_name_col, leadNameWindowExpression as 'next_name_col)
.withColumn("name", nameConditionExpression).drop("previous_name_col", "next_name_col")
.show(false)
Output
+----+-----+---------+
|name|place|row_count|
+----+-----+---------+
|a |a1 |1 |
|a |a2 |2 |
|a |a2 |3 |
|d1 |null |4 |
|b |a2 |5 |
|c |a2 |6 |
|c |null |7 |
|d |null |8 |
|d |c1 |9 |
+----+-----+---------+