Using PySpark, Incrementing frequency in data, for a specific number? - pyspark

I have a dataset like:
a
c
c
d
b
a
a
d
d
c
c
b
a
b
I want to add a column that looks like the one below. When 'c' is reached, the new column will be zero and then be increased by one. Is there a way we can do this using pyspark?
a 1
c 0
c 0
d 2
b 2
a 2
a 2
d 2
d 2
c 0
c 0
b 3
a 3
b 3
I have tried the below code but it is not working.
from pyspark.sql.functions import col, when, lag, sum
s = df.filter(col("col") == 'c')
df = df.withColumn("new", when(s.neq(lag("s", 1).over()), sum("s").over(Window.orderBy("index"))).otherwise(0))

The following solution uses PySpark SQL functions to implement the logic requested above.
Set-Up
Create a DataFrame to mimic the example provided
df = spark.createDataFrame(
[('a',),
('c',),
('c',),
('d',),
('b',),
('a',),
('a',),
('d',),
('d',),
('c',),
('c',),
('b',),
('a',),
('b',),],
['id',])
Output
+---+
|id |
+---+
|a |
|c |
|c |
|d |
|b |
|a |
|a |
|d |
|d |
|c |
|c |
|b |
|a |
|b |
+---+
Logic
Calculate row number (reference logic for row_num here)
df = df.withColumn("row_num", F.row_number().over(Window.orderBy(F.monotonically_increasing_id())))
Use row number to determine the preceding id value (the lag). There is no preceding id for the first row so the lag results in a null - set this missing value to "c".
df = df.withColumn("lag_id", F.lag("id",1).over(Window.orderBy("row_num")))
df = df.na.fill(value="c", subset=['lag_id'])
output
+---+--------------+------+
|id | row_num |lag_id|
+---+--------------+------+
|a |1 |c |
|c |2 |a |
|c |3 |c |
|d |4 |c |
|b |5 |d |
|a |6 |b |
|a |7 |a |
|d |8 |a |
|d |9 |d |
|c |10 |d |
|c |11 |c |
|b |12 |c |
|a |13 |b |
|b |14 |a |
+---+--------------+------+
Determine order (sequence) for rows that immediately follow a row where id = "c"
df_sequence = df.filter((df.id != "c") & (df.lag_id == "c"))
df_sequence = df_sequence.withColumn("sequence", F.row_number().over(Window.orderBy("row_num")))
output
+---+--------------+------+--------+
|id | row_num |lag_id|sequence|
+---+--------------+------+--------+
|a |1 |c |1 |
|d |4 |c |2 |
|b |12 |c |3 |
+---+--------------+------+--------+
Join the sequence DF to the original DF
df_joined = df.alias("df1").join(df_sequence.alias("df2"),
on="row_num",
how="leftouter")\
.select(df["*"],df_sequence["sequence"])
)
Set sequence to 0 when id = "c"
df_joined = df_joined.withColumn('sequence', F.when(df_joined.id == "c", 0)
.otherwise(df_joined.sequence)
output
+---+--------------+------+--------+
|id | row_num |lag_id|sequence|
+---+--------------+------+--------+
|a |1 |c |1 |
|c |2 |a |0 |
|c |3 |c |0 |
|d |4 |c |2 |
|b |5 |d |null |
|a |6 |b |null |
|a |7 |a |null |
|d |8 |a |null |
|d |9 |d |null |
|c |10 |d |0 |
|c |11 |c |0 |
|b |12 |c |3 |
|a |13 |b |null |
|b |14 |a |null |
+---+--------------+------+--------+
Forward fill sequence values (reference the forward fill logic here)
df_final = df_joined.withColumn('sequence', F.last('sequence', ignorenulls=True).over(Window.orderBy("row_num")
Final Output
+---+--------------+------+--------+
|id | row_num |lag_id|sequence|
+---+--------------+------+--------+
|a |1 |c |1 |
|c |2 |a |0 |
|c |3 |c |0 |
|d |4 |c |2 |
|b |5 |d |2 |
|a |6 |b |2 |
|a |7 |a |2 |
|d |8 |a |2 |
|d |9 |d |2 |
|c |10 |d |0 |
|c |11 |c |0 |
|b |12 |c |3 |
|a |13 |b |3 |
|b |14 |a |3 |
+---+--------------+------+--------+

Related

How can I make a unique match with join with two spark dataframes and different columns?

I have two dataframes spark(scala):
First:
+-------------------+------------------+-----------------+----------+-----------------+
|id |zone |zone_father |father_id |country |
+-------------------+------------------+-----------------+----------+-----------------+
|2 |1 |123 |1 |0 |
|2 |2 |123 |1 |0 |
|3 |3 |1 |2 |0 |
|2 |4 |123 |1 |0 |
|3 |5 |2 |2 |0 |
|3 |6 |4 |2 |0 |
|3 |7 |19 |2 |0 |
+-------------------+------------------+-----------------+----------+-----------------+
Second:
+-------------------+------------------+-----------------+-----------------+
|country |id |zone |zone_value |
+-------------------+------------------+-----------------+-----------------+
|0 |2 |1 |7 |
|0 |2 |2 |7 |
|0 |2 |4 |8 |
|0 |0 |0 |2 |
+-------------------+------------------+-----------------+-----------------+
Then I need following logic:
1 -> If => first.id = second.id && first.zone = second.zone
2 -> Else if => first.father_id = second.id && first.zone_father = second.zone
3 -> If neither the first nor the second is true, follow the latter => first.country = second.zone
And the expected result would be:
+-------------------+------------------+-----------------+----------+-----------------+-----------------+
|id |zone |zone_father |father_id |country |zone_value |
+-------------------+------------------+-----------------+----------+-----------------+-----------------+
|2 |1 |123 |1 |0 |7 |
|2 |2 |123 |1 |0 |7 |
|3 |3 |1 |2 |0 |7 |
|2 |4 |123 |1 |0 |8 |
|3 |5 |2 |2 |0 |7 |
|3 |6 |4 |2 |0 |8 |
|3 |7 |19 |2 |0 |2 |
+-------------------+------------------+-----------------+----------+-----------------+-----------------+
I tried to join both dataframes, but due "or" operation, two results for each row is returned, because the last premise returns true regardless of the result of the other two.

Pyspark get predecessor value

I have a dataset similar to this one
exp
pid
mat
pskey
order
1
CR
P
1-CR-P
1
1
M
C
1-M-C
2
1
CR
C
1-CR-C
3
1
PP
C
1-PP-C
4
2
CR
P
2-CR-P
1
2
CR
P
2-CR-P
1
2
M
C
2-M-C
2
2
CR
C
2-CR-C
3
2
CR
C
2-CR-C
3
2
CR
C
2-CR-C
3
2
CR
C
2-CR-C
3
2
CR
C
2-CR-C
3
2
PP
C
2-PP-C
4
2
PP
C
2-PP-C
4
2
PP
C
2-PP-C
4
2
PP
C
2-PP-C
4
2
PP
C
2-PP-C
4
3
M
C
3-M-C
2
4
CR
P
4-CR-P
1
4
M
C
4-M-C
2
4
CR
C
4-CR-C
3
4
PP
C
4-PP-C
4
What I need is to get pskey of the predecessor for the same exp giving the following relation:
order 1 -> no predecessor
order 2 -> no predecessor
order 3 -> [1,2]
order 4 -> [3]
And add those values to a new column called predecessor
The expected result would be like:
+---+---+---+------+-----+----------------------------------------+
|exp|pid|mat|pskey |order|predecessor |
+---+---+---+------+-----+----------------------------------------+
|1 |CR |P |1-CR-P|1 |null |
|1 |M |C |1-M-C |2 |null |
|1 |CR |C |1-CR-C|3 |[1-CR-P, 1-M-C ] |
|1 |PP |C |1-PP-C|4 |[1-CR-C] |
|3 |M |C |3-M-C |2 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |M |C |2-M-C |2 |null |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|4 |CR |P |4-CR-P|1 |null |
|4 |M |C |4-M-C |2 |null |
|4 |CR |C |4-CR-C|3 |[4-CR-P, 4-M-C] |
|4 |PP |C |4-PP-C|4 |[4-CR-C] |
+---+---+---+------+-----+----------------------------------------+
I am quite new to pyspark so I have no clue how to manage it.
Differents cases on order are handled with when. You aggregate the values with a collect_set to get unic identifiers:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
"predecessor",
F.when(
F.col("order") == 3,
F.collect_set(F.col("pskey")).over(
Window.partitionBy("exp").orderBy("order").rangeBetween(-2, -1)
),
).when(
F.col("order") == 4,
F.collect_set(F.col("pskey")).over(
Window.partitionBy("exp").orderBy("order").rangeBetween(-1, -1)
),
),
)
the result :
df2.show(truncate=False)
+---+---+---+------+-----+----------------+
|exp|pid|mat|pskey |order|predecessor |
+---+---+---+------+-----+----------------+
|1 |CR |P |1-CR-P|1 |null |
|1 |M |C |1-M-C |2 |null |
|1 |CR |C |1-CR-C|3 |[1-CR-P, 1-M-C ]|
|1 |PP |C |1-PP-C|4 |[1-CR-C] |
|3 |M |C |3-M-C |2 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |CR |P |2-CR-P|1 |null |
|2 |M |C |2-M-C |2 |null |
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |CR |C |2-CR-C|3 |[2-CR-P, 2-M-C ]|
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|2 |PP |C |2-PP-C|4 |[2-CR-C] |
|4 |CR |P |4-CR-P|1 |null |
|4 |M |C |4-M-C |2 |null |
+---+---+---+------+-----+----------------+
only showing top 20 rows

Add increasing number for each group in column of Spark dataframe

I have a dataframe with 2 columns "Id" and "category". For each category, I want to label encode the column "Id", so the expected outcome will be the column "Enc_id" like this
Id Category Enc_id
a1 A 0
a2 A 1
b1 B 0
c1 C 0
c2 C 1
a3 A 2
b2 B 1
b3 B 2
b4 B 3
b4 B 3
b3 B 2
Here, the Id may not be unique, so that there may be duplicated rows. I thought of creating a window to partitionBy(category), then apply the label encoding (StringIndexer) over this window but it didn't work. Any hint, please?
You can use the window function with substring function with and calculate the rank
val window = Window.partitionBy($"Category", substring($"Id", 1,1)).orderBy("Id")
df.withColumn("Enc_id", rank().over(window) - 1) // -1 to start the rank from 0
.show(false)
Output:
+---+--------+------+
|Id |Category|Enc_id|
+---+--------+------+
|a1 |A |0 |
|a2 |A |1 |
|a3 |A |2 |
|c1 |C |0 |
|c2 |C |1 |
|b1 |B |0 |
|b2 |B |1 |
|b3 |B |2 |
|b4 |B |3 |
+---+--------+------+
Update1:
for the updated case with duplicate id
df1.groupBy("Id", "Category")
.agg(collect_list("Category") as "list_category")
.withColumn("Enc_id", rank().over(window) - 1)
.withColumn("Category", explode($"list_category"))
.drop("list_category")
.show(false)
Output:
+---+--------+------+
|Id |Category|Enc_id|
+---+--------+------+
|a1 |A |0 |
|a2 |A |1 |
|a3 |A |2 |
|c1 |C |0 |
|c2 |C |1 |
|b1 |B |0 |
|b2 |B |1 |
|b3 |B |2 |
|b3 |B |2 |
|b4 |B |3 |
|b4 |B |3 |
+---+--------+------+

Spark Scala, merging two columnar dataframes duplicating the second dataframe each time

I want to merge 2 columns or 2 dataframes like
df1
+--+
|id|
+--+
|1 |
|2 |
|3 |
+--+
df2 --> this one can be a list as well
+--+
|m |
+--+
|A |
|B |
|C |
+--+
I want to have as resulting table
+--+--+
|id|m |
+--+--+
|1 |A |
|1 |B |
|1 |C |
|2 |A |
|2 |B |
|2 |C |
|3 |A |
|3 |B |
|3 |C |
+--+--+
def crossJoin(right: org.apache.spark.sql.Dataset[_]): org.apache.spark.sql.DataFrame
Using crossJoin function you can get same result. Please check code below.
scala> dfa.show
+---+
| id|
+---+
| 1|
| 2|
| 3|
+---+
scala> dfb.show
+---+
| m|
+---+
| A|
| B|
| C|
+---+
scala> dfa.crossJoin(dfb).orderBy($"id".asc).show(false)
+---+---+
|id |m |
+---+---+
|1 |B |
|1 |A |
|1 |C |
|2 |A |
|2 |B |
|2 |C |
|3 |C |
|3 |B |
|3 |A |
+---+---+

PySpark: Creating a column with number of timesteps to an event

I have a dataframe that looks as follows:
|id |val1|val2|
+---+----+----+
|1 |1 |0 |
|1 |2 |0 |
|1 |3 |0 |
|1 |4 |0 |
|1 |5 |5 |
|1 |6 |0 |
|1 |7 |0 |
|1 |8 |0 |
|1 |9 |9 |
|1 |10 |0 |
|1 |11 |0 |
|2 |1 |0 |
|2 |2 |0 |
|2 |3 |0 |
|2 |4 |0 |
|2 |5 |0 |
|2 |6 |6 |
|2 |7 |0 |
|2 |8 |8 |
|2 |9 |0 |
+---+----+----+
only showing top 20 rows
I want to create a new column with the number of rows until a non-zero value appears in val2, this should be done groupby/partitionby 'id'... if the event never happens, I need to put a -1 in the steps field.
|id |val1|val2|steps|
+---+----+----+----+
|1 |1 |0 |4 |
|1 |2 |0 |3 |
|1 |3 |0 |2 |
|1 |4 |0 |1 |
|1 |5 |5 |0 | event
|1 |6 |0 |3 |
|1 |7 |0 |2 |
|1 |8 |0 |1 |
|1 |9 |9 |0 | event
|1 |10 |0 |-1 | no further events for this id
|1 |11 |0 |-1 | no further events for this id
|2 |1 |0 |5 |
|2 |2 |0 |4 |
|2 |3 |0 |3 |
|2 |4 |0 |2 |
|2 |5 |0 |1 |
|2 |6 |6 |0 | event
|2 |7 |0 |1 |
|2 |8 |8 |0 | event
|2 |9 |0 |-1 | no further events for this id
+---+----+----+----+
only showing top 20 rows
Your requirement seems easy but implementing in spark and preserving immutability is a difficult task. I am suggesting you would need a recursive function to generate the steps column. Below I have tried to suggest you a recursive way using a udf function.
import org.apache.spark.sql.functions._
//udf function to populate step column
def stepsUdf = udf((values: Seq[Row]) => {
//sorting the collected struct in reverse order according to val1 column in reverse order
val val12 = values.sortWith(_.getAs[Int]("val1") > _.getAs[Int]("val1"))
//selecting the first of sorted list
val val12Head = val12.head
//generating the first step column in the collected list
val prevStep = if(val12Head.getAs("val2") != 0) 0 else -1
//generating the first output struct
val listSteps = List(steps(val12Head.getAs("val1"), val12Head.getAs("val2"), prevStep))
//recursive function for generating the step column
def recursiveSteps(vals : List[Row], previousStep: Int, listStep : List[steps]): List[steps] = vals match {
case x :: y =>
//event changed so step column should be 0
if(x.getAs("val2") != 0) {
recursiveSteps(y, 0, listStep :+ steps(x.getAs("val1"), x.getAs("val2"), 0))
}
//event doesn't change after the last event change
else if(x.getAs("val2") == 0 && previousStep == -1) {
recursiveSteps(y, previousStep, listStep :+ steps(x.getAs("val1"), x.getAs("val2"), previousStep))
}
//val2 is 0 after the event change so increment the step column
else {
recursiveSteps(y, previousStep+1, listStep :+ steps(x.getAs("val1"), x.getAs("val2"), previousStep+1))
}
case Nil => listStep
}
//calling the recursive function
recursiveSteps(val12.tail.toList, prevStep, listSteps)
})
df
.groupBy("id") // grouping by id column
.agg(stepsUdf(collect_list(struct("val1", "val2"))).as("stepped")) //calling udf function after the collection of struct of val1 and val2
.withColumn("stepped", explode(col("stepped"))) // generating rows from the list returned from udf function
.select(col("id"), col("stepped.*")) // final desired output
.sort("id", "val1") //optional step just for viewing
.show(false)
where steps is a case class
case class steps(val1: Int, val2: Int, steps: Int)
which should give you
+---+----+----+-----+
|id |val1|val2|steps|
+---+----+----+-----+
|1 |1 |0 |4 |
|1 |2 |0 |3 |
|1 |3 |0 |2 |
|1 |4 |0 |1 |
|1 |5 |5 |0 |
|1 |6 |0 |3 |
|1 |7 |0 |2 |
|1 |8 |0 |1 |
|1 |9 |9 |0 |
|1 |10 |0 |-1 |
|1 |11 |0 |-1 |
|2 |1 |0 |5 |
|2 |2 |0 |4 |
|2 |3 |0 |3 |
|2 |4 |0 |2 |
|2 |5 |0 |1 |
|2 |6 |6 |0 |
|2 |7 |0 |1 |
|2 |8 |8 |0 |
|2 |9 |0 |-1 |
+---+----+----+-----+
I hope the answer is helpful