Using lag function in Spark Scala to bring values from another column - scala

I have a dataframe that is such as the following, but that has several different items in the column "person".
val df_beginning = Seq(("2022-06-06", "person1", 1),
("2022-06-13", "person1", 1),
("2022-06-20", "person1", 1),
("2022-06-27", "person1", 0),
("2022-07-04", "person1", 0),
("2022-07-11", "person1", 1),
("2022-07-18", "person1", 1),
("2022-07-25", "person1", 0),
("2022-08-01", "person1", 0),
("2022-08-08", "person1", 1),
("2022-08-15", "person1", 1),
("2022-08-22", "person1", 1),
("2022-08-29", "person1", 1))
.toDF("week", "person", "person_active_flag")
.orderBy($"week")
I want to create a new column that will have the week in which that chain of person_active_flag with value 1 started. In the end, it would look something like this:
val df_beginning = Seq(("2022-06-06", "person1", 1, "2022-06-06"),
("2022-06-13", "person1", 1, "2022-06-06"),
("2022-06-20", "person1", 1, "2022-06-06"),
("2022-06-27", "person1", 0, "0"),
("2022-07-04", "person1", 0, "0"),
("2022-07-11", "person1", 1, "2022-07-11"),
("2022-07-18", "person1", 1, "2022-07-11"),
("2022-07-25", "person1", 0, "0"),
("2022-08-01", "person1", 0, "0"),
("2022-08-08", "person1", 1, "2022-08-08"),
("2022-08-15", "person1", 1, "2022-08-08"),
("2022-08-22", "person1", 1, "2022-08-08"),
("2022-08-29", "person1", 1, "2022-08-08"))
.toDF("week", "person", "person_active_flag", "chain_beginning")
.orderBy($"week")
But I am not being able to do it. I have tried some variations of the code below, but it doesn't give me the right answer. Can someone show me to do this, please?
val w = Window.partitionBy($"person").orderBy($"week".asc)
df_beginning
.withColumn("beginning_chain",
when($"person_active_flag" === 1 && (lag($"person_active_flag", 1).over(w) === 0 || lag($"person_active_flag", 1).over(w).isNull), 1).otherwise(0)
)
.withColumn("first_week", when($"beginning_chain" === 1, $"week"))
.withColumn("beginning_chain_week",
when($"person_active_flag" === 1 && lag($"person_active_flag", 1).over(w).isNull, $"first_week")
.when($"person_active_flag" === 1 && lag($"person_active_flag", 1).over(w) === 0, $"first_week")
.when($"person_active_flag" === 1 && lag($"person_active_flag", 1).over(w) === 1, lag($"first_week", 1).over(w))
// .when($"person_active_flag" === 1 && lag($"person_active_flag", 1).over(w) === 1, "test")
.otherwise(0)
)
.d

Use lag function to add helper column switch_flag to show you when the flag changed from previous week
Then mark week_beginning only for rows where it switched from 0 to 1
Finally using last(col, ignoreNulls = true) extend week_beginning to all rows where person is active
Final query:
val window = Window.partitionBy($"person").orderBy($"week")
df_beginning
.withColumn("switch_flag", $"person_active_flag" - coalesce(lag($"person_active_flag", 1).over(window), lit(0)))
.withColumn("week_beginning_ind", when($"switch_flag" === 1, $"week"))
.withColumn("week_beginning", when($"person_active_flag" === 1, last($"week_beginning_ind", true).over(window)))
.show
+----------+-------+------------------+-----------+------------------+--------------+
| week| person|person_active_flag|switch_flag|week_beginning_ind|week_beginning|
+----------+-------+------------------+-----------+------------------+--------------+
|2022-06-06|person1| 1| 1| 2022-06-06| 2022-06-06|
|2022-06-13|person1| 1| 0| null| 2022-06-06|
|2022-06-20|person1| 1| 0| null| 2022-06-06|
|2022-06-27|person1| 0| -1| null| null|
|2022-07-04|person1| 0| 0| null| null|
|2022-07-11|person1| 1| 1| 2022-07-11| 2022-07-11|
|2022-07-18|person1| 1| 0| null| 2022-07-11|
|2022-07-25|person1| 0| -1| null| null|
|2022-08-01|person1| 0| 0| null| null|
|2022-08-08|person1| 1| 1| 2022-08-08| 2022-08-08|
|2022-08-15|person1| 1| 0| null| 2022-08-08|
|2022-08-22|person1| 1| 0| null| 2022-08-08|
|2022-08-29|person1| 1| 0| null| 2022-08-08|
+----------+-------+------------------+-----------+------------------+--------------+

Related

Spark scala column level mismatches from 2 dataframes

I have 2 dataframes
val df1 = Seq((1, "1","6"), (2, "10","8"), (3, "6","4")).toDF("id", "value1","value2")
val df2 = Seq((1, "1","6"), (2, "5","4"), (4, "3","1")).toDF("id", "value1","value2")
and i want to find the difference of column level
output should look like
id,value1_df1,value1_df2,diff_value1,value2_df1,value_df2,diff_value2
1, 1 ,1 , 0 , 6 ,6 ,0
2, 10 ,5 , 5 , 8 ,4 ,4
3, 6 ,3 , 1 , 4 ,1 ,3
like wise i have 100's of column and want to compute difference between same column in 2 dataframes columns are dynamic
Maybe this will help:
val spark = SparkSession.builder.appName("Test").master("local[*]").getOrCreate();
import spark.implicits._
var df1 = Seq((1, "1", "6"), (2, "10", "8"), (3, "6", "4")).toDF("id", "value1", "value2")
var df2 = Seq((1, "1", "6"), (2, "5", "4"), (3, "3", "1")).toDF("id", "value1", "value2")
df1.columns.foreach(column => {
df1 = df1.withColumn(column, df1.col(column).cast(IntegerType))
})
df2.columns.foreach(column => {
df2 = df2.withColumn(column, df2.col(column).cast(IntegerType))
})
df1 = df1.withColumnRenamed("id", "df1_id")
df2 = df2.withColumnRenamed("id", "df2_id")
df1.show()
df2.show()
so till now you have two dataframes with value_x,value_y,value_z and going on ...
df1:
+------+------+------+
|df1_id|value1|value2|
+------+------+------+
| 1| 1| 6|
| 2| 10| 8|
| 3| 6| 4|
+------+------+------+
df2:
+------+------+------+
|df2_id|value1|value2|
+------+------+------+
| 1| 1| 6|
| 2| 5| 4|
| 3| 3| 1|
+------+------+------+
Now we are gonna join them base on id:
var df3 = df1.alias("df1").join(df2.alias("df2"), $"df1.df1_id" === $"df2.df2_id")
and last, we gonna take all columns on df1/df2 (* Its important that they will have the same columns) - without the id, and create a new column of the diff:
df1.columns.tail.foreach(col => {
val new_col_name = s"${col}-diff"
val df_a_col = s"df1.${col}"
val df_b_col = s"df2.${col}"
df3 = df3.withColumn(new_col_name, df3.col(df_a_col) - df3.col(df_b_col))
})
df3.show()
Result:
+------+------+------+------+------+------+-----------+-----------+
|df1_id|value1|value2|df2_id|value1|value2|value1-diff|value2-diff|
+------+------+------+------+------+------+-----------+-----------+
| 1| 1| 6| 1| 1| 6| 0| 0|
| 2| 10| 8| 2| 5| 4| 5| 4|
| 3| 6| 4| 3| 3| 1| 3| 3|
+------+------+------+------+------+------+-----------+-----------+
This is the result, and it`s dynamic so you can add valueX you want.

How to apply conditional counts (with reset) to grouped data in PySpark?

I have PySpark code that effectively groups up rows numerically, and increments when a certain condition is met. I'm having trouble figuring out how to transform this code, efficiently, into one that can be applied to groups.
Take this sample dataframe df
df = sqlContext.createDataFrame(
[
(33, [], '2017-01-01'),
(33, ['apple', 'orange'], '2017-01-02'),
(33, [], '2017-01-03'),
(33, ['banana'], '2017-01-04')
],
('ID', 'X', 'date')
)
This code achieves what I want for this sample df, which is to order by date and to create groups ('grp') that increment when the size column goes back to 0.
df \
.withColumn('size', size(col('X'))) \
.withColumn(
"grp",
sum((col('size') == 0).cast("int")).over(Window.orderBy('date'))
).show()
This is partly based on Pyspark - Cumulative sum with reset condition
Now what I am trying to do is apply the same approach to a dataframe that has multiple IDs - achieving a result that looks like
df2 = sqlContext.createDataFrame(
[
(33, [], '2017-01-01', 0, 1),
(33, ['apple', 'orange'], '2017-01-02', 2, 1),
(33, [], '2017-01-03', 0, 2),
(33, ['banana'], '2017-01-04', 1, 2),
(55, ['coffee'], '2017-01-01', 1, 1),
(55, [], '2017-01-03', 0, 2)
],
('ID', 'X', 'date', 'size', 'group')
)
edit for clarity
1) For the first date of each ID - the group should be 1 - regardless of what shows up in any other column.
2) However, for each subsequent date, I need to check the size column. If the size column is 0, then I increment the group number. If it is any non-zero, positive integer, then I continue the previous group number.
I've seen a few way to handle this in pandas, but I'm having difficulty understanding the applications in pyspark and the ways in which grouped data is different in pandas vs spark (e.g. do I need to use something called UADFs?)
Create a column zero_or_first by checking whether the size is zero or the row is the first row. Then sum.
df2 = sqlContext.createDataFrame(
[
(33, [], '2017-01-01', 0, 1),
(33, ['apple', 'orange'], '2017-01-02', 2, 1),
(33, [], '2017-01-03', 0, 2),
(33, ['banana'], '2017-01-04', 1, 2),
(55, ['coffee'], '2017-01-01', 1, 1),
(55, [], '2017-01-03', 0, 2),
(55, ['banana'], '2017-01-01', 1, 1)
],
('ID', 'X', 'date', 'size', 'group')
)
w = Window.partitionBy('ID').orderBy('date')
df2 = df2.withColumn('row', F.row_number().over(w))
df2 = df2.withColumn('zero_or_first', F.when((F.col('size')==0)|(F.col('row')==1), 1).otherwise(0))
df2 = df2.withColumn('grp', F.sum('zero_or_first').over(w))
df2.orderBy('ID').show()
Here' the output. You can see that column group == grp. Where group is the expected results.
+---+---------------+----------+----+-----+---+-------------+---+
| ID| X| date|size|group|row|zero_or_first|grp|
+---+---------------+----------+----+-----+---+-------------+---+
| 33| []|2017-01-01| 0| 1| 1| 1| 1|
| 33| [banana]|2017-01-04| 1| 2| 4| 0| 2|
| 33|[apple, orange]|2017-01-02| 2| 1| 2| 0| 1|
| 33| []|2017-01-03| 0| 2| 3| 1| 2|
| 55| [coffee]|2017-01-01| 1| 1| 1| 1| 1|
| 55| [banana]|2017-01-01| 1| 1| 2| 0| 1|
| 55| []|2017-01-03| 0| 2| 3| 1| 2|
+---+---------------+----------+----+-----+---+-------------+---+
I added a window function, and created an index within each ID. Then I expanded the conditional statement to also reference that index. The following seems to produce my desired output dataframe - but I am interested in knowing if there is a more efficient way to do this.
window = Window.partitionBy('ID').orderBy('date')
df \
.withColumn('size', size(col('X'))) \
.withColumn('index', rank().over(window).alias('index')) \
.withColumn(
"grp",
sum(((col('size') == 0) | (col('index') == 1)).cast("int")).over(window)
).show()
which yields
+---+---------------+----------+----+-----+---+
| ID| X| date|size|index|grp|
+---+---------------+----------+----+-----+---+
| 33| []|2017-01-01| 0| 1| 1|
| 33|[apple, orange]|2017-01-02| 2| 2| 1|
| 33| []|2017-01-03| 0| 3| 2|
| 33| [banana]|2017-01-04| 1| 4| 2|
| 55| [coffee]|2017-01-01| 1| 1| 1|
| 55| []|2017-01-03| 0| 2| 2|
+---+---------------+----------+----+-----+---+

Scala Spark GroupBy Aggregate maintain order of dates while sorting list

val df = Seq((1221, 1, "Boston", "9/22/18 14:00"), (1331, 1, "New York", "8/10/18 14:00"), (1442, 1, "Toronto", "10/15/19 14:00"), (2041, 2, "LA", "1/2/18 14:00"), (2001, 2,"San Fransisco", "5/20/18 15:00"), (3001, 3, "San Jose", "6/02/18 14:00"), (3121, 3, "Seattle", "9/12/18 16:00"), (34562, 3, "Utah", "12/12/18 14:00"), (3233, 3, "Boston", "8/31/18 14:00"), (4120, 4, "Miami", "1/01/18 14:00"), (4102, 4, "Cincinati", "7/21/19 14:00"), (4201, 4, "Washington", "5/10/18 23:00"), (4301, 4, "New Jersey", "3/27/18 15:00"), (4401, 4, "Raleigh", "11/14/18 14:00")).toDF("id", "group_id", "place", "date")
This is a simple df
| id|group_id| place| date|
+-----+--------+-------------+--------------+
| 1221| 1| Boston| 9/22/18 14:00|
| 1331| 1| New York| 8/10/18 14:00|
| 1442| 1| Toronto|10/15/19 14:00|
| 2041| 2| LA| 1/2/18 14:00|
| 2001| 2|San Fransisco| 5/20/18 15:00|
| 3001| 3| San Jose| 6/02/18 14:00|
| 3121| 3| Seattle| 9/12/18 16:00|
| 4562| 3| Utah|12/12/18 14:00|
| 3233| 3| Boston| 8/31/18 14:00|
| 4120| 4| Miami| 1/01/18 14:00|
| 4102| 4| Cincinati| 7/21/19 14:00|
| 4201| 4| Washington| 5/10/18 23:00|
| 4301| 4| New Jersey| 3/27/18 15:00|
| 4401| 4| Raleigh|11/14/18 14:00|
+-----+--------+-------------+--------------+
I want to group by "group_id" and collect dates in ascending order. (earliest date first).
Needed output :
+--------+----+--------+--------------+----+-------------+--------------+----+----------+--------------+----+---------+--------------+
|group_id|id_1| venue_1| date_1|id_2| venue_2| date_2|id_3| venue_3| date_3|id_4| venue_4| date_4|
+--------+----+--------+--------------+----+-------------+--------------+----+----------+--------------+----+---------+--------------+
| 1|1331|New York|08/10/18 14:00|1221| Boston|09/22/18 14:00|1442| Toronto|10/15/19 14:00|null| null| null|
| 3|3001|San Jose|06/02/18 14:00|3233| Boston|08/31/18 14:00|3121| Seattle|09/12/18 16:00|4562| Utah|12/12/18 14:00|
| 4|4120| Miami|01/01/18 14:00|4301| New Jersey|03/27/18 15:00|4201|Washington|05/10/18 23:00|4102|Cincinati|07/21/19 14:00|
| 2|2041| LA| 01/2/18 14:00|2001|San Fransisco|05/20/18 15:00|null| null| null|null| null| null|
+--------+----+--------+--------------+----+-------------+--------------+----+----------+--------------+----+---------+--------------+
The code i am using:
//for sorting by date to preserve order
val df2 = df.repartition(col("group_id")).sortWithinPartitions("date")
val finalDF = df2.groupBy(df("group_id")).agg(collect_list(df("id")).alias("id_list"),collect_list(df("place")).alias("venue_name_list"),collect_list(df("date")).alias("date_list")).selectExpr("group_id","id_list[0] as id_1","venue_name_list[0] as venue_1","date_list[0] as date_1","id_list[1] as id_2","venue_name_list[1] as venue_2","date_list[1] as date_2","id_list[2] as id_3","venue_name_list[2] as venue_3","date_list[2] as date_3","id_list[3] as id_4","venue_name_list[3] as venue_4","date_list[3] as date_4")
But the output is :
+--------+-----+-------+--------------+----+-------------+--------------+----+----------+-------------+----+----------+-------------+
|group_id| id_1|venue_1| date_1|id_2| venue_2| date_2|id_3| venue_3| date_3|id_4| venue_4| date_4|
+--------+-----+-------+--------------+----+-------------+--------------+----+----------+-------------+----+----------+-------------+
| 1| 1442|Toronto|10/15/19 14:00|1331| New York| 8/10/18 14:00|1221| Boston|9/22/18 14:00|null| null| null|
| 3|34562| Utah|12/12/18 14:00|3001| San Jose| 6/02/18 14:00|3233| Boston|8/31/18 14:00|3121| Seattle|9/12/18 16:00|
| 4| 4120| Miami| 1/01/18 14:00|4401| Raleigh|11/14/18 14:00|4301|New Jersey|3/27/18 15:00|4201|Washington|5/10/18 23:00|
| 2| 2041| LA| 1/2/18 14:00|2001|San Fransisco| 5/20/18 15:00|null| null| null|null| null| null|
+--------+-----+-------+--------------+----+-------------+--------------+----+----------+-------------+----+----------+-------------+
Observation:
if the dates are formatted instead of example "9/22/18 14:00" to "09/22/18 14:00", adding a '0' ahead of single digit month dates and adding zero ahead of single digit dates, the code works properly, that is, the dates orders are maintained properly. Any solution is welcome! Thank you.
Format the date using to_timestamp function and sorting in the aggregation using sort_array like this:
import org.apache.spark.sql.functions.to_timestamp
val df = Seq((1221, 1, "Boston", "9/22/18 14:00"), (1331, 1, "New York", "8/10/18 14:00"), (1442, 1, "Toronto", "10/15/19 14:00"), (2041, 2, "LA", "1/2/18 14:00"), (2001, 2, "San Fransisco", "5/20/18 15:00"), (3001, 3, "San Jose", "6/02/18 14:00"), (3121, 3, "Seattle", "9/12/18 16:00"), (34562, 3, "Utah", "12/12/18 14:00"), (3233, 3, "Boston", "8/31/18 14:00"), (4120, 4, "Miami", "1/01/18 14:00"), (4102, 4, "Cincinati", "7/21/19 14:00"), (4201, 4, "Washington", "5/10/18 23:00"), (4301, 4, "New Jersey", "3/27/18 15:00"), (4401, 4, "Raleigh", "11/14/18 14:00"))
.toDF("id", "group_id", "place", "date")
val df2 = df.withColumn("MyDate", to_timestamp($"date", "MM/dd/yyyy HH:mm"))
val finalDF = df2.groupBy(df("group_id"))
.agg(collect_list(df2("id")).alias("id_list"),
collect_list(df2("place")).alias("venue_name_list"),
sort_array(collect_list(df2("MyDate"))).alias("date_list")).
selectExpr("group_id",
"id_list[0] as id_1",
"venue_name_list[0] as venue_1",
"date_list[0] as date_1",
"id_list[1] as id_2",
"venue_name_list[1] as venue_2",
"date_list[1] as date_2",
"id_list[2] as id_3",
"venue_name_list[2] as venue_3",
"date_list[2] as date_3",
"id_list[3] as id_4",
"venue_name_list[3] as venue_4",
"date_list[3] as date_4")
finalDF.show()
As you've already figured out sorting by unformatted StringType dates is the root of the problem, here's one approach that first generates TimestampType dates, create a StructType column of the wanted columns in "suitable field order" for sorting:
val finalDF = df.
withColumn("dateFormatted", to_timestamp($"date", "MM/dd/yy HH:mm")).
groupBy($"group_id").agg(
sort_array(collect_list(struct($"dateFormatted", $"id", $"place"))).as("sorted_arr")
).
selectExpr(
"group_id",
"sorted_arr[0].id as id_1", "sorted_arr[0].place as venue_1", "sorted_arr[0].dateFormatted as date_1",
"sorted_arr[1].id as id_2", "sorted_arr[1].place as venue_2", "sorted_arr[1].dateFormatted as date_2",
"sorted_arr[2].id as id_3", "sorted_arr[2].place as venue_3", "sorted_arr[2].dateFormatted as date_3",
"sorted_arr[3].id as id_4", "sorted_arr[3].place as venue_4", "sorted_arr[3].dateFormatted as date_4"
)
finalDF.show
// +--------+----+--------+-------------------+----+-------------+-------------------+----+----------+-------------------+-----+-------+-------------------+
// |group_id|id_1| venue_1| date_1|id_2| venue_2| date_2|id_3| venue_3| date_3| id_4|venue_4| date_4|
// +--------+----+--------+-------------------+----+-------------+-------------------+----+----------+-------------------+-----+-------+-------------------+
// | 1|1331|New York|2018-08-10 14:00:00|1221| Boston|2018-09-22 14:00:00|1442| Toronto|2019-10-15 14:00:00| null| null| null|
// | 3|3001|San Jose|2018-06-02 14:00:00|3233| Boston|2018-08-31 14:00:00|3121| Seattle|2018-09-12 16:00:00|34562| Utah|2018-12-12 14:00:00|
// | 4|4120| Miami|2018-01-01 14:00:00|4301| New Jersey|2018-03-27 15:00:00|4201|Washington|2018-05-10 23:00:00| 4401|Raleigh|2018-11-14 14:00:00|
// | 2|2041| LA|2018-01-02 14:00:00|2001|San Fransisco|2018-05-20 15:00:00|null| null| null| null| null| null|
// +--------+----+--------+-------------------+----+-------------+-------------------+----+----------+-------------------+-----+-------+-------------------+
A couple of notes:
Forming the StructType column is necessary to make sure corresponding columns will be sorted together
Struct field dateFormatted is placed first so that sort_array will sort the array in the wanted order

How to calculate connections of the node in Spark 2

I have the following DataFrame df:
val df = Seq(
(1, 0, 1, 0, 0), (1, 4, 1, 0, 4), (2, 2, 1, 2, 2),
(4, 3, 1, 4, 4), (4, 5, 1, 4, 4)
).toDF("from", "to", "attr", "type_from", "type_to")
+-----+-----+----+---------------+---------------+
|from |to |attr|type_from |type_to |
+-----+-----+----+---------------+---------------+
| 1| 0| 1| 0| 0|
| 1| 4| 1| 0| 4|
| 2| 2| 1| 2| 2|
| 4| 3| 1| 4| 4|
| 4| 5| 1| 4| 4|
+-----+-----+----+---------------+---------------+
I want to count the number of ingoing and outgoing links for each node only when the type of from node is the same as the type of to node (i.e. the values of type_from and type_to).
The cases when to and from are equal should be excluded.
This is how I calculate the number of outgoing links based on this answer proposed by user8371915.
df
.where($"type_from" === $"type_to" && $"from" =!= $"to")
.groupBy($"from" as "nodeId", $"type_from" as "type")
.agg(count("*") as "numLinks")
.na.fill(0)
.show()
Of course, I can repeat the same calculation for the incoming links and then join the results. But is there any shorter solution?
df2
.where($"type_from" === $"type_to" && $"from" =!= $"to")
.groupBy($"to" as "nodeId", $"type_to" as "type")
.agg(count("*") as "numLinks")
.na.fill(0)
.show()
val df_result = df.join(df2, Seq("nodeId", "type"), "rightouter")

scala - Spark : how to get the resultSet with condition in a groupedData

Is there a way to the group Dataframe using its own schema?
This is produces data of format :
Country | Class | Name | age
US, 1,'aaa',21
US, 1,'bbb',20
BR, 2,'ccc',30
AU, 3,'ddd',20
....
I would want to do some like
Country | Class 1 Students | Class 2 Students
US , 2, 0
BR , 0, 1
....
condition 1. Country Groupping.
condition 2. get only 1 or 2 class value
this is a source code..
val df = Seq(("US", 1, "AAA",19),("US", 1, "BBB",20),("KR", 2, "CCC",29),
("AU", 3, "DDD",18)).toDF("country", "class", "name","age")
df.groupBy("country").agg(count($"name") as "Cnt")
You should use pivot function.
val df = Seq(("US", 1, "AAA",19),("US", 1, "BBB",20),("KR", 2, "CCC",29),
("AU", 3, "DDD",18)).toDF("country", "class", "name","age")
df.groupBy("country").pivot("class").agg(count($"name") as "Cnt").show
+-------+---+---+---+
|country| 1| 2| 3|
+-------+---+---+---+
| AU| 0| 0| 1|
| US| 2| 0| 0|
| KR| 0| 1| 0|
+-------+---+---+---+