How to find the next occurring item from current row in a data frame using Spark Windowing? - scala

I have the following Dataframe:
+------+----------+-------------+--------------------+---------+-----+----------+
|ID |MEM_ID | BFS | SVC_DT |TYP |SEQ |BFS_SEQ |
+------+----------+----------------------------------+---------+-----+----------+
|105771|29378668 | BRIMONIDINE | 2019-02-04 00:00:00|PD |1 |1 |
|105772|29378668 | BRIMONIDINE | 2019-04-04 00:00:00|PD |2 |2 |
|105773|29378668 | BRIMONIDINE | 2019-04-17 00:00:00|RV |3 |3 |
|105774|29378668 | TIMOLOL | 2019-04-17 00:00:00|RV |4 |1 |
|105775|29378668 | BRIMONIDINE | 2019-04-22 00:00:00|PD |5 |4 |
|105776|29378668 | TIMOLOL | 2019-04-22 00:00:00|PD |6 |2 |
+------+----------+----------------------------------+---------+-----+----------+
For every row, I have to find the occurrence of next 'PD' Typ at BFS level from the current row and populate its associated ID as a new column named 'NEXT_PD_TYP_ID'
The output I am expecting is:
+------+---------+-------------+--------------------+----+-----+---------+---------------+
|ID |MEM_ID | BFS | SVC_DT |TYP |SEQ |BFS_SEQ |NEXT_PD_TYP_ID |
+------+---------+----------------------------------+----+-----+---------+---------------+
|105771|29378668 | BRIMONIDINE | 2019-02-04 00:00:00|PD |1 |1 |105772 |
|105772|29378668 | BRIMONIDINE | 2019-04-04 00:00:00|PD |2 |2 |105775 |
|105773|29378668 | BRIMONIDINE | 2019-04-17 00:00:00|RV |3 |3 |105775 |
|105774|29378668 | TIMOLOL | 2019-04-17 00:00:00|RV |4 |1 |105776 |
|105775|29378668 | BRIMONIDINE | 2019-04-22 00:00:00|PD |5 |4 |null |
|105776|29378668 | TIMOLOL | 2019-04-22 00:00:00|PD |6 |2 |null |
+------+---------+----------------------------------+----+-----+---------+---------------+
Need help.
I have tried using the conditional aggregation: max(when), however since it has more than one 'PD' the max is returning only one value for all the rows.
No error messages

I hope this helps.
I created a new column with ID's of TYP === PD. I called it TYPPDID.
Then I used Window frame ranging from next row to unbounded following row and got the first not-null TYPPDID
orderBy("ID") in the end is only to show records in order.
import org.apache.spark.sql.functions._
val df = Seq(
("105771", "BRIMONIDINE", "PD"),
("105772", "BRIMONIDINE", "PD"),
("105773", "BRIMONIDINE","RV"),
("105774", "TIMOLOL", "RV"),
("105775", "BRIMONIDINE", "PD"),
("105776", "TIMOLOL", "PD")
).toDF("ID", "BFS", "TYP").withColumn("TYPPDID", when($"TYP" === "PD", $"ID"))
df: org.apache.spark.sql.DataFrame = [ID: string, BFS: string ... 2 more fields]
scala> df.show
+------+-----------+---+-------+
| ID| BFS|TYP|TYPPDID|
+------+-----------+---+-------+
|105771|BRIMONIDINE| PD| 105771|
|105772|BRIMONIDINE| PD| 105772|
|105773|BRIMONIDINE| RV| null|
|105774| TIMOLOL| RV| null|
|105775|BRIMONIDINE| PD| 105775|
|105776| TIMOLOL| PD| 105776|
+------+-----------+---+-------+
scala> val overColumns = Window.partitionBy("BFS").orderBy("ID").rowsBetween(1, Window.unboundedFollowing)
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#eb923ef
scala> df.withColumn("NEXT_PD_TYP_ID",first("TYPPDID", true).over(overColumns)).orderBy("ID").show(false)
+------+-----------+---+-------+-------+
|ID |BFS |TYP|TYPPDID|NEXT_PD_TYP_ID|
+------+-----------+---+-------+-------+
|105771|BRIMONIDINE|PD |105771 |105772 |
|105772|BRIMONIDINE|PD |105772 |105775 |
|105773|BRIMONIDINE|RV |null |105775 |
|105774|TIMOLOL |RV |null |105776 |
|105775|BRIMONIDINE|PD |105775 |null |
|105776|TIMOLOL |PD |105776 |null |
+------+-----------+---+-------+-------+

Related

How to add some values in a dataframe in Scala Spark?

Here is the dataframe I have for now, suppose there are totally 4 days{1,2,3,4}:
+-------------+----------+------+
| key | Time | Value|
+-------------+----------+------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 4 | 3 |
| 2 | 2 | 4 |
| 2 | 3 | 5 |
+-------------+----------+------+
And what I want is
+-------------+----------+------+
| key | Time | Value|
+-------------+----------+------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | null |
| 1 | 4 | 3 |
| 2 | 1 | null |
| 2 | 2 | 4 |
| 2 | 3 | 5 |
| 2 | 4 | null |
+-------------+----------+------+
If there is some ways that can help me get this?
Say df1 is our main table:
+---+----+-----+
|key|Time|Value|
+---+----+-----+
|1 |1 |1 |
|1 |2 |2 |
|1 |4 |3 |
|2 |2 |4 |
|2 |3 |5 |
+---+----+-----+
We can use the following transformations:
val data = df1
// we first group by and aggregate the values to a sequence between 1 and 4 (your number)
.groupBy("key")
.agg(sequence(lit(1), lit(4)).as("Time"))
// we explode the sequence, thus creating all 'Time' per 'key'
.withColumn("Time", explode(col("Time")))
// finally, we join with our main table on 'key' and 'Time'
.join(df1, Seq("key", "Time"), "left")
To get this output:
+---+----+-----+
|key|Time|Value|
+---+----+-----+
|1 |1 |1 |
|1 |2 |2 |
|1 |3 |null |
|1 |4 |3 |
|2 |1 |null |
|2 |2 |4 |
|2 |3 |5 |
|2 |4 |null |
+---+----+-----+
Which should be what you are looking for, good luck!

Delete values lower than cummax on multiple spark dataframe columns in scala

I am having a data frame as shown below. The number of signals are more than 100, so there will be more than 100 columns in the data frame.
+---+------------+--------+--------+--------+
|id | date|signal01|signal02|signal03|......
+---+------------+--------+--------+--------+
|050|2021-01-14 |1 |3 |1 |
|050|2021-01-15 |null |4 |2 |
|050|2021-02-02 |2 |3 |3 |
|051|2021-01-14 |1 |3 |0 |
|051|2021-01-15 |2 |null |null |
|051|2021-02-02 |3 |3 |2 |
|051|2021-02-03 |1 |3 |1 |
|052|2021-03-03 |1 |3 |0 |
|052|2021-03-05 |3 |3 |null |
|052|2021-03-06 |2 |null |2 |
|052|2021-03-16 |3 |5 |5 |.......
+-------------------------------------------+
I have to find out cummax of each signal and then compare with respective signal columns and delete the signal records which are having value lower than cummax and null values.
step1. find cumulative max for each signal with respect to id column.
step2. delete the records which are having lower value than cummax for each signal.
step3. Take count of records which are having cummax less than signal value(excluded of null) for each signals with respect to id.
After the count the final output should be as shown below.
+---+------------+--------+--------+--------+
|id | date|signal01|signal02|signal03|.....
+---+------------+--------+--------+--------+
|050|2021-01-14 |1 | 3 | 1 |
|050|2021-01-15 |null | null | 2 |
|050|2021-02-02 |2 | 3 | 3 |
|
|051|2021-01-14 |1 | 3 | 0 |
|051|2021-01-15 |2 | null | null |
|051|2021-02-02 |3 | 3 | 2 |
|051|2021-02-03 |null | 3 | null |
|
|052|2021-03-03 |1 | 3 | 0 |
|052|2021-03-05 |3 | 3 | null |
|052|2021-03-06 |null | null | 2 |
|052|2021-03-16 |3 | 5 | 5 | ......
+----------------+--------+--------+--------+
I have tried by using window function as below and it worked for almost all records.
val w = Window.partitionBy("id").orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
val signalList01 = ListBuffer[Column]()
signalList01.append(col("id"), col("date"))
for (column <- signalColumns) {
// Applying the max non null aggregate function on each signal column
signalList01 += (col(column), max(column).over(w).alias(column+"_cummax")) }
val cumMaxDf = df.select(signalList01: _*)
But I am getting error values as shown below for few records.
Is there any idea about how this error records in the cummax column? Any leads appreciated!
Just giving out hints here (as you suggested) to help you unblock the situation, but --WARNING-- haven't tested the code !
the code you provided in the comments looks good. It'll get you your max column
val nw_df = original_df.withColumn("singal01_cummax", sum(col("singal01")).over(windowCodedSO))
now, you need to be able to compare the two values in "singal01" and "singal01_cummax". A function like this, maybe:
def takeOutRecordsLessThanCummax (signal:Int, singal_cummax: Int) : Any =
{ if (signal == null || signal < singal_cummax) null
else singal_cummax }
since we'll be applying it to columns, we'll wrap it up in a UDF
val takeOutRecordsLessThanCummaxUDF : UserDefinedFunction = udf {
(i:Int, j:Int) => takeOutRecordsLessThanCummax(i,j)
}
and then, you can combine everything above so it can be applicable on your original dataframe. Something like this could work:
val signal_cummax_suffix = "_cummax"
val result = original_df.columns.foldLeft(original_df)(
(dfac, colname) => dfac
.withColumn(colname.concat(signal_cummax_suffix),
sum(col(colname)).over(windowCodedSO))
.withColumn(colname.concat("output"),
takeOutRecordsLessThanCummaxUDF(col(colname), col(colname.concat(signal_cummax_suffix))))
)

How to rank dataframe depending on a group of rows in a column?

I have this dataframe :
+-----+----------+---------+
|num |Timestamp |frequency|
+-----+----------+---------+
|20.0 |1632899456|4 |
|20.0 |1632901256|4 |
|20.0 |1632901796|4 |
|20.0 |1632899155|4 |
|10.0 |1632901743|2 |
|10.0 |1632899933|2 |
|91.0 |1632899756|1 |
|32.0 |1632900776|1 |
|41.0 |1632900176|1 |
+-----+----------+---------+
I want to add a column containing the rank of each frequency. The new dataframe would be like this :
+-----+----------+---------+------------+
|num |Timestamp |frequency|rank |
+-----+----------+---------+------------+
|20.0 |1632899456|4 |1 |
|20.0 |1632901256|4 |1 |
|20.0 |1632901796|4 |1 |
|20.0 |1632899155|4 |1 |
|10.0 |1632901743|2 |2 |
|10.0 |1632899933|2 |2 |
|91.0 |1632899756|1 |3 |
|32.0 |1632900776|1 |3 |
|41.0 |1632900176|1 |3 |
+-----+----------+---------+------------+
I am using Spark version 2.4.3 and SQLContext, with scala language.
You can use dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("rank", dense_rank().over(Window.orderBy(desc("frequency")))

Scala Spark Join Dataframe in loop

I am trying to join DataFrames on the fly in loop. I am using a properties file to get the column details to use in the final data frame.
Properties file -
a01=status:single,perm_id:multi
a02=status:single,actv_id:multi
a03=status:single,perm_id:multi,actv_id:multi
............................
............................
For each row in the properties file, I need to create a DataFrame and save it in a file. Loading the properties file using PropertiesReader. if the mode is single then I need to get only the column value from the table. But if multi, then I need to get the list of values.
val propertyColumn = properties.get("a01") //a01 value we are getting as an argument. This might be a01,a02 or a0n
val columns = propertyColumn.toString.split(",").map(_.toString)
act_det table -
+-------+--------+-----------+-----------+-----------+------------+
|id |act_id |status |perm_id |actv_id | debt_id |
+-------+--------+-----------+-----------+-----------+------------+
| 1 |1 | 4 | 1 | 10 | 1 |
+-------+--------+-----------+-----------+-----------+------------+
| 2 |1 | 4 | 2 | 20 | 2 |
+-------+--------+-----------+-----------+-----------+------------+
| 3 |1 | 4 | 3 | 30 | 1 |
+-------+--------+-----------+-----------+-----------+------------+
| 4 |2 | 4 | 5 | 10 | 3 |
+-------+--------+-----------+-----------+-----------+------------+
| 5 |2 | 4 | 6 | 20 | 1 |
+-------+--------+-----------+-----------+-----------+------------+
| 6 |2 | 4 | 7 | 30 | 1 |
+-------+--------+-----------+-----------+-----------+------------+
| 7 |3 | 4 | 1 | 10 | 3 |
+-------+--------+-----------+-----------+-----------+------------+
| 8 |3 | 4 | 5 | 20 | 1 |
+-------+--------+-----------+-----------+-----------+------------+
| 9 |3 | 4 | 2 | 30 | 3 |
+-------+--------+-----------+-----------+------------+-----------+
Main DataFrame -
val data = sqlContext.sql("select * from act_det")
I want the following output -
For a01 -
+-------+--------+-----------+
|act_id |status |perm_id |
+-------+--------+-----------+
| 1 | 4 | [1,2,3] |
+-------+--------+-----------+
| 2 | 4 | [5,6,7] |
+-------+--------+-----------+
| 3 | 4 | [1,5,2] |
+-------+--------+-----------+
For a02 -
+-------+--------+-----------+
|act_id |status |actv_id |
+-------+--------+-----------+
| 1 | 4 | [10,20,30]|
+-------+--------+-----------+
| 2 | 4 | [10,20,30]|
+-------+--------+-----------+
| 3 | 4 | [10,20,30]|
+-------+--------+-----------+
For a03 -
+-------+--------+-----------+-----------+
|act_id |status |perm_id |actv_id |
+-------+--------+-----------+-----------+
| 1 | 4 | [1,2,3] |[10,20,30] |
+-------+--------+-----------+-----------+
| 2 | 4 | [5,6,7] |[10,20,30] |
+-------+--------+-----------+-----------+
| 3 | 4 | [1,5,2] |[10,20,30] |
+-------+--------+-----------+-----------+
But the data frame creation process should be dynamic.
I have tried below code but I am not able to implement the join logic for the DataFrames in loop.
val finalDF:DataFrame = ??? //empty dataframe
for {
column <- columns
} yeild {
val eachColumn = column.toString.split(":").map(_.toString)
val columnName = eachColumn(0)
val mode = eachColumn(1)
if(mode.equalsIgnoreCase("single")) {
data.select($"act_id", $"status").distinct
//I want to join finalDF with data.select($"act_id", $"status").distinct
} else if(mode.equalsIgnoreCase("multi")) {
data.groupBy($"act_id").agg(collect_list($"perm_id").as("perm_id"))
//I want to join finalDF with data.groupBy($"act_id").agg(collect_list($"perm_id").as("perm_id"))
}
}
Any advice or guidance would be greatly appreciated.
Check below code.
scala> df.show(false)
+---+------+------+-------+-------+-------+
|id |act_id|status|perm_id|actv_id|debt_id|
+---+------+------+-------+-------+-------+
|1 |1 |4 |1 |10 |1 |
|2 |1 |4 |2 |20 |2 |
|3 |1 |4 |3 |30 |1 |
|4 |2 |4 |5 |10 |3 |
|5 |2 |4 |6 |20 |1 |
|6 |2 |4 |7 |30 |1 |
|7 |3 |4 |1 |10 |3 |
|8 |3 |4 |5 |20 |1 |
|9 |3 |4 |2 |30 |3 |
+---+------+------+-------+-------+-------+
Defining primary keys
scala> val primary_key = Seq("act_id").map(col(_))
primary_key: Seq[org.apache.spark.sql.Column] = List(act_id)
Configs
scala> configs.foreach(println)
/*
(a01,status:single,perm_id:multi)
(a02,status:single,actv_id:multi)
(a03,status:single,perm_id:multi,actv_id:multi)
*/
Constructing Expression.
scala>
val columns = configs
.map(c => {
c._2
.split(",")
.map(c => {
val cc = c.split(":");
if(cc.tail.contains("single"))
first(col(cc.head)).as(cc.head)
else
collect_list(col(cc.head)).as(cc.head)
}
)
})
/*
columns: scala.collection.immutable.Iterable[Array[org.apache.spark.sql.Column]] = List(
Array(first(status, false) AS `status`, collect_list(perm_id) AS `perm_id`),
Array(first(status, false) AS `status`, collect_list(actv_id) AS `actv_id`),
Array(first(status, false) AS `status`, collect_list(perm_id) AS `perm_id`, collect_list(actv_id) AS `actv_id`)
)
*/
Final Result
scala> columns.map(c => df.groupBy(primary_key:_*).agg(c.head,c.tail:_*)).map(_.show(false))
+------+------+---------+
|act_id|status|perm_id |
+------+------+---------+
|3 |4 |[1, 5, 2]|
|1 |4 |[1, 2, 3]|
|2 |4 |[5, 6, 7]|
+------+------+---------+
+------+------+------------+
|act_id|status|actv_id |
+------+------+------------+
|3 |4 |[10, 20, 30]|
|1 |4 |[10, 20, 30]|
|2 |4 |[10, 20, 30]|
+------+------+------------+
+------+------+---------+------------+
|act_id|status|perm_id |actv_id |
+------+------+---------+------------+
|3 |4 |[1, 5, 2]|[10, 20, 30]|
|1 |4 |[1, 2, 3]|[10, 20, 30]|
|2 |4 |[5, 6, 7]|[10, 20, 30]|
+------+------+---------+------------+

how to rename the Columns Produced by count() function in Scala

I have the below df:
+------+-------+--------+
|student| vars|observed|
+------+-------+--------+
| 1| ABC | 19|
| 1| ABC | 1|
| 2| CDB | 1|
| 1| ABC | 8|
| 3| XYZ | 3|
| 1| ABC | 389|
| 2| CDB | 946|
| 1| ABC | 342|
|+------+-------+--------+
I wanted to add a new frequency column groupBy two columns "student", "vars" in SCALA.
val frequency = df.groupBy($"student", $"vars").count()
This code generates a "count" column with the frequencies BUT losing observed column from the df.
I would like to create a new df as follows without losing "observed" column
+------+-------+--------+------------+
|student| vars|observed|total_count |
+------+-------+--------+------------+
| 1| ABC | 9|22
| 1| ABC | 1|22
| 2| CDB | 1|7
| 1| ABC | 2|22
| 3| XYZ | 3|3
| 1| ABC | 8|22
| 2| CDB | 6|7
| 1| ABC | 2|22
|+------+-------+-------+--------------+
You cannot do this directly but there are couple of ways,
You can join original df with count df. check here
You collect the observed column while doing aggregation and explode it again
With explode:
val frequency = df.groupBy("student", "vars").agg(collect_list("observed").as("observed_list"),count("*").as("total_count")).select($"student", $"vars",explode($"observed_list").alias("observed"), $"total_count")
scala> frequency.show(false)
+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3 |XYZ |3 |1 |
|2 |CDB |1 |2 |
|2 |CDB |946 |2 |
|1 |ABC |389 |5 |
|1 |ABC |342 |5 |
|1 |ABC |19 |5 |
|1 |ABC |1 |5 |
|1 |ABC |8 |5 |
+-------+----+--------+-----------+
We can use Window functions as well
val windowSpec = Window.partitionBy("student","vars")
val frequency = df.withColumn("total_count", count(col("student")) over windowSpec)
.show
+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3 |XYZ |3 |1 |
|2 |CDB |1 |2 |
|2 |CDB |946 |2 |
|1 |ABC |389 |5 |
|1 |ABC |342 |5 |
|1 |ABC |19 |5 |
|1 |ABC |1 |5 |
|1 |ABC |8 |5 |
+-------+----+--------+-----------+