How to add some values in a dataframe in Scala Spark? - scala

Here is the dataframe I have for now, suppose there are totally 4 days{1,2,3,4}:
+-------------+----------+------+
| key | Time | Value|
+-------------+----------+------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 4 | 3 |
| 2 | 2 | 4 |
| 2 | 3 | 5 |
+-------------+----------+------+
And what I want is
+-------------+----------+------+
| key | Time | Value|
+-------------+----------+------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | null |
| 1 | 4 | 3 |
| 2 | 1 | null |
| 2 | 2 | 4 |
| 2 | 3 | 5 |
| 2 | 4 | null |
+-------------+----------+------+
If there is some ways that can help me get this?

Say df1 is our main table:
+---+----+-----+
|key|Time|Value|
+---+----+-----+
|1 |1 |1 |
|1 |2 |2 |
|1 |4 |3 |
|2 |2 |4 |
|2 |3 |5 |
+---+----+-----+
We can use the following transformations:
val data = df1
// we first group by and aggregate the values to a sequence between 1 and 4 (your number)
.groupBy("key")
.agg(sequence(lit(1), lit(4)).as("Time"))
// we explode the sequence, thus creating all 'Time' per 'key'
.withColumn("Time", explode(col("Time")))
// finally, we join with our main table on 'key' and 'Time'
.join(df1, Seq("key", "Time"), "left")
To get this output:
+---+----+-----+
|key|Time|Value|
+---+----+-----+
|1 |1 |1 |
|1 |2 |2 |
|1 |3 |null |
|1 |4 |3 |
|2 |1 |null |
|2 |2 |4 |
|2 |3 |5 |
|2 |4 |null |
+---+----+-----+
Which should be what you are looking for, good luck!

Related

DB2/AS400 SQL Pivot

I have a problem with pivot tables ....
I don't understand what to do ...
My table is as follows:
|CODART|MONTH|QT |
|------|-----|----|
|ART1 |1 |100 |
|ART2 |1 |30 |
|ART3 |1 |30 |
|ART1 |2 |10 |
|ART4 |2 |40 |
|ART3 |4 |50 |
|ART5 |4 |60 |
I would like to get a summary table by month:
|CODART|1 |2 |3 |4 |5 |6 |7 |8 |9 |10 |11 |12 |
|------|---|---|---|---|---|---|---|---|---|---|---|---|
|ART1 |100|10 | | | | | | | | | | |
|ART2 |30 | | | | | | | | | | | |
|ART3 |30 | | |50 | | | | | | | | |
|ART4 | |2 | | | | | | | | | | |
|ART5 | | | |60 | | | | | | | | |
|TOTAL |160|12 | |110| | | | | | | | |
Too many requests? :-)
Thanks for the support
WITH MYTAB (CODART, MONTH, QT) AS
(
VALUES
('ART1', 1, 100)
, ('ART2', 1, 30)
, ('ART3', 1, 30)
, ('ART1', 2, 10)
, ('ART4', 2, 40)
, ('ART3', 4, 50)
, ('ART5', 4, 60)
)
SELECT
CASE GROUPING (CODART) WHEN 0 THEN CODART ELSE 'TOTAL' END AS CODART
, SUM (CASE MONTH WHEN 1 THEN QT END) AS "1"
, SUM (CASE MONTH WHEN 2 THEN QT END) AS "2"
, SUM (CASE MONTH WHEN 3 THEN QT END) AS "3"
, SUM (CASE MONTH WHEN 4 THEN QT END) AS "4"
---
, SUM (CASE MONTH WHEN 12 THEN QT END) AS "12"
FROM MYTAB T
GROUP BY ROLLUP (T.CODART)
ORDER BY GROUPING (T.CODART), T.CODART
CODART
1
2
3
4
12
ART1
100
10
ART2
30
ART3
30
50
ART4
40
ART5
60
TOTAL
160
50
110

Delete values lower than cummax on multiple spark dataframe columns in scala

I am having a data frame as shown below. The number of signals are more than 100, so there will be more than 100 columns in the data frame.
+---+------------+--------+--------+--------+
|id | date|signal01|signal02|signal03|......
+---+------------+--------+--------+--------+
|050|2021-01-14 |1 |3 |1 |
|050|2021-01-15 |null |4 |2 |
|050|2021-02-02 |2 |3 |3 |
|051|2021-01-14 |1 |3 |0 |
|051|2021-01-15 |2 |null |null |
|051|2021-02-02 |3 |3 |2 |
|051|2021-02-03 |1 |3 |1 |
|052|2021-03-03 |1 |3 |0 |
|052|2021-03-05 |3 |3 |null |
|052|2021-03-06 |2 |null |2 |
|052|2021-03-16 |3 |5 |5 |.......
+-------------------------------------------+
I have to find out cummax of each signal and then compare with respective signal columns and delete the signal records which are having value lower than cummax and null values.
step1. find cumulative max for each signal with respect to id column.
step2. delete the records which are having lower value than cummax for each signal.
step3. Take count of records which are having cummax less than signal value(excluded of null) for each signals with respect to id.
After the count the final output should be as shown below.
+---+------------+--------+--------+--------+
|id | date|signal01|signal02|signal03|.....
+---+------------+--------+--------+--------+
|050|2021-01-14 |1 | 3 | 1 |
|050|2021-01-15 |null | null | 2 |
|050|2021-02-02 |2 | 3 | 3 |
|
|051|2021-01-14 |1 | 3 | 0 |
|051|2021-01-15 |2 | null | null |
|051|2021-02-02 |3 | 3 | 2 |
|051|2021-02-03 |null | 3 | null |
|
|052|2021-03-03 |1 | 3 | 0 |
|052|2021-03-05 |3 | 3 | null |
|052|2021-03-06 |null | null | 2 |
|052|2021-03-16 |3 | 5 | 5 | ......
+----------------+--------+--------+--------+
I have tried by using window function as below and it worked for almost all records.
val w = Window.partitionBy("id").orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
val signalList01 = ListBuffer[Column]()
signalList01.append(col("id"), col("date"))
for (column <- signalColumns) {
// Applying the max non null aggregate function on each signal column
signalList01 += (col(column), max(column).over(w).alias(column+"_cummax")) }
val cumMaxDf = df.select(signalList01: _*)
But I am getting error values as shown below for few records.
Is there any idea about how this error records in the cummax column? Any leads appreciated!
Just giving out hints here (as you suggested) to help you unblock the situation, but --WARNING-- haven't tested the code !
the code you provided in the comments looks good. It'll get you your max column
val nw_df = original_df.withColumn("singal01_cummax", sum(col("singal01")).over(windowCodedSO))
now, you need to be able to compare the two values in "singal01" and "singal01_cummax". A function like this, maybe:
def takeOutRecordsLessThanCummax (signal:Int, singal_cummax: Int) : Any =
{ if (signal == null || signal < singal_cummax) null
else singal_cummax }
since we'll be applying it to columns, we'll wrap it up in a UDF
val takeOutRecordsLessThanCummaxUDF : UserDefinedFunction = udf {
(i:Int, j:Int) => takeOutRecordsLessThanCummax(i,j)
}
and then, you can combine everything above so it can be applicable on your original dataframe. Something like this could work:
val signal_cummax_suffix = "_cummax"
val result = original_df.columns.foldLeft(original_df)(
(dfac, colname) => dfac
.withColumn(colname.concat(signal_cummax_suffix),
sum(col(colname)).over(windowCodedSO))
.withColumn(colname.concat("output"),
takeOutRecordsLessThanCummaxUDF(col(colname), col(colname.concat(signal_cummax_suffix))))
)

Scala Spark Join Dataframe in loop

I am trying to join DataFrames on the fly in loop. I am using a properties file to get the column details to use in the final data frame.
Properties file -
a01=status:single,perm_id:multi
a02=status:single,actv_id:multi
a03=status:single,perm_id:multi,actv_id:multi
............................
............................
For each row in the properties file, I need to create a DataFrame and save it in a file. Loading the properties file using PropertiesReader. if the mode is single then I need to get only the column value from the table. But if multi, then I need to get the list of values.
val propertyColumn = properties.get("a01") //a01 value we are getting as an argument. This might be a01,a02 or a0n
val columns = propertyColumn.toString.split(",").map(_.toString)
act_det table -
+-------+--------+-----------+-----------+-----------+------------+
|id |act_id |status |perm_id |actv_id | debt_id |
+-------+--------+-----------+-----------+-----------+------------+
| 1 |1 | 4 | 1 | 10 | 1 |
+-------+--------+-----------+-----------+-----------+------------+
| 2 |1 | 4 | 2 | 20 | 2 |
+-------+--------+-----------+-----------+-----------+------------+
| 3 |1 | 4 | 3 | 30 | 1 |
+-------+--------+-----------+-----------+-----------+------------+
| 4 |2 | 4 | 5 | 10 | 3 |
+-------+--------+-----------+-----------+-----------+------------+
| 5 |2 | 4 | 6 | 20 | 1 |
+-------+--------+-----------+-----------+-----------+------------+
| 6 |2 | 4 | 7 | 30 | 1 |
+-------+--------+-----------+-----------+-----------+------------+
| 7 |3 | 4 | 1 | 10 | 3 |
+-------+--------+-----------+-----------+-----------+------------+
| 8 |3 | 4 | 5 | 20 | 1 |
+-------+--------+-----------+-----------+-----------+------------+
| 9 |3 | 4 | 2 | 30 | 3 |
+-------+--------+-----------+-----------+------------+-----------+
Main DataFrame -
val data = sqlContext.sql("select * from act_det")
I want the following output -
For a01 -
+-------+--------+-----------+
|act_id |status |perm_id |
+-------+--------+-----------+
| 1 | 4 | [1,2,3] |
+-------+--------+-----------+
| 2 | 4 | [5,6,7] |
+-------+--------+-----------+
| 3 | 4 | [1,5,2] |
+-------+--------+-----------+
For a02 -
+-------+--------+-----------+
|act_id |status |actv_id |
+-------+--------+-----------+
| 1 | 4 | [10,20,30]|
+-------+--------+-----------+
| 2 | 4 | [10,20,30]|
+-------+--------+-----------+
| 3 | 4 | [10,20,30]|
+-------+--------+-----------+
For a03 -
+-------+--------+-----------+-----------+
|act_id |status |perm_id |actv_id |
+-------+--------+-----------+-----------+
| 1 | 4 | [1,2,3] |[10,20,30] |
+-------+--------+-----------+-----------+
| 2 | 4 | [5,6,7] |[10,20,30] |
+-------+--------+-----------+-----------+
| 3 | 4 | [1,5,2] |[10,20,30] |
+-------+--------+-----------+-----------+
But the data frame creation process should be dynamic.
I have tried below code but I am not able to implement the join logic for the DataFrames in loop.
val finalDF:DataFrame = ??? //empty dataframe
for {
column <- columns
} yeild {
val eachColumn = column.toString.split(":").map(_.toString)
val columnName = eachColumn(0)
val mode = eachColumn(1)
if(mode.equalsIgnoreCase("single")) {
data.select($"act_id", $"status").distinct
//I want to join finalDF with data.select($"act_id", $"status").distinct
} else if(mode.equalsIgnoreCase("multi")) {
data.groupBy($"act_id").agg(collect_list($"perm_id").as("perm_id"))
//I want to join finalDF with data.groupBy($"act_id").agg(collect_list($"perm_id").as("perm_id"))
}
}
Any advice or guidance would be greatly appreciated.
Check below code.
scala> df.show(false)
+---+------+------+-------+-------+-------+
|id |act_id|status|perm_id|actv_id|debt_id|
+---+------+------+-------+-------+-------+
|1 |1 |4 |1 |10 |1 |
|2 |1 |4 |2 |20 |2 |
|3 |1 |4 |3 |30 |1 |
|4 |2 |4 |5 |10 |3 |
|5 |2 |4 |6 |20 |1 |
|6 |2 |4 |7 |30 |1 |
|7 |3 |4 |1 |10 |3 |
|8 |3 |4 |5 |20 |1 |
|9 |3 |4 |2 |30 |3 |
+---+------+------+-------+-------+-------+
Defining primary keys
scala> val primary_key = Seq("act_id").map(col(_))
primary_key: Seq[org.apache.spark.sql.Column] = List(act_id)
Configs
scala> configs.foreach(println)
/*
(a01,status:single,perm_id:multi)
(a02,status:single,actv_id:multi)
(a03,status:single,perm_id:multi,actv_id:multi)
*/
Constructing Expression.
scala>
val columns = configs
.map(c => {
c._2
.split(",")
.map(c => {
val cc = c.split(":");
if(cc.tail.contains("single"))
first(col(cc.head)).as(cc.head)
else
collect_list(col(cc.head)).as(cc.head)
}
)
})
/*
columns: scala.collection.immutable.Iterable[Array[org.apache.spark.sql.Column]] = List(
Array(first(status, false) AS `status`, collect_list(perm_id) AS `perm_id`),
Array(first(status, false) AS `status`, collect_list(actv_id) AS `actv_id`),
Array(first(status, false) AS `status`, collect_list(perm_id) AS `perm_id`, collect_list(actv_id) AS `actv_id`)
)
*/
Final Result
scala> columns.map(c => df.groupBy(primary_key:_*).agg(c.head,c.tail:_*)).map(_.show(false))
+------+------+---------+
|act_id|status|perm_id |
+------+------+---------+
|3 |4 |[1, 5, 2]|
|1 |4 |[1, 2, 3]|
|2 |4 |[5, 6, 7]|
+------+------+---------+
+------+------+------------+
|act_id|status|actv_id |
+------+------+------------+
|3 |4 |[10, 20, 30]|
|1 |4 |[10, 20, 30]|
|2 |4 |[10, 20, 30]|
+------+------+------------+
+------+------+---------+------------+
|act_id|status|perm_id |actv_id |
+------+------+---------+------------+
|3 |4 |[1, 5, 2]|[10, 20, 30]|
|1 |4 |[1, 2, 3]|[10, 20, 30]|
|2 |4 |[5, 6, 7]|[10, 20, 30]|
+------+------+---------+------------+

How to find the next occurring item from current row in a data frame using Spark Windowing?

I have the following Dataframe:
+------+----------+-------------+--------------------+---------+-----+----------+
|ID |MEM_ID | BFS | SVC_DT |TYP |SEQ |BFS_SEQ |
+------+----------+----------------------------------+---------+-----+----------+
|105771|29378668 | BRIMONIDINE | 2019-02-04 00:00:00|PD |1 |1 |
|105772|29378668 | BRIMONIDINE | 2019-04-04 00:00:00|PD |2 |2 |
|105773|29378668 | BRIMONIDINE | 2019-04-17 00:00:00|RV |3 |3 |
|105774|29378668 | TIMOLOL | 2019-04-17 00:00:00|RV |4 |1 |
|105775|29378668 | BRIMONIDINE | 2019-04-22 00:00:00|PD |5 |4 |
|105776|29378668 | TIMOLOL | 2019-04-22 00:00:00|PD |6 |2 |
+------+----------+----------------------------------+---------+-----+----------+
For every row, I have to find the occurrence of next 'PD' Typ at BFS level from the current row and populate its associated ID as a new column named 'NEXT_PD_TYP_ID'
The output I am expecting is:
+------+---------+-------------+--------------------+----+-----+---------+---------------+
|ID |MEM_ID | BFS | SVC_DT |TYP |SEQ |BFS_SEQ |NEXT_PD_TYP_ID |
+------+---------+----------------------------------+----+-----+---------+---------------+
|105771|29378668 | BRIMONIDINE | 2019-02-04 00:00:00|PD |1 |1 |105772 |
|105772|29378668 | BRIMONIDINE | 2019-04-04 00:00:00|PD |2 |2 |105775 |
|105773|29378668 | BRIMONIDINE | 2019-04-17 00:00:00|RV |3 |3 |105775 |
|105774|29378668 | TIMOLOL | 2019-04-17 00:00:00|RV |4 |1 |105776 |
|105775|29378668 | BRIMONIDINE | 2019-04-22 00:00:00|PD |5 |4 |null |
|105776|29378668 | TIMOLOL | 2019-04-22 00:00:00|PD |6 |2 |null |
+------+---------+----------------------------------+----+-----+---------+---------------+
Need help.
I have tried using the conditional aggregation: max(when), however since it has more than one 'PD' the max is returning only one value for all the rows.
No error messages
I hope this helps.
I created a new column with ID's of TYP === PD. I called it TYPPDID.
Then I used Window frame ranging from next row to unbounded following row and got the first not-null TYPPDID
orderBy("ID") in the end is only to show records in order.
import org.apache.spark.sql.functions._
val df = Seq(
("105771", "BRIMONIDINE", "PD"),
("105772", "BRIMONIDINE", "PD"),
("105773", "BRIMONIDINE","RV"),
("105774", "TIMOLOL", "RV"),
("105775", "BRIMONIDINE", "PD"),
("105776", "TIMOLOL", "PD")
).toDF("ID", "BFS", "TYP").withColumn("TYPPDID", when($"TYP" === "PD", $"ID"))
df: org.apache.spark.sql.DataFrame = [ID: string, BFS: string ... 2 more fields]
scala> df.show
+------+-----------+---+-------+
| ID| BFS|TYP|TYPPDID|
+------+-----------+---+-------+
|105771|BRIMONIDINE| PD| 105771|
|105772|BRIMONIDINE| PD| 105772|
|105773|BRIMONIDINE| RV| null|
|105774| TIMOLOL| RV| null|
|105775|BRIMONIDINE| PD| 105775|
|105776| TIMOLOL| PD| 105776|
+------+-----------+---+-------+
scala> val overColumns = Window.partitionBy("BFS").orderBy("ID").rowsBetween(1, Window.unboundedFollowing)
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#eb923ef
scala> df.withColumn("NEXT_PD_TYP_ID",first("TYPPDID", true).over(overColumns)).orderBy("ID").show(false)
+------+-----------+---+-------+-------+
|ID |BFS |TYP|TYPPDID|NEXT_PD_TYP_ID|
+------+-----------+---+-------+-------+
|105771|BRIMONIDINE|PD |105771 |105772 |
|105772|BRIMONIDINE|PD |105772 |105775 |
|105773|BRIMONIDINE|RV |null |105775 |
|105774|TIMOLOL |RV |null |105776 |
|105775|BRIMONIDINE|PD |105775 |null |
|105776|TIMOLOL |PD |105776 |null |
+------+-----------+---+-------+-------+

very specific requirement for outlier treatment in Spark Dataframe

I have very specific requirement for outlier treatment in Spark Dataframe(Scala)
i want to treat just first outlier and make it equal to second group.
Input:
+------+-----------------+------+
|market|responseVariable |blabla|
+------+-----------------+------+
|A |r1 | da |
|A |r1 | ds |
|A |r1 | s |
|A |r1 | f |
|A |r1 | v |
|A |r2 | s |
|A |r2 | s |
|A |r2 | c |
|A |r3 | s |
|A |r3 | s |
|A |r4 | s |
|A |r5 | c |
|A |r6 | s |
|A |r7 | s |
|A |r8 | s |
+------+-----------------+------+
Now per market and responseVariable i want to treat just first outlier..
Group per market and responseVariable:
+------+-----------------+------+
|market|responseVariable |count |
+------+-----------------+------+
|A |r1 | 5 |
|A |r2 | 3 |
|A |r3 | 2 |
|A |r4 | 1 |
|A |r5 | 1 |
|A |r6 | 1 |
|A |r7 | 1 |
|A |r8 | 1 |
+------+-----------------+------+
I want to treat outlier for group market=A and responseVariable=r1 in actual dataset. I want to randomly remove records from group 1 and make it equal to group 2.
Expected output:
+------+-----------------+------+
|market|responseVariable |blabla|
+------+-----------------+------+
|A |r1 | da |
|A |r1 | s |
|A |r1 | v |
|A |r2 | s |
|A |r2 | s |
|A |r2 | c |
|A |r3 | s |
|A |r3 | s |
|A |r4 | s |
|A |r5 | c |
|A |r6 | s |
|A |r7 | s |
|A |r8 | s |
+------+-----------------+------+
group:
+------+-----------------+------+
|market|responseVariable |count |
+------+-----------------+------+
|A |r1 | 3 |
|A |r2 | 3 |
|A |r3 | 2 |
|A |r4 | 1 |
|A |r5 | 1 |
|A |r6 | 1 |
|A |r7 | 1 |
|A |r8 | 1 |
+------+-----------------+------+
I want to repeat this for multiple market.
You will have to know the first and the second groups counts and names which can be done as below
import org.apache.spark.sql.functions._
val first_two_values = df.groupBy("market", "responseVariable").agg(count("blabla").as("count")).orderBy($"count".desc).take((2)).map(row => (row(1) -> row(2))).toList
val rowsToFilter = first_two_values(0)._1
val countsToFilter = first_two_values(1)._2
After you know the first two groups, you need to filter out the extra rows from the first group which can be done by generating row_number and filtering out the extra rows as below
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("market","responseVariable").orderBy("blabla")
df.withColumn("rank", row_number().over(windowSpec))
.withColumn("rank", when(col("rank") > countsToFilter && col("responseVariable") === rowsToFilter, false).otherwise(true))
.filter(col("rank"))
.drop("rank")
.show(false)
You should get your requirement fulfilled