SPARK-SCALA: Update End date for a ID with the new start_date for the updated respective ID - scala

I want to create a new column end_date for an id with the value of start_date column of the updated record for the same id using Spark Scala
Consider the following Data frame:
+---+-----+----------+
| id|Value|start_date|
+---+---- +----------+
| 1 | a | 1/1/2018 |
| 2 | b | 1/1/2018 |
| 3 | c | 1/1/2018 |
| 4 | d | 1/1/2018 |
| 1 | e | 10/1/2018|
+---+-----+----------+
Here initially start date of id=1 is 1/1/2018 and value is a, while on 10/1/2018(start_date) the value of id=1 became e. so i have to populate a new column end_date and populate value for id=1 in the beginning to 10/1/2018 and NULL values for all other records for end_date column
Result should be like below:
+---+-----+----------+---------+
| id|Value|start_date|end_date |
+---+---- +----------+---------+
| 1 | a | 1/1/2018 |10/1/2018|
| 2 | b | 1/1/2018 |NULL |
| 3 | c | 1/1/2018 |NULL |
| 4 | d | 1/1/2018 |NULL |
| 1 | e | 10/1/2018|NULL |
+---+-----+----------+---------+
I am using spark 2.3.
Can anyone help me out here please

With Window function "lead":
val df = List(
(1, "a", "1/1/2018"),
(2, "b", "1/1/2018"),
(3, "c", "1/1/2018"),
(4, "d", "1/1/2018"),
(1, "e", "10/1/2018")
).toDF("id", "Value", "start_date")
val idWindow = Window.partitionBy($"id")
.orderBy($"start_date")
val result = df.withColumn("end_date", lead($"start_date", 1).over(idWindow))
result.show(false)
Output:
+---+-----+----------+---------+
|id |Value|start_date|end_date |
+---+-----+----------+---------+
|3 |c |1/1/2018 |null |
|4 |d |1/1/2018 |null |
|1 |a |1/1/2018 |10/1/2018|
|1 |e |10/1/2018 |null |
|2 |b |1/1/2018 |null |
+---+-----+----------+---------+

Related

DB2/AS400 SQL Pivot

I have a problem with pivot tables ....
I don't understand what to do ...
My table is as follows:
|CODART|MONTH|QT |
|------|-----|----|
|ART1 |1 |100 |
|ART2 |1 |30 |
|ART3 |1 |30 |
|ART1 |2 |10 |
|ART4 |2 |40 |
|ART3 |4 |50 |
|ART5 |4 |60 |
I would like to get a summary table by month:
|CODART|1 |2 |3 |4 |5 |6 |7 |8 |9 |10 |11 |12 |
|------|---|---|---|---|---|---|---|---|---|---|---|---|
|ART1 |100|10 | | | | | | | | | | |
|ART2 |30 | | | | | | | | | | | |
|ART3 |30 | | |50 | | | | | | | | |
|ART4 | |2 | | | | | | | | | | |
|ART5 | | | |60 | | | | | | | | |
|TOTAL |160|12 | |110| | | | | | | | |
Too many requests? :-)
Thanks for the support
WITH MYTAB (CODART, MONTH, QT) AS
(
VALUES
('ART1', 1, 100)
, ('ART2', 1, 30)
, ('ART3', 1, 30)
, ('ART1', 2, 10)
, ('ART4', 2, 40)
, ('ART3', 4, 50)
, ('ART5', 4, 60)
)
SELECT
CASE GROUPING (CODART) WHEN 0 THEN CODART ELSE 'TOTAL' END AS CODART
, SUM (CASE MONTH WHEN 1 THEN QT END) AS "1"
, SUM (CASE MONTH WHEN 2 THEN QT END) AS "2"
, SUM (CASE MONTH WHEN 3 THEN QT END) AS "3"
, SUM (CASE MONTH WHEN 4 THEN QT END) AS "4"
---
, SUM (CASE MONTH WHEN 12 THEN QT END) AS "12"
FROM MYTAB T
GROUP BY ROLLUP (T.CODART)
ORDER BY GROUPING (T.CODART), T.CODART
CODART
1
2
3
4
12
ART1
100
10
ART2
30
ART3
30
50
ART4
40
ART5
60
TOTAL
160
50
110

Fixing hierarchy data with table transformation (Hive, scala, spark)

I have a task with working with hierarchical data, but the source data contains errors in the hierarchy, namely: some parent-child links are broken. I have an algorithm for reestablishing such connections, but I have not yet been able to implement it on my own.
Example:
Initial data is
+------+----+----------+-------+
| NAME | ID | PARENTID | LEVEL |
+------+----+----------+-------+
| A1 | 1 | 2 | 1 |
| B1 | 2 | 3 | 2 |
| C1 | 18 | 4 | 3 |
| C2 | 3 | 5 | 3 |
| D1 | 4 | NULL | 4 |
| D2 | 5 | NULL | 4 |
| D3 | 10 | 11 | 4 |
| E1 | 11 | NULL | 5 |
+------+----+----------+-------+
Schematically it looks like:
As you can see, connections with C1 and D3 are lost here.
In order to restore connections, I need to apply the following algorithm for this table:
if for some NAME the ID is not in the PARENTID column (like ID = 18, 10), then create a row with a 'parent' with LEVEL = (current LEVEL - 1) and PARENTID = (current ID), and take ID and NAME such that the current ID < ID of the node from the LEVEL above.
Result must be like:
+------+----+----------+-------+
| NAME | ID | PARENTID | LEVEL |
+------+----+----------+-------+
| A1 | 1 | 2 | 1 |
| B1 | 2 | 3 | 2 |
| B1 | 2 | 18 | 2 |#
| C1 | 18 | 4 | 3 |
| C2 | 3 | 5 | 3 |
| C2 | 3 | 10 | 3 |#
| D1 | 4 | NULL | 4 |
| D2 | 5 | NULL | 4 |
| D3 | 10 | 11 | 4 |
| E1 | 11 | NULL | 5 |
+------+----+----------+-------+
Where rows with # - new rows created.And new schema looks like:
Are there any ideas on how to do this algorithm in spark/scala? Thanks!
You can build a createdRows dataframe from your current dataframe that you union with your current dataframe to obtain your final dataframe.
You can build this createdRows dataframe in several step:
The first step is to get the IDs (and LEVEL) that are not in PARENTID column. You can use a self left anti join to do that.
Then, you renameID column to PARENTID and updating LEVEL column, decreasing it by 1.
Then, you take ID and NAME columns of new rows by joining it with your input dataframe on the LEVEL column
Finally, you apply your condition ID < PARENTID
You end up with the following code, dataframe is the dataframe with your initial data:
import org.apache.spark.sql.functions.col
val createdRows = dataframe
// if for some NAME the ID is not in the PARENTID column (like ID = 18, 10)
.select("LEVEL", "ID")
.filter(col("LEVEL") > 1) // Remove root node from created rows
.join(dataframe.select("PARENTID"), col("PARENTID") === col("ID"), "left_anti")
// then create a row with a 'parent' with LEVEL = (current LEVEL - 1) and PARENTID = (current ID)
.withColumnRenamed("ID", "PARENTID")
.withColumn("LEVEL", col("LEVEL") - 1)
// and take ID and NAME
.join(dataframe.select("NAME", "ID", "LEVEL"), Seq("LEVEL"))
// such that the current ID < ID of the node from the LEVEL above.
.filter(col("ID") < col("PARENTID"))
val result = dataframe
.unionByName(createdRows)
.orderBy("NAME", "PARENTID") // Optional, if you want an ordered result
And in result dataframe you get:
+----+---+--------+-----+
|NAME|ID |PARENTID|LEVEL|
+----+---+--------+-----+
|A1 |1 |2 |1 |
|B1 |2 |3 |2 |
|B1 |2 |18 |2 |
|C1 |18 |4 |3 |
|C2 |3 |5 |3 |
|C2 |3 |10 |3 |
|D1 |4 |null |4 |
|D2 |5 |null |4 |
|D3 |10 |11 |4 |
|E1 |11 |null |5 |
+----+---+--------+-----+

Creating a new column based on a window and a condition in Spark

INITIAL DATA FRAME:
+------------------------------+----------+-------+
| Timestamp | Property | Value |
+------------------------------+----------+-------+
| 2019-09-01T01:36:57.000+0000 | X | N |
| 2019-09-01T01:37:39.000+0000 | A | 3 |
| 2019-09-01T01:42:55.000+0000 | X | Y |
| 2019-09-01T01:53:44.000+0000 | A | 17 |
| 2019-09-01T01:55:34.000+0000 | A | 9 |
| 2019-09-01T01:57:32.000+0000 | X | N |
| 2019-09-01T02:59:40.000+0000 | A | 2 |
| 2019-09-01T02:00:03.000+0000 | A | 16 |
| 2019-09-01T02:01:40.000+0000 | X | Y |
| 2019-09-01T02:04:03.000+0000 | A | 21 |
+------------------------------+----------+-------+
FINAL DATA FRAME:
+------------------------------+----------+-------+---+
| Timestamp | Property | Value | X |
+------------------------------+----------+-------+---+
| 2019-09-01T01:37:39.000+0000 | A | 3 | N |
| 2019-09-01T01:53:44.000+0000 | A | 17 | Y |
| 2019-09-01T01:55:34.000+0000 | A | 9 | Y |
| 2019-09-01T02:00:03.000+0000 | A | 16 | N |
| 2019-09-01T02:04:03.000+0000 | A | 21 | Y |
| 2019-09-01T02:59:40.000+0000 | A | 2 | Y |
+------------------------------+----------+-------+---+
Basically, I have a Timestamp, a Property, and a Value field. The Property could be either A or X and it has a value. I would like to have a new DataFrame with a fourth column named X based on the values of the X property.
I start going through the rows from the earliest to the oldest.
I encounter a row with the X-property, I store its value and I insert it into the X-column.
IF I encounter an A-property row: I insert the stored value from the previous step into the X-column.
ELSE (meaning I encounter an X-property row): I update the stored value (since it is more recent) and I insert the new stored value into the X column.
I keep doing so until I have gone through the whole dataframe.
I remove the rows with the X property to have the final dataframe showed above.
I am sure there is some sort of way to do so efficiently with the Window function.
create a temp column with value X's value, null if A. Then use window to get last not-null Temp value. Filter property "A" in the end.
scala> val df = Seq(
| ("2019-09-01T01:36:57.000+0000", "X", "N"),
| ("2019-09-01T01:37:39.000+0000", "A", "3"),
| ("2019-09-01T01:42:55.000+0000", "X", "Y"),
| ("2019-09-01T01:53:44.000+0000", "A", "17"),
| ("2019-09-01T01:55:34.000+0000", "A", "9"),
| ("2019-09-01T01:57:32.000+0000", "X", "N"),
| ("2019-09-01T02:59:40.000+0000", "A", "2"),
| ("2019-09-01T02:00:03.000+0000", "A", "16"),
| ("2019-09-01T02:01:40.000+0000", "X", "Y"),
| ("2019-09-01T02:04:03.000+0000", "A", "21")
| ).toDF("Timestamp", "Property", "Value").withColumn("Temp", when($"Property" === "X", $"Value").otherwise(null))
df: org.apache.spark.sql.DataFrame = [Timestamp: string, Property: string ... 2 more fields]
scala> df.show(false)
+----------------------------+--------+-----+----+
|Timestamp |Property|Value|Temp|
+----------------------------+--------+-----+----+
|2019-09-01T01:36:57.000+0000|X |N |N |
|2019-09-01T01:37:39.000+0000|A |3 |null|
|2019-09-01T01:42:55.000+0000|X |Y |Y |
|2019-09-01T01:53:44.000+0000|A |17 |null|
|2019-09-01T01:55:34.000+0000|A |9 |null|
|2019-09-01T01:57:32.000+0000|X |N |N |
|2019-09-01T02:59:40.000+0000|A |2 |null|
|2019-09-01T02:00:03.000+0000|A |16 |null|
|2019-09-01T02:01:40.000+0000|X |Y |Y |
|2019-09-01T02:04:03.000+0000|A |21 |null|
+----------------------------+--------+-----+----+
scala> val overColumns = Window.orderBy("TimeStamp").rowsBetween(Window.unboundedPreceding, Window.currentRow)
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#1b759662
scala> df.withColumn("X", last($"Temp",true).over(overColumns)).show(false)
+----------------------------+--------+-----+----+---+
|Timestamp |Property|Value|Temp|X |
+----------------------------+--------+-----+----+---+
|2019-09-01T01:36:57.000+0000|X |N |N |N |
|2019-09-01T01:37:39.000+0000|A |3 |null|N |
|2019-09-01T01:42:55.000+0000|X |Y |Y |Y |
|2019-09-01T01:53:44.000+0000|A |17 |null|Y |
|2019-09-01T01:55:34.000+0000|A |9 |null|Y |
|2019-09-01T01:57:32.000+0000|X |N |N |N |
|2019-09-01T02:00:03.000+0000|A |16 |null|N |
|2019-09-01T02:01:40.000+0000|X |Y |Y |Y |
|2019-09-01T02:04:03.000+0000|A |21 |null|Y |
|2019-09-01T02:59:40.000+0000|A |2 |null|Y |
+----------------------------+--------+-----+----+---+
scala> df.withColumn("X", last($"Temp",true).over(overColumns)).filter($"Property" === "A").show(false)
+----------------------------+--------+-----+----+---+
|Timestamp |Property|Value|Temp|X |
+----------------------------+--------+-----+----+---+
|2019-09-01T01:37:39.000+0000|A |3 |null|N |
|2019-09-01T01:53:44.000+0000|A |17 |null|Y |
|2019-09-01T01:55:34.000+0000|A |9 |null|Y |
|2019-09-01T02:00:03.000+0000|A |16 |null|N |
|2019-09-01T02:04:03.000+0000|A |21 |null|Y |
|2019-09-01T02:59:40.000+0000|A |2 |null|Y |
+----------------------------+--------+-----+----+---+

How to count the number of missing values in each row of a data frame -spark scala?

I want to count the number of missing values in each row of a data frame in spark scala.
Code:
val samplesqlDF = spark.sql("SELECT * FROM sampletable")
samplesqlDF.show()
Input Dataframe:
------------------------------------------------------------------
| name | age | degree | Place |
| -----------------------------------------------------------------|
| Ram | | MCA | Bangalore |
| | 25 | | |
| | 26 | BE | |
| Raju | 21 | Btech | Chennai |
-----------------------------------------------------------------
The Output Data frame (Row Level Count) as follows:
-----------------------------------------------------------------
| name | age | degree | Place | rowcount |
| ----------------------------------------------------------------|
| Ram | | MCA | Bangalore | 1 |
| | 25 | | | 3 |
| | 26 | BE | | 2 |
| Raju | 21 | Btech | Chennai | 0 |
-----------------------------------------------------------------
I am a beginner to scala and spark. Thanks in advance.
Looks like you want to get the null count in a dynamic way. Check this out
val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,null),(null,"26","BE",null),("Raju","21","Btech","Chennai")).toDF("name","age","degree","Place")
df.show(false)
val df2 = df.columns.foldLeft(df)( (df,c) => df.withColumn(c+"_null", when(col(c).isNull,1).otherwise(0) ) )
df2.createOrReplaceTempView("student")
val sql_str_null = df.columns.map( x => x+"_null").mkString(" ","+"," as null_count ")
val sql_str_full = df.columns.mkString( "select ", ",", " , " + sql_str_null + " from student")
spark.sql(sql_str_full).show(false)
Output:
+----+----+------+---------+----------+
|name|age |degree|Place |null_count|
+----+----+------+---------+----------+
|Ram |null|MCA |Bangalore|1 |
|null|25 |null |null |3 |
|null|26 |BE |null |2 |
|Raju|21 |Btech |Chennai |0 |
+----+----+------+---------+----------+
Also a possibility and checking also for "" but not using foldLeft just to demonstrate the point:
import org.apache.spark.sql.functions._
val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,""),(null,"26","BE",null),("Raju","21","Btech","Chennai")).toDF("name","age","degree","place")
// Count per row the null or "" columns!
val null_counter = Seq("name", "age", "degree", "place").map(x => when(col(x) === "" || col(x).isNull , 1).otherwise(0)).reduce(_ + _)
val df2 = df.withColumn("nulls_cnt", null_counter)
df2.show(false)
returns:
+----+----+------+---------+---------+
|name|age |degree|place |nulls_cnt|
+----+----+------+---------+---------+
|Ram |null|MCA |Bangalore|1 |
|null|25 |null | |3 |
|null|26 |BE |null |2 |
|Raju|21 |Btech |Chennai |0 |
+----+----+------+---------+---------+
A simplified version of the one suggested by #stack0114106 is
val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,null),
(null,"26","BE",null),("Raju","21","Btech","Chennai"))
.toDF("name","age","degree","Place")
.withColumn("null_count", lit(0))
val df2 = df.columns.foldLeft(df)((df,c) =>
df.withColumn("null_count",
when(col(c).isNull,$"null_count" + 1).otherwise($"null_count")
)
)
df2.show(false)
the output is
+----+----+------+---------+----------+
|name|age |degree|Place |null_count|
+----+----+------+---------+----------+
|Ram |null|MCA |Bangalore|1 |
|null|25 |null |null |3 |
|null|26 |BE |null |2 |
|Raju|21 |Btech |Chennai |0 |
+----+----+------+---------+----------+

GroupBy based on conditions in Spark dataframe

I have two dataframe,
Dataframe1 contains key/value pairs:
+------+-----------------+
| Key | Value |
+------+-----------------+
| key1 | Column1 |
+------+-----------------+
| key2 | Column2 |
+------+-----------------+
| key3 | Column1,Column3 |
+------+-----------------+
Second dataframe:
This is actual dataframe where I need to apply groupBy operation
+---------+---------+---------+--------+
| Column1 | Column2 | Column3 | Amount |
+---------+---------+---------+--------+
| A | A1 | XYZ | 100 |
+---------+---------+---------+--------+
| A | A1 | XYZ | 100 |
+---------+---------+---------+--------+
| A | A2 | XYZ | 10 |
+---------+---------+---------+--------+
| A | A3 | PQR | 100 |
+---------+---------+---------+--------+
| B | B1 | XYZ | 200 |
+---------+---------+---------+--------+
| B | B2 | PQR | 280 |
+---------+---------+---------+--------+
| B | B3 | XYZ | 20 |
+---------+---------+---------+--------+
Dataframe1 contains the key,value columns
It has to take the keys from dataframe1, it has to take the respective value and do the groupBy operation on the dataframe2
Dframe= df.groupBy($"key").sum("amount").show()
Expected Output: Generate three dataframes based on number of keys in dataframe
d1= df.grouBy($"key1").sum("amount").show()
it has to be : df.grouBy($"column1").sum("amount").show()
+---+-----+
| A | 310 |
+---+-----+
| B | 500 |
+---+-----+
Code:
d2=df.groupBy($"key2").sum("amount").show()
result: df.grouBy($"column2").sum("amount").show()
dataframe:
+----+-----+
| A1 | 200 |
+----+-----+
| A2 | 10 |
+----+-----+
Code :
d3.df.groupBy($"key3").sum("amount").show()
DataFrame:
+---+-----+-----+
| A | XYZ | 320 |
+---+-----+-----+
| A | PQR | 10 |
+---+-----+-----+
| B | XYZ | 220 |
+---+-----+-----+
| B | PQR | 280 |
+---+-----+-----+
In future, if I add more keys , it has to show the dataframe. Can someone help me.
Given the key value dataframe as ( which I suggest you not to form dataframe from the source data, reason is given below)
+----+---------------+
|Key |Value |
+----+---------------+
|key1|Column1 |
|key2|Column2 |
|key3|Column1,Column3|
+----+---------------+
and actual dataframe as
+-------+-------+-------+------+
|Column1|Column2|Column3|Amount|
+-------+-------+-------+------+
|A |A1 |XYZ |100 |
|A |A1 |XYZ |100 |
|A |A2 |XYZ |10 |
|A |A3 |PQR |100 |
|B |B1 |XYZ |200 |
|B |B2 |PQR |280 |
|B |B3 |XYZ |20 |
+-------+-------+-------+------+
I would suggest you not to convert the first dataframe to rdd maps as
val maps = df1.rdd.map(row => row(0) -> row(1)).collect()
And then loop the maps as
import org.apache.spark.sql.functions._
for(kv <- maps){
df2.groupBy(kv._2.toString.split(",").map(col): _*).agg(sum($"Amount")).show(false)
//you can store the results in separate dataframes or write them to files or database
}
You should have follwing outputs
+-------+-----------+
|Column1|sum(Amount)|
+-------+-----------+
|B |500 |
|A |310 |
+-------+-----------+
+-------+-----------+
|Column2|sum(Amount)|
+-------+-----------+
|A2 |10 |
|B2 |280 |
|B1 |200 |
|B3 |20 |
|A3 |100 |
|A1 |200 |
+-------+-----------+
+-------+-------+-----------+
|Column1|Column3|sum(Amount)|
+-------+-------+-----------+
|B |PQR |280 |
|B |XYZ |220 |
|A |PQR |100 |
|A |XYZ |210 |
+-------+-------+-----------+