I am using spark-sql-2.4.1v how to do various joins depend on the value of column
I need get multiple look up values of map_val column for given value columns as show below.
Sample data:
val data = List(
("20", "score", "school", "2018-03-31", 14 , 12),
("21", "score", "school", "2018-03-31", 13 , 13),
("22", "rate", "school", "2018-03-31", 11 , 14),
("21", "rate", "school", "2018-03-31", 13 , 12)
)
val df = data.toDF("id", "code", "entity", "date", "value1", "value2")
df.show
+---+-----+------+----------+------+------+
| id| code|entity| date|value1|value2|
+---+-----+------+----------+------+------+
| 20|score|school|2018-03-31| 14| 12|
| 21|score|school|2018-03-31| 13| 13|
| 22| rate|school|2018-03-31| 11| 14|
| 21| rate|school|2018-03-31| 13| 12|
+---+-----+------+----------+------+------+
Lookup dataset rateDs:
val rateDs = List(
("21","2018-01-31","2018-06-31", 12 ,"C"),
("21","2018-01-31","2018-06-31", 13 ,"D")
).toDF("id","start_date","end_date", "map_code","map_val")
rateDs.show
+---+----------+----------+--------+-------+
| id|start_date| end_date|map_code|map_val|
+---+----------+----------+--------+-------+
| 21|2018-01-31|2018-06-31| 12| C|
| 21|2018-01-31|2018-06-31| 13| D|
+---+----------+----------+--------+-------+
Joining with lookup table for map_val column based on start_date and end_date:
val resultDs = df.filter(col("code").equalTo(lit("rate"))).join(rateDs ,
(
df.col("date").between(rateDs.col("start_date"), rateDs.col("end_date"))
.and(rateDs.col("id").equalTo(df.col("id")))
//.and(rateDs.col("mapping_value").equalTo(df.col("mean")))
)
, "left"
)
//.drop("start_date")
//.drop("end_date")
resultDs.show
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
| id|code|entity| date|value1|value2| id|start_date| end_date|map_code|map_val|
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
| 21|rate|school|2018-03-31| 13| 12| 21|2018-01-31|2018-06-31| 13| D|
| 21|rate|school|2018-03-31| 13| 12| 21|2018-01-31|2018-06-31| 12| C|
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
The expected output should be:
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
| id|code|entity| date|value1|value2| id|start_date| end_date|map_code|map_val|
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
| 21|rate|school|2018-03-31| D | C | 21|2018-01-31|2018-06-31| 13| D|
| 21|rate|school|2018-03-31| D | C | 21|2018-01-31|2018-06-31| 12| C|
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
Please let me know if any more details are needed.
Try this-
Create lookup map before join per id and use the same to replace
val newRateDS = rateDs.withColumn("lookUpMap",
map_from_entries(collect_list(struct(col("map_code"), col("map_val"))).over(Window.partitionBy("id")))
)
newRateDS.show(false)
/**
* +---+----------+----------+--------+-------+------------------+
* |id |start_date|end_date |map_code|map_val|lookUpMap |
* +---+----------+----------+--------+-------+------------------+
* |21 |2018-01-31|2018-06-31|12 |C |[12 -> C, 13 -> D]|
* |21 |2018-01-31|2018-06-31|13 |D |[12 -> C, 13 -> D]|
* +---+----------+----------+--------+-------+------------------+
*/
val resultDs = df.filter(col("code").equalTo(lit("rate"))).join(broadcast(newRateDS) ,
rateDs("id") === df("id") && df("date").between(rateDs("start_date"), rateDs("end_date"))
//.and(rateDs.col("mapping_value").equalTo(df.col("mean")))
, "left"
)
resultDs.withColumn("value1", expr("coalesce(lookUpMap[value1], value1)"))
.withColumn("value2", expr("coalesce(lookUpMap[value2], value2)"))
.show(false)
/**
* +---+----+------+----------+------+------+----+----------+----------+--------+-------+------------------+
* |id |code|entity|date |value1|value2|id |start_date|end_date |map_code|map_val|lookUpMap |
* +---+----+------+----------+------+------+----+----------+----------+--------+-------+------------------+
* |22 |rate|school|2018-03-31|11 |14 |null|null |null |null |null |null |
* |21 |rate|school|2018-03-31|D |C |21 |2018-01-31|2018-06-31|13 |D |[12 -> C, 13 -> D]|
* |21 |rate|school|2018-03-31|D |C |21 |2018-01-31|2018-06-31|12 |C |[12 -> C, 13 -> D]|
* +---+----+------+----------+------+------+----+----------+----------+--------+-------+------------------+
*/
Related
Given Example, But looking for N number of columns comparison between two data frame as column-wise.
Given sample with 5 rows and 3 columns with EmpID as Primary key.
How can I do this comparison in Spark core?
InputDf1:
|EMPID |Dept | Salary
--------------------------
|1 |HR | 100
|2 |IT | 200
|3 |Finance | 250
|4 |Accounts | 200
|5 |IT | 150
InfputDF2:
|EMPID |Dept |Salary
------------------------------
|1 |HR | 100
|2 |IT | 200
|3 |FIN | 250
|4 |Accounts | 150
|5 |IT | 150
Expected Result DF:
|EMPID |Dept |Dept |status |Salary |Salary |status
--------------------------------------------------------------------
|1 |HR |HR | TRUE | 100 | 100 | TRUE
|2 |IT |IT | TRUE | 200 | 200 | TRUE
|3 |Finance |FIN | False | 250 | 250 | TRUE
|4 |Accounts |Accounts | TRUE | 200 | 150 | FALSE
|5 |IT |IT | TRUE | 150 | 150 | TRUE
You can do a join using the EMPID and compare the resulting columns:
val result = df1.alias("df1").join(
df2.alias("df2"), "EMPID"
).select(
$"EMPID",
$"df1.Dept", $"df2.Dept",
($"df1.Dept" === $"df2.Dept").as("status"),
$"df1.Salary", $"df2.Salary",
($"df1.Salary" === $"df2.Salary").as("status")
)
result.show
+-----+--------+--------+------+------+------+------+
|EMPID| Dept| Dept|status|Salary|Salary|status|
+-----+--------+--------+------+------+------+------+
| 1| HR| HR| true| 100| 100| true|
| 2| IT| IT| true| 200| 200| true|
| 3| Finance| FIN| false| 250| 250| true|
| 4|Accounts|Accounts| true| 200| 150| false|
| 5| IT| IT| true| 150| 150| true|
+-----+--------+--------+------+------+------+------+
Note that you may wish to rename the columns because duplicate column names are not possible to query in the future.
You can use join and then iterate over df.columns to select the desired output columns :
val df_final = df1.alias("df1")
.join(df2.alias("df2"), "EMPID")
.select(
Seq(col("EMPID")) ++
df1.columns.filter(_ != "EMPID")
.flatMap(c =>
Seq(
col(s"df1.$c").as(s"df1_$c"),
col(s"df2.$c").as(s"df2_$c"),
(col(s"df1.$c") === col(s"df2.$c")).as(s"status_$c")
)
): _*
)
df_final.show
//+-----+--------+--------+-----------+----------+----------+-------------+
//|EMPID|df1_Dept|df2_Dept|status_Dept|df1_Salary|df2_Salary|status_Salary|
//+-----+--------+--------+-----------+----------+----------+-------------+
//| 1| HR| HR| true| 100| 100| true|
//| 2| IT| IT| true| 200| 200| true|
//| 3| Finance| FIN| false| 250| 250| true|
//| 4|Accounts|Accounts| true| 200| 150| false|
//| 5| IT| IT| true| 150| 150| true|
//+-----+--------+--------+-----------+----------+----------+-------------+
You could also do this in a way below:
//Source data
val df = Seq((1,"HR",100),(2,"IT",200),(3,"Finance",250),(4,"Accounts",200),(5,"IT",150)).toDF("EMPID","Dept","Salary")
val df1 = Seq((1,"HR",100),(2,"IT",200),(3,"Fin",250),(4,"Accounts",150),(5,"IT",150)).toDF("EMPID","Dept","Salary")
//joins and other operations
val finalDF = df.as("d").join(df1.as("d1"),Seq("EMPID"),"inner")
.withColumn("DeptStatus",$"d.Dept" === $"d1.Dept")
.withColumn("Salarystatus",$"d.Salary" === $"d1.Salary")
.selectExpr("EMPID","d.Dept","d1.Dept","DeptStatus as
Status","d.Salary","d1.Salary","SalaryStatus as Status")
display(finalDF)
You can see the output as below:
There is almost 1500+ columns in a dataframe where differnt set of array of column existing like : col1, col2, col3, col4.... , then c1, c2, c3, c4...... , then column1, column2, column3, column4 ....... etc etc. Now based on certain logic and condition subset of array of column of same name need to be updated/modified . Suppose the logic is like :
col(i) = 2 *column(i) + 3* c(i)
and ,
col(i) = 2* column(i) + col(i)
where i is ranging the between upper of lower limits of subset of array. Here subset of array of col1-col4 can be col1, col2, col3.
I need an expression where above mentioned two operation can be expressed.
I am giving an example for the below expression :
col(i) = col(i) + 1
Example :
scala> val original_df = Seq((1,2,3,4,9,8,7,6),(2,3,4,5,8,7,6,5),(3,4,5,6,7,6,5,4),(4,5,6,7,6,5,4,3),(5,6,7,8,5,4,3,2),(6,7,8,9,4,3,2,1)).toDF("col1","col2","col3","col4","c1","c2","c3","c4")
original_df: org.apache.spark.sql.DataFrame = [col1: int, col2: int ... 6 more fields]
scala> original_df.show()
+----+----+----+----+---+---+---+---+
|col1|col2|col3|col4| c1| c2| c3| c4|
+----+----+----+----+---+---+---+---+
| 1| 2| 3| 4| 9| 8| 7| 6|
| 2| 3| 4| 5| 8| 7| 6| 5|
| 3| 4| 5| 6| 7| 6| 5| 4|
| 4| 5| 6| 7| 6| 5| 4| 3|
| 5| 6| 7| 8| 5| 4| 3| 2|
| 6| 7| 8| 9| 4| 3| 2| 1|
+----+----+----+----+---+---+---+---+
scala> val requiredColumns = original_df.columns.zipWithIndex.filter(_._2 < 3).map(_._1).toSet
requiredColumns: scala.collection.immutable.Set[String] = Set(col1, col2, col3)
scala> val allColumns = original_df.columns
allColumns: Array[String] = Array(col1, col2, col3, col4, c1, c2, c3, c4)
scala> val columnExpr = allColumns.filterNot(requiredColumns(_)).map(col(_)) ++ requiredColumns.map(c => (col(c) + 1).alias(c))
columnExpr: Array[org.apache.spark.sql.Column] = Array(col4, c1, c2, c3, c4, (col1 + 1) AS `col1`, (col2 + 1) AS `col2`, (col3 + 1) AS `col3`)
scala> original_df.select(columnExpr:_*).show(false)
+----+---+---+---+---+----+----+----+
|col4|c1 |c2 |c3 |c4 |col1|col2|col3|
+----+---+---+---+---+----+----+----+
|4 |9 |8 |7 |6 |2 |3 |4 |
|5 |8 |7 |6 |5 |3 |4 |5 |
|6 |7 |6 |5 |4 |4 |5 |6 |
|7 |6 |5 |4 |3 |5 |6 |7 |
|8 |5 |4 |3 |2 |6 |7 |8 |
|9 |4 |3 |2 |1 |7 |8 |9 |
+----+---+---+---+---+----+----+----+
So I need an expression in this case for below two expression :
col(i) = 2 *column(i) + 3* c(i)
and ,
col(i) = 2* column(i) + col(i)
Given the following DataFrame:
+----+--------+--------+-----+------+------+------+
|name|platform|group_id|width|height| x| y|
+----+--------+--------+-----+-------------+------+
| a| plat_a| 0|500.0|1000.0|250.41|500.01|
| a| plat_a| 0|250.0| 500.0|125.75| 250.7|
| a| plat_a| 0|300.0| 800.0| 120.0| 111.7|
| b| plat_b| 0|500.0|1000.0| 250.5|500.67|
| b| plat_b| 1|400.0| 800.0|100.67|200.67|
| b| plat_b| 1|800.0|1600.0|201.07|401.07|
+----+--------+--------+-----+------+------+------+
I would like to group by name, platform, group_id and count by the following columns logic:
//normalizing value to percent with 2 digit precision
new_x = Math.round(x / width * 100.0) / 100.0
new_y = Math.round(y / height * 100.0) / 100.0
So the output DataFrame would be:
+----+--------+--------+------+------+-----+
|name|platform|group_id| new_x| new_y|count|
+----+--------+---------------+------+-----+
| a| plat_a| 0| 0.5| 0.5| 2|
| a| plat_a| 0| 0.4| 0.13| 1|
| b| plat_b| 0| 0.5| 0.5| 1|
| b| plat_b| 1| 0.25| 0.25| 2|
+----+--------+--------+------+------+-----+
How should I approach this problem?
It should be quite straightforward groupBy and count
import org.apache.spark.sql.functions._
df.withColumn("new_x", round($"x" / $"width" * 100.0 ) / 100.0)
.withColumn("new_y", round($"y" / $"height" * 100.0 ) / 100.0)
.groupBy("name", "platform", "group_id", "new_x", "new_y")
.count()
.show(false)
Output:
+----+--------+--------+-----+-----+-----+
|name|platform|group_id|new_x|new_y|count|
+----+--------+--------+-----+-----+-----+
|a |plat_a |0 |0.5 |0.5 |2 |
|b |plat_b |0 |0.5 |0.5 |1 |
|b |plat_b |1 |0.25 |0.25 |2 |
|a |plat_a |0 |0.4 |0.14 |1 |
+----+--------+--------+-----+-----+-----+
I have to fill the first null values with immediate value of the same column in dataframe. This logic applies only on first consecutive null values only of the column.
I have a dataframe with similar to below
//I replaced null to 0 in value column
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|0 |exA |30 |
|0 |exB |22 |
|0 |exC |19 |
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 |
|13 |exH |53 |
+-----+----+----+
From this dataframe I am expecting as below
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|16 |exA |30 | // Change the value 0 to 16 at value column
|16 |exB |22 | // Change the value 0 to 16 at value column
|16 |exC |19 | // Change the value 0 to 16 at value column
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 | // value should not be change here
|13 |exH |53 |
+-----+----+----+
Please help me solve this.
You can use Window function for this purpose
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
val w = Window.orderBy($"col2".desc)
df.withColumn("Result", last(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.orderBy($"col2")
.show(10)
Will result in
+-----+----+----+------+
|value|col2|col3|Result|
+-----+----+----+------+
| 0| exA| 30| 16|
| 0| exB| 22| 16|
| 0| exC| 19| 16|
| 16| exD| 13| 16|
| 5| exE| 28| 5|
| 6| exF| 26| 6|
| 0| exG| 12| 13|
| 13| exH| 53| 13|
+-----+----+----+------+
Expression df.orderBy($"col2") is needed only to show final results in right order. You can skip it if you don't care about final order.
UPDATE
To get exactly what you need you should you a little bit more complicated code
val w = Window.orderBy($"col2")
val w2 = Window.orderBy($"col2".desc)
df.withColumn("IntermediateResult", first(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.withColumn("Result", when($"IntermediateResult".isNull, last($"IntermediateResult", ignoreNulls = true).over(w2)).otherwise($"value"))
.orderBy($"col2")
.show(10)
+-----+----+----+------------------+------+
|value|col2|col3|IntermediateResult|Result|
+-----+----+----+------------------+------+
| 0| exA| 30| null| 16|
| 0| exB| 22| null| 16|
| 0| exC| 19| null| 16|
| 16| exD| 13| 16| 16|
| 5| exE| 28| 16| 5|
| 6| exF| 26| 16| 6|
| 0| exG| 12| 16| 0|
| 13| exH| 53| 16| 13|
+-----+----+----+------------------+------+
I think you need to take the 1st not null or non-zero value based on col2 's order. Please find the script below. I have created a table in spark's memory to write sql.
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
df.registerTempTable("table_df")
spark.sql("with cte as(select *,row_number() over(order by col2) rno from table_df) select case when value = 0 and rno<(select min(rno) from cte where value != 0) then (select value from cte where rno=(select min(rno) from cte where value != 0)) else value end value,col2,col3 from cte").show(df.count.toInt,false)
Please let me know if you have any questions.
I added a new column with incremental id to your DF
import org.apache.spark.sql.functions._
val df_1 = Seq((0,"exA",30),
(0,"exB",22),
(0,"exC",19),
(16,"exD",13),
(5,"exE",28),
(6,"exF",26),
(0,"exG",12),
(13,"exH",53))
.toDF("value", "col2", "col3")
.withColumn("UniqueID", monotonically_increasing_id)
filter DF to have non-zero values
val df_2 = df_1.filter("value != 0")
create a variable "limit" to limit first N row that we need and variable Nvar for the first non-zero value
val limit = df_2.agg(min("UniqueID")).collect().map(_(0)).mkString("").toInt + 1
val nVal = df_1.limit(limit).agg(max("value")).collect().map(_(0)).mkString("").toInt
create DF with a column with the same name ("value") with a condition
val df_4 = df_1.withColumn("value", when(($"UniqueID" < limit), nVal).otherwise($"value"))
I have an input spark-dataframe named df as
+---------------+---+---+---+-----------+
|Main_CustomerID| P1| P2| P3|Total_Count|
+---------------+---+---+---+-----------+
| 725153| 1| 0| 2| 3|
| 873008| 0| 0| 3| 3|
| 625109| 1| 1| 0| 2|
+---------------+---+---+---+-----------+
Here,Total_Count is the sum of P1,P2,P3 and P1,P2,P3 were the product names. I need to find the frequency of each product by dividing the values of products with its Total_Count. I need to create a new spark-dataframe named frequencyTable as follows,
+---------------+------------------+---+------------------+-----------+
|Main_CustomerID| P1| P2| P3|Total_Count|
+---------------+------------------+---+------------------+-----------+
| 725153|0.3333333333333333|0.0|0.6666666666666666| 3|
| 873008| 0.0|0.0| 1.0| 3|
| 625109| 0.5|0.5| 0.0| 2|
+---------------+------------------+---+------------------+-----------+
I have done this using Scala as,
val df_columns = df.columns.toSeq
var frequencyTable = df
for (index <- df_columns) {
if (index != "Main_CustomerID" && index != "Total_Count") {
frequencyTable = frequencyTable.withColumn(index, df.col(index) / df.col("Total_Count"))
}
}
But I don't prefer this for loop because my df is of larger size. What is the optimized solution?
If you have dataframe as
val df = Seq(
("725153", 1, 0, 2, 3),
("873008", 0, 0, 3, 3),
("625109", 1, 1, 0, 2)
).toDF("Main_CustomerID", "P1", "P2", "P3", "Total_Count")
+---------------+---+---+---+-----------+
|Main_CustomerID|P1 |P2 |P3 |Total_Count|
+---------------+---+---+---+-----------+
|725153 |1 |0 |2 |3 |
|873008 |0 |0 |3 |3 |
|625109 |1 |1 |0 |2 |
+---------------+---+---+---+-----------+
You can simply use foldLeft on the columns except Main_CustomerID, Total_Count i.e. on P1 P2 and P3
val df_columns = df.columns.toSet - "Main_CustomerID" - "Total_Count" toList
df_columns.foldLeft(df){(tempdf, colName) => tempdf.withColumn(colName, df.col(colName) / df.col("Total_Count"))}.show(false)
which should give you
+---------------+------------------+---+------------------+-----------+
|Main_CustomerID|P1 |P2 |P3 |Total_Count|
+---------------+------------------+---+------------------+-----------+
|725153 |0.3333333333333333|0.0|0.6666666666666666|3 |
|873008 |0.0 |0.0|1.0 |3 |
|625109 |0.5 |0.5|0.0 |2 |
+---------------+------------------+---+------------------+-----------+
I hope the answer is helpful