How to update column based on a condition (a value in a group)? - scala

I have the following df:
+---+----+-----+
|sno|dept|color|
+---+----+-----+
| 1| fn| red|
| 2| fn| blue|
| 3| fn|green|
+---+----+-----+
If any of the color column values is red, then I all values of the color column should be updated to be red, as below:
+---+----+-----+
|sno|dept|color|
+---+----+-----+
| 1| fn| red|
| 2| fn| red|
| 3| fn| red|
+---+----+-----+
I could not figure it out. Please help; I have tried following code:
val gp=jdbcDF.filter($"dept".contains("fn"))
//.withColumn("newone",when($"dept"==="fn","RED").otherwise("NULL"))
gp.show()
gp.map(
row=>{
val row1=row.getAs[String](1)
var row2=row.getAs[String](2)
val make=if(row1 =="fn") row2="red"
Row(row(0),row(1),make)
}
).collect().foreach(println)

Given:
val df = Seq(
(1, "fn", "red"),
(2, "fn", "blue"),
(3, "fn", "green"),
(4, "aa", "blue"),
(5, "aa", "green"),
(6, "bb", "red"),
(7, "bb", "red"),
(8, "aa", "blue")
).toDF("id", "fn", "color")
Do the calculation:
val redOrNot = df.groupBy("fn")
.agg(collect_set('color) as "values")
.withColumn("hasRed", array_contains('values, "red"))
// gives null for no option
val colorPicker = when('hasRed, "red")
val result = df.join(redOrNot, "fn")
.withColumn("resultColor", colorPicker)
.withColumn("color", coalesce('resultColor, 'color)) // skips nulls that leads to the answer
.select('id, 'fn, 'color)
The result looks as follows (that seems to be an answer):
scala> result.show
+---+---+-----+
| id| fn|color|
+---+---+-----+
| 1| fn| red|
| 2| fn| red|
| 3| fn| red|
| 4| aa| blue|
| 5| aa|green|
| 6| bb| red|
| 7| bb| red|
| 8| aa| blue|
+---+---+-----+
You can chain when operators and have a default value with otherwise. Consult the scaladoc of when operator.
I think you could do something very similar (and perhaps more efficient) using windowed operators or user-defined aggregate functions (UDAF), but...well...don't currently know how to do it. Leaving the comment here to inspire others ;-)
p.s. Learnt a lot! Thanks for the idea!

Efficient solution which doesn't require expensive grouping:
// All groups with `red`
df.where($"color" === "red").select($"fn".alias("fn_")).distinct
// Join with input
.join(df.as("df"), $"fn" === $"fn_", "rightouter")
// Replace `color`
.withColumn("color", when($"fn_"isNull, $"color").otherwise(lit("red")))
.drop("fn_")

You are conditionally updating the DataFrame if it satisfies a certain property. In this case the property is "the color column contains 'red'". The idiomatic way to express this is to filter with the desired predicate and then determine whether any rows satisfy it. There is no need for a join.
import org.apache.spark.sql.functions.lit
import org.apache.spark.sql.DataFrame
def makeAllRedIfAnyAreRed(df: DataFrame) = {
val containsRed = df.filter(df("color") === "red").count() > 0
if (containsRed) df.withColumn("color", lit("red")) else df
}

Spark 2.2.0:
Sample Dataframe ( taken from above examples)
val df = Seq(
(1, "fn", "red"),
(2, "fn", "blue"),
(3, "fn", "green"),
(4, "aa", "blue"),
(5, "aa", "green"),
(6, "bb", "red"),
(7, "bb", "red"),
(8, "aa", "blue")
).toDF("id", "dept", "color")
created a UDF to perform the update by checking the condition.
val replace_val = udf((x: String,y:String) => if (Option(x).getOrElse("").equalsIgnoreCase("fn")&&(!y.equalsIgnoreCase("red"))) "red" else y)
val final_df = df.withColumn("color", replace_val($"dept",$"color"))
final_df.show()
output:
spark 1.6:
val conf = new SparkConf().setMaster("local").setAppName("My app")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// For implicit conversions like converting RDDs to DataFrames
val df = sc.parallelize(Seq(
(1, "fn", "red"),
(2, "fn", "blue"),
(3, "fn", "green"),
(4, "aa", "blue"),
(5, "aa", "green"),
(6, "bb", "red"),
(7, "bb", "red"),
(8, "aa", "blue")
) ).toDF("id","dept","color")
val replace_val = udf((x: String,y:String) => if (Option(x).getOrElse("").equalsIgnoreCase("fn")&&(!y.equalsIgnoreCase("red"))) "red" else y)
val final_df = df.withColumn("color", replace_val($"dept",$"color"))
final_df.show()

As there could be few rows in filtered dataframe I'm adding solution with isin() and .withColumn() combination.
Sample DataFrame
val df = Seq(
(1, "fn", "red"),
(2, "fn", "blue"),
(3, "fn", "green"),
(4, "aa", "blue"),
(5, "aa", "green"),
(6, "bb", "red"),
(7, "bb", "red"),
(8, "aa", "blue")
).toDF("id", "dept", "color")
Now Let's pick only depts which have at least one red color row and place it in broadcast variable like below.
val depts = sc.broadcast(df.filter($"color" === "red").select(collect_set("dept")).first.getSeq[String](0)))
Update red color for filtered depts records.
isin() takes a vararg so convert list to vararg (depts.value:_*)
//creating new column by giving diff name (clr) to see the diff
val result = df.withColumn("clr", when($"dept".isin(depts.value:_*),lit("red"))
.otherwise($"color"))
result.show()
+---+----+-----+-----+
| id|dept|color| clr|
+---+----+-----+-----+
| 1| fn| red| red|
| 2| fn| blue| red|
| 3| fn|green| red|
| 4| aa| blue| blue|
| 5| aa|green|green|
| 6| bb| red| red|
| 7| bb| red| red|
| 8| aa| blue| blue|
+---+----+-----+-----+

Related

Spark Dataframe with pivot and different aggregation, based on the column value (measure_type) - Scala

I have a spark dataframe of this type:
scala> val data = Seq((1, "k1", "measureA", 2), (1, "k1", "measureA", 4), (1, "k1", "measureB", 5), (1, "k1", "measureB", 7), (1, "k1", "measureC", 7), (1, "k1", "measureC", 1), (2, "k1", "measureB", 8), (2, "k1", "measureC", 9), (2, "k2", "measureA", 5), (2, "k2", "measureC", 5), (2, "k2", "measureC", 8))
data: Seq[(Int, String, String, Int)] = List((1,k1,measureA,2), (1,k1,measureA,4), (1,k1,measureB,5), (1,k1,measureB,7), (1,k1,measureC,7), (1,k1,measureC,1), (2,k1,measureB,8), (2,k1,measureC,9), (2,k2,measureA,5), (2,k2,measureC,5), (2,k2,measureC,8))
scala> val rdd = spark.sparkContext.parallelize(data)
rdd: org.apache.spark.rdd.RDD[(Int, String, String, Int)] = ParallelCollectionRDD[22] at parallelize at <console>:27
scala> val df = rdd.toDF("ts","key","measure_type","value")
df: org.apache.spark.sql.DataFrame = [ts: int, key: string ... 2 more fields]
scala> df.show
+---+---+------------+-----+
| ts|key|measure_type|value|
+---+---+------------+-----+
| 1| k1| measureA| 2|
| 1| k1| measureA| 4|
| 1| k1| measureB| 5|
| 1| k1| measureB| 7|
| 1| k1| measureC| 7|
| 1| k1| measureC| 1|
| 2| k1| measureB| 8|
| 2| k1| measureC| 9|
| 2| k2| measureA| 5|
| 2| k2| measureC| 5|
| 2| k2| measureC| 8|
+---+---+------------+-----+
I want to pivot on measure_type and apply different aggregation types to the value, depending on measure_type:
measureA -> sum
measureB -> avg
measureC -> max
Then, get the following output dataframe:
+---+---+--------+--------+--------+
| ts|key|measureA|measureB|measureC|
+---+---+--------+--------+--------+
| 1| k1| 6| 6| 7|
| 2| k1| null| 8| 9|
| 2| k2| 5| null| 8|
+---+---+--------+--------+--------+
Thanks a lot.
val ddf = df.groupBy("ts", "key").agg(
sum(when(col("measure_type") === "measureA",col("value"))).as("measureA"),
avg(when(col("measure_type") === "measureB",col("value"))).as("measureB"),
max(when(col("measure_type") === "measureC",col("value"))).as("measureC"))
And results are
scala> ddf.show(false)
+---+---+--------+--------+--------+
|ts |key|measureA|measureB|measureC|
+---+---+--------+--------+--------+
|2 |k2 |5 |null |8 |
|2 |k1 |null |8.0 |9 |
|1 |k1 |6 |6.0 |7 |
+---+---+--------+--------+--------+
I think its tedious to do with traditional pivot function as it will only limit you to one particular aggregate function.
Here is what I would do by mapping a pre-defined list of aggregate functions that I need to perform and apply them on my dataframe giving me 3 extra columns for each aggregate functions and then create another column with value for the measure_type as you mentioned and then drop the 3 columns i created in previous step
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import spark.implicits._
val df = Seq((1, "k1", "measureA", 2), (1, "k1", "measureA", 4), (1, "k1", "measureB", 5), (1, "k1", "measureB", 7), (1, "k1", "measureC", 7), (1, "k1", "measureC", 1), (2, "k1", "measureB", 8), (2, "k1", "measureC", 9), (2, "k2", "measureA", 5), (2, "k2", "measureC", 5), (2, "k2", "measureC", 8)).toDF("ts","key","measure_type","value")
val mapping: Map[String, Column => Column] = Map(
"sum" -> sum, "avg" -> avg, "max" -> max)
val groupBy = Seq("ts","key","measure_type")
val aggregate = Seq("value")
val operations = Seq("sum", "avg", "max")
val exprs = aggregate.flatMap(c => operations .map(f => mapping(f)(col(c))))
val df2 = df.groupBy(groupBy.map(col): _*).agg(exprs.head, exprs.tail: _*)
val df3 = df2.withColumn("new_column",
when($"measure_type" === "measureA", $"sum(value)")
.when($"measure_type" === "measureB", $"avg(value)")
.otherwise($"max(value)"))
.drop("sum(value)")
.drop("avg(value)")
.drop("max(value)")
df3 is the dataframe that you need.

Spark: map columns of a dataframe to their ID of the distinct elements

I have the following dataframe of two columns of string type A and B:
val df = (
spark
.createDataFrame(
Seq(
("a1", "b1"),
("a1", "b2"),
("a1", "b2"),
("a2", "b3")
)
)
).toDF("A", "B")
I create maps between distinct elements of each columns and a set of integers
val mapColA = (
df
.select("A")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
val mapColB = (
df
.select("B")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
Now I want to create a new columns in the dataframe applying those maps to their correspondent columns. For one map only this would be
df.select("A").map(x=>mapColA.get(x)).show()
However I don't understand how to apply each map to their correspondent columns and create two new columns (e.g. with withColumn). The expected result would be
val result = (
spark
.createDataFrame(
Seq(
("a1", "b1", 1, 1),
("a1", "b2", 1, 2),
("a1", "b2", 1, 2),
("a2", "b3", 2, 3)
)
)
).toDF("A", "B", "idA", "idB")
Could you help me?
If I understood correctly, this can be achieved using dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("idA", dense_rank().over(Window.orderBy("A")))
.withColumn("idB", dense_rank().over(Window.orderBy("B")))
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 1|
| a1| b2| 1| 2|
| a1| b2| 1| 2|
| a2| b3| 2| 3|
+---+---+---+---+
If you want to stick with your original code, you can make some modifications:
val mapColA = df.select("A").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val mapColB = df.select("B").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val df2 = df.map(r => (r.getAs[String](0), r.getAs[String](1), mapColA.get(r.getAs[String](0)), mapColB.get(r.getAs[String](1)))).toDF("A","B", "idA", "idB")
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 2|
| a1| b2| 1| 0|
| a1| b2| 1| 0|
| a2| b3| 0| 1|
+---+---+---+---+

Join two dataframes with different records and size in Spark

It seems this issue asked couple of times, but the solutions that suggested in previous questions not working for me.
I have two dataframes with different dimensions as shown in picture below. The table two second was part of table one first but after some processing on it I added one more column column4. Now I want to join these two tables such that I have table three Required after joining.
Things that tried.
So I did couple of different solution but no one works for me.
I tried
val required =first.join(second, first("PDE_HDR_CMS_RCD_NUM") === second("PDE_HDR_CMS_RCD_NUM") , "left_outer")
Also I tried
val required = first.withColumn("SEQ", when(second.col("PDE_HDR_FILE_ID") === (first.col("PDE_HDR_FILE_ID").alias("PDE_HDR_FILE_ID1")), second.col("uniqueID")).otherwise(lit(0)))
In the second attempt I used .alias after I get an error that says
Error occured during extract process. Error:
org.apache.spark.sql.AnalysisException: Resolved attribute(s) uniqueID#775L missing from.
Thanks for taking time to read my question
To generate the wanted result, you should join the two tables on column(s) that are row-identifying in your first table. Assuming c1 + c2 + c3 uniquely identifies each row in the first table, here's an example using a partial set of your sample data:
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = Seq(
(1, "e", "o"),
(4, "d", "t"),
(3, "f", "e"),
(2, "r", "r"),
(6, "y", "f"),
(5, "t", "g"),
(1, "g", "h"),
(4, "f", "j"),
(6, "d", "k"),
(7, "s", "o")
).toDF("c1", "c2", "c3")
val df2 = Seq(
(3, "f", "e", 444),
(5, "t", "g", 555),
(7, "s", "o", 666)
).toDF("c1", "c2", "c3", "c4")
df1.join(df2, Seq("c1", "c2", "c3"), "left_outer").show
// +---+---+---+----+
// | c1| c2| c3| c4|
// +---+---+---+----+
// | 1| e| o|null|
// | 4| d| t|null|
// | 3| f| e| 444|
// | 2| r| r|null|
// | 6| y| f|null|
// | 5| t| g| 555|
// | 1| g| h|null|
// | 4| f| j|null|
// | 6| d| k|null|
// | 7| s| o| 666|
// +---+---+---+----+

How to replace values in RDD 1 per keys in RDD 2?

Here are two RDDs.
Table1-pair(key,value)
val table1 = sc.parallelize(Seq(("1", "a"), ("2", "b"), ("3", "c")))
//RDD[(String, String)]
Table2-Arrays
val table2 = sc.parallelize(Array(Array("1", "2", "d"), Array("1", "3", "e")))
//RDD[Array[String]]
I am trying to change elements of table2 such as "1" to "a" using the keys and values in table1. My expect result is as follows:
RDD[Array[String]] = (Array(Array("a", "b", "d"), Array("a", "c", "e")))
Is there a way to make this possible?
If so, would it be efficient using a huge dataset?
I think we can do it better with dataframes while avoiding joins as it might involve shuffling of data.
val table1 = spark.sparkContext.parallelize(Seq(("1", "a"), ("2", "b"), ("3", "c"))).collectAsMap()
//Brodcasting so that mapping is available to all nodes
val brodcastedMapping = spark.sparkContext.broadcast(table1)
val table2 = spark.sparkContext.parallelize(Array(Array("1", "2", "d"), Array("1", "3", "e")))
def changeMapping(value: String): String = {
brodcastedMapping.value.getOrElse(value, value)
}
val changeMappingUDF = udf(changeMapping(_:String))
table2.toDF.withColumn("exploded", explode($"value"))
.withColumn("new", changeMappingUDF($"exploded"))
.groupBy("value")
.agg(collect_list("new").as("mappedCol"))
.select("mappedCol").rdd.map(r => r.toSeq.toArray.map(_.toString))
Let me know if it suits your requirement otherwise I can modify it as needed.
You can do that in Dataset
package dataframe
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
/**
* #author vaquar.khan#gmail.com
*/
object Test {
case class table1Class(key: String, value: String)
case class table2Class(key: String, value: String, value1: String)
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-Basic")
.master("local[4]")
.getOrCreate()
import spark.implicits._
//
val table1 = Seq(
table1Class("1", "a"), table1Class("2", "b"), table1Class("3", "c"))
val df1 = spark.sparkContext.parallelize(table1, 4).toDF()
df1.show()
val table2 = Seq(
table2Class("1", "2", "d"), table2Class("1", "3", "e"))
val df2 = spark.sparkContext.parallelize(table2, 4).toDF()
df2.show()
//
df1.createOrReplaceTempView("A")
df2.createOrReplaceTempView("B")
spark.sql("select d1.key,d1.value,d2.value1 from A d1 inner join B d2 on d1.key = d2.key").show()
//TODO
/* need to fix query
spark.sql( "select * from ( "+ //B1.value,B1.value1,A.value
" select A.value,B.value,B.value1 "+
" from B "+
" left join A "+
" on B.key = A.key ) B2 "+
" left join A " +
" on B2.value = A.key" ).show()
*/
}
}
Results :
+---+-----+
|key|value|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
+---+-----+------+
|key|value|value1|
+---+-----+------+
| 1| 2| d|
| 1| 3| e|
+---+-----+------+
[Stage 15:=====================================> (68 + 6) / 100]
[Stage 15:============================================> (80 + 4) / 100]
+-----+-----+------+
|value|value|value1|
+-----+-----+------+
| 1| a| d|
| 1| a| e|
+-----+-----+------+
Is there a way to make this possible?
Yes. Use Datasets (not RDDs as less effective and expressive), join them together and select fields of your liking.
val table1 = Seq(("1", "a"), ("2", "b"), ("3", "c")).toDF("key", "value")
scala> table1.show
+---+-----+
|key|value|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
val table2 = sc.parallelize(
Array(Array("1", "2", "d"), Array("1", "3", "e"))).
toDF("a").
select($"a"(0) as "a0", $"a"(1) as "a1", $"a"(2) as "a2")
scala> table2.show
+---+---+---+
| a0| a1| a2|
+---+---+---+
| 1| 2| d|
| 1| 3| e|
+---+---+---+
scala> table2.join(table1, $"key" === $"a0").select($"value" as "a0", $"a1", $"a2").show
+---+---+---+
| a0| a1| a2|
+---+---+---+
| a| 2| d|
| a| 3| e|
+---+---+---+
Repeat for the other a columns and union together. While repeating the code, you'll notice the pattern that will make the code generic.
If so, would it be efficient using a huge dataset?
Yes (again). We're talking Spark here and a huge dataset is exactly why you chose Spark, isn't it?

How to combine two spark data frames in sorted order

I want to combine two dataframes a and b into a dataframe c that is sorted on a column.
val a = Seq(("a", 1), ("c", 2), ("e", 3)).toDF("char", "num")
val b = Seq(("b", 4), ("d", 5)).toDF("char", "num")
val c = // how do I sort on char column?
Here is the result I want:
a.show() b.show() c.show()
+----+---+ +----+---+ +----+---+
|char|num| |char|num| |char|num|
+----+---+ +----+---+ +----+---+
| a| 1| | b| 4| | a| 1|
| c| 2| | d| 5| | b| 4|
| e| 3| +----+---+ | c| 2|
+----+---+ | d| 5|
| e| 3|
+----+---+
In simple, you can use sort() on each dataframe and union().
val a = Seq(("a", 1), ("c", 2), ("e", 3)).toDF("char", "num").sort($"char")
val b = Seq(("b", 4), ("d", 5)).toDF("char", "num").sort($"char")
val c = a.union(b).sort($"char")
if you want to do union for multiple dataframes we can try this way.
val df1 = sc.parallelize(List(
(50, 2, "arjun"),
(34, 4, "bob")
)).toDF("age", "children","name")
val df2 = sc.parallelize(List(
(51, 3, "jane"),
(35, 5, "bob")
)).toDF("age", "children","name")
val df3 = sc.parallelize(List(
(50, 2,"arjun"),
(34, 4,"bob")
)).toDF("age", "children","name")
val result= Seq(df1, df2, df3)
val res_union=result.reduce(_ union _).sort($"age",$"name",$"children")
res_union.show()