Read value from table and apply condition in Spark - scala

I have dataframe: df1
+------+--------+--------+--------+
| Name | value1 | value2 | value3 |
+------+--------+--------+--------+
| A | 100 | null | 200 |
| B | 10000 | 300 | 10 |
| c | null | 10 | 100 |
+------+--------+--------+--------+
second dataframe: df2:
+------+------+
| Col1 | col2 |
+------+------+
| X | 1000 |
| Y | 2002 |
| Z | 3000 |
+------+------+
I want to read the values from table1 like value1,value2 and value3
Apply condition to table2 with new columns:
cond1: when name= A and col2>value1, flag it to Y or N
cond2: when name= B and col2>value2 then Y or N
cond3: when name =c and col2>value1 and col2> value3, then Y or N
source code:
df2.withColumn("cond1",when($"col2")>value1,lit("Y)).otherwise(lit("N"))
df2.withColumn("cond2",when($"col2")>value2,lit("Y)).otherwise(lit("N"))
df2.withColumn("cond3",when($"col2")>value1 && when($"col2")>value3,lit("Y")).otherwise(lit("N"))
output:
+------+------+-------+-------+-------+
| Col1 | col2 | cond1 | cond2 | cond3 |
+------+------+-------+-------+-------+
| X | 1000 | Y | Y | y |
| Y | 2002 | N | Y | Y |
| Z | 3000 | Y | Y | Y |
+------+------+-------+-------+-------+

If I understand your question correctly, you can join the two dataframes and create the condition columns as shown below. A couple of notes:
1) With the described conditions,null in df1 is replaced with Int.MinValue for simplified integer comparison
2) Since df1 is small, broadcast join is used to minimize sorting/shuffling for better performance
val df1 = Seq(
("A", 100, Int.MinValue, 200),
("B", 10000, 300, 10),
("C", Int.MinValue, 10, 100)
).toDF("Name", "value1", "value2", "value3")
val df2 = Seq(
("A", 1000),
("B", 2002),
("C", 3000),
("A", 5000),
("A", 150),
("B", 250),
("B", 12000),
("C", 50)
).toDF("Col1", "col2")
val df3 = df2.join(broadcast(df1), df2("Col1") === df1("Name")).select(
df2("Col1"),
df2("col2"),
when(df2("col2") > df1("value1"), "Y").otherwise("N").as("cond1"),
when(df2("col2") > df1("value2"), "Y").otherwise("N").as("cond2"),
when(df2("col2") > df1("value1") && df2("col2") > df1("value3"), "Y").otherwise("N").as("cond3")
)
df3.show
+----+-----+-----+-----+-----+
|Col1| col2|cond1|cond2|cond3|
+----+-----+-----+-----+-----+
| A| 1000| Y| Y| Y|
| B| 2002| N| Y| N|
| C| 3000| Y| Y| Y|
| A| 5000| Y| Y| Y|
| A| 150| Y| Y| N|
| B| 250| N| N| N|
| B|12000| Y| Y| Y|
| C| 50| Y| Y| N|
+----+-----+-----+-----+-----+

You can create rowNo column in both dataframes as below
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val tempdf1 = df1.withColumn("rowNo", row_number().over(Window.orderBy("Name")))
val tempdf2 = df2.withColumn("rowNo", row_number().over(Window.orderBy("Col1")))
Then you can join them with the created column as below
val joinedDF = tempdf2.join(tempdf1, Seq("rowNo"), "left")
Finally you can use select and when function to get the final dataframe
joinedDF.select($"Col1",
$"col2",
when($"col2">$"value1" || $"value1".isNull, "Y").otherwise("N").as("cond1"),
when($"col2">$"value2" || $"value2".isNull, "Y").otherwise("N").as("cond2"),
when(($"col2">$"value1" && $"col2">$"value3") || $"value3".isNull, "Y").otherwise("N").as("cond3"))
you should have your desired dataframe as
+----+----+-----+-----+-----+
|Col1|col2|cond1|cond2|cond3|
+----+----+-----+-----+-----+
|X |1000|Y |Y |Y |
|Y |2002|N |Y |Y |
|Z |3000|Y |Y |Y |
+----+----+-----+-----+-----+
I hope the answer is helpful

Related

Scala Spark, fill a entire column with a char value

I am trying to read a column, and if in that column there is a "Y" i will fill the new column with "Y" otherwise i will fill it with "N".
+--------------+---------------------+-------------------+
|Date | Value | HasChanged |
+--------------+---------------------+-------------------+
|2020-12-14 | N | Y |
|2020-12-14 | Y | Y |
|2020-12-14 | N | Y |
|2020-12-14 | N | Y |
+--------------+---------------------+-------------------+
|Date | Value | HasChanged |
+--------------+---------------------+-------------------+
|2020-12-14 | N | N |
|2020-12-14 | N | N |
|2020-12-14 | N | N |
|2020-12-14 | N | N |
I am trying with this:
val df1 = df.withColumn("HasChanged", when(Value === "Y"), lit("Y")).otherwise("N")))
But only changes the row where there is a Y and what i want is change the entire Column. How can I do it?
You don't need when actually, you can just use max function as Y > N:
import org.apache.spark.sql.expressions.Window
val df1 = df.withColumn("HasChanged", max(col("Value")).over(Window.orderBy()))
df1.show
//+----------+-----+----------+
//| Date|Value|HasChanged|
//+----------+-----+----------+
//|2020-12-14| N| Y|
//|2020-12-14| Y| Y|
//|2020-12-14| N| Y|
//|2020-12-14| N| Y|
//+----------+-----+----------+
You need to check whether value = Y in all rows. You can do that using the maximum of the comparison boolean over a window, which will be True if 1 or more rows are True, and False if every row is False.
import org.apache.spark.sql.expressions.Window
val df1 = df.withColumn(
"HasChanged",
when(max($"Value" === "Y").over(Window.orderBy()), "Y").otherwise("N")
)
df1.show
+----------+-----+----------+
| Date|Value|HasChanged|
+----------+-----+----------+
|2020-12-14| N| Y|
|2020-12-14| Y| Y|
|2020-12-14| N| Y|
|2020-12-14| N| Y|
+----------+-----+----------+

reverse effect of explode function

In scala with spark-2.4, I would like to filter the value inside the arrays in a column.
From
+---+------------+
| id| letter|
+---+------------+
| 1|[x, xxx, xx]|
| 2|[yy, y, yyy]|
+---+------------+
To
+---+-------+
| id| letter|
+---+-------+
| 1|[x, xx]|
| 2|[yy, y]|
+---+-------+
I thought of using explode + filter
val res = Seq(("1", Array("x", "xxx", "xx")), ("2", Array("yy", "y", "yyy"))).toDF("id", "letter")
res.withColumn("tmp", explode(col("letter"))).filter(length(col("tmp")) < 3).drop(col("letter")).show()
And I'm getting
+---+---+
| id|tmp|
+---+---+
| 1| x|
| 1| xx|
| 2| yy|
| 2| y|
+---+---+
How do I zip/groupBy back by id ?
Or maybe there is a better, more optimised solution ?
You can filter the array without explode() in Spark 2.4:
res.withColumn("letter", expr("filter(letter, x -> length(x) < 3)")).show()
Output:
+---+-------+
| id| letter|
+---+-------+
| 1|[x, xx]|
| 2|[yy, y]|
+---+-------+
In Spark 2.4+, higher order functions are the way to go (filter), alternatively use collect_list :
res.withColumn("tmp",explode(col("letter")))
.filter(length(col("tmp")) < 3)
.drop(col("letter"))
// aggregate back
.groupBy($"id")
.agg(collect_list($"tmp").as("letter"))
.show()
gives:
+---+-------+
| id| letter|
+---+-------+
| 1|[x, xx]|
| 2|[yy, y]|
+---+-------+
As this introduces a shuffle, it's better to use UDF for that:
def filter_arr(maxLength:Int)= udf((arr:Seq[String]) => arr.filter(str => str.size<=maxLength))
res
.select($"id",filter_arr(maxLength = 2)($"letter").as("letter"))
.show()
gives:
+---+-------+
| id| letter|
+---+-------+
| 1|[x, xx]|
| 2|[yy, y]|
+---+-------+

Apply QuantileDiscretizer to all columns in a DataFrame

Assume that I have a dataframe with id and 100 columns. I want to apply QuantileDiscretizer on each column and return and a new dataframe with the id column tied with new columns with the discretized values.
Example for two columns only:
Input
id | col1 | col2
----|------|------
0 | 18.0 | 20.0
----|------|------
1 | 19.0 | 30.0
----|------|------
2 | 8.0 | 35.0
----|------|------
3 | 5.0 | 10.0
----|------|------
4 | 2.2 | 5.0
Output
id | col1Disc | col2Disc
----|----------|------
0 | 2 | 2
----|----------| ------
1 | 2 | 3
----|----------|------
2 | 1 | 3
----|----------|------
3 | 2 | 1
----|----------|------
4 | 0 | 0
You can use Pipeline API:
import org.apache.spark.ml.Pipeline
val df = Seq(
(0, 18.0, 20.0), (1, 19.0, 30.0), (2, 8.0, 35.0), (3, 5.0, 10.0), (4, 2.2, 5.0)
).toDF("id", "col1", "col2")
val pipeline = new Pipeline().setStages(for {
c <- df.columns
if c != "id"
} yield new QuantileDiscretizer().setInputCol(c).setOutputCol(s"${c}Disc"))
val result = pipeline.fit(df).transform(df)
result.drop(df.columns.diff(Seq("id")): _*).show
+---+--------+--------+
| id|col1Disc|col2Disc|
+---+--------+--------+
| 0| 1.0| 1.0|
| 1| 1.0| 1.0|
| 2| 1.0| 1.0|
| 3| 0.0| 0.0|
| 4| 0.0| 0.0|
+---+--------+--------+

Add new column to dataframe based on previous values and condition

I have sample dataframe,
After grouping by level1 and date i got the resulted dataframe:
val group_df = qwe.groupBy($"level1",$"date").agg(sum("rel_amount").as("amount"))
+------+----------+------+
|level1| date|amount|
+------+----------+------+
| A|2016-03-31| 100|
| A|2016-02-28| 100|
| A|2016-01-31| 400|
| A|2015-12-31| 500|
| A|2015-11-30| 1200|
| A|2015-10-31| 1300|
| A|2014-12-31| 600|
| B|2016-03-31| 10|
| B|2016-02-28| 300|
| B|2016-01-31| 423|
| B|2015-12-31| 501|
| B|2015-11-30| 234|
| B|2015-10-31| 1234|
| B|2014-12-31| 3456|
+------+----------+------+
Now I want to add extra column(previous) as year end, in this column I need to get the value for previous year end amount for each group.
For example: for level1 :A, date=2016-03-31 the value should be 500 because it is the amount for 2015-12-31.
Similarily, for date= 2015-12-31 the value should be 600 because the amount for 2014-12-31.Need to calculate the previous year end amount for each row.
Expected output :
+------+----------+------+--------+
|level1| date|amount|Previous|
+------+----------+------+--------+
| A|2016-03-31| 100| 500|
| A|2016-02-28| 100| 500|
| A|2016-01-31| 400| 500|
| A|2015-12-31| 500| 600|
| A|2015-11-30| 1200| 600|
| A|2015-10-31| 1300| 600|
| A|2014-12-31| 600| 600|
| B|2016-03-31| 10| 501|
| B|2016-02-28| 300| 501|
| B|2016-01-31| 423| 501|
| B|2015-12-31| 501| 3456|
| B|2015-11-30| 234| 3456|
| B|2015-10-31| 1234| 3456|
| B|2014-12-31| 3456| 3456|
+------+----------+------+--------+
Can someone help me on this.
One approach would be to use an UDF to manipulate column date as String to create a new column that holds the previous end-of-year value:
val df = Seq(
("A", "2016-03-31", 100),
("A", "2016-02-28", 100),
("A", "2016-01-31", 400),
("A", "2015-12-31", 500),
("A", "2015-11-30", 1200),
("A", "2015-10-31", 1300),
("A", "2014-12-31", 600),
("B", "2016-03-31", 10),
("B", "2016-02-28", 300),
("B", "2016-01-31", 423),
("B", "2015-12-31", 501),
("B", "2015-11-30", 234),
("B", "2015-10-31", 1234),
("B", "2014-12-31", 3456)
).toDF(
"level1", "date", "amount"
)
import org.apache.spark.sql.functions._
def previousEOY = udf( (d: String) => (d.substring(0, 4).toInt - 1).toString + "-12-31" )
val df2 = df.withColumn("previous_eoy", previousEOY($"date"))
For the convenience of standard SQL's scalar subquery capability, I'm reverting to using Spark's TempView (Note that max() is used in the subquery simply to satisfy single-row return):
df2.createOrReplaceTempView("dfView")
val df3 = spark.sqlContext.sql("""
SELECT
level1, date, amount, (
SELECT max(amount) FROM dfView v2
WHERE v2.level1 = v1.level1 AND v2.date = v1.previous_eoy
) previous
FROM
dfView v1
""")
df3.show
+------+----------+------+--------+
|level1| date|amount|previous|
+------+----------+------+--------+
| A|2016-03-31| 100| 500|
| A|2016-02-28| 100| 500|
| A|2016-01-31| 400| 500|
| A|2015-12-31| 500| 600|
| A|2015-11-30| 1200| 600|
| A|2015-10-31| 1300| 600|
| A|2014-12-31| 600| null|
| B|2016-03-31| 10| 501|
| B|2016-02-28| 300| 501|
| B|2016-01-31| 423| 501|
| B|2015-12-31| 501| 3456|
| B|2015-11-30| 234| 3456|
| B|2015-10-31| 1234| 3456|
| B|2014-12-31| 3456| null|
+------+----------+------+--------+
val amount = ss.sparkContext.parallelize(Seq(("B","2014-12-31", 3456))).toDF("level1", "dateY", "amount")
val yearStr = udf((date:String) => {(date.substring(0,4).toInt - 1) +"-12-31" })
val df3 = amount.withColumn( "p", yearStr($"dateY"))
df3.show()
df3.createOrReplaceTempView("dfView")
val df4 = df3.filter( s => s.getString(1).contains("12-31")).select( $"dateY".as("p"), $"level1",$"amount".as("am"))
df4.show
df3.join( df4, Seq("p", "level1"), "left_outer").orderBy("level1", "amount").drop($"p").show()
First, create a dataframe that is year to year-end-value. Then join that into your original data frame where year is equal.

Splitting row in multiple row in spark-shell

I have imported data in Spark dataframe in spark-shell. Data is filled in it like :
Col1 | Col2 | Col3 | Col4
A1 | 11 | B2 | a|b;1;0xFFFFFF
A1 | 12 | B1 | 2
A2 | 12 | B2 | 0xFFF45B
Here in Col4, the values are of different kinds and I want to separate them like (suppose "a|b" is type of alphabets, "1 or 2" is a type of digit and "0xFFFFFF or 0xFFF45B" is a type of hexadecimal no.):
So, the output should be :
Col1 | Col2 | Col3 | alphabets | digits | hexadecimal
A1 | 11 | B2 | a | 1 | 0xFFFFFF
A1 | 11 | B2 | b | 1 | 0xFFFFFF
A1 | 12 | B1 | | 2 |
A2 | 12 | B2 | | | 0xFFF45B
Hope I've made my query clear to you and I am using spark-shell. Thanks in advance.
Edit after getting this answer about how to make backreference in regexp_replace.
You can use regexp_replace with a backreference, then split twice and explode. It is, imo, cleaner than my original solution
val df = List(
("A1" , "11" , "B2" , "a|b;1;0xFFFFFF"),
("A1" , "12" , "B1" , "2"),
("A2" , "12" , "B2" , "0xFFF45B")
).toDF("Col1" , "Col2" , "Col3" , "Col4")
val regExStr = "^([A-z|]+)?;?(\\d+)?;?(0x.*)?$"
val res = df
.withColumn("backrefReplace",
split(regexp_replace('Col4,regExStr,"$1;$2;$3"),";"))
.select('Col1,'Col2,'Col3,
explode(split('backrefReplace(0),"\\|")).as("letter"),
'backrefReplace(1) .as("digits"),
'backrefReplace(2) .as("hexadecimal")
)
+----+----+----+------+------+-----------+
|Col1|Col2|Col3|letter|digits|hexadecimal|
+----+----+----+------+------+-----------+
| A1| 11| B2| a| 1| 0xFFFFFF|
| A1| 11| B2| b| 1| 0xFFFFFF|
| A1| 12| B1| | 2| |
| A2| 12| B2| | | 0xFFF45B|
+----+----+----+------+------+-----------+
you still need to replace empty strings by nullthough...
Previous Answer (somebody might still prefer it):
Here is a solution that sticks to DataFrames but is also quite messy. You can first use regexp_extract three times (possible to do less with backreference?), and finally split on "|" and explode. Note that you need a coalesce for explode to return everything (you still might want to change the empty strings in letter to null in this solution).
val res = df
.withColumn("alphabets", regexp_extract('Col4,"(^[A-z|]+)?",1))
.withColumn("digits", regexp_extract('Col4,"^([A-z|]+)?;?(\\d+)?;?(0x.*)?$",2))
.withColumn("hexadecimal",regexp_extract('Col4,"^([A-z|]+)?;?(\\d+)?;?(0x.*)?$",3))
.withColumn("letter",
explode(
split(
coalesce('alphabets,lit("")),
"\\|"
)
)
)
res.show
+----+----+----+--------------+---------+------+-----------+------+
|Col1|Col2|Col3| Col4|alphabets|digits|hexadecimal|letter|
+----+----+----+--------------+---------+------+-----------+------+
| A1| 11| B2|a|b;1;0xFFFFFF| a|b| 1| 0xFFFFFF| a|
| A1| 11| B2|a|b;1;0xFFFFFF| a|b| 1| 0xFFFFFF| b|
| A1| 12| B1| 2| null| 2| null| |
| A2| 12| B2| 0xFFF45B| null| null| 0xFFF45B| |
+----+----+----+--------------+---------+------+-----------+------+
Note: The regexp part could be so much better with backreference, so if somebody knows how to do it, please comment!
Not sure this is doable while staying 100% with Dataframes, here's a (somewhat messy?) solution using RDDs for the split itself:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
// we switch to RDD to perform the split of Col4 into 3 columns
val rddWithSplitCol4 = input.rdd.map { r =>
val indexToValue = r.getAs[String]("Col4").split(';').map {
case s if s.startsWith("0x") => 2 -> s
case s if s.matches("\\d+") => 1 -> s
case s => 0 -> s
}
val newCols: Array[String] = indexToValue.foldLeft(Array.fill[String](3)("")) {
case (arr, (index, value)) => arr.updated(index, value)
}
(r.getAs[String]("Col1"), r.getAs[Int]("Col2"), r.getAs[String]("Col3"), newCols(0), newCols(1), newCols(2))
}
// switch back to Dataframe and explode alphabets column
val result = rddWithSplitCol4
.toDF("Col1", "Col2", "Col3", "alphabets", "digits", "hexadecimal")
.withColumn("alphabets", explode(split(col("alphabets"), "\\|")))
result.show(truncate = false)
// +----+----+----+---------+------+-----------+
// |Col1|Col2|Col3|alphabets|digits|hexadecimal|
// +----+----+----+---------+------+-----------+
// |A1 |11 |B2 |a |1 |0xFFFFFF |
// |A1 |11 |B2 |b |1 |0xFFFFFF |
// |A1 |12 |B1 | |2 | |
// |A2 |12 |B2 | | |0xFFF45B |
// +----+----+----+---------+------+-----------+