Spark scala - add a json column from other columns conditionally - scala

With a dataframe such as :
+-----+-----+-----+-----+-----+-----+
|old_a|new_a| a|old_b|new_b| b|
+-----+-----+-----+-----+-----+-----+
| 6| 7| true| 6| 6|false|
| 1| 1|false| 12| 8| true|
| 1| 2| true| 2| 8| true|
| 1| null| true| 2| 8| true|
+-----+-----+-----+-----+-----+-----+
note : 'a' is 'true' when 'new_a' is different from 'old_a', the same for 'b'
I'd like to add a json column, with some values from other columns, following that rule
"if 'a' is true, value of 'new_a' col must be added to the new json, and the same for 'b'",
which will produce following dataframe
+-----+-----+--------+-----+-----+--------+------------------------+
|old_a|new_a|a |old_b|new_b| b| json |
+-----+-----+--------+-----+-----+--------+------------------------+
| 6| 7| true| 6| 6| false| { "a" : 7 } |
| 1| 1| false| 12| 8| true| { "b" : 8 } |
| 1| 2| true| 2| 8| true| { "a" : 2, "b" : 8} |
| 1| null| true| 2| 8| true| { "a" : null, "b" : 8} |
+-----+-----+--------+-----+-----+--------+------------------------+
Is there a way to achieve that without UDFs ?
If not what would best way to write the UDF so it won't be too costly ?
Thanks

Use to_json & struct functions.
By default to_json function removes all null value columns, due to this reason I have converted new_a column datatype to string
new_a datatype integer
scala> df.show(false)
+-----+-----+-----+-----+-----+-----+
|old_a|new_a|a |old_b|new_b|b |
+-----+-----+-----+-----+-----+-----+
|6 |7 |true |6 |6 |false|
|1 |1 |false|12 |8 |true |
|1 |2 |true |2 |8 |true |
|1 |null |true |2 |8 |true |
+-----+-----+-----+-----+-----+-----+
scala> df.printSchema
root
|-- old_a: integer (nullable = false)
|-- new_a: integer (nullable = true)
|-- a: boolean (nullable = false)
|-- old_b: integer (nullable = false)
|-- new_b: integer (nullable = false)
|-- b: boolean (nullable = false)
scala> df.withColumn("json",when($"a" && $"b",to_json(struct($"new_a",$"new_b"))).when($"a",to_json(struct($"new_a"))).otherwise(to_json(struct($"new_b")))).show(false)
+-----+-----+-----+-----+-----+-----+---------------------+
|old_a|new_a|a |old_b|new_b|b |json |
+-----+-----+-----+-----+-----+-----+---------------------+
|6 |7 |true |6 |6 |false|{"new_a":7} |
|1 |1 |false|12 |8 |true |{"new_b":8} |
|1 |2 |true |2 |8 |true |{"new_a":2,"new_b":8}|
|1 |null |true |2 |8 |true |{"new_b":8} |
+-----+-----+-----+-----+-----+-----+---------------------+
new_a datatype string
scala> df.show(false)
+-----+-----+-----+-----+-----+-----+
|old_a|new_a|a |old_b|new_b|b |
+-----+-----+-----+-----+-----+-----+
|6 |7 |true |6 |6 |false|
|1 |1 |false|12 |8 |true |
|1 |2 |true |2 |8 |true |
|1 |null |true |2 |8 |true |
+-----+-----+-----+-----+-----+-----+
scala> df.printSchema
root
|-- old_a: integer (nullable = false)
|-- new_a: string (nullable = true)
|-- a: boolean (nullable = false)
|-- old_b: integer (nullable = false)
|-- new_b: integer (nullable = false)
|-- b: boolean (nullable = false)
scala> df.withColumn("json",when($"a" && $"b",to_json(struct($"new_a",$"new_b"))).when($"a",to_json(struct($"new_a"))).otherwise(to_json(struct($"new_b")))).show(false)
+-----+-----+-----+-----+-----+-----+--------------------------+
|old_a|new_a|a |old_b|new_b|b |json |
+-----+-----+-----+-----+-----+-----+--------------------------+
|6 |7 |true |6 |6 |false|{"new_a":"7"} |
|1 |1 |false|12 |8 |true |{"new_b":8} |
|1 |2 |true |2 |8 |true |{"new_a":"2","new_b":8} |
|1 |null |true |2 |8 |true |{"new_a":"null","new_b":8}|
+-----+-----+-----+-----+-----+-----+--------------------------+

A solution to generalize Srinivas solution, when we don't know the number of old/new column pairs
(note something I didn't mention is that col 'a' and 'b' where here to tell if the value changed between old a and new a (respectively b)
val df = Seq(
(null, "a", "b", "b"),
("a", null, "b", "b"),
("a", "a2", "b", "b"),
("a", "a2", "b", "b2"),
(null, null, "b", "b2"),
).toDF("old_a", "new_a","old_b", "new_b")
// replace null by empty string to not mess with the voluntary null value we set later
val df2 = df.na.fill("",df.columns)
df2.show()
val colNames = df2.columns.map(name => name.stripPrefix("old_").stripPrefix("new_")).distinct
val res = colNames.foldLeft(df2){(tempDF, colName) =>
tempDF.withColumn(colName,
when(col(s"old_$colName").equalTo(col(s"new_$colName")), null)
.otherwise(col(s"new_$colName"))
)
}
val cols: Array[Column] = colNames.map(col(_))
val resWithJson = res.withColumn("json", to_json(struct(cols:_*)))
output :
+-----+-----+-----+-----+
|old_a|new_a|old_b|new_b|
+-----+-----+-----+-----+
| | a| b| b|
| a| | b| b|
| a| a2| b| b|
| a| a2| b| b2|
| | | b| b2|
+-----+-----+-----+-----+
+-----+-----+-----+-----+----+----+-------------------+
|old_a|new_a|old_b|new_b|a |b |json |
+-----+-----+-----+-----+----+----+-------------------+
| |a |b |b |a |null|{"a":"a"} |
|a | |b |b | |null|{"a":""} |
|a |a2 |b |b |a2 |null|{"a":"a2"} |
|a |a2 |b |b2 |a2 |b2 |{"a":"a2","b":"b2"}|
| | |b |b2 |null|b2 |{"b":"b2"} |
+-----+-----+-----+-----+----+----+-------------------+

Related

Find min value for every 5 hour interval

My df
val df = Seq(
("1", 1),
("1", 1),
("1", 2),
("1", 4),
("1", 5),
("1", 6),
("1", 8),
("1", 12),
("1", 12),
("1", 13),
("1", 14),
("1", 15),
("1", 16)
).toDF("id", "time")
For this case the first interval starts from 1 hour. So every row up to 6 (1 + 5) is part of this interval.
But 8 - 1 > 5, so the second interval starts from 8 and goes up to 13.
Then I see that 14 - 8 > 5, so the third one starts and so on.
The desired result
+---+----+--------+
|id |time|min_time|
+---+----+--------+
|1 |1 |1 |
|1 |1 |1 |
|1 |2 |1 |
|1 |4 |1 |
|1 |5 |1 |
|1 |6 |1 |
|1 |8 |8 |
|1 |12 |8 |
|1 |12 |8 |
|1 |13 |8 |
|1 |14 |14 |
|1 |15 |14 |
|1 |16 |14 |
+---+----+--------+
I'm trying to do it using min function, but don't know how to account for this condition.
val window = Window.partitionBy($"id").orderBy($"time")
df
.select($"id", $"time")
.withColumn("min_time", when(($"time" - min($"time").over(window)) <= 5, min($"time").over(window)).otherwise($"time"))
.show(false)
what I get
+---+----+--------+
|id |time|min_time|
+---+----+--------+
|1 |1 |1 |
|1 |1 |1 |
|1 |2 |1 |
|1 |4 |1 |
|1 |5 |1 |
|1 |6 |1 |
|1 |8 |8 |
|1 |12 |12 |
|1 |12 |12 |
|1 |13 |13 |
|1 |14 |14 |
|1 |15 |15 |
|1 |16 |16 |
+---+----+--------+
You can go with your first idea of using aggregation function on a window. But instead of using some combination of Spark's already defined functions, you can define your own Spark's user-defined aggregate function (UDAF).
Analysis
As you correctly supposed, we should use a kind of min function on a window. On the rows of this window, we want to implement the following rule:
Given rows sorted by time, if the difference between the min_time of the previous row and the time of the current row is greater than 5, then the current row's min_time should be current row's time, else the current row's min_time should be previous row's min_time.
However, with the aggregate functions provided by Spark, we can't access to the previous row's min_time. It exists a lag function, but with this function we can only access to the already present values of previous rows. As the previous row's min_time is not already present, we can't access it.
Thus we have to define our own aggregate function
Solution
Defining a tailor-made aggregate function
To define our aggregate function, we need to create a class that extends the Aggregator abstract class. Below is the complete implementation:
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.{Encoder, Encoders}
object MinByInterval extends Aggregator[Integer, Integer, Integer] {
def zero: Integer = null
def reduce(buffer: Integer, time: Integer): Integer = {
if (buffer == null || time - buffer > 5) time else buffer
}
def merge(b1: Integer, b2: Integer): Integer = {
throw new NotImplementedError("should not use as general aggregation")
}
def finish(reduction: Integer): Integer = reduction
def bufferEncoder: Encoder[Integer] = Encoders.INT
def outputEncoder: Encoder[Integer] = Encoders.INT
}
We use Integer for input, buffer and output types. We chose Integer as it is a nullable Int. We could have used Option[Int], however the documentation of Spark advises to not recreate objects in aggregators methods for performance issues, what would happens if we use complex types like Option.
We implement the rule defined in Analysis section in reduce method:
def reduce(buffer: Integer, time: Integer): Integer = {
if (buffer == null || time - buffer > 5) time else buffer
}
Here time is the value in the column time of the current row, and buffer the value previously computed, so corresponding to the column min_time of the previous row. As in our window we sort the rows by time, time is always greater than buffer. The null buffer case only happens when treating first row.
The method merge is not used when using aggregate function over a window, so we don't implement it.
finish method is identity method as we don't need to perform final calculation on our aggregated value and output and buffer encoders are Encoders.INT
Calling user defined aggregate function
Now we can call our user defined aggregate function with the following code:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, udaf}
val minTime = udaf(MinByInterval)
val window = Window.partitionBy("id").orderBy("time")
df.withColumn("min_time", minTime(col("time")).over(window))
Run
Given the input dataframe in the question, we get:
+---+----+--------+
|id |time|min_time|
+---+----+--------+
|1 |1 |1 |
|1 |1 |1 |
|1 |2 |1 |
|1 |4 |1 |
|1 |5 |1 |
|1 |6 |1 |
|1 |8 |8 |
|1 |12 |8 |
|1 |12 |8 |
|1 |13 |8 |
|1 |14 |14 |
|1 |15 |14 |
|1 |16 |14 |
+---+----+--------+
Input data
val df = Seq(
("1", 1),
("1", 1),
("1", 2),
("1", 4),
("1", 5),
("1", 6),
("1", 8),
("1", 12),
("1", 12),
("1", 13),
("1", 14),
("1", 15),
("1", 16),
("2", 4),
("2", 8),
("2", 10),
("2", 11),
("2", 11),
("2", 12),
("2", 13),
("2", 20)
).toDF("id", "time")
The data must be sorted, otherwise the result will be incorect.
val window = Window.partitionBy($"id").orderBy($"time")
df
.withColumn("min", row_number().over(window))
.as[Row]
.map(_.getMin)
.show(40)
After, I create a case class. var min is used to hold the minimum value and is only updated when the conditions are met.
case class Row(id:String, time:Int, min:Int){
def getMin: Row = {
if(time - Row.min > 5 || Row.min == -99 || min == 1){
Row.min = time
}
Row(id, time, Row.min)
}
}
object Row{
var min: Int = -99
}
Result
+---+----+---+
| id|time|min|
+---+----+---+
| 1| 1| 1|
| 1| 1| 1|
| 1| 2| 1|
| 1| 4| 1|
| 1| 5| 1|
| 1| 6| 1|
| 1| 8| 8|
| 1| 12| 8|
| 1| 12| 8|
| 1| 13| 8|
| 1| 14| 14|
| 1| 15| 14|
| 1| 16| 14|
| 2| 4| 4|
| 2| 8| 4|
| 2| 10| 10|
| 2| 11| 10|
| 2| 11| 10|
| 2| 12| 10|
| 2| 13| 10|
| 2| 20| 20|
+---+----+---+

Spark: explode multiple columns into one

Is it possible to explode multiple columns into one new column in spark? I have a dataframe which looks like this:
userId varA varB
1 [0,2,5] [1,2,9]
desired output:
userId bothVars
1 0
1 2
1 5
1 1
1 2
1 9
What I have tried so far:
val explodedDf = df.withColumn("bothVars", explode($"varA")).drop("varA")
.withColumn("bothVars", explode($"varB")).drop("varB")
which doesn't work. Any suggestions is much appreciated.
You could wrap the two arrays into one and flatten the nested array before exploding it, as shown below:
val df = Seq(
(1, Seq(0, 2, 5), Seq(1, 2, 9)),
(2, Seq(1, 3, 4), Seq(2, 3, 8))
).toDF("userId", "varA", "varB")
df.
select($"userId", explode(flatten(array($"varA", $"varB"))).as("bothVars")).
show
// +------+--------+
// |userId|bothVars|
// +------+--------+
// | 1| 0|
// | 1| 2|
// | 1| 5|
// | 1| 1|
// | 1| 2|
// | 1| 9|
// | 2| 1|
// | 2| 3|
// | 2| 4|
// | 2| 2|
// | 2| 3|
// | 2| 8|
// +------+--------+
Note that flatten is available on Spark 2.4+.
Use array_union and then use explode function.
scala> df.show(false)
+------+---------+---------+
|userId|varA |varB |
+------+---------+---------+
|1 |[0, 2, 5]|[1, 2, 9]|
|2 |[1, 3, 4]|[2, 3, 8]|
+------+---------+---------+
scala> df
.select($"userId",explode(array_union($"varA",$"varB")).as("bothVars"))
.show(false)
+------+--------+
|userId|bothVars|
+------+--------+
|1 |0 |
|1 |2 |
|1 |5 |
|1 |1 |
|1 |9 |
|2 |1 |
|2 |3 |
|2 |4 |
|2 |2 |
|2 |8 |
+------+--------+
array_union is available in Spark 2.4+

How would I repeat each row in a Scala dataframe N times

Here is the before of the dataframe:
and here is the after:
notice how the rows that are repeated are all next to each other, as opposed to just starting the dataframe over from scratch at the end.
Thanks
Try with array_repeat with struct function then explode the array.
Example:
df.show()
/*
+----+----+
|col1|col2|
+----+----+
| 1| 4|
| 2| 5|
| 3| 6|
+----+----+
*/
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
df.withColumn("arr",explode(array_repeat(struct(df.columns.head,df.columns.tail:_*),7))).
select("arr.*").
toDF("col1","col2").
show(100,false)
/*
+----+----+
|col1|col2|
+----+----+
|1 |4 |
|1 |4 |
|1 |4 |
|1 |4 |
|1 |4 |
|1 |4 |
|1 |4 |
|2 |5 |
|2 |5 |
|2 |5 |
|2 |5 |
|2 |5 |
|2 |5 |
|2 |5 |
|3 |6 |
|3 |6 |
|3 |6 |
|3 |6 |
|3 |6 |
|3 |6 |
|3 |6 |
+----+----+
*/
Here's a function which duplicates a DataFrame:
def repeatRows(df: DataFrame, numRepeats: Int): DataFrame = {
(1 until numRepeats).foldLeft(df)((growingDF, _) => growingDF.union(df))
}
The problem of having the resulting DataFrame sorted is separate from the duplication process, and hence wasn't included in the function, but can be easily achieved afterwards.
So let's take your problem:
// Problem setup
val someDF = Seq((1,4),(2,4),(3,6)).toDF("col1","col2")
// Duplicate followed by sort
val duplicatedSortedDF = repeatRows(someDF, 3).sort("col1")
// Show result
duplicatedSortedDF.show()
+----+----+
|col1|col2|
+----+----+
| 1| 4|
| 1| 4|
| 1| 4|
| 2| 4|
| 2| 4|
| 2| 4|
| 3| 6|
| 3| 6|
| 3| 6|
+----+----+
And there you have it.

Update the value of a array of column depend on the value of different array of column in same dataframe in Spark-Scala

There is almost 1500+ columns in a dataframe where differnt set of array of column existing like : col1, col2, col3, col4.... , then c1, c2, c3, c4...... , then column1, column2, column3, column4 ....... etc etc. Now based on certain logic and condition subset of array of column of same name need to be updated/modified . Suppose the logic is like :
col(i) = 2 *column(i) + 3* c(i)
and ,
col(i) = 2* column(i) + col(i)
where i is ranging the between upper of lower limits of subset of array. Here subset of array of col1-col4 can be col1, col2, col3.
I need an expression where above mentioned two operation can be expressed.
I am giving an example for the below expression :
col(i) = col(i) + 1
Example :
scala> val original_df = Seq((1,2,3,4,9,8,7,6),(2,3,4,5,8,7,6,5),(3,4,5,6,7,6,5,4),(4,5,6,7,6,5,4,3),(5,6,7,8,5,4,3,2),(6,7,8,9,4,3,2,1)).toDF("col1","col2","col3","col4","c1","c2","c3","c4")
original_df: org.apache.spark.sql.DataFrame = [col1: int, col2: int ... 6 more fields]
scala> original_df.show()
+----+----+----+----+---+---+---+---+
|col1|col2|col3|col4| c1| c2| c3| c4|
+----+----+----+----+---+---+---+---+
| 1| 2| 3| 4| 9| 8| 7| 6|
| 2| 3| 4| 5| 8| 7| 6| 5|
| 3| 4| 5| 6| 7| 6| 5| 4|
| 4| 5| 6| 7| 6| 5| 4| 3|
| 5| 6| 7| 8| 5| 4| 3| 2|
| 6| 7| 8| 9| 4| 3| 2| 1|
+----+----+----+----+---+---+---+---+
scala> val requiredColumns = original_df.columns.zipWithIndex.filter(_._2 < 3).map(_._1).toSet
requiredColumns: scala.collection.immutable.Set[String] = Set(col1, col2, col3)
scala> val allColumns = original_df.columns
allColumns: Array[String] = Array(col1, col2, col3, col4, c1, c2, c3, c4)
scala> val columnExpr = allColumns.filterNot(requiredColumns(_)).map(col(_)) ++ requiredColumns.map(c => (col(c) + 1).alias(c))
columnExpr: Array[org.apache.spark.sql.Column] = Array(col4, c1, c2, c3, c4, (col1 + 1) AS `col1`, (col2 + 1) AS `col2`, (col3 + 1) AS `col3`)
scala> original_df.select(columnExpr:_*).show(false)
+----+---+---+---+---+----+----+----+
|col4|c1 |c2 |c3 |c4 |col1|col2|col3|
+----+---+---+---+---+----+----+----+
|4 |9 |8 |7 |6 |2 |3 |4 |
|5 |8 |7 |6 |5 |3 |4 |5 |
|6 |7 |6 |5 |4 |4 |5 |6 |
|7 |6 |5 |4 |3 |5 |6 |7 |
|8 |5 |4 |3 |2 |6 |7 |8 |
|9 |4 |3 |2 |1 |7 |8 |9 |
+----+---+---+---+---+----+----+----+
So I need an expression in this case for below two expression :
col(i) = 2 *column(i) + 3* c(i)
and ,
col(i) = 2* column(i) + col(i)

Iterate Over a Dataframe as each time column is passing to do transformation

I have a dataframe with 100 columns and col names like col1, col2, col3.... I want to apply certain transformation on the values of columns based on condition matches. I can store the column names in a array of string. And pass the value each element of the array in withColumn and based on When condition i can transform the values of the column vertically.
But the question is, as Dataframe is immutable, so each updated version is need to store in a new variable and also new dataframe need to pass in withColumn to transform for next iteration.
Is there any way to create array of dataframe so that new dataframe can be stored as a element of array and it can iterate based on the value of iterator.
Or is there any other way to handle the same.
var arr_df : Array[DataFrame] = new Array[DataFrame](60)
--> This throws error "not found type DataFrame"
val df(0) = df1.union(df2)
for(i <- 1 to 99){
val df(i) = df(i-1).withColumn(col(i), when(col(i)> 0, col(i) +
1).otherwise(col(i)))
Here col(i) is an array of strings that stores the name of the columns of the original datframe .
As a example :
scala> val original_df = Seq((1,2,3,4),(2,3,4,5),(3,4,5,6),(4,5,6,7),(5,6,7,8),(6,7,8,9)).toDF("col1","col2","col3","col4")
original_df: org.apache.spark.sql.DataFrame = [col1: int, col2: int ... 2 more fields]
scala> original_df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 2| 3| 4|
| 2| 3| 4| 5|
| 3| 4| 5| 6|
| 4| 5| 6| 7|
| 5| 6| 7| 8|
| 6| 7| 8| 9|
+----+----+----+----+
I want to iterate 3 columns : col1, col2, col3 if the value of that column is greater than 3, then it will be updated by +1
Check below code.
scala> df.show(false)
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|1 |2 |3 |4 |
|2 |3 |4 |5 |
|3 |4 |5 |6 |
|4 |5 |6 |7 |
|5 |6 |7 |8 |
|6 |7 |8 |9 |
+----+----+----+----+
scala> val requiredColumns = df.columns.zipWithIndex.filter(_._2 < 3).map(_._1).toSet
requiredColumns: scala.collection.immutable.Set[String] = Set(col1, col2, col3)
scala> val allColumns = df.columns
allColumns: Array[String] = Array(col1, col2, col3, col4)
scala> val columnExpr = allColumns.filterNot(requiredColumns(_)).map(col(_)) ++ requiredColumns.map(c => when(col(c) > 3, col(c) + 1).otherwise(col(c)).as(c))
scala> df.select(columnExpr:_*).show(false)
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|1 |2 |3 |4 |
|2 |3 |5 |5 |
|3 |5 |6 |6 |
|5 |6 |7 |7 |
|6 |7 |8 |8 |
|7 |8 |9 |9 |
+----+----+----+----+
If I understand you right, you are trying to do a dataframe wise operation. you dont need to iterate for this . I can show you how it can be done in pyspark. probably it can be taken over in scala.
from pyspark.sql import functions as F
tst= sqlContext.createDataFrame([(1,7,0),(1,8,4),(1,0,10),(5,1,90),(7,6,0),(0,3,11)],schema=['col1','col2','col3'])
expr = [F.when(F.col(coln)>3,F.col(coln)+1).otherwise(F.col(coln)).alias(coln) for coln in tst.columns if 'col3' not in coln]
tst1= tst.select(*expr)
results:
tst1.show()
+----+----+
|col1|col2|
+----+----+
| 1| 8|
| 1| 9|
| 1| 0|
| 6| 1|
| 8| 7|
| 0| 3|
+----+----+
This should give you the desired result
You can iterate over all columns and apply the condition in single line as below,
original_df.select(original_df.columns.map(c => (when(col(c) > lit(3), col(c)+1).otherwise(col(c))).alias(c)):_*).show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 2| 3| 5|
| 2| 3| 5| 6|
| 3| 5| 6| 7|
| 5| 6| 7| 8|
| 6| 7| 8| 9|
| 7| 8| 9| 10|
+----+----+----+----+
You can use foldLeft whenever you want to make changes on multiple columns as below
val original_df = Seq(
(1,2,3,4),
(2,3,4,5),
(3,4,5,6),
(4,5,6,7),
(5,6,7,8),
(6,7,8,9)
).toDF("col1","col2","col3","col4")
//Filter the columns that yuou want to update
val columns = original_df.columns
columns.foldLeft(original_df){(acc, colName) =>
acc.withColumn(colName, when(col(colName) > 3, col(colName) + 1).otherwise(col(colName)))
}
.show(false)
Output:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|1 |2 |3 |5 |
|2 |3 |5 |6 |
|3 |5 |6 |7 |
|5 |6 |7 |8 |
|6 |7 |8 |9 |
|7 |8 |9 |10 |
+----+----+----+----+