mock spark column functions in scala - scala

My code is using monotonically_increasing_id function is scala
val df = List(("oleg"), ("maxim")).toDF("first_name")
.withColumn("row_id", monotonically_increasing_id)
I want to mock it in my unit test so that it returns integers 0, 1, 2, 3, ...
In my spark-shell it returns the desired result.
scala> df.show
+----------+------+
|first_name|row_id|
+----------+------+
| oleg| 0|
| maxim| 1|
+----------+------+
But in my scala applications the results are different.
How can I mock column functions?

Mocking such a function so that it produces a sequence is not simple. Indeed, spark is a parallel computing engine and accessing the data in sequence is therefore complicated.
Here is a solution you could try.
Let's define a function that zips a dataframe:
def zip(df : DataFrame, name : String) = {
df.withColumn(name, monotonically_increasing_id)
}
Then let's rewrite the function we want to test using this zip function by default:
def fun(df : DataFrame,
zipFun : (DataFrame, String) => DataFrame = zip) : DataFrame = {
zipFun(df, "id_row")
}
// let 's see what it does
fun(spark.range(5).toDF).show()
+---+----------+
| id| id_row|
+---+----------+
| 0| 0|
| 1| 1|
| 2|8589934592|
| 3|8589934593|
| 4|8589934594|
+---+----------+
It's the same as before, let's write a new function that uses zipWithIndex from the RDD API. It's a bit tedious because we have to go back and forth between the two APIs.
def zip2(df : DataFrame, name : String) = {
val rdd = df.rdd.zipWithIndex
.map{ case (row, i) => Row.fromSeq(row.toSeq :+ i) }
val newSchema = df.schema.add(StructField(name, LongType, false))
df.sparkSession.createDataFrame(rdd, newSchema)
}
fun(spark.range(5).toDF, zip2)
+---+------+
| id|id_row|
+---+------+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+------+
You can adapt zip2, for instance multiplying i by 2, to get what you want.

Based on answer from #Oli I came up with the following workaround:
val df = List(("oleg"), ("maxim")).toDF("first_name")
.withColumn("row_id", monotonically_increasing_id)
.withColumn("test_id", row_number().over(Window.orderBy("row_id")))
It solves my problem but I'm still interested in mocking column functions.

I mock my spark functions with this code :
val s = typedLit[Timestamp](Timestamp.valueOf("2021-05-07 15:00:46.394"))
implicit val ds = DefaultAnswer(CALLS_REAL_METHODS)
withObjectMocked[functions.type] {
when(functions.current_timestamp()).thenReturn(s)
// spark logic
}

Related

How to create a map column to count occurrences without udaf

I would like to create a Map column which counts the number of occurrences.
For instance:
+---+----+
| b| a|
+---+----+
| 1| b|
| 2|null|
| 1| a|
| 1| a|
+---+----+
would result in
+---+--------------------+
| b| res|
+---+--------------------+
| 1|[a -> 2.0, b -> 1.0]|
| 2| []|
+---+--------------------+
For the moment, in Spark 2.4.6, I was able to make it using udaf.
While bumping to Spark3 I was wondering if I could get rid of this udaf (I tried using the new method aggregate without success)
Is there an efficient way to do it?
(For the efficiency part, I am able to test easily)
Here a Spark 3 solution:
import org.apache.spark.sql.functions._
df.groupBy($"b",$"a").count()
.groupBy($"b")
.agg(
map_from_entries(
collect_list(
when($"a".isNotNull,struct($"a",$"count"))
)
).as("res")
)
.show()
gives:
+---+----------------+
| b| res|
+---+----------------+
| 1|[b -> 1, a -> 2]|
| 2| []|
+---+----------------+
Here the solution using Aggregator:
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Encoder
val countOcc = new Aggregator[String, Map[String,Int], Map[String,Int]] with Serializable {
def zero: Map[String,Int] = Map.empty.withDefaultValue(0)
def reduce(b: Map[String,Int], a: String) = if(a!=null) b + (a -> (b(a) + 1)) else b
def merge(b1: Map[String,Int], b2: Map[String,Int]) = {
val keys = b1.keys.toSet.union(b2.keys.toSet)
keys.map{ k => (k -> (b1(k) + b2(k))) }.toMap
}
def finish(b: Map[String,Int]) = b
def bufferEncoder: Encoder[Map[String,Int]] = implicitly(ExpressionEncoder[Map[String,Int]])
def outputEncoder: Encoder[Map[String, Int]] = implicitly(ExpressionEncoder[Map[String, Int]])
}
val countOccUDAF = udaf(countOcc)
df
.groupBy($"b")
.agg(countOccUDAF($"a").as("res"))
.show()
gives:
+---+----------------+
| b| res|
+---+----------------+
| 1|[b -> 1, a -> 2]|
| 2| []|
+---+----------------+
You could always use collect_list with UDF, but only if you groupings are not too lage:
val udf_histo = udf((x:Seq[String]) => x.groupBy(identity).mapValues(_.size))
df.groupBy($"b")
.agg(
collect_list($"a").as("as")
)
.select($"b",udf_histo($"as").as("res"))
.show()
gives:
+---+----------------+
| b| res|
+---+----------------+
| 1|[b -> 1, a -> 2]|
| 2| []|
+---+----------------+
This should be faster than UDAF: Spark custom aggregation : collect_list+UDF vs UDAF
We can achieve this is spark 2.4
//GET THE COUNTS
val groupedCountDf = originalDf.groupBy("b","a").count
//CREATE MAPS FOR EVERY COUNT | EMPTY MAP FOR NULL KEY
//AGGREGATE THEM AS ARRAY
val dfWithArrayOfMaps = groupedCountDf
.withColumn("newMap", when($"a".isNotNull, map($"a",$"count")).otherwise(map()))
.groupBy("b").agg(collect_list($"newMap") as "multimap")
//EXPRESSION TO CONVERT ARRAY[MAP] -> MAP
val mapConcatExpr = expr("aggregate(multimap, map(), (k, v) -> map_concat(k, v))")
val finalDf = dfWithArrayOfMaps.select($"b", mapConcatExpr.as("merged_data"))
Here a solution with a single groupBy and a slightly complex sql expression. This solution works for Spark 2.4+
df.groupBy("b")
.agg(expr("sort_array(collect_set(a)) as set"),
expr("sort_array(collect_list(a)) as list"))
.withColumn("res",
expr("map_from_arrays(set,transform(set, x -> size(filter(list, y -> y=x))))"))
.show()
Output:
+---+------+---------+----------------+
| b| set| list| res|
+---+------+---------+----------------+
| 1|[a, b]|[a, a, b]|[a -> 2, b -> 1]|
| 2| []| []| []|
+---+------+---------+----------------+
The idea is to collect the data from column a twice: one time into a set and one time into a list. Then with the help of transform for each element of the set the number of occurences of the particular element in the list is counted. Finally, the set and the number of elements are combined with map_from_arrays.
However I cannot say if this approach is really faster than a UDAF.

create a simple DF after reading a parquet file

I am a new developer on Scala and I met some problems to write a simple code on Spark Scala. I have this DF that I get after reading a parquet file :
ID Timestamp
1 0
1 10
1 11
2 20
3 15
And what I want is to create a DF result from the first DF (if the ID = 2 for example, the timestamp should be multiplied by two). So, I created a new class :
case class OutputData(id: bigint, timestamp:bigint)
And here is my code :
val tmp = spark.read.parquet("/user/test.parquet").select("id", "timestamp")
val outputData:OutputData = tmp.map(x:Row => {
var time_result
if (x.getString("id") == 2) {
time_result = x.getInt(2)* 2
}
if (x.getString("id") == 1) {
time_result = x.getInt(2) + 10
}
OutputData2(x.id, time_result)
})
case class OutputData2(id: bigint, timestamp:bigint)
Can you help me please ?
To make the implementation easier, you can cast your df using a case class, the process that Dataset with object notation instead of access to your row each time that you want the value of some element. Apart of that, based on your input and output will take have same format you can use same case class instead of define 2.
Code looks like:
// Sample intput data
val df = Seq(
(1, 0L),
(1, 10L),
(1, 11L),
(2, 20L),
(3, 15L)
).toDF("ID", "Timestamp")
df.show()
// Case class as helper
case class OutputData(ID: Integer, Timestamp: Long)
val newDF = df.as[OutputData].map(record=>{
val newTime = if(record.ID == 2) record.Timestamp*2 else record.Timestamp // identify your id and apply logic based on that
OutputData(record.ID, newTime)// return same format with updated values
})
newDF.show()
The output of above code:
// original
+---+---------+
| ID|Timestamp|
+---+---------+
| 1| 0|
| 1| 10|
| 1| 11|
| 2| 20|
| 3| 15|
+---+---------+
// new one
+---+---------+
| ID|Timestamp|
+---+---------+
| 1| 0|
| 1| 10|
| 1| 11|
| 2| 40|
| 3| 15|
+---+---------+

How to add a column collection based on the maximum and minimum values in a dataframe

I've got this DataFrame
val for_df = Seq((5,7,"5k-7k"),(4,8,"4k-8k"),(6,12,"6k-2k")).toDF("min","max","salary")
I want to convert 5k-7k to 5,6,7 and 4k-8k to 4,5,6,7,8.
Original DataFrame:
Desired DataFrame
a.select("min","max","salary")
.as[(Integer,Integer,String)]
.map{
case(min,max,salary) =>
(min,max,salary.split("-").flatMap(x => {
for(i <- 0 to x.length-1) yield (i)
}))
}.toDF("1","2","3").show()
you need to create a UDF to expand the limits. The following UDF will convert convert 5k-7k to 5,6,7 and 4k-8k to 4,5,6,7,8 and so on
import org.apache.spark.sql.functions._
val inputDF = sc.parallelize(List((5,7,"5k-7k"),(4,8,"4k-8k"),(6,12,"6k-12k"))).toDF("min","max","salary")
val extendUDF = udf((str: String) => {
val nums = str.replace("k","").split("-").map(_.toInt)
(nums(0) to nums(1)).toList.mkString(",")
})
val output = inputDF.withColumn("salary_level", extendUDF($"salary"))
Output:
scala> output.show
+---+---+------+----------------+
|min|max|salary| salary_level|
+---+---+------+----------------+
| 5| 7| 5k-7k| 5,6,7|
| 4| 8| 4k-8k| 4,5,6,7,8|
| 6| 12|6k-12k|6,7,8,9,10,11,12|
+---+---+------+----------------+
You can easily do this with a udf.
// The following defines a udf in spark which create a list as per your requirement.
val makeRangeLists = udf( (min: Int, max: Int) => List.range(min, max+1) )
val input = sc.parallelize(List((5,7,"5k-7k"),
(4,8,"4k-8k"),(6,12,"6k-12k"))).toDF("min","max","salary")
// Create a new column using the UDF and pass the max and min columns.
input.withColumn("salary_level", makeRangeLists($"min", $"max")).show
Here one quick option with an UDF
import org.apache.spark.sql.functions
val toSalary = functions.udf((value: String) => {
val array = value.filterNot(_ == 'k').split("-").map(_.trim.toInt).sorted
val (startSalary, endSalary) = (array.headOption, array.tail.headOption)
(startSalary, endSalary) match {
case (Some(s), Some(e)) => (s to e).toList.mkString(",")
case _ => ""
}
})
for_df.withColumn("salary_level", toSalary($"salary")).drop("salary")
Input
+---+---+------+
|min|max|salary|
+---+---+------+
| 5| 7| 5k-7k|
| 4| 8| 4k-8k|
| 6| 12| 6k-2k|
+---+---+------+
Result
+---+---+------------+
|min|max|salary_level|
+---+---+------------+
| 5| 7| 5,6,7|
| 4| 8| 4,5,6,7,8|
| 6| 12| 2,3,4,5,6|
+---+---+------------+
First you remove the k and split your string by the dash. Then you get the start and endSalary and perform a range beetwen them.

Writing Spark UDAFs in Scala to return Array type as output

I have a dataframe as below -
val myDF = Seq(
(1,"A",100),
(1,"E",300),
(1,"B",200),
(2,"A",200),
(2,"C",300),
(2,"D",100)
).toDF("id","channel","time")
myDF.show()
+---+-------+----+
| id|channel|time|
+---+-------+----+
| 1| A| 100|
| 1| E| 300|
| 1| B| 200|
| 2| A| 200|
| 2| C| 300|
| 2| D| 100|
+---+-------+----+
For each id, I want the channel sorted by time in ascending fashion. I want to implement an UDAF for this logic.
I would like to call this UDAF as -
scala > spark.sql("""select customerid , myUDAF(customerid,channel,time) group by customerid """).show()
Ouptut dataframe should look like -
+---+-------+
| id|channel|
+---+-------+
| 1|[A,B,E]|
| 2|[D,A,C]|
+---+-------+
I am trying to write an UDAF but unable to implement it -
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
class myUDAF extends UserDefinedAggregateFunction {
// This is the input fields for your aggregate function
override def inputSchema : org.apache.spark.sql.types.Structype =
Structype(
StructField("id" , IntegerType)
StructField("channel", StringType)
StructField("time", IntegerType) :: Nil
)
// This is the internal fields we would keep for computing the aggregate
// output
override def bufferSchema : Structype =
Structype(
StructField("Sequence", ArrayType(StringType)) :: Nil
)
// This is the output type of my aggregate function
override def dataType : DataType = ArrayType(StringType)
// no comments here
override def deterministic : Booelan = true
// initialize
override def initialize(buffer: MutableAggregationBuffer) : Unit = {
buffer(0) = Seq("")
}
}
Please help.
This will do it (no need to define your own UDF):
df.groupBy("id")
.agg(sort_array(collect_list( // NOTE: sort based on the first element of the struct
struct("time", "channel"))).as("stuff"))
.select("id", "stuff.channel")
.show(false)
+---+---------+
|id |channel |
+---+---------+
|1 |[A, B, E]|
|2 |[D, A, C]|
+---+---------+
I would not write an UDAF for that. In my experience UDAF are rather slow, especially with complex types. I would use the collect_list & UDF approach:
val sortByTime = udf((rws:Seq[Row]) => rws.sortBy(_.getInt(0)).map(_.getString(1)))
myDF
.groupBy($"id")
.agg(collect_list(struct($"time",$"channel")).as("channel"))
.withColumn("channel", sortByTime($"channel"))
.show()
+---+---------+
| id| channel|
+---+---------+
| 1|[A, B, E]|
| 2|[D, A, C]|
+---+---------+
A much simpler way without UDF.
import org.apache.spark.sql.functions._
myDF.orderBy($"time".asc).groupBy($"id").agg(collect_list($"channel") as "channel").show()

Spark: Add column to dataframe conditionally

I am trying to take my input data:
A B C
--------------
4 blah 2
2 3
56 foo 3
And add a column to the end based on whether B is empty or not:
A B C D
--------------------
4 blah 2 1
2 3 0
56 foo 3 1
I can do this easily by registering the input dataframe as a temp table, then typing up a SQL query.
But I'd really like to know how to do this with just Scala methods and not having to type out a SQL query within Scala.
I've tried .withColumn, but I can't get that to do what I want.
Try withColumn with the function when as follows:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ // for `toDF` and $""
import org.apache.spark.sql.functions._ // for `when`
val df = sc.parallelize(Seq((4, "blah", 2), (2, "", 3), (56, "foo", 3), (100, null, 5)))
.toDF("A", "B", "C")
val newDf = df.withColumn("D", when($"B".isNull or $"B" === "", 0).otherwise(1))
newDf.show() shows
+---+----+---+---+
| A| B| C| D|
+---+----+---+---+
| 4|blah| 2| 1|
| 2| | 3| 0|
| 56| foo| 3| 1|
|100|null| 5| 0|
+---+----+---+---+
I added the (100, null, 5) row for testing the isNull case.
I tried this code with Spark 1.6.0 but as commented in the code of when, it works on the versions after 1.4.0.
My bad, I had missed one part of the question.
Best, cleanest way is to use a UDF.
Explanation within the code.
// create some example data...BY DataFrame
// note, third record has an empty string
case class Stuff(a:String,b:Int)
val d= sc.parallelize(Seq( ("a",1),("b",2),
("",3) ,("d",4)).map { x => Stuff(x._1,x._2) }).toDF
// now the good stuff.
import org.apache.spark.sql.functions.udf
// function that returns 0 is string empty
val func = udf( (s:String) => if(s.isEmpty) 0 else 1 )
// create new dataframe with added column named "notempty"
val r = d.select( $"a", $"b", func($"a").as("notempty") )
scala> r.show
+---+---+--------+
| a| b|notempty|
+---+---+--------+
| a| 1| 1111|
| b| 2| 1111|
| | 3| 0|
| d| 4| 1111|
+---+---+--------+
How about something like this?
val newDF = df.filter($"B" === "").take(1) match {
case Array() => df
case _ => df.withColumn("D", $"B" === "")
}
Using take(1) should have a minimal hit