How to use non udf method in Spark? - scala

I've code whose which is like below
myDF.map{ x =>
val inp = MyUtils.doSomething(x.value) //accepts Int values and return Int
MyInfo(inp)
}
here MyUtils.doSomething is normal function (non UDF) in my spark scala code. Its working fine
But when I do like this
val DF = myDF.withColumn("value", lit(MyUtils.doSomething(col("value").asInstanceOf[Int].toInt)))
why its showing error
class org.apache.spark.sql.Column cannot be cast to class java.lang.Integer
How can I fix this ? Is there any way I could get underlying value of col("value"), so that I could use this in my doSomething function.
Not sure why col("value").asInstanceOf[Int].toInt its not giving Int value?

Not sure why col("value").asInstanceOf[Int].toInt its not giving Int value?
Well because how do you want to cast Column("colName", 21, false)? That asInstanceOf will basically make compiler ignore the fact that an object of type Column is an integer, and you'll face exceptions in runtime instead. You should program your code in a way that you won't even need asInstanceOf. About your first consideration, that UDF is basically a function, serialized by spark into spark slaves and executed on columns, so you'll have to do it like this:
import org.apache.spark.sql.functions._
val doSomethingUdf = udf(MyUtils.doSomething)
// if doSomething is defined as a method "def doSomething ..."
// then it would be better to do
// udf(MyUtils.doSomething _)

Related

Writing unit test for a spark scala function that takes no arguments?

I am writing unit tests of scala function where I am passing mocked spark data frame to the function and then using assertSmallDataFrameEquality(actualDF, expectedDF) function to check whether my function is transforming correctly or not.
Recently I came across a function that is taking no argument and returning Column type. Now since it is not expecting any argument. How should I write test case for this function. My function is given below.
def arriveDateMinusRuleDays: Column = {
expr(s"date_sub(${Columns.ARRIVE_DATE},${Columns.RULE_DAYS})")
}
Test blueprint is written below
test("arrive date minus rule days") {
import spark.implicits._
val today = Date.valueOf(LocalDate.now)
val inputDF = Seq(
(Y, today, 0, 80852),
(S, today, 1, 18851))
.toDF(FLAG, ARRIVE_DT, RULE_DAYS,ITEM_NBR)
val actualOutput = DataAggJob.arriveDateMinusRuleDays() // How to pass my column values to this function
// val exepectedoutput
assertmethod(actualoutput, expectedoutput)
// print(actualOutput)
}
You don't need to test each individual function. The purpose of the unit test is to assert the contract between implementation and downstream consumer, not implementation details.
If your job returns the expected output given the input, then it is working correctly, regardless of what that particular function is doing. It should really just be made private to avoid confusion,

How to map a dataframe of datetime strings in spark to a dataframe of booleans?

I want to basically check whether every value in my dataframe of dates is the correct format "MM/dd/yy".
val df: DataFrame = spark.read.csv("----")
However, whenever I apply the function map:
df.map(x => right_format(x)).show()
and try to show this new dataframe/dataset, I'm getting a nonserializable error.
Does anyone know why?
I've tried to debug by using the intellij debugger, but to no avail.
val df: DataFrame = spark.read.csv("----")
df.map(x => right_format(x)).show()
Expected results: dataframe of boolean values
Actual results: Nonserializable error.
Does the non-Serializable error say something like SparkContext is non serializable?
Map runs in a distributed manner, and Spark will attempt to serialize and send right_format function def to all the nodes.
It looks like right_format is defined in the same scope as objects such as your SparkContext instance (for example, is all this in your main() method call?).
To get around this, I think you could do 1 of 2 things -
Define right_format() within the map block
df.map(x => {
def right_format(elem) = {...}
right_format(x)
}
).show()
Define an abstract object or a trait of helper functions that includes the function def for right_format.
Spark will serialize this object and send it to all the nodes. This should solve the issue that you're facing.

How to find the mean of values in an array in Scala - Apache Spark

I have an array of values as shown below:
scala> number.take(5)
res1: Array[Any] = Array(908.76, 901.74, 83.71, 39.36, 234.64)
I need to find the mean value of the array using RDD method.
I have tried using number.mean() method but it keeps giving me following error:
error: could not find implicit value for parameter num: Numeric[Any]
I am new to Spark, please provide some suggestions. Thank you.
That's not Spark related. Compiler gives you a hint - there is no .mean() method for Array[Any] because it requires that elements of Array must be Numeric.
It means that it would work if it was an Array of Double or Ints.
number.take(5) returned Array[Any] because somewhere above it you provided no guarantee that Array will contain only Numeric elements.
If you can't provide that guarantee, then you have to map over that array and explicitly cast all these values to Double or other Numeric type of your choice.
implicit class AnyExtended(value: Any) {
def toDoubleO: Option[Double] = {
Try(value.toDouble).toOption
}
}
val array: Array[Double] = number.take(5).flatMap(_.toDoubleO)
val mean: Double = array.mean
Note that instead of using basic .toDouble I've written implicit extension because .toDouble can fail and throw an exception. Instead, we can wrap that into Try and turn into Option - in case of exception we'll get None and this value will be skipped from computation of the mean due to flatMap
If you are happy to convert to a DF, then spark will do this for you with minimal effort.
val number = List(908.76, 901.74, 83.71, 39.36, 234.64)
val numberRDD = sc.parallelize(number)
numberRDD.toDF("x").agg(avg(col("x")))
res1.show
This will produce the answer 433.642

How to construct and persist a reference object per worker in a Spark 2.3.0 UDF?

In a Spark 2.3.0 Structured Streaming job I need to append a column to a DataFrame which is derived from the value of the same row of an existing column.
I want to define this transformation in a UDF and use withColumn to build the new DataFrame.
Doing this transform requires consulting a very-expensive-to-construct reference object -- constructing it once per record yields unacceptable performance.
What is the best way to construct and persist this object once per worker node so it can be referenced repeatedly for every record in every batch? Note that the object is not serializable.
My current attempts have revolved around subclassing UserDefinedFunction to add the expensive object as a lazy member and providing an alternate constructor to this subclass that does the init normally performed by the udf function, but I've been so far unable to get it to do the kind of type coercion that udf does -- some deep type inference is wanting objects of type org.apache.spark.sql.Column when my transformation lambda works on a string for input and output.
Something like this:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.DataType
class ExpensiveReference{
def ExpensiveReference() = ... // Very slow
def transformString(in:String) = ... // Fast
}
class PersistentValUDF(f: AnyRef, dataType: DataType, inputTypes: Option[Seq[DataType]]) extends UserDefinedFunction(f: AnyRef, dataType: DataType, inputTypes: Option[Seq[DataType]]){
lazy val ExpensiveReference = new ExpensiveReference()
def PersistentValUDF(){
this(((in:String) => ExpensiveReference.transformString(in) ):(String => String), StringType, Some(List(StringType)))
}
}
The further I dig into this rabbit hole the more I suspect there's a better way to accomplish this that I'm overlooking. Hence this post.
Edit:
I tested initializing a reference lazily in an object declared in the UDF; this triggers reinitialization. Example code and object
class IntBox {
var valu = 0;
def increment {
valu = valu + 1
}
def get:Int ={
return valu
}
}
val altUDF = udf((input:String) => {
object ExpensiveRef{
lazy val box = new IntBox
def transform(in:String):String={
box.increment
return in + box.get.toString
}
}
ExpensiveRef.transform(input)
})
The above UDF always appends 1; so the lazy object is being reinitialized per-record.
I found this post whose Option 1 I was able to turn into a workable solution. The end result ended up being similar to Jacek Laskowski's answer, but with a few tweaks:
Pull the object definition outside of the UDF's scope. Even being lazy, it will still reinitialize if it's defined in the scope of the UDF.
Move the transform function off of the object and into the UDF's lambda (required to avoid serialization errors)
Capture the object's lazy member in the closure of the UDF lambda
Something like this:
object ExpensiveReference {
lazy val ref = ...
}
val persistentUDF = udf((input:String)=>{
/*transform code that references ExpensiveReference.ref*/
})
DISCLAIMER Let me have a go on this, but please consider it a work in progress (downvotes are big no-no :))
What I'd do would be to use a Scala object with a lazy val for the expensive reference.
object ExpensiveReference {
lazy val ref = ???
def transform(in:String) = {
// use ref here
}
}
With the object, whatever you do on a Spark executor (be it part of a UDF or any other computation) is going to instantiate ExpensiveReference.ref at the very first access. You could access it directly or a part of transform.
Again, it does not really matter whether you do this in a UDF or a UDAF or any other transformation. The point is that once a computation happens on a Spark executor "a very-expensive-to-construct reference object -- constructing it once per record yields unacceptable performance." would happen only once.
It could be in a UDF (just to make it clearer).

Scala UDF with multiple parameters used in Pyspark

I have a UDF written in Scala that I'd like to be able to call through a Pyspark session. The UDF takes two parameters, string column value and a second string parameter. I've been able to successfully call the UDF if it takes only a single parameter (column value). I'm struggling to call the UDF if there's multiple parameters required. Here's what I've been able to do so far in Scala and then through Pyspark:
Scala UDF:
class SparkUDFTest() extends Serializable {
def stringLength(columnValue: String, columnName: String): Int =
LOG.info("Column name is: " + columnName)
return columnValue.length
}
When using this in Scala, I've been able to register and use this UDF:
Scala main class:
val udfInstance = new SparkUDFTest()
val stringLength = spark.sqlContext.udf.register("stringlength", udfInstance.stringLength _)
val newDF = df.withColumn("name", stringLength(col("email"), lit("email")))
The above works successfully. Here's the attempt through Pyspark:
def testStringLength(colValue, colName):
package = "com.test.example.udf.SparkUDFTest"
udfInstance = sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(testpackage).newInstance().stringLength().apply
return Column(udfInstance(_to_seq(sc, [colValue], _to_java_column), colName))
Call the UDF in Pyspark:
df.withColumn("email", testStringLength("email", lit("email")))
Doing the above and making some adjustments in Pyspark gives me following errors:
py4j.Py4JException: Method getStringLength([]) does not exist
or
java.lang.ClassCastException: com.test.example.udf.SparkUDFTest$$anonfun$stringLength$1 cannot be cast to scala.Function1
or
TypeError: 'Column' object is not callable
I was able to modify the UDF to take just a single parameter (the column value) and was able to successfully call it and get back a new Dataframe.
Scala UDF Class
class SparkUDFTest() extends Serializable {
def testStringLength(): UserDefinedFunction = udf(stringLength _)
def stringLength(columnValue: String): Int =
LOG.info("Column name is: " + columnName)
return columnValue.length
}
Updating Python code:
def testStringLength(colValue, colName):
package = "com.test.example.udf.SparkUDFTest"
udfInstance = sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(testpackage).newInstance().testStringLength().apply
return Column(udfInstance(_to_seq(sc, [colValue], _to_java_column)))
The above works successfully. I'm still struggling to call the UDF if the UDF takes an extra parameter. How can the second parameter be passed to the UDF through in Pyspark?
I was able to resolve this by using currying. First registered the UDF as
def testStringLength(columnName): UserDefinedFunction = udf((colValue: String) => stringLength(colValue, colName)
Called the UDF
udfInstance = sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(testpackage).newInstance().testStringLength("email").apply
df.withColumn("email", Column(udfInstance(_to_seq(sc, [col("email")], _to_java_column))))
This can be cleaned up a bit more but it's how I got it to work.
Edit: The reason I went with currying is because even when I was using 'lit' on the second argument that I wanted to pass in as a String to the UDF, I kept exerperiencing the "TypeError: 'Column' object is not callable" error. In Scala I did not experience this issue. I am not sure as to why this was happening in Pyspark. It's possible it could be due to some complication that may occur between the Python interpreter and the Scala code. Still unclear but currying works for me.