Scala UDF with multiple parameters used in Pyspark

Scala UDF with multiple parameters used in Pyspark - scala

I have a UDF written in Scala that I'd like to be able to call through a Pyspark session. The UDF takes two parameters, string column value and a second string parameter. I've been able to successfully call the UDF if it takes only a single parameter (column value). I'm struggling to call the UDF if there's multiple parameters required. Here's what I've been able to do so far in Scala and then through Pyspark:
Scala UDF:
class SparkUDFTest() extends Serializable {
def stringLength(columnValue: String, columnName: String): Int =
LOG.info("Column name is: " + columnName)
return columnValue.length
}
When using this in Scala, I've been able to register and use this UDF:
Scala main class:
val udfInstance = new SparkUDFTest()
val stringLength = spark.sqlContext.udf.register("stringlength", udfInstance.stringLength _)
val newDF = df.withColumn("name", stringLength(col("email"), lit("email")))
The above works successfully. Here's the attempt through Pyspark:
def testStringLength(colValue, colName):
package = "com.test.example.udf.SparkUDFTest"
udfInstance = sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(testpackage).newInstance().stringLength().apply
return Column(udfInstance(_to_seq(sc, [colValue], _to_java_column), colName))
Call the UDF in Pyspark:
df.withColumn("email", testStringLength("email", lit("email")))
Doing the above and making some adjustments in Pyspark gives me following errors:
py4j.Py4JException: Method getStringLength([]) does not exist
or
java.lang.ClassCastException: com.test.example.udf.SparkUDFTest$$anonfun$stringLength$1 cannot be cast to scala.Function1
or
TypeError: 'Column' object is not callable
I was able to modify the UDF to take just a single parameter (the column value) and was able to successfully call it and get back a new Dataframe.
Scala UDF Class
class SparkUDFTest() extends Serializable {
def testStringLength(): UserDefinedFunction = udf(stringLength _)
def stringLength(columnValue: String): Int =
LOG.info("Column name is: " + columnName)
return columnValue.length
}
Updating Python code:
def testStringLength(colValue, colName):
package = "com.test.example.udf.SparkUDFTest"
udfInstance = sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(testpackage).newInstance().testStringLength().apply
return Column(udfInstance(_to_seq(sc, [colValue], _to_java_column)))
The above works successfully. I'm still struggling to call the UDF if the UDF takes an extra parameter. How can the second parameter be passed to the UDF through in Pyspark?

I was able to resolve this by using currying. First registered the UDF as
def testStringLength(columnName): UserDefinedFunction = udf((colValue: String) => stringLength(colValue, colName)
Called the UDF
udfInstance = sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(testpackage).newInstance().testStringLength("email").apply
df.withColumn("email", Column(udfInstance(_to_seq(sc, [col("email")], _to_java_column))))
This can be cleaned up a bit more but it's how I got it to work.
Edit: The reason I went with currying is because even when I was using 'lit' on the second argument that I wanted to pass in as a String to the UDF, I kept exerperiencing the "TypeError: 'Column' object is not callable" error. In Scala I did not experience this issue. I am not sure as to why this was happening in Pyspark. It's possible it could be due to some complication that may occur between the Python interpreter and the Scala code. Still unclear but currying works for me.

Related

How to use non udf method in Spark?

I've code whose which is like below
myDF.map{ x =>
val inp = MyUtils.doSomething(x.value) //accepts Int values and return Int
MyInfo(inp)
}
here MyUtils.doSomething is normal function (non UDF) in my spark scala code. Its working fine
But when I do like this
val DF = myDF.withColumn("value", lit(MyUtils.doSomething(col("value").asInstanceOf[Int].toInt)))
why its showing error
class org.apache.spark.sql.Column cannot be cast to class java.lang.Integer
How can I fix this ? Is there any way I could get underlying value of col("value"), so that I could use this in my doSomething function.
Not sure why col("value").asInstanceOf[Int].toInt its not giving Int value?

Not sure why col("value").asInstanceOf[Int].toInt its not giving Int value?
Well because how do you want to cast Column("colName", 21, false)? That asInstanceOf will basically make compiler ignore the fact that an object of type Column is an integer, and you'll face exceptions in runtime instead. You should program your code in a way that you won't even need asInstanceOf. About your first consideration, that UDF is basically a function, serialized by spark into spark slaves and executed on columns, so you'll have to do it like this:
import org.apache.spark.sql.functions._
val doSomethingUdf = udf(MyUtils.doSomething)
// if doSomething is defined as a method "def doSomething ..."
// then it would be better to do
// udf(MyUtils.doSomething _)

How do we write Unit test for UDF in scala

I have a following User defined function in scala
val returnKey: UserDefinedFunction = udf((key: String) => {
val abc: String = key
abc
})
Now, I want to unit test whether it is returning correct or not. How do I write the Unit test for it. This is what I tried.
class CommonTest extends FunSuite with Matchers {
test("Invalid String Test") {
val key = "Test Key"
val returnedKey = returnKey(col(key));
returnedKey should equal (key);
}
But since its a UDF the returnKey is a UDF function. I am not sure how to call it or how to test this particular scenario.

A UserDefinedFunction is effectively a wrapper around your Scala function that can be used to transform Column expressions. In other words, the UDF given in the question wraps a function of String => String to create a function of Column => Column.
I usually pick 1 of 2 different approaches to testing UDFs.
Test the UDF in a spark plan. In other words, create a test DataFrame and apply the UDF to it. Then collect the DataFrame and check its contents.
// In your test
val testDF = Seq("Test Key", "", null).toDS().toDF("s")
val result = testDF.select(returnKey(col("s"))).as[String].collect.toSet
result should be(Set("Test Key", "", null))
Notice that this lets us test all our edge cases in a single spark plan. In this case, I have included tests for the empty string and null.
Extract the Scala function being wrapped by the UDF and test it as you would any other Scala function.
def returnKeyImpl(key: String) = {
val abc: String = key
abc
}
val returnKey = udf(returnKeyImpl _)
Now we can test returnKeyImpl by passing in strings and checking the string output.
Which is better?
There is a trade-off between these two approaches, and my recommendation is different depending on the situation.
If you are doing a larger test on bigger datasets, I would recommend using testing the UDF in a Spark job.
Testing the UDF in a Spark job can raise issues that you wouldn't catch by only testing the underlying Scala function. For example, if your underlying Scala function relies on a non-serializable object, then Spark will be unable to broadcast the UDF to the workers and you will get an exception.
On the other hand, starting spark jobs in every unit test for every UDF can be quite slow. If you are only doing a small unit test, it will likely be faster to just test the underlying Scala function.

Spark UDF Overloading

I have a requirement that the Spark UDF has to be overloaded, I know that UDF overloading is not supported in Spark. So to overcome this limitation of spark I tried to create an UDF that accepts Any type and inside the UDF it finds the actual datatype and call the respective methods for computation and returns value accordingly. When doing so I got an error as
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type Any is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:724)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:906)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:46)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:723)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:720)
at org.apache.spark.sql.UDFRegistration.register(UDFRegistration.scala:213)
at com.experian.spark_jobs.Test$.main(Test.scala:9)
at com.experian.spark_jobs.Test.main(Test.scala)
Below is the sample code:
import org.apache.spark.sql.SparkSession
object Test {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").appName("test").getOrCreate()
spark.udf.register("testudf", testudf _)
spark.sql("create temporary view testView as select testudf(1, 2) as a").show()
spark.sql("select testudf(a, 5) from testView").show()
}
def testudf(a: Any, b: Any) = {
if (a.isInstanceOf[Integer] && b.isInstanceOf[Integer]) {
add(a.asInstanceOf[Integer], b.asInstanceOf[Integer])
} else if (a.isInstanceOf[java.math.BigDecimal] && b.isInstanceOf[java.math.BigDecimal]) {
add(a.asInstanceOf[java.math.BigDecimal], b.asInstanceOf[java.math.BigDecimal])
}
}
def add(decimal: java.math.BigDecimal, decimal1: java.math.BigDecimal): java.math.BigDecimal = {
decimal.add(decimal1)
}
def add(integer: Integer, integer1: Integer): Integer = {
integer + integer1
}
}
Does it possible to make the above requirement possible? If not please suggest me a better approach.
Note: Spark Version - 2.4.0

The problem working with Dataframe(untyped) is that it is very painful doing something like some kind of polymorphism at compile time. Ideally, having the column types will allow to build your udfs with the specific "add function" implementation, like if you were working with Monoids. But Spark Dataframe API is very far away from this world. Working with Datasets or with Frameless help a lot.
In your example, to examine the type at runtime you will need AnyRef instead of Any. That should work.

How to map a dataframe of datetime strings in spark to a dataframe of booleans?

I want to basically check whether every value in my dataframe of dates is the correct format "MM/dd/yy".
val df: DataFrame = spark.read.csv("----")
However, whenever I apply the function map:
df.map(x => right_format(x)).show()
and try to show this new dataframe/dataset, I'm getting a nonserializable error.
Does anyone know why?
I've tried to debug by using the intellij debugger, but to no avail.
val df: DataFrame = spark.read.csv("----")
df.map(x => right_format(x)).show()
Expected results: dataframe of boolean values
Actual results: Nonserializable error.

Does the non-Serializable error say something like SparkContext is non serializable?
Map runs in a distributed manner, and Spark will attempt to serialize and send right_format function def to all the nodes.
It looks like right_format is defined in the same scope as objects such as your SparkContext instance (for example, is all this in your main() method call?).
To get around this, I think you could do 1 of 2 things -
Define right_format() within the map block
df.map(x => {
def right_format(elem) = {...}
right_format(x)
}
).show()
Define an abstract object or a trait of helper functions that includes the function def for right_format.
Spark will serialize this object and send it to all the nodes. This should solve the issue that you're facing.

How to use array_contains with 2 columns in spark scala?

I have an issue , I want to check if an array of string contains string present in another column . I am currently using below code which is giving an error.
.withColumn("is_designer_present", when(array_contains(col("list_of_designers"),$"dept_resp"),1).otherwise(0))
error:
java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.ColumnName dept_resp
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)

you can write a udf function to get your job done
import org.apache.spark.sql.functions._
def stringContains = udf((array: collection.mutable.WrappedArray[String], str: String) => array.contains(str))
df.withColumn("is_designer_present", when(stringContains(col("list_of_designers"), $"dept_resp"),1).otherwise(0))
You can return appropriate value from udf function itself so that you don't have to use when function
import org.apache.spark.sql.functions._
def stringContains = udf((array: collection.mutable.WrappedArray[String], str: String) => if (array.contains(str)) 1 else 0)
df.withColumn("is_designer_present", stringContains(col("list_of_designers"), $"dept_resp"))

With Spark 1.6 you can wrap your array_contains() as a string into the expr() function:
import org.apache.spark.sql.functions.expr
.withColumn("is_designer_present",
when(expr("array_contains(list_of_designers, dept_resp)"), 1).otherwise(0))
This form of array_contains inside the expr can accept a column as the second argument.

I know that this is a somewhat old question, but I found myself in a similar predicament, and found the following solution. It uses Spark native functions (so it doesn't suffer from UDF related performance regressions, and it does not rely in string expressions (which are hard to maintain).
def array_contains_column(arrayColumn: Column, valueColumn: Column): Column = {
new Column(ArrayContains(arrayColumn.expr, valueColumn.expr))
}
// ...
df.withColumn(
"is_designer_present",
when(
array_contains_column(col("list_of_designers"),col("dept_resp")),
1
).otherwise(0)
)

You can do it without UDFs using explode.
.withColumn("exploCol", explode($"dept_resp"))
.withColumn("aux", when($"exploCol" === col("list_of_designers"), 1).otherwise(0))
.drop("exploCol")
.groupBy($"dep_rest") //all cols except aux
.agg(sum($"aux") as "result")
And there you go, if result > 0 then "dept_rest" contains the value.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scala UDF with multiple parameters used in Pyspark - scala

Related

How to use non udf method in Spark?

How do we write Unit test for UDF in scala

Spark UDF Overloading

How to map a dataframe of datetime strings in spark to a dataframe of booleans?

How to use array_contains with 2 columns in spark scala?

Categories

Resources