Why does query with UDF fail with "Task not serializable" exception? - scala

I have created a UDF and I am trying to apply it on the result of the coalesce inside a join.
Ideally I would like to do this during a join:
def foo(value: Double): Double = {
value / 100
}
val foo = udf(foo _)
df.join(.....)
.withColumn("value",foo(coalesce(new Column("valueA"), new Column("valueB"))))
But I am getting the exception Task not serializable.
Is there a way to work around that?

Use lambda function to make it serializable. This example works fine.
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.coalesce
import org.apache.spark.sql.functions.udf
val central: DataFrame = Seq(
(1, Some(2014)),
(2, null)
).toDF("key", "year1")
val other1: DataFrame = Seq(
(1, 2016),
(2, 2015)
).toDF("key", "year2")
def fooUDF = udf{v: Double => v/100}
val result = central.join(other1, Seq("key"))
.withColumn("value",fooUDF(coalesce(col("year1"), col("year2"))))

But I am getting the exception Task not serializable.
The reason for the infamous "Task not serializable" exception is that def foo(value: Double): Double is a part of an unserializable owning object (perhaps with SparkSession that indirectly references an unserializable SparkContext).
A solution is to define the method as part of a "standalone" object that has no references to unserializable values.
Is there a way to work around that?
See the other answer by #firas.

Related

Spark job completes without executing udf

I've been having an issue with a long, complicated spark job which contains a udf.
The issue I've been having is that the udf doesn't seem to get called properly, although there is no error message.
I know it isn't called properly because the output gets written, only anything the udf was supposed to calculate is NULL, and no print statements appear when debugging locally.
The only lead is that this code previously worked using different input data, meaning the error must have something to do with the input.
The change in inputs mostly means different column names are used, which is addressed in the code.
Print statements are executed given the first, 'working' input.
Both inputs are created using the same series of steps from the same database, and by inspection there doesn't appear to be a problem with either one.
I've never experienced this sort of behaviour before, and any leads on what might cause it would be appreciated.
The code is monolithic and inflexible - I'm working on refactoring, but it's not an easy piece to break apart. This is a short version of what is happening:
package mypackage
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.util._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql.types._
import scala.collection.{Map => SMap}
object MyObject {
def main(args: Array[String]){
val spark: SparkSession = SparkSession.builder()
.appName("my app")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
val bigInput = spark.read.parquet("inputname.parquet")
val reference_table = spark.read.parquet("reference_table.parquet")
val exchange_rate = spark.read.parquet("reference_table.parquet")
val bigInput2 = bigInput
.filter($"column1" === "condition1")
.join(joinargs)
.drop(dropargs)
val bigInput3 = bigInput
.filter($"column2" === "condition2")
.join(joinargs)
.drop(dropargs)
<continue for many lines...>
def mapper1(
arg1: String,
arg2: Double,
arg3: Integer
): List[Double]{
exchange_rate.map(
List(idx1, idx2, idx3),
r.toSeq.toList
.drop(idx4)
.take(arg2)
)
}
def mapper2(){}
...
def mapper5(){}
def my_udf(
arg0: Integer,
arg1: String,
arg2: Double,
arg3: Integer,
...
arg20: String
): Double = {
println("I'm actually doing something!")
val result1 = mapper1(arg1, arg2, arg3)
val result2 = mapper2(arg4, arg5, arg6, arg7)
...
val result5 = mapper5(arg18, arg19, arg20)
result1.take(arg0)
.zipAll(result1, 0.0, 0.0)
.map(x=>_1*x._2)
....
.zipAll(result5, 0.0, 0.0)
.foldLeft(0.0)(_+_)
}
spark.udf.register("myUDF", my_udf_)
val bigResult1 = bigInputFinal.withColumn("Newcolumnname",
callUDF(
"myUDF",
$"col1",
...
$"col20"
)
)
<postprocessing>
bigResultFinal
.filter(<configs>)
.select(<column names>)
.write
.format("parquet")
}
}
To recap
This code runs to completion on each of two input files.
The udf only appears to execute on the first file.
There are no error messages or anything using the second file, although all non-udf logic appears to complete successfully.
Any help greatly appreciated!
Here the UDF is not being called because spark is
Lazy it does not call the UDF unless you use any action on the dataframe. You can achieve this by forcing dataframe actions.

scala spark conversion error when creating dataframe

I am a newbie in scala. Please be patient.
I have this code.
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation._
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.evaluation.ClusteringEvaluator
// create spark session
implicit val spark = SparkSession.builder().appName("clustering").getOrCreate()
// read file
val fileName = """file:///some_location/head_sessions_sample.csv"""
// create DF from file
val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(fileName)
def inputKmeans(df: DataFrame,spark: SparkSession): DataFrame = {
try {
val a = df.select("id", "start_ts", "duration", "ip_dist").map(r => (r.getInt(0), Vectors.dense(r.getDouble(1), r.getDouble(2), r.getDouble(3)))).toDF("id", "features")
a
}
catch {
case e: java.lang.ClassCastException => spark.emptyDataFrame
}
}
val t = inputKmeans(df).filter( _ != null )
t.foreach(r =>
if (r.get(0) != null)
println(r.get(0)))
For the moment, i want to ignore my conversion errors. But somehow, I still have them.
2018-09-24 11:26:22 ERROR Executor:91 - Exception in task 0.0 in stage
4.0 (TID 6) java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Double
I dont think there is any point to give a snapshot of the csv. At this point, i just want to ignore conversion errors.
Any ideas why this is happening?
As mentioned in the comment, the issue is because the values are not Double type.
val a = df.select("id", "start_ts", "duration", "ip_dist").map(r => (r.getInt(0), Vectors.dense(r.getDouble(1), r.getDouble(2), r.getDouble(3)))).toDF("id", "features")
Either cast to the Correct DataType i.e Long Type (you can also provide the Schema explicitly using Case Class and apply the schema to DataFrame).
Or use the VectorAssembler to convert the columns into features. This is easier and recommended approach.
import org.apache.spark.ml.feature.VectorAssembler
def inputKmeans(df: DataFrame,spark: SparkSession): DataFrame = {
val assembler = new VectorAssembler().setInputCols(Array("start_ts", "duration", "ip_dist")).setOutputCol("features")
val output = assembler.transform(df).select("id", "features")
output
}
i think i discovered the problem. the "try catch" is placed at the level of the DF creation, not at the level of the conversion. in consequence, it catches problems related to DF creation, not conversion issues.

Rewrite scala code to be more functional

I am trying to teach myself Scala whilst at the same time trying to write code that is idiomatic of a functional language, i.e. write better, more elegant, functional code.
I have the following code that works OK:
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import java.time.LocalDate
object DataFrameExtensions_ {
implicit class DataFrameExtensions(df: DataFrame){
def featuresGroup1(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
def featuresGroup2(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
}
}
import DataFrameExtensions_._
val spark = SparkSession.builder().config(new SparkConf().setMaster("local[*]")).enableHiveSupport().getOrCreate()
import spark.implicits._
val df = Seq((8, "bat"),(64, "mouse"),(-27, "horse")).toDF("number", "word")
val groupBy = Seq("a","b")
val asAt = LocalDate.now()
val dataFrames = Seq(df.featuresGroup1(groupBy, asAt),df.featuresGroup2(groupBy, asAt))
The last line bothers me though. The two functions (featuresGroup1, featuresGroup2) both have the same signature:
scala> :type df.featuresGroup1(_,_)
(Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame
scala> :type df.featuresGroup2(_,_)
(Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame
and take the same vals as parameters so I assume I can write that line in a more functional way (perhaps using .map somehow) that means I can write the parameter list just once and pass it to both functions. I can't figure out the syntax though. I thought maybe I could construct a list of those functions but that doesn't work:
scala> Seq(featuresGroup1, featuresGroup2)
<console>:23: error: not found: value featuresGroup1
Seq(featuresGroup1, featuresGroup2)
^
<console>:23: error: not found: value featuresGroup2
Seq(featuresGroup1, featuresGroup2)
^
Can anyone help?
I thought maybe I could construct a list of those functions but that doesn't work:
Why are you writing just featuresGroup1/2 here when you already had the correct syntax df.featuresGroup1(_,_) just above?
Seq(df.featuresGroup1(_,_), df.featuresGroup2(_,_)).map(_(groupBy, asAt))
df.featuresGroup1 _ should work as well.
df.featuresGroup1 by itself would work if you had an expected type, e.g.
val dataframes: Seq[(Seq[String], LocalDate) => DataFrame] =
Seq(df.featuresGroup1, df.featuresGroup2)
but in this specific case providing the expected type is more verbose than using lambdas.
I thought maybe I could construct a list of those functions but that doesn't work
You need to explicitly perform eta expansion to turn methods into functions (they are not the same in Scala), by using an underscore operator:
val funcs = Seq(featuresGroup1 _, featuresGroup2 _)
or by using placeholders:
val funcs = Seq(featuresGroup1(_, _), featuresGroup2(_, _))
And you are absolutely right about using map operator:
val dataFrames = funcs.map(f => f(groupBy, asAdt))
I strongly recommend against using implicits of types String or Seq, as if used in multiple places, these lead to subtle bugs that are not immediately obvious from the code and the code will be prone to breaking when it's moved somewhere.
If you want to use implicits, wrap them into a custom types:
case class DfGrouping(groupBy: Seq[String]) extends AnyVal
implicit val grouping: DfGrouping = DfGrouping(Seq("a", "b"))
Why no just create a function in DataFrameExtensions to do so?
def getDataframeGroups(groupBy: Seq[String], asAt: String) = Seq(featuresGroup1(groupBy,asAt), featuresGroup2(groupBy,asAt))
I think you could create a list of functions as below:
val funcs:List[DataFrame=>(Seq[String], java.time.LocalDate) => org.apache.spark.sql.DataFrame] = List(_.featuresGroup1, _.featuresGroup1)
funcs.map(x => x(df)(groupBy, asAt))
It seems you have a list of functions which convert a DataFrame to another DataFrame. If that is the pattern, you could go a little bit further with Endo in Scalaz
I like this answer best, courtesy of Alexey Romanov.
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import java.time.LocalDate
object DataFrameExtensions_ {
implicit class DataFrameExtensions(df: DataFrame){
def featuresGroup1(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
def featuresGroup2(groupBy: Seq[String], asAt: LocalDate): DataFrame = {df}
}
}
import DataFrameExtensions_._
val spark = SparkSession.builder().config(new SparkConf().setMaster("local[*]")).enableHiveSupport().getOrCreate()
import spark.implicits._
val df = Seq((8, "bat"),(64, "mouse"),(-27, "horse")).toDF("number", "word")
val groupBy = Seq("a","b")
val asAt = LocalDate.now()
Seq(df.featuresGroup1(_,_), df.featuresGroup2(_,_)).map(_(groupBy, asAt))

Removing newlines in a DataFrame field with udf function gives TypeTag Error

val trim: String => String = _.trim.replace("[\\r\\n]", "")
def main(args: Array[String]) {
val spark = ... ...
import spark.implicits._
val trimUDF = udf[String,String](trim)
val df = spark.read.json(df_path) ...
val fixed_dblogs_df = df.withColumn("qp_new", trimUDF('qp)) ...
}
When I run this code I get a compile time error:
No TypeTag available for String
This error is where I define the udf function. I have no idea why this is happening. I have used udf functions before but this one is making this error. I used Spark 2.1.1 and that's it.
The purpose of the code is to remove all the new lines in one of my fields of columns that is StringType and I just want it to not have any newlines in it
Is there some reason you're using a UDF instead of the replace_regexp builtin?
val fixed_dblogs_df = df.withColumn("qp_new", replace_regexp('qp, "[\\r\\n]", "") ...)
UDF's break Spark's plan optimization.

How do I dynamically create a UDF in Spark?

I have a DataFrame where I want to create multiple UDFs dynamically to determine if certain rows match. I am just testing one example right now. My test code looks like the following.
//create the dataframe
import spark.implicits._
val df = Seq(("t","t"), ("t", "f"), ("f", "t"), ("f", "f")).toDF("n1", "n2")
//create the scala function
def filter(v1: Seq[Any], v2: Seq[String]): Int = {
for (i <- 0 until v1.length) {
if (!v1(i).equals(v2(i))) {
return 0
}
}
return 1
}
//create the udf
import org.apache.spark.sql.functions.udf
val fudf = udf(filter(_: Seq[Any], _: Seq[String]))
//apply the UDF
df.withColumn("filter1", fudf(Seq($"n1"), Seq("t"))).show()
However, when I run the last line, I get the following error.
:30: error: not found: value df
df.withColumn("filter1", fudf($"n1", Seq("t"))).show()
^
:30: error: type mismatch;
found : Seq[String]
required: org.apache.spark.sql.Column
df.withColumn("filter1", fudf($"n1", Seq("t"))).show()
^
Any ideas on what I'm doing wrong? Note, I am on Scala v2.11.x and Spark 2.0.x.
On another note, if we can solve this "dynamic" UDF question/concern, my use case would be to add them to the dataframe. With some test code as follows, it takes forever (it doesn't even finish, I had to ctrl-c to break out). I'm guessing doing a bunch of .withColumn in a for-loop is a bad idea in Spark. If so, please let me know and I'll abandon this approach altogether.
import spark.implicits._
val df = Seq(("t","t"), ("t", "f"), ("f", "t"), ("f", "f")).toDF("n1", "n2")
import org.apache.spark.sql.functions.udf
val fudf = udf( (x: String) => if (x.equals("t")) 1 else 0)
var df2 = df
for (i <- 0 until 10000) {
df2 = df2.withColumn("filter"+i, fudf($"n1"))
}
Enclose "t" in lit()
df.withColumn("filter1", fudf($"n1", Seq(lit("t")))).show()
Try registering UDF on sqlContext.
Spark 2.0 UDF registration