Task serialisation error when using UDF - scala

I use IntelliJ IDEA to execute the code shown below. The content of df is the following:
+------+------+
|nodeId| p_i|
+------+------+
| 26|0.6914|
| 29|0.6914|
| 474| 0.0|
| 65|0.4898|
| 191|0.4445|
| 418|0.4445|
I get Task serialization error at line result.show() when I run this code:
class MyUtils extends Serializable {
def calculate(spark: SparkSession,
df: DataFrame): DataFrame = {
def myFunc(a: Double): String = {
var result: String = "-"
if (a > 1) {
result = "A"
}
return result
}
val myFuncUdf = udf(myFunc _)
val result = df.withColumn("role", myFuncUdf(df("a")))
result.show()
result
}
}
Why do I get this error?
Update:
This is how I run the code:
object Processor extends App {
// ...
val mu = new MyUtils()
var result = mu.calculate(spark, df)
}

I had to add extends Serializable to the specification of a class MyUtils.

Related

Processing List of nested column in spark huge dataframe in scala

I want to store list of nested json in spark dataframe, and also wanted to process that column. There is also the need for operations like update on some value or delete.
{
"studentName": "abc",
"mailId": "abc#gmail.com",
"class" : 7,
"scoreBoard" : [
{"subject":"Math","score":90,"grade":"A"},
{"subject":"Science","score":82,"grade":"A"},
{"subject":"History","score":80,"grade":"A"},
{"subject":"Hindi","score":75,"grade":"B"},
{"subject":"English","score":80,"grade":"A"},
{"subject":"Geography","score":80,"grade":"A"},
]
}
Trying to process scoreBoard field from above data, find out top five subject, delete lowest score subject row, also change grade of some subject.
case class Student(subject: String, score: Long, grade : String)
var studentTest = sc.read.json("**/testStudent.json")
val studentSchema = ArrayType(new StructType().add("subject", StringType).add("score", LongType).add("grade", StringType))
val parseStudentUDF = udf((scoreBoard : Seq[Row]) => {
//do data processing and return updated data
ListBuffer(Subtable(subject,score,grade), , ,)
}, subtableSchema)
studentTest = studentTest.withColumn("scoreBoard",parseStudentUDF(col("scoreBoard")))
I am not sure how to convert seq[Row] to DataFrame in UDF, or how to process seq to sort data and delete any row.
Is there any way to do this?
Any different approach also acceptable.
this approach is using Spark dataframes and Spark SQL. I hope this can help you.
package tests
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
object ProcessingList {
val spark = SparkSession
.builder()
.appName("ProcessingList")
.master("local[*]")
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","ProcessingList") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
val sqlContext = spark.sqlContext
val input = "/home/cloudera/files/tests/list_processing.json"
def main(args: Array[String]): Unit = {
Logger.getRootLogger.setLevel(Level.ERROR)
try {
import org.apache.spark.sql.functions._
val studentTest = sqlContext
.read
.json(input)
studentTest
.filter(col("grade").isNotNull)
.select(col("grade"), col("score"), col("subject"))
.cache()
.createOrReplaceTempView("student_test")
sqlContext
.sql(
"""SELECT grade, score, subject
|FROM student_test
|ORDER BY score DESC
|LIMIT 5
|""".stripMargin)
.show()
// To have the opportunity to view the web console of Spark: http://localhost:4041/
println("Type whatever to the console to exit......")
scala.io.StdIn.readLine()
} finally {
sc.stop()
println("SparkContext stopped")
spark.stop()
println("SparkSession stopped")
}
}
}
+-----+-----+---------+
|grade|score| subject|
+-----+-----+---------+
| A| 90| Math|
| A| 82| Science|
| A| 80| History|
| A| 80| English|
| A| 80|Geography|
+-----+-----+---------+
Regards.
Firstly, the comment by mvasyliv is good in my opinion.
To modify it, you can use
plain scala collection methods like .filter(), however you dont need spark for that. See Scala collections API on how to use
You can write UnaryTransformers, which transform specific columns and insert a new one. See this simple Special Chars remover as an example. Notice the outputDataType method and createTransformFunc, which is based on .map collection method.
class SpecialCharsRemover (override val uid: String)
extends UnaryTransformer[Seq[String], Seq[String], SpecialCharsRemover] with DefaultParamsWritable {
def this() = this(Identi
fiable.randomUID("tokenPermutationGenerato
r"))
override protected def createTransformFunc: Seq[String] => Seq[String] = (tokensWithSpecialChars: Seq[String]) => {
tokensWithSpecialChars.map(token => {
removeSpecialCharsImpl(token)
})
}
private def removeSpecialCharsImpl(token: String): String = {
if(token.equals("")) {
return
""
}
//remove sonderzeichen
var tempToken = token;
tempToken = tempToken.replace(",", "")
tempToken = tempToken.replace("'", "")
tempToken = tempToken.replace("'", "")
tempToken = tempToken.replace("_", "")
tempToken = tempToken.replace("-", "")
tempToken = tempToken.replace("!", "")
tempToken = tempToken.replace(".", "")
tempToken = tempToken.replace("?", "")
tempToken = tempToken.replace(":", "")
tempToken = tempToken.replace(")", "")
tempToken = tempToken.replace("(", "")
tempToken = tempToken.replace(",", "")
tempToken = tempToken.replace("‘", "")
tempToken = tempToken.replace("}", "")
tempToken = tempToken.replace("{", "")
tempToken = tempToken.replace("[", "")
tempToken = tempToken.replace("]", "")
tempToken = tempToken.replace("]", "")
tempToken = tempToken.replace("®", "")
tempToken = ThesaurusUtils.stemToken(tempToken);
tempToken
}
override protected def outputDataType: DataType = new ArrayType(StringType, false)
}
Or you can register an arbitrary function as an UDF (Java Code):
ds.sparkSession().sqlContext().udf().register("THE_BOB", (UDF1<String, String>) this::getSomeBob, DataTypes.StringType);
private String getSomeBob(String text) {
return "bob";
}
then call it with:
bobColumn = functions.callUDF("THE_BOB", bobColumn);

spark recursively fix circular references in class

My initial data structure contains self-references which are not supported by spark:
initial.toDF
java.lang.UnsupportedOperationException: cannot have circular references in class, but got the circular reference
The initial data structure:
case class FooInitial(bar:String, otherSelf:Option[FooInitial])
val initial = Seq(FooInitial("first", Some(FooInitial("i1", Some(FooInitial("i2", Some(FooInitial("finish", None))))))))
To fix it a semantically similar and desired representation could be:
case class Inner(value:String)
case class Foo(bar:String, otherSelf:Option[Seq[Inner]])
val first = Foo("first", None)
val intermediate1 = Inner("i1")//Foo("i1", None)
val intermediate2 = Inner("i2")//Foo("i2", None)
val finish = Foo("finish", Some(Seq(intermediate1, intermediate2)))
val basic = Seq(first, finish)
basic.foreach(println)
val df = basic.toDF
df.printSchema
df.show
+------+------------+
| bar| otherSelf|
+------+------------+
| first| null|
|finish|[[i1], [i2]]|
+------+------------+
What is a nice functional way to convert from the initial to the other non-self-referencing
representation?
This recursively dereferences the objects:
class MyCollector {
val intermediateElems = new ListBuffer[Foo]
def addElement(initialElement : FooInitial) : MyCollector = {
intermediateElems += Foo(initialElement.bar, None)
intermediateElems ++ addIntermediateElement(initialElement.otherSelf, ListBuffer.empty[Foo])
this
}
#tailrec private def addIntermediateElement(intermediate:Option[FooInitial], l:ListBuffer[Foo]) : ListBuffer[Foo] = {
intermediate match {
case None => l
case Some(s) => {
intermediatePoints += Foo(s.bar + "_inner", None)
addIntermediateElement(s.otherSelf,intermediatePoints)
}
}
}
}
initial.foldLeft(new MyCollector)((myColl,stay)=>myColl.addElement(stay)).intermediatePoints.toArray.foreach(println)
The result is a List of:
Foo(first,None)
Foo(i1_inner,None)
Foo(i2_inner,None)
Foo(finish_inner,None)
which now nicely works for spark.
NOTE: this is not 1:1 for what I asked initially, but good enough for me for now.

passing UDF to a method or class

I have a UDF say
val testUDF = udf{s: string=>s.toUpperCase}
I want to create this UDF in a separate method or may be something else like an implementation class and pass it on another class which uses it. Is it possible?
Say suppose I have a class A
class A(df: DataFrame) {
def testMethod(): DataFrame = {
val demo=df.select(testUDF(col))
}
}
class A should be able to use UDF. Can this be achieved?
Given a dataframe as
+----+
|col1|
+----+
|abc |
|dBf |
|Aec |
+----+
And a udf function
import org.apache.spark.sql.functions._
val testUDF = udf{s: String=>s.toUpperCase}
You can definitely use that udf function from another class as
val demo = df.select(testUDF(col("col1")).as("upperCasedCol"))
which should give you
+-------------+
|upperCasedCol|
+-------------+
|ABC |
|DBF |
|AEC |
+-------------+
But I would suggest you to use other functions if possible as udf function requires columns to be serialized and deserialized which would consume time and memory more than other functions available. UDF function should be the last choice.
You can use upper function for your case
val demo = df.select(upper(col("col1")).as("upperCasedCol"))
This will generate the same output as the original udf function
I hope the answer is helpful
Updated
Since your question is asking for information on how to call the udf function defined in another class or object, here is the method
suppose you have an object where you defined the udf function or a function that i suggested as
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
object UDFs {
def testUDF = udf{s: String=>s.toUpperCase}
def testUpper(column: Column) = upper(column)
}
Your A class is as in your question, I just added another function
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
class A(df: DataFrame) {
def testMethod(): DataFrame = {
val demo = df.select(UDFs.testUDF(col("col1")))
demo
}
def usingUpper() = {
df.select(UDFs.testUpper(col("col1")))
}
}
Then you can call the functions from main as below
import org.apache.spark.sql.SparkSession
object TestUpper {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder().appName("Simple Application")
.master("local")
.config("", "")
.getOrCreate()
import sparkSession.implicits._
val df = Seq(
("abc"),
("dBf"),
("Aec")
).toDF("col1")
val a = new A(df)
//calling udf function
a.testMethod().show(false)
//calling upper function
a.usingUpper().show(false)
}
}
I guess this is more than helpful
If I understand correctly you would actually like some kind of factory to create this user-defined-function for a specific class A.
This could be achieve using a type class which gets injected implicitly.
E.g. (I had to define UDF and DataFrame to be able to test this)
type UDF = String => String
case class DataFrame(col: String) {
def select(in: String) = s"col:$col, in:$in"
}
trait UDFFactory[A] {
def testUDF: UDF
}
implicit object UDFFactoryA extends UDFFactory[AClass] {
def testUDF: UDF = _.toUpperCase
}
class AClass(df: DataFrame) {
def testMethod(implicit factory: UDFFactory[AClass]) = {
val demo = df.select(factory.testUDF(df.col))
println(demo)
}
}
val a = new AClass(DataFrame("test"))
a.testMethod // prints 'col:test, in:TEST'
Like you mentioned, create a method exactly like your UDF in your object body or companion class,
val myUDF = udf((str:String) => { str.toUpperCase })
Then for some dataframe df do this,
val res=df withColumn("NEWCOLNAME", myUDF(col("OLDCOLNAME")))
This will change something like this,
+-------------------+
| OLDCOLNAME |
+-------------------+
| abc |
+-------------------+
to
+-------------------+-------------------+
| OLDCOLNAME | NEWCOLNAME |
+-------------------+-------------------+
| abc | ABC |
+-------------------+-------------------+
Let me know if this helped, Cheers.
Yes thats possible as functions are objects in scala which can be passed around:
import org.apache.spark.sql.expressions.UserDefinedFunction
class A(df: DataFrame, testUdf:UserDefinedFunction) {
def testMethod(): DataFrame = {
df.select(testUdf(col))
}
}

NullPointerException when using UDF in Spark

I have a DataFrame in Spark such as this one:
var df = List(
(1,"{NUM.0002}*{NUM.0003}"),
(2,"{NUM.0004}+{NUM.0003}"),
(3,"END(6)"),
(4,"END(4)")
).toDF("CODE", "VALUE")
+----+---------------------+
|CODE| VALUE|
+----+---------------------+
| 1|{NUM.0002}*{NUM.0003}|
| 2|{NUM.0004}+{NUM.0003}|
| 3| END(6)|
| 4| END(4)|
+----+---------------------+
My task is to iterate through the VALUE column and do the following: check if there is a substring such as {NUM.XXXX}, get the XXXX number, get the row where $"CODE" === XXXX, and replace the {NUM.XXX} substring with the VALUE string in that row.
I would like the dataframe to look like this in the end:
+----+--------------------+
|CODE| VALUE|
+----+--------------------+
| 1|END(4)+END(6)*END(6)|
| 2| END(4)+END(6)|
| 3| END(6)|
| 4| END(4)|
+----+--------------------+
This is the best I've come up with:
val process = udf((ln: String) => {
var newln = ln
while(newln contains "{NUM."){
var num = newln.slice(newln.indexOf("{")+5, newln.indexOf("}")).toInt
var new_value = df.where($"CODE" === num).head.getAs[String](1)
newln = newln.replace(newln.slice(newln.indexOf("{"),newln.indexOf("}")+1), new_value)
}
newln
})
var df2 = df.withColumn("VALUE", when('VALUE contains "{NUM.",process('VALUE)).otherwise('VALUE))
Unfortunately, I get a NullPointerException when I try to filter/select/save df2, and no error when I just show df2. I believe the error appears when I access the DataFrame df within the UDF, but I need to access it every iteration, so I can't pass it as an input. Also, I've tried saving a copy of df inside the UDF but I don't know how to do that. What can I do here?
Any suggestions to improve the algorithm are very welcome! Thanks!
I wrote something which works but not very optimized I think. I actually do recursive joins on the initial DataFrame to replace the NUMs by END. Here is the code :
case class Data(code: Long, value: String)
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder().master("local").getOrCreate()
val data = Seq(
Data(1,"{NUM.0002}*{NUM.0003}"),
Data(2,"{NUM.0004}+{NUM.0003}"),
Data(3,"END(6)"),
Data(4,"END(4)"),
Data(5,"{NUM.0002}")
)
val initialDF = sparkSession.createDataFrame(data)
val endDF = initialDF.filter(!(col("value") contains "{NUM"))
val numDF = initialDF.filter(col("value") contains "{NUM")
val resultDF = endDF.union(replaceNumByEnd(initialDF, numDF))
resultDF.show(false)
}
val parseNumUdf = udf((value: String) => {
if (value.contains("{NUM")) {
val regex = """.*?\{NUM\.(\d+)\}.*""".r
value match {
case regex(code) => code.toLong
}
} else {
-1L
}
})
val replaceUdf = udf((value: String, replacement: String) => {
val regex = """\{NUM\.(\d+)\}""".r
regex.replaceFirstIn(value, replacement)
})
def replaceNumByEnd(initialDF: DataFrame, currentDF: DataFrame): DataFrame = {
if (currentDF.count() == 0) {
currentDF
} else {
val numDFWithCode = currentDF
.withColumn("num_code", parseNumUdf(col("value")))
.withColumnRenamed("code", "code_original")
.withColumnRenamed("value", "value_original")
val joinedDF = numDFWithCode.join(initialDF, numDFWithCode("num_code") === initialDF("code"))
val replacedDF = joinedDF.withColumn("value_replaced", replaceUdf(col("value_original"), col("value")))
val nextDF = replacedDF.select(col("code_original").as("code"), col("value_replaced").as("value"))
val endDF = nextDF.filter(!(col("value") contains "{NUM"))
val numDF = nextDF.filter(col("value") contains "{NUM")
endDF.union(replaceNumByEnd(initialDF, numDF))
}
}
If you need more explanation, don't hesitate.

Spark scala udf error for if else

I am trying to define udf with the function getTIme for spark scala udf but i am getting the error as error: illegal start of declaration. What might be error in the syntax and retutrn the date and also if there is parse exception instead of returing the null, send the some string as error
def getTime=udf((x:String) : java.sql.Timestamp => {
if (x.toString() == "") return null
else { val format = new SimpleDateFormat("yyyy-MM-dd' 'HH:mm:ss");
val d = format.parse(x.toString());
val t = new Timestamp(d.getTime()); return t
}})
Thank you!
The return type for the udf is derived and should not be specified. Change the first line of code to:
def getTime=udf((x:String) => {
// your code
}
This should get rid of the error.
The following is a fully working code written in functional style and making use of Scala constructs:
val data: Seq[String] = Seq("", null, "2017-01-15 10:18:30")
val ds = spark.createDataset(data).as[String]
import java.text.SimpleDateFormat
import java.sql.Timestamp
val fmt = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
// ********HERE is the udf completely re-written: **********
val f = udf((input: String) => {
Option(input).filter(_.nonEmpty).map(str => new Timestamp(fmt.parse(str).getTime)).orNull
})
val ds2 = ds.withColumn("parsedTimestamp", f($"value"))
The following is output:
+-------------------+--------------------+
| value| parsedTimestamp|
+-------------------+--------------------+
| | null|
| null| null|
|2017-01-15 10:18:30|2017-01-15 10:18:...|
+-------------------+--------------------+
You should be using Scala datatypes, not Java datatypes. It would go like this:
def getTime(x: String): Timestamp = {
//your code here
}
You can easily do it in this way :
def getTimeFunction(timeAsString: String): java.sql.Timestamp = {
if (timeAsString.isEmpty)
null
else {
val format = new SimpleDateFormat("yyyy-MM-dd' 'HH:mm:ss")
val date = format.parse(timeAsString.toString())
val time = new Timestamp(date.getTime())
time
}
}
val getTimeUdf = udf(getTimeFunction _)
Then use this getTimeUdf accordingly. !