Spark: UDF not reading already defined value - scala

I have a function written that I am trying to apply to a dataframe via a UDF. It applies a category based on the value in a particular column. The function makes use of a value defined earlier in my code. The code looks like this:
object myFuncs extends App {
val sc = new SparkContext()
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val categories = List("10","20")
def makeCategory(value:Double): String = {
if (value < categories(0)) "< 10"
else if (value >= categories(0) && value < categories(1)) "10 to 20"
else ">= 10"
}
val myFunc = udf(makeCategory _)
val df = sqlContext.parquetFile("hdfs:/to/my/file.parquet").withColumn("category", myFunc(col("myColumn")))
}
This produces a NullPointerException when it tries to read the categories variable inside the function. This works fine if I explicitly define the categories variable inside the function. Ultimately, I want to pass that in as an arg so I can't define it inside the function.
Any explanation why it won't read values defined outside the function in the UDF? Any suggestion on how to make this work without explicitly defining the values in the function? I tried using the 'lit' function and passing it as an argument but it didn't like having a list as 'lit'.

The simple solution is to pass the categories in the query, then it will work fine. You have to make changes into your function as
def makeCategory(value:Double, categoriesString : String): String = {
val categories = categoriesString.split(",")
if (value < categories(0)) "< 10"
else if (value >= categories(0) && value < categories(1)) "10 to 20"
else ">= 10"
}
So now you can register this function as UDT, but you have to use it like following
val df = sqlContext.parquetFile("hdfs:/to/my/file.parquet").withColumn("category", myFunc(col("myColumn"),"10,20"))
Hopefully it will help in your case.

Related

pass accumulators to spark udf

This is a simplified version of what I am trying to do. I want to do some counting inside my udf. So thinking one way of doing it is to pass Long accumulators to the udf and incrementing the acuumulators inside the if else loops in deserializeProtobuf function. But not able to get the syntax working. Can anyone help me with that ? Is there any better way ?
def deserializeProtobuf(raw_data: Byte[Array]) = {
val input_stream = new ByteArrayInputStream(raw_data)
parsed_data = CustomClass.parseFrom(input_stream)
if (condition 1 related to parsed_data) {
< increment variable1 >
}
else if (condition 2 related to parsed_data) {
< increment variable2 >
}
else {
< increment variable3 >
}
}
val decode = udf(deserializeProtobuf _)
val deserialized_data = ds.withColumn("data", decode(col("protobufData")))
I have done something like this before , If you are doing heavy-lifting in your CUSTOMCLASS one thing I can suggest is to Broadcast it , also you can instantiate Metrics on BroadCast variable
Now coming to counting part I tried accumulator part but it was quite difficult to manage them inside UDF as getting correct count over a window so I tried to use spark-metrics and send the count at regular interval
use this https://github.com/groupon/spark-metrics
and make sure initialise the metrics on Broadcast variable creation time from that point the copied variable will report on same metrics
You shouldn't have to pass the accumulator to the UDF:
import org.apache.spark.util.{AccumulatorV2, LongAccumulator}
import org.apache.spark.sql.functions.{udf,col}
var acc1: LongAccumulator = null
def my_udf = udf ( (arg1: str) => {
...
acc1.add(1)
}
val spark = SparkSession...
acc1 = spark.sparkContext.longAccumulator("acc1")
... withColumn("col_name", my_udf(col("...")))
// some action here to cause the withColumn to execute
System.err.println(s"${acc1.value}")

Matching Column name from Csv file in spark scala

I want to take headers (column name) from my csv file and the want to match with it my existing header.
I am using below code:
val cc = sparksession.read.csv(filepath).take(1)
Its giving me value like:
Array([id,name,salary])
and I have created one more static schema, which is giving me value like this:
val ss=Array("id","name","salary")
and then I'm trying to compare column name using if condition:
if(cc==ss){
println("matched")
} else{
println("not matched")
}
I guess due to [] and () mismatch its always going to else part is there any other way to compare these value without considering [] and ()?
First, for convenience, set the header option to true when reading the file:
val df = sparksession.read.option("header", true).csv(filepath)
Get the column names and define the expected column names:
val cc = df.columns
val ss = Array("id", "name", "salary")
To check if the two match (not considering the ordering):
if (cc.toSet == ss.toSet) {
println("matched")
} else {
println("not matched")
}
If the order is relevant, then the condition can be done as follows (you can't use Array here but Seq works):
cc.toSeq == ss.toSeq
or you a deep array comparison:
cc.deep == d.deep
First of all, I think you are trying to compare a Array[org.apache.spark.sql.Row] with an Array[String]. I believe you should change how you load the headers to something like: val cc = spark.read.format("csv").option("header", "true").load(fileName).columns.toArray.
Then you could compare using cc.deep == ss.deep.
Below code worked for me.
val cc= spark.read.csv("filepath").take(1)(0).toString
The above code gave output as String:[id,name,salary].
created one one stating schema as
val ss="[id,name,salary]"
then wrote the if else Conditions.

Spark UDF as function parameter, UDF is not in function scope

I have a few UDFs that I'd like to pass along as a function argument along with data frames.
One way to do this might be to create the UDF within the function, but that would create and destroy several instances of the UDF without reusing it which might not be the best way to approach this problem.
Here's a sample piece of code -
val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
val df = inputDF1
.withColumn("new_col", lkpUDF(col("c1")))
val df2 = inputDF2.
.withColumn("new_col", lkpUDF(col("c1")))
Instead of doing the above, I'd ideally want to do something like this -
val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
def appendCols(df: DataFrame, lkpUDF: ?): DataFrame = {
df
.withColumn("new_col", lkpUDF(col("c1")))
}
val df = appendCols(inputDF, lkpUDF)
The above UDF is pretty simple, but in my case it can return a primitive type or a user defined case class type. Any thoughts/ pointers would be much appreciated. Thanks.
Your function with the appropriate signature needs to be this:
import org.apache.spark.sql.UserDefinedFunction
def appendCols(df: DataFrame, func: UserDefinedFunction): DataFrame = {
df.withColumn("new_col", func(col("col1")))
}
The scala REPL is quite helpful in returning the type of the values initialized.
scala> val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
lkpUDF: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,List(IntegerType))
Also, if the signature of the function that you pass into the udf wrapper consists of an Any return type (which will be the case if the function can return either a primitive or a user defined case class), the UDF will fail to compile with an exception like so:
java.lang.UnsupportedOperationException: Schema for type Any is not supported

Create new column with function in Spark Dataframe

I'm trying to figure out the new dataframe API in Spark. Seems like a good step forward but having trouble doing something that should be pretty simple. I have a dataframe with 2 columns, "ID" and "Amount". As a generic example, say I want to return a new column called "code" that returns a code based on the value of "Amt". I can write a function something like this:
def coder(myAmt:Integer):String {
if (myAmt > 100) "Little"
else "Big"
}
When I try to use it like this:
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
myDF.withColumn("Code", coder(myDF("Amt")))
I get type mismatch errors
found : org.apache.spark.sql.Column
required: Integer
I've tried changing the input type on my function to org.apache.spark.sql.Column but I then I start getting errors with the function compiling because it wants a boolean in the if statement.
Am I doing this wrong? Is there a better/another way to do this than using withColumn?
Thanks for your help.
Let's say you have "Amt" column in your Schema:
import org.apache.spark.sql.functions._
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
val coder: (Int => String) = (arg: Int) => {if (arg < 100) "little" else "big"}
val sqlfunc = udf(coder)
myDF.withColumn("Code", sqlfunc(col("Amt")))
I think withColumn is the right way to add a column
We should avoid defining udf functions as much as possible due to its overhead of serialization and deserialization of columns.
You can achieve the solution with simple when spark function as below
val myDF = sqlContext.parquetFile("hdfs:/to/my/file.parquet")
myDF.withColumn("Code", when(myDF("Amt") < 100, "Little").otherwise("Big"))
Another way of doing this:
You can create any function but according to the above error, you should define function as a variable
Example:
val coder = udf((myAmt:Integer) => {
if (myAmt > 100) "Little"
else "Big"
})
Now this statement works perfectly:
myDF.withColumn("Code", coder(myDF("Amt")))

passing a code block to method without execution

I have following code:
import java.io._
import com.twitter.chill.{Input, Output, ScalaKryoInstantiator}
import scala.reflect.ClassTag
object serializer {
val instantiator = new ScalaKryoInstantiator
instantiator.setRegistrationRequired(false)
val kryo = instantiator.newKryo()
def load[T](file:_=>_,name:String,cls:Class[T]):T = {
if (java.nio.file.Files.notExists(new File(name).toPath())) {
val temp = file
val baos = new FileOutputStream(name)
val output = new Output(baos, 4096)
kryo.writeObject(output, temp)
temp.asInstanceOf[T]
}
else {
println("loading from " + name)
val baos = new FileInputStream(name)
val input = new Input(baos)
kryo.readObject(input,cls)
}
}
}
I want to use it in this way:
val mylist = serializer.load((1 to 100000).toList,"allAdj.bin",classOf[List[Int]])
I don't want to run (1 to 100000).toList every time so I want to pass it to the serializer and then decide to compute it for the first time and serialize it for future or load it from file.
The problem is that the code block is running first in my code, how can I pass the code block without executing it?
P.S. Is there any scala tool that do the exact thing for me?
To have parameters not be evaluated before being passed, use pass-by-name, like this:
def method(param: =>ParamType)
Whatever you pass won't be evaluated at the time you pass, but will be evaluated each time you use param, which might not be what you want either. To have it be evaluated only the first time you use, do this:
def method(param: =>ParamType) = {
lazy val p: ParamType = param
Then use only p on the body. The first time p is used, param will be evaluated and the value will be stored. All other uses of p will use the stored value.
Note that this happens every time you invoke method. That is, if you call method twice, it won't use the "stored" value of p -- it will evaluate it again on first use. If you want to "pre-compute" something, then perhaps you'd be better off with a class instead?