UDF usage in spark - scala

I have a custom udf and registered in spark.If I try to access that UDF ,It throws error.Unable to access.
I tried like this.
spark.udf.register("rssi_weightage", FilterMap.rssi_weightage)
val filterop = input_data.groupBy($"tagShortID", $"Timestamp", $"ListenerShortID", $"rootOrgID", $"subOrgID").agg(first(rssi_weightage($"RSSI").as("RSSI_Weight")))
Showing error in first(rssi_weightage($"RSSI") // rssi_weightage not found error
Any help will be appreciated.

this is not how you use the udf, the actual udf is a return value from spark.udf.register. So you can do :
val udf_rssii_weightage = spark.udf.register("rssi_weightage", FilterMap.rssi_weightage)
val filterop = input_data.groupBy($"tagShortID", $"Timestamp", $"ListenerShortID", $"rootOrgID", $"subOrgID").agg(first(udf_rssi_weightage($"RSSI")).as("RSSI_Weight"))
But in your case you do not need to register the udf, just use org.apache.spark.sql.functions.udf to convert a regular function to an udf:
val udf_rssii_weightage = udf(FilterMap.rssi_weightage)

I suppose you have an issue with the way you're defining the udf function,
the next snapshot has a slightly different approach in announcement udf - it's directly defined function:
import org.apache.spark.sql.functions._
val data = sqlContext.read.json(sc.parallelize(Seq("{'foo' : 'Bar'}", "{'foo': 'Baz'}")))
val example = Seq("Bar", "Bazzz")
val urbf = udf { foo: String => if (example.contains(example)) 1 else 0 }
data.select($"foo", urbf($"foo")).show
+--------+-------------+
| foo |UDF(foo) |
+--------+-------------+
| Bar | 1|
| Bazzz | 0|
+--------+-------------+

Related

Write transformation in typesafe way

How do I write the below code in typesafe manner in spark scala with Dataset Api:
val schema: StructType = Encoders.product[CaseClass].schema
//read json from a file
val readAsDataSet :CaseClass=sparkSession.read.option("mode",mode).schema(schema).json(path)as[CaseClass]
//below code needs to be written in type safe way:
val someDF= readAsDataSet.withColumn("col1",explode(col("col_to_be_exploded")))
.select(from_unixtime(col("timestamp").divide(1000))
.as("date"), col("col1"))
As someone in the comments said, you can create a Dataset[CaseClass] and do your operations on there. Let's set it up:
import spark.implicits._
case class MyTest (timestamp: Long, col_explode: Seq[String])
val df = Seq(
MyTest(1673850366000L, Seq("some", "strings", "here")),
MyTest(1271850365998L, Seq("pasta", "with", "cream")),
MyTest(611850366000L, Seq("tasty", "food"))
).toDF("timestamp", "col_explode").as[MyTest]
df.show(false)
+-------------+---------------------+
|timestamp |col_explode |
+-------------+---------------------+
|1673850366000|[some, strings, here]|
|1271850365998|[pasta, with, cream] |
|611850366000 |[tasty, food] |
+-------------+---------------------+
Typically, you can do many operations with the map function and the Scala language.
A map function returns the same amount of elements as the input has. The explode function that you're using, however, does not return the same amount of elements. You can implement this behaviour using the flatMap function.
So, using the Scala language and the flatMap function together, you can do something like this:
import java.time.LocalDateTime
import java.time.ZoneOffset
case class Exploded (datetime: String, exploded: String)
val output = df.flatMap{ case MyTest(timestamp, col_explode) =>
col_explode.map( value => {
val date = LocalDateTime.ofEpochSecond(timestamp/1000, 0, ZoneOffset.UTC).toString
Exploded(date, value)
}
)
}
output.show(false)
+-------------------+--------+
|datetime |exploded|
+-------------------+--------+
|2023-01-16T06:26:06|some |
|2023-01-16T06:26:06|strings |
|2023-01-16T06:26:06|here |
|2010-04-21T11:46:05|pasta |
|2010-04-21T11:46:05|with |
|2010-04-21T11:46:05|cream |
|1989-05-22T14:26:06|tasty |
|1989-05-22T14:26:06|food |
+-------------------+--------+
As you see, we've created a second case class called Exploded which we use to type our output dataset. Our output dataset has the following type: org.apache.spark.sql.Dataset[Exploded] so everything is completely type safe.

How to convert org.apache.spark.sql.ColumnName to string,Decimal type in Spark Scala?

I have a JSON like below
{"name":"method1","parameter1":"P1name","parameter2": 1.0}
I am loading my JSON file
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("C:/Users/test/Desktop/te.txt")
scala> df.show()
+-------+----------+----------+
| name|parameter1|parameter2|
+-------+----------+----------+
|method1| P1name| 1.0 |
+-------+----------+----------+
I have a function like below:
def method1(P1:String, P2:Double)={
| print(P1)
print(P2)
| }
I am calling my method1 based on column name after executing below code it should execute method1.
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
df.withColumn("methodCalling", when($"name" === "method1", method1($"parameter1",$"parameter2")).otherwise(when($"name" === "method2", method2($"parameter1",$"parameter2")))).show(false)
But I am getting bellow error.
<console>:63: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: String
Please let me know how to convert org.apache.spark.sql.ColumnName data type to String
When you pass arguments as
method1($"parameter1",$"parameter2")
You are passing columns to the function and not primitive datatypes. So, I would suggest you to change your method1 and method2 as udf functions, if you want to apply primitive datatype manipulations inside functions. And udf functions would have to return a value for each row of the new column.
import org.apache.spark.sql.functions._
def method1 = udf((P1:String, P2:Double)=>{
print(P1)
print(P2)
P1+P2
})
def method2 = udf((P1:String, P2:Double)=>{
print(P1)
print(P2)
P1+P2
})
Then your withColumn api should work properly
df.withColumn("methodCalling", when($"name" === "method1", method1($"parameter1",$"parameter2")).otherwise(when($"name" === "method2", method2($"parameter1",$"parameter2")))).show(false)
Note: udf functions perform data serialization and deserialzation for changing the column dataTypes to be processed row-wise which would increase complexity and a lot of memory usages. spark functions should be used as much as possible
You can try like this:
scala> def method1(P1:String, P2:Double): Int = {
| println(P1)
| println(P2)
| 0
| }
scala> def method2(P1:String, P2:Double): Int = {
| println(P1)
| println(P2)
| 1
| }
df.withColumn("methodCalling", when($"name" === "method1", method1(df.select($"parameter1").map(_.getString(0)).collect.head,df.select($"parameter2").map(_.getDouble(0)).collect.head))
.otherwise(when($"name" === "method2", method2(df.select($"parameter1").map(_.getString(0)).collect.head,df.select($"parameter2").map(_.getDouble(0)).collect.head)))).show
//output
P1name
1.0
+-------+----------+----------+-------------+
| name|parameter1|parameter2|methodCalling|
+-------+----------+----------+-------------+
|method1| P1name| 1.0| 0|
+-------+----------+----------+-------------+
You have to return something from your method otherwise it will retun unit and it will give error after printing result:
java.lang.RuntimeException: Unsupported literal type class scala.runtime.BoxedUnit ()
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:75)
at org.apache.spark.sql.functions$.lit(functions.scala:101)
at org.apache.spark.sql.functions$.when(functions.scala:1245)
... 50 elided
Thanks.
I think you just want to read the JSON and based on that call the methods.
Since you have already created a dataframe, you can do something like :
df.map( row => (row.getString(0), row.getString(1) , row.getDouble(2) ) ).collect
.foreach { x =>
x._1.trim.toLowerCase match {
case "method1" => method1(x._2, x._3)
//case "method2" => method2(x._2, x._3)
//case _ => methodn(x._2, x._3)
}
}
// Output : P1name1.0
// Because you used `print` and not `println` ;)

passing UDF to a method or class

I have a UDF say
val testUDF = udf{s: string=>s.toUpperCase}
I want to create this UDF in a separate method or may be something else like an implementation class and pass it on another class which uses it. Is it possible?
Say suppose I have a class A
class A(df: DataFrame) {
def testMethod(): DataFrame = {
val demo=df.select(testUDF(col))
}
}
class A should be able to use UDF. Can this be achieved?
Given a dataframe as
+----+
|col1|
+----+
|abc |
|dBf |
|Aec |
+----+
And a udf function
import org.apache.spark.sql.functions._
val testUDF = udf{s: String=>s.toUpperCase}
You can definitely use that udf function from another class as
val demo = df.select(testUDF(col("col1")).as("upperCasedCol"))
which should give you
+-------------+
|upperCasedCol|
+-------------+
|ABC |
|DBF |
|AEC |
+-------------+
But I would suggest you to use other functions if possible as udf function requires columns to be serialized and deserialized which would consume time and memory more than other functions available. UDF function should be the last choice.
You can use upper function for your case
val demo = df.select(upper(col("col1")).as("upperCasedCol"))
This will generate the same output as the original udf function
I hope the answer is helpful
Updated
Since your question is asking for information on how to call the udf function defined in another class or object, here is the method
suppose you have an object where you defined the udf function or a function that i suggested as
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
object UDFs {
def testUDF = udf{s: String=>s.toUpperCase}
def testUpper(column: Column) = upper(column)
}
Your A class is as in your question, I just added another function
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
class A(df: DataFrame) {
def testMethod(): DataFrame = {
val demo = df.select(UDFs.testUDF(col("col1")))
demo
}
def usingUpper() = {
df.select(UDFs.testUpper(col("col1")))
}
}
Then you can call the functions from main as below
import org.apache.spark.sql.SparkSession
object TestUpper {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder().appName("Simple Application")
.master("local")
.config("", "")
.getOrCreate()
import sparkSession.implicits._
val df = Seq(
("abc"),
("dBf"),
("Aec")
).toDF("col1")
val a = new A(df)
//calling udf function
a.testMethod().show(false)
//calling upper function
a.usingUpper().show(false)
}
}
I guess this is more than helpful
If I understand correctly you would actually like some kind of factory to create this user-defined-function for a specific class A.
This could be achieve using a type class which gets injected implicitly.
E.g. (I had to define UDF and DataFrame to be able to test this)
type UDF = String => String
case class DataFrame(col: String) {
def select(in: String) = s"col:$col, in:$in"
}
trait UDFFactory[A] {
def testUDF: UDF
}
implicit object UDFFactoryA extends UDFFactory[AClass] {
def testUDF: UDF = _.toUpperCase
}
class AClass(df: DataFrame) {
def testMethod(implicit factory: UDFFactory[AClass]) = {
val demo = df.select(factory.testUDF(df.col))
println(demo)
}
}
val a = new AClass(DataFrame("test"))
a.testMethod // prints 'col:test, in:TEST'
Like you mentioned, create a method exactly like your UDF in your object body or companion class,
val myUDF = udf((str:String) => { str.toUpperCase })
Then for some dataframe df do this,
val res=df withColumn("NEWCOLNAME", myUDF(col("OLDCOLNAME")))
This will change something like this,
+-------------------+
| OLDCOLNAME |
+-------------------+
| abc |
+-------------------+
to
+-------------------+-------------------+
| OLDCOLNAME | NEWCOLNAME |
+-------------------+-------------------+
| abc | ABC |
+-------------------+-------------------+
Let me know if this helped, Cheers.
Yes thats possible as functions are objects in scala which can be passed around:
import org.apache.spark.sql.expressions.UserDefinedFunction
class A(df: DataFrame, testUdf:UserDefinedFunction) {
def testMethod(): DataFrame = {
df.select(testUdf(col))
}
}

Spark Rdd handle different fields of each row by field name

I am new to Spark and Scala, now I'm somehow stuck with a problem: how to handle different field of each row by field name, then into a new rdd.
This is my pseudo codeļ¼š
val newRdd = df.rdd.map(x=>{
def Random1 => random(1,10000) //pseudo
def Random2 => random(10000,20000) //pseduo
x.schema.map(y=> {
if (y.name == "XXX1")
x.getAs[y.dataType](y.name)) = Random1
else if (y.name == "XXX2")
x.getAs[y.dataType](y.name)) = Random2
else
x.getAs[y.dataType](y.name)) //pseduo,keeper the same
})
})
There are 2 less errors in above:
the second map,"x.getAs" is a error syntax
how to into a new rdd
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.
Thanks Ramesh Maharjan, it works now.
def randomString(len: Int): String = {
val rand = new scala.util.Random(System.nanoTime)
val sb = new StringBuilder(len)
val ab = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
for (i <- 0 until len) {
sb.append(ab(rand.nextInt(ab.length)))
}
sb.toString
}
def testUdf = udf((value: String) =>randomString(2))
val df = sqlContext.createDataFrame(Seq((1,"Android"), (2, "iPhone")))
df.withColumn("_2", testUdf(df("_2")))
+---+---+
| _1| _2|
+---+---+
| 1| F3|
| 2| Ag|
+---+---+
If you are intending to filter certain felds "XXX1" "XXX2" then simple select function should do the trick
df.select("XXX1", "XXX2")
and convert that to rdd
If you are intending something else then your x.getAs should look as below
val random1 = x.getAs(y.name)
It seems that you are trying to change values in some columns "XXX1" and "XXX2"
For that a simple udf function and withColumn should do the trick
Simple udf function is as below
def testUdf = udf((value: String) => {
//do your logics here and what you return from here would be reflected in the value you passed from the column
})
And you can call the udf function as
df.withColumn("XXX1", testUdf(df("XXX1")))
Similarly you can do for "XXX2"

Spark scala udf error for if else

I am trying to define udf with the function getTIme for spark scala udf but i am getting the error as error: illegal start of declaration. What might be error in the syntax and retutrn the date and also if there is parse exception instead of returing the null, send the some string as error
def getTime=udf((x:String) : java.sql.Timestamp => {
if (x.toString() == "") return null
else { val format = new SimpleDateFormat("yyyy-MM-dd' 'HH:mm:ss");
val d = format.parse(x.toString());
val t = new Timestamp(d.getTime()); return t
}})
Thank you!
The return type for the udf is derived and should not be specified. Change the first line of code to:
def getTime=udf((x:String) => {
// your code
}
This should get rid of the error.
The following is a fully working code written in functional style and making use of Scala constructs:
val data: Seq[String] = Seq("", null, "2017-01-15 10:18:30")
val ds = spark.createDataset(data).as[String]
import java.text.SimpleDateFormat
import java.sql.Timestamp
val fmt = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
// ********HERE is the udf completely re-written: **********
val f = udf((input: String) => {
Option(input).filter(_.nonEmpty).map(str => new Timestamp(fmt.parse(str).getTime)).orNull
})
val ds2 = ds.withColumn("parsedTimestamp", f($"value"))
The following is output:
+-------------------+--------------------+
| value| parsedTimestamp|
+-------------------+--------------------+
| | null|
| null| null|
|2017-01-15 10:18:30|2017-01-15 10:18:...|
+-------------------+--------------------+
You should be using Scala datatypes, not Java datatypes. It would go like this:
def getTime(x: String): Timestamp = {
//your code here
}
You can easily do it in this way :
def getTimeFunction(timeAsString: String): java.sql.Timestamp = {
if (timeAsString.isEmpty)
null
else {
val format = new SimpleDateFormat("yyyy-MM-dd' 'HH:mm:ss")
val date = format.parse(timeAsString.toString())
val time = new Timestamp(date.getTime())
time
}
}
val getTimeUdf = udf(getTimeFunction _)
Then use this getTimeUdf accordingly. !