Define return value in Spark Scala UDF - scala

Imagine the following code:
def myUdf(arg: Int) = udf((vector: MyData) => {
// complex logic that returns a Double
})
How can I define the return type for myUdf so that people looking at the code will know immediately that it returns a Double?

I see two ways to do it, either define a method first and then lift it to a function
def myMethod(vector:MyData) : Double = {
// complex logic that returns a Double
}
val myUdf = udf(myMethod _)
or define a function first with explicit type:
val myFunction: Function1[MyData,Double] = (vector:MyData) => {
// complex logic that returns a Double
}
val myUdf = udf(myFunction)
I normally use the firt approach for my UDFs

Spark functions define several udf methods that have the following modifier/type: static <RT,A1, ..., A10> UserDefinedFunction
You can specify the input/output data types in square brackets as follows:
def myUdf(arg: Int) = udf[Double, MyData]((vector: MyData) => {
// complex logic that returns a Double
})

You can pass a type parameter to udf but you need to seemingly counter-intuitively pass the return type first, followed by the input types like [ReturnType, ArgTypes...], at least as of Spark 2.3.x. Using the original example (which seems to be a curried function based on arg):
def myUdf(arg: Int) = udf[Double, Seq[Int]]((vector: Seq[Int]) => {
13.37 // whatever
})

There is nothing special about UDF with lambda functions, they behave just like scala lambda function (see Specifying the lambda return type in Scala) so you could do:
def myUdf(arg: Int) = udf(((vector: MyData) => {
// complex logic that returns a Double
}): (MyData => Double))
or instead explicitly define your function:
def myFuncWithArg(arg: Int) {
def myFunc(vector: MyData): Double = {
// complex logic that returns a Double. Use arg here
}
myFunc _
}
def myUdf(arg: Int) = udf(myFuncWithArg(arg))

Related

Empty parameter functions passing as arguments

Here are two functions to be used as UDFs:
def nextString(): String = Random.nextString(10)
def plusOne(a: Int): Int = a + 1
def udfString = udf(nextString)
def udfInt = udf(plusOne)
If I try to use withColumn, myUDF1 will work perfectly fine with udfInt, but throws: can't use Char in Schema for udfString
Probably cause it uses (Int) => (Int) for udfInt type, which is what udf expects
But treats nextString as type String, which obviously leads to an assumption, that I am trying to extract Chars when I apply the function.
It will work if I do something like:
def myUDF: () => String = udf(() => nextString)
Which seems ugly for something that simple. Is there any way to pass udfString as a function, not as String?
when you write the following code:
def udfString = udf(nextString)
it's the same as writing
val s = nextString
def udfString = udf(s)
this compiles because a string is also a function of Int => Char (see here)
you can tell the compiler that you are passing a function to the udf by:
def udfString = udf(nextString _)

Scala - evaluate function calls sequentially until one return

I have a few 'legacy' endpoints that can return the Data I'm looking for.
def mainCall(id): Data {
maybeMyDataInEndpoint1(id: UUID): DataA
maybeMyDataInEndpoint2(id: UUID): DataB
maybeMyDataInEndpoint3(id: UUID): DataC
}
null can be returned if no DataX found
return types for each method are different. There are a convert method that converting each DataX to unified Data.
The endpoints are not Scala-ish
What is the best Scala approach to evaluate those method calls sequentially until I have the value I need?
In pseudo I would do something like:
val myData = maybeMyDataInEndpoint1 getOrElse maybeMyDataInEndpoint2 getOrElse maybeMyDataInEndpoint3
I'd use an easier approach, though the other Answers use more elaborate language features.
Just use Option() to catch the null, chain with orElse. I'm assuming methods convertX(d:DataX):Data for explicit conversion. As it might not be found at all we return an Option
def mainCall(id: UUID): Option[Data] {
Option(maybeMyDataInEndpoint1(id)).map(convertA)
.orElse(Option(maybeMyDataInEndpoint2(id)).map(convertB))
.orElse(Option(maybeMyDataInEndpoint3(id)).map(convertC))
}
Maybe You can lift these methods as high order functions of Lists and collectFirst, like:
val fs = List(maybeMyDataInEndpoint1 _, maybeMyDataInEndpoint2 _, maybeMyDataInEndpoint3 _)
val f = (a: UUID) => fs.collectFirst {
case u if u(a) != null => u(a)
}
r(myUUID)
The best Scala approach IMHO is to do things in the most straightforward way.
To handle optional values (or nulls from Java land), use Option.
To sequentially evaluate a list of methods, fold over a Seq of functions.
To convert from one data type to another, use either (1.) implicit conversions or (2.) regular functions depending on the situation and your preference.
(Edit) Assuming implicit conversions:
def legacyEndpoint[A](endpoint: UUID => A)(implicit convert: A => Data) =
(id: UUID) => Option(endpoint(id)).map(convert)
val legacyEndpoints = Seq(
legacyEndpoint(maybeMyDataInEndpoint1),
legacyEndpoint(maybeMyDataInEndpoint2),
legacyEndpoint(maybeMyDataInEndpoint3)
)
def mainCall(id: UUID): Option[Data] =
legacyEndpoints.foldLeft(Option.empty[Data])(_ orElse _(id))
(Edit) Using explicit conversions:
def legacyEndpoint[A](endpoint: UUID => A)(convert: A => Data) =
(id: UUID) => Option(endpoint(id)).map(convert)
val legacyEndpoints = Seq(
legacyEndpoint(maybeMyDataInEndpoint1)(fromDataA),
legacyEndpoint(maybeMyDataInEndpoint2)(fromDataB),
legacyEndpoint(maybeMyDataInEndpoint3)(fromDataC)
)
... // same as before
Here is one way to do it.
(1) You can make your convert methods implicit (or wrap them into implicit wrappers) for convenience.
(2) Then use Stream to build chain from method calls. You should give type inference a hint that you want your stream to contain Data elements (not DataX as returned by legacy methods) so that appropriate implicit convert will be applied to each result of a legacy method call.
(3) Since Stream is lazy and evaluates its tail "by name" only first method gets called so far. At this point you can apply lazy filter to skip null results.
(4) Now you can actually evaluate chain, getting first non-null result with headOption
(HACK) Unfortunately, scala type inference (at the time of writing, v2.12.4) is not powerful enough to allow using #:: stream methods, unless you guide it every step of the way. Using cons makes inference happy but is cumbersome. Also, building stream using vararg apply method of companion object is not an option too, since scala does not support "by-name" varargs yet. In my example below I use combination of stream and toLazyData methods. stream is a generic helper, builds streams from 0-arg functions. toLazyData is an implicit "by-name" conversion designed to interplay with implicit convert functions that convert from DataX to Data.
Here is the demo that demonstrates the idea with more detail:
object Demo {
case class Data(value: String)
class DataA
class DataB
class DataC
def maybeMyDataInEndpoint1(id: String): DataA = {
println("maybeMyDataInEndpoint1")
null
}
def maybeMyDataInEndpoint2(id: String): DataB = {
println("maybeMyDataInEndpoint2")
new DataB
}
def maybeMyDataInEndpoint3(id: String): DataC = {
println("maybeMyDataInEndpoint3")
new DataC
}
implicit def convert(data: DataA): Data = if (data == null) null else Data(data.toString)
implicit def convert(data: DataB): Data = if (data == null) null else Data(data.toString)
implicit def convert(data: DataC): Data = if (data == null) null else Data(data.toString)
implicit def toLazyData[T](value: => T)(implicit convert: T => Data): (() => Data) = () => convert(value)
def stream[T](xs: (() => T)*): Stream[T] = {
xs.toStream.map(_())
}
def main (args: Array[String]) {
val chain = stream(
maybeMyDataInEndpoint1("1"),
maybeMyDataInEndpoint2("2"),
maybeMyDataInEndpoint3("3")
)
val result = chain.filter(_ != null).headOption.getOrElse(Data("default"))
println(result)
}
}
This prints:
maybeMyDataInEndpoint1
maybeMyDataInEndpoint2
Data(Demo$DataB#16022d9d)
Here maybeMyDataInEndpoint1 returns null and maybeMyDataInEndpoint2 needs to be invoked, delivering DataB, maybeMyDataInEndpoint3 never gets invoked since we already have the result.
I think #g.krastev's answer is perfectly good for your use case and you should accept that. I'm just expending a bit on it to show how you can make the last step slightly better with cats.
First, the boilerplate:
import java.util.UUID
final case class DataA(i: Int)
final case class DataB(i: Int)
final case class DataC(i: Int)
type Data = Int
def convertA(a: DataA): Data = a.i
def convertB(b: DataB): Data = b.i
def convertC(c: DataC): Data = c.i
def maybeMyDataInEndpoint1(id: UUID): DataA = DataA(1)
def maybeMyDataInEndpoint2(id: UUID): DataB = DataB(2)
def maybeMyDataInEndpoint3(id: UUID): DataC = DataC(3)
This is basically what you have, in a way that you can copy/paste in the REPL and have compile.
Now, let's first declare a way to turn each of your endpoints into something safe and unified:
def makeSafe[A, B](evaluate: UUID ⇒ A, f: A ⇒ B): UUID ⇒ Option[B] =
id ⇒ Option(evaluate(id)).map(f)
With this in place, you can, for example, call the following to turn maybeMyDataInEndpoint1 into a UUID => Option[A]:
makeSafe(maybeMyDataInEndpoint1, convertA)
The idea is now to turn your endpoints into a list of UUID => Option[A] and fold over that list. Here's your list:
val endpoints = List(
makeSafe(maybeMyDataInEndpoint1, convertA),
makeSafe(maybeMyDataInEndpoint2, convertB),
makeSafe(maybeMyDataInEndpoint3, convertC)
)
You can now fold on it manually, which is what #g.krastev did:
def mainCall(id: UUID): Option[Data] =
endpoints.foldLeft(None: Option[Data])(_ orElse _(id))
If you're fine with a cats dependency, the notion of folding over a list of options is just a concrete use case of a common pattern (the interaction of Foldable and Monoid):
import cats._
import cats.implicits._
def mainCall(id: UUID): Option[Data] = endpoints.foldMap(_(id))
There are other ways to make this nicer still, but they might be overkill in this context - I'd probably declare a type class to turn any type into a Data, say, to give makeSafe a cleaner type signature.

Creating a modified `filter` function

Consider the filter function.
I am interested in the following modifications of the filter function, if possible:
We know for a collection we can do:
case class People(val age: Int)
val a: List[People] = ...
a.filter(i => i.age ==10 )
Or more simply:
a.filter(_.age==10 )
Any simple way I can define another modified filter that works just like the following (no underline)
a.myfilter1( age==10 )
the filter function does not work when its argument is no Boolean. Suppose I want to create a modified filter that when a non-Boolean is given, it translates to equality automatically. Here is an example:
val anotherPerson: People = ...
a.myFilter2(anotherPerson)
I want the above myFilter2 to get translated as following:
a.filter(_.equals(anotherPerson))
Using implicit def:
case class MyFilterable[T](seq: Seq[T]) {
def suchAFilter(v: Any): Seq[T] = {
seq.filter(v.equals)
}
}
implicit def strongFilter[T](seq: Seq[T]): MyFilterable[T] = {
MyFilterable(seq)
}
println(List(1,2,3).suchAFilter(2))

Spark Key/Value filter Function

I have data in a Key Value pairing. I am trying to apply a filter function to the data that looks like:
def filterNum(x: Int) : Boolean = {
if (decimalArr.contains(x)) return true
else return false
}
My Spark code that has:
val numRDD = columnRDD.filter(x => filterNum(x(0)))
but that wont work and when I send in the:
val numRDD = columnRDD.filter(x => filterNum(x))
I get the error:
<console>:23: error: type mismatch;
found : (Int, String)
required: Int
val numRDD = columnRDD.filter(x => filterNum(x))
I also have tried to do other things like changing the inputs to the function
This is because RDD.filter is passing in the Key-Value Tuple, (Int, String), and filterNum is expecting an Int, which is why the first attempt works: tuple(index) pulls out the value at that index of the tuple.
You could change your filter function to be
def filterNum(x: (Int, String)) : Boolean = {
if (decimalArr.contains(x._1)) return true
else return false
}
Although, I would personally do a more terse version as the false is baked into the contains and you can just use the expression directly:
columnRDD.filter(decimalArr.contains(_._1))
Or, if you don't like the underscore syntax:
columnRDD.filter(x=>decimalArr.contains(x._1))
Also, do not use return in scala, the last evaluated line is the return automatically

Specifying the lambda return type in Scala

Note: this is a theoretical question, I am not trying to fix anything, nor am I trying to achieve any effect for a practical purpose
When creating a lambda in Scala using the (arguments)=>expression syntax, can the return type be explicitly provided?
Lambdas are no different than methods on that they both are specified as expressions, but as far as I understand it, the return type of methods is defined easily with the def name(arguments): return type = expression syntax.
Consider this (illustrative) example:
def sequence(start: Int, next: Int=>Int): ()=>Int = {
var x: Int = start
//How can I denote that this function should return an integer?
() => {
var result: Int = x
x = next(x)
result
}
}
You can always declare the type of an expression by appending : and the type. So, for instance:
((x: Int) => x.toString): (Int => String)
This is useful if you, for instance, have a big complicated expression and you don't want to rely upon type inference to get the types straight.
{
if (foo(y)) x => Some(bar(x))
else x => None
}: (Int => Option[Bar])
// Without type ascription, need (x: Int)
But it's probably even clearer if you assign the result to a temporary variable with a specified type:
val fn: Int => Option[Bar] = {
if (foo(y)) x => Some(bar(x))
else _ => None
}
Let say you have this function:
def mulF(a: Int, b: Int): Long = {
a.toLong * b
}
The same function can be written as lambda with defined input and output types:
val mulLambda: (Int, Int) => Long = (x: Int, y: Int) => { x.toLong * y }
x => x:SomeType
Did not know the answer myself as I never had the need for it, but my gut feeling was that this will work. And trying it in a worksheet confirmed it.
Edit: I provided this answer before there was an example above. It is true that this is not needed in the concrete example. But in rare cases where you'd need it, the syntax I showed will work.