Is it possible to use Option with spark UDF - scala

I'd like to use Option as input type for my functions.
udf((oa: Option[String], ob: Option[String])) => …
to handle null values in a more functional way.
Is there a way to do that ?

As far as I know it is not directly possible. Nothing stops you wrapping arguments with Options:
udf((oa: String, ob: String) => (Option(oa), Option(ob)) match {
...
})
using Dataset encoders:
val df = Seq(("a", None), ("b", Some("foo"))).toDF("oa", "ob")
df.as[(Option[String], Option[String])]
or adding some implicit conversions:
implicit def asOption[T](value: T) : Option[T] = Option(value)
def foo(oa: Option[String], ob: Option[String]) = {
oa.flatMap(a => ob.map(b => s"$a - $b"))
}
def wrap[T, U, V](f: (Option[T], Option[U]) => V) =
(t: T, u: U) => f(Option(t), Option(u))
val foo_ = udf(wrap(foo))
df.select(foo_($"oa", $"ob"))

Related

scala function with implicit parameters

I have scala function as below,
scala> def getOrders: (String, String) => Seq[String] = (user: String, apiToken: String) => Seq.empty[String]
def getOrders: (String, String) => Seq[String]
scala> getOrders("prayagupd", "A1B2C3")
val res0: Seq[String] = List()
I want to pass in a third parameter as a implicit parameter but it does not seem possible for a function.
Here's what I want achieved using a method,
scala> def getOrders(user: String, apiToken: String)(implicit clientType: String) = Seq.empty[String]
def getOrders
(user: String, apiToken: String)(implicit clientType: String): Seq[String]
scala> implicit val clientType: String = "android"
implicit val clientType: String = "android"
scala> getOrders("prayagupd", "A1B2C3")
val res2: Seq[String] = List()
It does not seem possible because of the fact that apply function is predefined, which won't extra accept implicit parameter.
scala> new Function2[String, String, Seq[String]] {
def apply(user: String, apiToken: String): Seq[String] = Seq.empty
}
val res4: (String, String) => Seq[String] = <function2>
Overloadding does not do the trick either,
scala> new Function2[String, String, Seq[String]] {
def apply(user: String, apiToken: String): Seq[String] = Seq.empty
def apply(user: String, apiToken: String)(implicit clientType: String) = Seq("order1")
}
val res9: (String, String) => Seq[String] = <function2>
scala> implicit val clientType: String = "device"
implicit val clientType: String = "device"
scala> res9("prayagupd", "apiToken")
val res10: Seq[String] = List()
Is it that implicits are not recommended at all for functions or I'm missing something?
Experimental, your function might be expressed as follows without the implicit:
scala> def getOrders: (String, String) => (String) => Seq[String] = (user: String, apiToken: String) => (clientType: String) => Seq.empty[String]
def getOrders: (String, String) => String => Seq[String]
Poking around on that... it doesn't like implicit anywhere in there that might give you want you want.
An answer to a related question suggests the reason: getOrders "... is a method, not a function, and eta-expansion (which converts methods to functions) is not attempted until after implicit application." It seems that implicits are resolved at a method level, not a function level.

Scala: reflection and case classes

The following code succeeds, but is there a better way of doing the same thing? Perhaps something specific to case classes? In the following code, for each field of type String in my simple case class, the code goes through my list of instances of that case class and finds the length of the longest string of that field.
case class CrmContractorRow(
id: Long,
bankCharges: String,
overTime: String,
name$id: Long,
mgmtFee: String,
contractDetails$id: Long,
email: String,
copyOfVisa: String)
object Go {
def main(args: Array[String]) {
val a = CrmContractorRow(1,"1","1",4444,"1",1,"1","1")
val b = CrmContractorRow(22,"22","22",22,"55555",22,"nine long","22")
val c = CrmContractorRow(333,"333","333",333,"333",333,"333","333")
val rows = List(a,b,c)
c.getClass.getDeclaredFields.filter(p => p.getType == classOf[String]).foreach{f =>
f.setAccessible(true)
println(f.getName + ": " + rows.map(row => f.get(row).asInstanceOf[String]).maxBy(_.length))
}
}
}
Result:
bankCharges: 3
overTime: 3
mgmtFee: 5
email: 9
copyOfVisa: 3
If you want to do this kind of thing with Shapeless, I'd strongly suggest defining a custom type class that handles the complicated part and allows you to keep that stuff separate from the rest of your logic.
In this case it sounds like the tricky part of what you're specifically trying to do is getting the mapping from field names to string lengths for all of the String members of a case class. Here's a type class that does that:
import shapeless._, shapeless.labelled.FieldType
trait StringFieldLengths[A] { def apply(a: A): Map[String, Int] }
object StringFieldLengths extends LowPriorityStringFieldLengths {
implicit val hnilInstance: StringFieldLengths[HNil] =
new StringFieldLengths[HNil] {
def apply(a: HNil): Map[String, Int] = Map.empty
}
implicit def caseClassInstance[A, R <: HList](implicit
gen: LabelledGeneric.Aux[A, R],
sfl: StringFieldLengths[R]
): StringFieldLengths[A] = new StringFieldLengths[A] {
def apply(a: A): Map[String, Int] = sfl(gen.to(a))
}
implicit def hconsStringInstance[K <: Symbol, T <: HList](implicit
sfl: StringFieldLengths[T],
key: Witness.Aux[K]
): StringFieldLengths[FieldType[K, String] :: T] =
new StringFieldLengths[FieldType[K, String] :: T] {
def apply(a: FieldType[K, String] :: T): Map[String, Int] =
sfl(a.tail).updated(key.value.name, a.head.length)
}
}
sealed class LowPriorityStringFieldLengths {
implicit def hconsInstance[K, V, T <: HList](implicit
sfl: StringFieldLengths[T]
): StringFieldLengths[FieldType[K, V] :: T] =
new StringFieldLengths[FieldType[K, V] :: T] {
def apply(a: FieldType[K, V] :: T): Map[String, Int] = sfl(a.tail)
}
}
This looks complex, but once you start working with Shapeless a bit you learn to write this kind of thing in your sleep.
Now you can write the logic of your operation in a relatively straightforward way:
def maxStringLengths[A: StringFieldLengths](as: List[A]): Map[String, Int] =
as.map(implicitly[StringFieldLengths[A]].apply).foldLeft(
Map.empty[String, Int]
) {
case (x, y) => x.foldLeft(y) {
case (acc, (k, v)) =>
acc.updated(k, acc.get(k).fold(v)(accV => math.max(accV, v)))
}
}
And then (given rows as defined in the question):
scala> maxStringLengths(rows).foreach(println)
(bankCharges,3)
(overTime,3)
(mgmtFee,5)
(email,9)
(copyOfVisa,3)
This will work for absolutely any case class.
If this is a one-off thing, you might as well use runtime reflection, or you could use the Poly1 approach in Giovanni Caporaletti's answer—it's less generic and it mixes up the different parts of the solution in a way I don't prefer, but it should work just fine. If this is something you're doing a lot of, though, I'd suggest the approach I've given here.
If you want to use shapeless to get the string fields of a case class and avoid reflection you can do something like this:
import shapeless._
import labelled._
trait lowerPriorityfilterStrings extends Poly2 {
implicit def default[A] = at[Vector[(String, String)], A] { case (acc, _) => acc }
}
object filterStrings extends lowerPriorityfilterStrings {
implicit def caseString[K <: Symbol](implicit w: Witness.Aux[K]) = at[Vector[(String, String)], FieldType[K, String]] {
case (acc, x) => acc :+ (w.value.name -> x)
}
}
val gen = LabelledGeneric[CrmContractorRow]
val a = CrmContractorRow(1,"1","1",4444,"1",1,"1","1")
val b = CrmContractorRow(22,"22","22",22,"55555",22,"nine long","22")
val c = CrmContractorRow(333,"333","333",333,"333",333,"333","333")
val rows = List(a,b,c)
val result = rows
// get for each element a Vector of (fieldName -> stringField) pairs for the string fields
.map(r => gen.to(r).foldLeft(Vector[(String, String)]())(filterStrings))
// get the maximum for each "column"
.reduceLeft((best, row) => best.zip(row).map {
case (kv1#(_, v1), (_, v2)) if v1.length > v2.length => kv1
case (_, kv2) => kv2
})
result foreach { case (k, v) => println(s"$k: $v") }
You probably want to use Scala reflection:
import scala.reflect.runtime.universe._
val rm = runtimeMirror(getClass.getClassLoader)
val instanceMirrors = rows map rm.reflect
typeOf[CrmContractorRow].members collect {
  case m: MethodSymbol if m.isCaseAccessor && m.returnType =:= typeOf[String] =>
    val maxValue = instanceMirrors map (_.reflectField(m).get.asInstanceOf[String]) maxBy (_.length)
    println(s"${m.name}: $maxValue")
}
So that you can avoid issues with cases like:
case class CrmContractorRow(id: Long, bankCharges: String, overTime: String, name$id: Long, mgmtFee: String, contractDetails$id: Long, email: String, copyOfVisa: String) {
val unwantedVal = "jdjd"
}
Cheers
I have refactored your code to something more reuseable:
import scala.reflect.ClassTag
case class CrmContractorRow(
id: Long,
bankCharges: String,
overTime: String,
name$id: Long,
mgmtFee: String,
contractDetails$id: Long,
email: String,
copyOfVisa: String)
object Go{
def main(args: Array[String]) {
val a = CrmContractorRow(1,"1","1",4444,"1",1,"1","1")
val b = CrmContractorRow(22,"22","22",22,"55555",22,"nine long","22")
val c = CrmContractorRow(333,"333","333",333,"333",333,"333","333")
val rows = List(a,b,c)
val initEmptyColumns = List.fill(a.productArity)(List())
def aggregateColumns[Tin:ClassTag,Tagg](rows: Iterable[Product], aggregate: Iterable[Tin] => Tagg) = {
val columnsWithMatchingType = (0 until rows.head.productArity).filter {
index => rows.head.productElement(index) match {case t: Tin => true; case _ => false}
}
def columnIterable(col: Int) = rows.map(_.productElement(col)).asInstanceOf[Iterable[Tin]]
columnsWithMatchingType.map(index => (index,aggregate(columnIterable(index))))
}
def extractCaseClassFieldNames[T: scala.reflect.ClassTag] = {
scala.reflect.classTag[T].runtimeClass.getDeclaredFields.filter(!_.isSynthetic).map(_.getName)
}
val agg = aggregateColumns[String,String] (rows,_.maxBy(_.length))
val fieldNames = extractCaseClassFieldNames[CrmContractorRow]
agg.map{case (index,value) => fieldNames(index) + ": "+ value}.foreach(println)
}
}
Using shapeless would get rid of the .asInstanceOf, but the essence would be the same. The main problem with the given code was that it was not re-usable since the aggregation logic was mixed with the reflection logic to get the field names.

How does find function in Map work in Scala?

I am new to Scala. This is the code that I have written.
object Main extends App {
val mp: Map[String, String] = Map[String, String]("a"->"a", "b"->"b", "c"->"c", "d"->"d")
val s: Option[(String, String)] = mp.find((a: String, b: String) => {
if(a == "c" && b == "c") {
true
}
else {
false
}
})
println(s)
}
I am getting the following error.
error: type mismatch;
found : (String, String) => Boolean
required: ((String, String)) => Boolean
What am I doing wrong?
You need to change
mp.find((a: String, b: String) =>
to either
mp.find(((a: String, b: String)) =>
or
mp.find( case (a: String, b: String) =>
What you have coded is a function expecting two parameters, but you will only be passing in one, which is a Pair (also called Tuple2). The extra braces and the case keyword are ways of specifying that you are only passing in the one parameter, which is an instance of a Pair.
The problem is that find expects a function that takes a single argument, a Tuple2 in this case and returns a Boolean: ((String, String)) => Boolean. However, what you have there is a function that takes two args a and b, not a tuple (brackets matter): (String, String) => Boolean.
Here is one way to fix it. In this case I use pattern matching to extract arguments:
object Main extends App {
val mp: Map[String, String] = Map[String, String]("a"->"a", "b"->"b", "c"->"c", "d"->"d")
val s: Option[(String, String)] = mp.find{ case(a, b) => a == "c" && b == "c" }
println(s)
}
alternatively you could also do:
val s: Option[(String, String)] = mp.find(t => t._1 == "c" && t._2 == "c")
Either would print:
Some((c,c))

Replacing options in Scala with default values if there is nothing there

Suppose I have an item of the type (Option[Long], Option[Long], Option[Long], Option[Long], Option[Long]), and I want to convert it to an item of type (Long, Long, Long, Long, Long). I want each coordinate to contain the value of the option (if the option contains a "Some" value), or be zero otherwise.
Usually if I have an item of type Option[Long], I'd do something like
item match {
case Some(n) => n
case None => 0
}
But I can't do that with a 5 coordinate item unless I want to list out all 32 possibilities. What can I do instead?
Simple solution:
item match {
case (a, b, c, d, e) => (a.getOrElse(0), b.getOrElse(0), c.getOrElse(0), d.getOrElse(0), e.getOrElse(0))
}
Obviously this isn't very generic. For that you'll probably want to look at Shapeless but I'll leave that answer to the resident experts. ;)
Using Shapeless you could do:
import shapeless._
import syntax.std.tuple._
import poly._
object defaultValue extends Poly1 {
implicit def defaultOptionLong = at[Option[Long]](_.getOrElse(0L))
}
val tuple : (Option[Long], Option[Long], Option[Long], Option[Long], Option[Long]) =
(Some(1L), None, Some(3L), Some(4L), None)
tuple.map(defaultValue)
// (Long, Long, Long, Long, Long) = (1,0,3,4,0)
You need to explicitly specify type Option[Int] if you don't use Option.apply (see this question).
(Option(1L), Option(2L)).map(defaultValue)
// (Long, Long) = (1,2)
(Some(3L), Some(4L)).map(defaulValue) // does not compile
val t : (Option[Long], Option[Long]) = (Some(3L), Some(4L))
t.map(defaultValue)
// (Long, Long) = (3,4)
(Option(5), None).map(defaultValue) // does not compile
val t2 (Option[Long], Option[Long]) = (Option(5), None)
t2.map(defaultValue)
// (Long, Long) = (5,0)
We could also provide default values for other types:
object defaultValue extends Poly1 {
implicit def caseLong = at[Option[Long]](_.getOrElse(0L))
implicit def caseInt = at[Option[Int]](_.getOrElse(0))
implicit def caseString = at[Option[String]](_.getOrElse("scala"))
}
val tuple2 : (Option[Int], Option[Long], Option[String]) = (None, None, None)
tuple2.map(defaultValue)
// (Int, Long, String) = (0,0,scala)
Edit: The problem with the need of explicit declaration of Some(5L) as Option[Long] can be solved using generics in the poly function :
objec defaultValue extends Poly1 {
implicit def caseLong[L <: Option[Long]] = at[L](_.getOrElse(0L))
implicit def caseInt[I <: Option[Int]] = at[I](_.getOrElse(0))
implicit def caseString[S <: Option[String]] = at[S](_.getOrElse("scala"))
}
(Some("A"), Some(1), None: Option[Int], None: Option[String]).map(defaultValue)
// (String, Int, Int, String) = (A,1,0,scala)
You can simply do:
val res = for {
a <- item._1.orElse(0L)
b <- item._2.orElse(0L)
c <- item._3.orElse(0L)
d <- item._4.orElse(0L)
e <- item._5.orElse(0L)
} yield (a, b, c, d, e)
Not the nicest but easy to implement and understand.
Another possible solution:
item.productIterator.collect{
case Some(a: Int) => a
case _ => 0
}.toList match {
case List(a,b,c,d,e) => (a,b,c,d,e)
case _ => (0,0,0,0,0) //or throw exception depending on your logic
}

Scala match function against variable

When I'm matching value of case classes, such as:
sealed abstract class Op
case class UOp[T, K](f: T => K) extends Op
case class BOp[T, Z, K](f: (T, Z) => K) extends Op
like this:
def f(op: Op): Int =
op match
{
case BOp(g) => g(1,2)
case UOp(g) => g(0)
}
the compiler infers it as
val g: (Nothing, Nothing) => Any
val g: Nothing => Any
Why am I getting Nothing as the type? Is it because of JVM type erasure? Are there elegant ways to match functions against variables?
I came up with this "hackish" solution, maybe there are other ways or cleaner ways to do this still without relying on reflection.
Define a few partial functions which will handle various args:
scala> val f: PartialFunction[Any, String] = { case (x: Int, y: String) => y * x }
f: PartialFunction[Any,String] = <function1>
scala> val g: PartialFunction[Any, String] = { case x: Int => x.toString }
g: PartialFunction[Any,String] = <function1>
scala> def h: PartialFunction[Any, BigDecimal] = { case (a: Int, b: Double, c: Long) => BigDecimal(a) + b + c }
h: PartialFunction[Any,BigDecimal]
scala> val l: List[PartialFunction[Any, Any]] = f :: g :: h :: Nil
l: List[PartialFunction[Any,Any]] = List(<function1>, <function1>, <function1>)
Check which functions can handle different inputs:
scala> l.map(_.isDefinedAt(1))
res0: List[Boolean] = List(false, true, false)
scala> l.map(_.isDefinedAt((1, "one")))
res1: List[Boolean] = List(true, false, false)
Given input find and apply a function:
scala> def applyFunction(input: Any): Option[Any] = {
| l find (_.isDefinedAt(input)) map (_ (input))
| }
applyFunction: (input: Any)Option[Any]
scala> applyFunction(1)
res1: Option[Any] = Some(1)
scala> applyFunction((2, "one"))
res2: Option[Any] = Some(oneone)
scala> applyFunction("one")
res3: Option[Any] = None
scala> applyFunction(1, 1.1, 9L)
res10: Option[Any] = Some(11.1)
This looks quite type unsafe and there must be better ways to do this.
I think magnet pattern should handle this well in more typesafe manner.