Spark SQL UDF from a string which represents scala code at runtime - scala

I need to be able to register a udf from a string which I will get from a web service, i.e at run time I call a web service to get the scala code which constitutes the udf, compile it and register it as an udf in the spark context. As as example let's say my web service return the following scala code in a json response -
(row: Row, field:String) => {
import scala.util.{Try, Success, Failure}
val index: Int = Try(row.fieldIndex(field)) match {
case Success(_) => 1
case Failure(_) => 0
}
index
})
I want to compile this code on the fly and then register it as an udf. I have already multiple options such as using toolbox, twitter eval util etc. but found that I need to explicity specify the arguments types of the method while creating an instance for ex -
val code =
q"""
(a:String, b:String) => {
a+b
}
"""
val compiledCode = toolBox.compile(code)
val compiledFunc = compiledCode().asInstanceOf[(String, String) => Option[Any]]
This udf takes two strings as arguments hence I need to specify the types while creating the object like
compiledCode().asInstanceOf[(String, String) => Option[Any]]
The other option I explored is
https://stackoverflow.com/a/34371343/1218856
In both the cases I have to know the no of arguments, argument types and the return type before hand to instantiate the code as a method. But in my case as the udfs are created my users, I have no control over the no of arguments and thier types, so I would like to know if there any way I can register the UDF by compiling the scala code with out knowing the argument number and type information.
In a nut shell, I get the code as string, compile it and register it as udf without knowing the type information.

I think you'd be much better off by not trying to generate/execute code directly but defining a different kind of expression language and executing that. Something like ANTLR could help you with writing the grammar of that expression language and generating the parser and the Abstract Syntax Trees. Or even scala's parser combinators. It's of course more work but also a far less risky and error-prone way of allowing custom function execution.

Related

Nested JSON in scala, Circe, Slick

I have a nested JSON in my database. I have figured out the case class for the same. I am using circe, slick and Akka HTTP in my Web api application.
My case class is :
case class Sen
(
sentences: Array[File]
)
case class File
(
content: String,
)
I have written GetResult for the same nesting. I have problems with the array in the case class.
implicit lazy val getFile = GetResult(r => Array[File](r.<<))
implicit lazy val SenObj = GetResult(r => Sen(getFile(r)))
Can anyone tell me how to solve this?
Following is the error I get while compiling
Error:diverging implicit expansion for type slick.jdbc.GetResult[T]
starting with method createGetTuple22 in object GetResult
implicit lazy val getFile = GetResult(r => Array[File](r.<<))
Your definition of getFile is manually constructing an Array, and specifically you're asking for an Array[File]. There's no GetResult[File], meaning that r.<< won't be able to convert a column value into a File.
Can anyone tell me how to solve this?
You'll at least need a GetResult[File] defined.
However, it's not clear from the question how the JSON part is intended to work:
Perhaps you have a column containing text which your application treats as JSON. If that's the case, I suggest doing JSON array conversion outside of your Slick code layer. That will keep the mapping to and from the database straightforward.
Or perhaps you have a JSON-type in your database and you're using database-specific operations. In that case, I guess it'll depend on what control you have there, and it probably does make sense to try to do JSON-array operations at the Slick layer. (That's the case for Postgress, for example, via the pg-slick library)
But that's a different question.
As a general note, I suggest always being explicit about the types of GetResult you are defining:
implicit lazy val getFile: GetResult[Array[File]]=
GetResult(r => Array[File](r.<<))
implicit lazy val SenObj: GetResult[Sen] =
GetResult(r => Sen(getFile(r)))
...to be clear about what instances you have available. I find that helps in debugging these situations.

using jmock in Scala with type parameterized function

I want to test a function that writes output from in RDD in Scala Spark.
Part of this test is mocking a map on an RDD, using jmock
val customerRdd = mockery.mock(classOf[RDD[Customer]], "rdd1")
val transformedRddToWrite = mockery.mock(classOf[RDD[TransformedCustomer]], "rdd2")
mockery.checking(new Expectations() {{
// ...
oneOf(customerRdd).map(
`with`(Expectations.any(classOf[Customer => TransformedCustomer]))
)
will(Expectations.returnValue(transformedRddToWrite))
// ...
}})
However, whenever I try to run this test, I get the following error:
not all parameters were given explicit matchers: either all parameters must be specified by matchers or all must be specified by values, you cannot mix matchers and values, despite the fact that I have specified matchers for all parameters to .map.
How do I fix this? Can jMock support matching on Scala functional arguments with implicit classtags?
jMock I thought has been abandoned since 2012. But if you like it, then more power to you. One of the issues is that map requires a ClassTag[U] according to the signature :
def map[U: ClassTag](f: T => U): RDD[U] where U is the return type of your function.
I am going to heavily assume that if you were to make this work with a Java mocking framework, go under the assumption that map's signature is public <U> RDD<U> map(scala.Function1<T, U>, scala.reflect.ClassTag<U>);
Hope that would work.

spark register expression for SQL DSL

How can I access a catalyst expression (not regular UDF) in spark SQL scala DSL API?
http://geospark.datasyslab.org only allows for text based execution
GeoSparkSQLRegistrator.registerAll(sparkSession)
var stringDf = sparkSession.sql(
"""
|SELECT ST_SaveAsWKT(countyshape)
|FROM polygondf
""".stripMargin)
When I try to use the SQL scala DSL
df.withColumn("foo", ST_Point(col("x"), col("y"))) I get an error of type mismatch expected column got ST_Point.
What do I need to change to properly register the catalyst expression as something which is callable directly via scala SQL DSL API?
edit
catalyst expressions are all registered via https://github.com/DataSystemsLab/GeoSpark/blob/fadccf2579e4bbe905b2c28d5d1162fdd72aa99c/sql/src/main/scala/org/datasyslab/geosparksql/UDF/UdfRegistrator.scala#L38:
Catalog.expressions.foreach(f=>sparkSession.sessionState.functionRegistry.createOrReplaceTempFunction(f.getClass.getSimpleName.dropRight(1),f))
edit2
import org.apache.spark.sql.geosparksql.expressions.ST_Point
val myPoint = udf((x: Double, y:Double) => ST_Point _)
fails with:
_ must follow method; cannot follow org.apache.spark.sql.geosparksql.expressions.ST_Point.type
You can access expressions that aren't exposed in the org.apache.spark.sql.functions package using the expr method. It doesn't actually give you a UDF-like object in Scala, but it does allow you to write the rest of your query using the Dataset API.
Here's an example from the docs:
// get the number of words of each length
df.groupBy(expr("length(word)")).count()
Here's another method that you can use to call the UDF and what I've done so far.
.withColumn("locationPoint", callUDF("ST_Point", col("longitude"),
col("latitude")))

Cast a DataFrame Entry to a Case Class with Any-type Member [duplicate]

I recently moved from Spark 1.6 to Spark 2.X and I would like to move - where possible - from Dataframes to Datasets, as well. I tried a code like this
case class MyClass(a : Any, ...)
val df = ...
df.map(x => MyClass(x.get(0), ...))
As you can see MyClass has a field of type Any, as I do not know at compile time the type of the field I retrieve with x.get(0). It may be a long, string, int, etc.
However, when I try to execute code similar to what you see above, I get an exception:
java.lang.ClassNotFoundException: scala.Any
With some debugging, I realized that the exception is raised, not because my data is of type Any, but because MyClass has a type Any. So how can I use Datasets then?
Unless you're interested in limited and ugly workarounds like Encoders.kryo:
import org.apache.spark.sql.Encoders
case class FooBar(foo: Int, bar: Any)
spark.createDataset(
sc.parallelize(Seq(FooBar(1, "a")))
)(Encoders.kryo[FooBar])
or
spark.createDataset(
sc.parallelize(Seq(FooBar(1, "a"))).map(x => (x.foo, x.bar))
)(Encoders.tuple(Encoders.scalaInt, Encoders.kryo[Any]))
you don't. All fields / columns in a Dataset have to be of known, homogeneous type for which there is an implicit Encoder in the scope. There is simply no place for Any there.
UDT API provides a bit more flexibility and allows for a limited polymorphism but it is private, not fully compatible with Dataset API and comes with significant performance and storage penalty.
If for a given execution all values of the same type you can of course create specialized classes and make a decision which one to use at run time.

Convert a Seq[String] to a case class in a typesafe way

I have written a parser which transforms a String to a Seq[String] following some rules. This will be used in a library.
I am trying to transform this Seq[String] to a case class. The case class would be provided by the user (so there is no way to guess what it will be).
I have thought to shapeless library because it seems to implement the good features and it seems mature, but I have no idea to how to proceed.
I have found this question with an interesting answer but I don't find how to transform it for my needs. Indeed, in the answer there is only one type to parse (String), and the library iterates inside the String itself. It probably requires a deep change in the way things are done, and I have no clue how.
Moreover, if possible, I want to make this process as easy as possible for the user of my library. So, if possible, unlike the answer in link above, the HList type would be guess from the case class itself (however according to my search, it seems the compiler needs this information).
I am a bit new to the type system and all these beautiful things, if anyone is able to give me an advice on how to do, I would be very happy!
Kind Regards
--- EDIT ---
As ziggystar requested, here is some possible of the needed signature:
//Let's say we are just parsing a CSV.
#onUserSide
case class UserClass(i:Int, j:Int, s:String)
val list = Seq("1,2,toto", "3,4,titi")
// User transforms his case class to a function with something like:
val f = UserClass.curried
// The function created in 1/ is injected in the parser
val parser = new Parser(f)
// The Strings to convert to case classes are provided as an argument to the parse() method.
val finalResult:Seq[UserClass] = parser.parse(list)
// The transfomation is done in two steps inside the parse() method:
// 1/ first we have: val list = Seq("1,2,toto", "3,4,titi")
// 2/ then we have a call to internalParserImplementedSomewhereElse(list)
// val parseResult is now equal to Seq(Seq("1", "2", "toto"), Seq("3","4", "titi"))
// 3/ finally Shapeless do its magick trick and we have Seq(UserClass(1,2,"toto"), UserClass(3,4,"titi))
#insideTheLibrary
class Parser[A](function:A) {
//The internal parser takes each String provided through argument of the method and transforms each String to a Seq[String]. So the Seq[String] provided is changed to Seq[Seq[String]].
private def internalParserImplementedSomewhereElse(l:Seq[String]): Seq[Seq[String]] = {
...
}
/*
* Class A and B are both related to the case class provided by the user:
* - A is the type of the case class as a function,
* - B is the type of the original case class (can be guessed from type A).
*/
private def convert2CaseClass[B](list:Seq[String]): B {
//do something with Shapeless
//I don't know what to put inside ???
}
def parse(l:Seq[String]){
val parseResult:Seq[Seq[String]] = internalParserImplementedSomewhereElse(l:Seq[String])
val finalResult = result.map(convert2CaseClass)
finalResult // it is a Seq[CaseClassProvidedByUser]
}
}
Inside the library some implicit would be available to convert the String to the correct type as they are guessed by Shapeless (similar to the answered proposed in the link above). Like string.toInt, string.ToDouble, and so on...
May be there are other way to design it. It's just what I have in mind after playing with Shapeless few hours.
This uses a very simple library called product-collecions
import com.github.marklister.collections.io._
case class UserClass(i:Int, j:Int, s:String)
val csv = Seq("1,2,toto", "3,4,titi").mkString("\n")
csv: String =
1,2,toto
3,4,titi
CsvParser(UserClass).parse(new java.io.StringReader(csv))
res28: Seq[UserClass] = List(UserClass(1,2,toto), UserClass(3,4,titi))
And to serialize the other way:
scala> res28.csvIterator.toList
res30: List[String] = List(1,2,"toto", 3,4,"titi")
product-collections is orientated towards csv and a java.io.Reader, hence the shims above.
This answer will not tell you how to do exactly what you want, but it will solve your problem. I think you're overcomplicating things.
What is it you want to do? It appears to me that you're simply looking for a way to serialize and deserialize your case classes - i.e. convert your Scala objects to a generic string format and the generic string format back to Scala objects. Your serialization step presently is something you seem to already have defined, and you're asking about how to do the deserialization.
There are a few serialization/deserialization options available for Scala. You do not have to hack away with Shapeless or Scalaz to do it yourself. Try to take a look at these solutions:
Java serialization/deserialization. The regular serialization/deserialization facilities provided by the Java environment. Requires explicit casting and gives you no control over the serialization format, but it's built in and doesn't require much work to implement.
JSON serialization: there are many libraries that provide JSON generation and parsing for Java. Take a look at play-json, spray-json and Argonaut, for example.
The Scala Pickling library is a more general library for serialization/deserialization. Out of the box it comes with some binary and some JSON format, but you can create your own formats.
Out of these solutions, at least play-json and Scala Pickling use macros to generate serializers and deserializers for you at compile time. That means that they should both be typesafe and performant.