spark register expression for SQL DSL - scala

How can I access a catalyst expression (not regular UDF) in spark SQL scala DSL API?
http://geospark.datasyslab.org only allows for text based execution
GeoSparkSQLRegistrator.registerAll(sparkSession)
var stringDf = sparkSession.sql(
"""
|SELECT ST_SaveAsWKT(countyshape)
|FROM polygondf
""".stripMargin)
When I try to use the SQL scala DSL
df.withColumn("foo", ST_Point(col("x"), col("y"))) I get an error of type mismatch expected column got ST_Point.
What do I need to change to properly register the catalyst expression as something which is callable directly via scala SQL DSL API?
edit
catalyst expressions are all registered via https://github.com/DataSystemsLab/GeoSpark/blob/fadccf2579e4bbe905b2c28d5d1162fdd72aa99c/sql/src/main/scala/org/datasyslab/geosparksql/UDF/UdfRegistrator.scala#L38:
Catalog.expressions.foreach(f=>sparkSession.sessionState.functionRegistry.createOrReplaceTempFunction(f.getClass.getSimpleName.dropRight(1),f))
edit2
import org.apache.spark.sql.geosparksql.expressions.ST_Point
val myPoint = udf((x: Double, y:Double) => ST_Point _)
fails with:
_ must follow method; cannot follow org.apache.spark.sql.geosparksql.expressions.ST_Point.type

You can access expressions that aren't exposed in the org.apache.spark.sql.functions package using the expr method. It doesn't actually give you a UDF-like object in Scala, but it does allow you to write the rest of your query using the Dataset API.
Here's an example from the docs:
// get the number of words of each length
df.groupBy(expr("length(word)")).count()

Here's another method that you can use to call the UDF and what I've done so far.
.withColumn("locationPoint", callUDF("ST_Point", col("longitude"),
col("latitude")))

Related

When would request.attrs.get in a Scala Play custom filter return None?

In an answer to "How to run filter on demand scala play framework", the following code is suggested:
// in your filter
val handlerDef: Option[HandlerDef] = request.attrs.get(Router.Attrs.HandlerDef)
I'm not sure what's happening here - is it safe to .get on this val (to get it out of the Option)? In what scenarios would this code result in a None (ie, when would Router.Attrs.HandlerDef not be present)?
I'm working with Scala and PlayFramework 2.6.
According to Route modifier tags
Please be aware that the HandlerDef request attribute exists only when
using a router generated by Play from a routes file. This attribute is
not added when the routes are defined in code, for example using the
Scala SIRD or Java RoutingDsl. In this case
request.attrs.get(HandlerDef) will return None in Scala or null in
Java. Keep this in mind when creating filters.
Hence if you are using routes file then Router.Attrs.HandlerDef should always be available. As a shorthand instead of
val handlerDef: HandlerDef = request.attrs.get(Router.Attrs.HandlerDef).get
your can use apply sugar like so
val handlerDef: HandlerDef = request.attrs(Router.Attrs.HandlerDef)

Spark SQL UDF from a string which represents scala code at runtime

I need to be able to register a udf from a string which I will get from a web service, i.e at run time I call a web service to get the scala code which constitutes the udf, compile it and register it as an udf in the spark context. As as example let's say my web service return the following scala code in a json response -
(row: Row, field:String) => {
import scala.util.{Try, Success, Failure}
val index: Int = Try(row.fieldIndex(field)) match {
case Success(_) => 1
case Failure(_) => 0
}
index
})
I want to compile this code on the fly and then register it as an udf. I have already multiple options such as using toolbox, twitter eval util etc. but found that I need to explicity specify the arguments types of the method while creating an instance for ex -
val code =
q"""
(a:String, b:String) => {
a+b
}
"""
val compiledCode = toolBox.compile(code)
val compiledFunc = compiledCode().asInstanceOf[(String, String) => Option[Any]]
This udf takes two strings as arguments hence I need to specify the types while creating the object like
compiledCode().asInstanceOf[(String, String) => Option[Any]]
The other option I explored is
https://stackoverflow.com/a/34371343/1218856
In both the cases I have to know the no of arguments, argument types and the return type before hand to instantiate the code as a method. But in my case as the udfs are created my users, I have no control over the no of arguments and thier types, so I would like to know if there any way I can register the UDF by compiling the scala code with out knowing the argument number and type information.
In a nut shell, I get the code as string, compile it and register it as udf without knowing the type information.
I think you'd be much better off by not trying to generate/execute code directly but defining a different kind of expression language and executing that. Something like ANTLR could help you with writing the grammar of that expression language and generating the parser and the Abstract Syntax Trees. Or even scala's parser combinators. It's of course more work but also a far less risky and error-prone way of allowing custom function execution.

How use Window aggrgates on strongly typed Spark Datasets?

I'm slowely trying to adapt to the new (strongly typed) Dataset[U] from Spark 2.x, but struggling to maintain the type info when using Window functions.
case class Measurement(nb:Long,x:Double)
ds being a Dataset[Measurement], I would like to do something like
ds.map{m => (m,sum($"x").over(Window.orderBy($"nb"))}
But this will not work (as it gives my a Dataset[(Measurement,Column)]) instead of Dataset[(Measurement,Double)]
Using withColumn gives me a Dataset[Row], so I'm loosing the type info:
ds.withColumn("cumsum",sum($"x").over(Window.orderBy($"nb")))
So, is there a better way to use Window functions on strongly typed Datasets?
As you adding a new column to your dataset, I guess there is no choices but using the dataframe.as[New Type] method
More information can be found here How to add a column to Dataset without converting from a DataFrame and accessing it?
More information on Window functions can be found on this blog article Window Functions in Spark SQL by Databricks
You can use the as[U] method to convert a Dataframe (or Dataset[Row]) to a Dataset[U]
For you special case :
ds.withColumn("cumsum",sum($"x").over(Window.orderBy($"nb"))).as[(Measurement,Column)]
Hope it helps

Register UDF to SqlContext from Scala to use in PySpark

Is it possible to register a UDF (or function) written in Scala to use in PySpark ?
E.g.:
val mytable = sc.parallelize(1 to 2).toDF("spam")
mytable.registerTempTable("mytable")
def addOne(m: Integer): Integer = m + 1
// Spam: 1, 2
In Scala, the following is now possible:
val UDFaddOne = sqlContext.udf.register("UDFaddOne", addOne _)
val mybiggertable = mytable.withColumn("moreSpam", UDFaddOne(mytable("spam")))
// Spam: 1, 2
// moreSpam: 2, 3
I would like to use "UDFaddOne" in PySpark like
%pyspark
mytable = sqlContext.table("mytable")
UDFaddOne = sqlContext.udf("UDFaddOne") # does not work
mybiggertable = mytable.withColumn("+1", UDFaddOne(mytable("spam"))) # does not work
Background: We are a team of developpers, some coding in Scala and some in Python, and would like to share already written functions. It would also be possible to save it into a library and import it.
As far as I know PySpark doesn't provide any equivalent of the callUDF function and because of that it is not possible to access registered UDF directly.
The simplest solution here is to use raw SQL expression:
mytable.withColumn("moreSpam", expr("UDFaddOne({})".format("spam")))
## OR
sqlContext.sql("SELECT *, UDFaddOne(spam) AS moreSpam FROM mytable")
## OR
mytable.selectExpr("*", "UDFaddOne(spam) AS moreSpam")
This approach is rather limited so if you need to support more complex workflows you should build a package and provide complete Python wrappers. You'll find and example UDAF wrapper in my answer to Spark: How to map Python with Scala or Java User Defined Functions?
The following worked for me (basically a summary of multiple places including the link provided by zero323):
In scala:
package com.example
import org.apache.spark.sql.functions.udf
object udfObj extends Serializable {
def createUDF = {
udf((x: Int) => x + 1)
}
}
in python (assume sc is the spark context. If you are using spark 2.0 you can get it from the spark session):
from py4j.java_gateway import java_import
from pyspark.sql.column import Column
jvm = sc._gateway.jvm
java_import(jvm, "com.example")
def udf_f(col):
return Column(jvm.com.example.udfObj.createUDF().apply(col))
And of course make sure the jar created in scala is added using --jars and --driver-class-path
So what happens here:
We create a function inside a serializable object which returns the udf in scala (I am not 100% sure Serializable is required, it was required for me for more complex UDF so it could be because it needed to pass java objects).
In python we use access the internal jvm (this is a private member so it could be changed in the future but I see no way around it) and import our package using java_import.
We access the createUDF function and call it. This creates an object which has the apply method (functions in scala are actually java objects with the apply method). The input to the apply method is a column. The result of applying the column is a new column so we need to wrap it with the Column method to make it available to withColumn.

SQL DSL for Scala

I am struggling to create a SQL DSL for Scala. The DSL is an extension to Querydsl, which is a popular Query abstraction layer for Java.
I am struggling now with really simple expressions like the following
user.firstName == "Bob" || user.firstName == "Ann"
As Querydsl supports already an expression model which can be used here I decided to provide conversions from Proxy objects to Querydsl expressions. In order to use the proxies I create an instance like this
import com.mysema.query.alias.Alias._
var user = alias(classOf[User])
With the following implicit conversions I can convert proxy instances and proxy property call chains into Querydsl expressions
import com.mysema.query.alias.Alias._
import com.mysema.query.types.expr._
import com.mysema.query.types.path._
object Conversions {
def not(b: EBoolean): EBoolean = b.not()
implicit def booleanPath(b: Boolean): PBoolean = $(b);
implicit def stringPath(s: String): PString = $(s);
implicit def datePath(d: java.sql.Date): PDate[java.sql.Date] = $(d);
implicit def dateTimePath(d: java.util.Date): PDateTime[java.util.Date] = $(d);
implicit def timePath(t: java.sql.Time): PTime[java.sql.Time] = $(t);
implicit def comparablePath(c: Comparable[_]): PComparable[_] = $(c);
implicit def simplePath(s: Object): PSimple[_] = $(s);
}
Now I can construct expressions like this
import com.mysema.query.alias.Alias._
import com.mysema.query.scala.Conversions._
var user = alias(classOf[User])
var predicate = (user.firstName like "Bob") or (user.firstName like "Ann")
I am struggling with the following problem.
eq and ne are already available as methods in Scala, so the conversions aren't triggered when they are used
This problem can be generalized as the following. When using method names that are already available in Scala types such as eq, ne, startsWith etc one needs to use some kind of escaping to trigger the implicit conversions.
I am considering the following
Uppercase
var predicate = (user.firstName LIKE "Bob") OR (user.firstName LIKE "Ann")
This is for example the approach in Circumflex ORM, a very powerful ORM framework for Scala with similar DSL aims. But this approach would be inconsistent with the query keywords (select, from, where etc), which are lowercase in Querydsl.
Some prefix
var predicate = (user.firstName :like "Bob") :or (user.firstName :like "Ann")
The context of the predicate usage is something like this
var user = alias(classOf[User])
query().from(user)
.where(
(user.firstName like "Bob") or (user.firstName like "Ann"))
.orderBy(user.firstName asc)
.list(user);
Do you see better options or a different approach for SQL DSL construction for Scala?
So the question basically boils down to two cases
Is it possible to trigger an implicit type conversion when using a method that exists in the super class (e.g. eq)
If it is not possible, what would be the most Scalaesque syntax to use for methods like eq, ne.
EDIT
We got Scala support in Querydsl working by using alias instances and a $-prefix based escape syntax. Here is a blog post on the results : http://blog.mysema.com/2010/09/querying-with-scala.html
There was a very good talk at Scala Days: Type-safe SQL embedded in Scala by Christoph Wulf.
See the video here: Type-safe SQL embedded in Scala by Christoph Wulf
Mr Westkämper - I was pondering this problem, and I wondered if would be possible to use 'tracer' objects, where the basic data types such as Int and String would be extended such that they contained source information, and the results of combining them would likewise hold within themselves their sources and the nature of the combination.
For example, your user.firstName method would return a TracerString, which extends String, but which also indicates that the String corresponds to a column in a relation. The == method would be overwritten such that it returns an EqualityTracerBoolean which extends Boolean. This would preserve the standard Scala semantics. However, the constructor for EqualityTracerBoolean would record the fact that the result of the expression was derived by comparing a column in a relation to a string constant. Your 'where' method could then analyse the EqualityTracerBoolean object returned by the conditional expression evaluated over a dummy argument in order to derive the expression used to create it.
There would have to be override defs for inequality operators, as well as plus and minus, for Ints, and whatever else you wished to represent from sql, and corresponding tracer classes for each of these. It would be a bit of a project!
Anyway, I decided not to bother, and use squeryl instead.
I didn't have the exact same problem with jOOQ, as I'm using a bit more verbose operator names: equal, notEqual, etc instead of eq, ne. On the other hand, there is a val operator in jOOQ for explicitly creating bind values, which I had to overload with value, as val is a keyword in Scala. Is overloading operators an option for you? I documented my attempts of running jOOQ in Scala here:
http://lukaseder.wordpress.com/2011/12/11/the-ultimate-sql-dsl-jooq-in-scala/
Just like you, I had also thought about capitalising all keywords in a major release (including SELECT, FROM, etc). But that will leave an open question about whether "compound" keywords should be split in two method calls, or connected by an underscore: GROUP().BY() or GROUP_BY(). WHEN().MATCHED().THEN().UPDATE() or WHEN_MATCHED_THEN_UPDATE(). Since the result is not really satisfying, I guess it's not worth to break backwards-compatibility for such a fix, even if the two-method-call option would look very very nice in Scala, as . and () can be omitted. So maybe, jOOQ and QueryDSL should both be "wrapped" (as opposed to "extended") by a dedicated Scala-API?
What about decompiling the bytecode at runtime? I started to write such a tool:
http://h2database.com/html/jaqu.html#natural_syntax
I know it's a hack, so please don't vote -1 :-) I just wanted to mentioned it. It's a relatively novel approach. Instead of decompiling at runtime, it might be possible to do it at compile time using an annotation processor, not sure if that's possible using Scala (and not sure if it's really possible with Java, but Project Lombok seems to do something like that).