Register UDF to SqlContext from Scala to use in PySpark - scala

Is it possible to register a UDF (or function) written in Scala to use in PySpark ?
E.g.:
val mytable = sc.parallelize(1 to 2).toDF("spam")
mytable.registerTempTable("mytable")
def addOne(m: Integer): Integer = m + 1
// Spam: 1, 2
In Scala, the following is now possible:
val UDFaddOne = sqlContext.udf.register("UDFaddOne", addOne _)
val mybiggertable = mytable.withColumn("moreSpam", UDFaddOne(mytable("spam")))
// Spam: 1, 2
// moreSpam: 2, 3
I would like to use "UDFaddOne" in PySpark like
%pyspark
mytable = sqlContext.table("mytable")
UDFaddOne = sqlContext.udf("UDFaddOne") # does not work
mybiggertable = mytable.withColumn("+1", UDFaddOne(mytable("spam"))) # does not work
Background: We are a team of developpers, some coding in Scala and some in Python, and would like to share already written functions. It would also be possible to save it into a library and import it.

As far as I know PySpark doesn't provide any equivalent of the callUDF function and because of that it is not possible to access registered UDF directly.
The simplest solution here is to use raw SQL expression:
mytable.withColumn("moreSpam", expr("UDFaddOne({})".format("spam")))
## OR
sqlContext.sql("SELECT *, UDFaddOne(spam) AS moreSpam FROM mytable")
## OR
mytable.selectExpr("*", "UDFaddOne(spam) AS moreSpam")
This approach is rather limited so if you need to support more complex workflows you should build a package and provide complete Python wrappers. You'll find and example UDAF wrapper in my answer to Spark: How to map Python with Scala or Java User Defined Functions?

The following worked for me (basically a summary of multiple places including the link provided by zero323):
In scala:
package com.example
import org.apache.spark.sql.functions.udf
object udfObj extends Serializable {
def createUDF = {
udf((x: Int) => x + 1)
}
}
in python (assume sc is the spark context. If you are using spark 2.0 you can get it from the spark session):
from py4j.java_gateway import java_import
from pyspark.sql.column import Column
jvm = sc._gateway.jvm
java_import(jvm, "com.example")
def udf_f(col):
return Column(jvm.com.example.udfObj.createUDF().apply(col))
And of course make sure the jar created in scala is added using --jars and --driver-class-path
So what happens here:
We create a function inside a serializable object which returns the udf in scala (I am not 100% sure Serializable is required, it was required for me for more complex UDF so it could be because it needed to pass java objects).
In python we use access the internal jvm (this is a private member so it could be changed in the future but I see no way around it) and import our package using java_import.
We access the createUDF function and call it. This creates an object which has the apply method (functions in scala are actually java objects with the apply method). The input to the apply method is a column. The result of applying the column is a new column so we need to wrap it with the Column method to make it available to withColumn.

Related

Databricks notebooks + Repos spark session scoping breakdown

I'm using databricks, and I have a repo in which I have a basic python module within which I define a class. I'm able to import and access the class and its methods from the databricks notebook.
One of the methods within the class within the module looks like this (simplified)
def read_raw_json(self):
self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")
When I execute this particular method within the databricks notebook it gives me a NameError that 'spark' is not defined. The databricks runtime instantiates with a spark session stored in a variable called "spark". I assumed any methods executed in that runtime would inherit from the parent scope.
Anyone know why this isn't the case?
EDIT: I was able to get it to work by passing it the spark variable from within the notebook as an object to my class instantiation. But I don't want to call this an answer yet, because I'm not sure why I needed to.
python files (not notebooks) don't have spark initiated at them.
when you import the function python raises a NameError as it tried to understand the read_raw_json which references an unknown spark object.
you can modify the python file like this and everything will work fine:
from pyspark.sql import SparkSession
def read_raw_json(self):
spark = SparkSession.builder.getOrCreate()
self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")
The spark variable is defined only in the top-level notebook, but not in the packages. You have two choices:
Pass the spark instance as an argument of the read_raw_json function:
def read_raw_json(self, spark):
self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")
get active session:
from pyspark.sql import SparkSession
def read_raw_json(self):
spark = SparkSession.getActiveSession()
self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")

How to develop and test python transforms code locally?

What is the recommended way to develop and test python transforms code locally, given that the input datasets fit into memory of the local machine?
The simplest way that doesn't require you to mock the transforms package, would be to just extract your logic into a pure python with pyspark function that receives dataframes as input and returns the dataframe.
i.e.:
# yourtransform.py
from my_business_logic import magic_super_complex_computation
#transform_df(
Output("/foo/bar/out_dataset"),
input1=Input("/foo/bar/input1"),
input2=Input("/foo/bar/input2"))
def my_transform(input1, input2):
return magic_super_complex_computation(input1, input2)
You can now import in your test the magic_super_complex_computation and test it just with pyspark.
i.e:
from my_business_logic import magic_super_complex_computation
def test_magic_super_complex_computation(spark_context):
df1 = load_my_data_as_df(spark_context, "input1")
df2 = load_my_data_as_df(spark_context, "input2")
result = magic_super_complex_computation(input1, input2).collect()
assert len(result) == 123
Do note that this requires you to provide a valid spark context as a fixture in your pytest (or whatever testing framework you are using)

spark register expression for SQL DSL

How can I access a catalyst expression (not regular UDF) in spark SQL scala DSL API?
http://geospark.datasyslab.org only allows for text based execution
GeoSparkSQLRegistrator.registerAll(sparkSession)
var stringDf = sparkSession.sql(
"""
|SELECT ST_SaveAsWKT(countyshape)
|FROM polygondf
""".stripMargin)
When I try to use the SQL scala DSL
df.withColumn("foo", ST_Point(col("x"), col("y"))) I get an error of type mismatch expected column got ST_Point.
What do I need to change to properly register the catalyst expression as something which is callable directly via scala SQL DSL API?
edit
catalyst expressions are all registered via https://github.com/DataSystemsLab/GeoSpark/blob/fadccf2579e4bbe905b2c28d5d1162fdd72aa99c/sql/src/main/scala/org/datasyslab/geosparksql/UDF/UdfRegistrator.scala#L38:
Catalog.expressions.foreach(f=>sparkSession.sessionState.functionRegistry.createOrReplaceTempFunction(f.getClass.getSimpleName.dropRight(1),f))
edit2
import org.apache.spark.sql.geosparksql.expressions.ST_Point
val myPoint = udf((x: Double, y:Double) => ST_Point _)
fails with:
_ must follow method; cannot follow org.apache.spark.sql.geosparksql.expressions.ST_Point.type
You can access expressions that aren't exposed in the org.apache.spark.sql.functions package using the expr method. It doesn't actually give you a UDF-like object in Scala, but it does allow you to write the rest of your query using the Dataset API.
Here's an example from the docs:
// get the number of words of each length
df.groupBy(expr("length(word)")).count()
Here's another method that you can use to call the UDF and what I've done so far.
.withColumn("locationPoint", callUDF("ST_Point", col("longitude"),
col("latitude")))

Using a module with udf defined inside freezes pyspark job - explanation?

Here is the situation:
We have a module where we define some functions that return pyspark.sql.DataFrame (DF). To get those DF we use some pyspark.sql.functions.udf defined either in the same file or in helper modules. When we actually write job for pyspark to execute we only import functions from modules (we provide a .zip file to --py-files) and then just save the dataframe to hdfs.
Issue is that when we do this, the udf function freezes our job. The nasty fix we found was to define udf functions inside the job and provide them to imported functions from our module.
The other fix I found here is to define a class:
from pyspark.sql.functions import udf
class Udf(object):
def __init__(s, func, spark_type):
s.func, s.spark_type = func, spark_type
def __call__(s, *args):
return udf(s.func, s.spark_type)(*args)
Then use this to define my udf in module. This works!
Can anybody explain why we have this problem in the first place? And why this fix (the last one with the class definition) works?
Additional info: PySpark 2.1.0. Deploying job on yarn in cluster mode.
Thanks!
The accepted answer to the link you posted above, says, "My work around is to avoid creating the UDF until Spark is running and hence there is an active SparkContext." Looks like your issue is with serializing the UDF.
Make sure the UDF functions in your helper classes are static methods or global functions. And inside the public functions that you import elsewhere, you can define the udf.
class Helperclass(object):
#staticmethod
def my_udf_todo(...):
...
def public_function_that_is_imported_elsewhere(...):
todo_udf = udf(Helperclass.my_udf_todo, RETURN_SCHEMA)
...

Spark ML - Save OneVsRestModel

I am in the middle of refactoring my code to take advantage of DataFrames, Estimators, and Pipelines. I was originally using MLlib Multiclass LogisticRegressionWithLBFGS on RDD[LabeledPoint]. I am enjoying learning and using the new API, but I am not sure how to save my new model and apply it on new data.
Currently, the ML implementation of LogisticRegression only supports binary classification. I am, instead using OneVsRest like so:
val lr = new LogisticRegression().setFitIntercept(true)
val ovr = new OneVsRest()
ovr.setClassifier(lr)
val ovrModel = ovr.fit(training)
I would now like to save my OneVsRestModel, but this does not seem to be supported by the API. I have tried:
ovrModel.save("my-ovr") // Cannot resolve symbol save
ovrModel.models.foreach(_.save("model-" + _.uid)) // Cannot resolve symbol save
Is there a way to save this, so I can load it in a new application for making new predictions?
Spark 2.0.0
OneVsRestModel implements MLWritable so it should be possible to save it directly. Method shown below can be still useful to save individual models separately.
Spark < 2.0.0
The problem here is that models returns an Array of ClassificationModel[_, _]] not an Array of LogisticRegressionModel (or MLWritable). To make it work you'll have to be specific about the types:
import org.apache.spark.ml.classification.LogisticRegressionModel
ovrModel.models.zipWithIndex.foreach {
case (model: LogisticRegressionModel, i: Int) =>
model.save(s"model-${model.uid}-$i")
}
or to be more generic:
import org.apache.spark.ml.util.MLWritable
ovrModel.models.zipWithIndex.foreach {
case (model: MLWritable, i: Int) =>
model.save(s"model-${model.uid}-$i")
}
Unfortunately as for now (Spark 1.6) OneVsRestModel doesn't implement MLWritable so it cannot be saved alone.
Note:
All models int the OneVsRest seem to use the same uid hence we need an explicit index. It will be also useful to identify the model later.