Using a module with udf defined inside freezes pyspark job - explanation? - pyspark

Here is the situation:
We have a module where we define some functions that return pyspark.sql.DataFrame (DF). To get those DF we use some pyspark.sql.functions.udf defined either in the same file or in helper modules. When we actually write job for pyspark to execute we only import functions from modules (we provide a .zip file to --py-files) and then just save the dataframe to hdfs.
Issue is that when we do this, the udf function freezes our job. The nasty fix we found was to define udf functions inside the job and provide them to imported functions from our module.
The other fix I found here is to define a class:
from pyspark.sql.functions import udf
class Udf(object):
def __init__(s, func, spark_type):
s.func, s.spark_type = func, spark_type
def __call__(s, *args):
return udf(s.func, s.spark_type)(*args)
Then use this to define my udf in module. This works!
Can anybody explain why we have this problem in the first place? And why this fix (the last one with the class definition) works?
Additional info: PySpark 2.1.0. Deploying job on yarn in cluster mode.
Thanks!

The accepted answer to the link you posted above, says, "My work around is to avoid creating the UDF until Spark is running and hence there is an active SparkContext." Looks like your issue is with serializing the UDF.
Make sure the UDF functions in your helper classes are static methods or global functions. And inside the public functions that you import elsewhere, you can define the udf.
class Helperclass(object):
#staticmethod
def my_udf_todo(...):
...
def public_function_that_is_imported_elsewhere(...):
todo_udf = udf(Helperclass.my_udf_todo, RETURN_SCHEMA)
...

Related

Databricks notebooks + Repos spark session scoping breakdown

I'm using databricks, and I have a repo in which I have a basic python module within which I define a class. I'm able to import and access the class and its methods from the databricks notebook.
One of the methods within the class within the module looks like this (simplified)
def read_raw_json(self):
self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")
When I execute this particular method within the databricks notebook it gives me a NameError that 'spark' is not defined. The databricks runtime instantiates with a spark session stored in a variable called "spark". I assumed any methods executed in that runtime would inherit from the parent scope.
Anyone know why this isn't the case?
EDIT: I was able to get it to work by passing it the spark variable from within the notebook as an object to my class instantiation. But I don't want to call this an answer yet, because I'm not sure why I needed to.
python files (not notebooks) don't have spark initiated at them.
when you import the function python raises a NameError as it tried to understand the read_raw_json which references an unknown spark object.
you can modify the python file like this and everything will work fine:
from pyspark.sql import SparkSession
def read_raw_json(self):
spark = SparkSession.builder.getOrCreate()
self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")
The spark variable is defined only in the top-level notebook, but not in the packages. You have two choices:
Pass the spark instance as an argument of the read_raw_json function:
def read_raw_json(self, spark):
self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")
get active session:
from pyspark.sql import SparkSession
def read_raw_json(self):
spark = SparkSession.getActiveSession()
self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")

How to pass variable arguments to my scala program?

I am very new to scala spark. Here I have a wordcount program wherein I pass the inputfile as an argument instead of hardcoding it and reading it. But when I run the program I get an error Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException : 0
I think it's because I have not mentioned the argument I am taking in the main class but don't know how to do so.
I tried running the program as is and also tried changing the run configurations. i do not know how to pass the filename (in code) as an argument in my main class
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.types.{StructType,StructField,StringType};
import org.apache.spark.sql.Row;
object First {
def main(args : Array[String]): Unit = {
val filename = args(0)
val cf = new SparkConf().setAppName("Tutorial").setMaster("local")
val sc = new SparkContext(cf)
val input = sc.textFile(filename)
val w = input.flatMap(line => line.split(" ")).map(word=>
(word,1)).reduceByKey(_ + _)
w.collect.foreach(println)
w.saveAsTextFile(args(1))
}
}
I wish to run this program by passing the right arguments (input file and save output file as arguments) in my main class. I am using scala eclipse IDE. I do not know what changes to make in my program please help me out here as I am new.
In the run configuration for the project, there is an option right next to main called '(x)=Arguments' where you can pass in arguments to main in the 'Program Arguments' section.
Additionally, you may print args.length to see the number of arguments your code is actually receiving after doing the above.
It appears you are running Spark on Windows, so I'm not sure if this will work exactly as-is, but you can definitely pass arguments like any normal command line application. The only difference is that you have to pass the arguments AFTER specifying the Spark-related parameters.
For example, the JAR filename is the.jar and the main object is com.obrigado.MyMain, then you could run a Spark submit job like so: spark-submit --class com.obrigado.MyMain the.jar path/to/inputfile. I believe args[0] should then be path/to/inputfile.
However, like any command-line program, it's generally better to use POSIX-style arguments (or at least named arguments), and there are several good ones out there. Personally, I love using Scallop as it's easy to use and doesn't seem to interfere with Spark's own CLI parsing library.
Hopefully this fixes your issue!

Why does Spark application work in spark-shell but fail with "org.apache.spark.SparkException: Task not serializable" in Eclipse?

With the purpose of save a file (delimited by |) into a DataFrame, I have developed the next code:
val file = sc.textFile("path/file/")
val rddFile = file.map(a => a.split("\\|")).map(x => ArchivoProcesar(x(0), x(1), x(2), x(3))
val dfInsumos = rddFile.toDF()
My case class used for the creation of my DataFrame is defined as followed:
case class ArchivoProcesar(nombre_insumo: String, tipo_rep: String, validado: String, Cargado: String)
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly. But when I executed my program into eclipse, it throws me the next error:
Is it something missing inside my scala class that I'm using and running with eclipse. Or what could be the reason that my functions works correctly at the spark-shell but not in my eclipse app?
Regards.
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly.
That's because spark-shell takes care of creating an instance of SparkContext for you. spark-shell then makes sure that references to SparkContext are not from "sensitive places".
But when I executed my program into eclipse, it throws me the next error:
Somewhere in your Spark application you hold a reference to org.apache.spark.SparkContext that is not serializable and so holds your Spark computation back from being serialized and send across the wire to executors.
As #T. Gawęda has mentioned in a comment:
I think that ArchivoProcesar is a nested class and as a nested class has a reference to the outer class that has a property of type SparkContext
So while copying the code from spark-shell to Eclipse you have added some additional lines that you don't show thinking that they are not necessary which happens to be quite the contrary. Find any places where you create and reference SparkContext and you will find the root cause of your issue.
I can see that the Spark processing happens inside ValidacionInsumos class that main method uses. I think the affecting method is LeerInsumosAValidar that does map transformation and that's where you should seek the answer.
Your case class must have public scope. You can't have ArchivoProcesar inside a class

Register UDF to SqlContext from Scala to use in PySpark

Is it possible to register a UDF (or function) written in Scala to use in PySpark ?
E.g.:
val mytable = sc.parallelize(1 to 2).toDF("spam")
mytable.registerTempTable("mytable")
def addOne(m: Integer): Integer = m + 1
// Spam: 1, 2
In Scala, the following is now possible:
val UDFaddOne = sqlContext.udf.register("UDFaddOne", addOne _)
val mybiggertable = mytable.withColumn("moreSpam", UDFaddOne(mytable("spam")))
// Spam: 1, 2
// moreSpam: 2, 3
I would like to use "UDFaddOne" in PySpark like
%pyspark
mytable = sqlContext.table("mytable")
UDFaddOne = sqlContext.udf("UDFaddOne") # does not work
mybiggertable = mytable.withColumn("+1", UDFaddOne(mytable("spam"))) # does not work
Background: We are a team of developpers, some coding in Scala and some in Python, and would like to share already written functions. It would also be possible to save it into a library and import it.
As far as I know PySpark doesn't provide any equivalent of the callUDF function and because of that it is not possible to access registered UDF directly.
The simplest solution here is to use raw SQL expression:
mytable.withColumn("moreSpam", expr("UDFaddOne({})".format("spam")))
## OR
sqlContext.sql("SELECT *, UDFaddOne(spam) AS moreSpam FROM mytable")
## OR
mytable.selectExpr("*", "UDFaddOne(spam) AS moreSpam")
This approach is rather limited so if you need to support more complex workflows you should build a package and provide complete Python wrappers. You'll find and example UDAF wrapper in my answer to Spark: How to map Python with Scala or Java User Defined Functions?
The following worked for me (basically a summary of multiple places including the link provided by zero323):
In scala:
package com.example
import org.apache.spark.sql.functions.udf
object udfObj extends Serializable {
def createUDF = {
udf((x: Int) => x + 1)
}
}
in python (assume sc is the spark context. If you are using spark 2.0 you can get it from the spark session):
from py4j.java_gateway import java_import
from pyspark.sql.column import Column
jvm = sc._gateway.jvm
java_import(jvm, "com.example")
def udf_f(col):
return Column(jvm.com.example.udfObj.createUDF().apply(col))
And of course make sure the jar created in scala is added using --jars and --driver-class-path
So what happens here:
We create a function inside a serializable object which returns the udf in scala (I am not 100% sure Serializable is required, it was required for me for more complex UDF so it could be because it needed to pass java objects).
In python we use access the internal jvm (this is a private member so it could be changed in the future but I see no way around it) and import our package using java_import.
We access the createUDF function and call it. This creates an object which has the apply method (functions in scala are actually java objects with the apply method). The input to the apply method is a column. The result of applying the column is a new column so we need to wrap it with the Column method to make it available to withColumn.

Scala Unit Testing - Mocking an implicitly wrapped function

I have a question concerning unit tests that I'm trying to achieve using Mockito in Scala. I've also looked up ScalaMock but it sounds like the feature is not provided as well. I suppose that maybe I'm looking from a narrow way to the solution and there might be a different perspective or approach to what im doing so all your opinions are welcomed.
Basically, I want to mock a function that is available to the object using implicit conversion, and I don't have any control to change how that is done. Since I'm a user to the library. The concrete example is similar to the following scenario
rdd: RDD[T] = //existing RDD
sqlContext: SQLContext = //existing sqlcontext
import sqlContext.implicits._
rdd.toDF()
/*toDF() doesn't originally exist at RDD but is implicitly added when importing sqlContext.implicits._*/
Now In the testing, I'm mocking the rdd and the sqlContext and I want to mock the toDF() function. I Can't mock the function toDF() since it doesn't exist on the RDD level. Even if I do a simple trick, importing the mocked sqlContext.implicit._ I get an error that any function that is not publicaly available to the object can't be mocked. I even tried to mock the code that is implicitly executed until toDF() but I get stuck with Final/Pivate[in accessible] classes that I also can't mock. Your suggestions are more than welcomed. Thanks in advance :)