scala Unspecified Value Parameters - scala

I want to extend the SparkSession class in spark. I copied the constructor of the original SparkSession partially reproduced here:
class SparkSession private(
#transient val sparkContext: SparkContext,
#transient private val existingSharedState: Option[SharedState],
#transient private val parentSessionState: Option[SessionState],
#transient private[sql] val extensions: SparkSessionExtensions)
extends Serializable with Closeable with Logging { self =>
private[sql] def this(sc: SparkContext) {
this(sc, None, None, new SparkSessionExtensions)
}
// other implementations
}
Here's my attempt at extending it:
class CustomSparkSession private(
#transient override val sparkContext: SparkContext,
#transient private val existingSharedState: Option[SharedState],
#transient private val parentSessionState: Option[SessionState],
#transient override private[sql] val extensions: SparkSessionExtensions)
extends SparkSession {
// implementation
}
But I get an error on the SparkSession part of extends SparkSession with error:
Unspecified value parameters: sc: SparkContext
I know that it's coming from the this constructor in the original SparkContext, but I'm not sure how, or if I can even extend this properly. Any ideas?

When you write class Foo extends Bar you are actually (1) creating a default (no-argument) constructor for class Foo, and (2) calling a default constructor of class Bar.
Consequently, if you have something like class Bar(bar: String), you can't just write class Foo extends Bar, because there is no default constructor to call, you need to pass a parameter for bar. So, you could write something like
class Foo(bar: String) extends Bar(bar)
This is why you are seeing this error - you are trying to call a constructor for SparkSession, but not passing any value for sc.
But you have a bigger problem. That private keyword you see next to SparkSession (and another one before this) means that the constructor is ... well ... private. You cannot call it. In other words, this class cannot be subclassed (outside the sql package), so you should look for another way to achieve what you are trying to do.

Related

Task not serializable when class is serializable

I have the following class in Scala
case class A
(a:Int,b:Int) extends Serializable
when I try in Spark 2.4. (via Databricks)
val textFile = sc.textFile(...)
val df = textFile.map(_=>new A(2,3)).toDF()
(Edit: the error happens when I call df.collect() or register as table)
I get org.apache.spark.SparkException: Task not serializable
what am I missing?
I've tried adding encoders:
implicit def AEncoder: org.apache.spark.sql.Encoder[A] =
org.apache.spark.sql.Encoders.kryo[A]
and
import spark.implicits._
import org.apache.spark.sql.Encoders
edit: I have also tried:
val df = textFile.map(_=>new A(2,3)).collect()
but no luck so far.
Sometimes this occurs intermittently on DataBricks. Most annoying.
Restart the cluster and try again, I have had this error sometimes and after restart it did not occur.
You can directly parse the file as Dataset with the case class you have.
case class A(a:Int,b:Int) extends Serializable
val testRDD = spark.sparkContext.textFile("file:///test_file.csv")
val testDS = testRDD.map( line => line.split(",")).map(line_cols => A(line_cols(0).toInt, line_cols(1).toInt) ).toDS()
#res23: org.apache.spark.sql.Dataset[A] = [a: int, b: int]

Avoid import tax when using spark implicits

In my testing, I have a test trait to provide spark context:
trait SparkTestTrait {
lazy val spark: SparkSession = SparkSession.builder().getOrCreate()
}
The problem is that I need to add an import in every test function:
test("test1) {
import spark.implicits._
}
I managed to reduce this to on per file by adding to the SparkTestTrait the following:
object testImplicits extends SQLImplicits {
protected override def _sqlContext: SQLContext = spark.sqlContext
}
and then in the constructor of the implementing file:
import testImplicits._
However, I would prefer to have these implicits imported to all classes implementing SparkTestTrait (I can't have SparkTestTrait extend SQLImplicits because the implementing classes already extend an abstract class).
Is there a way to do this?

Create SparkSQL UDF with non serializable objects

I'm trying to write an UDF that I would like to use on Hive tables in an sqlContext. Is it in any way possible to include objects from other libraries that are not serializable? Here's a minimal example of what does not work:
def myUDF(s: String) = {
import sun.misc.BASE64Encoder
val coder= new BASE64Encoder
val encoded= decoder.encode(s)
encoded
}
I register the function in the spark shell as udf function
val encoding = sqlContext.udf.register("encoder", myUDF)
If I try to run it on a table "test"
sqlContext.sql("SELECT encoder(colname) from test").show()
I get the error
org.apache.spark.SparkException: Task not serializable
object not serializable (class: sun.misc.BASE64Encoder, value: sun.misc.BASE64Encoder#4a7f9a94)
Is there a workaround for this? I tried embedding myUDF in an object and in a class but that didn't work either.
You can try defining udf function as
def encoder = udf((s: String) => {
import sun.misc.BASE64Encoder
val coder= new BASE64Encoder
val encoded= coder.encode(s.getBytes("UTF-8"))
encoded
})
And call the udf function as
dataframe.withColumn("encoded", encoder(col("id"))).show
Updated
As #santon has pointed out that BASE64Encoder encoder is initiated for each row in the dataframe which might lead to performance issues. The solution to that would be to create a static object of BASE64Encoder and call it within udf function.

How can I get the current SparkSession in any place of the codes?

I have created a session in the main() function, like this:
val sparkSession = SparkSession.builder.master("local[*]").appName("Simple Application").getOrCreate()
Now if I want to configure the application or access the properties, I can use the local variable sparkSession in the same function.
What if I want to access this sparkSession elsewhere in the same project, like project/module/.../.../xxx.scala. What should I do?
Once a session was created (anywhere), you can safely use:
SparkSession.builder.getOrCreate()
To get the (same) session anywhere in the code, as long as the session is still alive. Spark maintains a single active session so unless it was stopped or crashed, you'll get the same one.
Edit: builder is not callable, as mentioned in the comments.
Since 2.2.0 you can access the active SparkSession through:
/**
* Returns the active SparkSession for the current thread, returned by the builder.
*
* #since 2.2.0
*/
def getActiveSession: Option[SparkSession] = Option(activeThreadSession.get)
or default SparkSession:
/**
* Returns the default SparkSession that is returned by the builder.
*
* #since 2.2.0
*/
def getDefaultSparkSession: Option[SparkSession] = Option(defaultSession.get)
When SparkSession variable has been defined as
val sparkSession = SparkSession.builder.master("local[*]").appName("Simple Application").getOrCreate()
This variable is going to point/refer to only one SparkSession as its a val. And you can always pass to different classes for them to access as well as
val newClassCall = new NewClass(sparkSession)
Now you can use the same sparkSession in that new class as well.
This is a old question and there are couple of answer that are good enough but I would like to give one more approach that can be used to make it work.
You can create a trait that extends from serializable and create spark session as a lazy variable and then through out your project in all the objects that you create, you can extend that trait and it will give you sparksession instance.
Code as below:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
trait SparkSessionWrapper extends Serializable {
lazy val spark: SparkSession = {
SparkSession.builder().appName("TestApp").getOrCreate()
}
//object with the main method and it extends SparkSessionWrapper
object App extends SparkSessionWrapper {
def main(args: Array[String]): Unit = {
val readdf = ReadFileProcessor.ReadFile("testpath")
readdf.createOrReplaceTempView("TestTable")
val viewdf = spark.sql("Select * from TestTable")
}
}
object ReadFileProcessor extends SparkSessionWrapper{
def ReadFile(path: String) : DataFrame = {
val df = spark.read.format("csv").load(path)
df
}
}
As you are extending the SparkSessionWrapper on both the Objects that you created, spark session would get initialized when first time spark variable is encountered in the code and then you refer it on any object that extends that trait without passing that as a parameter to the method. It works or give you a experience that is similar to notebook.
Update : If you even want it to be more generic and have an need to even set the custom appname based on the type of workflow you are running you can do it as below :
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.DataFrame
trait SparkSessionWrapper extends Serializable {
lazy val spark: SparkSession = {
createSparkSession(appname)
}
def appname : String
def createSparkSession(appname : String) : SparkSession ={
SparkSession.builder().appName(appname).master("local[*]").getOrCreate()
}
//object with the main method and it extends SparkSessionWrapper
object App extends SparkSessionWrapper {
def main(args: Array[String]): Unit = {
val readdf = ReadFileProcessor.ReadFile("testpath")
readdf.createOrReplaceTempView("TestTable")
val viewdf = spark.sql("Select * from TestTable")
}
override def appname: String = "ReadFile"
}
object ReadFileProcessor extends SparkSessionWrapper{
def ReadFile(path: String) : DataFrame = {
val df = spark.read.format("csv").load(path)
df
}
override def appname: String = "ReadcsvFile"
}
the only main difference is that you need to create an abstract function inside the trait and then you would have to override that into any of the startup class that you are using to provide the value.

How to share SparkContext with methods that need it implicitly

I have the following method:
def loadData(a:String, b:String)(implicit sparkContext: SparkContext) : RDD[Result]
I am trying to test it using this SharedSparkContext: https://github.com/holdenk/spark-testing-base/wiki/SharedSparkContext.
So, I made my test class extend SharedSparkContext:
class Ingest$Test extends FunSuite with SharedSparkContext
And within the test method I made this call:
val res: RDD[Result] = loadData("x", "y")
However, I am getting this error:
Error:(113, 64) could not find implicit value for parameter sparkContext: org.apache.spark.SparkContext
val result: RDD[Result] = loadData("x", "y")
So how can I make the SparkContext from the testing method visible?
EDIT:
I don't see how the question is related with Understanding implicit in Scala
What is the variable name of your Spark Context? If it is 'sc' as is typically the case, you will have to alias it to the variable name the method is looking for via implicit val sparkContext = sc and then proceed to call your method in the same environment