Passing sparkSession as function parameters spark-scala - scala

I'm in the process of generating tables using spark-scala and I am concerned about efficiency.
Would passing sparkSession make my program slower? Is it any slower than SparkSession.getOrCreate ?
I am using yarn as master.
Thanks in advance.

You can create Spark session once and pass around without losing any performance.
However it is little inconvenient to modify method signature to pass in a session object. You can avoid that by simply calling getOrCreate in the functions to obtain the same global session without passing it. When getOrCreate is called it sets the current session as default SparkSession.setDefaultSession ad gives that back to you for other getOrCreat calls
val spark : SparkSession = SparkSession.builder
.appName("test")
.master("local[2]")
.getOrCreate()
//pass in function
function1(pass)
//obtain without passing
def function2(){
val s = SparkSession.builder.getOrCreate()
}

Related

Databricks notebooks + Repos spark session scoping breakdown

I'm using databricks, and I have a repo in which I have a basic python module within which I define a class. I'm able to import and access the class and its methods from the databricks notebook.
One of the methods within the class within the module looks like this (simplified)
def read_raw_json(self):
self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")
When I execute this particular method within the databricks notebook it gives me a NameError that 'spark' is not defined. The databricks runtime instantiates with a spark session stored in a variable called "spark". I assumed any methods executed in that runtime would inherit from the parent scope.
Anyone know why this isn't the case?
EDIT: I was able to get it to work by passing it the spark variable from within the notebook as an object to my class instantiation. But I don't want to call this an answer yet, because I'm not sure why I needed to.
python files (not notebooks) don't have spark initiated at them.
when you import the function python raises a NameError as it tried to understand the read_raw_json which references an unknown spark object.
you can modify the python file like this and everything will work fine:
from pyspark.sql import SparkSession
def read_raw_json(self):
spark = SparkSession.builder.getOrCreate()
self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")
The spark variable is defined only in the top-level notebook, but not in the packages. You have two choices:
Pass the spark instance as an argument of the read_raw_json function:
def read_raw_json(self, spark):
self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")
get active session:
from pyspark.sql import SparkSession
def read_raw_json(self):
spark = SparkSession.getActiveSession()
self.df = spark.read.format("json").load(f"{self.base_savepath}/{self.resource}/{self.resource}*.json")

Spark context not invoking stop() and shutdown hook not called when saving to s3

I'm reading data from parquet files, processing it, and then saving the result to S3. The problem is relevant only to the last (saving) part.
When running locally, after saving the data, the sparkContext's stop() is not invoked, and the shutdown hook is not called. I need to manually invoke/call them by clicking the IDE's (IntelliJ) stop button.
When saving to a local folder, the process finishes correctly.
When running on EMR, the process finishes correctly.
Changing the format/header/etc. doesn't solve the problem.
Removing any/all of the transformations/joins doesn't solve the problem.
Problem occurs when using both dataframes and datasets.
EDIT: Following comments, tried using both s3:// and s3a:// - as well as separating the commands, but the issue remains.
Example code:
package test
import org.apache.spark.sql.SparkSession
object test {
def main(args: Array[String]): Unit = {
SparkSession
.builder
.appName("s3_output_test")
.master("local[*]")
.getOrCreate
.read
.parquet("path-to-input-data")
// .transform(someFunc)
// .join(someDF)
.coalesce(1)
.write
.format("csv")
.option("header", true)
.save("s3a://bucket-name/output-folder")
sparkSession.stop()
}
}
Any ideas on a solution or how to further debug the issue will be greatly appreciated, thanks!

Registering Classes with Kryo via SparkSession in Spark 2+

I'm migrating from Spark 1.6 to 2.3.
I need to register custom classes with Kryo. So what I see here: https://spark.apache.org/docs/2.3.1/tuning.html#data-serialization
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)
The problem is... everywhere else in Spark 2+ instructions, it indicates that SparkSession is the way to go for everything... and if you need SparkContext it should be through spark.sparkContext and not as a stand-alone val.
So now I use the following (and have wiped any trace of conf, sc, etc. from my code)...
val spark = SparkSession.builder.appName("myApp").getOrCreate()
My question: where do I register classes with Kryo if I don't use SparkConf or SparkContext directly?
I see spark.kryo.classesToRegister here: https://spark.apache.org/docs/2.3.1/configuration.html#compression-and-serialization
I have a pretty extensive conf.json to set spark-defaults.conf, but I want to keep it generalizable across apps, so I don't want to register classes here.
When I look here: https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.sql.SparkSession
It makes me think I can do something like the following to augment my spark-defaults.conf:
val spark =
SparkSession
.builder
.appName("myApp")
.config("spark.kryo.classesToRegister", "???")
.getOrCreate()
But what is??? if I want to register org.myorg.myapp.{MyClass1, MyClass2, MyClass3}? I can't find an example of this use.
Would it be:
.config("spark.kryo.classesToRegister", "MyClass1,MyClass2,MyClass3")
or
.config("spark.kryo.classesToRegister", "class org.myorg.mapp.MyClass1,class org.myorg.mapp.MyClass2,class org.myorg.mapp.MyClass3")
or something else?
EDIT
when I try testing different formats in spark-shell via spark.conf.set("spark.kryo.classesToRegister", "any,any2,any3") i never get any error messages no matter what I put in the string any,any2,any3.
I tried making any each of the following formats
"org.myorg.myapp.myclass"
"myclass"
"class org.myorg.myapp.myclass"
I can't tell if any of these successfully registered anything.
Have you tried the following, it should work since it actually a part of the SparkConf API and I think the only thing missing is that you just need to plug it into the SparkSession:
private lazy val sparkConf = new SparkConf()
.setAppName("spark_basic_rdd").setMaster("local[*]").registerKryoClasses(...)
private lazy val sparkSession = SparkSession.builder()
.config(sparkConf).getOrCreate()
And if you need a Spark Context you can call:
private lazy val sparkContext: SparkContext = sparkSession.sparkContext

Accessing Spark.SQL

I am new to Spark. Following the below example in a book, I found that the command below was giving the error. What would be the best way to run a Spark-SQL command, whilst coding in general in Spark?
scala> // Use SQL to create another DataFrame containing the account
summary records
scala> val acSummary = spark.sql("SELECT accNo, sum(tranAmount) as TransTotal FROM trans GROUP BY accNo")
<console>:37: error: not found: value spark
I tried importing import org.apache.spark.SparkContext or using the sc object, but no luck.
Assuming you're in the spark-shell, then first get a sql context thus:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Then you can do:
val acSummary = sqlContext.sql("SELECT accNo, sum(tranAmount) as TransTotal FROM trans GROUP BY accNo")
So the value spark that is available in spark-shell is actually an instance of SparkSession (https://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.sql.SparkSession)
val spark = SparkSession.builder().getOrCreate()
will give you one.
What version are you using? It appears you're in the shell and this should work, but only in Spark 2+ - otherwise you have to use sqlContext.sql

Spark 2.0 Scala - RDD.toDF()

I am working with Spark 2.0 Scala. I am able to convert an RDD to a DataFrame using the toDF() method.
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
But for the life of me I cannot find where this is in the API docs. It is not under RDD. But it is under DataSet (link 1). However I have an RDD not a DataSet.
Also I can't see it under implicits (link 2).
So please help me understand why toDF() can be called for my RDD. Where is this method being inherited from?
It's coming from here:
Spark 2 API
Explanation: if you import sqlContext.implicits._, you have a implicit method to convert RDD to DataSetHolder (rddToDataSetHolder), then you call toDF on the DataSetHolder
Yes, you should import sqlContext implicits like that:
val sqlContext = //create sqlContext
import sqlContext.implicits._
val df = RDD.toDF()
Before you call to "toDF" in your RDDs
Yes I finally found piece of mind, this issue. It was troubling me like hell, this post is a life saver. I was trying to generically load data from log files to a case class object making it mutable List, this idea was to finally convert the list into DF. However as it was mutable and Spark 2.1.1 has changed the toDF implementation, what ever why the list want not getting converted. I finally thought of even covering save the data to file and the load it back using .read. However 5 min back this post had saved my day.
I did the exact same way as described.
after loading the data to mutable list I immediately used
import spark.sqlContext.implicits._
val df = <mutable list object>.toDF
df.show()
I have done just this with Spark 2.
it worked.
val orders = sc.textFile("/user/gd/orders")
val ordersDF = orders.toDF()