How to submit Spark jobs to Apache Livy? - scala

I am trying to understand how to submit Spark job to Apache Livy.
I added the following API to my POM.xml:
<dependency>
<groupId>com.cloudera.livy</groupId>
<artifactId>livy-api</artifactId>
<version>0.3.0</version>
</dependency>
<dependency>
<groupId>com.cloudera.livy</groupId>
<artifactId>livy-scala-api_2.11</artifactId>
<version>0.3.0</version>
</dependency>
Then I have the following code in Spark that I want to submit to Livy on request.
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.appName("Test")
.master("local[*]")
.getOrCreate()
import spark.sqlContext.implicits._
implicit val sparkContext = spark.sparkContext
// ...
}
}
To have the following code that creates a LivyClient instance and uploads the application code to the Spark context:
val client = new LivyClientBuilder()
.setURI(new URI(livyUrl))
.build()
try {
client.uploadJar(new File(testJarPath)).get()
client.submit(new Test())
} finally {
client.stop(true)
}
However, the problem is that the code of Test is not adapted to be used with Apache Livy.
How can I adjust the code of Test object in order to be able to run client.submit(new Test())?

Your Test class needs to implement Livy's Job interface and you need to implement its call method in your Test class, from where you will get access to jobContext/SparkContext. You can then pass the instance of Test in the submit method
You don't have to create SparkSession by yourself, Livy will create it on the cluster and you can access that context in your call method.
You can find more detailed information on Livy's programmatic API here: https://livy.incubator.apache.org/docs/latest/programmatic-api.html
Here's a sample implementation of Test Class:
import com.cloudera.livy.{Job, JobContext}
class Test extends Job[Int]{
override def call(jc: JobContext): Int = {
val spark = jc.sparkSession()
// Do anything with SparkSession
1 //Return value
}
}

Related

Spark Session Dispose after Unit test for specified file is Done

I'm Writing Unit Tests for Spark Scala code and facing this issue.
When I run UnitTests files separately I'm good to go but, When I run all of UnitTests in module using maven Testcases fails.
How we can create local instance of spark or mock for UnitTests.
`
Cannot call methods on a stopped SparkContext. This stopped
SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
`
Method I tried.
Tried using creating private spark session for each one UnitTest File.
Creating common spark session trait for all unit test file.
calling spark.Stop() at end of each file and removing it from all.
File are make two unit test files and try to execute them together. Both files should have spark session.
Class test1 extends AnyFlatSpec
{
val spark: SparkSession = SparkSession.builder
.master("local[*]")
.getOrCreate()
val sc: SparkContext = spark.sparkContext
val sqlCont: SQLContext = spark.sqlContext
"test1" should "take spark session spark context and sql context" in
{
//do something
}
}
Class test2 extends AnyFlatSpec
{
val spark: SparkSession = SparkSession.builder
.master("local[*]")
.getOrCreate()
val sc: SparkContext = spark.sparkContext`enter code here`
val sqlCont: SQLContext = spark.sqlContext
"test2" should "take spark session spark context and sql context" in
{
//do something
}
}
when you run those independently each file will work fine but when you run them together using mvn test they will failed.

How to write Spark-submit logs file with the Scala code?

I am trying to build a scala based jar file that uses log4j to write logs. Executing the code above with spark-shell works fine (logs printing in the console). But when I try to make it write to a log file (spark-shell or spark-submit), only the line with logging.info is print out. I wish to set the log level to DEBUG. Here is my code :
import org.apache.log4j
import org.apache.spark.sql.SparkSession
import org.apache.log4j.{Level, Logger, PatternLayout, Priority, RollingFileAppender}
import java.time
import java.time.format.DateTimeFormatter
trait SparkContextProvider {
def spark: SparkSession
}
trait Logs extends SparkContextProvider {
lazy val logging: log4j.Logger = Logger.getLogger(getClass.getName)
lazy val applicationId: String = spark.sparkContext.applicationId
val appender = new RollingFileAppender()
appender.setAppend(true)
appender.setMaxFileSize("50MB")
appender.setMaxBackupIndex(10)
appender.setFile("/usr/spark-3.0.2/app-logs/spark-" + applicationId + ".log")
appender.activateOptions()
val layOut = new PatternLayout()
layOut.setConversionPattern("%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n")
appender.setLayout(layOut)
logging.addAppender(appender)
logging.setLevel(Level.DEBUG)
}
object DataExtractionProcess extends Logs {
def Main(): Unit = {
logging.info("hello test world")
}
override def spark: SparkSession = SparkSession.builder
.appName("PredictiveDataOperation")
.getOrCreate()
}
I trigger the job with DataExtractionProcess.Main()
And I tried also to set log level with :
//Logger.getLogger("org.apache.spark").setLevel(Level.DEBUG)
//Logger.getRootLogger().setLevel(Level.DEBUG)
//spark.sparkContext.setLogLevel("all")
But no change in the log file.
Thanks for the help

When is onQueryTerminated triggered in Apache Spark StreamingQueryListeners?

I'm developing a custom StreamingQueryListener and I'd like to trigger its onQueryTerminated method in a test.
This is what I tried implementing:
import org.apache.spark.sql.{ SQLContext, SparkSession }
import org.apache.spark.sql.execution.streaming.MemoryStream
import org.apache.spark.sql.functions.{ col, to_date }
import org.apache.spark.sql.streaming.StreamingQueryListener
import org.scalatest.flatspec.AnyFlatSpec
class MyListener extends StreamingQueryListener {
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = {}
override def onQueryProgress(event: StreamingQueryListener.QueryProgressEvent): Unit = {}
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = println(event.exception)
}
class ListenerSpec extends AnyFlatSpec {
it should "trigger onQueryTerminated" in {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.streams.addListener(new MyListener())
implicit val sqlContext: SQLContext = spark.sqlContext
import spark.implicits._
val stream = MemoryStream[Int]
stream.addData(Seq(1, 3, 4))
val query = stream
.toDF()
.withColumn("columnDoesntExist", to_date(col("names")))
.writeStream
.format("console")
.start()
query.awaitTermination()
}
}
However, this doesn't work because it raises an AnalysisException but the onQueryTerminated method isn't triggered by the termination of the streaming query.
In what situations is that method triggered and event.exception is Some(exception)?
Update
The following code successfully triggers the execution of onQueryTerminated:
val exceptionUdf = udf(() => throw new Exception())
val query = stream
.toDF()
.withColumn("exception", exceptionUdf())
.writeStream
.format("console")
.start()
Refer to the accepted answer for an explanation as to why.
According to the book "Stream Processing with Apache Spark" (published by O'Reilly) the onQueryTerminated method gets
"Called when a streaming query is stopped. The event contains id and runId fields that correlate with the start event. It also provides an exception field that contains an exception if the query failed due to an error."
As you are getting an AnalysisException, your query did not even start yet. It only got to the first of the four phases in the Catalyst optimizer, which is the "Analysis" and it has not been transformed into runnable code yet:
More details on the Catalyst Optimizer.
The AnalysisException just means that there are issues in the code related to the Catalog which is exactly what you intended to do: Refer to a column that does not exist (in the Catalog).
If you want to run the execution of the onQueryTermination method you need to implement a working code but have it failed while it is already running (e.g. provide wrong data input type).

How to write and update by kudu API in Spark 2.1

I want to write and update by Kudu API.
This is the maven dependency:
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-client</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-spark2_2.11</artifactId>
<version>1.1.0</version>
</dependency>
In the following code, I have no idea about KuduContext parameter.
My code in spark2-shell:
val kuduContext = new KuduContext("master:7051")
Also the same error in Spark 2.1 streaming:
import org.apache.kudu.spark.kudu._
import org.apache.kudu.client._
val sparkConf = new SparkConf().setAppName("DirectKafka").setMaster("local[*]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val messages = KafkaUtils.createDirectStream("")
messages.foreachRDD(rdd => {
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
val bb = spark.read.options(Map("kudu.master" -> "master:7051","kudu.table" -> "table")).kudu //good
val kuduContext = new KuduContext("master:7051") //error
})
Then the error:
org.apache.spark.SparkException: Only one SparkContext may be running
in this JVM (see SPARK-2243). To ignore this error, set
spark.driver.allowMultipleContexts = true. The currently running
SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
Update your version of Kudu to the latest one (currently 1.5.0). The KuduContext takes the SparkContext as an input parameter in later versions and that should prevent this problem.
Also, do the initial Spark initialization outside of the foreachRDD. In the code you provided, move both the spark and kuduContext out of the foreach. Also, you do not need to create a separate sparkConf, you can use the newer SparkSession only.
val spark = SparkSession.builder.appName("DirectKafka").master("local[*]").getOrCreate()
import spark.implicits._
val kuduContext = new KuduContext("master:7051", spark.sparkContext)
val bb = spark.read.options(Map("kudu.master" -> "master:7051", "kudu.table" -> "table")).kudu
val messages = KafkaUtils.createDirectStream("")
messages.foreachRDD(rdd => {
// do something with the bb table and messages
})

Best practice to create SparkSession object in Scala to use both in unittest and spark-submit

I have tried to write a transform method from DataFrame to DataFrame.
And I also want to test it by scalatest.
As you know, in Spark 2.x with Scala API, you can create SparkSession object as follows:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.bulider
.config("spark.master", "local[2]")
.getOrCreate()
This code works fine with unit tests.
But, when I run this code with spark-submit, the cluster options did not work.
For example,
spark-submit --master yarn --deploy-mode client --num-executors 10 ...
does not create any executors.
I have found that the spark-submit arguments are applied when I remove config("master", "local[2]") part of the above code.
But, without master setting the unit test code did not work.
I tried to split spark (SparkSession) object generation part to test and main.
But there is so many code blocks needs spark, for example import spark.implicit,_ and spark.createDataFrame(rdd, schema).
Is there any best practice to write a code to create spark object both to test and to run spark-submit?
One way is to create a trait which provides the SparkContext/SparkSession, and use that in your test cases, like so:
trait SparkTestContext {
private val master = "local[*]"
private val appName = "testing"
System.setProperty("hadoop.home.dir", "c:\\winutils\\")
private val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
val ss: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
val sc: SparkContext = ss.sparkContext
val sqlContext: SQLContext = ss.sqlContext
}
And your test class header then looks like this for example:
class TestWithSparkTest extends BaseSpec with SparkTestContext with Matchers{
I made a version where Spark will close correctly after tests.
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.scalatest.{BeforeAndAfterAll, FunSuite, Matchers}
trait SparkTest extends FunSuite with BeforeAndAfterAll with Matchers {
var ss: SparkSession = _
var sc: SparkContext = _
var sqlContext: SQLContext = _
override def beforeAll(): Unit = {
val master = "local[*]"
val appName = "MyApp"
val conf: SparkConf = new SparkConf()
.setMaster(master)
.setAppName(appName)
.set("spark.driver.allowMultipleContexts", "false")
.set("spark.ui.enabled", "false")
ss = SparkSession.builder().config(conf).getOrCreate()
sc = ss.sparkContext
sqlContext = ss.sqlContext
super.beforeAll()
}
override def afterAll(): Unit = {
sc.stop()
super.afterAll()
}
}
The spark-submit command with parameter --master yarn is setting yarn master.
And this will be conflict with your code master("x"), even using like master("yarn").
If you want to use import sparkSession.implicits._ like toDF ,toDS or other func,
you can just use a local sparkSession variable created like below:
val spark = SparkSession.builder().appName("YourName").getOrCreate()
without setting master("x") in spark-submit --master yarn, not in local machine.
I advice : do not use global sparkSession in your code. That may cause some errors or exceptions.
hope this helps you.
good luck!
How about defining an object in which a method creates a singleton instance of SparkSession, like MySparkSession.get(), and pass it as a paramter in each of your unit tests.
In your main method, you can create a separate SparkSession instance, which can have different configurations.