Spark session Null Pointer with Checkpointing - scala

I have enabled checkpoint that saves the logs to S3.
If there are NO files in the checkpoint directory, spark streaming works fine and I can see log files appearing in the checkpoint directory. Then I kill spark streaming and restart it. This time, I start getting NullPointerException for spark session.
In short, if there are NO log files in the checkpoint directory, spark streaming works fine. However as soon as I restart spark streaming WITH log files in the checkpoint directory, I start getting null pointer exception on spark session.
Below is the code:
object asf {
val microBatchInterval = 5
val sparkSession = SparkSession
.builder()
.appName("Streaming")
.getOrCreate()
val conf = new SparkConf(true)
//conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val sparkContext = SparkContext.getOrCreate(conf)
val checkpointDirectory = "s3a://bucketname/streaming-checkpoint"
println("Spark session: " + sparkSession)
val ssc = StreamingContext.getOrCreate(checkpointDirectory,
() => {
createStreamingContext(sparkContext, microBatchInterval, checkpointDirectory, sparkSession)
}, s3Config.getConfig())
ssc.start()
ssc.awaitTermination()
}
def createStreamingContext(sparkContext: SparkContext, microBatchInterval: Int, checkpointDirectory: String,spark:SparkSession): StreamingContext = {
println("Spark session inside: " + spark)
val ssc: org.apache.spark.streaming.StreamingContext = new StreamingContext(sparkContext, Seconds(microBatchInterval))
//TODO: StorageLevel.MEMORY_AND_DISK_SER
val lines = ssc.receiverStream(new EventHubClient(StorageLevel.MEMORY_AND_DISK_SER);
lines.foreachRDD {
rdd => {
val df = spark.read.json(rdd)
df.show()
}
}
ssc.checkpoint(checkpointDirectory)
ssc
}
}
And again, the very first time I run this code (with No log files in the checkpoint directory), I can see the data frame being printed out.
And if I run with log files in the checkpoint directory, I don't even see
println("Spark session inside: " + spark)
getting printed and it IS printed the FIRST time. The error:
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:111)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:549)
at org.apache.spark.sql.SparkSession.read(SparkSession.scala:605)
And the error is happening at:
val df = spark.read.json(rdd)
Edit: I added this Line:
conf.set("spark.streaming.stopGracefullyOnShutdown","true")
and it still did not make a difference, still getting NullPointerException.

To answer my own question, this works:
lines.foreachRDD {
rdd => {
val sqlContext:SQLContext = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate().sqlContext
val df = sqlContext.read.json(rdd)
df.show()
}
}
Passing a spark session being built from rdd.sparkContext works

Just to put it explicitly for the benefit of newbies, this is an anti-pattern. Creating Dataset inside a transformation is not allowed!
As Michel mentioned executor wont have access to SparkSession

Related

Spark Listener execute hook on onJobComplete on Executors?

I have a simple spark job which reads csv data from S3, transforms it, partitions it by and saves it to local file system.
I have csv file on s3 with below content
sample input: japan, 01-01-2020, weather, provider, device
case class WeatherReport(country:String, date:String, event:String, provide:String, device:String )
object SampleSpark extends App{
val conf = new SparkConf()
.setAppName("processing")
.setIfMissing("spark.master", "local[*]")
.setIfMissing("spark.driver.host", "localhost")
val sc = new SparkContext(conf)
val baseRdd = sc.textFile("s3a://mybucket/sample/*.csv")
val weatherDataFrame = baseRdd
.filter(_.trim.nonEmpty)
.map(x => WeatherReport(x))
.toDF()
df.write.partitionBy("date")
.mode(SaveMode.Append)
.format("com.databricks.spark.csv")
.save("outputDirectory")
}
The file gets saved in "outputDirectory/date=01-01-2020/part-" with more than 1 part files.
I want to merge the part file and remove prefix date= name like "outputDirectory/01-01-2020/output.csv" and copy this to S3.
How is it possible to do it??
I thought of using SparkListener like below but i guess it'll only run on Drive but the files would be present on Executors.
sparkContext.addListener(new SparkListener {
override def onJobEnd(jobEnd: SparkListenerJobEnd) {
renameDirectory()
mergePartFilesToSingleFiles()
uploadFileToS3()
}
})
Is there a way to run a post Job Completion hook on Executors and Driver which would sync all the local files on them to S3?
You can run post execution hooks on executors by registering TaskCompletionListener
// call this from the code that is running on executor such as your mapper WeatherReport
val taskContext = TaskContext.get
taskContext.addTaskCompletionListener(customTaskCompletionListener)
Reference:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/TaskContext.html#addTaskCompletionListener-scala.Function1-

Spart streaming with kafka code inside foreachRdd is not executing if I have function executed LOCALLY

I have setup spark 2.2 locally and working with scala
spark session config is below
val sparkSession = SparkSession
.builder()
.appName("My application")
.config("es.nodes", "localhost:9200")
.config("es.index.auto.create", true)
.config("spark.streaming.backpressure.initialRate", "1")
.config("spark.streaming.kafka.maxRatePerPartition", "7")
.master("local[2]")
.enableHiveSupport()
.getOrCreate()
I am running spark on my local machine
when i do
kafkaStream.foreachRDD(rdd => {
calledFunction(rdd)
})
def calledFunction(rdd: RDD[ConsumerRecord[String, String]]): Unit ={
rdd.foreach(r=>{
print("hello")})
}
for above code on my local machine "hello" is not printing but all jobs are being lined up.
if i change my code to
kafkaStream.foreachRDD(rdd => {
rdd.foreach(r=>{
print("hello")})
})
then it's printing "hello" on console.
can you please help me here what is the issue?
When running with spark 1.6 its printing hello in console.
For reference here is the sample code
val message = KafkaUtils.createStream[Array[Byte], String, DefaultDecoder, StringDecoder](
ssc,
kafkaConf,
Map("test" ->1),
StorageLevel.MEMORY_ONLY
)
val lines = message.map(_._2)
lines.foreachRDD(rdd => {calledFunction(rdd)})
def calledFunction(rdd: RDD[String]): Unit ={
rdd.foreach(r=>{
print("hello")})
}
Hope this helps. As I am not able to regenerate the same issue with spark 2.0 for now, due to dependency mismatch.

How to create a stop condition on Spark streaming?

I want to use spark streaming for reading data from the HDFS. The idea is that another program will keep on uploading new files to an HDFS directory, which my spark streaming job would process. However, I also want to have an end condition. That is, a way in which the program uploading files to the HDFS can signal the spark streaming program, that it is done uploading all the files.
For a simple example, take the program from Here. The code is shown below. Assuming another program is uploading those files, how can the end condition be progammatically signalled by that program (Not requiring us to press CTRL+C) to the spark streaming program?
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object StreamingWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage StreamingWordCount <input-directory> <output-directory>")
System.exit(0)
}
val inputDir=args(0)
val output=args(1)
val conf = new SparkConf().setAppName("Spark Streaming Example")
val streamingContext = new StreamingContext(conf, Seconds(10))
val lines = streamingContext.textFileStream(inputDir)
val words = lines.flatMap(_.split(" "))
val wc = words.map(x => (x, 1))
wc.foreachRDD(rdd => {
val counts = rdd.reduceByKey((x, y) => x + y)
counts.saveAsTextFile(output)
val collectedCounts = counts.collect
collectedCounts.foreach(c => println(c))
}
)
println("StreamingWordCount: streamingContext start")
streamingContext.start()
println("StreamingWordCount: await termination")
streamingContext.awaitTermination()
println("StreamingWordCount: done!")
}
}
OK, I got it. Basically you create another thread from where you call ssc.stop(), to signal the stream processing to stop. For example, like this.
val ssc = new StreamingContext(sparkConf, Seconds(1))
//////////////////////////////////////////////////////////////////////
val thread = new Thread
{
override def run
{
....
// On reaching the end condition
ssc.stop()
}
}
thread.start
//////////////////////////////////////////////////////////////////////
val lines = ssc.textFileStream("inputDir")
.....

Stopping Spark Streaming: exception in the cleaner thread but it will continue to run

I'm working on a Spark-Streaming application, I'm just trying to get a simple example of a Kafka Direct Stream working:
package com.username
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.{Seconds, StreamingContext}
object MyApp extends App {
val topic = args(0) // 1 topic
val brokers = args(1) //localhost:9092
val spark = SparkSession.builder().master("local[2]").getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(1))
val topicSet = topic.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
// Just print out the data within the topic
val parsers = directKafkaStream.map(v => v)
parsers.print()
ssc.start()
val endTime = System.currentTimeMillis() + (5 * 1000) // 5 second loop
while(System.currentTimeMillis() < endTime){
//write something to the topic
Thread.sleep(1000) // 1 second pause between iterations
}
ssc.stop()
}
This mostly works, whatever I write into the kafka topic, it gets included into the streaming batch and gets printed out. My only concern is what happens at ssc.stop():
dd/mm/yy hh:mm:ss WARN FileSystem: exception in the cleaner thread but it will continue to run
java.lang.InterruptException
at java.lang.Object.wait(Native Method)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:143)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:164)
at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:2989)
at java.lang.Thread.run(Thread.java:748)
This exception doesn't cause my app to fail nor exit though. I know I could wrap ssc.stop() into a try/catch block to suppress it, but looking into the API docs has me believe that this is not its intended behavior. I've been looking around online for a solution but nothing involving Spark has mentioned this exception, is there anyway for me to properly fix this?
I encountered the same problem with starting the process directly with sbt run. But if I packaged the project and start with YOUR_SPARK_PATH/bin/spark-submit --class [classname] --master local[4] [package_path], it works correctly. Hope this would help.

spark dataframe write to file using scala

I am trying to read a file and add two extra columns. 1. Seq no and 2. filename.
When I run spark job in scala IDE output is generated correctly but when I run in putty with local or cluster mode job is stucks at stage-2 (save at File_Process). There is no progress even i wait for an hour. I am testing on 1GB data.
Below is the code i am using
object File_Process
{
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("yarn")
.appName("File_Process")
.getOrCreate()
def main(arg:Array[String])
{
val FileDF = spark.read
.csv("/data/sourcefile/")
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow => Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
val dataframefinal = datasetnew.withColumn("Filetag", lit(filename))
val query = dataframefinal.write
.mode("overwrite")
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.save("/data/text_file/")
spark.stop()
}
If I remove logic to add seq_no, code is working fine.
code for creating seq no is
val rdd = FileDF.rdd.zipWithIndex().map(indexedRow =>Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
Thanks in advance.