I am having some trouble with stopping a streaming context after a condition has been met inside a foreachRDD. Any time the scc.stop() inside function foo is executed, I get an Interrupted error.
Simplified Code:
def main(){
var sc = new SparkContext(new SparkConf().setAppName("appname").setMaster("local"))
foo(123,sc)
//foo(312,sc) can I call foo again here?
sc.stop()
}
def foo(param1: Integer, sc: SparkContext){
val ssc = new StreamingContext(sc, Seconds(1))
val res = 0
//dummy data, but actual datatypes (but is not relevant to the error I get in this code)
val inputData: mutable.Queue[RDD[Int]] = mutable.Queue()
val inputStream: InputDStream[Int] = ssc.queueStream(inputData)
inputData += sc.makeRDD(List(1, 2))
val rdds_list=some_other_fn(inputstream,param1) //returns DStream
rdds_list.foreachRDD((rdd) => {
def foo1(rdd: RDD[<some_type_2>]) = {
if (condition1) {
println("condition satisfied!") //prints correctly
res = do_stuff(rdd) //executes correctly
println("result: " + res) //executes correctly (and output is as intended)
}else{
println("stopping streaming context!")
ssc.stop(stopSparkContext = false) //error occurs here
}
}
foo(rdd)
})
ssc.start()
ssc.awaitTermination()
res
}
Error log:
**condition satisfied!
result: 124124**
stopping streaming context!
[error] (pool-11-thread-1) java.lang.Error: java.lang.InterruptedException
java.lang.Error: java.lang.InterruptedException
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.spark.util.AsynchronousListenerBus.stop(AsynchronousListenerBus.scala:160)
at org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:98)
at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:573)
at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:555)
at edu.gatech.cse8803.main.Main$$anonfun$testClustering$1.foo$1(Main.scala:315)
at edu.gatech.cse8803.main.Main$$anonfun$testClustering$1.apply(Main.scala:318)
at edu.gatech.cse8803.main.Main$$anonfun$testClustering$1.apply(Main.scala:306)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:176)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:175)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I tried using ssc.stop(stopSparkContext = true, stopGracefully = true) but I get this:
WARN scheduler.JobGenerator -
Timed out while stopping the job generator (timeout = 10000)
after foo is called and the program just gets stuck (i.e it does not complete and I have to Ctrl+c it).
Is this the correct way to stop a streaming context? Also, if I wanted to call foo multiple times, should I make any changes? I understand that there should only be one spark context in an application that's why I am trying to re-use them or should I close the SparkContext by setting stopSparkContext as true?
My environment:
sbt v1.0
Scala 2.10.5
Spark 1.3.1
Edit: Looked at other similar questions, tried all their answers - still no luck! :(
It shows that while the spark driver was waiting for the job to finish, you are closing StreamingContext inside the rdd_list which is being handled by the very StreamingContext. This should be closed separately.
And the contexts shouldn't be created and closed with such frequency.
What i would recommend is to do the following...
Initiate and pass StreamingContext from main() to foo(...)
Which will make foo
def foo(param1: Integer, ssc: StreamingContext)
To safely close both of the contexts for a streaming application would be like...
sys.ShutdownHookThread {
//Executes when shutdown signal is received by the app
log.info("Gracefully stopping Spark Context")
sc.stop()
ssc.stop(true, true)
log.info("Application stopped")
}
But if you need to close with programmaticly logic, close StreamingContext with SparkContext.
Which will make your main() look like
def main(){
var sc = new SparkContext(new SparkConf().setAppName("appname").setMaster("local"))
val ssc = new StreamingContext(sc, Seconds(1))
sys.ShutdownHookThread {
//Executes when shutdown signal is received by the app
log.info("Gracefully stopping Spark Context")
sc.stop()
ssc.stop(true, true)
log.info("Application stopped")
}
foo(123,ssc)
sc.stop()
ssc.stop
}
Related
I thought it can work,but i failed actually
import math._
import org.apache.spark.sql.SparkSession
object Position {
def main(args: Array[String]): Unit = {
// create Spark DataFrame with Spark configuration
val spark= SparkSession.builder().getOrCreate()
// Read csv with DataFrame
val file1 = spark.read.csv("file:///home/aaron/Downloads/taxi_gps.txt")
val file2 = spark.read.csv("file:///home/aaron/Downloads/district.txt")
//change the name
val new_file1= file1.withColumnRenamed("_c4","lat")
.withColumnRenamed("_c5","lon")
val new_file2= file2.withColumnRenamed("_c0","dis")
.withColumnRenamed("_1","lat")
.withColumnRenamed("_2","lon")
.withColumnRenamed("_c3","r")
//geo code
def haversine(lat1:Double, lon1:Double, lat2:Double, lon2:Double): Double ={
val R = 6372.8 //radius in km
val dLat=(lat2 - lat1).toRadians
val dLon=(lon2 - lon1).toRadians
val a = pow(sin(dLat/2),2) + pow(sin(dLon/2),2) * cos(lat1.toRadians) * cos(lat2.toRadians)
val c = 2 * asin(sqrt(a))
R * c
}
//count
new_file2.foreach(row => {
val district = row.getAs[Float]("dis")
val lon = row.getAs[Float]("lon")
val lat = row.getAs[Float]("lat")
val distance = row.getAs[Float]("r")
var temp = 0
new_file1.foreach(taxi => {
val taxiLon = taxi.getAs[Float]("lon")
val taxiLat = taxi.getAs[Float]("lat")
if(haversine(lat,lon,taxiLat,taxiLon) <= distance) {
temp+=1
}
})
println(s"district:$district temp=$temp")
})
}
}
Here's results
20/06/07 23:04:11 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
......
20/06/07 23:04:11 ERROR Utils: Uncaught exception in thread main
java.lang.NullPointerException
......
20/06/07 23:04:11 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
I am not sure that since this seems to be Spark, using a DF inside a DF is the only mistake to this program.
I am not familiar with scala and spark,it is quite a tough question for me. I hope you guys can help me,thx!
Your exception says org.apache.spark.SparkException: A master URL must be set in your configuration set master url in master function.
I hope you are running code in some IDE. If yes, Please replace this val spark= SparkSession.builder().getOrCreate() with val spark= SparkSession.builder().master("local[*]").getOrCreate() in your code.
or if you are executing this code using spark-submit try add this --master yarn.
I want to use spark streaming for reading data from the HDFS. The idea is that another program will keep on uploading new files to an HDFS directory, which my spark streaming job would process. However, I also want to have an end condition. That is, a way in which the program uploading files to the HDFS can signal the spark streaming program, that it is done uploading all the files.
For a simple example, take the program from Here. The code is shown below. Assuming another program is uploading those files, how can the end condition be progammatically signalled by that program (Not requiring us to press CTRL+C) to the spark streaming program?
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object StreamingWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage StreamingWordCount <input-directory> <output-directory>")
System.exit(0)
}
val inputDir=args(0)
val output=args(1)
val conf = new SparkConf().setAppName("Spark Streaming Example")
val streamingContext = new StreamingContext(conf, Seconds(10))
val lines = streamingContext.textFileStream(inputDir)
val words = lines.flatMap(_.split(" "))
val wc = words.map(x => (x, 1))
wc.foreachRDD(rdd => {
val counts = rdd.reduceByKey((x, y) => x + y)
counts.saveAsTextFile(output)
val collectedCounts = counts.collect
collectedCounts.foreach(c => println(c))
}
)
println("StreamingWordCount: streamingContext start")
streamingContext.start()
println("StreamingWordCount: await termination")
streamingContext.awaitTermination()
println("StreamingWordCount: done!")
}
}
OK, I got it. Basically you create another thread from where you call ssc.stop(), to signal the stream processing to stop. For example, like this.
val ssc = new StreamingContext(sparkConf, Seconds(1))
//////////////////////////////////////////////////////////////////////
val thread = new Thread
{
override def run
{
....
// On reaching the end condition
ssc.stop()
}
}
thread.start
//////////////////////////////////////////////////////////////////////
val lines = ssc.textFileStream("inputDir")
.....
I have enabled checkpoint that saves the logs to S3.
If there are NO files in the checkpoint directory, spark streaming works fine and I can see log files appearing in the checkpoint directory. Then I kill spark streaming and restart it. This time, I start getting NullPointerException for spark session.
In short, if there are NO log files in the checkpoint directory, spark streaming works fine. However as soon as I restart spark streaming WITH log files in the checkpoint directory, I start getting null pointer exception on spark session.
Below is the code:
object asf {
val microBatchInterval = 5
val sparkSession = SparkSession
.builder()
.appName("Streaming")
.getOrCreate()
val conf = new SparkConf(true)
//conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val sparkContext = SparkContext.getOrCreate(conf)
val checkpointDirectory = "s3a://bucketname/streaming-checkpoint"
println("Spark session: " + sparkSession)
val ssc = StreamingContext.getOrCreate(checkpointDirectory,
() => {
createStreamingContext(sparkContext, microBatchInterval, checkpointDirectory, sparkSession)
}, s3Config.getConfig())
ssc.start()
ssc.awaitTermination()
}
def createStreamingContext(sparkContext: SparkContext, microBatchInterval: Int, checkpointDirectory: String,spark:SparkSession): StreamingContext = {
println("Spark session inside: " + spark)
val ssc: org.apache.spark.streaming.StreamingContext = new StreamingContext(sparkContext, Seconds(microBatchInterval))
//TODO: StorageLevel.MEMORY_AND_DISK_SER
val lines = ssc.receiverStream(new EventHubClient(StorageLevel.MEMORY_AND_DISK_SER);
lines.foreachRDD {
rdd => {
val df = spark.read.json(rdd)
df.show()
}
}
ssc.checkpoint(checkpointDirectory)
ssc
}
}
And again, the very first time I run this code (with No log files in the checkpoint directory), I can see the data frame being printed out.
And if I run with log files in the checkpoint directory, I don't even see
println("Spark session inside: " + spark)
getting printed and it IS printed the FIRST time. The error:
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:111)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:549)
at org.apache.spark.sql.SparkSession.read(SparkSession.scala:605)
And the error is happening at:
val df = spark.read.json(rdd)
Edit: I added this Line:
conf.set("spark.streaming.stopGracefullyOnShutdown","true")
and it still did not make a difference, still getting NullPointerException.
To answer my own question, this works:
lines.foreachRDD {
rdd => {
val sqlContext:SQLContext = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate().sqlContext
val df = sqlContext.read.json(rdd)
df.show()
}
}
Passing a spark session being built from rdd.sparkContext works
Just to put it explicitly for the benefit of newbies, this is an anti-pattern. Creating Dataset inside a transformation is not allowed!
As Michel mentioned executor wont have access to SparkSession
I'm working on a Spark-Streaming application, I'm just trying to get a simple example of a Kafka Direct Stream working:
package com.username
import _root_.kafka.serializer.StringDecoder
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.{Seconds, StreamingContext}
object MyApp extends App {
val topic = args(0) // 1 topic
val brokers = args(1) //localhost:9092
val spark = SparkSession.builder().master("local[2]").getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(1))
val topicSet = topic.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
// Just print out the data within the topic
val parsers = directKafkaStream.map(v => v)
parsers.print()
ssc.start()
val endTime = System.currentTimeMillis() + (5 * 1000) // 5 second loop
while(System.currentTimeMillis() < endTime){
//write something to the topic
Thread.sleep(1000) // 1 second pause between iterations
}
ssc.stop()
}
This mostly works, whatever I write into the kafka topic, it gets included into the streaming batch and gets printed out. My only concern is what happens at ssc.stop():
dd/mm/yy hh:mm:ss WARN FileSystem: exception in the cleaner thread but it will continue to run
java.lang.InterruptException
at java.lang.Object.wait(Native Method)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:143)
at java.lang.ReferenceQueue.remove(ReferenceQueue.java:164)
at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:2989)
at java.lang.Thread.run(Thread.java:748)
This exception doesn't cause my app to fail nor exit though. I know I could wrap ssc.stop() into a try/catch block to suppress it, but looking into the API docs has me believe that this is not its intended behavior. I've been looking around online for a solution but nothing involving Spark has mentioned this exception, is there anyway for me to properly fix this?
I encountered the same problem with starting the process directly with sbt run. But if I packaged the project and start with YOUR_SPARK_PATH/bin/spark-submit --class [classname] --master local[4] [package_path], it works correctly. Hope this would help.
I have a Spark streaming application that uses SparkSQL written in Scala that attempts to register a udf after getting an RDD. I get the error below. Is it not possible to register udfs in a SparkStreaming app?
Here is the code snippet that throws the error:
sessionStream.foreachRDD((rdd: RDD[(String)], time: Time) => {
val sqlcc = SqlContextSingleton.getInstance(rdd.sparkContext)
sqlcc.udf.register("getUUID", () => java.util.UUID.randomUUID().toString)
...
}
Here is the error throw when I attempt to register the function:
Exception in thread "pool-6-thread-6" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror;
at com.ignitionone.datapipeline.ClusterApp$$anonfun$CreateCheckpointStreamContext$1.apply(ClusterApp.scala:173)
at com.ignitionone.datapipeline.ClusterApp$$anonfun$CreateCheckpointStreamContext$1.apply(ClusterApp.scala:164)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:176)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:175)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
sessionStream.foreachRDD((rdd: RDD[Event], time: Time) => {
val f = (t: Long) => t - t % 60000
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
import sqlContext.implicits._
val df = rdd.toDF()
val per_min = udf(f)
val grouped = df.groupBy(per_min(df("created_at")) as "created_at",
df("blah"),
df("status")
).agg(sum("price") as "price",sum("payout") as "payout", sum("counter") as "counter")
...
}
is working fine by me