My code works in local mode, but with yarn (client or cluster mode), it stops wit this error :
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, hadoopdatanode, executor 1): java.io.IOException: java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1353)
at org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
I don't understand why it works in local mode but not with yarn. The problem comes with the declaration of the sparkContext inside the rdd.foreach.
I need a sparContext inside the executeAlgorithm, and because a sparcontext is not serializable i have to get it inside the rdd.foreach
here is my main object :
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("scTest")
val sparkContext = new SparkContext(sparkConf)
val sparkSession = org.apache.spark.sql.SparkSession.builder
.appName("sparkSessionTest")
.getOrCreate
val IDList = List("ID1","ID2","ID3")
val IDListRDD = sparkContext.parallelize(IDList)
IDListRDD.foreach(idString => {
val sc = SparkContext.getOrCreate(sparkConf)
executeAlgorithm(idString,sc)
})
Thank you in advance
The rdd.foreach{} block normally should get executed in a executor somewhere in your cluster. In the local mode though, both driver and executor share the same JVM instance accessing each other classes/instances that live in the heap memory
causing an unpredictable behavior. Therefore you can't and you shouldn't make call from an executor node to driver's objects such as SparkContext, RDDs, DataFrames e.t.c please advice the next links for more information:
Apache Spark : When not to use mapPartition and foreachPartition?
Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset
Related
I trying to write the output using databricks notebook from this code using df write to a S3 bucket, but getting this error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 4 times, most recent failure: Lost task 0.3 in stage 20.0 (TID 50, 10.239.78.116, executor 0): java.nio.file.AccessDeniedException: tfstest/0/outputFile/_started_7476690454845668203: regular upload on tfstest/0/outputFile/_started_7476690454845668203: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied; request: PUT https://tfs-databricks-sys-test.s3.amazonaws.com tfstest/0/outputFile/_started_7476690454845668203 {}
val outputFile = "s3a://tfsdl-ghd-wb/raidnd/Incte_19&20.csv" // OUTPUT path
val df3 = spark.read.option("header","true").csv("s3a://tfsdl-ghd-wb/raidnd/rawdata.csv")
val writer = df3.coalesce(1).write.csv("outputFile")
filtinp.foreach(x => {
val (com1, avg1) = com1Average(filtermp, x)
val (com2, avg2) = com2Average(filtermp, x)
})
def getFileSystem(path: String): FileSystem = {
val hconf = new Configuration() // initialize new hadoop configuration
new Path(path).getFileSystem(hconf) // get new filesystem to handle data
It seems to be a permissions issue, but I'm able to write to this S3 bucket doing a Union for example, with two dataframes in same location, so I don't get what's really happening here. Also, note the path show in the error https://tfs-databricks-sys-test.s3.amazonaws.com is not the path I'm trying to use.
I am having some trouble with stopping a streaming context after a condition has been met inside a foreachRDD. Any time the scc.stop() inside function foo is executed, I get an Interrupted error.
Simplified Code:
def main(){
var sc = new SparkContext(new SparkConf().setAppName("appname").setMaster("local"))
foo(123,sc)
//foo(312,sc) can I call foo again here?
sc.stop()
}
def foo(param1: Integer, sc: SparkContext){
val ssc = new StreamingContext(sc, Seconds(1))
val res = 0
//dummy data, but actual datatypes (but is not relevant to the error I get in this code)
val inputData: mutable.Queue[RDD[Int]] = mutable.Queue()
val inputStream: InputDStream[Int] = ssc.queueStream(inputData)
inputData += sc.makeRDD(List(1, 2))
val rdds_list=some_other_fn(inputstream,param1) //returns DStream
rdds_list.foreachRDD((rdd) => {
def foo1(rdd: RDD[<some_type_2>]) = {
if (condition1) {
println("condition satisfied!") //prints correctly
res = do_stuff(rdd) //executes correctly
println("result: " + res) //executes correctly (and output is as intended)
}else{
println("stopping streaming context!")
ssc.stop(stopSparkContext = false) //error occurs here
}
}
foo(rdd)
})
ssc.start()
ssc.awaitTermination()
res
}
Error log:
**condition satisfied!
result: 124124**
stopping streaming context!
[error] (pool-11-thread-1) java.lang.Error: java.lang.InterruptedException
java.lang.Error: java.lang.InterruptedException
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.spark.util.AsynchronousListenerBus.stop(AsynchronousListenerBus.scala:160)
at org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:98)
at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:573)
at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:555)
at edu.gatech.cse8803.main.Main$$anonfun$testClustering$1.foo$1(Main.scala:315)
at edu.gatech.cse8803.main.Main$$anonfun$testClustering$1.apply(Main.scala:318)
at edu.gatech.cse8803.main.Main$$anonfun$testClustering$1.apply(Main.scala:306)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:176)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:175)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I tried using ssc.stop(stopSparkContext = true, stopGracefully = true) but I get this:
WARN scheduler.JobGenerator -
Timed out while stopping the job generator (timeout = 10000)
after foo is called and the program just gets stuck (i.e it does not complete and I have to Ctrl+c it).
Is this the correct way to stop a streaming context? Also, if I wanted to call foo multiple times, should I make any changes? I understand that there should only be one spark context in an application that's why I am trying to re-use them or should I close the SparkContext by setting stopSparkContext as true?
My environment:
sbt v1.0
Scala 2.10.5
Spark 1.3.1
Edit: Looked at other similar questions, tried all their answers - still no luck! :(
It shows that while the spark driver was waiting for the job to finish, you are closing StreamingContext inside the rdd_list which is being handled by the very StreamingContext. This should be closed separately.
And the contexts shouldn't be created and closed with such frequency.
What i would recommend is to do the following...
Initiate and pass StreamingContext from main() to foo(...)
Which will make foo
def foo(param1: Integer, ssc: StreamingContext)
To safely close both of the contexts for a streaming application would be like...
sys.ShutdownHookThread {
//Executes when shutdown signal is received by the app
log.info("Gracefully stopping Spark Context")
sc.stop()
ssc.stop(true, true)
log.info("Application stopped")
}
But if you need to close with programmaticly logic, close StreamingContext with SparkContext.
Which will make your main() look like
def main(){
var sc = new SparkContext(new SparkConf().setAppName("appname").setMaster("local"))
val ssc = new StreamingContext(sc, Seconds(1))
sys.ShutdownHookThread {
//Executes when shutdown signal is received by the app
log.info("Gracefully stopping Spark Context")
sc.stop()
ssc.stop(true, true)
log.info("Application stopped")
}
foo(123,ssc)
sc.stop()
ssc.stop
}
I use pyspark streaming with enabled checkpoints.
The first launch is successful but when restart crashes with the error:
INFO scheduler.DAGScheduler: ResultStage 6 (runJob at PythonRDD.scala:441) failed in 1,160 s due to Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 86, h-1.e-contenta.com, executor 2): org.apache.spark.api.python.PythonException:
Traceback (most recent call last):
File"/data1/yarn/nm/usercache/appcache/application_1481115309392_0229/container_1481115309392_0229_01_000003/pyspark.zip/pyspark/worker.py", line 163, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File"/data1/yarn/nm/usercache/appcache/application_1481115309392_0229/container_1481115309392_0229_01_000003/pyspark.zip/pyspark/worker.py", line 56, in read_command
command = serializer.loads(command.value)
File"/data1/yarn/nm/usercache/appcache/application_1481115309392_0229/container_1481115309392_0229_01_000003/pyspark.zip/pyspark/serializers.py", line 431, in loads return pickle.loads(obj, encoding=encoding)
ImportError: No module named ...
Python modules added via spark context addPyFile()
def create_streaming():
"""
Create streaming context and processing functions
:return: StreamingContext
"""
sc = SparkContext(conf=spark_config)
zip_path = zip_lib(PACKAGES, PY_FILES)
sc.addPyFile(zip_path)
ssc = StreamingContext(sc, BATCH_DURATION)
stream = KafkaUtils.createStream(ssc=ssc, zkQuorum=','.join(ZOOKEEPER_QUORUM),
groupId='new_group',
topics={topic: 1})
stream.checkpoint(BATCH_DURATION)
stream = stream \
.map(lambda x: process(ujson.loads(x[1]), geo_data_bc_value)) \
.foreachRDD(lambda_log_writer(topic, schema_bc_value))
ssc.checkpoint(STREAM_CHECKPOINT)
return ssc
if __name__ == '__main__':
ssc = StreamingContext.getOrCreate(STREAM_CHECKPOINT, lambda: create_streaming())
ssc.start()
ssc.awaitTermination()
Sorry it is my mistake.
Try this :
if __name__ == '__main__':
ssc = StreamingContext.getOrCreate('', None)
ssc.sparkContext.addPyFile()
ssc.start()
ssc.awaitTermination()
I have been trying to parse data from Dstream got from spark stream(TCP) and send it to elastic search. I am getting an error org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [127.0.0.1:9200] returned Bad Request(400) - failed to parse; Bailing out..
The following is my code:
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.SparkContext
import org.apache.spark.serializer.KryoSerializer;
import org.apache.spark.SparkContext._
import org.elasticsearch.spark._
import org.elasticsearch.spark.rdd.EsSpark
import com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.spark.TaskContext
import org.elasticsearch.common.transport.InetSocketTransportAddress;
object Test {
case class createRdd(Message: String, user: String)
def main(args:Array[String]) {
val mapper=new ObjectMapper()
val SparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[*]")
SparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
SparkConf.set("es.nodes","localhost:9200")
SparkConf.set("es.index.auto.create", "true")
// Create a local StreamingContext with batch interval of 10 second
val ssc = new StreamingContext(SparkConf, Seconds(10))
/* Create a DStream that will connect to hostname and port, like localhost 9999. As stated earlier, DStream will get created from StreamContext, which in return is created from SparkContext. */
val lines = ssc.socketTextStream("localhost",9998)
// Using this DStream (lines) we will perform transformation or output operation.
val words = lines.map(_.split(" "))
words.foreachRDD(_.saveToEs("spark/test"))
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
}
The following is the error:
16/10/17 11:02:30 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
16/10/17 11:02:30 INFO BlockManager: Found block input-0-1476682349200 locally
16/10/17 11:02:30 INFO Version: Elasticsearch Hadoop v5.0.0.BUILD.SNAPSHOT [4282a0194a]
16/10/17 11:02:30 INFO EsRDDWriter: Writing to [spark/test]
16/10/17 11:02:30 ERROR TaskContextImpl: Error in TaskCompletionListener
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [127.0.0.1:9200] returned Bad Request(400) - failed to parse; Bailing out..
at org.elasticsearch.hadoop.rest.RestClient.processBulkResponse(RestClient.java:250)
at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:202)
at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:220)
at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:242)
at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:267)
at org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:120)
at org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply(EsRDDWriter.scala:42)
at org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply(EsRDDWriter.scala:42)
at org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:123)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:97)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:95)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:95)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I am coding on scala. I am unable to find the reason for the error. Please help me out with the exception.
Thank you.
I am trying to submit a spark streaming + kafka job which just reads lines of string from a kafka topic. However, I am getting the following exception
15/07/24 22:39:45 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; aborting job
Exception in thread "Thread-49" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 73, 10.11.112.93): java.lang.NoSuchMethodException: kafka.serializer.StringDecoder.(kafka.utils.VerifiableProperties)
java.lang.Class.getConstructor0(Class.java:2892)
java.lang.Class.getConstructor(Class.java:1723)
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
When I checked the spark jar files used by DSE, I see that it uses kafka_2.10-0.8.0.jar which does have that constructor. Not sure what is causing the error. Here is my consumer code
val sc = new SparkContext(sparkConf)
val streamingContext = new StreamingContext(sc, SLIDE_INTERVAL)
val topicMap = kafkaTopics.split(",").map((_, numThreads.toInt)).toMap
val accessLogsStream = KafkaUtils.createStream(streamingContext, zooKeeper, "AccessLogsKafkaAnalyzer", topicMap)
val accessLogs = accessLogsStream.map(_._2).map(log => ApacheAccessLog.parseLogLine(log).cache()
UPDATE This exception seems to happen only when I submit the job. If I use the spark shell to run the job by pasting the code, it works fine
I was facing the same issue with my custom decoder. I added the following constructor, which resolved the issue.
public YourDecoder(VerifiableProperties verifiableProperties)
{
}