I use pyspark streaming with enabled checkpoints.
The first launch is successful but when restart crashes with the error:
INFO scheduler.DAGScheduler: ResultStage 6 (runJob at PythonRDD.scala:441) failed in 1,160 s due to Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 86, h-1.e-contenta.com, executor 2): org.apache.spark.api.python.PythonException:
Traceback (most recent call last):
File"/data1/yarn/nm/usercache/appcache/application_1481115309392_0229/container_1481115309392_0229_01_000003/pyspark.zip/pyspark/worker.py", line 163, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File"/data1/yarn/nm/usercache/appcache/application_1481115309392_0229/container_1481115309392_0229_01_000003/pyspark.zip/pyspark/worker.py", line 56, in read_command
command = serializer.loads(command.value)
File"/data1/yarn/nm/usercache/appcache/application_1481115309392_0229/container_1481115309392_0229_01_000003/pyspark.zip/pyspark/serializers.py", line 431, in loads return pickle.loads(obj, encoding=encoding)
ImportError: No module named ...
Python modules added via spark context addPyFile()
def create_streaming():
"""
Create streaming context and processing functions
:return: StreamingContext
"""
sc = SparkContext(conf=spark_config)
zip_path = zip_lib(PACKAGES, PY_FILES)
sc.addPyFile(zip_path)
ssc = StreamingContext(sc, BATCH_DURATION)
stream = KafkaUtils.createStream(ssc=ssc, zkQuorum=','.join(ZOOKEEPER_QUORUM),
groupId='new_group',
topics={topic: 1})
stream.checkpoint(BATCH_DURATION)
stream = stream \
.map(lambda x: process(ujson.loads(x[1]), geo_data_bc_value)) \
.foreachRDD(lambda_log_writer(topic, schema_bc_value))
ssc.checkpoint(STREAM_CHECKPOINT)
return ssc
if __name__ == '__main__':
ssc = StreamingContext.getOrCreate(STREAM_CHECKPOINT, lambda: create_streaming())
ssc.start()
ssc.awaitTermination()
Sorry it is my mistake.
Try this :
if __name__ == '__main__':
ssc = StreamingContext.getOrCreate('', None)
ssc.sparkContext.addPyFile()
ssc.start()
ssc.awaitTermination()
Related
I'm trying to read in some json files from HDFS to spark streaming and then sending out an HTTP call
If I use the recursiveFileLookup = true, then the code works, but if I set it to false then the code doesn't work
The schema seems to work fine. I'm really not sure what the issue is.
import spark.implicits._
val schema = new StructType()
...
val streaming_df = spark.readStream
.schema(schema)
.option("mode", "DROPMALFORMED")
.option("maxFileAge", "90d")
.option("maxFilesPerTrigger", 10)
.option("recursiveFileLookup", false)
.option("pathGlobFilter", "*.json.gz")
.json(pathToJSONResource)
// scalastyle:off println
println("df is streaming:" + streaming_df.isStreaming)
println("df schema: " + streaming_df.printSchema)
// scalastyle:on println
val exploded_df = streaming_df
.withColumn("messages", explode($"messages"))
exploded_df.writeStream
.foreachBatch(batchHttpCall _)
.outputMode(outputMode)
.option("checkpointLocation", checkpointLocation)
.start()
.awaitTermination()
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 81) (ac3k9x2111.bdp.bdata.ai executor 2): java.lang.ArrayIndexOutOfBoundsException: 0
jsonPathResources looks like this
json-resource-path: "xxx/version=1.0.0/*"
with the folder schema looking like
/version=1.0.0/year=2022/month=02/day=18/hour=17/json.gz
Apparently you need to remove the glob pattern and just end with a /
I trying to write the output using databricks notebook from this code using df write to a S3 bucket, but getting this error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 4 times, most recent failure: Lost task 0.3 in stage 20.0 (TID 50, 10.239.78.116, executor 0): java.nio.file.AccessDeniedException: tfstest/0/outputFile/_started_7476690454845668203: regular upload on tfstest/0/outputFile/_started_7476690454845668203: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied; request: PUT https://tfs-databricks-sys-test.s3.amazonaws.com tfstest/0/outputFile/_started_7476690454845668203 {}
val outputFile = "s3a://tfsdl-ghd-wb/raidnd/Incte_19&20.csv" // OUTPUT path
val df3 = spark.read.option("header","true").csv("s3a://tfsdl-ghd-wb/raidnd/rawdata.csv")
val writer = df3.coalesce(1).write.csv("outputFile")
filtinp.foreach(x => {
val (com1, avg1) = com1Average(filtermp, x)
val (com2, avg2) = com2Average(filtermp, x)
})
def getFileSystem(path: String): FileSystem = {
val hconf = new Configuration() // initialize new hadoop configuration
new Path(path).getFileSystem(hconf) // get new filesystem to handle data
It seems to be a permissions issue, but I'm able to write to this S3 bucket doing a Union for example, with two dataframes in same location, so I don't get what's really happening here. Also, note the path show in the error https://tfs-databricks-sys-test.s3.amazonaws.com is not the path I'm trying to use.
My code works in local mode, but with yarn (client or cluster mode), it stops wit this error :
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, hadoopdatanode, executor 1): java.io.IOException: java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1353)
at org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
I don't understand why it works in local mode but not with yarn. The problem comes with the declaration of the sparkContext inside the rdd.foreach.
I need a sparContext inside the executeAlgorithm, and because a sparcontext is not serializable i have to get it inside the rdd.foreach
here is my main object :
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("scTest")
val sparkContext = new SparkContext(sparkConf)
val sparkSession = org.apache.spark.sql.SparkSession.builder
.appName("sparkSessionTest")
.getOrCreate
val IDList = List("ID1","ID2","ID3")
val IDListRDD = sparkContext.parallelize(IDList)
IDListRDD.foreach(idString => {
val sc = SparkContext.getOrCreate(sparkConf)
executeAlgorithm(idString,sc)
})
Thank you in advance
The rdd.foreach{} block normally should get executed in a executor somewhere in your cluster. In the local mode though, both driver and executor share the same JVM instance accessing each other classes/instances that live in the heap memory
causing an unpredictable behavior. Therefore you can't and you shouldn't make call from an executor node to driver's objects such as SparkContext, RDDs, DataFrames e.t.c please advice the next links for more information:
Apache Spark : When not to use mapPartition and foreachPartition?
Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset
I am trying to build an ALS model using Spark.mllib.recommendation.
I am getting a null pointer exception.
I do not see any null values in the columns I am using. help needed.
import org.apache.spark.mllib.recommendation.Rating
import org.apache.spark.ml.recommendation.ALS
val path = "DataPath"
val data = spark.read.json(path)
data.printSchema()
data.createOrReplaceTempView("reviews")
val raw_reviews = spark.sql("Select reviewerID, cast(asin as int) as ProductID, overall from reviews")
raw_reviews.printSchema()
import org.apache.spark.ml.feature.StringIndexer
val stringindexer = new StringIndexer()
.setInputCol("reviewerID")
.setOutputCol("userID")
val modelc = stringindexer.fit(raw_reviews)
val df = modelc.transform(raw_reviews)
val Array(training,test) = df.randomSplit(Array(0.8,0.2))
val als = new ALS().setMaxIter(5).setRegParam(0.01).setUserCol("userID").setItemCol("ProductID").setRatingCol("overall")
val model = als.fit(training)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 30.0 failed 1 times, most recent failure: Lost task 1.0 in stage 30.0 (TID 94, localhost): java.lang.NullPointerException: Value at index 1 is null
I am trying to submit a spark streaming + kafka job which just reads lines of string from a kafka topic. However, I am getting the following exception
15/07/24 22:39:45 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; aborting job
Exception in thread "Thread-49" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 73, 10.11.112.93): java.lang.NoSuchMethodException: kafka.serializer.StringDecoder.(kafka.utils.VerifiableProperties)
java.lang.Class.getConstructor0(Class.java:2892)
java.lang.Class.getConstructor(Class.java:1723)
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
When I checked the spark jar files used by DSE, I see that it uses kafka_2.10-0.8.0.jar which does have that constructor. Not sure what is causing the error. Here is my consumer code
val sc = new SparkContext(sparkConf)
val streamingContext = new StreamingContext(sc, SLIDE_INTERVAL)
val topicMap = kafkaTopics.split(",").map((_, numThreads.toInt)).toMap
val accessLogsStream = KafkaUtils.createStream(streamingContext, zooKeeper, "AccessLogsKafkaAnalyzer", topicMap)
val accessLogs = accessLogsStream.map(_._2).map(log => ApacheAccessLog.parseLogLine(log).cache()
UPDATE This exception seems to happen only when I submit the job. If I use the spark shell to run the job by pasting the code, it works fine
I was facing the same issue with my custom decoder. I added the following constructor, which resolved the issue.
public YourDecoder(VerifiableProperties verifiableProperties)
{
}