DSE Spark Streaming+Kafka NoSuchMethodError - apache-kafka

I am trying to submit a spark streaming + kafka job which just reads lines of string from a kafka topic. However, I am getting the following exception
15/07/24 22:39:45 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; aborting job
Exception in thread "Thread-49" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 73, 10.11.112.93): java.lang.NoSuchMethodException: kafka.serializer.StringDecoder.(kafka.utils.VerifiableProperties)
java.lang.Class.getConstructor0(Class.java:2892)
java.lang.Class.getConstructor(Class.java:1723)
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
When I checked the spark jar files used by DSE, I see that it uses kafka_2.10-0.8.0.jar which does have that constructor. Not sure what is causing the error. Here is my consumer code
val sc = new SparkContext(sparkConf)
val streamingContext = new StreamingContext(sc, SLIDE_INTERVAL)
val topicMap = kafkaTopics.split(",").map((_, numThreads.toInt)).toMap
val accessLogsStream = KafkaUtils.createStream(streamingContext, zooKeeper, "AccessLogsKafkaAnalyzer", topicMap)
val accessLogs = accessLogsStream.map(_._2).map(log => ApacheAccessLog.parseLogLine(log).cache()
UPDATE This exception seems to happen only when I submit the job. If I use the spark shell to run the job by pasting the code, it works fine

I was facing the same issue with my custom decoder. I added the following constructor, which resolved the issue.
public YourDecoder(VerifiableProperties verifiableProperties)
{
}

Related

Spark java.lang.ArrayIndexOutOfBoundsException when setting recursiveFileLookup to true, but not when false

I'm trying to read in some json files from HDFS to spark streaming and then sending out an HTTP call
If I use the recursiveFileLookup = true, then the code works, but if I set it to false then the code doesn't work
The schema seems to work fine. I'm really not sure what the issue is.
import spark.implicits._
val schema = new StructType()
...
val streaming_df = spark.readStream
.schema(schema)
.option("mode", "DROPMALFORMED")
.option("maxFileAge", "90d")
.option("maxFilesPerTrigger", 10)
.option("recursiveFileLookup", false)
.option("pathGlobFilter", "*.json.gz")
.json(pathToJSONResource)
// scalastyle:off println
println("df is streaming:" + streaming_df.isStreaming)
println("df schema: " + streaming_df.printSchema)
// scalastyle:on println
val exploded_df = streaming_df
.withColumn("messages", explode($"messages"))
exploded_df.writeStream
.foreachBatch(batchHttpCall _)
.outputMode(outputMode)
.option("checkpointLocation", checkpointLocation)
.start()
.awaitTermination()
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 81) (ac3k9x2111.bdp.bdata.ai executor 2): java.lang.ArrayIndexOutOfBoundsException: 0
jsonPathResources looks like this
json-resource-path: "xxx/version=1.0.0/*"
with the folder schema looking like
/version=1.0.0/year=2022/month=02/day=18/hour=17/json.gz
Apparently you need to remove the glob pattern and just end with a /

Error writting file to S3 org.apache.spark.SparkException: Job aborted

I trying to write the output using databricks notebook from this code using df write to a S3 bucket, but getting this error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 4 times, most recent failure: Lost task 0.3 in stage 20.0 (TID 50, 10.239.78.116, executor 0): java.nio.file.AccessDeniedException: tfstest/0/outputFile/_started_7476690454845668203: regular upload on tfstest/0/outputFile/_started_7476690454845668203: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied; request: PUT https://tfs-databricks-sys-test.s3.amazonaws.com tfstest/0/outputFile/_started_7476690454845668203 {}
val outputFile = "s3a://tfsdl-ghd-wb/raidnd/Incte_19&20.csv" // OUTPUT path
val df3 = spark.read.option("header","true").csv("s3a://tfsdl-ghd-wb/raidnd/rawdata.csv")
val writer = df3.coalesce(1).write.csv("outputFile")
filtinp.foreach(x => {
val (com1, avg1) = com1Average(filtermp, x)
val (com2, avg2) = com2Average(filtermp, x)
})
def getFileSystem(path: String): FileSystem = {
val hconf = new Configuration() // initialize new hadoop configuration
new Path(path).getFileSystem(hconf) // get new filesystem to handle data
It seems to be a permissions issue, but I'm able to write to this S3 bucket doing a Union for example, with two dataframes in same location, so I don't get what's really happening here. Also, note the path show in the error https://tfs-databricks-sys-test.s3.amazonaws.com is not the path I'm trying to use.

distribute sparkContext error on yarn cluster

My code works in local mode, but with yarn (client or cluster mode), it stops wit this error :
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, hadoopdatanode, executor 1): java.io.IOException: java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1353)
at org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
I don't understand why it works in local mode but not with yarn. The problem comes with the declaration of the sparkContext inside the rdd.foreach.
I need a sparContext inside the executeAlgorithm, and because a sparcontext is not serializable i have to get it inside the rdd.foreach
here is my main object :
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("scTest")
val sparkContext = new SparkContext(sparkConf)
val sparkSession = org.apache.spark.sql.SparkSession.builder
.appName("sparkSessionTest")
.getOrCreate
val IDList = List("ID1","ID2","ID3")
val IDListRDD = sparkContext.parallelize(IDList)
IDListRDD.foreach(idString => {
val sc = SparkContext.getOrCreate(sparkConf)
executeAlgorithm(idString,sc)
})
Thank you in advance
The rdd.foreach{} block normally should get executed in a executor somewhere in your cluster. In the local mode though, both driver and executor share the same JVM instance accessing each other classes/instances that live in the heap memory
causing an unpredictable behavior. Therefore you can't and you shouldn't make call from an executor node to driver's objects such as SparkContext, RDDs, DataFrames e.t.c please advice the next links for more information:
Apache Spark : When not to use mapPartition and foreachPartition?
Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset

ClassCastException when using two Dataframes in a UDF

I Am relativley new to both Object Orientend an Functional programming so please forgive me if I am asking a dumb question here. I have searched thoroughly and tried to find answer myself for hours but to no avail.
I am working withh Spark2 on scala an have the following problem.
I have two Dataframes both of which I would like to pass to a UDF which should output one resultant Dataframe. For simplicities sake I have created two "test" dataframes as input.
val seqs = List(("1234","abaab"), ("1235","aaab")).toDF
val actions = List(1,2).toDF
The actions list dataframe will be converted to an Array of Strings e.g.:
val actionsARR = actions.rdd.map(x=>(x.getInt(0)+96).toChar.toString).collect()
We then have a UDF to count the occurances of the actions in the seqs:
def countActions(sequence:String, actions:Array[String]):Array[Int] = {
return actions.map(x => x.r.findAllIn(sequence).size)
}
put together with :
val results = seqs.map(x=>(x.getString(0), countActions(x.getString(1),actionsARR))).toDF("sequence_u_key","action_counter")
This works fine in the spark shell running one command after the other, I now try to embed the code in a further UDF which will accept Dataframes as input:
def testFun(seqsin:DataFrame, actionsin:DataFrame ): DataFrame ={
seqsin.map(x=>(x.getString(0), countActions(x.getString(1),actionsARR))).toDF("sequence_u_key","action_counter")
}
calling this with :
testFun(seqs,actions).show
works, however it is currently not yet using the Dataframe actionsin but the array allready created from it called actionsARR.
Of course I want it to take the actionsin Dataframe and convert it to the Array within the UDF , so I tried :
def testFun(seqsin:DataFrame, actionsin:DataFrame ): DataFrame ={
val acts = actionsin.rdd.map(y=>(y.getInt(0)+96).toChar.toString).collect()
seqsin.map(x=>(x.getString(0), countActions(x.getString(1),acts))).toDF("sequence_u_key","action_counter")
}
But when I call the function with my dataframes as input I get :
testFun(seqs,actions).show 17/07/19 07:31:32 ERROR Executor: Exception
in task 0.0 in stage 38.0 (TID 64) java.lang.ClassCastException
17/07/19 07:31:32 WARN TaskSetManager: Lost task 0.0 in stage 38.0
(TID 64, localhost, executor driver): java.lang.ClassCastException
17/07/19 07:31:32 ERROR TaskSetManager: Task 0 in stage 38.0 failed 1
times; aborting job 17/07/19 07:31:32 WARN TaskSetManager: Lost task
1.0 in stage 38.0 (TID 65, localhost, executor driver): TaskKilled (killed intentionally) org.apache.spark.SparkException: Job aborted
due to stage failure: Task 0 in stage 38.0 failed 1 times, most recent
failure: Lost task 0.0 in stage 38.0 (TID 64, localhost, executor
driver): java.lang.ClassCastException
Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1958) at
org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935) at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at
org.apache.spark.rdd.RDD.collect(RDD.scala:934) at
testFun(:44) ... 52 elided Caused by:
java.lang.ClassCastException
Maybee I am doing something fundamentatly wrong by trying to pass two Dataframes to a function , coverting one to an array and then using them in the function ? or maybee I am missing something simple ?
Oh, I can succesfully create a function that takes a single Dataframe and converts and returns this as an array.
Any help would be appreciated
Best Regards
James

pyspark streaming restore from checkpoint

I use pyspark streaming with enabled checkpoints.
The first launch is successful but when restart crashes with the error:
INFO scheduler.DAGScheduler: ResultStage 6 (runJob at PythonRDD.scala:441) failed in 1,160 s due to Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 86, h-1.e-contenta.com, executor 2): org.apache.spark.api.python.PythonException:
Traceback (most recent call last):
File"/data1/yarn/nm/usercache/appcache/application_1481115309392_0229/container_1481115309392_0229_01_000003/pyspark.zip/pyspark/worker.py", line 163, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File"/data1/yarn/nm/usercache/appcache/application_1481115309392_0229/container_1481115309392_0229_01_000003/pyspark.zip/pyspark/worker.py", line 56, in read_command
command = serializer.loads(command.value)
File"/data1/yarn/nm/usercache/appcache/application_1481115309392_0229/container_1481115309392_0229_01_000003/pyspark.zip/pyspark/serializers.py", line 431, in loads return pickle.loads(obj, encoding=encoding)
ImportError: No module named ...
Python modules added via spark context addPyFile()
def create_streaming():
"""
Create streaming context and processing functions
:return: StreamingContext
"""
sc = SparkContext(conf=spark_config)
zip_path = zip_lib(PACKAGES, PY_FILES)
sc.addPyFile(zip_path)
ssc = StreamingContext(sc, BATCH_DURATION)
stream = KafkaUtils.createStream(ssc=ssc, zkQuorum=','.join(ZOOKEEPER_QUORUM),
groupId='new_group',
topics={topic: 1})
stream.checkpoint(BATCH_DURATION)
stream = stream \
.map(lambda x: process(ujson.loads(x[1]), geo_data_bc_value)) \
.foreachRDD(lambda_log_writer(topic, schema_bc_value))
ssc.checkpoint(STREAM_CHECKPOINT)
return ssc
if __name__ == '__main__':
ssc = StreamingContext.getOrCreate(STREAM_CHECKPOINT, lambda: create_streaming())
ssc.start()
ssc.awaitTermination()
Sorry it is my mistake.
Try this :
if __name__ == '__main__':
ssc = StreamingContext.getOrCreate('', None)
ssc.sparkContext.addPyFile()
ssc.start()
ssc.awaitTermination()