ClassCastException when using two Dataframes in a UDF

ClassCastException when using two Dataframes in a UDF - scala

I Am relativley new to both Object Orientend an Functional programming so please forgive me if I am asking a dumb question here. I have searched thoroughly and tried to find answer myself for hours but to no avail.
I am working withh Spark2 on scala an have the following problem.
I have two Dataframes both of which I would like to pass to a UDF which should output one resultant Dataframe. For simplicities sake I have created two "test" dataframes as input.
val seqs = List(("1234","abaab"), ("1235","aaab")).toDF
val actions = List(1,2).toDF
The actions list dataframe will be converted to an Array of Strings e.g.:
val actionsARR = actions.rdd.map(x=>(x.getInt(0)+96).toChar.toString).collect()
We then have a UDF to count the occurances of the actions in the seqs:
def countActions(sequence:String, actions:Array[String]):Array[Int] = {
return actions.map(x => x.r.findAllIn(sequence).size)
}
put together with :
val results = seqs.map(x=>(x.getString(0), countActions(x.getString(1),actionsARR))).toDF("sequence_u_key","action_counter")
This works fine in the spark shell running one command after the other, I now try to embed the code in a further UDF which will accept Dataframes as input:
def testFun(seqsin:DataFrame, actionsin:DataFrame ): DataFrame ={
seqsin.map(x=>(x.getString(0), countActions(x.getString(1),actionsARR))).toDF("sequence_u_key","action_counter")
}
calling this with :
testFun(seqs,actions).show
works, however it is currently not yet using the Dataframe actionsin but the array allready created from it called actionsARR.
Of course I want it to take the actionsin Dataframe and convert it to the Array within the UDF , so I tried :
def testFun(seqsin:DataFrame, actionsin:DataFrame ): DataFrame ={
val acts = actionsin.rdd.map(y=>(y.getInt(0)+96).toChar.toString).collect()
seqsin.map(x=>(x.getString(0), countActions(x.getString(1),acts))).toDF("sequence_u_key","action_counter")
}
But when I call the function with my dataframes as input I get :
testFun(seqs,actions).show 17/07/19 07:31:32 ERROR Executor: Exception
in task 0.0 in stage 38.0 (TID 64) java.lang.ClassCastException
17/07/19 07:31:32 WARN TaskSetManager: Lost task 0.0 in stage 38.0
(TID 64, localhost, executor driver): java.lang.ClassCastException
17/07/19 07:31:32 ERROR TaskSetManager: Task 0 in stage 38.0 failed 1
times; aborting job 17/07/19 07:31:32 WARN TaskSetManager: Lost task
1.0 in stage 38.0 (TID 65, localhost, executor driver): TaskKilled (killed intentionally) org.apache.spark.SparkException: Job aborted
due to stage failure: Task 0 in stage 38.0 failed 1 times, most recent
failure: Lost task 0.0 in stage 38.0 (TID 64, localhost, executor
driver): java.lang.ClassCastException
Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:1958) at
org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935) at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at
org.apache.spark.rdd.RDD.collect(RDD.scala:934) at
testFun(:44) ... 52 elided Caused by:
java.lang.ClassCastException
Maybee I am doing something fundamentatly wrong by trying to pass two Dataframes to a function , coverting one to an array and then using them in the function ? or maybee I am missing something simple ?
Oh, I can succesfully create a function that takes a single Dataframe and converts and returns this as an array.
Any help would be appreciated
Best Regards
James

Related

spark stateful streaming with checkpoint + kafka producer

How can I integrate Kafka producer with spark stateful streaming which uses checkpoint along with StreamingContext.getOrCreate.
I read this post: How to write spark streaming DF to Kafka topic and implemented the method mentioned in this post: Spark and Kafka integration patterns
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
import org.apache.kafka.common.serialization.StringSerializer
import java.util.Properties
class KafkaSink(createProducer: () => KafkaProducer[String, String]) extends Serializable {
lazy val producer = createProducer()
def send(topic: String, value: String): Unit = producer.send(new ProducerRecord(topic, value))
}
object KafkaSink {
def apply(): KafkaSink = {
val f = () => {
val kafkaProducerProps: Properties = {
val props = new Properties()
props.put("bootstrap.servers", "127.0.0.1:9092")
props.setProperty("batch.size", "8192");
props.put("key.serializer", classOf[StringSerializer].getName)
props.put("value.serializer", classOf[StringSerializer].getName)
props.setProperty("request.timeout.ms", "60000")
props
}
val producer = new KafkaProducer[String, String](kafkaProducerProps)
producer
}
new KafkaSink(f)
}
}
and
package webmetric
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.{ByteArraySerializer, StringDeserializer, StringSerializer}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, State, StateSpec, StreamingContext}
import java.util.Properties
import scala.concurrent.Future;
object RecoverableJsonProcess {
def createContext(checkpointDirectory: String)
: StreamingContext = {
// If you do not see this printed, that means the StreamingContext has been loaded
// from the new checkpoint
println("Creating new context")
val sparkConf = new SparkConf().setAppName("RecoverableNetworkWordCount").setMaster("local[2]")
// Create the context with a 1 second batch size
val ssc = new StreamingContext(sparkConf, Seconds(4))
ssc.checkpoint(checkpointDirectory)
...
...
val globalKafkaSink = ssc.sparkContext.broadcast(KafkaSink())
val mappingFunc = ...
val stateDstream = xitemPairs.mapWithState(
StateSpec.function(mappingFunc))
stateDstream.foreachRDD { rdd =>
rdd.foreach { message =>
globalKafkaSink.value.send("mytopic",message.toString())
}
}
stateDstream.print()
ssc
}
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
System.setProperty("hadoop.home.dir", "H:\\work\\spark\\")
val checkpointDirectory = "H:/work/spark/chk2"
val ssc = StreamingContext.getOrCreate(checkpointDirectory,
() => createContext(checkpointDirectory))
ssc.start()
ssc.awaitTermination()
}
}
I did it and it works in first run which create a new context. But when try to get context from checkpoint directory after running it rises the following error when using Kafka producer:
22/04/20 09:34:44 ERROR Executor: Exception in task 0.0 in stage 99.0 (TID 35)
java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to webmetric.MySparkKafkaProducer
at webmetric.RecoverableJsonProcess$.$anonfun$createContext$8(RecoverableJsonProcess.scala:92)
at webmetric.RecoverableJsonProcess$.$anonfun$createContext$8$adapted(RecoverableJsonProcess.scala:90)
at scala.collection.AbstractIterator.foreach(Iterator.scala:932)
at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1012)
at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1012)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2242)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/04/20 09:34:44 WARN TaskSetManager: Lost task 0.0 in stage 99.0 (TID 35) (hajibaba.PC executor driver): java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to webmetric.MySparkKafkaProducer
at webmetric.RecoverableJsonProcess$.$anonfun$createContext$8(RecoverableJsonProcess.scala:92)
at webmetric.RecoverableJsonProcess$.$anonfun$createContext$8$adapted(RecoverableJsonProcess.scala:90)
at scala.collection.AbstractIterator.foreach(Iterator.scala:932)
at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1012)
at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1012)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2242)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/04/20 09:34:44 ERROR TaskSetManager: Task 0 in stage 99.0 failed 1 times; aborting job
22/04/20 09:34:44 ERROR JobScheduler: Error running job streaming job 1650431044000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 99.0 failed 1 times, most recent failure: Lost task 0.0 in stage 99.0 (TID 35) (hajibaba.PC executor driver): java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to webmetric.MySparkKafkaProducer
at webmetric.RecoverableJsonProcess$.$anonfun$createContext$8(RecoverableJsonProcess.scala:92)
at webmetric.RecoverableJsonProcess$.$anonfun$createContext$8$adapted(RecoverableJsonProcess.scala:90)
at scala.collection.AbstractIterator.foreach(Iterator.scala:932)
at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1012)
at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1012)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2242)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2251)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2200)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2199)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2199)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2438)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2380)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2369)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
at org.apache.spark.rdd.RDD.$anonfun$foreach$1(RDD.scala:1012)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:1010)
at webmetric.RecoverableJsonProcess$.$anonfun$createContext$7(RecoverableJsonProcess.scala:90)
at webmetric.RecoverableJsonProcess$.$anonfun$createContext$7$adapted(RecoverableJsonProcess.scala:89)
at org.apache.spark.streaming.dstream.DStream.$anonfun$foreachRDD$2(DStream.scala:629)
at org.apache.spark.streaming.dstream.DStream.$anonfun$foreachRDD$2$adapted(DStream.scala:629)
at org.apache.spark.streaming.dstream.ForEachDStream.$anonfun$generateJob$2(ForEachDStream.scala:51)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:417)
at org.apache.spark.streaming.dstream.ForEachDStream.$anonfun$generateJob$1(ForEachDStream.scala:51)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at scala.util.Try$.apply(Try.scala:209)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.$anonfun$run$1(JobScheduler.scala:256)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to webmetric.MySparkKafkaProducer
at webmetric.RecoverableJsonProcess$.$anonfun$createContext$8(RecoverableJsonProcess.scala:92)
at webmetric.RecoverableJsonProcess$.$anonfun$createContext$8$adapted(RecoverableJsonProcess.scala:90)
at scala.collection.AbstractIterator.foreach(Iterator.scala:932)
at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1012)
at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1012)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2242)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
... 3 more

I dug around this a little bit and found out that streaming doesn't support broadcast. Take a look at this issue on official spark issues page. If I understood correctly, broadcast variables will cause exceptions when recovering from checkpoint, since the actual data is lost in executor side, and recovery from driver side is not possible. Now there are some workarounds for this, I found this one which looks promising, in the mentioned example, it uses an accumulator, but there should be similar implementation for broadcast.

distribute sparkContext error on yarn cluster

My code works in local mode, but with yarn (client or cluster mode), it stops wit this error :
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, hadoopdatanode, executor 1): java.io.IOException: java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1353)
at org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)
I don't understand why it works in local mode but not with yarn. The problem comes with the declaration of the sparkContext inside the rdd.foreach.
I need a sparContext inside the executeAlgorithm, and because a sparcontext is not serializable i have to get it inside the rdd.foreach
here is my main object :
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("scTest")
val sparkContext = new SparkContext(sparkConf)
val sparkSession = org.apache.spark.sql.SparkSession.builder
.appName("sparkSessionTest")
.getOrCreate
val IDList = List("ID1","ID2","ID3")
val IDListRDD = sparkContext.parallelize(IDList)
IDListRDD.foreach(idString => {
val sc = SparkContext.getOrCreate(sparkConf)
executeAlgorithm(idString,sc)
})
Thank you in advance

The rdd.foreach{} block normally should get executed in a executor somewhere in your cluster. In the local mode though, both driver and executor share the same JVM instance accessing each other classes/instances that live in the heap memory
causing an unpredictable behavior. Therefore you can't and you shouldn't make call from an executor node to driver's objects such as SparkContext, RDDs, DataFrames e.t.c please advice the next links for more information:
Apache Spark : When not to use mapPartition and foreachPartition?
Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset

Why can't I use sort or orderby on DataFrame?

When I try to sort a DataFrame:
val df1 = df.toDF().sort(desc("sourceId"))
I get:
17/11/07 15:15:37 ERROR Executor: Exception in task 3.0 in stage 114.0 (TID 218)
com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Class is not registered: scala.math.Ordering$$anon$4
Note: To register this class use: kryo.register(scala.math.Ordering$$anon$4.class);
Serialization trace:
ord (org.apache.spark.util.BoundedPriorityQueue)
at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:101)
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:21)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:19)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:312)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:364)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Class is not registered: scala.math.Ordering$$anon$4
I've also tried order by, but neither works.
What is the issue here? Do I have to import ordering
scala.math.order?

Looks like you're using spark.kryo.registrationRequired
spark.kryo.registrationRequired true
Please either set it to false:
spark.kryo.registrationRequired false
or add required class to spark.kryo.classesToRegister

DSE Spark Streaming+Kafka NoSuchMethodError

I am trying to submit a spark streaming + kafka job which just reads lines of string from a kafka topic. However, I am getting the following exception
15/07/24 22:39:45 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; aborting job
Exception in thread "Thread-49" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 73, 10.11.112.93): java.lang.NoSuchMethodException: kafka.serializer.StringDecoder.(kafka.utils.VerifiableProperties)
java.lang.Class.getConstructor0(Class.java:2892)
java.lang.Class.getConstructor(Class.java:1723)
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:106)
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
When I checked the spark jar files used by DSE, I see that it uses kafka_2.10-0.8.0.jar which does have that constructor. Not sure what is causing the error. Here is my consumer code
val sc = new SparkContext(sparkConf)
val streamingContext = new StreamingContext(sc, SLIDE_INTERVAL)
val topicMap = kafkaTopics.split(",").map((_, numThreads.toInt)).toMap
val accessLogsStream = KafkaUtils.createStream(streamingContext, zooKeeper, "AccessLogsKafkaAnalyzer", topicMap)
val accessLogs = accessLogsStream.map(_._2).map(log => ApacheAccessLog.parseLogLine(log).cache()
UPDATE This exception seems to happen only when I submit the job. If I use the spark shell to run the job by pasting the code, it works fine

I was facing the same issue with my custom decoder. I added the following constructor, which resolved the issue.
public YourDecoder(VerifiableProperties verifiableProperties)
{
}

Exception when using collect on Spark Dataframe that is a union of Selects of another dataframe

My code:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
def yearFrame(x: String) : org.apache.spark.sql.DataFrame = {
val csv0 = sc.textFile("Data/Casos_Notificados_Dengue_01_"+x+".csv")
val csv = sc.textFile("Data/*"+x+".csv")
val rdd = csv.mapPartitionsWithIndex(
((i,iterator) => if (i == 0 && iterator.hasNext){
iterator.next
iterator.next
iterator
}else iterator), true)
var schemaArray = csv0.collect()(1).split(",")
schemaArray(0) = "NU_NOTIF" //Corrigindo mudança de header de 2011 para 2012
val schema =
StructType(
schemaArray.map(fieldName =>
if(fieldName == "NU_NOTIF") StructField(fieldName, StringType, false))
else StructField(fieldName, StringType, true))
)
val rowRDD = rdd.map(_.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)")).map(p => Row.fromSeq(p))
// Apply the schema to the RDD.
val schemaRDD = sqlContext.applySchema(rowRDD, schema)
// Register the SchemaRDD as a table.
schemaRDD.registerTempTable("casos")
// SQL statements can be run by using the sql methods provided by sqlContext.
val r = sqlContext.sql("SELECT NU_NOTIF,NU_ANO,Long_WGS84,Lat_WGS84 FROM casos")
return r
}
val years = List("2010","2011","2012","2013","2014")
val allTables = years.map(x => yearFrame(x))
val finalTables = allTables.reduce(_.unionAll(_))
This executes without a problem, so let's say i want now to get all rows with NU_ANO = 2014:
scala> val a = finalTables.filter("NU_ANO = 2014")
a: org.apache.spark.sql.DataFrame = [NU_NOTIF: string, NU_ANO: string, Long_WGS84: string, Lat_WGS84: string]
scala> a.first
15/05/28 11:42:59 ERROR Executor: Exception in task 0.0 in stage 91.0 (TID 287)
java.lang.ArrayIndexOutOfBoundsException
15/05/28 11:42:59 ERROR TaskSetManager: Task 0 in stage 91.0 failed 1 times; aborting job
15/05/28 11:42:59 ERROR Executor: Exception in task 1.0 in stage 91.0 (TID 288)
java.lang.ArrayIndexOutOfBoundsException
15/05/28 11:42:59 ERROR Executor: Exception in task 3.0 in stage 91.0 (TID 290)
org.apache.spark.TaskKilledException
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 91.0 failed 1 times, most recent failure: Lost task 0.0 in stage 91.0 (TID 287, localhost): java.lang.ArrayIndexOutOfBoundsException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:245)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
scala> a.schema
res116: org.apache.spark.sql.types.StructType = StructType(StructField(NU_NOTIF,StringType,true), StructField(NU_ANO,StringType,true), StructField(Long_WGS84,StringType,true), StructField(Lat_WGS84,StringType,true))
scala> a.count
15/05/28 11:43:13 ERROR Executor: Exception in task 1.0 in stage 92.0 (TID 293)
java.lang.ArrayIndexOutOfBoundsException
15/05/28 11:43:13 ERROR TaskSetManager: Task 1 in stage 92.0 failed 1 times; aborting job
15/05/28 11:43:13 ERROR Executor: Exception in task 2.0 in stage 92.0 (TID 294)
java.lang.ArrayIndexOutOfBoundsException
15/05/28 11:43:13 ERROR Executor: Exception in task 3.0 in stage 92.0 (TID 295)
java.lang.ArrayIndexOutOfBoundsException
15/05/28 11:43:13 ERROR Executor: Exception in task 4.0 in stage 92.0 (TID 296)
org.apache.spark.TaskKilledException
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 92.0 failed 1 times, most recent failure: Lost task 1.0 in stage 92.0 (TID 293, localhost): java.lang.ArrayIndexOutOfBoundsException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:245)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
So, it didn't work, let's try collecting the data and iterating in a map:
scala> finalTable
finalTable finalTables
scala> finalTables.count
res118: Long = 226570
scala> finalTables.collect()
15/05/28 11:45:59 ERROR Executor: Exception in task 1.0 in stage 96.0 (TID 351)
java.lang.ArrayIndexOutOfBoundsException
15/05/28 11:45:59 ERROR TaskSetManager: Task 1 in stage 96.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 96.0 failed 1 times, most recent failure: Lost task 1.0 in stage 96.0 (TID 351, localhost): java.lang.ArrayIndexOutOfBoundsException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:245)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Now, i do know i did something wrong and in case someone wonders, the error is the same if i try to pick only "r" or any other Dataframe, with the exception of "SchemaRDD". Anyone encountered a similar problem and/or has a suggestion?
I am using Spark 1.3.1 by the way.

It seems that doing a textFile of a group of files was what i was doing wrong, not because of any problem with scala or spark but rather the files starting with only a bogus "ç" character on the first line. I will post bellow the modified code that worked.
import org.apache.spark.sql._
import org.apache.spark.sql.types._
def g(x: String, y:String = "2010") : org.apache.spark.sql.DataFrame = {
val csv = sc.textFile("Data/Casos_Notificados_Dengue_"+x+"_"+y+".csv")
val rdd = csv.mapPartitionsWithIndex(
((i,iterator) => if (i == 0 && iterator.hasNext){
iterator.next
iterator.next
iterator
}else iterator), true)
var schemaArray = csv.collect()(1).split(",")
schemaArray(0) = "NU_NOTIF" //Corrigindo mudança de header de 2011 para 2012
val schema =
StructType(
schemaArray.map(fieldName =>
if(fieldName == "NU_NOTIF") StructField(fieldName, StringType, false)
else StructField(fieldName, StringType, true)))
val rowRDD = rdd.map(_.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)")).map(p => Row.fromSeq(p))
val schemaRDD = sqlContext.applySchema(rowRDD, schema)
schemaRDD.registerTempTable("casos")
val r = sqlContext.sql("SELECT NU_NOTIF,NU_ANO,Long_WGS84,Lat_WGS84 FROM casos")
return r
}
val months = List[String]("01","02","03","04","05","06","07","08","09","10","11","12")
val years = List(List("2010",months),List("2011",months),List("2012",months),List("2013",months),List("2014",List("01","02","03")))
val allTables = years.map(x => (x(1).asInstanceOf[List[String]]).map(y => g(y.toString(),x(0).toString())).reduce(_.unionAll(_)))
val finalTable = allTables.reduce(_.unionAll(_))
Although it's not the best way this is enough for the purpose of a prototype, the best would be for the files to be pre-processed as they are downloaded.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

ClassCastException when using two Dataframes in a UDF - scala

Related

spark stateful streaming with checkpoint + kafka producer

distribute sparkContext error on yarn cluster

Why can't I use sort or orderby on DataFrame?

DSE Spark Streaming+Kafka NoSuchMethodError

Exception when using collect on Spark Dataframe that is a union of Selects of another dataframe

Categories

Resources