Checkpoint data corruption in Spark Streaming - scala

I am testing checkpointing and write ahead logs with this basic Spark streaming code below. I am checkpointing into a local directory. After starting and stopping the application a few times (using Ctrl-C) - it would refuse to start, for what looks like some data corruption in the checkpoint directoty. I am getting:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 80.0 failed 1 times, most recent failure: Lost task 0.0 in stage 80.0 (TID 17, localhost): com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994
at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
Full code:
import org.apache.hadoop.conf.Configuration
import org.apache.spark._
import org.apache.spark.streaming._
object ProtoDemo {
def createContext(dirName: String) = {
val conf = new SparkConf().setAppName("mything")
conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(dirName)
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
val runningCounts = wordCounts.updateStateByKey[Int] {
(values: Seq[Int], oldValue: Option[Int]) =>
val s = values.sum
Some(oldValue.fold(s)(_ + s))
}
// Print the first ten elements of each RDD generated in this DStream to the console
runningCounts.print()
ssc
}
def main(args: Array[String]) = {
val hadoopConf = new Configuration()
val dirName = "/tmp/chkp"
val ssc = StreamingContext.getOrCreate(dirName, () => createContext(dirName), hadoopConf)
ssc.start()
ssc.awaitTermination()
}
}

Basically what you are trying to do is a driver failure scenario , for this to work , based on the cluster you are running you have to follow the below instructions to monitor the driver process and relaunch the driver if it fails
Configuring automatic restart of the application driver - To automatically recover from a driver failure, the deployment infrastructure that is used to run the streaming application must monitor the driver process and relaunch the driver if it fails. Different cluster managers have different tools to achieve this.
Spark Standalone - A Spark application driver can be submitted to
run within the Spark Standalone cluster (see cluster deploy
mode), that is, the application driver itself runs on one of the
worker nodes. Furthermore, the Standalone cluster manager can be
instructed to supervise the driver, and relaunch it if the driver
fails either due to non-zero exit code, or due to failure of the
node running the driver. See cluster mode and supervise in the Spark
Standalone guide for more details.
YARN - Yarn supports a similar mechanism for automatically restarting an application. Please refer to YARN documentation for
more details.
Mesos - Marathon has been used to achieve this with Mesos.
You need to configure write ahead logs as below ,there are special instructions for S3 which you need to follow.
While using S3 (or any file system that does not support flushing) for write ahead logs, please remember to enable
spark.streaming.driver.writeAheadLog.closeFileAfterWrite
spark.streaming.receiver.writeAheadLog.closeFileAfterWrite.
See Spark Streaming Configuration for more details.

The issue looks rather Kryo Serializer issue than checkpoint corruption.
At code example (including GitHub project), Kryo Serialization is not configured.
Since it is not configured KryoException exception could not happen.
When using "write ahead logs", and restoring from a directory, all Spark config is getting from there.
At your example, createContext method does not call when starting from the checkpoint.
I assume the issue is another application were tested before with the same checkpoint directory, where Kryo Serializer where configured.
And current application fails to be restored from that checkpoint.

Related

Why is a streaming query still up and running after StreamingQueryManager.awaitAnyTermination?

I want to terminate the spark mapping after a specific time. I'm using sqlContext.streams.awaitAnyTermination(long timeoutMs) for that. But the mapping is not stopping after the given timeout.
I have tried to read from azure event hub and provided 2 min (120000 ms) as a timeout for awaitAnyTermination method. but the mapping is not stopping on azure databricks cluster.
Below is my code. I'm reading from azure eventhub and writing to console and 120000ms to awaitAnyTermination.
import org.apache.spark.eventhubs._
// Event hub configurations
// Replace values below with yours
import org.apache.spark.eventhubs.ConnectionStringBuilder
val connStr = ConnectionStringBuilder()
.setNamespaceName("iisqaeventhub")
.setEventHubName("devsource")
.setSasKeyName("RootManageSharedAccessKey")
.setSasKey("saskey")
.build
val customEventhubParameters = EventHubsConf(connStr).setMaxEventsPerTrigger(5).setStartingPosition(EventPosition.fromEndOfStream)
// reading from the Azure event hub
val incomingStream = spark.readStream.format("eventhubs").options(customEventhubParameters.toMap).load()
// write to console
val query = incomingStream.writeStream
.outputMode("append")
.format("console")
.start()
// awaitAnyTermination for shutting down the query
sqlContext.streams.awaitAnyTermination(120000)
I am expecting that mapping should have ended after a timeout. No error but mapping is not stopping.
tl;dr Works as designed.
From the official documentation:
awaitAnyTermination(timeoutMs: Long): Boolean
Returns whether any query has terminated or not (multiple may have terminated).
In other words, no streaming query is going to be terminated at any point in time (before or after the timeoutMs) unless there is an exception or stop.
When using DataBricks and prototyping, this is what I use to stop Spark Structured Streaming Apps in a separate Notebook pane:
import org.apache.spark.streaming._
StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) }

Flink - producing to kinesis not working

I'm trying to run a simple program which reads from one kinesis stream, does a trivial transformation, and writes the result to another kinesis stream.
Running locally on Flink 1.4.0 (this is the version supported on EMR currently, so no way of upgrading).
Here is the code:
def main(args: Array[String]) {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val consumerConfig = new Properties()
consumerConfig.put(AWSConfigConstants.AWS_REGION, "us-east-1")
consumerConfig.put(ConsumerConfigConstants.STREAM_INITIAL_POSITION, "LATEST")
val kinesisMaps = env.addSource(new FlinkKinesisConsumer[String](
"source-stream", new SimpleStringSchema, consumerConfig))
val jsonMaps = kinesisMaps.map { jsonStr => JSON.parseFull(jsonStr).get.asInstanceOf[Map[String, String]] }
val values = jsonMaps.map(jsonMap => jsonMap("field_name"))
values.print()
val producerConfig = new Properties()
producerConfig.put(AWSConfigConstants.AWS_REGION, "us-east-1")
val kinesisProducer = new FlinkKinesisProducer[String](new SimpleStringSchema, producerConfig)
kinesisProducer.setFailOnError(true)
kinesisProducer.setDefaultStream("target-stream")
kinesisProducer.setDefaultPartition("0")
values.addSink(kinesisProducer)
// execute program
env.execute("Flink Kinesis")
}
If I comment out the producing code, the program runs as expected and prints the correct values.
As soon as I add the producer code, I get the following exception:
org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.DaemonException: The child process has been shutdown and can no longer accept messages.
at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.Daemon.add(Daemon.java:176)
at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.producer.KinesisProducer.addUserRecord(KinesisProducer.java:477)
at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisProducer.invoke(FlinkKinesisProducer.java:248)
at org.apache.flink.streaming.api.functions.sink.SinkFunction.invoke(SinkFunction.java:52)
at org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:608)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:569)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809)
at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:41)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809)
at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:41)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:549)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:524)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:504)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collect(StreamSourceContexts.java:104)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collectWithTimestamp(StreamSourceContexts.java:111)
at org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.emitRecordAndUpdateState(KinesisDataFetcher.java:486)
at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.deserializeRecordForCollectionAndUpdateState(ShardConsumer.java:264)
at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:210)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Any idea what's the cause of this?
Apparently, this is an issue with the old version of Amazon KPL which is used in Flink 1.4.
There are at least two possible solutions for this:
Upgrade to Flink version 1.5.
You can still use it on EMR, if you install it as described here, under the section Custom EMR Installation:
https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/deployment/aws.html
When building the Kinesis connector for Flink 1.4, you can build it with newer AWS dependencies: I've cherry-picked the aws dependency changes in pom.xml of the connector from 1.5, and built the connector with them. Looks like it's working as expected.

Spark Cluster: How to print out the content of RDD on each worker node

I just started learning apache spark and wanted to know why this is not working for me.
I am running spark 2.1 and started a master and a worker (not local). This my code:
object SimpleApp {
def main(args: Array[String]) {
val file = [FILELOCATION]
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val textFile = sc.textFile(file)
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word.toLowerCase.toCharArray.toList.sorted.mkString, 1))
.reduceByKey(_ + _)
counts.map(println)
counts.foreach(println)
val countCollect = counts.collect()
sc.stop()
}
}
I cannot seem to get the worker nodes to print out their contents in stdout. Even if I set the master and worker to local, it does not seem to work.
Am I understanding something wrong here?
If you want to print something in executor a normal println will do. That will print the output in the executor's stdout
You can Actually view the worker status, Application status stderr, stdout of each workers rdd distribution and many more things in viewing in localhost:8080 in the browser [Master Machine]. click on worker-Id you can view logs (stdout,stderr). If you want to see the actual distribution and status you can click on running application, In that click on Application Detailed UI link it will show complete details of your application.
If you want to view worker UI only then you can see by typing localhost:8081 in worker system.
Whenever you submit a Spark Job, the tasks (instructions) for the Spark job go from the driver to the executors. The driver can be running on the same node that you are currently logged onto (local and YARN-client) or the driver can be on another node (Application master).
All the actions return back a result to the driver, so if you are logged onto to the machine where the driver runs, you can see the output. But you cannot see the output on the executor nodes since any print statement will be printed on the console of respective machines. You can just do a sc.textFile() and it will save all the partitions into the directory separately. In this way you can see the contents in each partition.

Spark AWS emr checkpoint location

I'm running a spark job on EMR but need to create a checkpoint. I tried using s3 but got this error message
17/02/24 14:34:35 ERROR ApplicationMaster: User class threw exception:
java.lang.IllegalArgumentException: Wrong FS: s3://spark-
jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected-
components-e3210fd6/2, expected: hdfs://ip-172-18-13-18.ec2.internal:8020
java.lang.IllegalArgumentException: Wrong FS: s3://spark-
jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected-
components-e3210fd6/2, expected: hdfs://ip-172-18-13-18.ec2.internal:8020
Here is my sample code
...
val sparkConf = new SparkConf().setAppName("spark-job")
.set("spark.default.parallelism", (CPU * 3).toString)
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(Array(classOf[Member], classOf[GraphVertex], classOf[GraphEdge]))
.set("spark.dynamicAllocation.enabled", "true")
implicit val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
sparkSession.sparkContext.setCheckpointDir("s3://spark-jobs/checkpoint")
....
How can I checkpoint on AWS EMR?
There's a now fixed bug for Spark which meant you could only checkpoint to the default FS, not any other one (like S3). It's fixed in master, don't know about backports.
if it makes you feel any better, the way checkpointing works: write then rename() is slow enough on the object store you may find yourself off better checkpointing locally then doing the upload to s3 yourself.
There is a fix in the master branch for this to allow checkpoint to s3 too. I was able to build against it and it worked so this should be part of next release.
Try something with AWS authenticaton like:
val hadoopConf: Configuration = new Configuration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "id-1")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "secret-key")
sparkSession.sparkContext.getOrCreate(checkPointDir, () =>
{ createStreamingContext(checkPointDir, config) }, hadoopConf)

Spark Streaming gets stopped without errors after ~1 minute

When I make spark-submit for the Spart Streaming job, I can see that it's running during approximatelly 1 minute, and then it's stopped with the final status SUCCEEDED:
16/11/16 18:58:16 INFO yarn.Client: Application report for application_XXXX_XXX (state: RUNNING)
16/11/16 18:58:17 INFO yarn.Client: Application report for application_XXXX_XXX (state: FINISHED)
I don't understand why it gets stopped, while I expect it to run for an undefined time and be triggered by messages received from the Kafka queue. In logs I can see all the println outputs, and there are no errors.
This is a short extract from the code:
val conf = new SparkConf().setAppName("MYTEST")
val sc = new SparkContext(conf)
sc.setCheckpointDir("~/checkpointDir")
val ssc = new StreamingContext(sc, Seconds(batch_interval_seconds))
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
println("Dividing the topic into partitions.")
val inputKafkaTopicMap = inputKafkaTopic.split(",").map((_, kafkaNumThreads)).toMap
val messages = KafkaUtils.createStream(ssc, zkQuorum, group, inputKafkaTopicMap).map(_._2)
messages.foreachRDD(msg => {
msg.foreach(s => {
if (s != null) {
//val result = ... processing goes here
//println(result)
}
})
})
// Start the streaming context in the background.
ssc.start()
This is my spark-submit command:
/usr/bin/spark-submit --master yarn --deploy-mode cluster --driver-memory 10g --executor-memory 10g --num-executors 2 --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC \
-XX:+AlwaysPreTouch" --class org.test.StreamingRunner test.jar param1 param2
When I open Resource Manager, I see that no job is RUNNING and the spark streaming job gets marked as FINISHED.
Your code is missing a call to ssc.awaitTermination to block the driver thread.
Unfortunately there's no easy way to see the printouts from inside your map function on the console, since those function calls are happening inside YARN executors. Cloudera Manager provides a decent look at the logs though, and if you really need them collected on the driver you can write to a location in HDFS and then scrape the various logs from there yourself. If the information that you want to track is purely numeric you might also consider using an Accumulator.