Datasource V2 Reader (Spark Structured Streaming) - offsets out of order - scala

I am currently implementing two custom readers using the V2 api for a spark structured streaming job. After the job runs for ~30-60 minutes, it bombs with:
Caused by: java.lang.RuntimeException: Offsets committed out of order: 608799 followed by 2982
I am repurposing the examples found here and it is bombing at line: 206.
Instead of using the twitter stream that is provided in the example I am implementing it for JMS & SQS.
My question is: has anyone encountered this issue? Or is there something wrong with that implementation?
Code snippet:
override def commit(end: Offset): Unit = {
internalLog(s"** commit($end) lastOffsetCommitted: $lastOffsetCommitted")
val newOffset = TwitterOffset.convert(end).getOrElse(
sys.error(s"TwitterStreamMicroBatchReader.commit() received an offset ($end) that did not " +
s"originate with an instance of this class")
)
val offsetDiff = (newOffset.offset - lastOffsetCommitted.offset).toInt
if (offsetDiff < 0) {
sys.error(s"Offsets committed out of order: $lastOffsetCommitted followed by $end")
}
tweetList.trimStart(offsetDiff)
lastOffsetCommitted = newOffset
}
I can't find an answer with my usual outlets. I did, however, see this. One point that was made is to delete checkpoint data - which doesn't seem like a viable solution in a production system. The other was that the source system doesn't maintain offset information? I was under the impression that spark would be handling the offset information by itself. If this second point is the problem, how can I ensure that the source system handles this paradigm.
Please let me know if I can provide more information.
Edit: Looking at the MicroBatchReader interface, the documentation for commit says:
/**
* Informs the source that Spark has completed processing all data for offsets less than or
* equal to `end` and will only request offsets greater than `end` in the future.
*/
void commit(Offset end);
So the question becomes, why is spark sending me commit offsets that has already been committed?

Answering my own question in case it helps someone,
I should have added more information to the question - this job is running on EMR and is using EFS to checkpoint data.
The problem occurred when I used Amazon's amazon-efs-utils to mount EFS. For some reason each worker was not able to see the other workers' reads and writes - as if EFS didn't mount.
The solution was to switch to nfs-utils to mount EFS (per AWS instructions) so that each worker could accurately read the checkpoint data.

Related

Fault tolerance in Flink file Sink

I am using Flink streaming with Kafka consumer connector (FlinkKafkaConsumer) and file Sink (StreamingFileSink) in a cluster mode with exactly once policy.
The file sink writes the files to the local disk.
I’ve noticed that if a job fails and automatic restart is on, the task managers look for the leftovers files from the last failing job (hidden files).
Obviously, since the tasks can be assigned to different task managers, this sums up to more failures over and over again.
The only solution I found so far is to delete the hidden files and resubmit the job.
If I get it right (and please correct me If I wrong), the events in the hidden files were not committed to the bootstrap-server, so there is no data loss.
Is there a way, forcing Flink to ignore the files that were written already? Or maybe there is a better way to implement the solution (maybe somehow with savepoints)?
I got a very detailed answer in Flink mailing list. TLDR, in order to implement exactly once, I have to use some kind of distributed FS.
The full answer:
A local filesystem is not the right choice for what you are trying to achieve. I don't think you can achieve a true exactly once policy in this setup. Let me elaborate on why.
The interesting bit is how it behaves on checkpoints. The behavior is controlled by a RollingPolicy. As you have not said what format you use let's assume you use row format first. For a row format the default rolling policy (when to change the file from in-progress to pending) is it will be rolled if the file reaches 128MB, the file is older than 60 sec or it has not been written to for 60 sec. It does not roll on a checkpoint. Moreover StreamingFileSink considers the filesystem as a durable sink that can be accessed after a restore. That implies that it will try to append to this file when restoring from checkpoint/savepoint.
Even if you rolled the files on every checkpoint you still might face the problem that you can have some leftovers because the StreamingFileSink moves the files from pending to complete after the checkpoint is completed. If a failure happens between finishing the checkpoint and moving the files it will not be able to move them after a restore (it would do it if had an access).
Lastly a completed checkpoint will contain offsets of records that were processed successfully end-to-end, which means records that are assumed committed by the StreamingFileSink. This can be records written to an in-progress file with a pointer in a StreamingFileSink checkpointed metadata, records in a "pending" file with an entry in a StreamingFileSink checkpointed metadata that this file has been completed or records in "finished" files.[1]
Therefore as you can see there are multiple scenarios when the StreamingFileSink has to access the files after a restart.
The last thing, you mentioned "committing to the "bootstrap-server". Bear in mind that Flink does not use offsets committed back to Kafka for guaranteeing consistency. It can write those offsets back but only for monitoring/debugging purposes. Flink stores/restores the processed offsets from its checkpoints.[3]
Let me know if it helped. I tried my best ;) BTW I highly encourage reading the linked sources as they try to describe all that in a more structured way.
I am also cc'ing Kostas who knows more about the StreamingFileSink than I do., so he can maybe correct me somewhere.
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/connectors/streamfile_sink.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/connectors/kafka.html
[3]https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/connectors/kafka.html#kafka-consumers-offset-committing-behaviour-configuration

Acknowledgement Kafka Producer Apache Beam

How do I get the records where an acknowledgement was received in apache beam KafkaIO?
Basically I want all the records where I didn't get any acknowledgement to go to a bigquery table so that I can retry sometime later. I used the following code snippet from the docs
.apply(KafkaIO.<Long, String>read()
.withBootstrapServers("broker_1:9092,broker_2:9092")
.withTopic("my_topic") // use withTopics(List<String>) to read from multiple topics.
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
// Above four are required configuration. returns PCollection<KafkaRecord<Long, String>>
// Rest of the settings are optional :
// you can further customize KafkaConsumer used to read the records by adding more
// settings for ConsumerConfig. e.g :
.updateConsumerProperties(ImmutableMap.of("group.id", "my_beam_app_1"))
// set event times and watermark based on LogAppendTime. To provide a custom
// policy see withTimestampPolicyFactory(). withProcessingTime() is the default.
.withLogAppendTime()
// restrict reader to committed messages on Kafka (see method documentation).
.withReadCommitted()
// offset consumed by the pipeline can be committed back.
.commitOffsetsInFinalize()
// finally, if you don't need Kafka metadata, you can drop it.g
.withoutMetadata() // PCollection<KV<Long, String>>
)
.apply(Values.<String>create()) // PCollection<String>
By Default Beam IOs are designed to keep attempting to write/read/process elements until . (Batch pipelines will fail after repeated errors)
What you are referring to is usually called a Dead Letter Queue, to take the failed records and add them to a PCollection, Pubsub topic, queuing service, etc. This is often desire-able as it allows a streaming pipeline to make progress (not block), when errors writing some records are encountered, but allowing the onces which succeed to be written.
Unfortunately, unless I am mistaken there is no dead letter queue implemented in Kafka IO. It may be possible to modify KafkaIO to support this. There was some discussion on the beam mailing list with some ideas proposed to implement this, which might have some ideas.
I suspect it may be possible to add this to KafkaWriter, catching the records that failed and outputting them to another PCollection. If you choose to implement this, please also contact the beam community mailing list, if you would like help merging it into master, they will be able to help make sure the change covers necessary requirements so that it can be merged and makes sense as a whole for beam.
Your pipeline can then write those elsewhere (i.e. a different source). Of course, if that secondary source simultaneously has an outage/issue, you would need another DLQ.

Kafka connect error handling and improved logging

I was trying to leverage some enhancements in Kafka connect in 2.0.0 release as specified by this KIP https://cwiki.apache.org/confluence/display/KAFKA/KIP-298%3A+Error+Handling+in+Connect and I came across this good blog post by Robin https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues.
Here are my questions
I have set errors.tolerance=all in my connector config. If I understand correctly, it will not fail for bad records and move forward. Is my understanding correct?
In my case, the consumer doesn't fail and stays in the RUNNING state (which is expected) but the consumer offsets don't move forward for the paritions with the bad records. Any guess why this may be happening?
I have set errors.log.include.messages and errors.log.enable to true for my connector but I don't see any additional logging for the failed records. The logs are similar to what I used to see before enabling these properties. I didn't see any message like this https://github.com/apache/kafka/blob/5a95c2e1cd555d5f3ec148cc7c765d1bb7d716f9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/errors/LogReporter.java#L67
Some Context:
In my connector, I do some transformations, validations for every record and if any of these fail, I throw RetriableException. Earlier I was throwing RuntimeException but I changed to RetriableException after reading the comments for RetryWithToleranceOperator class.
I have tried to keep it brief but let me know if any additional context is required.
Thanks so much in advance!

End-to-end exactly once semantics in spark structured streaming

I am trying to understand if end-to-end exactly once semantics is compromised in spark structured streaming in the below scenario.
Scenario: Structured streaming job with kafka source and file sink is started. Kafka has 16 partitions and I am reading with 16 executors. I interrupted the job at the moment when a particular batch is incomplete. 8 out of 16 tasks completed and we have 8 output files generated. Now if I run the job again the batch starts and reads the data from the same offset range of previous incomplete batch producing 16 output files. Now the 8 output files of incomplete batch resulted in duplicates and the same has been confirmed by data comparision.
About Streaming end-to-end Exactly-Once, recommand u to read this poster on flink ( a similar framework with spark ) .
Briefly, store source/sink state when occurring checkpoint event.
rest of anwser from flink post.
So let’s put all of these different pieces together:
Once all of the operators complete their pre-commit, they issue a commit .
If at least one pre-commit fails, all others are aborted, and we roll back to the previous successfully-completed checkpoint.
After a successful pre-commit, the commit must be guaranteed to eventually succeed — both our operators and our external system need to make this guarantee. If a commit fails (for example, due to an intermittent network issue), the entire Flink application fails, the application restarts according to the user’s restart strategy, and there is another commit attempt. This process is critical because if the commit does not eventually succeed, data loss occurs.

Reliability issues with Checkpointing/WAL in Spark Streaming 1.6.0

Description
We have a Spark Streaming 1.5.2 application in Scala that reads JSON events from a Kinesis Stream, does some transformations/aggregations and writes the results to different S3 prefixes. The current batch interval is 60 seconds. We have 3000-7000 events/sec. We’re using checkpointing to protect us from losing aggregations.
It’s been working well for a while, recovering from exceptions and even cluster restarts. We recently recompiled the code for Spark Streaming 1.6.0, only changing the library dependencies in the build.sbt file. After running the code in a Spark 1.6.0 cluster for several hours, we’ve noticed the following:
“Input Rate” and “Processing Time” volatility has increased substantially (see the screenshots below) in 1.6.0.
Every few hours, there’s an ‘’Exception thrown while writing record: BlockAdditionEvent … to the WriteAheadLog. java.util.concurrent.TimeoutException: Futures timed out after [5000 milliseconds]” exception (see complete stack trace below) coinciding with the drop to 0 events/sec for specific batches (minutes).
After doing some digging, I think the second issue looks related to this Pull Request. The initial goal of the PR: “When using S3 as a directory for WALs, the writes take too long. The driver gets very easily bottlenecked when multiple receivers send AddBlock events to the ReceiverTracker. This PR adds batching of events in the ReceivedBlockTracker so that receivers don’t get blocked by the driver for too long.”
We are checkpointing in S3 in Spark 1.5.2 and there are no performance/reliability issues. We’ve tested checkpointing in Spark 1.6.0 in S3 and local NAS and in both cases we’re receiving this exception. It looks like when it takes more than 5 seconds to checkpoint a batch, this exception arises and we’ve checked that the events for that batch are lost forever.
Questions
Is the increase in “Input Rate” and “Processing Time” volatility expected in Spark Streaming 1.6.0 and is there any known way of improving it?
Do you know of any workaround apart from these 2?:
1) To guarantee that it takes less than 5 seconds for the checkpointing sink to write all files. In my experience, you cannot guarantee that with S3, even for small batches. For local NAS, it depends on who’s in charge of infrastructure (difficult with cloud providers).
2) Increase the spark.streaming.driver.writeAheadLog.batchingTimeout property value.
Would you expect to lose any events in the described scenario? I'd think that if batch checkpointing fails, the shard/receiver Sequence Numbers wouldn't be increased and it would be retried at a later time.
Spark 1.5.2 Statistics - Screenshot
Spark 1.6.0 Statistics - Screenshot
Full Stack Trace
16/01/19 03:25:03 WARN ReceivedBlockTracker: Exception thrown while writing record: BlockAdditionEvent(ReceivedBlockInfo(0,Some(3521),Some(SequenceNumberRanges(SequenceNumberRange(StreamEventsPRD,shardId-000000000003,49558087746891612304997255299934807015508295035511636018,49558087746891612304997255303224294170679701088606617650), SequenceNumberRange(StreamEventsPRD,shardId-000000000004,49558087949939897337618579003482122196174788079896232002,49558087949939897337618579006984380295598368799020023874), SequenceNumberRange(StreamEventsPRD,shardId-000000000001,49558087735072217349776025034858012188384702720257294354,49558087735072217349776025038332464993957147037082320914), SequenceNumberRange(StreamEventsPRD,shardId-000000000009,49558088270111696152922722880993488801473174525649617042,49558088270111696152922722884455852348849472550727581842), SequenceNumberRange(StreamEventsPRD,shardId-000000000000,49558087841379869711171505550483827793283335010434154498,49558087841379869711171505554030816148032657077741551618), SequenceNumberRange(StreamEventsPRD,shardId-000000000002,49558087853556076589569225785774419228345486684446523426,49558087853556076589569225789389107428993227916817989666))),BlockManagerBasedStoreResult(input-0-1453142312126,Some(3521)))) to the WriteAheadLog.
java.util.concurrent.TimeoutException: Futures timed out after [5000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:81)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:232)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:87)
at org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:321)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:500)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1230)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:498)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Source Code Extract
...
// Function to create a new StreamingContext and set it up
def setupContext(): StreamingContext = {
...
// Create a StreamingContext
val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds))
// Create a Kinesis DStream
val data = KinesisUtils.createStream(ssc,
kinesisAppName, kinesisStreamName,
kinesisEndpointUrl, RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName(),
InitialPositionInStream.LATEST, Seconds(kinesisCheckpointIntervalSeconds),
StorageLevel.MEMORY_AND_DISK_SER_2, awsAccessKeyId, awsSecretKey)
...
ssc.checkpoint(checkpointDir)
ssc
}
// Get or create a streaming context.
val ssc = StreamingContext.getActiveOrCreate(checkpointDir, setupContext)
ssc.start()
ssc.awaitTermination()
Following zero323's suggestion about posting my comment as an answer:
Increasing spark.streaming.driver.writeAheadLog.batchingTimeout solved the checkpointing timeout issue. We did it after making sure we had room for it. We have been testing it for a while now. So I only recommend increasing it after careful consideration.
DETAILS
We used these 2 settings in $SPARK_HOME/conf/spark-defaults.conf:
spark.streaming.driver.writeAheadLog.allowBatching true
spark.streaming.driver.writeAheadLog.batchingTimeout 15000
Originally, we only had spark.streaming.driver.writeAheadLog.allowBatching set to true.
Before the change, we had reproduced the issue mentioned in the question ("...ReceivedBlockTracker: Exception thrown while writing record...") in a testing environment. It occurred every few hours. After the change, the issue disappeared. We ran it for several days before moving to production.
We had found that the getBatchingTimeout() method of the WriteAheadLogUtils class had a default value of 5000ms, as seen here:
def getBatchingTimeout(conf: SparkConf): Long = {
conf.getLong(DRIVER_WAL_BATCHING_TIMEOUT_CONF_KEY, defaultValue = 5000)
}