End-to-end exactly once semantics in spark structured streaming - spark-structured-streaming

I am trying to understand if end-to-end exactly once semantics is compromised in spark structured streaming in the below scenario.
Scenario: Structured streaming job with kafka source and file sink is started. Kafka has 16 partitions and I am reading with 16 executors. I interrupted the job at the moment when a particular batch is incomplete. 8 out of 16 tasks completed and we have 8 output files generated. Now if I run the job again the batch starts and reads the data from the same offset range of previous incomplete batch producing 16 output files. Now the 8 output files of incomplete batch resulted in duplicates and the same has been confirmed by data comparision.

About Streaming end-to-end Exactly-Once, recommand u to read this poster on flink ( a similar framework with spark ) .
Briefly, store source/sink state when occurring checkpoint event.
rest of anwser from flink post.
So let’s put all of these different pieces together:
Once all of the operators complete their pre-commit, they issue a commit .
If at least one pre-commit fails, all others are aborted, and we roll back to the previous successfully-completed checkpoint.
After a successful pre-commit, the commit must be guaranteed to eventually succeed — both our operators and our external system need to make this guarantee. If a commit fails (for example, due to an intermittent network issue), the entire Flink application fails, the application restarts according to the user’s restart strategy, and there is another commit attempt. This process is critical because if the commit does not eventually succeed, data loss occurs.

Related

Spring Batch + Kafka: KafkaItemReader run forever?

I want to make something to monitor some Kafka topic continuously, and then execute some batch job when a message comes in (hitting some REST api and storing response). I set something up with KafkaItemReader, however, it turns off if it doesn't receive a message for 30 seconds based on pollTimeout. How can I make it run indefinitely? Since this is not an obvious option I'm wondering if I am using the right tool for the job.
Likely answer: you are not supposed to do this.
That's correct. Batch processing is about processing finite data sets. If your data source is an infinite stream of records and you want to monitor it continuously, then a streaming solution is more appropriate for your use case.

Kafka - different configuration settings

I am going through the documentation, and there seems to be there are lot of moving with respect to message processing like exactly once processing , at least once processing . And, the settings scattered here and there. There doesnt seem a single place that documents the properties need to be configured rougly for exactly once processing and atleast once processing.
I know there are many moving parts involved and it always depends . However, like i was mentioning before , what are the settings to be configured atleast to provide exactly once processing and at most once and atleast once ...
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.

Fault tolerance in Flink file Sink

I am using Flink streaming with Kafka consumer connector (FlinkKafkaConsumer) and file Sink (StreamingFileSink) in a cluster mode with exactly once policy.
The file sink writes the files to the local disk.
I’ve noticed that if a job fails and automatic restart is on, the task managers look for the leftovers files from the last failing job (hidden files).
Obviously, since the tasks can be assigned to different task managers, this sums up to more failures over and over again.
The only solution I found so far is to delete the hidden files and resubmit the job.
If I get it right (and please correct me If I wrong), the events in the hidden files were not committed to the bootstrap-server, so there is no data loss.
Is there a way, forcing Flink to ignore the files that were written already? Or maybe there is a better way to implement the solution (maybe somehow with savepoints)?
I got a very detailed answer in Flink mailing list. TLDR, in order to implement exactly once, I have to use some kind of distributed FS.
The full answer:
A local filesystem is not the right choice for what you are trying to achieve. I don't think you can achieve a true exactly once policy in this setup. Let me elaborate on why.
The interesting bit is how it behaves on checkpoints. The behavior is controlled by a RollingPolicy. As you have not said what format you use let's assume you use row format first. For a row format the default rolling policy (when to change the file from in-progress to pending) is it will be rolled if the file reaches 128MB, the file is older than 60 sec or it has not been written to for 60 sec. It does not roll on a checkpoint. Moreover StreamingFileSink considers the filesystem as a durable sink that can be accessed after a restore. That implies that it will try to append to this file when restoring from checkpoint/savepoint.
Even if you rolled the files on every checkpoint you still might face the problem that you can have some leftovers because the StreamingFileSink moves the files from pending to complete after the checkpoint is completed. If a failure happens between finishing the checkpoint and moving the files it will not be able to move them after a restore (it would do it if had an access).
Lastly a completed checkpoint will contain offsets of records that were processed successfully end-to-end, which means records that are assumed committed by the StreamingFileSink. This can be records written to an in-progress file with a pointer in a StreamingFileSink checkpointed metadata, records in a "pending" file with an entry in a StreamingFileSink checkpointed metadata that this file has been completed or records in "finished" files.[1]
Therefore as you can see there are multiple scenarios when the StreamingFileSink has to access the files after a restart.
The last thing, you mentioned "committing to the "bootstrap-server". Bear in mind that Flink does not use offsets committed back to Kafka for guaranteeing consistency. It can write those offsets back but only for monitoring/debugging purposes. Flink stores/restores the processed offsets from its checkpoints.[3]
Let me know if it helped. I tried my best ;) BTW I highly encourage reading the linked sources as they try to describe all that in a more structured way.
I am also cc'ing Kostas who knows more about the StreamingFileSink than I do., so he can maybe correct me somewhere.
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/connectors/streamfile_sink.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/connectors/kafka.html
[3]https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/connectors/kafka.html#kafka-consumers-offset-committing-behaviour-configuration

Avoid Data Loss While Processing Messages from Kafka

Looking out for best approach for designing my Kafka Consumer. Basically I would like to see what is the best way to avoid data loss in case there are any
exception/errors during processing the messages.
My use case is as below.
a) The reason why I am using a SERVICE to process the message is - in future I am planning to write an ERROR PROCESSOR application which would run at the end of the day, which will try to process the failed messages (not all messages, but messages which fails because of any dependencies like parent missing) again.
b) I want to make sure there is zero message loss and so I will save the message to a file in case there are any issues while saving the message to DB.
c) In production environment there can be multiple instances of consumer and services running and so there is high chance that multiple applications try to write to the
same file.
Q-1) Is writing to file the only option to avoid data loss ?
Q-2) If it is the only option, how to make sure multiple applications write to the same file and read at the same time ? Please consider in future once the error processor
is build, it might be reading the messages from the same file while another application is trying to write to the file.
ERROR PROCESSOR - Our source is following a event driven mechanics and there is high chance that some times the dependent event (for example, the parent entity for something) might get delayed by a couple of days. So in that case, I want my ERROR PROCESSOR to process the same messages multiple times.
I've run into something similar before. So, diving straight into your questions:
Not necessarily, you could perhaps send those messages back to Kafka in a new topic (let's say - error-topic). So, when your error processor is ready, it could just listen in to the this error-topic and consume those messages as they come in.
I think this question has been addressed in response to the first one. So, instead of using a file to write to and read from and open multiple file handles to do this concurrently, Kafka might be a better choice as it is designed for such problems.
Note: The following point is just some food for thought based on my limited understanding of your problem domain. So, you may just choose to ignore this safely.
One more point worth considering on your design for the service component - You might as well consider merging points 4 and 5 by sending all the error messages back to Kafka. That will enable you to process all error messages in a consistent way as opposed to putting some messages in the error DB and some in Kafka.
EDIT: Based on the additional information on the ERROR PROCESSOR requirement, here's a diagrammatic representation of the solution design.
I've deliberately kept the output of the ERROR PROCESSOR abstract for now just to keep it generic.
I hope this helps!
If you don't commit the consumed message before writing to the database, then nothing would be lost while Kafka retains the message. The tradeoff of that would be that if the consumer did commit to the database, but a Kafka offset commit fails or times out, you'd end up consuming records again and potentially have duplicates being processed in your service.
Even if you did write to a file, you wouldn't be guaranteed ordering unless you opened a file per partition, and ensured all consumers only ran on a single machine (because you're preserving state there, which isn't fault-tolerant). Deduplication would still need handled as well.
Also, rather than write your own consumer to a database, you could look into Kafka Connect framework. For validating a message, you can similarly deploy a Kafka Streams application to filter out bad messages from an input topic out into a topic to send to the DB

Does draining pipelines trigger ack's to PubSub for failing elements?

I have a streaming pipeline in Apache Beam 2.5 that subscribes to a PubSub subscription, parses CSV files received as messages from that subscription, applies some trivial processing to the data, and then stores the results in BigQuery.
Occasionally, the producer of the data sent to the PubSub topic changes the CSV file format (columns are added/removed/renamed) without telling us. When this happens the CSV parsing DoFn (luckily) starts failing and the Pipeline gets stuck retrying processing the element (the pipeline's system lag starts increasing monotonically).
Google's documentation promotes the use of the Drain functionality as the "nice" way to stop a Pipeline, outlining that in this way "in-flight" elements won't risk being lost. Does this mean that all in-flight elements, even the failing ones, will be "committed" when their bundle closes, thus triggering an ACK to PubSub?
In my case, I'd like the failing element NOT to be ACK'ed, so that after the pipeline is fixed, the failing element will be processed when redelivered.
Another answerer linked a question with a popular (though not yet accepted) answer that states that Dataflow will ACK the message once, for the bundle the message belongs to, "results of the bundle (outputs and state mutations etc) have been durably committed" (At what stage does Dataflow/Apache Beam ack a pub/sub message?).
It's important to note that Dataflow needs to commit state when there are stateful operations in your pipeline. For example, with windowing, Dataflow needs to stash your data somewhere while it waits for the window to pass, at which point it pulls the state back out and sends it off to the next part of your pipeline.
This behavior actually matches what I've observed using Dataflow in production for a few years now. We used to have a stateless pipeline (no windowing, etc) and it NACKed messages when exceptions occurred in any part of the pipeline. When we added windowing, we noticed it ACKing the messages even though the window the message belonged to has not yet passed (and nothing had been output at the end of the pipeline into the sink).
Therefore, the situation you're concerned about, where messages are ACKed even though the message is "bad" will occur, in pipelines that have stateful operations, because the message won't be deemed "bad" by your code until after it has been ACKed so that it can be durably committed. The situation won't occur, and you can safely rely on a NACK for these "bad" messages, if your pipeline has no stateful operations (and all stateless operations finish within the ACK deadline you've configured for your Pub/Sub subscription).
If this is a problem for you, because you have stateful operations in your pipeline, I'd suggest one of two things:
Add validation before the Pub/Sub message is published, such that no "bad" messages will enter your pipeline, or
Break up your pipeline into two pipelines, one stateless and one stateful, such that messages will only be deemed "bad" in the first pipeline, and can be retried later when the pipeline is updated to no longer deem the message "bad" or the message is discarded through other means if it isn't needed
According to some related discussions[1], acks only happen when a bundle succeeds. In you case, bundle already fails, which means it won't succeed before drained, I think we don't expect acks.
[1] At what stage does Dataflow/Apache Beam ack a pub/sub message?