Fault tolerance in Flink file Sink - apache-kafka

I am using Flink streaming with Kafka consumer connector (FlinkKafkaConsumer) and file Sink (StreamingFileSink) in a cluster mode with exactly once policy.
The file sink writes the files to the local disk.
I’ve noticed that if a job fails and automatic restart is on, the task managers look for the leftovers files from the last failing job (hidden files).
Obviously, since the tasks can be assigned to different task managers, this sums up to more failures over and over again.
The only solution I found so far is to delete the hidden files and resubmit the job.
If I get it right (and please correct me If I wrong), the events in the hidden files were not committed to the bootstrap-server, so there is no data loss.
Is there a way, forcing Flink to ignore the files that were written already? Or maybe there is a better way to implement the solution (maybe somehow with savepoints)?

I got a very detailed answer in Flink mailing list. TLDR, in order to implement exactly once, I have to use some kind of distributed FS.
The full answer:
A local filesystem is not the right choice for what you are trying to achieve. I don't think you can achieve a true exactly once policy in this setup. Let me elaborate on why.
The interesting bit is how it behaves on checkpoints. The behavior is controlled by a RollingPolicy. As you have not said what format you use let's assume you use row format first. For a row format the default rolling policy (when to change the file from in-progress to pending) is it will be rolled if the file reaches 128MB, the file is older than 60 sec or it has not been written to for 60 sec. It does not roll on a checkpoint. Moreover StreamingFileSink considers the filesystem as a durable sink that can be accessed after a restore. That implies that it will try to append to this file when restoring from checkpoint/savepoint.
Even if you rolled the files on every checkpoint you still might face the problem that you can have some leftovers because the StreamingFileSink moves the files from pending to complete after the checkpoint is completed. If a failure happens between finishing the checkpoint and moving the files it will not be able to move them after a restore (it would do it if had an access).
Lastly a completed checkpoint will contain offsets of records that were processed successfully end-to-end, which means records that are assumed committed by the StreamingFileSink. This can be records written to an in-progress file with a pointer in a StreamingFileSink checkpointed metadata, records in a "pending" file with an entry in a StreamingFileSink checkpointed metadata that this file has been completed or records in "finished" files.[1]
Therefore as you can see there are multiple scenarios when the StreamingFileSink has to access the files after a restart.
The last thing, you mentioned "committing to the "bootstrap-server". Bear in mind that Flink does not use offsets committed back to Kafka for guaranteeing consistency. It can write those offsets back but only for monitoring/debugging purposes. Flink stores/restores the processed offsets from its checkpoints.[3]
Let me know if it helped. I tried my best ;) BTW I highly encourage reading the linked sources as they try to describe all that in a more structured way.
I am also cc'ing Kostas who knows more about the StreamingFileSink than I do., so he can maybe correct me somewhere.
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/connectors/streamfile_sink.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/connectors/kafka.html


Minimizing failure without impacting recovery when building processes on top of Kafka

I am working with a microservice that consumes messages from Kafka. It does some processing on the message and then inserts the result in a database. Only then am I acknowledging the message with Kafka.
It is required that I keep data loss to an absolute minimum but recovery rate is quick (avoid reprocessing message because it is expensive).
I realized that if there was to be some kind of failure, like my microservice would crash, my messages would be reprocessed. So I thought to add some kind of 'checkpoint' to my process by writing the state of the transformed message to the file and reading from it after a failure. I thought this would mean that I could move my Kafka commit to an earlier stage, only after writing to the file is successful.
But then, upon further thinking, I realized that if there was to be a failure on the file system, I might not find my files e.g. using a cloud file service might still have a chance of failure even if the marketed rate is that of >99% availability. I might end up in an inconsistent state where I have data in my Kafka topic (which is unaccessible because the Kafka offset has been committed) but I have lost my file on the file system. This made me realize that I should send the Kafka commit at a later stage.
So now, considering the above two design decisions, it feels like there is a tradeoff between not missing data and minimizing time to recover from failure. Am I being unrealistic in my concerns? Is there some design pattern that I can follow to minimize the tradeoffs? How do I reason about this situation? Here I thought that maybe the Saga pattern is appropriate, but am I overcomplicating things?
If you are that concerned of data reprocess, you could always follow the paradigm of sending the offsets out of kafka.
For example, in your consumer-worker reading loop:
MessageAndOffset = getMsg();
//do your things
saveOffsetInQueueToDB is responsible of adding the offset to a Queue/List, or whatever. This operation is only done one the message has been correctly processed.
Periodically, when a certain number of offsets are stored, or when shutdown is captured, you could implement another function that stores the offsets for each topic/partition in:
An external database.
An external SLA backed storing system, such as S3 or Azure Blobs.
Internal (disk) and remote loggers.
If you are concerned about failures, you could use a combination of two of those three options (or even use all three).
Storing these in a "memory buffer" allows the operation to be async, so there's no need for a new transfer/connection to the database/datalake/log for each processed message.
If there's a crash, you could read all messages from the beginning (easiest way is just changing the group.id and setting from beginning) but discarding those whose offset is included in the database, avoiding the reprocess. For example by adding a condition in your loop (yep pseudocode again):
MessageAndOffset = getMsg();
if (offset.notIncluded(offsetListFromDB))
//do your things
You could implement better performant algorithms instead a "non-included" type one, just storing the last read offsets for each partition in a HashMap and then just checking if the partition that belongs to each consumer is bigger or not than the stored one. For example, partition 0's last offset was 558 and partitions 1's 600:
//offsetMap = {[0,558],[1,600]}
MessageAndOffset = getMsg();
//get partition => 0
if (offset > offsetMap.get(partition))
//do your things
This way, you guarantee that only the non-processed messages from each partition will be processed.
Regarding file system failures, that's why Kafka comes as a cluster: Fault tolerance in Kafka is done by copying the partition data to other brokers which are known as replicas.
So if you have 5 brokers, for example, you must experience a total of 5 different system failures at the same time (I guess brokers are in separate hosts) in order to lose any data. Even 4 different brokers could fail at the same time without losing any data.
All brokers save the same amount of data, same partitions. If a filesystem error occurs in one of the brokers, the others will still hold all the information:

Apache Samza flush table update to changelog immediately

If I specify a changelog backing for a RocksDB Table in Samza. Is there configuration to update the async write time to the changelog? I want to reduce it to a shorter time. I cannot see anything in the Config reference.
The scenario I want is too write to a changelog from a stream after bridging a legacy JMS connection. This legacy connection provides partial updates and I want to merge the partial updates into a fuller message building a cache of these messages in the samza streaming application and write these down to a changelog.
If I use a changelog configured with stores.store-name.changelog then it will write to the changelog eventually changes I make to the Samze API Table. But not quick enough for my needs so want to configure the max wait time to propagate to changelog.
Alternatively it seems that using the withSideInputs to bootstrap my table each time and then using sendTo will work faster to update and I can keep a LocalStore to read and write the cache too and always have the changelog as golden source.
The reason I want the changelog to write quickly too is because other applications are reading from this changelog.
Yes you can configure the time it will commit changes to the changelog usin the config:
Then writes to the store will be flushed when the commit happens:
profileTable.put(message.key, message.value)
A note on this higher volumes of input appear to result in changes going to changelog topic before this commit millisecond configuration. Also be careful not to put too low as will slow down overall throughout massively with higher volumes.
You can also use the low level API to commit on a particular stream task the TaskCoordinator provides commit api to manually commit.

Kafka - different configuration settings

I am going through the documentation, and there seems to be there are lot of moving with respect to message processing like exactly once processing , at least once processing . And, the settings scattered here and there. There doesnt seem a single place that documents the properties need to be configured rougly for exactly once processing and atleast once processing.
I know there are many moving parts involved and it always depends . However, like i was mentioning before , what are the settings to be configured atleast to provide exactly once processing and at most once and atleast once ...
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset

End-to-end exactly once semantics in spark structured streaming

I am trying to understand if end-to-end exactly once semantics is compromised in spark structured streaming in the below scenario.
Scenario: Structured streaming job with kafka source and file sink is started. Kafka has 16 partitions and I am reading with 16 executors. I interrupted the job at the moment when a particular batch is incomplete. 8 out of 16 tasks completed and we have 8 output files generated. Now if I run the job again the batch starts and reads the data from the same offset range of previous incomplete batch producing 16 output files. Now the 8 output files of incomplete batch resulted in duplicates and the same has been confirmed by data comparision.
About Streaming end-to-end Exactly-Once, recommand u to read this poster on flink ( a similar framework with spark ) .
Briefly, store source/sink state when occurring checkpoint event.
rest of anwser from flink post.
So let’s put all of these different pieces together:
Once all of the operators complete their pre-commit, they issue a commit .
If at least one pre-commit fails, all others are aborted, and we roll back to the previous successfully-completed checkpoint.
After a successful pre-commit, the commit must be guaranteed to eventually succeed — both our operators and our external system need to make this guarantee. If a commit fails (for example, due to an intermittent network issue), the entire Flink application fails, the application restarts according to the user’s restart strategy, and there is another commit attempt. This process is critical because if the commit does not eventually succeed, data loss occurs.

Clean up changelog topic backing session windows

We are aggregating in session windows by using the following code:
.aggregate(..., ..., ...)
The state store that is created for us automatically is backed by a changelog topic with cleanup.policy=compact.
When redeploying our topology, we found that restoring the state store took much longer than expected (10+ minutes). The explanation seems to be that even though a session has been closed, it is still present in the changelog topic.
We noticed that session windows have a default maintain duration of one day but even after the inactivity + maintain durations have been exceeded, it does not look like messages are removed from the changelog topic.
a) Do we need to manually delete "old" (by our definition) messages to keep the size of the changelog topic under control? (This may be the case as hinted to by [1].)
b) Would it be possible to somehow have the changelog topic created with cleanup.policy=compact,delete and would that even make sense?
[1] A session store seems to be created internally by Kafka Stream's UnwindowedChangelogTopicConfig (and not WindowedChangelogTopicConfig) which may make this comment from Kafka Streams - reducing the memory footprint for large state stores relevant: "For non-windowed store, there is no retention policy. The underlying topic is compacted only. Thus, if you know, that you don't need a record anymore, you would need to delete it via a tombstone. But it's a little tricky to achieve... – Matthias J. Sax Jun 27 '17 at 22:07"
You are hitting a bug. I just created a ticket for this: https://issues.apache.org/jira/browse/KAFKA-7101.
I would recommend that you modify the topic config manually for fix the issue in your deployment.