Kafka - how to avoid losing data in emergency situations - apache-kafka

Recently, we had a production incident when Kafka consumers were repeatedly processing the same Kafka records again and again, and Kafka was rebalancing all the time. But I do not want to write here about this issue - we resolved it (by lowering the max-poll-records) and it works fine, now.
But the incident made me wonder - could we have lost some messages during this incident?
For instance: The documentation for auto-offset-reset says that this parameter applies "...if an offset is out of range". According to Kafka auto.offset.reset query it may happen e.g. "if the Consumer offset is less than the smallest offset". That is, if we had auto-offset-reset=latest and topic cleanup was triggered during the incident, we could have lost all the unprocessed data in the topic (because the offset would be set to the end of the topic, in this case). Therefore, IMO, it is never a good idea to have auto-offset-reset=latest if you need at-least-once delivery.
Actually, there are plenty of other situations where there is a threat of data loss in Kafka if not everything is set up correctly. For instance:
When the schema registry is not available, messages can get lost:
How to avoid losing messages with Kafka streams
After application restart, unprocessed messages are skipped despite that auto-offset-reset=earliest. We had this problem too in a topic (=not in every topic). Perhaps this is the same case.
etc.
Is there a cook-book how to set everything related to Kafka properly in order to make the application robust (with respect to Kafka) and prevent data loss? We've set up everything we consider important, but I'm not sure that we haven't overlooked something. And I cannot imagine all bad things that are possible in order to prevent them. For instance:
We have Kafka consumers with the same groupId running in different (geographically separated) networks. Does it matter? Nowadays probably not, but in the past probably yes, according to this answer.

Related

Apache Kafka: large retention time vs. fast read of last value

Dear Apache Kafka friends,
I have a use case for which I am looking for an elegant solution:
Data is published in a Kafka-Topic at a relatively high rate. There are two competing requirements
all records should be kept for 7 days (which is configured by min.compaction.lag)
applications should read the "last status" from the topic during their initialization phase
LogCompaction is enabled in order for the "last state" to be available in the topic.
Now comes the problem. If an application wants to initialize itself from the topic, it has to read a lot of records to get the last state for all keys (the entire topic content must be processed). But this is not performant possible with the amount of records.
Idea
A streaming process streams the data of the topic into a corresponding ShortTerm topic which has a much shorter min.compaction.lag time (1 hour). The applications initialize themselves from this topic.
Risk
The streaming process is a potential source of errors. If it temporarily fails, the applications will no longer receive the latest status.
My Question
Are there any other possible solutions to satisfy the two requirements. Did I maybe miss a Kafa concept that helps to handle these competing requirements?
Any contribution is welcome. Thank you all.
If you don't have a strict guarantee how frequently each key will be updated, you cannot do anything else as you proposed.
To avoid the risk that the downstream app does not get new updates (because the data replication jobs stalls), I would recommend to only bootstrap an app from the short term topic, and let it consume from the original topic afterwards. To not miss any updates, you can sync the switch over as follows:
On app startup, get the replication job's committed offsets from the original topic.
Get the short term topic's current end-offsets (because the replication job will continue to write data, you just need a fixed stopping point).
Consume the short term topic from beginning to the captured end offsets.
Resume consuming from the original topic using the captured committed offsets (from step 1) as start point.
This way, you might read some messages twice, but you won't lose any updates.
To me, the two requirements you have mentioned together with the requirement for new consumers are not competing. In fact, I do not see any reason why you should keep a message of an outdated key in your topic for 7 days, because
New consumers are only interested in the latest message of a key.
Already existing consumers will have processed the message within 1 hour (as taken from your comments).
Therefore, my understanding is that your requirement "all records should be kept for 7 days" can be replaced by "each consumer should have enough time to consume the message & the latest message for each key should be kept for 7 days".
Please correct me if I am wrong and explain which consumer actually does need "all records for 7 days".
If that is the case you could do the following:
Enable log compaction as well as time-based retention to 7 days for this topic
Fine-tune the compaction frequency to be very eager, meaning to keep as little as possible outdated messages for a key.
Set min.compaction.lag to 1 hour such that all consumers have the chance to keep up.
That way, new consumers will read (almost) only the latest message for each key. If that is not performant enough, you can try increasing the partitions and consumer threads of your consumer groups.

Kafka: Is it good practice too keep topic offset in database?

I have started learning kafka. I don't have much idea of live project where kafka is used.
Wanted to know if offset can be saved in database apart from committing in broker?
I think it should always be saved otherwise some record will be missed or re-processed.
Taking an example if offset is not saved in database, when application(consumer) is deployed or restarted during that time if some message is sent to broker at that time, that will be missed as when consumer will be up it will read next onward record or(from start)
the short answer to your question is "its complicated" :-)
the long answer to your question is something like:
kafka (without extra configuration and/or careful design of your code) is an at-least-once system (see official documentation). this means that yes, your consumer may see a particular set of records more than once. this wont happen on a graceful shutdown/rebalance, but will definitely happen if your application crashes.
newer versions of kafka support so called "exactly once". this involves configuring your clients differently (and a significant performance and latency hit), and the guarantees only ever hold if all your inputs and outputs are from/to the exact same kafka cluster. so if your consumer does anything like call an external HTTP API or insert into a database in response to seeing a kafka record we are back to at-least-once.
if your outputs go to a transactional system (like a classic ACID database) a common pattern would be to start a transaction, and in that transaction record both your outputs and the consumer offsets (you would also need to change your code to restore from these DB offsets and not the kafka default). this has better guarantees (but still wont help if your code interacts with non-transactional systems, like making an HTTP call)
another common design pattern to overcome at-least-once is to somehow "tag" every operation you do (record you produce, http call you make ...) with some UUID that derives from the original kafka records comsumed to produce this output. this means if your consumer sees the same record again, it will perform the same operations again, and repeat the same "tag" value. this shifts the burden to downstream systems that must now remember (at least for some period of time) all the "tags" they have seen so they could disregard a repeat operation, or somehow design all your operations to be idempotent

dealing with Kafka's exactly once processing edge-cases

Folks,
Trying to do a POC for processing messages using Kafka for an implementation which absolutely requires only once processing. Example: as a payment system, process a credit card transaction only once
What edge cases should we protect against?
One failure scenario covered here is:
1.) If a consumer fails, and does not commit that it has read through a particular offset, the message will be read again.
Lets say consumers live in Kubernetes pods, and one of the hosts goes offline. We will potentially have messages that have been processed, but not marked as processed in Kafka before the pods went away due to underlying hardware issue. Do i understand this error scenario correctly?
Are there other failure scenarios which we need to fully understand on the producer/consumer side when thinking of Kafka doing only-once processing?
Thanks!
im going to basically repeat and exand on an answer i gave here:
a few scenarios can result in duplication:
consumers only periodically checkpoint their positions. a consumer crash can result in duplicate processing of some range or records
producers have client-side timeouts. this means the producer may think a request timed out and re-transmit while broker-side it actually succeeded.
if you mirror data between kafka clusters thats usually done with a producer + consumer pair of some sort that can lead to more duplication.
there are also scenarios that end in data loss - look up "unclean leader election" (disabling that trades with availability).
also - kafka "exactly once" configurations only work if all you inputs, outputs, and side effects happen on the same kafka cluster. which often makes it of limited use in real life.
there are a few kafka features you could try using to reduce the likelihood of this happening to you:
set enable.idempotence to true in your producer configs (see https://kafka.apache.org/documentation/#producerconfigs) - incurs some overhead
use transactions when producing - incurs overhead and adds latency
set transactional.id on the producer in case your fail over across machines - gets complicated to manage at scale
set isolation.level to read_committed on the consumer - adds latency (needs to be done in combination with 2 above)
shorten auto.commit.interval.ms on the consumer - just reduces the window of duplication, doesnt really solve anything. incurs overhead at really low values.
I have to say that as someone who's been maintaining a VERY large kafka installation for the past few years I'd never use a bank that relied on kafka for its core transaction processing though ...

Failed to rebalance error in Kafka Streams with more than one topic partition

Works fine when source topic partition count = 1. If I bump up the partitions to any value > 1, I see the below error. Applicable to both Low level as well as the DSL API. Any pointers ? What could be missing ?
org.apache.kafka.streams.errors.StreamsException: stream-thread [StreamThread-1] Failed to rebalance
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:410)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:242)
Caused by: org.apache.kafka.streams.errors.StreamsException: task [0_1] Store in-memory-avg-store's change log (cpu-streamz-in-memory-avg-store-changelog) does not contain partition 1
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.register(ProcessorStateManager.java:185)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.register(ProcessorContextImpl.java:123)
at org.apache.kafka.streams.state.internals.InMemoryKeyValueStoreSupplier$MemoryStore.init(InMemoryKeyValueStoreSupplier.java:102)
at org.apache.kafka.streams.state.internals.InMemoryKeyValueLoggedStore.init(InMemoryKeyValueLoggedStore.java:56)
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.init(MeteredKeyValueStore.java:85)
at org.apache.kafka.streams.processor.internals.AbstractTask.initializeStateStores(AbstractTask.java:81)
at org.apache.kafka.streams.processor.internals.StreamTask.<init>(StreamTask.java:119)
at org.apache.kafka.streams.processor.internals.StreamThread.createStreamTask(StreamThread.java:633)
at org.apache.kafka.streams.processor.internals.StreamThread.addStreamTasks(StreamThread.java:660)
at org.apache.kafka.streams.processor.internals.StreamThread.access$100(StreamThread.java:69)
at org.apache.kafka.streams.processor.internals.StreamThread$1.onPartitionsAssigned(StreamThread.java:124)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:228)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:313)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:277)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:259)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1013)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:979)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:407)
It's an operational issue. Kafka Streams does not allow to change the number of input topic partitions during its "life time".
If you stop a running Kafka Streams application, change the number of input topic partitions, and restart your app it will break (with the error you see above). It is tricky to fix this for production use cases and it is highly recommended to not change the number of input topic partitions (cf. comment below). For POC/demos it's not difficult to fix though.
In order to fix this, you should reset your application using Kafka's application reset tool:
http://docs.confluent.io/current/streams/developer-guide.html#application-reset-tool
https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/
Using the application reset tool, has the disadvantage that you wipe out your whole application state. Thus, in order to get your application into the same state as before, you need to reprocess the whole input topic from beginning. This is of course only possible, if all input data is still available and nothing got deleted by brokers that applying topic retention time/size policy.
Furthermore you should note, that adding partitions to input topics changes the topic's partitioning schema (be default hash-based partitioning by key). Because Kafka Streams assumes that input topics are correctly partitioned by key, if you use the reset tool and reprocess all data, you might get wrong result as "old" data is partitioned differently than "new" data (ie, data written after adding the new partitions). For production use cases, you would need to read all data from your original topic and write it into a new topic (with increased number of partitions) to get your data partitioned correctly (or course, this step might change the ordering of records with different keys -- what should not be an issue usually -- just wanted to mention it). Afterwards you can use the new topic as input topic for your Streams app.
This repartitioning step can also be done easily within you Streams application by using operator through("new_topic_with_more_partitions") directly after reading the original topic and before doing any actual processing.
In general however, it is recommended to over partition your topics for production use cases, such that you will never need to change the number of partitions later on. The overhead of over partitioning is rather small and saves you a lot of hassle later on. This is a general recommendation if you work with Kafka -- it's not limited to Streams use cases.
One more remark:
Some people might suggest to increase the number of partitions of Kafka Streams internal topics manually. First, this would be a hack and is not recommended for certain reasons.
It might be tricky to figure out what the right number is, as it depends on various factors (as it's a Stream's internal implementation detail).
You also face the problem of breaking the partitioning scheme, as described in the paragraph above. Thus, you application most likely ends up in an inconsistent state.
In order to avoid inconsistent application state, Streams does not delete any internal topics or changes the number of partitions of internal topics automatically, but fails with the error message you reported. This ensure, that the user is aware of all implications by doing the "cleanup" manually.
Btw: For upcoming Kafka 0.10.2 this error message got improved: https://github.com/apache/kafka/blob/0.10.2/streams/src/main/java/org/apache/kafka/streams/processor/internals/InternalTopicManager.java#L100-L103

Reliable usage of DirectKafkaAPI

I am pllaned to develop a reliable streamig application based on directkafkaAPI..I will have one producer and another consumer..I wnated to know what is the best approach to achieve the reliability in my consumer?..I can employ two solutions..
Increasing the retention time of messages in Kafka
Using writeahead logs
I am abit confused regarding the usage of writeahead logs in directkafka API as there is no receiver..but in the documentation it indicates..
"Exactly-once semantics: The first approach uses Kafka’s high level API to store consumed offsets in Zookeeper. This is traditionally the way to consume data from Kafka. While this approach (in combination with write ahead logs) can ensure zero data loss (i.e. at-least once semantics), there is a small chance some records may get consumed twice under some failures. "
so I wanted to know what is the best approach..if it suffices to increase the TTL of messages in kafka or I have to also enable write ahead logs..
I guess it would be good practice if I avoid one of the above since the backup data (retentioned messages, checkpoint files) can be lost and then recovery could face failure..
Direct Approach eliminates the duplication of data problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka.
Also, Direct approach by default supports exactly-once message delivery semantics, it does not use Zookeeper. Offsets are tracked by Spark Streaming within its checkpoints.