How to make restart-able producer? - apache-kafka

Latest version of kafka support exactly-once-semantics (EoS). To support this notion, extra details are added to each message. This means that at your consumer; if you print offsets of messages they won't be necessarily sequential. This makes harder to poll a topic to read the last committed message.
In my case, consumer printed something like this
Offset-0 0
Offset-2 1
Offset-4 2
Problem: In order to write restart-able proudcer; I poll the topic and read the content of last message. In this case; last message would be offset#5 which is not a valid consumer record. Hence, I see errors in my code.
I can use the solution provided at : Getting the last message sent to a kafka topic. The only problem is that instead of using consumer.seek(partition, last_offset=1); I would use consumer.seek(partition, last_offset-2). This can immediately resolve my issue, but it's not an ideal solution.
What would be the most reliable and best solution to get last committed message for a consumer written in Java? OR
Is it possible to use local state-store for a partition? OR
What is the most recommended way to store last message to withstand producer-failure? OR
Are kafka connectors restartable? Is there any specific API that I can use to make producers restartable?
FYI- I am not looking for quick fix

In my case, multiple producers push data to one big topic. Therefore, reading entire topic would be nightmare.
The solution that I found is to maintain another topic i.e. "P1_Track" where producer can store metadata. Within a transaction a producer will send data to one big topic and P1_Track.
When I restart a producer, it will read P1_Track and figure out where to start from.
Thinking about storing last committed message in a database and using it when producer process restarts.

Related

What happens to the kafka messages if the microservice crashes before kafka commit?

I am new to kafka.I have a Kafka Stream using java microservice that consumes the messages from kafka topic produced by producer and processes. The kafka commit interval has been set using the auto.commit.interval.ms . My question is, before commit if the microservice crashes , what will happen to the messages that got processed but didn't get committed? will there be duplicated records? and how to resolve this duplication, if happens?
Kafka has exactly-once-semantics which guarantees the records will get processed only once. Take a look at this section of Spring Kafka's docs for more details on the Spring support for that. Also, see this section for the support for transactions.
Kafka provides various delivery semantics. These delivery semantics can be decided on the basis of your use-case you've implemented.
If you're concerned that your messages should not get lost by consumer service - you should go ahead with at-lease once delivery semantic.
Now answering your question on the basis of at-least once delivery semantics:
If your consumer service crashes before committing the Kafka message, it will re-stream the message once your consumer service is up and running. This is because the offset for a partition was not committed. Once the message is processed by the consumer, committing an offset for a partition happens. In simple words, it says that the offset has been processed and Kafka will not send the committed message for the same partition.
at-least once delivery semantics are usually good enough for use cases where data duplication is not a big issue or deduplication is possible on the consumer side. For example - with a unique key in each message, a message can be rejected when writing duplicate data to the database.
There are mainly three types of delivery semantics,
At most once-
Offsets are committed as soon as the message is received at consumer.
It's a bit risky as if the processing goes wrong the message will be lost.
At least once-
Offsets are committed after the messages processed so it's usually the preferred one.
If the processing goes wrong the message will be read again as its not been committed.
The problem with this is duplicate processing of message so make sure your processing is idempotent. (Yes your application should handle duplicates, Kafka won't help here)
Means in case of processing again will not impact your system.
Exactly once-
Can be achieved for kafka to kafka communication using kafka streams API.
Its not your case.
You can choose semantics from above as per your requirement.

Removing one message from a topic in Kafka

I'm new at using Kafka and I have one question. Can I delete only ONE message from a topic if I know the topic, the offset and the partition? And if not is there any alternative?
It is not possible to remove a single message from a Kafka topic, even though you know its partition and offset.
Keep in mind, that Kafka is not a key/value store but a topic is rather an append-only(!) log that represents a stream of data.
If you are looking for alternatives to remove a single message you may
Have your consumer clients ignore that message
Enable log compaction and send a tompstone message
Write a simple job (KafkaStreams) to consume the data, filter out that one message and produce all messages to a new topic.

Get latest values from a topic on consumer start, then continue normally

We have a Kafka producer that produces keyed messages in a very high frequency to topics whose retention time = 10 hours. These messages are real-time updates and the used key is the ID of the element whose value has changed. So the topic is acting as a changelog and will have many duplicate keys.
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal, keeping the minimum load on Kafka server and letting the consumer do most of the job. We tried many ways and none of them seems the best.
What we tried:
1 changelog topic + 1 compact topic:
The producer sends the same message to both topics wrapped in a transaction to assure successful send.
Consumer launches and requests the latest offset of the changelog topic.
Consumes the compacted topic from beginning to construct the table.
Continues consuming the changelog since the requested offset.
Cons:
Having duplicates in compacted topic is a very high possibility even with setting the log compaction frequency the highest possible.
x2 number of topics on Kakfa server.
KSQL:
With KSQL we either have to rewrite a KTable as a topic so that consumer can see it (Extra topics), or we will need consumers to execute KSQL SELECT using to KSQL Rest Server and query the table (Not as fast and performant as Kafka APIs).
Kafka Consumer API:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Kafka Streams:
By using KTables as following:
KTable<Integer, MarketData> tableFromTopic = streamsBuilder.table("topic_name", Consumed.with(Serdes.Integer(), customSerde));
KTable<Integer, MarketData> filteredTable = tableFromTopic.filter((key, value) -> keys.contains(value.getRiskFactorId()));
Kafka Streams will create 1 topic on Kafka server per KTable (named {consumer_app_id}-{topic_name}-STATE-STORE-0000000000-changelog), which will result in a huge number of topics since we a big number of consumers.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Thanks in advance.
By using KTables, Kafka Streams will create 1 topic on Kafka server per KTable, which will result in a huge number of topics since we a big number of consumers.
If you are just reading an existing topic into a KTable (via StreamsBuilder#table()), then no extra topics are being created by Kafka Streams. Same for KSQL.
It would help if you could clarify what exactly you want to do with the KTable(s). Apparently you are doing something that does result in additional topics being created?
1 changelog topic + 1 compact topic:
Why were you thinking about having two separate topics? Normally, changelog topics should always be compacted. And given your use case description, I don't see a reason why it should not be:
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal [...]
Hence compaction would be very useful for your use case. It would also prevent this problem you described:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Note that, to reconstruct the latest table values, all three of Kafka Streams, KSQL, and the Kafka Consumer must read the table's underlying topic completely (from beginning to end). If that topic is NOT compacted, this might indeed take a long time depending on the data volume, topic retention settings, etc.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Without knowing more about your use case, particularly what you want to do with the KTable(s) once they are populated, my answer would be:
Make sure the "changelog topic" is also compacted.
Try KSQL first. If this doesn't satisfy your needs, try Kafka Streams. If this doesn't satisfy your needs, try the Kafka Consumer.
For example, I wouldn't use the Kafka Consumer if it is supposed to do any stateful processing with the "table" data, because the Kafka Consumer lacks built-in functionality for fault-tolerant stateful processing.
Consumer starts and consumes the topic from beginning. This worked
perfectly, but the consumer has to consume the 10 hours change log to
construct the last values table.
During the first time your application starts up, what you said is correct.
To avoid this during every restart, store the key-value data in a file.
For example, you might want to use a persistent map (like MapDB).
Since you give the consumer group.id and you commit the offset either periodically or after each record is stored in the map, the next time your application restarts it will read it from the last comitted offset for that group.id.
So the problem of taking a lot of time occurs only initially (during first time). So long as you have the file, you don't need to consume from beginning.
In case, if the file is not there or is deleted, just seekToBeginning in the KafkaConsumer and build it again.
Somewhere, you need to store this key-values for retrieval and why cannot it be a persistent store?
In case if you want to use Kafka streams for whatever reason, then an alternative (not as simple as the above) is to use a persistent backed store.
For example, a persistent global store.
streamsBuilder.addGlobalStore(Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(topic), keySerde, valueSerde), topic, Consumed.with(keySerde, valueSerde), this::updateValue);
P.S: There will be a file called .checkpoint in the directory which stores the offsets. In case if the topic is deleted in the middle you get OffsetOutOfRangeException. You may want to avoid this, perhaps by using UncaughtExceptionHandler
Refer to https://stackoverflow.com/a/57301986/2534090 for more.
Finally,
It is better to use Consumer with persistent file rather than Streams for this, because of simplicity it offers.

Kafka stream application not consume data after restart

After I did restart our Kafka cluster my application of Kafka streams didn't receive messages from input topic and I got an exception of "can׳t create internal topic". After some research, I did reset with the Kafka tool (to the input topic and the application) the tool is Kafka-streams-application-reset.sh.
Unfortunately, it didn't resolve the problem and I also got the exception again
From the error message, you can infer that the topic already exists and thus, cannot be created. The reason for the failure is, that the existing topic does not have the expected number of partitions (it has 1 instead of 150) -- if the number of partitions would match, Kafka Streams would just use the existing topic.
This can happen, if you have topic auto-create enabled at the brokers (and the topic was created with a wrong number of partitions), or if the number of partitions of your input topic changed. Kafka Streams does not automatically change the number of partitions for the repartition topic, because this might result in data corruption and thus lead to incorrect results.
One way to fix this, it to either manually delete this topic: note, that this might result in data loss and you should only do this, if you know that it is what you want.
Another (better way) would be, to reset the application cleanly using bin/kafka-streams-application-reste.sh in combination with KafkaStreams#cleanup().
Because you need to clean up the application and users should be aware of the implication, Kafka Streams fails to make user aware of the issue instead of "auto magically" take some actions that might be undesired from a user point of view.
Check out the docs for more details. There is also a blog post that explains application reset in details:
https://kafka.apache.org/11/documentation/streams/developer-guide/app-reset-tool.html
https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/

Kafka only once consumption guarantee

I see in some answers around stack-overflow and in general in the web the idea that Kafka does not support consumption acknowledge or that exactly once consumption is hard to achieve.
In the following entry as a sample
Is there any reason to use RabbitMQ over Kafka?, I can read the following statements:
RabbitMQ will keep all states about consumed/acknowledged/unacknowledged messages while Kafka doesn't
or
Exactly once guarantees are hard to get with Kafka.
This is not what I understand by reading the official Kafka documentation at:
https://kafka.apache.org/documentation/#design_consumerposition
The previous documentation states that Kafka does not use a traditional acknowledge implementation (as RabbitMQ). Instead they rely on the relationship partition-consumer and offset...
This makes the equivalent of message acknowledgements very cheap
Could somebody please explain why "only once consumption guarantee" in Kafka is difficult to achieve? and How this differs from Kafka vs other more traditional Message Broker as RabbitMQ? What am I missing?
If you mean exactly once the problem is like this.
Kafka consumer as you may know use a polling mechanism, that is consumers ask the server for messages. Also, you need to recall that the consumer commit message offsets, that is, it tells the cluster what is the next expected offset. So, imagine what could happen.
Consumer poll for messages and get message with offset = 1.
A) If consumer commit that offset immediately before processing the message, then it can crash and will never receive that message again because it was already committed, on next poll Kafka will return message with offset = 2. This is what they call at most once semantic.
B) If consumer process the message first and then commit the offset, what could happen is that after processing the message but before committing, the consumer crashes, so in that case next poll will get again the same message with offset = 1 and that message will be processed twice. This is what they call at least once.
In order to achieve exactly once, you need to process the message and commit that offset in an atomic operation, where you always do both or none of them. This is not so easy. One way to do this (if possible) is to store the result of the processing along with the offset of the message that generated that result. Then, when consumer starts it looks for the last processed offset outside Kafka and seek to that offset.