I am new to Kafka, and am developing a personal project with a few services and the communication between them is made through Kafka and I am using Confluent for housing Kafka remotely.
All works fine, but when I startup a server it will try to process all the old messages in the topics that were generated as I was testing the system.
I would like to avoid this because it is time consuming and those messages were already processed, when the server was up the last time. Is there any way to prevent this in the development environment?
Am I even using Kafka correctly? Are there good practises that I missed?
By "server", I assume you mean consumer. The broker server doesn't process data, only stores it.
If you have auto.offset.reset=earliest + enable.auto.commit=false + are not committing the records in your code (or are overall using a new group.id each time), this is the expected behavior since your group.id is not tracking already consumed data.
Since you're now in a situation where you have processed data, but no stored offsets, first set a static group id, then your options include
re-process all the data again, accepting the duplicates, perhaps adding some conditional filter in your consumer code to skip records
skip all processed and un-processed data and only start consuming brand-new records after the consumer starts, by either setting a new group.id + auto.offset.reset=latest, or use consumer.seekToEnd() / the kafka-consumer-groups CLI tool ; downside of setting auto.offset.reset=latest is that you might run into a situation where the consumer group has been idle too long, and the group expires, causing you to go back to the end of the topic, even though there may still be un-processed data
manually find the offsets for all the partitions for the last processed data and consumer.seek() to those offsets
Related
We are streaming messages to a Kafka topic at a rate of a few hundred per second. Each message has a timestamp and a payload. Ultimately, we would like aggregate one hour worth of data - based on the timestamp of the message - into parquet files and upload them to a cheap remote storage (object-store).
A naive approach would be to have the consumer simply read the messages from the topic and do the aggregation/roll-up in memory, and once there is one hour worth of data, generate and upload the parquet file.
However, in case the consumer crashes or needs to be restarted, we would lose all data since the beginning of the current hour - if we use enable.auto.commit=true or enable.auto.commit=false and manually commit after a batch of messages.
A simple solution for the Consumer could be to keep reading until one hour worth of data is in memory, do the parquet file generation (and upload it), and only then call commitAsync() or commitSync() (using enable.auto.commit=false and use an external store to keep track of the offsets).
But this would lead to millions of messages not being committed for at least one hour. I am wondering if Kafka does even allow to "delay" the commit of messages for so many messages / so long time (I seem to remember to have read about this somewhere but for the life of me I cannot find it again).
Actual questions:
a) is there a limit to the number of messages (or duration) not being committed before Kafka possibly considers the Consumer to be broken or stops giving additional messages to the consumer? this seems counter-intuitive though, since what would be the purpose of enable.auto.commit=false and managing the offsets in the Consumer (with e.g. the help of an external database).
b) in terms of robustness/redundancy and scalability, it would be great to have more than one Consumer in the consumer group; if I understand correctly, it is never possible to have more than one Consumer per partition. If we then run more than one Consumer and configure multiple partitions per topic we cannot do this kind of aggregation/roll-up, since now messages will be distributed across Consumers. The only way to work-around this issue would be to have an additional (external) temporary storage for all those messages belonging to such one-hour group, correct?
You can configure Kafka Streams with a TimestampExtractor to aggregate data into different types of time-windows
into parquet files and upload them to a cheap remote storage (object-store).
Kafka Connect S3 sink, or Pinterest Secor tool, already do this
I have Spring Boot app that uses Kafka via Spring Kafka module. A neighboring team sends data in JSON format periodically to a compacted topic that serves as "snapshot" of their internal database at a certain moment of time. But,sometimes, the team updates contract without notification, our DTOs don't reflect the recent changes and, obviously, deserialization fails, and, because we have our listener containers configured as batched with default BATCH ack mode and BatchLoggingErrorHandler, we found, from time to time, that our Kuber pod with consumer is full of errors and we can't reread topic with fresh DTOs after microservice redeploy that simple since the last offset in every partition is commited and we can't change group.id (InfoSec department policy) to use auto.offset.reset = "earliest" as a workaround.
So, is there a way to reposition every consumer in a consumer group to the initial offset in assigned partitions programmaticaly? If it is true, I think we could write a REST endpoint, which being called triggers a "reprocessing from scratch".
See the documentation.
If your listener extends AbstractConsumerSeekAware, you can perform all kinds of seek operations (e.g. initial seek during initialization, arbitrary seeks between polls).
We have a Kafka producer that produces keyed messages in a very high frequency to topics whose retention time = 10 hours. These messages are real-time updates and the used key is the ID of the element whose value has changed. So the topic is acting as a changelog and will have many duplicate keys.
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal, keeping the minimum load on Kafka server and letting the consumer do most of the job. We tried many ways and none of them seems the best.
What we tried:
1 changelog topic + 1 compact topic:
The producer sends the same message to both topics wrapped in a transaction to assure successful send.
Consumer launches and requests the latest offset of the changelog topic.
Consumes the compacted topic from beginning to construct the table.
Continues consuming the changelog since the requested offset.
Cons:
Having duplicates in compacted topic is a very high possibility even with setting the log compaction frequency the highest possible.
x2 number of topics on Kakfa server.
KSQL:
With KSQL we either have to rewrite a KTable as a topic so that consumer can see it (Extra topics), or we will need consumers to execute KSQL SELECT using to KSQL Rest Server and query the table (Not as fast and performant as Kafka APIs).
Kafka Consumer API:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Kafka Streams:
By using KTables as following:
KTable<Integer, MarketData> tableFromTopic = streamsBuilder.table("topic_name", Consumed.with(Serdes.Integer(), customSerde));
KTable<Integer, MarketData> filteredTable = tableFromTopic.filter((key, value) -> keys.contains(value.getRiskFactorId()));
Kafka Streams will create 1 topic on Kafka server per KTable (named {consumer_app_id}-{topic_name}-STATE-STORE-0000000000-changelog), which will result in a huge number of topics since we a big number of consumers.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Thanks in advance.
By using KTables, Kafka Streams will create 1 topic on Kafka server per KTable, which will result in a huge number of topics since we a big number of consumers.
If you are just reading an existing topic into a KTable (via StreamsBuilder#table()), then no extra topics are being created by Kafka Streams. Same for KSQL.
It would help if you could clarify what exactly you want to do with the KTable(s). Apparently you are doing something that does result in additional topics being created?
1 changelog topic + 1 compact topic:
Why were you thinking about having two separate topics? Normally, changelog topics should always be compacted. And given your use case description, I don't see a reason why it should not be:
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal [...]
Hence compaction would be very useful for your use case. It would also prevent this problem you described:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Note that, to reconstruct the latest table values, all three of Kafka Streams, KSQL, and the Kafka Consumer must read the table's underlying topic completely (from beginning to end). If that topic is NOT compacted, this might indeed take a long time depending on the data volume, topic retention settings, etc.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Without knowing more about your use case, particularly what you want to do with the KTable(s) once they are populated, my answer would be:
Make sure the "changelog topic" is also compacted.
Try KSQL first. If this doesn't satisfy your needs, try Kafka Streams. If this doesn't satisfy your needs, try the Kafka Consumer.
For example, I wouldn't use the Kafka Consumer if it is supposed to do any stateful processing with the "table" data, because the Kafka Consumer lacks built-in functionality for fault-tolerant stateful processing.
Consumer starts and consumes the topic from beginning. This worked
perfectly, but the consumer has to consume the 10 hours change log to
construct the last values table.
During the first time your application starts up, what you said is correct.
To avoid this during every restart, store the key-value data in a file.
For example, you might want to use a persistent map (like MapDB).
Since you give the consumer group.id and you commit the offset either periodically or after each record is stored in the map, the next time your application restarts it will read it from the last comitted offset for that group.id.
So the problem of taking a lot of time occurs only initially (during first time). So long as you have the file, you don't need to consume from beginning.
In case, if the file is not there or is deleted, just seekToBeginning in the KafkaConsumer and build it again.
Somewhere, you need to store this key-values for retrieval and why cannot it be a persistent store?
In case if you want to use Kafka streams for whatever reason, then an alternative (not as simple as the above) is to use a persistent backed store.
For example, a persistent global store.
streamsBuilder.addGlobalStore(Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(topic), keySerde, valueSerde), topic, Consumed.with(keySerde, valueSerde), this::updateValue);
P.S: There will be a file called .checkpoint in the directory which stores the offsets. In case if the topic is deleted in the middle you get OffsetOutOfRangeException. You may want to avoid this, perhaps by using UncaughtExceptionHandler
Refer to https://stackoverflow.com/a/57301986/2534090 for more.
Finally,
It is better to use Consumer with persistent file rather than Streams for this, because of simplicity it offers.
Is it possible to have multiple copies of an application listen to the same Kafka group/topic so that only one is reading it at a time, but the other ones will start working if the main one crashes/stops reading?
I need to make an application highly available but can't tolerate doubling the traffic to the data store on the other end of the application by having multiple copies actively running.
FYI - Technically I'm using MapR streams but it adheres to the Kafka API and functionality, in case anyone knows a MapR stream-specific feature that helps the situation.
It is possible. If multi consumers are in same consumer group, when the group subscribes a topic, kafka will do a partition assignment work for your consumers: one partition could only be consumed by only one consumer in a same group.
So you could set your topic to have only one partition, then only one consumer to consume message, others will be idle. Once the consumer is shutdown, it will trigger the group rebalance operation : kafka will do the partition assignment again. And Then in your case , a new consumer will go ahead this work. It will process message from the last committed offset which commited by old consumer.
And if your case supports parallel processing, you could make many process(app) doing same work and set the topic to multi partitions. They will be assigned to consume different partitions and process different messages. So it will speed up your process and also can tolerant the fail over. As above said, if some consumers is failed, kafka will take care it for you, it will assign their paritition to other working consumer. So everything will be ok.
I am trying to implement a simple Producer-->Kafka-->Consumer application in Java. I am able to produce as well as consume the messages successfully, but the problem occurs when I restart the consumer, wherein some of the already consumed messages are again getting picked up by consumer from Kafka (not all messages, but a few of the last consumed messages).
I have set autooffset.reset=largest in my consumer and my autocommit.interval.ms property is set to 1000 milliseconds.
Is this 'redelivery of some already consumed messages' a known problem, or is there any other settings that I am missing here?
Basically, is there a way to ensure none of the previously consumed messages are getting picked up/consumed by the consumer?
Kafka uses Zookeeper to store consumer offsets. Since Zookeeper operations are pretty slow, it's not advisable to commit offset after consumption of every message.
It's possible to add shutdown hook to consumer that will manually commit topic offset before exit. However, this won't help in certain situations (like jvm crash or kill -9). To guard againts that situations, I'd advise implementing custom commit logic that will commit offset locally after processing each message (file or local database), and also commit offset to Zookeeper every 1000ms. Upon consumer startup, both these locations should be queried, and maximum of two values should be used as consumption offset.