Kafka data reads and offset management with sink - apache-kafka

What happens when the consumer reads the data from kafka but fails to write into sink. Lets say, I read the data from kafka and applied some transformation on data and finally storing the final result into Database. If everything is perfectly working my final result will be stored in Database. But lets say for some reason my Database isn't available. what happens with the data that i read from kafka? When I restart my application, can I read the same data again since I failed to store it in sink? or will the kafka marks this data as read and will not allow me to read this data?
can you also tell me what this property is used for - enable.auto.commit=true?

There's a part of the metadata in Kafka called consumer offsets. Each message has a unique offset - an integer value that continually increases for each message.
So, in the scenario you've described:
If, you've committed the offset BEFORE writing to the database then you will not be able to read those messages again.
But, if you commit the offset AFTER writing to the database then you will be able to re-read those messages.
enable.auto.commit=true as the name suggests will automatically commit consumer offsets after a certain time interval defined by auto.commit.interval.ms parameter - which by default is 5000 ms (5 seconds). So, as you can probably imagine that if these default values are used, then the offsets will be committed in 5 seconds regardless of whether they have landed in the destination or not.
So, you would basically need to control these through your code and change the enable.auto.commit to false if you'd like to ensure guaranteed delivery.
Hope this helps!

Related

How is it possible to aggregate messages from Kafka topic based on duration (e.g. 1h)?

We are streaming messages to a Kafka topic at a rate of a few hundred per second. Each message has a timestamp and a payload. Ultimately, we would like aggregate one hour worth of data - based on the timestamp of the message - into parquet files and upload them to a cheap remote storage (object-store).
A naive approach would be to have the consumer simply read the messages from the topic and do the aggregation/roll-up in memory, and once there is one hour worth of data, generate and upload the parquet file.
However, in case the consumer crashes or needs to be restarted, we would lose all data since the beginning of the current hour - if we use enable.auto.commit=true or enable.auto.commit=false and manually commit after a batch of messages.
A simple solution for the Consumer could be to keep reading until one hour worth of data is in memory, do the parquet file generation (and upload it), and only then call commitAsync() or commitSync() (using enable.auto.commit=false and use an external store to keep track of the offsets).
But this would lead to millions of messages not being committed for at least one hour. I am wondering if Kafka does even allow to "delay" the commit of messages for so many messages / so long time (I seem to remember to have read about this somewhere but for the life of me I cannot find it again).
Actual questions:
a) is there a limit to the number of messages (or duration) not being committed before Kafka possibly considers the Consumer to be broken or stops giving additional messages to the consumer? this seems counter-intuitive though, since what would be the purpose of enable.auto.commit=false and managing the offsets in the Consumer (with e.g. the help of an external database).
b) in terms of robustness/redundancy and scalability, it would be great to have more than one Consumer in the consumer group; if I understand correctly, it is never possible to have more than one Consumer per partition. If we then run more than one Consumer and configure multiple partitions per topic we cannot do this kind of aggregation/roll-up, since now messages will be distributed across Consumers. The only way to work-around this issue would be to have an additional (external) temporary storage for all those messages belonging to such one-hour group, correct?
You can configure Kafka Streams with a TimestampExtractor to aggregate data into different types of time-windows
into parquet files and upload them to a cheap remote storage (object-store).
Kafka Connect S3 sink, or Pinterest Secor tool, already do this

How the offset should be configured for a new consumer group for an existing topic when source connectors can't be paused

We have an existing topic where the data gets published by JDBC source connector using the mode increment + timestamp (The source connector uses increment+timestamp (https://docs.confluent.io/current/connect/kafka-connect-jdbc/source-connector/index.html#incremental-query-modes)
We have existing consumer groups which consumes data from some existing topics. Now we are introducing a new consumer group (call this group k) which should consume data from the same existing topics and should write to database. As a first step, we have an initial data migration workflow to take a dump of source database and copy the dump to destination database before starting consuming messages from existing topic.
Now when the consumer group starts, I am wondering what should be the offset it should start with?
One option is to use latest. But the problem is that existing source connectors would be publishing data to existing topics when initial data migration is being done for this new consumer group. In our case we have 10s of tables to be migrated and there could be a gap where the table dump was taken but still some changes are getting done to the source database and so data will get added to topics. So, there is a chance that we may miss to process some records.
We don't have the option to pause the source connectors which would solve the problem for us.
If we use offset earliest we will end up processing all the old data from kafka topic which is not required as we have done an initial data migration.
We want to maintain only one source connector regardless of the number of consumer groups.
I was going through kafka consumer APIs like seek which takes timestamp. I can note down the time before initial data migration and call consumer.seek once the consumer group has started and partitions are assigned. But I couldn't find any docs saying that the timestamp is GMT based or something else. Is it ok to use this API by passing the time which is number of milliseconds elapsed from epoch?
If I understand this sentence correctly: "If we use offset latest we might lose some data as source connectors might have written some data to the topic during initial data migration" the topic will end up having some data from initial loads and CDC data mixed up, so there is no offset that clearly distinct this. Therefore, you will not get far setting any particular offset.
I see the following options:
Have your consumer group K filtering out initial load data and read from earliest
Produce the initial load data to a dedicated topic
If possible perform the initial load outside of business hours so that no CDC data is flowing (maybe over week-end or bank holidays)

Options for getting a specific item's produce and consume time stamps?

Suppose I'm debugging an issue involving a single specific message that was produced and consumed. And I want to know when this message was produced and when it was consumed. What are my options for getting this info?
I guess when I construct a message I could include within it the current time. And when my consumer gets a message, it could write out a log entry.
But suppose I have many producer and consumer classes and none of the code is doing these things. Is there something already existing in kafka that could support finding out this info about a specific message without having to touch the implementation of these producers and consumers, something like the __consumer_offsets topic?
Kafka has built-in timestamp support for messages sent and this timestamp can be accessed via timestamp method of ConsumerRecord (link)
It can be configured with broker config (log.message.timestamp.type) or topic level config (message.timestamp.type). Its default value is CreateTime. You can also set this as LogAppendTime.
CreateTime: Timestamp is assigned when producer record is created (before sending).
LogAppendTime: broker will override the
timestamp with its current local time and append the message to the
log.
IMHO for consume timestamp your only option is to get currentTime of the system after message process is finished.
For more information about timestamp you can check this.
When it comes to consumption, there's no explicit way of specifying when message was consumed (also bear in mind that a single message can be consumed multiple times e.g. by consumers in different consumer groups).
However there are a few possible ways to track it on your own:
register the timestamps after the record has been received (after the .poll(...) call returns),
if using consumer groups, monitor consumer group's offset or see the value in __consumer_offsets (that would require you to deserialize internal format - see this answer for details (bear in mind, the timestamps of records in this topic correspond to consumer's commit timestamps, so they need to commit often enough to provide correct granurality,
log compaction + custom implementation: send the message with the same key and value that marks the consumption timestamp (however the message might still be re-read before the compaction happens).

Get latest values from a topic on consumer start, then continue normally

We have a Kafka producer that produces keyed messages in a very high frequency to topics whose retention time = 10 hours. These messages are real-time updates and the used key is the ID of the element whose value has changed. So the topic is acting as a changelog and will have many duplicate keys.
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal, keeping the minimum load on Kafka server and letting the consumer do most of the job. We tried many ways and none of them seems the best.
What we tried:
1 changelog topic + 1 compact topic:
The producer sends the same message to both topics wrapped in a transaction to assure successful send.
Consumer launches and requests the latest offset of the changelog topic.
Consumes the compacted topic from beginning to construct the table.
Continues consuming the changelog since the requested offset.
Cons:
Having duplicates in compacted topic is a very high possibility even with setting the log compaction frequency the highest possible.
x2 number of topics on Kakfa server.
KSQL:
With KSQL we either have to rewrite a KTable as a topic so that consumer can see it (Extra topics), or we will need consumers to execute KSQL SELECT using to KSQL Rest Server and query the table (Not as fast and performant as Kafka APIs).
Kafka Consumer API:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Kafka Streams:
By using KTables as following:
KTable<Integer, MarketData> tableFromTopic = streamsBuilder.table("topic_name", Consumed.with(Serdes.Integer(), customSerde));
KTable<Integer, MarketData> filteredTable = tableFromTopic.filter((key, value) -> keys.contains(value.getRiskFactorId()));
Kafka Streams will create 1 topic on Kafka server per KTable (named {consumer_app_id}-{topic_name}-STATE-STORE-0000000000-changelog), which will result in a huge number of topics since we a big number of consumers.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Thanks in advance.
By using KTables, Kafka Streams will create 1 topic on Kafka server per KTable, which will result in a huge number of topics since we a big number of consumers.
If you are just reading an existing topic into a KTable (via StreamsBuilder#table()), then no extra topics are being created by Kafka Streams. Same for KSQL.
It would help if you could clarify what exactly you want to do with the KTable(s). Apparently you are doing something that does result in additional topics being created?
1 changelog topic + 1 compact topic:
Why were you thinking about having two separate topics? Normally, changelog topics should always be compacted. And given your use case description, I don't see a reason why it should not be:
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal [...]
Hence compaction would be very useful for your use case. It would also prevent this problem you described:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Note that, to reconstruct the latest table values, all three of Kafka Streams, KSQL, and the Kafka Consumer must read the table's underlying topic completely (from beginning to end). If that topic is NOT compacted, this might indeed take a long time depending on the data volume, topic retention settings, etc.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Without knowing more about your use case, particularly what you want to do with the KTable(s) once they are populated, my answer would be:
Make sure the "changelog topic" is also compacted.
Try KSQL first. If this doesn't satisfy your needs, try Kafka Streams. If this doesn't satisfy your needs, try the Kafka Consumer.
For example, I wouldn't use the Kafka Consumer if it is supposed to do any stateful processing with the "table" data, because the Kafka Consumer lacks built-in functionality for fault-tolerant stateful processing.
Consumer starts and consumes the topic from beginning. This worked
perfectly, but the consumer has to consume the 10 hours change log to
construct the last values table.
During the first time your application starts up, what you said is correct.
To avoid this during every restart, store the key-value data in a file.
For example, you might want to use a persistent map (like MapDB).
Since you give the consumer group.id and you commit the offset either periodically or after each record is stored in the map, the next time your application restarts it will read it from the last comitted offset for that group.id.
So the problem of taking a lot of time occurs only initially (during first time). So long as you have the file, you don't need to consume from beginning.
In case, if the file is not there or is deleted, just seekToBeginning in the KafkaConsumer and build it again.
Somewhere, you need to store this key-values for retrieval and why cannot it be a persistent store?
In case if you want to use Kafka streams for whatever reason, then an alternative (not as simple as the above) is to use a persistent backed store.
For example, a persistent global store.
streamsBuilder.addGlobalStore(Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(topic), keySerde, valueSerde), topic, Consumed.with(keySerde, valueSerde), this::updateValue);
P.S: There will be a file called .checkpoint in the directory which stores the offsets. In case if the topic is deleted in the middle you get OffsetOutOfRangeException. You may want to avoid this, perhaps by using UncaughtExceptionHandler
Refer to https://stackoverflow.com/a/57301986/2534090 for more.
Finally,
It is better to use Consumer with persistent file rather than Streams for this, because of simplicity it offers.

Having a Kafka Consumer read a single message at a time

We have Kafka setup to be able to process messages in parallel by several servers. But every message must only be processed exactly once (and by only one server). We have this up and running and it’s working fine.
Now, the problem for us is that the Kafka Consumers reads messages in batches for maximal efficiency. This leads to a problem if/when processing fails, the server shuts down or whatever, because then we loose data that was about to be processed.
Is there a way to get the Consumer to only read on message at a time to let Kafka keep the unprocessed messages? Something like; Consumer pulls one message -> process -> commit offset when done, repeat. Is this feasible using Kafka? Any thoughts/ideas?
Thanks!
You might try setting max.poll.records to 1.
You mention having exactly one processing, but then you're worried about losing data. I'm assuming you're just worried about the edge case when one of your server fails? And you lose data?
I don't think there's a way to accomplish one message at a time. Looking through the consumer configurations, there only seems to be a option for setting the max bytes a consumer can fetch from Kafka, not number of messages.
fetch.message.max.bytes
But if you're worried about losing data completely, if you never commit the offset Kafka will not mark is as being committed and it won't be lost.
Reading through the Kafka documentation about delivery semantics,
So effectively Kafka guarantees at-least-once delivery by default and
allows the user to implement at most once delivery by disabling
retries on the producer and committing its offset prior to processing
a batch of messages. Exactly-once delivery requires co-operation with
the destination storage system but Kafka provides the offset which
makes implementing this straight-forward.
So to achieve exactly-once processing is not something that Kafka enables by default. It requires you to implement storing the offset whenever you write the output of your processing to storage.
But this can be handled more simply and generally by simply letting
the consumer store its offset in the same place as its output...As an example of this,
our Hadoop ETL that populates data in HDFS stores its offsets in HDFS
with the data it reads so that it is guaranteed that either data and
offsets are both updated or neither is.
I hope that helps.
It depends on what client you are going to use. For C++ and python, it is possible to consume ONE message each time.
For python, I used https://github.com/mumrah/kafka-python. The following code can consume one message each time:
message = self.__consumer.get_message(block=False, timeout=self.IterTimeout, get_partition_info=True )
__consumer is the object of SimpleConsumer.
See my question and answer here:How to stop Python Kafka Consumer in program?
For C++, I am using https://github.com/edenhill/librdkafka. The following code can consume one message each time.
214 while( m_bRunning )
215 {
216 // Start to read messages from the local queue.
217 RdKafka::Message *msg = m_consumer->consume(m_topic, m_partition, 1000);
218 msg_consume(msg);
219 delete msg;
220 m_consumer->poll(0);
221 }
m_consumer is the pointer to C++ Consumer object (C++ API).
Hope this help.