capturing the data which is just about to get discarded from kafka topic? - apache-kafka

Kafka topic's retention period is 7 days. But I need to push data which is expiring because of retention period to new kafka topic or some other storage.
So is there any method where I can access the data which is going to be deleted after 7 days just before it gets deleted? or way to set up some process where it will automatically push data which is going to get deleted to some place else.

Since 0.10 version of kafka each message has a timestamp. Simply setup a consumer group that starts every hour and processes each topic partition from the initial offset (auto.offset.reset=earliest) and pushes on the new topic the messages with the timestamp with incoming expiration (one hour width), then the consumer group stops and is restarted one hour later.

Related

Why did all the offsets disappear for consumers?

I have a service with kafka consumers. Previously, I created and closed consumers after receiving records every time. I made a change and started using resume / pause without closing consumers (with ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG = false and consumer.commitSync(offsetAndMetadataMap);). The service worked great all week. After 7 days it was restarted. After the restart, all offsets disappeared and consumers began to receive all old records (). How could this happen? Where did the offsets go?
I guess your consumers of that consumer group were not up for the 7 days before the restart?
The internal offset topic which contains data about the offsets of your groups is defined as compacted and delete topic policy,
it means it compact the records to save last value of a key
and also deletes old records from the topic,
the default is 7 days, offset topic retention ,
KAFKA-3806: Increase offsets retention default to 7 days (KIP-186) #4648
it is configurable as any other topic configuration
Offset expiration semantics has slightly changed in this version. According to the new semantics, offsets of partitions in a group will not be removed while the group is subscribed to the corresponding topic and is still active (has active consumers). If group becomes empty all its offsets will be removed after default offset retention period (or the one set by broker) has passed (unless the group becomes active again). Offsets associated with standalone (simple) consumers, that do not use Kafka group management, will be removed after default offset retention period (or the one set by broker) has passed since their last commit.

What consumer offset will be set if auto.offset.reset=earliest but topic has no messages

I have Kafka server version 2.4 and set log.retention.hours=168(so that messages in the topic will get deleted after 7 days) and auto.offset.reset=earliest(so that if the consumer doesn't get the last committed offset then it should be processed from the beginning). And since I am using Kafka 2.4 version so by default value offsets.retention.minutes=10080 (since I am not setting this property in my application).
My Topic data is : 1,2,3,4,5,6,7,8,9,10
current consumer offset before shutting down consumer: 10
End offset:10
last committed offset by consumer: 10
So let's say my consumer is not running for the past 7 days and I have started the consumer on the 8th day. So my last committed offset by the consumer will get expired(due to offsets.retention.minutes=10080 property) and topic messages also will get deleted(due to log.retention.hours=168 property).
So wanted to know what consumer offset will be set by auto.offset.reset=earliest property now?
Although no data is available in the Kafka topic, your brokers still know the "next" offset within that partition. In your case the first and last offset of this topic is 10 whereas it does not contain any data.
Therefore, your consumer which already has committed offset 10 will try to read 11 when started again, independent of the consumer configuration auto.offset.reset.
Your example will get even more interesting when your topic has had offsets, say, until 15 while the consumer was shut down after committing offset 10. Now, imagine all offsets were removed from the topic due to the retention policy. If you then start your consumer only then the consumer configuration auto.offset.reset comes into effect as stated in the documentation:
"What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted)"
As long as the Kafka topic is empty there is no offset "set" for the consumer. The consumer just tries to find the next available offset, either based on
the last committed offset or,
in case the last committed offset does not exist anymore, the configuration given through auto.offset.reset.
Just as an additional note: Even though the messages seem to get cleaned by the retention policy you may still see some data in the topic due to Data still remains in Kafka topic even after retention time/size
Once the consumer group gets deleted from log, auto.offset.reset will take the precedence and consumers will start consuming data from beginning.
My Topic data is : 1,2,3,4,5,6,7,8,9,10
If the topic has the above data, the consumer will start from beginning, and all 1 to 10 records will be consumed
My Topic data is : 11,12,13,14,15,16,17,18,19,20
In this case if old data is purged due to retention, the consumer will reset the offset to earliest (earliest offset available at that time) and start consuming from there, for example in this scenario it will consume all from 11 to 20 (since 1 to 10 are purged)

Kafka paritioning with respect to timestamp to process only messages in parition of current hour parition

I am working on a messaging system where messages are generated by user which have a time parameter.
I have a consumer which runs a job every hour and looks for messages that have time with current time and sends these messages.
I want to parition the topic "message" based on this time/timestamp in groups of one hour per parition, so that my consumer will only process one parition every hour instead of going through all messages every hour.
as of now i have a producer that will produce message in key value pair where key is the time with hour rounded off.
I have two questions :-
I can create 2-3 partitions by specifying in kafka settings but how can I have multiple patitions for every hour slot?
Should I rather create new topic for every hour and ask the consumer to only listen to the topic of the current hour?
example I am creating a topic named say "2020-7-23-10" which will contain all messages that need to be delivered between 10-11AM 23 July 2020. so I can just subscribe to this topic and process them.
or I can create a topic named "messages" and partition it based on the time and force my consumer to only process a particular patition.

Is it possible consumer Kafka messages after arrival?

I would like to consume events from a kafka topic after the time they arrive. The time on which I want the event to be consumed is in the payload of the message. Is it possible to achieve something like that in Kafka? What are the drawbacks of it?
Practical example: a message M is produced at 12:10, arrives to my kafka topic at 12:11 and I want the consumer to poll it at 12:41 (30 minutes after arrival)
Kafka has a default retention period of all topic for 7 days. You can therefore consume up to a week's data at any moment, the drawback being network saturation if you are constantly doing this.
If you want to consume data that is not at the latest offset, then for any new consumer group, you would set auto.offset.reset=earliest. Otherwise for existing groups, you would need to use kafka-consumer-groups --reset command in order to re-consume an already consumed record.
Sometimes you may want to start from beginning of a topic, for example, if you have a compacted topic, in order to rebuild the "deltas" of the data within a topic - lookup the "Stream / Table Duality"
The time on which I want the event to be consumed is in the payload of the message
Since KIP-32 every message has a timestamp outside the payload, by the way
I want the consumer to poll it ... (30 minutes after arrival)
Sure, you can start a consumer whenever, as long as the data is within the retention window, you will get that event.
There isn't a way to finely control when that happens that other than acually making your consumer at that time, for example 30 minutes later. You could play with max.poll.records and max.poll.interval.ms, but I find anything larger than a few seconds really isn't a use-case for Kafka.
For example, you could rather have a TimerTask around a consumer thread, or Spark or MapReduce scheduled with an Oozie/Airflow task that reads a max amount of records.

Kafka Consumer is getting few (not all) old messages (that was already processed earlier)

We have topics with retention set as 7 days (168 hours). Messages are consumed in real-time as and when the producer sends the message. Everything is working as expected. However recently on a production server, Devops changed the time zone from PST to EST accidentally as part of OS patch.
After Kafka server restart, we saw few (not all of them, but random) old messages being consumed by the consumers. We asked Devops to change it back to PST and restart. Again the old messages re-appeared this weekend as well.
We have not seen this problem in lower environments (Dev, QA, Stage etc).
Kafka version: kafka_2.12-0.11.0.2
Any help is highly appreciated.
Adding more info... Recently our CentOS had a patch update and somehow, admins changed from PST timezone to EST and started Kafka servers... After that our consumers started seeing messages from offset 0. After debugging, I found the timezone change and admins changed back from EST to PST after 4 days. Our message producers were sending messages before and after timezone changes regularly. After timezone change from EST back to PST, Kafka servers were restarted and I am seeing the bellow warning.
This log happened when we chnaged back from EST to PST : (server.log)
[2018-06-13 18:36:34,430] WARN Found a corrupted index file due to requirement failed: Corrupt index found, index file (/app/kafka_2.12-0.11.0.2/data/__consumer_offsets-21/00000000000000002076.index) has non-zero size but the last offset is 2076 which is no larger than the base offset 2076.}. deleting /app/kafka_2.12-0.11.0.2/data/__consumer_offsets-21/00000000000000002076.timeindex, /app/kafka_2.12-0.11.0.2/data/__consumer_offsets-21/00000000000000002076.index, and /app/kafka_2.12-0.11.0.2/data/__consumer_offsets-21/00000000000000002076.txnindex and rebuilding index... (kafka.log.Log)
We restarted consumers after 3 days of timezone change back from EST to PST and started seeing consumer messages with offset 0 again.
As on Kafka v2.3.0
You can set
"enable.auto.commit" : "true",// default is true as well
"auto.commit.interval.ms" : "1000"
This means that So after every 1 second, a Consumer is going to commit its Offset to Kafka or every time data is fetched from the specified Topic it will commit the latest Offset.
So no sooner your Kafka Consumer has started and 1 second has elapsed, it will never read the messages that were received by the consumer and committed. This setting does not require Kafka Server to be restarted.
I think this is because you will restart the program before you Commit new offsets.
Managing offsets
For each consumer group, Kafka maintains the committed offset for each partition being consumed. When a consumer processes a message, it doesn't remove it from the partition. Instead, it just updates its current offset using a process called committing the offset.
If a consumer fails after processing a message but before committing its offset, the committed offset information will not reflect the processing of the message. This means that the message will be processed again by the next consumer in that group to be assigned the partition.
Committing offsets automatically
The easiest way to commit offsets is to let the Kafka consumer do it automatically. This is simple but it does give less control than committing manually. By default, a consumer automatically commits offsets every 5 seconds. This default commit happens every 5 seconds, regardless of the progress the consumer is making towards processing the messages. In addition, when the consumer calls poll(), this also causes the latest offset returned from the previous call to poll() to be committed (because it's probably been processed).
If the committed offset overtakes the processing of the messages and there is a consumer failure, it's possible that some messages might not be processed. This is because processing restarts at the committed offset, which is later than the last message to be processed before the failure. For this reason, if reliability is more important than simplicity, it's usually best to commit offsets manually.
Committing offsets manually
If ‍enable.auto.commit is set to false, the consumer commits its offsets manually. It can do this either synchronously or asynchronously. A common pattern is to commit the offset of the latest processed message based on a periodic timer. This pattern means that every message is processed at least once, but the committed offset never overtakes the progress of messages that are actively being processed. The frequency of the periodic timer controls the number of messages that can be reprocessed following a consumer failure. Messages are retrieved again from the last saved committed offset when the application restarts or when the group rebalances.
The committed offset is the offset of the messages from which processing is resumed. This is usually the offset of the most recently processed message plus one.
From this article, which I think is very helpful.