Stream reprocessing on system time change

Stream reprocessing on system time change - apache-kafka

I have a kafka streams application (Kafka v1.1.0) with multiple(24) topics. Four of these topics are source topics and the remaining are destination topics. They seem to have reprocessed data on changing the system time to a previous date. I have the default broker configs i.e. :
auto.offset.reset = latest
offsets.retention.minutes = 1440 #1 day
log.retention.hours = 168 #7 days
I have looked into the following links in details and the sub-links posted in the answers:
1) Kafka Stream reprocessing old messages on rebalancing
2) How does an offset expire for an Apache Kafka consumer group?
3) https://cwiki.apache.org/confluence/display/KAFKA/KIP-186%3A+Increase+offsets+retention+default+to+7+days
The following JIRA discussion also states this issue:
https://issues.apache.org/jira/browse/KAFKA-3806
After reading up on this I have established an understanding of the cases in which stream consumers might reprocess data.
However, with the default configs mentioned above (the ones being used for my setup), if offsets are lost i.e. offsets.retention.minutes has expired then the consumer would rebalance and start from latest committed offset (which wouldn't be anything) and any new incoming data would be processed as is. In this scenario there shouldn't be any data reprocessing and hence no duplicates.
In the case of a system time change however, there might be a possibility of offsets being inconsistent i.e. it is possible for offsets of source topic to have a CommitTime of an earlier date after a CommitTime of a later date. In this case if a topic has a low traffic and there is no data received on it for more than offsets.retention.minutes then its offset would be no longer available and another topic with high traffic would have its offset in __consumer_offsets topic.
How would the stream consumer behave in this scenario? Is there a chance of duplication in this scenario. I am really confused about it. Any help will be really appreciated.

Related

Kafka consumer-group liveness empty topic partitions

Following up on this question - I would like to know semantics between consumer-groups and offset expiry. In general I'm curious to know, how kafka protocol determines some specific offset (for consumer-group, topic, partition combination) to be expired ? Is it basing on periodic commits from consumer that are part of the group-protocol or does the offset-tick gets applied after all consumers are deemed dead/closed ? Im thinking this could have repercussions when dealing with topic-partitions to which data isn't produced frequently. In my case, we have a consumer-group reading from a fairly idle topic (not much data produced). Since, the consumer-group doesnt periodically commit any offsets, can we ever be in danger of loosing previously committed offsets. For example, when some unforeseen rebalance happens, the topic-partitions could get re-assigned with lost offset-commits and this could cause the consumer to read data from the earliest (configured auto.offset.reset) point ?

For user-topics, offset expiry / topic retention is completely decoupled from consumer-group offsets. Segments do not "reopen" when a consumer accesses them.
At a minimum, segment.bytes, retention.ms(or minutes/hours), retention.bytes all determine when log segments get deleted.
For the internal __consumer_offsets topic, offsets.retention.minutes controls when it is deleted (also in coordination with its segment.bytes).
The LogCleaner thread actively removes closed segments on a periodic basis, not the consumers. If a consumer is lagging considerably, and upon requesting offsets from a segment that had been deleted, then the auto.offset.reset gets applied.

If my service consumes Kafka messages, can kafka somehow lose my offsets?

If I have a service that connects to kafka as a message consumer, and every message I read I send a commit to that message offset, so that if my service shutsdown and restarts it will start reading from the last read message onwards. My understanding is that the committed offset will be maintained by kafka.
Now my question is, do I have to worry about the offset? Can kafka somehow lose that information and when the service restarts start reading messages from the beginning of the topic or the end of it depending on my initial offset config? Or if kafka loses my offset it will also have lost all messages in the topic so that it is alright to read from the beginning?
Note: I use spring-kafka on the service, but not sure if that is relevant to the question.

In most cases where you have an active consumer (with manual or auto-committing), you don't need to worry about it.
The cases where you do need to consider the behavior of auto.offset.reset setting is when the offsets.retention.minutes time on the broker has elapsed while your consumer group(s) are inactive. When this happens, Kafka compacts the __consumer_offsets topic and removes any offsets stored for those inactive groups
Losing offsets doesn't affect the source topic. Your client topic(s) have their own independent retention settings, and its message can be removed as well (or not), depending on how you've configured it.

Missing events on previously empty partition when restarting kafka streams app

I have a strange issue that I cannot understand how I can resolve. I have a kafka streams app (2.1.0) that reads from a topic with around 40 partitions. The partitions are using a range partition policy so at the moment some of them can be completely empty.
My issue is that during the downtime of the app one of those empty partitions was activated and a number of events were written to it. When the app was restored though, it read all the events from other partitions but it ignored the events already stored to the previous empty partition (the app has OffsetResetPolicy LATEST for the specific topic). On top of that when newer messages arrived to the specific partition it did consume them and somehow bypassed the previous ones.
My assumption is that __consumer_offsets does not have any entry for the specified partition when restoring but how can I avoid this situation without losing events. I mean the topic already exists
with the specified number of partitions.
Does this sound familiar to anybody ? Am I missing something, do I need to set some parameter to kafka because I cannot figure out why this is happening ?

This is expected behaviour.
Your empty partition does not have committed offset in __consumer_offsets. If there are no committed offsets for a partition, the offset policy specified in auto.offset.rest is used to decide at which offset to start consuming the events.
If auto.offset.reset is set to LATEST, your Streams app will only start consuming at the latest offset in the partition, i.e., after the events that were added during downtime and it will only consume events that were written to the partition after downtime.
If auto.offset.reset is set to EARLIEST, your Streams app will start from the earliest offset in the partition and read also the events written to the partition during downtime.
As #mazaneica mentioned in a comment to your question, auto.offset.reset only affects partitions without a committed offset. So your non-empty partitions will be fine, i.e., the Streams app will consume events from where it stopped before the downtime.

kafka Burrow reports Multiple Offsets for same consumer_group and same topic

SetUp
myTopic has a single partition.
consumer_group is my spring-boot app using spring-kafka client and there is always a single consumer for that consumer group. spring-kafka version 1.1.8 RELEASE
I have a single broker node in kafka. Kafka version 0.10.1.1
When I query a particular consumer_group using burrow, I see 15 offset entries for same topic.
Observations
curl http://burrow-node:8000/v3/kafka/mykafka-1/consumer/my_consumer_grp
"myTopic":[
{"offsets":[
{"offset":6671,"timestamp":1533099130556,"lag":0},
{"offset":6671,"timestamp":1533099135556,"lag":0},
{"offset":6671,"timestamp":1533099140558,"lag":0},
{"offset":6671,"timestamp":1533099145558,"lag":0},
{"offset":6671,"timestamp":1533099150557,"lag":0},
{"offset":6671,"timestamp":1533099155558,"lag":0},
{"offset":6671,"timestamp":1533099160561,"lag":0},
{"offset":6671,"timestamp":1533099165559,"lag":0},
{"offset":6671,"timestamp":1533099170560,"lag":0},
{"offset":6671,"timestamp":1533099175561,"lag":0},
{"offset":6671,"timestamp":1533099180562,"lag":0},
{"offset":6671,"timestamp":1533099185562,"lag":0},
{"offset":6671,"timestamp":1533099190563,"lag":0},
{"offset":6671,"timestamp":1533099195562,"lag":0},
{"offset":6671,"timestamp":1533099200564,"lag":0}
]
More Observations
When I restarted the app again, I didn't find a new offset entry to be created, except the timestamp kept on updating; which is probably due to the auto.commit.interval.ms;
When I started producing/consuming; I saw the changes in offset and lag in one of the offsets; later on the other offsets caught up; which makes me think those are replicas;
offset.retention.minutes is default 1440
Questions
Why do we have 15 offset entries in burrow reports?
If they are replicas, why does a single partition topic gets split up in 14 different replicas under __consumer_offsets? Is there any documentation for this?
If they are NOT replicas, what else are they?

Here's my understanding, based on the docs. Burrow stores a configurable number of committed offsets. It's a rolling window. Every time a consumer commits, burrow stores the committed offset and lag at time of commit. What you are seeing is likely the result of having applied a Storage config something like this (culled from burrow.iml):
[storage.default]
class-name="inmemory"
workers=20
intervals=15
expire-group=604800
min-distance=1
Note that intervals is set to 15.
I believe this feature is simply to provide some history of consumer group commits and associated lags, and has nothing to do with replicas.
EDIT:
The Consumer Lag Evaluation Rules page on the Burrow wiki explains this functionality in more detail. In short, this configurable window of offset/lag data is used to calculate consumer group status.

What determines Kafka consumer offset?

I am relatively new to Kafka. I have done a bit of experimenting with it, but a few things are unclear to me regarding consumer offset. From what I have understood so far, when a consumer starts, the offset it will start reading from is determined by the configuration setting auto.offset.reset (correct me if I am wrong).
Now say for example that there are 10 messages (offsets 0 to 9) in the topic, and a consumer happened to consume 5 of them before it went down (or before I killed the consumer). Then say I restart that consumer process. My questions are:
If the auto.offset.reset is set to earliest, is it always going to start consuming from offset 0?
If the auto.offset.reset is set to latest, is it going to start consuming from offset 5?
Is the behavior regarding this kind of scenario always deterministic?
Please don't hesitate to comment if anything in my question is unclear.

It is a bit more complex than you described.
The auto.offset.reset config kicks in ONLY if your consumer group does not have a valid offset committed somewhere (2 supported offset storages now are Kafka and Zookeeper), and it also depends on what sort of consumer you use.
If you use a high-level java consumer then imagine following scenarios:
You have a consumer in a consumer group group1 that has consumed 5 messages and died. Next time you start this consumer it won't even use that auto.offset.reset config and will continue from the place it died because it will just fetch the stored offset from the offset storage (Kafka or ZK as I mentioned).
You have messages in a topic (like you described) and you start a consumer in a new consumer group group2. There is no offset stored anywhere and this time the auto.offset.reset config will decide whether to start from the beginning of the topic (earliest) or from the end of the topic (latest)
One more thing that affects what offset value will correspond to earliest and latest configs is log retention policy. Imagine you have a topic with retention configured to 1 hour. You produce 5 messages, and then an hour later you post 5 more messages. The latest offset will still remain the same as in previous example but the earliest one won't be able to be 0 because Kafka will already remove these messages and thus the earliest available offset will be 5.
Everything mentioned above is not related to SimpleConsumer and every time you run it, it will decide where to start from using the auto.offset.reset config.
If you use Kafka version older than 0.9, you have to replace earliest, latest with smallest,largest.

Just an update: From Kafka 0.9 and forth, Kafka is using a new Java version of the consumer and the auto.offset.reset parameter names have changed; From the manual:
What to do when there is no initial offset in Kafka or if the current
offset does not exist any more on the server (e.g. because that data
has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found
for the consumer's group
anything else: throw exception to the consumer.
I spent some time to find this after checking the accepted answer, so I thought it might be useful for the community to post it.

Further more there's offsets.retention.minutes. If time since last commit is > offsets.retention.minutes, then auto.offset.reset also kicks in

Categories

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse