Stream processing from a specific offset to an end offset - apache-kafka

Is it possible to do kafka stream processing from a specific offset of input topic to an end offset?
I have one Kafka stream application which consume an input topic but for some reason it failed. I fixed the issue and started it again but it started consuming from the latest offset of the input topic. I know the offset of the input topic till which the application has processed. Now, how can I process the input topic from one offset to another. I am using confluent Platform 5.1.2.

By default, KStreams supports two possible values for auto.offset.reset. It could be either "earliest" or "latest". You can't set it to a specific offset in your application code.
There is an option during the application reset. If you use application reset script, you can use the --to-offset property and assign it to the specific offset. It will reset the application to that point.
<path-to-confluent>/bin/kafka-streams-application-reset --application-id app1 --input-topics a,b --to-offset 1000
You can find the details in the documentation :
https://docs.confluent.io/5.1.2/streams/developer-guide/app-reset-tool.html
In case, if you are fixing the bugs, it will be better to reset to the earliest state if possible.

Related

Is there a common offset value that spans across Kafka partitions?

I am just experimenting on Kafka as a SSE holder on the server side and I want "replay capability". Say each kafka topic is in the form events.<username> and it would have a delete items older than X time set.
Now what I want is an API that looks like
GET /events/offset=n
offset would be the last processed offset by the client if not specified it is the same as latest offset + 1 which means no new results. It can be earliest which represents the earliest possible entry. The offset needs to exist as a security-through-obscurity check.
My suspicion is for this to work correctly the topic must remain in ONE partition and cannot scale horizontally. Though because the topics are tied to a user name the distribution between brokers would be handled by the fact that the topics are different.
If you want to retain event sequence for each of the per-user topics, then yes, you have to use one partition per user only. Kafka cannot guarantee message delivery order with multiple partitions.
The earliest and latest options you mention are already supported in any basic Kafka consumer configuration. The specific offset one, you'd have to filter out manually by issuing a request for the given offset, and then returning nothing if the first message you receive does not match the requested offset.

Kafka: how does consumer offsets work with dynamically created group ids?

In Kafka, each consumer group is represented by a unique group.id property. Each consumer group manages their own offset (stored in __consumer_offsets topic). What happens to this offset if I always start my consumer service with a dynamically generated group.id value?
Will this new consumer group always read from the beginning of the topic since it has no offset, or will 'auto.offset.reset' take effect?
If you generate a new group.id each time your application starts, the consumer will rely on auto.offset.reset to find its starting position. This is because there won't be any offsets stored as this is a new group.
With auto.offset.reset, you can instruct consumers to either start from the beginning with earliest or end with latest of the logs.
Note that at startup you can also control the position in your application logic and explicitly seek to an arbitrary position based on whatever you want.
A relatively common pattern is to start from a position derived on time, for example seek to 1 hour ago or start of the day. This can be done using offsetsForTimes() and seek().

Kaka auto.offset.reset query

My project uses Kafka 0.10.2 version. Iam setting enable.auto.commit=false and auto.offset.reset=latest in the consumer. If consumer is restarted after maintenance, the consumer is reading again from first offset instead of waiting for latest offset messages. Any reasons why is this happening? Have i understood the configurations wrongly?
My requirement is the consumer should not auto commit and should read only the new messages put into the topic when it is active.
Just because you aren't auto committing doesn't guarnatee there are no manual commits.
Regardless, auto.offset.reset=latest will never send the consumer group to the beginning of the topic. Sounds like whatever Kafka tool / library you are using is calling a consumer.seekToBeginning call on its own.
For Understanding purpose , The Consumer property auto.offset.reset determines what to do if there is no valid offset in Kafka for the Consumer’s Consumer Group Based on the below scenarios :
– When a particular Consumer Group starts the first time
– If the Consumer offset is less than the smallest offset
– If the Consumer offset is greater than the last offset
▪ The value can be one of:
– earliest: Automatically reset the offset to the earliest available
– latest: Automatically reset to the latest offset available
– none: Throw an exception if no previous offset can be found for the
ConsumerGroup
▪ The default is latest

kafka subscribe commit offset manually

I am using Kafka 9 and confused with the behavior of subscribe.
Why does it expects group.id with subscribe.
Do we need to commit the offset manually using commitSync. Even if don't do that I see that it always starts from the latest.
Is there a way a replay the messages from beginning.
Why does it expects group.id with subscribe?
The concept of consumer groups is used by Kafka to enable parallel consumption of topics - every message will be delivered once per consumer group, no matter how many consumers actually are in that group. This is why the group parameter is mandatory, without a group Kafka would not know how this consumer should be treated in relation to other consumers that might subscribe to the same topic.
Whenever you start a consumer it will join a consumer group, based on how many other consumers are in this consumer group it will then be assigned partitions to read from. For these partitions it then checks whether a list read offset is known, if one is found it will start reading messages from this point.
If no offset is found, the parameter auto.offset.reset controls whether reading starts at the earliest or latest message in the partition.
Do we need to commit the offset manually using commitSync? Even if
don't do that I see that it always starts from the latest.
Whether or not you need to commit the offset depends on the value you choose for the parameter enable.auto.commit. By default this is set to true, which means the consumer will automatically commit its offset regularly (how often is defined by auto.commit.interval.ms). If you set this to false, then you will need to commit the offsets yourself.
This default behavior is probably also what is causing your "problem" where your consumer always starts with the latest message. Since the offset was auto-committed it will use that offset.
Is there a way a replay the messages from beginning?
If you want to start reading from the beginning every time, you can call seekToBeginning, which will reset to the first message in all subscribed partitions if called without parameters, or just those partitions that you pass in.

What determines Kafka consumer offset?

I am relatively new to Kafka. I have done a bit of experimenting with it, but a few things are unclear to me regarding consumer offset. From what I have understood so far, when a consumer starts, the offset it will start reading from is determined by the configuration setting auto.offset.reset (correct me if I am wrong).
Now say for example that there are 10 messages (offsets 0 to 9) in the topic, and a consumer happened to consume 5 of them before it went down (or before I killed the consumer). Then say I restart that consumer process. My questions are:
If the auto.offset.reset is set to earliest, is it always going to start consuming from offset 0?
If the auto.offset.reset is set to latest, is it going to start consuming from offset 5?
Is the behavior regarding this kind of scenario always deterministic?
Please don't hesitate to comment if anything in my question is unclear.
It is a bit more complex than you described.
The auto.offset.reset config kicks in ONLY if your consumer group does not have a valid offset committed somewhere (2 supported offset storages now are Kafka and Zookeeper), and it also depends on what sort of consumer you use.
If you use a high-level java consumer then imagine following scenarios:
You have a consumer in a consumer group group1 that has consumed 5 messages and died. Next time you start this consumer it won't even use that auto.offset.reset config and will continue from the place it died because it will just fetch the stored offset from the offset storage (Kafka or ZK as I mentioned).
You have messages in a topic (like you described) and you start a consumer in a new consumer group group2. There is no offset stored anywhere and this time the auto.offset.reset config will decide whether to start from the beginning of the topic (earliest) or from the end of the topic (latest)
One more thing that affects what offset value will correspond to earliest and latest configs is log retention policy. Imagine you have a topic with retention configured to 1 hour. You produce 5 messages, and then an hour later you post 5 more messages. The latest offset will still remain the same as in previous example but the earliest one won't be able to be 0 because Kafka will already remove these messages and thus the earliest available offset will be 5.
Everything mentioned above is not related to SimpleConsumer and every time you run it, it will decide where to start from using the auto.offset.reset config.
If you use Kafka version older than 0.9, you have to replace earliest, latest with smallest,largest.
Just an update: From Kafka 0.9 and forth, Kafka is using a new Java version of the consumer and the auto.offset.reset parameter names have changed; From the manual:
What to do when there is no initial offset in Kafka or if the current
offset does not exist any more on the server (e.g. because that data
has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found
for the consumer's group
anything else: throw exception to the consumer.
I spent some time to find this after checking the accepted answer, so I thought it might be useful for the community to post it.
Further more there's offsets.retention.minutes. If time since last commit is > offsets.retention.minutes, then auto.offset.reset also kicks in