I have setup the Kafka Spout for the Storm Pipeline. I don't want to read the data neither from the latest offset nor from the beginning. Is there any way to read the offset stored in zookeeper from the configurable offset. Storm provides us ways to read from the latest offset or from the beginning. I do not want that case.
Use Case : Offset 0 deployed topology.
Offset 50 changed a topology
Offset 100 detect that recent topology has a bug. Want to start from offset 50.
How can i achieve the same.?
KafkaSpout will read last committed offset from zookeeper. If there is no offset in the zookeeper, it will use configured startOffsetTime. The default configuration of KafkaSpout is following.
public long startOffsetTime = kafka.api.OffsetRequest.EarliestTime();
If you change the value of startOffsetTime and set KafkaConfig.ignoreZkOffsets = true, I think you can make the consumer start from the specific offset.
If ignoreZkOffsets equals true, the spout will always begin reading from the offset defined by KafkaConfig.startOffsetTime as described above.
Also, have a look on this article. How do I accurately get offsets of messages for a certain timestamp using OffsetRequest?
Kafka enable.auto.commit is set to false and Spark version is 2.4
If using latest offset, do we need to manually find last offset details and mention it in .CreateDirectStream() in Spark application? or will it automatically take latest offset? In any case do we need to find the last offset details manually.
Is there any difference when use SparkSession.readstrem.format(kafka).... and KafkaUtils.createDirectStream()?
When using earliest offset option, will it consider the offset automatically?
Here is my attempt to answer your questions
Ques 1: enable.auto.commit is a kafka related parameter and if set to false requires you to manually commit (read update) your offset to the checkpoint directory. If your application restarts it will look into the checkpoint directory and start reading from last committed offset + 1. same is mentioned here https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafka-properties-enable-auto-commit.html by jaceklaskowski. There is no need to specify the offset anywhere as part of your spark application. All you need is the checkpoint directory. Also, remember offsets are maintained per partition in topic per consumer so it would be bad on Spark to expect developers/users to provide that.
Ques 2: spark.readStream is a generic method to read data from streaming sources such as tcp socket, kafka topics etc while kafkaUtils is a dedicated class for integration of spark with kafka so I assume it is more optimised if you are using kafka topics as source. I usually use KafkaUtils on my own through I haven't done any performance benchmarks. If I am not wrong, KafkaUtils can be used to subscribe to more than 1 topic as well while readStream cannot be.
Ques 3: earliest offset means your consumer will start reading from the oldest record available for example, if your topic is new (no clean up has happened) or cleanup is not configured for the topic it will start from offset 0. in case cleanup is configured and all records till offset 2000 have been removed, records will be read from offset 2001 while the topic may have records till offset 10000 ( this is assuming there is only one partition, in topics will multiple partitions the offset value will be different ). See section for batch queries here https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html for more details.
If you take a look at documentation of kafka connector for Spark, you can find most of the answers.
Documentation about startingOffsets option for Kafka connector, last part is about streaming queries.
The start point when a query is started, either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.
If you have offsets, it will always pick up offsets if they're available, otherwise it will ask Kafka for earliest or latest offset. This should be true for both types of streams, direct and structured streams should consider offsets.
I see that you mentioned enable.auto.commit option and I just want to make sure you're aware of the following quote from the same documentation site i provided above.
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception: enable.auto.commit: Kafka source doesn’t commit any offset.
My project uses Kafka 0.10.2 version. Iam setting enable.auto.commit=false and auto.offset.reset=latest in the consumer. If consumer is restarted after maintenance, the consumer is reading again from first offset instead of waiting for latest offset messages. Any reasons why is this happening? Have i understood the configurations wrongly?
My requirement is the consumer should not auto commit and should read only the new messages put into the topic when it is active.
Just because you aren't auto committing doesn't guarnatee there are no manual commits.
Regardless, auto.offset.reset=latest will never send the consumer group to the beginning of the topic. Sounds like whatever Kafka tool / library you are using is calling a consumer.seekToBeginning call on its own.
For Understanding purpose , The Consumer property auto.offset.reset determines what to do if there is no valid offset in Kafka for the Consumer’s Consumer Group Based on the below scenarios :
– When a particular Consumer Group starts the first time
– If the Consumer offset is less than the smallest offset
– If the Consumer offset is greater than the last offset
▪ The value can be one of:
– earliest: Automatically reset the offset to the earliest available
– latest: Automatically reset to the latest offset available
– none: Throw an exception if no previous offset can be found for the
▪ The default is latest
Is it possible to do kafka stream processing from a specific offset of input topic to an end offset?
I have one Kafka stream application which consume an input topic but for some reason it failed. I fixed the issue and started it again but it started consuming from the latest offset of the input topic. I know the offset of the input topic till which the application has processed. Now, how can I process the input topic from one offset to another. I am using confluent Platform 5.1.2.
By default, KStreams supports two possible values for auto.offset.reset. It could be either "earliest" or "latest". You can't set it to a specific offset in your application code.
There is an option during the application reset. If you use application reset script, you can use the --to-offset property and assign it to the specific offset. It will reset the application to that point.
<path-to-confluent>/bin/kafka-streams-application-reset --application-id app1 --input-topics a,b --to-offset 1000
You can find the details in the documentation :
In case, if you are fixing the bugs, it will be better to reset to the earliest state if possible.
I hope I am not making a mistake, but I remember that in Kafka documentation it mentioned that using high level APIs you can't start reading messages from a specific offset, but it was mentioned that it would change.
Is it possible now using the high level APIs to read messages from a specific partition and a specific offset? Could you please give me an example how to do it?
I am using kafka
Thanks in advance.
You can do that with kafka 0.9:
public void seek(TopicPartition partition, long offset)
Overrides the fetch offsets that the consumer will use on the next poll(timeout). If this API is invoked for the
same partition more than once, the latest offset will be used on the
next poll(). Note that you may lose data if this API is arbitrarily
used in the middle of consumption, to reset the fetch offsets
Kafka can use Zookeeper to store offsets for each consumer group. If you configure your consumer to commit offsets to zookeeper than you Need just to manually set the starting offset for the topic and partition under zookeeper for your consumer Group.
You Need to connect to zookeeper and use the set command:
set /consumers/[groupId]/offsets/[topic]/[partitionId] -> long (offset)
E.g. setting offset 10 for partition 0 of topicname for the spark-app consumer Group.
set /consumers/spark-app/offsets/topicname/0 10
When a consumer starts to consume message from Kafka it always starts to consume from the last committed offset. If this last committes offset is not.valid for any reason than the consumer applies the logic due the configurazione properties auto.offset.reset.
Hope this helps.
I am relatively new to Kafka. I have done a bit of experimenting with it, but a few things are unclear to me regarding consumer offset. From what I have understood so far, when a consumer starts, the offset it will start reading from is determined by the configuration setting auto.offset.reset (correct me if I am wrong).
Now say for example that there are 10 messages (offsets 0 to 9) in the topic, and a consumer happened to consume 5 of them before it went down (or before I killed the consumer). Then say I restart that consumer process. My questions are:
If the auto.offset.reset is set to earliest, is it always going to start consuming from offset 0?
If the auto.offset.reset is set to latest, is it going to start consuming from offset 5?
Is the behavior regarding this kind of scenario always deterministic?
Please don't hesitate to comment if anything in my question is unclear.
It is a bit more complex than you described.
The auto.offset.reset config kicks in ONLY if your consumer group does not have a valid offset committed somewhere (2 supported offset storages now are Kafka and Zookeeper), and it also depends on what sort of consumer you use.
If you use a high-level java consumer then imagine following scenarios:
You have a consumer in a consumer group group1 that has consumed 5 messages and died. Next time you start this consumer it won't even use that auto.offset.reset config and will continue from the place it died because it will just fetch the stored offset from the offset storage (Kafka or ZK as I mentioned).
You have messages in a topic (like you described) and you start a consumer in a new consumer group group2. There is no offset stored anywhere and this time the auto.offset.reset config will decide whether to start from the beginning of the topic (earliest) or from the end of the topic (latest)
One more thing that affects what offset value will correspond to earliest and latest configs is log retention policy. Imagine you have a topic with retention configured to 1 hour. You produce 5 messages, and then an hour later you post 5 more messages. The latest offset will still remain the same as in previous example but the earliest one won't be able to be 0 because Kafka will already remove these messages and thus the earliest available offset will be 5.
Everything mentioned above is not related to SimpleConsumer and every time you run it, it will decide where to start from using the auto.offset.reset config.
If you use Kafka version older than 0.9, you have to replace earliest, latest with smallest,largest.
Just an update: From Kafka 0.9 and forth, Kafka is using a new Java version of the consumer and the auto.offset.reset parameter names have changed; From the manual:
What to do when there is no initial offset in Kafka or if the current
offset does not exist any more on the server (e.g. because that data
has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found
for the consumer's group
anything else: throw exception to the consumer.
I spent some time to find this after checking the accepted answer, so I thought it might be useful for the community to post it.
Further more there's offsets.retention.minutes. If time since last commit is > offsets.retention.minutes, then auto.offset.reset also kicks in