kafka offset in spark - scala

Kafka enable.auto.commit is set to false and Spark version is 2.4
If using latest offset, do we need to manually find last offset details and mention it in .CreateDirectStream() in Spark application? or will it automatically take latest offset? In any case do we need to find the last offset details manually.
Is there any difference when use SparkSession.readstrem.format(kafka).... and KafkaUtils.createDirectStream()?
When using earliest offset option, will it consider the offset automatically?

Here is my attempt to answer your questions
Ques 1: enable.auto.commit is a kafka related parameter and if set to false requires you to manually commit (read update) your offset to the checkpoint directory. If your application restarts it will look into the checkpoint directory and start reading from last committed offset + 1. same is mentioned here https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafka-properties-enable-auto-commit.html by jaceklaskowski. There is no need to specify the offset anywhere as part of your spark application. All you need is the checkpoint directory. Also, remember offsets are maintained per partition in topic per consumer so it would be bad on Spark to expect developers/users to provide that.
Ques 2: spark.readStream is a generic method to read data from streaming sources such as tcp socket, kafka topics etc while kafkaUtils is a dedicated class for integration of spark with kafka so I assume it is more optimised if you are using kafka topics as source. I usually use KafkaUtils on my own through I haven't done any performance benchmarks. If I am not wrong, KafkaUtils can be used to subscribe to more than 1 topic as well while readStream cannot be.
Ques 3: earliest offset means your consumer will start reading from the oldest record available for example, if your topic is new (no clean up has happened) or cleanup is not configured for the topic it will start from offset 0. in case cleanup is configured and all records till offset 2000 have been removed, records will be read from offset 2001 while the topic may have records till offset 10000 ( this is assuming there is only one partition, in topics will multiple partitions the offset value will be different ). See section for batch queries here https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html for more details.

If you take a look at documentation of kafka connector for Spark, you can find most of the answers.
Documentation about startingOffsets option for Kafka connector, last part is about streaming queries.
The start point when a query is started, either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.
If you have offsets, it will always pick up offsets if they're available, otherwise it will ask Kafka for earliest or latest offset. This should be true for both types of streams, direct and structured streams should consider offsets.
I see that you mentioned enable.auto.commit option and I just want to make sure you're aware of the following quote from the same documentation site i provided above.
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception: enable.auto.commit: Kafka source doesn’t commit any offset.

Related

Kaka auto.offset.reset query

My project uses Kafka 0.10.2 version. Iam setting enable.auto.commit=false and auto.offset.reset=latest in the consumer. If consumer is restarted after maintenance, the consumer is reading again from first offset instead of waiting for latest offset messages. Any reasons why is this happening? Have i understood the configurations wrongly?
My requirement is the consumer should not auto commit and should read only the new messages put into the topic when it is active.
Just because you aren't auto committing doesn't guarnatee there are no manual commits.
Regardless, auto.offset.reset=latest will never send the consumer group to the beginning of the topic. Sounds like whatever Kafka tool / library you are using is calling a consumer.seekToBeginning call on its own.
For Understanding purpose , The Consumer property auto.offset.reset determines what to do if there is no valid offset in Kafka for the Consumer’s Consumer Group Based on the below scenarios :
– When a particular Consumer Group starts the first time
– If the Consumer offset is less than the smallest offset
– If the Consumer offset is greater than the last offset
▪ The value can be one of:
– earliest: Automatically reset the offset to the earliest available
– latest: Automatically reset to the latest offset available
– none: Throw an exception if no previous offset can be found for the
ConsumerGroup
▪ The default is latest

kafka Burrow reports Multiple Offsets for same consumer_group and same topic

SetUp
myTopic has a single partition.
consumer_group is my spring-boot app using spring-kafka client and there is always a single consumer for that consumer group. spring-kafka version 1.1.8 RELEASE
I have a single broker node in kafka. Kafka version 0.10.1.1
When I query a particular consumer_group using burrow, I see 15 offset entries for same topic.
Observations
curl http://burrow-node:8000/v3/kafka/mykafka-1/consumer/my_consumer_grp
"myTopic":[
{"offsets":[
{"offset":6671,"timestamp":1533099130556,"lag":0},
{"offset":6671,"timestamp":1533099135556,"lag":0},
{"offset":6671,"timestamp":1533099140558,"lag":0},
{"offset":6671,"timestamp":1533099145558,"lag":0},
{"offset":6671,"timestamp":1533099150557,"lag":0},
{"offset":6671,"timestamp":1533099155558,"lag":0},
{"offset":6671,"timestamp":1533099160561,"lag":0},
{"offset":6671,"timestamp":1533099165559,"lag":0},
{"offset":6671,"timestamp":1533099170560,"lag":0},
{"offset":6671,"timestamp":1533099175561,"lag":0},
{"offset":6671,"timestamp":1533099180562,"lag":0},
{"offset":6671,"timestamp":1533099185562,"lag":0},
{"offset":6671,"timestamp":1533099190563,"lag":0},
{"offset":6671,"timestamp":1533099195562,"lag":0},
{"offset":6671,"timestamp":1533099200564,"lag":0}
]
More Observations
When I restarted the app again, I didn't find a new offset entry to be created, except the timestamp kept on updating; which is probably due to the auto.commit.interval.ms;
When I started producing/consuming; I saw the changes in offset and lag in one of the offsets; later on the other offsets caught up; which makes me think those are replicas;
offset.retention.minutes is default 1440
Questions
Why do we have 15 offset entries in burrow reports?
If they are replicas, why does a single partition topic gets split up in 14 different replicas under __consumer_offsets? Is there any documentation for this?
If they are NOT replicas, what else are they?
Here's my understanding, based on the docs. Burrow stores a configurable number of committed offsets. It's a rolling window. Every time a consumer commits, burrow stores the committed offset and lag at time of commit. What you are seeing is likely the result of having applied a Storage config something like this (culled from burrow.iml):
[storage.default]
class-name="inmemory"
workers=20
intervals=15
expire-group=604800
min-distance=1
Note that intervals is set to 15.
I believe this feature is simply to provide some history of consumer group commits and associated lags, and has nothing to do with replicas.
EDIT:
The Consumer Lag Evaluation Rules page on the Burrow wiki explains this functionality in more detail. In short, this configurable window of offset/lag data is used to calculate consumer group status.

Spark structured streaming query always starts with auto.offset.rest=earliest even though auto.offset.reset=latest is set

I have a weird issue with trying to read data from Kafka using Spark structured streaming.
My use case is to be able to read from a topic from the largest/latest offset available.
My read configs:
val data = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "some xyz server")
.option("subscribe", "sampletopic")
.option("auto.offset.reset", "latest")
.option("startingOffsets", "latest")
.option("kafkaConsumer.pollTimeoutMs", 20000)
.option("failOnDataLoss","false")
.option("maxOffsetsPerTrigger",20000)
.load()
My write configs:
data
.writeStream
.outputMode("append")
.queryName("test")
.format("parquet")
.option("checkpointLocation", "s3://somecheckpointdir")
.start("s3://outpath").awaitTermination()
Versions used:
spark-core_2.11 : 2.2.1
spark-sql_2.11 : 2.2.1
spark-sql-kafka-0-10_2.11 : 2.2.1
I have done my research online and from [the Kafka documentation](https://kafka.apache.org/0100/documentation.html0/
I am using the new consumer apis and as the documentation suggests i just need to set auto.offset.reset to "latest" or startingOffsets to "latest" to ensure that my Spark job starts consuming from the the latest offset available per partition in Kafka.
I am also aware that the setting auto.offset.reset only kicks in when a new query is started for the first time and not on a restart of an application in which case it will continue to read from the last saved offset.
I am using s3 for checkpointing my offsets. and I see them being generated under s3://somecheckpointdir.
The issue I am facing is that the Spark job always read from earliest offset even though latest option is specified in the code during startup of application when it is started for the first time and I see this in the Spark logs.
auto.offset.reset = earliest being used. I have not seen posts related to this particular issue.
I would like to know if I am missing something here and if someone has seen this behavior before. Any help/direction will indeed be useful. Thank you.
All Kafka configurations should be set with kafka. prefix. Hence the correct option key is kafka.auto.offset.reset.
You should never set auto.offset.reset. Instead, "set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed. Note that startingOffsets only applies when a new streaming query is started, and that resuming will always pick up from where the query left off." [1]
[1] http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#kafka-specific-configurations
Update:
So i have done some testing on a local kafka instance with a controlled set of messages going in to kafka. I see that expected behavior is working fine when property startingOffsets is set to earlier or latest.
But the logs always show the property being pickup as earliest, which is a little misleading.
auto.offset.reset=earliest, even though i am not setting it.
Thank you.
You cannot set auto.offset.reset in Spark Streaming as per the documentation. For setting to latest you just need to set the source option startingOffsets to specify where to start instead (earliest or latest). Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it. This will ensure that no data is missed when new topics/partitions are dynamically subscribed.
It clearly says that following fields can't be set and the Kafka source or sink will throw an exception:
group.id
auto.offset.reset
key.deserializer
value.deserializer
key.serializer
value.serializer
enable.auto.commit
interceptor.classes
For Structured Streaming can set startingOffsets to earliest so that every time you consume from the earliest available offset. The following will do the trick
.option("startingOffsets", "earliest")
However note that this is effective just for newly created queries:
startingOffsets
The start point when a query is started, either "earliest" which is
from the earliest offsets, "latest" which is just from the latest
offsets, or a json string specifying a starting offset for each
TopicPartition. In the json, -2 as an offset can be used to refer to
earliest, -1 to latest. Note: For batch queries, latest (either
implicitly or by using -1 in json) is not allowed. For streaming
queries, this only applies when a new query is started, and that
resuming will always pick up from where the query left off. Newly
discovered partitions during a query will start at earliest.
Alternatively, you might also choose to change the consumer group every time:
.option("kafka.group.id", "newGroupID")

Reading from earlier offset Apache Storm

I have setup the Kafka Spout for the Storm Pipeline. I don't want to read the data neither from the latest offset nor from the beginning. Is there any way to read the offset stored in zookeeper from the configurable offset. Storm provides us ways to read from the latest offset or from the beginning. I do not want that case.
Use Case : Offset 0 deployed topology.
Offset 50 changed a topology
Offset 100 detect that recent topology has a bug. Want to start from offset 50.
How can i achieve the same.?
KafkaSpout will read last committed offset from zookeeper. If there is no offset in the zookeeper, it will use configured startOffsetTime. The default configuration of KafkaSpout is following.
public long startOffsetTime = kafka.api.OffsetRequest.EarliestTime();
If you change the value of startOffsetTime and set KafkaConfig.ignoreZkOffsets = true, I think you can make the consumer start from the specific offset.
If ignoreZkOffsets equals true, the spout will always begin reading from the offset defined by KafkaConfig.startOffsetTime as described above.
Also, have a look on this article. How do I accurately get offsets of messages for a certain timestamp using OffsetRequest?
Reference

What determines Kafka consumer offset?

I am relatively new to Kafka. I have done a bit of experimenting with it, but a few things are unclear to me regarding consumer offset. From what I have understood so far, when a consumer starts, the offset it will start reading from is determined by the configuration setting auto.offset.reset (correct me if I am wrong).
Now say for example that there are 10 messages (offsets 0 to 9) in the topic, and a consumer happened to consume 5 of them before it went down (or before I killed the consumer). Then say I restart that consumer process. My questions are:
If the auto.offset.reset is set to earliest, is it always going to start consuming from offset 0?
If the auto.offset.reset is set to latest, is it going to start consuming from offset 5?
Is the behavior regarding this kind of scenario always deterministic?
Please don't hesitate to comment if anything in my question is unclear.
It is a bit more complex than you described.
The auto.offset.reset config kicks in ONLY if your consumer group does not have a valid offset committed somewhere (2 supported offset storages now are Kafka and Zookeeper), and it also depends on what sort of consumer you use.
If you use a high-level java consumer then imagine following scenarios:
You have a consumer in a consumer group group1 that has consumed 5 messages and died. Next time you start this consumer it won't even use that auto.offset.reset config and will continue from the place it died because it will just fetch the stored offset from the offset storage (Kafka or ZK as I mentioned).
You have messages in a topic (like you described) and you start a consumer in a new consumer group group2. There is no offset stored anywhere and this time the auto.offset.reset config will decide whether to start from the beginning of the topic (earliest) or from the end of the topic (latest)
One more thing that affects what offset value will correspond to earliest and latest configs is log retention policy. Imagine you have a topic with retention configured to 1 hour. You produce 5 messages, and then an hour later you post 5 more messages. The latest offset will still remain the same as in previous example but the earliest one won't be able to be 0 because Kafka will already remove these messages and thus the earliest available offset will be 5.
Everything mentioned above is not related to SimpleConsumer and every time you run it, it will decide where to start from using the auto.offset.reset config.
If you use Kafka version older than 0.9, you have to replace earliest, latest with smallest,largest.
Just an update: From Kafka 0.9 and forth, Kafka is using a new Java version of the consumer and the auto.offset.reset parameter names have changed; From the manual:
What to do when there is no initial offset in Kafka or if the current
offset does not exist any more on the server (e.g. because that data
has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found
for the consumer's group
anything else: throw exception to the consumer.
I spent some time to find this after checking the accepted answer, so I thought it might be useful for the community to post it.
Further more there's offsets.retention.minutes. If time since last commit is > offsets.retention.minutes, then auto.offset.reset also kicks in