KafkaIO - Different behaviors for enable.auto.commit set to true and commitOffsetsInFinalize when used with groupId - apache-beam

We are have an Apache Beam pipeline that is reading messages from a given kafka topic and doing further processing. My pipeline uses the FlinkRunner and I have described three different cases that we have tried:
Case 1: No group id specified:
Beam creates a new consumer for every run and thus reads from the latest topic offset. It reads messages that are produced after the consumer starts. There could be potential data loss in this case during the time interval between stop and restart of the pipeline
Case 2: Group id specified and set enable.auto.commit to true
Beam starts re-processing messages from the time the pipeline was stopped and starts reading the messages that were not committed to kafka for the given groupid.
New group id again starts listening to messages from latest topic offset and starts committing messages
.withConsumerConfigUpdates(ImmutableMap.of("enable.auto.commit", true))
.withConsumerConfigUpdates(ImmutableMap.of("group.id", "testGroupId"))
Case 3: Group id specified with commitOffsetsInFinalize()
Ideally I would expect the same behavior as Case 2 here but instead I see behavior similar to Case 1 where there is a potential data loss between stop and restart of a pipeline.
.withConsumerConfigUpdates(ImmutableMap.of("group.id", "testGroupId"))
.commitOffsetsInFinalize()
From the documentation of KafkaIO I do see that offsets are committed back to kafka when checkpoints are finalized as per: https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1098
We would like to understand:
Why Case 2 is not behaving like Case 3 on stopping and re-starting the pipeline?
What are the cases when we should be setting enable.auto.committo true vs commitOffsetsinFinalize?

It was a bug, reported in https://github.com/apache/beam/issues/22631, fixed in https://github.com/apache/beam/pull/22633

Related

Kafka consumer - how to recognized offset skipping/missing offsets?

Setup:
We have a Debezium/Kafka Connect setup with an Debezium Oracle producer and a Confluend JDBC consumer/sink.
Starting position / background / problem:
Due to high traffic we have decreased the log.retention.minutes to 1h which is suitable in 99% of the time.
But in some rare cases one of the kafka consumers gets a slow down and can't keep up any longer. In that case messages will be deleted in Kafka (due to the aforementioned retention period) before they were picked up and handled by the consumer.
In the default config, the consumer then will skip the missing records be choosing the earliest available offset. This leads to inconsistencies on the target side.
Question:
How to handle those situations (if raising the log.retention.minutes isn't an option)?
Note: We would be fine, if the consumer would just throw an exception/stop/etc in case it can't find a message for its given offset.
What we've tried to far...
We tried setting auto.offset.reset to none for the consumer and expected the consumer to stop in case it can't find an offset. In theory this should work. In practice it immeadiately throws an exception when the consumer gets instantiated because there's no first/initial offset.
Final thoughts
So is there another config parameter we could use? (Something like "throw exception if offset is missing/skipped, but not on first start"?) Or is there a JMX metric we could monitor in case a consumer is skipping messages?
setting auto.offset.reset to none for the consumer and expected the consumer to stop in case it can't find an offset
That's what it'll do, yes.
In practice it immediately throws an exception when the consumer gets instantiated because there's no first/initial offset
You'll need to actually initialize the group first, then seek it to the earliest offset. E.g. kafka-consumer-offsets --reset-offsets --to-earliest --group connect-<name>
Something like "throw exception if offset is missing/skipped, but not on first start"?)
There's nothing to differentiate auto.offset.reset between "first" and "next" starts. But, you could create the connector with consumer.override.auto.offset.reset=earliest, then wait for it to be running, then set it back to none with a PUT /config call. Then repeat whenever it stops running again.
JMX metric we could monitor in case a consumer is skipping messages
Not that I know of; the metrics are mostly reporting bytes processed. You'd have to additionally track how many bytes you expect it to read.
You'd need other monitoring solutions to detect log segments being deleted on the broker, and tracking those offset ranges compared to the offsets your consumer is currently reading.

kafka offset in spark

Kafka enable.auto.commit is set to false and Spark version is 2.4
If using latest offset, do we need to manually find last offset details and mention it in .CreateDirectStream() in Spark application? or will it automatically take latest offset? In any case do we need to find the last offset details manually.
Is there any difference when use SparkSession.readstrem.format(kafka).... and KafkaUtils.createDirectStream()?
When using earliest offset option, will it consider the offset automatically?
Here is my attempt to answer your questions
Ques 1: enable.auto.commit is a kafka related parameter and if set to false requires you to manually commit (read update) your offset to the checkpoint directory. If your application restarts it will look into the checkpoint directory and start reading from last committed offset + 1. same is mentioned here https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafka-properties-enable-auto-commit.html by jaceklaskowski. There is no need to specify the offset anywhere as part of your spark application. All you need is the checkpoint directory. Also, remember offsets are maintained per partition in topic per consumer so it would be bad on Spark to expect developers/users to provide that.
Ques 2: spark.readStream is a generic method to read data from streaming sources such as tcp socket, kafka topics etc while kafkaUtils is a dedicated class for integration of spark with kafka so I assume it is more optimised if you are using kafka topics as source. I usually use KafkaUtils on my own through I haven't done any performance benchmarks. If I am not wrong, KafkaUtils can be used to subscribe to more than 1 topic as well while readStream cannot be.
Ques 3: earliest offset means your consumer will start reading from the oldest record available for example, if your topic is new (no clean up has happened) or cleanup is not configured for the topic it will start from offset 0. in case cleanup is configured and all records till offset 2000 have been removed, records will be read from offset 2001 while the topic may have records till offset 10000 ( this is assuming there is only one partition, in topics will multiple partitions the offset value will be different ). See section for batch queries here https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html for more details.
If you take a look at documentation of kafka connector for Spark, you can find most of the answers.
Documentation about startingOffsets option for Kafka connector, last part is about streaming queries.
The start point when a query is started, either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.
If you have offsets, it will always pick up offsets if they're available, otherwise it will ask Kafka for earliest or latest offset. This should be true for both types of streams, direct and structured streams should consider offsets.
I see that you mentioned enable.auto.commit option and I just want to make sure you're aware of the following quote from the same documentation site i provided above.
Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception: enable.auto.commit: Kafka source doesn’t commit any offset.

kafka Burrow reports Multiple Offsets for same consumer_group and same topic

SetUp
myTopic has a single partition.
consumer_group is my spring-boot app using spring-kafka client and there is always a single consumer for that consumer group. spring-kafka version 1.1.8 RELEASE
I have a single broker node in kafka. Kafka version 0.10.1.1
When I query a particular consumer_group using burrow, I see 15 offset entries for same topic.
Observations
curl http://burrow-node:8000/v3/kafka/mykafka-1/consumer/my_consumer_grp
"myTopic":[
{"offsets":[
{"offset":6671,"timestamp":1533099130556,"lag":0},
{"offset":6671,"timestamp":1533099135556,"lag":0},
{"offset":6671,"timestamp":1533099140558,"lag":0},
{"offset":6671,"timestamp":1533099145558,"lag":0},
{"offset":6671,"timestamp":1533099150557,"lag":0},
{"offset":6671,"timestamp":1533099155558,"lag":0},
{"offset":6671,"timestamp":1533099160561,"lag":0},
{"offset":6671,"timestamp":1533099165559,"lag":0},
{"offset":6671,"timestamp":1533099170560,"lag":0},
{"offset":6671,"timestamp":1533099175561,"lag":0},
{"offset":6671,"timestamp":1533099180562,"lag":0},
{"offset":6671,"timestamp":1533099185562,"lag":0},
{"offset":6671,"timestamp":1533099190563,"lag":0},
{"offset":6671,"timestamp":1533099195562,"lag":0},
{"offset":6671,"timestamp":1533099200564,"lag":0}
]
More Observations
When I restarted the app again, I didn't find a new offset entry to be created, except the timestamp kept on updating; which is probably due to the auto.commit.interval.ms;
When I started producing/consuming; I saw the changes in offset and lag in one of the offsets; later on the other offsets caught up; which makes me think those are replicas;
offset.retention.minutes is default 1440
Questions
Why do we have 15 offset entries in burrow reports?
If they are replicas, why does a single partition topic gets split up in 14 different replicas under __consumer_offsets? Is there any documentation for this?
If they are NOT replicas, what else are they?
Here's my understanding, based on the docs. Burrow stores a configurable number of committed offsets. It's a rolling window. Every time a consumer commits, burrow stores the committed offset and lag at time of commit. What you are seeing is likely the result of having applied a Storage config something like this (culled from burrow.iml):
[storage.default]
class-name="inmemory"
workers=20
intervals=15
expire-group=604800
min-distance=1
Note that intervals is set to 15.
I believe this feature is simply to provide some history of consumer group commits and associated lags, and has nothing to do with replicas.
EDIT:
The Consumer Lag Evaluation Rules page on the Burrow wiki explains this functionality in more detail. In short, this configurable window of offset/lag data is used to calculate consumer group status.

Consume messages without committing from Kafka 10 consumer

I have a requirement to read messages from a topic, batch them and push the batch to an external system. If the batch fails for any reason, I need to consume the same set of messages again and repeat the process. So for every batch, the from and to offsets for each partition are stored in a database. In order to achieve this, I am creating one Kafka consumer per partition by assigning partition to the reader, based on the previous offsets stored, the consumers seek to that position and start reading. I have turned off auto commit and I dont commit offsets from the consumer. For every batch, I create a new consumer per partition, read messages from the last offset stored and publish to the external system. Do you see any problems in consuming messages without committing offsets and using the same consumer group across batches, but at any point there won't be more than one consumer per partition ?
Your design seems reasonable to me.
Committing offsets to Kafka is just a convenient built-in mechanism within Kafka to keep track of offsets. However, there is no requirement whatsoever to use it -- you can use any other mechanism to track offsets, too (like using a DB as in your case).
Furthermore, if you assign partitions manually, there will be no group management anyway. So parameter group.id has no effect. See http://docs.confluent.io/current/clients/consumer.html for more details.
In kafka version two i achieved this behaviour without the need for a database to store the offsets.
The following is a configuration for spring-boot-kafka but it should also work with any kafka consumer api
spring:
kafka:
bootstrap-servers: ...
consumer:
value-deserializer: ...
max-poll-records: 1000
enable-auto-commit: false
fetch-min-size: 262144 # 1/4 mb..
group-id: ...
fetch-max-wait: 10000 # we will consume every 10s or when 1/4 mb or 1000 records are accumulated.
auto-offset-reset: earliest
listener:
type: batch
concurrency: 7
ack-mode: manual
This gives me the messages in batches of max. 1000 records (dependent on load). I then write these records asynchronously to a database and count how many success callbacks i get. If the successful writes equals the received batch size i acknowledge the batch, e.g. i commit the offset. This design was very reliable even in a high-load production environment.

What determines Kafka consumer offset?

I am relatively new to Kafka. I have done a bit of experimenting with it, but a few things are unclear to me regarding consumer offset. From what I have understood so far, when a consumer starts, the offset it will start reading from is determined by the configuration setting auto.offset.reset (correct me if I am wrong).
Now say for example that there are 10 messages (offsets 0 to 9) in the topic, and a consumer happened to consume 5 of them before it went down (or before I killed the consumer). Then say I restart that consumer process. My questions are:
If the auto.offset.reset is set to earliest, is it always going to start consuming from offset 0?
If the auto.offset.reset is set to latest, is it going to start consuming from offset 5?
Is the behavior regarding this kind of scenario always deterministic?
Please don't hesitate to comment if anything in my question is unclear.
It is a bit more complex than you described.
The auto.offset.reset config kicks in ONLY if your consumer group does not have a valid offset committed somewhere (2 supported offset storages now are Kafka and Zookeeper), and it also depends on what sort of consumer you use.
If you use a high-level java consumer then imagine following scenarios:
You have a consumer in a consumer group group1 that has consumed 5 messages and died. Next time you start this consumer it won't even use that auto.offset.reset config and will continue from the place it died because it will just fetch the stored offset from the offset storage (Kafka or ZK as I mentioned).
You have messages in a topic (like you described) and you start a consumer in a new consumer group group2. There is no offset stored anywhere and this time the auto.offset.reset config will decide whether to start from the beginning of the topic (earliest) or from the end of the topic (latest)
One more thing that affects what offset value will correspond to earliest and latest configs is log retention policy. Imagine you have a topic with retention configured to 1 hour. You produce 5 messages, and then an hour later you post 5 more messages. The latest offset will still remain the same as in previous example but the earliest one won't be able to be 0 because Kafka will already remove these messages and thus the earliest available offset will be 5.
Everything mentioned above is not related to SimpleConsumer and every time you run it, it will decide where to start from using the auto.offset.reset config.
If you use Kafka version older than 0.9, you have to replace earliest, latest with smallest,largest.
Just an update: From Kafka 0.9 and forth, Kafka is using a new Java version of the consumer and the auto.offset.reset parameter names have changed; From the manual:
What to do when there is no initial offset in Kafka or if the current
offset does not exist any more on the server (e.g. because that data
has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found
for the consumer's group
anything else: throw exception to the consumer.
I spent some time to find this after checking the accepted answer, so I thought it might be useful for the community to post it.
Further more there's offsets.retention.minutes. If time since last commit is > offsets.retention.minutes, then auto.offset.reset also kicks in