Kafka consumer api in java has to fetch message based on offset - apache-kafka

We have a requirement that the consumer should be able to go back to any point of message stream and reprocess the message.
It looks like to set an offset in High level consumer as it needs setting offset in the zookeeper and re-start of consumer is needed.
Simple consumer supports this ,but need to handle broker leader election ,broker leader failure etc.
The new consumer api provide this ,but it looks like this is still in beta.
So we might have to select simple consumer. Any know issues with simple consumers

Related

Kafka producer buffering

Suppose there is a producer which is running and I run a consumer a few minutes later. I noticed that the consumer will consume old messages that has been produced by the producer but I don't want that happens. How can I do that? Is there any config parameters in broker to be set and solve this problem?
It really depends on the use case, you didn't really provide much information about the architecture. For instance - once the consumer is up, is it a long running consumer, or does it just wake up for a short while and consumes new messages arriving?
You can take any of the following approaches:
Filter ConsumerRecord by timestamp, so you will automatically throw away messages that were produced over configurable time.
In my team we're using ephemeral groups. That is - each time the service goes up, we generate a new group id for the consumer group, setting auto.offset.reset to latest
Seek to timestamp - since kafka 0.10 you can seek to a certain position. Use consumer.offsetsForTimes to get the offset of each topic partition for the desired time, and then use consumer.seek to get to the given offset.
If you use a consumer group, but never commit to kafka, then each time the a consumer is assigned to a topic partition, it will start consuming according to auto.offset.reset policy...

Should Kafka consumers be started before producers?

When I have a kafka console producer message produce some messages and then start a consumer, I am not getting the messages.
However i am receiving message produced by the producer after a consumer has been started.Should Kafka consumers be started before producers?
--from- beginning seems to give all messages including ones that are consumed.
Please help me with this on both console level and java client example for starting producer first and consuming by starting a consumer.
Kafka stores messages for a configurable amount of time. Default is a week. Consumers do not need to be "available" to receive messages, but they do need to know where they should start reading from
The console consumer has the default option of looking at the latest offset for all partitions. So if you're not actively producing data you see nothing as a consumer. You can specify a group flag for the console consumer or a Java client, and that's what tracks what offsets are read within the Kafka protocol and where a read request will resume from if you stopped that consumer in a group
Otherwise, I think you can only give an offset along with a single partition to consume from

Preventing message loss with Kafka High Level Consumer 0.8.x

A typical kafka consumer looks like the following:
kafka-broker ---> kafka-consumer ----> downstream-consumer like Elastic-Search
And according to the documentation for Kafka High Level Consumer:
The ‘auto.commit.interval.ms’ setting is how often updates to the
consumed offsets are written to ZooKeeper
It seems that there can be message loss if the following two things happen:
Offsets are committed just after some messages are retrieved from kafka brokers.
Downstream consumers (say Elastic-Search) fail to process the most recent batch of messages OR the consumer process itself is killed.
It would perhaps be most ideal if the offsets are not committed automatically based on a time interval but they are committed by an API. This would make sure that the kafka-consumer can signal the committing of offsets only after it receives an acknowledgement from the downstream-consumer that they have successfully consumed the messages. There could be some replay of messages (if kafka-consumer dies before committing offsets) but there would at least be no message loss.
Please let me know if such an API exists in the High Level Consumer.
Note: I am aware of the Low Level Consumer API in 0.8.x version of Kafka but I do not want to manage everything myself when all I need is just one simple API in High Level Consumer.
Ref:
AutoCommitTask.run(), look for commitOffsetsAsync
SubscriptionState.allConsumed()
There is a commitOffsets() API in the High Level Consumer API that can be used to solve this.
Also set option "auto.commit.enable" to "false" so that at no time, the offsets are committed automatically by kafka consumer.

kafka consumer sessions timing out

We have an application that a consumer reads a message and the thread does a number of things, including database accesses before a message is produced to another topic. The time between consuming and producing the message on the thread can take several minutes. Once message is produced to new topic, a commit is done to indicate we are done with work on the consumer queue message. Auto commit is disabled for this reason.
I'm using the high level consumer and what I'm noticing is that zookeeper and kafka sessions timeout because it is taking too long before we do anything on consumer queue so kafka ends up rebalancing every time the thread goes back to read more from consumer queue and it starts to take a long time before a consumer reads a new message after a while.
I can set zookeeper session timeout very high to not make that a problem but then i have to adjust the rebalance parameters accordingly and kafka won't pickup a new consumer for a while among other side effects.
What are my options to solve this problem? Is there a way to heartbeat to kafka and zookeeper to keep both happy? Do i still have these same issues if i were to use a simple consumer?
It sounds like your problems boil down to relying on the high-level consumer to manage the last-read offset. Using a simple consumer would solve that problem since you control the persistence of that offset. Note that all the high-level consumer commit does is store the last read offset in zookeeper. There's no other action taken and the message you just read is still there in the partition and is readable by other consumers.
With the kafka simple consumer, you have much more control over when and how that offset storage takes place. You can even persist that offset somewhere other than Zookeeper (a data base, for example).
The bad news is that while the simple consumer itself is simpler than the high-level consumer, there's a lot more work you have to do code-wise to make it work. You'll also have to write code to access multiple partitions - something the high-level consumer does quite nicely for you.
I think issue is consumer's poll method trigger consumer's heartbeat request. And when you increase session.timeout. Consumer's heartbeat will not reach to coordinator. Because of this heartbeat skipping, coordinator mark consumer dead. And also consumer rejoining is very slow especially in case of single consumer.
I have faced a similar issue and to solve that I have to change following parameter in consumer config properties
session.timeout.ms=
request.timeout.ms=more than session timeout
Also you have to add following property in server.properties at kafka broker node.
group.max.session.timeout.ms =
You can see the following link for more detail.
http://grokbase.com/t/kafka/users/16324waa50/session-timeout-ms-limit

Apache Kafka Consumer group and Simple Consumer

I am new to Kafka, what I've understood sofar regarding the consumer is there are basically two types of implementation.
1) The High level consumer/consumer group
2) Simple Consumer
The most important part about the high level abstraction is it used when Kafka doesn't care about handling the offset while the Simple consumer provides much better control over the offset management. What confuse me is what if I want to run consumer in a multithreaded environment and also want to have control over the offset.If I use consumer group does that mean I must read from the last offset stored in zookeeper? is that the only option I have.
For the most part, the high-level consumer API does not let you control the offset directly.
When the consumer group is first created, you can tell it whether to start with the oldest or newest message that kafka has stored using the auto.offset.reset property.
You can also control when the high-level consumer commits new offsets to zookeeper by setting auto.commit.enable to false.
Since the high-level consumer stores the offsets in zookeeper, your app could access zookeeper directly and manipulate the offsets - but it would be outside of the high-level consumer API.
Your question was a little confusing but you can use the simple consumer in a multi-threaded environment. That's what the high-level consumer does.
In Apache Kafka 0.9 and 0.10 the consumer group management is handled entirely within the Kafka application by a Broker (for coordination) and a topic (for state storage).
When a consumer group first subscribes to a topic the setting of auto.offset.reset determines where consumers begin to consume messages (http://kafka.apache.org/documentation.html#newconsumerconfigs)
You can register a ConsumerRebalanceListener to receive a notification when a particular consumer is assigned topics/partitions.
Once the consumer is running, you can use seek, seekToBeginning and seekToEnd to get messages from a specific offset. seek affects the next poll for that consumer, and is stored on the next commit (e.g. commitSync, commitAsync or when the auto.commit.interval elapses, if enabled.)
The consumer javadocs mention more specific situations: http://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
You can combine the group management provided by Kafka with manual management of offsets via seek(..) once partitions are assigned.