We have an application that a consumer reads a message and the thread does a number of things, including database accesses before a message is produced to another topic. The time between consuming and producing the message on the thread can take several minutes. Once message is produced to new topic, a commit is done to indicate we are done with work on the consumer queue message. Auto commit is disabled for this reason.
I'm using the high level consumer and what I'm noticing is that zookeeper and kafka sessions timeout because it is taking too long before we do anything on consumer queue so kafka ends up rebalancing every time the thread goes back to read more from consumer queue and it starts to take a long time before a consumer reads a new message after a while.
I can set zookeeper session timeout very high to not make that a problem but then i have to adjust the rebalance parameters accordingly and kafka won't pickup a new consumer for a while among other side effects.
What are my options to solve this problem? Is there a way to heartbeat to kafka and zookeeper to keep both happy? Do i still have these same issues if i were to use a simple consumer?
It sounds like your problems boil down to relying on the high-level consumer to manage the last-read offset. Using a simple consumer would solve that problem since you control the persistence of that offset. Note that all the high-level consumer commit does is store the last read offset in zookeeper. There's no other action taken and the message you just read is still there in the partition and is readable by other consumers.
With the kafka simple consumer, you have much more control over when and how that offset storage takes place. You can even persist that offset somewhere other than Zookeeper (a data base, for example).
The bad news is that while the simple consumer itself is simpler than the high-level consumer, there's a lot more work you have to do code-wise to make it work. You'll also have to write code to access multiple partitions - something the high-level consumer does quite nicely for you.
I think issue is consumer's poll method trigger consumer's heartbeat request. And when you increase session.timeout. Consumer's heartbeat will not reach to coordinator. Because of this heartbeat skipping, coordinator mark consumer dead. And also consumer rejoining is very slow especially in case of single consumer.
I have faced a similar issue and to solve that I have to change following parameter in consumer config properties
session.timeout.ms=
request.timeout.ms=more than session timeout
Also you have to add following property in server.properties at kafka broker node.
group.max.session.timeout.ms =
You can see the following link for more detail.
http://grokbase.com/t/kafka/users/16324waa50/session-timeout-ms-limit
Related
If I have a service that connects to kafka as a message consumer, and every message I read I send a commit to that message offset, so that if my service shutsdown and restarts it will start reading from the last read message onwards. My understanding is that the committed offset will be maintained by kafka.
Now my question is, do I have to worry about the offset? Can kafka somehow lose that information and when the service restarts start reading messages from the beginning of the topic or the end of it depending on my initial offset config? Or if kafka loses my offset it will also have lost all messages in the topic so that it is alright to read from the beginning?
Note: I use spring-kafka on the service, but not sure if that is relevant to the question.
In most cases where you have an active consumer (with manual or auto-committing), you don't need to worry about it.
The cases where you do need to consider the behavior of auto.offset.reset setting is when the offsets.retention.minutes time on the broker has elapsed while your consumer group(s) are inactive. When this happens, Kafka compacts the __consumer_offsets topic and removes any offsets stored for those inactive groups
Losing offsets doesn't affect the source topic. Your client topic(s) have their own independent retention settings, and its message can be removed as well (or not), depending on how you've configured it.
Issue we were facing:
In our system we were logging a ticket in database with status NEW and also putting it in the kafka queue for further processing. The processors pick those tickets from kafka queue, do processing and update the status accordingly. We found that some tickets are left in NEW state forever. So we were guessing whether tickets are failing to get produced in the queue or are no getting consumed.
Message loss / duplication scenarios (and some other related points):
So I started to dig exhaustively to know in what all ways we can face message loss and duplication in Kafka. Below I have listed all possible message loss and duplication scenarios that I can find in this post:
How data loss can occur in different approaches to handle all replicas down
Handle by waiting for leader to come online
Messages sent between all replica down and leader comes online are lost.
Handle by electing new broker as a leader once it comes online
If new broker is out of sync from previous leader, all data written between the
time where this broker went down and when it was elected the new leader will be
lost. As additional brokers come back up, they will see that they have committed
messages that do not exist on the new leader and drop those messages.
How data loss can occur when leader goes down, while other replicas may be up
In this case, the Kafka controller will detect the loss of the leader and elect a new leader from the pool of in sync replicas. This may take a few seconds and result in LeaderNotAvailable errors from the client. However, no data loss will occur as long as producers and consumers handle this possibility and retry appropriately.
When a consumer may miss to consume a message
If Kafka is configured to keep messages for a day and a consumer is down for a period of longer than a day, the consumer will lose messages.
Evaluating different approaches to consumer consistency
Message might not be processed when consumer is configured to receive each message at most once
Message might be duplicated / processed twice when consumer is configured to receive each message at least once
No message is processed multiple times or left unprocessed if consumer is configured to receive each message exactly once.
Kafka provides below guarantees as long as you are producing to one partition and consuming from one partition. All guarantees are off if you are reading from the same partition using two consumers or writing to the same partition using two producers.
Kafka makes the following guarantees about data consistency and availability:
Messages sent to a topic partition will be appended to the commit log in the order they are sent,
a single consumer instance will see messages in the order they appear in the log,
a message is ‘committed’ when all in sync replicas have applied it to their log, and
any committed message will not be lost, as long as at least one in sync replica is alive.
Approach I came up with:
After reading several articles, I felt I should do following:
If message is not enqueued, producer should resend
For this producer should listen for acknowledgement for each message sent. If no ackowledement is received, it can retry sending message
Producer should be async with callback:
As explained in last example here
How to avoid duplicates in case of producer retries sending
To avoid duplicates in queue, set enable.idempotence=true in producer configs. This will make producer ensure that exactly one copy of each message is sent. This requires following properties set on producer:
max.in.flight.requests.per.connection<=5
retries>0
acks=all (Obtain ack when all brokers has committed message)
Producer should be transactional
As explained here.
Set transactional id to unique id:
producerProps.put("transactional.id", "prod-1");
Because we've enabled idempotence, Kafka will use this transaction id as part of its algorithm to deduplicate any message this producer sends, ensuring idempotency.
Use transactions semantics: init, begin, commit, close
As explained here:
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(record1);
producer.send(record2);
producer.commitTransaction();
} catch(ProducerFencedException e) {
producer.close();
} catch(KafkaException e) {
producer.abortTransaction();
}
Consumer should be transactional
consumerProps.put("isolation.level", "read_committed");
This ensures that consumer don't read any transactional messages before the transaction completes.
Manually commit offset in consumer
As explained here
Process record and save offsets atomically
Say by atomically saving both record processing output and offsets to any database. For this we need to set auto commit of database connection to false and manually commit after persisting both processing output and offset. This also requires setting enable.auto.commit to false.
Read initial offset (say to read after recovery from cache) from database
Seek consumer to this offset and then read from that position.
Doubts I have:
(Some doubts might be primary and can be resolved by implementing code. But I want words from experienced kafka developer.)
Does the consumer need to read the offset from database only for initial (/ first after consumer recovery) read or for all reads? I feel it needs to read offset from database only on restarts, as explained here
Do we have to opt for manual partitioning? Does this approach works only with auto partitioning off? I have this doubt because this example explains storing offset in MySQL by specifying partitions explicitly.
Do we need both: Producer side kafka transactions and consumer side database transactions (for storing offset and processing records atomically)? I feel for producer idempotence, we need producer to have unique transaction id and for that we need to use kafka transactional api (init, begin, commit). And as a counterpart, consumer also need to set isolation.level to read_committed. However can we ensure no message loss and duplicate processing without using kafka transactions? Or they are absolutely necessary?
Should we persist offset to external db as explained above and here
or send offset to transaction as explained here (also I didnt get what does it exactly mean by sending offset to transaction)
or follow sync async commit combo explained here.
I feel message loss / duplication scenarios 1 and 2 are handled by points 1 to 4 of approach I explained above.
I feel message loss / duplication scenario 3 is handled by point 6 of approach I explained above.
How do we implement different consumer consistency approaches as stated in message loss / duplication scenario 4? Is their any configuration or it needs to be implemented inside custom logic inside consumer?
Message loss / duplication scenario 5 says: "Kafka provides below guarantees as long as you are producing to one partition and consuming from one partition."? Is it something to concern about while building correct system?
Is any consideration unnecessary/redundant in the approach I came up with above? Also did I miss any necessary consideration? Did I miss any message loss / duplication scenarios?
Is their any other standard / recommended / preferable approach to ensure no message loss and duplicate processing than what I have thought above?
Do I have to actually code above approach using kafka APIs? or is there any high level API built atop kafka API which allows to easily ensure no message loss and duplicate processing?
Looking at issue we were facing (as stated at very beginning), we were thinking if we can recover any lost/unprocessed messages from files in which kafka stores messages. However that isnt correct, right?
(Extremely sorry for such an exhaustive post but wanted to write question which will ask all related question at one place allowing to build big picture of how to build system around kafka.)
I just want to know how the Consumer is able to consume data when the producer is down. Let's say Producer keeps sending logs to the consumer at a steady rate and then the producer goes down from 8AM- 6PM. How does the consumer work in such a case and is there a way that the consumer can get the data that would have been sent during 8am - 6pm if the producer was up.
In Apache Kafka there is no relationship between how producer and consumer behaves.
Acting as a messaging system, Kafka allows to decoupling producer from a consumer providing an asynchronous communication channel.
The producer can send messages at its own pace and the consumer can read these messages in real time or later at its own pace (different from the producer one).
The messages are saved in a topic living in the Kafka cluster, and each message has a position in the topic partition (offset).
Of course, it's possible to tune when messages are deleted from the topic if the consumer isn't online for long time reading the messages.
You can set to store messages for very long time (days, weeks, months) and after that they will be deleted; or you can set to store messages based on time (so deleting the ones older than a time).
Furthermore, the consumer is also able to rewind the stream of messages in the topic, actually re-reading the messages if needed.
Finally, the consumer can also seek to a specific position in the topic partition based on offset or specifiying a time.
The Kafka doc has a nice diagram which I copied below. It shows the novelty of Kafka in a succinct way.
Without Kafka, the situation is something like this. We have multiple servers, e.g. Frontend servers, DB servers, Chat servers etc. On the other side, we have probably different metrics and monitoring tools (e.g. DB monitor, UI monitor etc.). Direct one-to-one communications between different servers and collectors might work out for smaller systems, but it breaks down pretty quickly after the system has surpassed a a certain threshold, in terms of scalability. Kafka solves this problem by decoupling the senders and receivers. Both of them talk through the Kafka brokers instead of talking to each other.
So, in your case the consumer would simply ask the broker if there's any new data on the topic it's subscribing to. As the producer is down, and assuming there is no data in the queue, broker would reply, there's nothing to be consumed.. So, the consumer would be perpetually polling in a fixed interval, in an endless loop and do nothing. Whenever the producer comes up and starts pumping out data, consumer would start receiving (and processing) it. There are more involved use cases when you might be losing data if retention period for particular topic is over, and the consumer hasn't processed the backlog. But I don't think that's a concern for you at this point of your journey.
I'm managing a kafka queue using a common consumer group across multiple machines. Now I also need to show the current content of the queue. How do I read only those messages within the group which haven't been read, yet making those messages again readable by other consumers in the group which actually processes those messages. Any help would be appreciated.
In Kafka, the notion of "reading" messages from a topic and that of "consuming" them are the same thing. At a high level, the only thing that makes a "consumed" message unavailable to a consumer is that consumer setting its read offset to a value beyond that of the message in question. Thus, you can turn off the autocommit feature of your consumers and avoid committing offsets in cases where you'd like only to "read" but not to "consume".
A good proxy for getting "all messages which haven't been read" is to compare the latest committed offset to the highwater mark offset per partition. This provides a notion of "lag" that indicates how far behind a given consumer is in its consumption of a partition. The fetch_consumer_lag CLI function in pykafka is a good example of how to do this.
In Kafka, a partition can be consumed by only one consumer in a group i.e. if your topic has 10 partitions and you spawned 20 consumers with same groupId, then only 10 will be connected to Kafka and remaining 10 will be sitting idle. A new consumer will be identified by Kafka only in case one of the existing consumer dies or does not poll from the topic.
AFAIK, I don't think you can do what I understand you want to do within a consumer group. You can obviously create another groupId and process message based on the information gathered by first consumer group.
Kafka now has a KStream.peek() method
See proposal "Add KStream peek method".
It's not 100% clear to me from the docs that this prevents consuming of message that's peeked from the topic, but I can't see how you could use it in any crash-safe, robust way unless it does.
See also:
Handling consumer rebalance when implementing synchronous auto-offset commit
High-Level Consumer and peeking messages
I think that you can use publish-subscribe model. Then each consumer has own offset and could consume all messages for itself.
I am trying to implement a simple Producer-->Kafka-->Consumer application in Java. I am able to produce as well as consume the messages successfully, but the problem occurs when I restart the consumer, wherein some of the already consumed messages are again getting picked up by consumer from Kafka (not all messages, but a few of the last consumed messages).
I have set autooffset.reset=largest in my consumer and my autocommit.interval.ms property is set to 1000 milliseconds.
Is this 'redelivery of some already consumed messages' a known problem, or is there any other settings that I am missing here?
Basically, is there a way to ensure none of the previously consumed messages are getting picked up/consumed by the consumer?
Kafka uses Zookeeper to store consumer offsets. Since Zookeeper operations are pretty slow, it's not advisable to commit offset after consumption of every message.
It's possible to add shutdown hook to consumer that will manually commit topic offset before exit. However, this won't help in certain situations (like jvm crash or kill -9). To guard againts that situations, I'd advise implementing custom commit logic that will commit offset locally after processing each message (file or local database), and also commit offset to Zookeeper every 1000ms. Upon consumer startup, both these locations should be queried, and maximum of two values should be used as consumption offset.