I have a batch job, which populates data to Kafka topic. Every message has data and job identifier.
On the consumer side, I want to only read messages, which belong to this job. After the job has finished and all the messages consumed, consumer side has to do some post processing.
1) If this is guaranteed, that no other messages will be produced during the job, how can I understand that job has finished and all the messages, produced by the job were consumed? (taking into consideration multiple partitions and asychrony).
2) If it is NOT guaranteed, that no other messages will be produced during the job, noise can be skipped, I believe.
Thanks
I'm assuming the job_id is constant. In that case, you can put a check in your consumer to shut down if n subsequent poll returns empty records from Kafka. n will depend on your ingestion rate and consumer poll interval.
I am only talking about the first case here. Mind you, this is just an idea and I have never tried it myself
You can use endOffsets()to get the last offsets of all the partitions and then loop over all of them after every message to check if all the current offsets match the ending offsets. If all are a match, you have reached the end.
Related
Is there an out-of-the-box mechanism in Kafka to produce a record that shall not be processed before a given timestamp regardless of the contents of the topic?
Thanks.
There isn't; Kafka consumes by offset, not by timestamps. But that doesn't prevent your consumer from handing off messages to some secondary worker "priority queue" ordered by timestamp backed by some scheduler thread that checks for the next event to be processed
Only problem is that if you process a message offset "O+1" with time "T-1" while waiting on message "O#T", and your consumer crashes and loses that in-memory queue, then you're effectively skipping messages (committing O+1 without processing O)
Depending on your application, this might be okay, since you could reproduce records again to queue them. You can also seek to specific offsets or get (approximate) offsets by a particular timestamp (which defaults to time the record is produced)
I have a use-case regarding consuming records by Kafka consumer.
For instance,
I have 1 topic which has 1 partition. Currently, it has 10 records and while consuming the first 10 records, another 10 records are written to the partition.
myConsumer polls the first time and returns the first 10 records say 0 - 9 records.
It processed all the records successfully.
It invoked commitAsync() to Kafka to commit the last offset.
Commit response is in processing. It can be a success or a failure.
But, since it is an asynchronous mode, it continues to poll for the next batch.
Now, how does either Kafka or consumer poll know that it has to read from the 10th position? Because the commitAsync request has not yet completed.
Please help me in understanding this concept.
Commit Offset tells the broker that the consumer has processed the corresponding message successfully. The consumer itself would be aware of its progress (except for start of consumer where it gets its last committed offset from broker).
At step-5 in your description, the commit offset is in progress. So:
Broker does not know that 0-9 records have been processed
Consumer itself has the read the messages and so it knows that is has read 0-9 messages. So it will know to read 10th onwards next.
Possible Scenarios
Lets say the commit fails for (0-9). Your next batch, say (10-15) is processed and committed succesfully then there is no harm done. Since we mark to the broker that processing till 15 is complete.
Lets say the commit fails for (0-9). Your next batch, (10-15) is processed and before committing, the consumer goes down. When your consumer is brought back up, it takes its state from broker (which does not have commit for either of the batch). So it will start reading from 0th message.
You can come up with several other scenarios as well. I guess the bottom line is, the importance of commit will come into picture when your consumer is restarted for whatever reason and it has get its last processed offset from kafka broker.
I see from the logs that exact same message is consumed by the 665 times. Why does this happen?
I also see this in the logs
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies
that the poll loop is spending too much time message processing. You can address this either by increasing the session
timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
Consumer properties
group.id=someGroupId
bootstrap.servers=kafka:9092
enable.auto.commit=false
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.StringDeserializer
session.timeout.ms=30000
max.poll.records=20
PS: Is it possible to consume only a specific number of messages like 10 or 50 or 100 messages from the 1000 that are in the queue?
I was looking at 'fetch.max.bytes' config, but it seems like it is for a message size rather than number of messages.
Thanks
The answer lies in the understanding of the following concepts:
session.timeout.ms
heartbeats
max.poll.interval.ms
In your case, your consumer receives a message via poll() but is not able to complete the processing in max.poll.interval.ms time. Therefore, it is assumed hung by the Broker and re-balancing of partitions happen due to which this consumer loses the ownership of all partitions. It is marked dead and is no longer part of a consumer group.
Then when your consumer completes the processing and calls poll() again two things happen:
Commit fails as the consumer no longer owns the partitions.
Broker identifies that the consumer is up again and therefore a re-balance is triggered and the consumer again joins the Consumer Group, start owning partitions and request messages from the Broker. Since the earlier message was not marked as committed (refer #1 above, failed commit) and is pending processing, the broker delivers the same message to consumer again.
Consumer again takes a lot of time to process and since is unable to finish processing in less than max.poll.interval.ms, 1. and 2. keep repeating in a loop.
To fix the problem, you can increase the max.poll.interval.ms to a large enough value based on how much time your consumer needs for processing. Then your consumer will not get marked as dead and will not receive duplicate messages.
However, the real fix is to check your processing logic and try to reduce the processing time.
The fix is described in the message you pasted:
You can address this either by increasing the session timeout or by
reducing the maximum size of batches returned in poll() with
max.poll.records.
The reason is a timeout is reached before your consumer is able to process and commit the message. When your Kafka consumer "commits", it's basically acknowledging receipt of the previous message, advancing the offset, and therefore moving onto the next message. But if that timeout is passed (as is the case for you), the consumer's commit isn't effective because it's happening too late; then the next time the consumer asks for a message, it's given the same message
Some of your options are to:
Increase session.timeout.ms=30000, so the consumer has more time
process the messages
Decrease the max.poll.records=20 so the consumer has less messages it'll need to work on before the timeout occurs. But this doesn't really apply to you because your consumer is already only just working on a single message
Or turn on enable.auto.commit, which probably also isn't the best solution for you because it might result in dropping messages though, as mentioned below:
If we allowed offsets to auto commit as in the previous example
messages would be considered consumed after they were given out by the
consumer, and it would be possible that our process could fail after
we have read messages into our in-memory buffer but before they had
been inserted into the database.
Source: https://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
I have a Kafka consumer project which consumes data from a specific Kafka topic. The 90% of the records are processed as soon as I got them but I have the delay processing some of the records (10%).
This these records need to be delayed, I can't commit the records so it may cause Kafka to reassign the partitions to new nodes. In order to avoid that, I can read the same topic twice and delay the fetching data part in the second consumer but it requires deserialization twice so comes with an overhead.
Is it possible the read records using single consumer but have two separate commits with Kafka consumers? It will be basically similar to having two different consumers in terms of commit, consumer.poll will be called from a single consumer but there will be two consumer.commitSync for each batch. I will help me to avoid extra deserialization and also the network cost.
Below mentioned are the things you can do to achieve the above-mentioned task.
Create a pipe Line having two topics(T1, T2) push all the messages (90%) in topic T1 and rest all the messages 10% in topic T2.
Make your Kafka consumer configurable i.e. you can easily pass polling interval, batchSize, and batch timeout whenever you are starting your consumer.
Find a logic/ or if your second topic consumption is time-based then schedule the cron which will start and stop your consumer topic T2 when it is required.
Regarding consumer Groups, you can place both of your topics in the same group or indifferent. It's completely your choice.
By this way you will be keeping the topics clean.and each and every time you need to process the messages you can do it easily by setting up the pipeline just for once.
Let's say there is a batch API for performing tasks List[T]. In order to do the job all the tasks needs to be pushed to kafka. There are 2 ways to do that :
1) Pushing List as a message in kafka
2) Pushing individual task T in kafka
I believe approach 1 would be better since i don't have to push the messages to kafka mutiple times for a single batch call. Can some one please tell me if there is any harm in such approach ?
A Kafka producer can batch together individual messages sent within a short time window (the particular config is linger.ms), so the cost of sending individual messages is probably a lot lower than you think.
Probably a more important factor to consider is how the consumer is going to consume messages. What should happen if the consumer cannot process one of the tasks, for example? If the consumer is just just going to call some other batch-based API which succeeds or fails as a batch, the a single message containing a list of tasks would be a perfectly good fit. On the other hand if the consumer ultimately has to process tasks individually then sending individual messages is probably a better fit, and will probably save you from having to implement some sort of retry logic in your consumer, because you can probably configure Kafka to behave with the semantics you need.
Starting from Kafka v0.11 you can also use transactions in the producer to publish your entire batch atomically. i.e. you begin the transaction, then publish your tasks message by message, finally you commit the transaction. Even though the messages can be sent to kafka in multiple batches, they will only become visible to consumers once you commit the transaction, as long as your consumers are running in read-committed mode.
Option 1 is the preferred method in Kafka so long as the entire batch should always stay together. If you publish a List of records as a batch then they will be stored as a batch, they will be (optionally) compressed as a batch yielding better compression, and they will be fetched by consumers as a batch yielding fewer fetch requests.
If you send individual messages then you will have to give them a common key or they will get spread out over different partitions and possibly be sent out of order, or to different consumers of a consumer group.