Having a Kafka Consumer read a single message at a time - apache-kafka

We have Kafka setup to be able to process messages in parallel by several servers. But every message must only be processed exactly once (and by only one server). We have this up and running and it’s working fine.
Now, the problem for us is that the Kafka Consumers reads messages in batches for maximal efficiency. This leads to a problem if/when processing fails, the server shuts down or whatever, because then we loose data that was about to be processed.
Is there a way to get the Consumer to only read on message at a time to let Kafka keep the unprocessed messages? Something like; Consumer pulls one message -> process -> commit offset when done, repeat. Is this feasible using Kafka? Any thoughts/ideas?
Thanks!

You might try setting max.poll.records to 1.

You mention having exactly one processing, but then you're worried about losing data. I'm assuming you're just worried about the edge case when one of your server fails? And you lose data?
I don't think there's a way to accomplish one message at a time. Looking through the consumer configurations, there only seems to be a option for setting the max bytes a consumer can fetch from Kafka, not number of messages.
fetch.message.max.bytes
But if you're worried about losing data completely, if you never commit the offset Kafka will not mark is as being committed and it won't be lost.
Reading through the Kafka documentation about delivery semantics,
So effectively Kafka guarantees at-least-once delivery by default and
allows the user to implement at most once delivery by disabling
retries on the producer and committing its offset prior to processing
a batch of messages. Exactly-once delivery requires co-operation with
the destination storage system but Kafka provides the offset which
makes implementing this straight-forward.
So to achieve exactly-once processing is not something that Kafka enables by default. It requires you to implement storing the offset whenever you write the output of your processing to storage.
But this can be handled more simply and generally by simply letting
the consumer store its offset in the same place as its output...As an example of this,
our Hadoop ETL that populates data in HDFS stores its offsets in HDFS
with the data it reads so that it is guaranteed that either data and
offsets are both updated or neither is.
I hope that helps.

It depends on what client you are going to use. For C++ and python, it is possible to consume ONE message each time.
For python, I used https://github.com/mumrah/kafka-python. The following code can consume one message each time:
message = self.__consumer.get_message(block=False, timeout=self.IterTimeout, get_partition_info=True )
__consumer is the object of SimpleConsumer.
See my question and answer here:How to stop Python Kafka Consumer in program?
For C++, I am using https://github.com/edenhill/librdkafka. The following code can consume one message each time.
214 while( m_bRunning )
215 {
216 // Start to read messages from the local queue.
217 RdKafka::Message *msg = m_consumer->consume(m_topic, m_partition, 1000);
218 msg_consume(msg);
219 delete msg;
220 m_consumer->poll(0);
221 }
m_consumer is the pointer to C++ Consumer object (C++ API).
Hope this help.

Related

How to consume Kafka's messages on a single consumer?

I need to implement a system that when the application starts a thread consumes all the messages generated during the shutdown of the service, it means that in parallel the application must consume the messages starting from the last message read by the thread that is in charge of consuming the old messages.
Is there a solution to this problem on kafka?
I'm not writing the language I'm using because I think it's a kafka feature.
EDIT:
Suppose we start the machine with consumers at 18:00 from 00:00 must take all messages from 00:00 to 18:00 the consumer assigned to read old messages and in parallel the other consumers start reading messages from 18:00 onward
This is how consumers work by default. You also have to be mindful about the retention of messages, as if that process doesn't restart after a certain amount of time you might lose messages. Kafka can retain data forever but it costs $$$, you need to find out what is the right retention for you.
From your comment, what you describe (multiple consumers consuming the same messages) happens when they have different consumer group ids. If you use the same consumer group, messages won't be processed twice during normal operation.
I need to warn you: Kafka is very complex technology, do not use it unless you know properly how consumers and producers work in detail. I would suggest you to pick at bare minimum the Kafka Definitive Guide before using it, unless you are ok with all kinds of failure scenarios.
Also, by default kafka guarantees "deliver at least once". If you want to be sure that you process messages exactly once, please read Exactly-Once Semantics Are Possible: Here’s How Kafka Does It, and know that this also depends on what you do while processing messages. If you touch a database, it might be better to use something on the DB that guarantees uniqueness (a kind of idempotency) so each message is processed once.

What is the best practice to retry messages from Dead letter Queue for Kafka

We are using Kafka as messaging system between micro-services. We have a kafka consumer listening to a particular topic and then publishing the data into another topic to be picked up by Kafka Connector who is responsible for publishing it into some data storage.
We are using Apache Avro as serialization mechanism.
We need to enable the DLQ to add the fault tolerance to the Kafka Consumer and the Kafka Connector.
Any message could move to DLQ due to multiple reasons:
Bad format
Bad Data
Throttling with high volume of messages , so some message could move to DLQ
Publish to Data store failed due to connectivity.
For the 3rd and 4th points as above, we would like to re-try message again from DLQ.
What is the best practice on the same. Please advise.
Only push to DLQ records that cause non-retryable errors, that is: point 1 (bad format) and point 2 (bad data) in your example. For the format of the DLQ records, a good approach is to:
push to DLQ the exact same kafka record value and key as the original one, do not wrap it inside any kind of envelope. This makes it much easier to reprocess with other tools during troubleshooting (e.g. with a new version of a deserializer or so).
add a bunch of Kafka header to communicate meta-data about the error, a few typical examples would be:
original topic name, partition, offset and Kafka timestamp of this record
exception or error message
name and version of the application that failed to process that record
time of the error
Typically I use one single DLQ topic per service or application (not one per inbound topic, not a shared one across services). That tends to keep things independent and manageable.
Oh, and you probably want to put some monitoring and alert on the inbound traffic to the DLQ topic ;)
Point 3 (high volume) should, IMHO, be dealt with some sort of auto-scaling, not with a DLQ. Try to always over-estimate (a bit) the number of partitions of the input topic, since the maximum number of instances you can start of your service is limited by that. A too high number of messages is not going to overload your service, since the Kafka consumers are explicitly polling for more messages when they decide to, so they're never asking for more than the app can process. What happens if there is a peak of messages is simply they'll keep piling up in the upstream kafka topic.
Point 4 (connectivity) should be retried directly from the source topic, without any DLQ involved, since the error is transient. Dropping the message to a DLQ and picking up the next one is not going to solve any issue since, well, the connectivity issue will still be present and the next message would likely be dropped as well. Reading, or not reading, a record from Kafka is not making it go away, so a record stored there is easy to read again later. You can program your service to move forward to the next inbound record only if it successfully writes a resulting record to the outbound topic (see Kafka transactions: reading a topic is actually involving a write operation since the new consumer offsets need to be persisted, so you can tell your program to persist new offsets and the output records as part of the same atomic transaction).
Kafka is more like a storage system (with just 2 operations: sequential reads and sequential writes) than a messaging queue, it's good at persistence, data replication, throughput, scale... (...and hype ;) ). It tends to be really good for representing data as a sequence of events, as in "event sourcing". If the needs of this microservice setup is mostly asynchronous point-to-point messaging, and if most scenarios would rather favor super low latency and choose to drop messages rather than reprocessing old ones (as seems suggested by the 4 points listed), maybe a lossy in-memory queuing system like Redis queues is more appropriate?
With regard to retrying messages in the DLQ, you could attach a Lambda to it and put your retry logic in there. Not ideal as I prefer the original answer, but if you're under heavy loads like I am, we can't depend on the web service we're calling, and so many messages go to the DLQ that are temporal that have to be retried after 24 hours, etc.
We also use redrive policy on the original SQS to retry up to 4 times after 20 minutes pass. Here's an example of how to set that up per our Terraform script:
resource "aws_sqs_queue" "backoffice_dlq_queue" {
name = "backoffice-dlq-queue.fifo"
fifo_queue = true
content_based_deduplication = true
deduplication_scope = "queue"
receive_wait_time_seconds = 10
message_retention_seconds = 14 * 86400 // 14 days
kms_master_key_id = aws_kms_key.sns_kms_key.id
tags = local.common_tags
}
resource "aws_sqs_queue" "backoffice_queue" {
depends_on = [aws_sqs_queue.backoffice_dlq_queue]
name = "backoffice-queue.fifo"
fifo_queue = true
content_based_deduplication = true
deduplication_scope = "queue"
message_retention_seconds = 172800
visibility_timeout_seconds = 300
receive_wait_time_seconds = 10
kms_master_key_id = aws_kms_key.sns_kms_key.id
tags = local.common_tags
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.backoffice_dlq_queue.arn
maxReceiveCount = 4
})
redrive_allow_policy = jsonencode({
redrivePermission = "byQueue",
sourceQueueArns = ["${aws_sqs_queue.backoffice_dlq_queue.arn}"]
})
}

What atomicity guarantees - if any - does Kafka have regarding batch writes?

We're now moving one of our services from pushing data through legacy communication tech to Apache Kafka.
The current logic is to send a message to IBM MQ and retry if errors occur. I want to repeat that, but I don't have any idea about what guarantees the broker provide in that scenario.
Let's say I send 100 messages in a batch via producer via Java client library. Assuming it reaches the cluster, is there a possibility only part of it be accepted (e.g. a disk is full, or some partitions I touch in my write are under-replicated)? Can I detect that problem from my producer and retry only those messages that weren't accepted?
I searched for kafka atomicity guarantee but came up empty, may be there's a well-known term for it
When you say you send 100 messages in one batch, you mean, you want to control this number of messages or be ok letting the producer batch a certain amount of messages and then send the batch ?
Because not sure you can control the number of produced messages in one producer batch, the API will queue them and batch them for you, but without guarantee of batch them all together ( I'll check that though).
If you're ok with letting the API batch a certain amount of messages for you, here is some clues about how they are acknowledged.
When dealing with producer, Kafka comes with some kind of reliability regarding writes ( also "batch writes")
As stated in this slideshare post :
https://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign (83)
The original list of messages is partitioned (randomly if the default partitioner is used) based on their destination partitions/topics, i.e. split into smaller batches.
Each post-split batch is sent to the respective leader broker/ISR (the individual send()’s happen sequentially), and each is acked by its respective leader broker according to request.required.acks
So regarding atomicity.. Not sure the whole batch will be seen as atomic regarding the above behavior. Maybe you can assure to send your batch of message using the same key for each message as they will go to the same partition, and thus maybe become atomic
If you need more clarity about acknowlegment rules when producing, here how it works As stated here https://docs.confluent.io/current/clients/producer.html :
You can control the durability of messages written to Kafka through the acks setting.
The default value of "1" requires an explicit acknowledgement from the partition leader that the write succeeded.
The strongest guarantee that Kafka provides is with "acks=all", which guarantees that not only did the partition leader accept the write, but it was successfully replicated to all of the in-sync replicas.
You can also look around producer enable.idempotence behavior if you aim having no duplicates while producing.
Yannick

How to make restart-able producer?

Latest version of kafka support exactly-once-semantics (EoS). To support this notion, extra details are added to each message. This means that at your consumer; if you print offsets of messages they won't be necessarily sequential. This makes harder to poll a topic to read the last committed message.
In my case, consumer printed something like this
Offset-0 0
Offset-2 1
Offset-4 2
Problem: In order to write restart-able proudcer; I poll the topic and read the content of last message. In this case; last message would be offset#5 which is not a valid consumer record. Hence, I see errors in my code.
I can use the solution provided at : Getting the last message sent to a kafka topic. The only problem is that instead of using consumer.seek(partition, last_offset=1); I would use consumer.seek(partition, last_offset-2). This can immediately resolve my issue, but it's not an ideal solution.
What would be the most reliable and best solution to get last committed message for a consumer written in Java? OR
Is it possible to use local state-store for a partition? OR
What is the most recommended way to store last message to withstand producer-failure? OR
Are kafka connectors restartable? Is there any specific API that I can use to make producers restartable?
FYI- I am not looking for quick fix
In my case, multiple producers push data to one big topic. Therefore, reading entire topic would be nightmare.
The solution that I found is to maintain another topic i.e. "P1_Track" where producer can store metadata. Within a transaction a producer will send data to one big topic and P1_Track.
When I restart a producer, it will read P1_Track and figure out where to start from.
Thinking about storing last committed message in a database and using it when producer process restarts.

Kafka only once consumption guarantee

I see in some answers around stack-overflow and in general in the web the idea that Kafka does not support consumption acknowledge or that exactly once consumption is hard to achieve.
In the following entry as a sample
Is there any reason to use RabbitMQ over Kafka?, I can read the following statements:
RabbitMQ will keep all states about consumed/acknowledged/unacknowledged messages while Kafka doesn't
or
Exactly once guarantees are hard to get with Kafka.
This is not what I understand by reading the official Kafka documentation at:
https://kafka.apache.org/documentation/#design_consumerposition
The previous documentation states that Kafka does not use a traditional acknowledge implementation (as RabbitMQ). Instead they rely on the relationship partition-consumer and offset...
This makes the equivalent of message acknowledgements very cheap
Could somebody please explain why "only once consumption guarantee" in Kafka is difficult to achieve? and How this differs from Kafka vs other more traditional Message Broker as RabbitMQ? What am I missing?
If you mean exactly once the problem is like this.
Kafka consumer as you may know use a polling mechanism, that is consumers ask the server for messages. Also, you need to recall that the consumer commit message offsets, that is, it tells the cluster what is the next expected offset. So, imagine what could happen.
Consumer poll for messages and get message with offset = 1.
A) If consumer commit that offset immediately before processing the message, then it can crash and will never receive that message again because it was already committed, on next poll Kafka will return message with offset = 2. This is what they call at most once semantic.
B) If consumer process the message first and then commit the offset, what could happen is that after processing the message but before committing, the consumer crashes, so in that case next poll will get again the same message with offset = 1 and that message will be processed twice. This is what they call at least once.
In order to achieve exactly once, you need to process the message and commit that offset in an atomic operation, where you always do both or none of them. This is not so easy. One way to do this (if possible) is to store the result of the processing along with the offset of the message that generated that result. Then, when consumer starts it looks for the last processed offset outside Kafka and seek to that offset.