Handling catastrophic failover in Kafka

Handling catastrophic failover in Kafka - apache-kafka

Let's imaging a simple message processing pipeline, like on the image below:
A group of consumers listens to a topic, picks messages one by one, does some sort of processing and sends them over to the next topic.
Some messages crash the consumer or make it stuck forever (so then a liveness probe kills the consumer after timeout).
In this case a consumer is not able to commit the offset, so the malicious message gets picked up by another consumer. And also makes it crash.
Ideally we want to move the message to a dead letter topic after N such attempts.
This can be achieved by introducing a shared storage:
But this creates coupling between the services and introduces a Single Point of Failure (SPOF) which is the shared database.
I'm looking for ideas on how to work this around with stateless services.

If your context is correct with this approach (that's something you should judge, as I'm only trying to give a suggestion), please consider decoupling the consumption and the processing.
In your case, the consumer is stopped, not because it was not able to read from kafka, and/or the kafka broker wasn't able to provide messages, but because the processing of the message was too slow and/or unsuccesful.
The consumer, in fact, was correctly receiving the messages. It was the processing of them that made it be declared dead.
First of all, the KafkaConsumer javadoc block regarding this (just above the constructor summary). The second option is the one quoted here
2. Decouple Consumption and Processing
Another alternative is to have one or more consumer threads that do
all data consumption and hands off ConsumerRecords instances to a
blocking queue consumed by a pool of processor threads that actually
handle the record processing. This option likewise has pros and cons:
PRO: This option allows independently scaling the number of consumers
and processors. This makes it possible to have a single consumer that
feeds many processor threads, avoiding any limitation on partitions.
CON: Guaranteeing order across the processors requires particular care
as the threads will execute independently an earlier chunk of data may
actually be processed after a later chunk of data just due to the luck
of thread execution timing. For processing that has no ordering
requirements this is not a problem.
CON: Manually committing the position becomes harder as it requires
that all threads co-ordinate to ensure that processing is complete for
that partition.**
Esentially, works like this. The consumer keeps reading and gives the responsibility of the processing and process-timeout management to the processor threads .
The error handling of the message processing would be responsibility of the processor threads as well. For example, if a timeout is thrown or an exception occurs, the processor will send the message to your defined "dead" queue, or whatever management of this you wish to perform, without involving the consumer. Regardless of the processor threads' success or fail, the consumer will continue its job and never be considered dead for not calling poll() in the specified timeout.
You should control the amount of messages the consumer retrieves in its poll call in order not to saturate the processors. Its a game regarding how fast the processors finish their job, how many messages the consumer retrieves (max.poll.records) at each iteration, and what's the specified timeout for the consumer.
Decoupled workflow
The first element to be quoted is the queue (with a limited size, which you should also manage in order not getting too filled - OOM).
This queue would be the link between consumer and processor threads, essentially a buffer that could dynamically get bigger or smaller depending on the specific word load at each time; It would manage overloads, something like a dam, or barrier, to find a similarity.
----->WORKERTHREAD1
KAFKA <------> CONSUMER ----> QUEUE -----|
----->WORKERTHREAD2
What you get is a second queue-lag mechanism:
1. Kafka Consumer LAG (the messages still to be read from the partition/topic)
2. Queue LAG (received messages still need to be processed)
--->WORKERTHREAD1
KAFKA <--(LAG)--> CONSUMER ----> QUEUE --(LAG)--|
--->WORKERTHREAD2
The queue could be some kind of synchronized queue, such a ConcurrentLinkedQueue. for example. Or you could manage yourself the synchronization with a customized queue.
Essentially, the duties would be divided, and the consumer is given the easiest one (as its the one that is most crucial).
Responsibilities:
Consumer
consume-->send to queue
Workers
read from queue|-->[manage timeout]
|==>PROCESS MESSAGE ==> send to topic
|-->[handle failed messages]
You should also manage if the processor threads die/deadlock; but usually those mechanisms are already implemented in most of ThreadPool variants.
I suggest the workers to share a unique KafkaProducer; The producer is thread safe and since the output topic would be the same for the group of consumers, this would also increase its performance. Also from the Kafka Producer javadoc:
The producer is thread safe and sharing a single producer instance
across threads will generally be faster than having multiple
instances.
In resume, each consumer thread feeds n processor threads. Some variants could be:
- 1 consumer - 1 worker (no processing paralellization, just division of duties)
- 1 consumer - 2 workers
- 1 consumer - 4 workers
- 2 consumers - 4 workers (2 for each)
- 2 consumers - 8 workers (4 for each)
...
Read carefully the pros and contras from this mechanism in the javadoc, and judge if this could be a solution to your specific case.
In my oppinion, there's a PRO that doesn't get reflected in the docs, which is the root of this answer/suggestion:
Consumption shouldn't be affected by processing. This approach avoids any consumer thread being considered dead due to a slow processing of the messages, and offers an extra "safety-window" thanks to the queue. I'm not saying that, at the point in which all processors fail for every message, or the queue hits maximum size, for example, the consumer would continue happily as if that didn't affect it; It will in fact be stopped by processing, but much, much later and due to bigger reasons that couldn't be avoided. This approach offers some extra time, or extra shield, for that to happen. Just like a dam can fail if it can't hold any more water.
Well, hope you take this as a suggestion, and may it be helpful somehow. It may avoid most of the dead consumer issues you're having. If well managed, it's a good approach for 24/7 real time data workflow.

Related

Single distributed system to handle large and small transactions

I have a kafka topic. The producer publishes 2 kinds of messages to this topic. Large messages which take more time to process and then small or fast processing messages. The small messages are of large volume (80%). The consumer receives these messages and sends these messages to our processing system. Our processing system have set of microservices deployed in Kubernetes environment as pods (which provides option to scaling).
I have to get the overall processing time as 200ms per transaction and system processing speed of (with scaling) to 10000 tps.
Now what is the better way to design this system in such way that small messages are processed with no blockage from large messages. Or is there a way to isolate the large messages in same channel without impacting processing small messages. Looking for your valuable inputs.
I have put a sample control flow of our system
.
The one option which I have is that consumer diverts the large message to one system and small messages to other system. But this doesn't seem like a good design and nightmare to maintain 2 systems with same functionalities. Also this could lead improper resource allocation.

I will assume large message and small messages can be processed out of order. Otherwise small messages will have to wait for large message and there is no parallelization possible.
I will also assume, you can not change producer to write large messages to another topic. Otherwise, you can just ask producers to send large messages to a different topic, with lesser number of consumers, so large messages will not block small messages.
Ok, with above two assumptions, following is the simplest solution:
On the consumer, if you read a small message, forward it to the message parser as you are doing today.
On the consumer, if you read a large message, instead of forwarding to the message parser, send it to another topic. Let's call it "Large Message Topic"
Configure limited number of consumers on the "Large Message Topic" to read and process large messages.
Alternatively, you will have to take control of commit offset, and add a little more complexity to your consumer code. You can use the solution below:
Disable auto commit, don't call commit on consumer after reading each batch.
If you read a small message, forward it to the message parser as you are doing today.
If you read large messages, send them to another thread/thread pool on your consumer process, that will forward it to the message parser. This thread pool processes in coming messages in a sequence, and keeps track of last offset completed.
Once in a while, you call commit with offset = min (consumer offset, large message offset)

How to implement fair scheduling between multiple tennants writing to 1 stream

As of now I have single Kafka Topic with 10 partitions. We have 10000 clients who keep dumping uncontrolled data into streams. The problem currently is that
A tenant with out any notice (or little notice) floods the topic
now the messages from other tenants suffer --> because their messages (handful) are queued behind and will take several hours to get their turn for processing
Question:
Can I somehow read may be 1k messages per tenant and roundrobin --> essentially like fair scheduling of Hadoop yarn
Can Apache pulsar help me in this? If yes then is there any example you can point me to?
I went through: https://www.confluent.io/blog/prioritize-messages-in-kafka/ already; but given the volume of clients it may not be practical to have 100k partitions etc.

I'm not aware of any way to get what you want out of the box. You could probably have the consumer pause some partitions to prioritize consumption from the ones with more messages (for example, by checking the lag per partition after every few poll iterations).
I'm not familiar enough with Apache Pulsar to have a clear answer.

I have a similar problem: a single customer can monopolize the resources and delay execution from all other customers, just because their events arrived first.
On a different application with a low amount of messages, we just load all the events in memory, creating a in-memory queue for every customer and then dequeuing up to N events from each customer queue and re-queue them again into a different queue, lets call it the re-ordered queue. The re-ordered queue has a capacity limit. (lets say...100*N), so no additional elements are queue until there is space. This guarantees equal treatment to all customers.
I am facing the same problem now with an application that processes billions of messages. The solution above is impossible; there is just not enough RAM. We can't keep all the data in memory. Creating a topic for each customer also sounds overkill; specially if you have a variable set of active customers at any given point in time. Nevertheless, Pulsar seems to handle well thousands, even millions, of topics.
So the technique above may work well for you (and for me).
Just read from thousands of topics... write to another topic a limited number of messages and then wait for it to have "space" to continue enqueuing.

Confluent Cloud - Can't see significant multithreading improvements in poll()

I tried consuming messages from Confluent Cloud cluster using 2 approaches and I'm getting almost the same total poll() time for both-
Single threaded- Use only one consumer to sequentially read from all TopicPartitions
Multithreaded- Spawn multiple Consumers (equal to no. of TopicPartitions), assign each partition on separate consumer manually (using #assign()). Run these threads in parallel and do a #poll(). Processing of messages will also be done by the thread itself.
There is some speed increase in 2nd approach but it's mostly due to the processing part which happens in parallel. The time taken by poll() method (to fetch all ConsumerRecords, excluding the processing) is almost same in both cases.
My question- Is the poll() method in KafkaConsumer blocking on the server side in some way? Multiple consumers polling in parallel vs a single consumer polling sequentially is giving almost similar poll time. The only performance increase seems to be happening due to the processing part which happens after fetching all ConsumerRecords using poll().
NOTE: I am not using the consumer group functionality here as it is not suitable to our use-case. As mentioned, I'm manually assigning the TopicPartitions.

What could be advised when trying to do multi threading or parallel consumers are these two blogs - Confluent Cloud - Can't see significant multithreading improvements in poll() and https://www.confluent.io/blog/kafka-consumer-multi-threaded-messaging/. I was informed that the main point of running multiple concurrent consumers in a group is generally not about improving the performance of the fetch from Kafka, but improving processing throughput.

How to manage Kafka transactional producer objects in request oriented applications

What is the best practice for managing Kafka producer objects in request oriented (e.g. http or RPC servers) applications, when configured as transactional producers? Specifically, how to share producer objects among serving threads, and how to define the transactional.id configuration value for those objects?
In non-transactional usage, producer objects are thread safe and it is common to share one object among all request serving threads. It is also straightforward to setup transactional producer objects to be used by kafka consumer threads, just instantiating one object for each consumer thread works well.
Combining transactional producers with request oriented applications appears to be more complicated, as the life-cycle of serving threads is usually dynamically controlled by a thread pool. I can think of a few options, all with downsides:
Share a single object, protected against concurrency by some kind of mutex. Contention under load would probably be a serious problem.
Instantiate a producer object for each request coming in. KafkaProducer objects are slow to initialize, as they maintain network connections, threads, and other heavyweight objects; paying this cost for each request seems impractical.
Maintain a pool of producer objects, and lease one for each request. The main downside I can see is the amount of machinery required. It is also unclear how to configure transactional.id for these objects, as their lifecycle does not map cleanly to a shard identifier in a partitioned, stateful, application as the documentation says.
Are there other options? Is there an optimal approach?

TL;DR
The transactional id is for preventing duplicates caused by zombie processes in the read-process-write pattern where you read from and produce to kafka topics. For request oriented applications, e.g. messages being produced by an incoming http request, transactional id doesn't bring any benefit (of course you still need to assign one if you want to use transactions and shouldn't be repeated between producers in the same process or different processes in your cluster)
Long answer
As the docs say, transactional producers are not thread safe
As is hinted at in the example, there can be only one open transaction per producer. All messages sent between the beginTransaction() and commitTransaction() calls will be part of a single transaction
so as you correctly explained there can't be concurrent access to the producer so we must pick one of the three options you described.
For this answer I'm going to assume that request oriented applications corresponds to http requests as the mechanism is triggering a message being produced with a transaction (actually, more than one message, otherwise will be enough with idempotent producers and transactions won't be needed)
In terms of correctness all of them are ok as, option 1 would work but depending on your application throughput it could have a high contention, option 2 will also work but you will pay the price of a higher latency and won't be very efficient.
IMHO I think option 3 could be the best since is a compromise between of the two previous options, although of course requires a more careful implementation than just opening a new producer each time.
Transactional id
The question that remains is how to assign a transactional id to the producer, specially in the last case (although both options 1 and 3 share the same concern, since in both cases we are reusing a producer with the same transactional id to handle different requests).
To answer this we first need to understand that the goal of transactional.id is to protect us from having duplicate message being produced caused by zombie processes (a process that hangs for a while, e.g. bc of a long gc pause, and is considered dead but after a while comes back and continues), this is called zombie fencing.
An important detail to understand the need of zombie fencing is understanding in which use case it could happen and this is the read-process-write pattern where you read from a topic, process the element and write to an output topic and the offset topic, which give us atomicity and Exactly-once semantics (if you are not doing any side effects on the process step).
Idempotent producers prevent us from having duplicates caused by producer retries (where the message was persisted by the broker but the ack wasn't received by the producer) and two-phase commit within kafka (where we are not only writing to the output but also marked the message as consumed by also producing to the offset topic) prevent us from having duplicates caused by consuming the message more than once (if the process crashes after producing to the output topic but before committing the offset).
There is still a subtle case where a duplicate can be introduced and it is a zombie producer, which is fenced by monotonically increasing an epoch each time a producer calls initTransactions that will be send with every message the producer sends.
So, for a producer to be fenced, another producer should have being started with the same transaction id, the key here is explained by Jason Gustafson in this talk
"what we are looking for is a guarantee that for each input partition there is only a single write that is responsible for reading that data and writing the output"
This means the transactional.id is assigned in terms of the partition is being consumed in the "read-process-write" pattern.
So if a process that has assigned partition 0 of topic A is considered dead, a rebalance will kick off and the new process that is assigned should create a producer with the same transactional.id, that's why it should be something like this <prefix><group>.<topic>.<partition> as described in this answer, where the partition is part of the transactional.id. This also means a producer per partition assigned, which could also represent an overhead depending on how many topics and partitions your consumers are being assigned.
This slides from the talk clarifies this situation
Transactional id before process crash
Transactional id reassigned to other process after crash
Transactional id in http requests
Going back to your original question, http requests won't follow the read-process-write pattern where zombies can introduce duplicates, because each http request will be unique, even if you introduce a unique identifier it will be a different message from the point of view of the transactional producer.
In this case I would argue that you may still have value using the transactional producer if you want the atomicity of writing to two different topics, but you can choose a random transactional id for option 2, or reuse it for options 1 and 3.
UPDATE
My answer is outdated since is based in an old version of kafka.
The overhead of having one producer per partition described before was a concern that was tackled in KIP-447
This architecture does not scale well as the number of input partitions increases. Every producer come with separate memory buffers, a separate thread, separate network connections. This limits the performance of the producer since we cannot effectively use the output of multiple tasks to improve batching. It also causes unneeded load on brokers since there are more concurrent transactions and more redundant metadata management.
This is the main difference as explained in this post
When the partition assignment is finalized after a consumer group rebalance, the first step for the consumer is to always get the next offset to begin fetching data. With this observation, the OffsetFetch protocol protection is enhanced, such that when a consumer group has pending transactional offsets associated with one partition, the OffsetFetch call can be blocked until the associated transaction completes. Previously, the “outdated” offset data would be returned and the application allowed to continue immediately.
Whit this new feature, the use of transactional.id is no longer clear to me.
Although it is still unclear why fencing requires both blocking the poll if there are pending transactions while it seems to me that the sending the consumer group metadata should be enough (I assume a zombie producer will be fenced by commiting with an old generation.id for that group.id, the generation.id being bumped with each rebalance) it seems the transactional.id doesn't play a major role anymore. e.g. spring docs says
With mode V1, the producer is "fenced" if another instance with the same transactional.id is started. Spring manages this by using a Producer for each group.id/topic/partition; when a rebalance occurs a new instance will use the same transactional.id and the old producer is fenced.
With mode V2, it is not necessary to have a producer for each group.id/topic/partition because consumer metadata is sent along with the offsets to the transaction and the broker can determine if the producer is fenced using that information instead.

Kafka as a message queue for long running tasks

I am wondering if there is something I am missing about my set up to facilitate long running jobs.
For my purposes it is ok to have At most once message delivery, this means it is not required to think about committing offsets (or at least it is ok to commit each message offset upon receiving it).
I have the following in order to achieve the competing consumer pattern:
A topic
X consumers in the same group
P partitions in a topic (where P >= X always)
My problem is that I have messages that can take ~15 minutes (but this may fluctuate by up to 50% lets say) in order to process. In order to avoid consumers having their partition assignments revoked I have increased the value of max.poll.interval.ms to reflect this.
However this comes with some negative consequences:
if some message exceeds this length of time then in a worst case scenario a the consumer processing this message will have to wait up to the value of max.poll.interval.ms for a rebalance
if I need to scale and increase the number of consumers based on load then any new consumers might also have to wait the value of max.poll.interval.ms for a rebalance to occur in order to process any new messages
As it stands at the moment I see that I can proceed as follows:
Set max.poll.interval.ms to be a small value and accept that every consumer processing every message will time out and go through the process of having assignments revoked and waiting a small amount of time for a rebalance
However I do not like this, and am considering looking at alternative technology for my message queue as I do not see any obvious way around this.
Admittedly I am new to Kafka, and it is just a gut feeling that the above is not desirable.
I have used RabbitMQ in the past for these scenarios, however we need Kafka in our architecture for other purposes at the moment and it would be nice not to have to introduce another technology if Kafka can achieve this.
I appreciate any advise that anybody can offer on this subject.

Using Kafka as a Job queue for scheduling long running process is not a good idea as Kafka is not a queue in the strictest sense and semantics for failure handling and retries are limited. Though you might be able to achieve a compromise by playing around with certain configuration for rebalance or timeout, it is likely to remain brittle design. Simple answer is that Kafka was not designed for these kind of usecases.
The idea of max.poll.interval.ms is to prevent livelock situation (see), but in your case, consumer will send a false positive to the Kafka broker and will trigger a rebalance as there is no way to distinguish between a livelock and a legitimate long process.
You should think about the tradeoffs between living with the negative consequences you mentioned Vs. introducing a new technology which helps you to model a job queue in a better way. For a more complex usecase, check out how slack is doing it.

The way we got around the issues we were having was as suggested in the comments.
We decided to decouple the message processing from the consumer polling.
On each worker/consumer there were 2 threads, one for doing the actual processing and the other for phoning home to Kafka periodically.
We also did some work with trying to reduce the processing times for messages.
However some messages still take time that can be measured in minutes.
This has worked for us now for some time with no issues.
Thanks for this suggestions in comments #Donal

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse