Kafka Streams handling timeout on a cluster - apache-kafka

In a Kafka based distributed JVM application running in several instances, I need to act on the event of "not receiving" a certain message in a specific Kafka topic for a certain configurable amount of time (this timeout value is driven by the business logic, is subject to change).
How can I accomplish this in a cluster-safe way?

Is the goal to trace latency of the E2E flow or is there some trigger which causes a second message to be expected in some configurable time?
If tracking latency, some options include:
Add a timestamp to the message. When the message is received, the latency can be calculated and used.
Add UUID, timestamp, and current component to the message and delegate message tracking to a separate service partitioned on UUID.
If some trigger causes a second message to be expected, some options include:
Partition the relevant topic in a way that guarantees the expected message will either arrive or not arrive at only 1 JVM (similar to 2 above). This will allow a list of expected messages to be kept in memory. Remove the expected messages when received and every N seconds handle 'not received' messages.
Keep track of the expected messages in a data store (DB/distributed cache). When received, remove the records. Periodically, handle 'not received' messages.
With details in the comment, one way to approach this with a callback style approach. Messages can be routed to a specific server by setting a partition key. By adding an intermediate topic/service partitioned on UUID it should be possible to achieve this as follows:
Send Message A to ttl_routing_service. Message A should include a UUID, TTL, where to send the message (functional topic), and what to do on expiry.
Routing Service picks up the message and tracks some metadata (ex: TTL/what to do on timeout) in a local cache or starts a delayed coroutine then routes message A to the functional topic including the UUID.
On completion of message A processing, a message can be sent to ttl_routing_service with the UUID preventing the coroutine or removing the cached record.
If not removed, 'what to do on expiry' is performed.


Reconsume Kafka Message that failed during processing due to DB error

I am new to Kafka and would like to seek advice on what is the best practice to handle such scenario.
I have a spring boot application that has a consumer method that is listening for messages via the #KafkaListner annotation. Once an incoming message has occurred, the consumer method will process the message, which simply performs database updates to different tables via JdbcTemplate.
If the updates to the tables are successful, I will manually commit the message by calling the acknowledge() method. If the database update fails, instead of calling the acknowledge() method, I will call the nack() method with a given duration (E.g. 10 seconds) such that the message will reappear again to be consumed.
Things to note
I am not concerned with the ordering of the messages. Whatever event comes I just have to consume and process it, that's all.
I am only given a topic (no retryable topic and no dead letter topic)
Here is the problem
If I do the above method, my consumer becomes inconsistent. Let's say if I call the nack() method with a duration of 1min, meaning to say after 1 min, the same message will reappear.
Within this 1 min, there could "x" number of incoming messages to be consumed and processed. The observation made was none of these messages are getting consumed and processed.
What I want to know
Hence, I hope someone will advise me what I am doing wrongly and what is the best practice / way to handle such scenarios.
Records are always received in order; there is no way to defer the current record until later, but continue to process other records after this one when consuming from a single topic.
Kafka topics are a linear log and not a queue.
You would need to send it to another topic; the #RetryableTopic (non-blocking retrties) feature is specifically designed for this use case.
You could also increase the container concurrency so at least you could continue to process records from other partitions.

Message order issue in single consumer connected to ActiveMQ Artemis queue

Any possibility of message order issue while receive single queue consumer and multiple producer?
producer1 publish message m1 at 2021-06-27 02:57:44.513 and producer2 publish message m2 at 2021-06-27 02:57:44.514 on same queue worker_consumer_queue. Client code connected to the queue configured as single consumer should receive message in order m1 first and then m2 correct? Sometimes message receive in wrong order. version is ActiveMQ Artemis 2.17.0.
Even though I mentioned that multiple producer, message publish one after another from same thread using property blockOnDurableSend=false.
I create and close producer on each message publish. On same JVM, my assumption is order of published messages in queue, from same thread or from different threads even with async. timestamp is getJMSTimestamp(). async publish also maintain any internal queue has order?
If you use blockOnDurableSend=false you're basically saying you don't strictly care about the order or even if the message makes it to the broker at all. Using blockOnDurableSend=false basically means "fire and forget."
Furthermore, the JMSTimetamp is not when the message is actually sent as noted in the javax.jms.Message JavaDoc:
The JMSTimestamp header field contains the time a message was handed off to a provider to be sent. It is not the time the message was actually transmitted, because the actual send may occur later due to transactions or other client-side queueing of messages.
With more than one producer there is no guarantee that the messages will be processed in order.
More producers, ActiveMQ Artemis and one consumer are a distributed system and the lack of a global clock is a significant characteristic of distributed systems.
Even if producers and ActiveMQ Artemis were on the same machine and used the same clock, ActiveMQ Artemis could not receive the messages in the same order producers would create and send their messages. Because the time to create a message and the time to send a message include variable time latencies.
The easiest solution is to trust the order of the messages received by ActiveMQ Artemis, adding a timestamp with an interceptor or enabling the ingress timestamp, see ARTEMIS-2919 for further details.
If the easiest solution doesn't work, the distributed solution is to implement a distributed system total ordering algorithm as lamport timestamps.
Well, as it seams it is not a bug within Artemis, when it comes to a millisecond difference it is more like a network lag or something like this.
So to workaround I got to the idea, you could create a algorythm in which a recieved message will wait for ~100ms before it is really worked through (whatever you want to be doing with this message) and check if there is another message which your application recieved afterwards but is send before. So basicly have your own receiver queue with a delay.
IF there is message that was before, you could simply move that up in your personal algorythm. You could also think about to reject the first message back to your bus, depending on your settings on queues and topics it would be able to recieve it afterwards again.

How to process events which are out of order using Kafka Streams

I have an application where events are sent on a Kafka topic based on user actions like User Login, user's Intermediate actions (optional) and User Logout. Each event has some information in a event object along with userId , for example a Login Event has loginTime; Add Note has notes (Intermediate actions). Similarly a Logout event has logoutTime. The requirement is to aggregate information from all these events into one object after receiving the Logout event for each user & send it on downstream.
Due to some reasons (Network delay, multiple event producer) events may not come in order (User Logout event may come before Intermediate event), So the question is how to handle such scenarios? I can not wait for Intermediate events after receiving User Logout event since Intermediate events are optional depending on user's actions.
The only option which I think here, is to wait for some time after receiving User Logout event, process Intermediate events if received within that wait time & send processed event, but again not sure how to achieve this.
Kafka does not guarantee order on topic, it guarantee order on partition. One topic can have more than one partition so every consumer that is consuming your topic will consume one partition. That is how kafka is achieving scalability. So what you are experiencing is normal behavior (it isn't bug or related to network delay or something like that). What you can do is to make sure that all messages that you want to proceed in order are sent to the same partition. You can do that by setting number of partitions to 1, that is the dumbest way. When you send message with producer, by default kafka take a look into key, take hash of it and by that hash know on which partition should send a message. You can make sure that for all messages, the key is the same. That way all hashes of keys will be the same and all messages will go to the same partition. Also, you can implement custom partitioner and override default way how kafka choose on which partition message will go. In this way, all messages will arrive in order. If you cannot do any of this actions, then you will receive events out of order and you will have to think about a way how to consume them out of order but that is not question related to kafka.
If you are not able to preserve order of event (that Logout will be last event),
you can achieve your requirements using ProcesorApi from Kafka Streams. Kafka Streams DSL can be combine with Processor API (more details here).
You can have several partitions, but all events for particular user has to be send to same Partition.
You have to implement custom Processor/Transformer.
Your processor will be put each event/activity in state store (aggregate all event from particular user under same key).
Processor API gives you ability to create some kind of scheduler (Punctuator).
You can schedule to check every X seconds events for particular user. If Logout was long ago, you get all events/activities and make some aggregation and send results to downstreams.
As said in other answers, in Kafka order is maintained on per-partition basis.
Since you are talking about user events, why don't you make UserID as your Kafka topic key? So, that all events related to a specific user will always be ordered (provided they are produced by a single producer).
You should ensure (by design) that only one Kafka producer pushes all the user change events to the given topic. In this way, you can avoid out-of order messages due to multiple producers.
From streams, you might also want to look at Windows in Kafka streams. Tumbling windows for example is non-overlapping and fixed size. You aggregate records over a period of time.
Now you may want to sort the aggregated by their timestamp (or you said you have logout time, login time etc) and act accordingly.
Simple and effective solution
Use synchronous send and set delivery.timeout.ms and retries to a maximum value.
To ensure fault tolerance set acks=all with min.insync.replicas=2 (topic configuration) and use a single producer to push to that topic.
You should also set max.block.ms to some max value so that your send() does not return immediately if there is an error in fetching the metadata (for example, when Kafka is down).
Benchmark the synchronous send with your rate and check to see if it meets your requirements or benchmark number.
This ensures that a message that came first is sent first to Kafka and then the next message is not sent until the previous message is successfully acknowledged.
If your benchmark figure is not met, try having a back-pressure
mechanism like in-memory/persistent queue.
Add event to a queue in Thread-1
Peek (not dequeue) event from the queue in Thread-2
Call producer.send(...).get() in Thread-2
Dequeue the event in Thread-2
The key is to make your frontend tracker to send ordered events to the backend service which then produces events to kafka.
You can achieve that by batching the events, and sending the batched events to the backend only after the previous batched events are successfully delivered.

Maintain Ordering Guarantees With Kafka Consumer Retries

I'm in the process of coming up with an architecture for consumer retries in a Kafka based data processing pipeline. We're using Kafka producers and consumers and are thinking of retry topics on which messages will be sent if they error out on consumption. There will be consumers running on these retry topics at a certain cadence.
I read many reference architectures, but none talk about how to maintain ordering guarantees during message consumption failures. Let me give an example:
Our Kafka messages contain payload that has an object and an operation type (which could either be CREATE/UPDATE/DELETE). We partition messages on object_id to make sure that the operations on that object are ordered. However, if a message fails on consumption, should you automatically flag the subsequent messages with the same object_id as failed, without even attempting to process them? And how do you maintain that state?
Are there any reference architectures that address this?
Yes, you need to have a mechanism in place wherein if one message with the same object_id fails and goes to retry, then all subsequent messages with the same object_id also go to retry directly.
I suggest to use a cache to co-ordinate this - whenever a message goes to retry, increment the object_id key. Similarly, whenever a message is successfully consumed from the retry topic, decrement the key.
Now, you need only check whether there exists a key with >0 value corresponding to the object_id before attempting to consume a message, and if it is, directly send it to retry.
If there are multiple levels of retry topics, maintain distributed key-value cache, with key is object_id and value is level of retry topic.
On consumption of message check against this cache & if object_id is present then send event directly to the topic.
The easiest would be to have a blocking retry-policy here: i.e, not using retry-topics, but instead block on consumer (sleep the thread for some time), and then retry the same message. In this case you can guarantee the order always.
If you opt to use the retry-topic instead, you are gonna have a lot of headaches making sure the order is guaranteed.

Spring Cloud Stream Kafka - How to implement idempotency to support distributed transaction management (eventual consistency)

I have the following typical scenario:
An order service used to purchase products. Acts as the commander of the distributed transaction.
A product service with the list of products and its stock.
A payment service.
Orders DB Products DB
| |
--------------- ---------------- ----------------
| OrderService | | ProductService | | PaymentService |
--------------- ---------------- ----------------
| | |
| -------------------- |
--------------- | Kafka orders topic |-------------
The normal flow would be:
The user orders a product.
Order service creates an order in DB and publishes a message in Kafka topic "orders" to reserve a product (PRODUCT_RESERVE_REQUEST).
Product service decreases the product stock one unit in its DB and publishes a message in "orders" saying PRODUCT_RESERVED
Order service gets the PRODUCT_RESERVED message and orders the payment publishing a message PAYMENT_REQUESTED
Payment service orders the payment and answers with a message PAYED
Order service reads the PAYED message and marks the order as COMPLETED, finishing the transaction.
I am having trouble to deal with error cases, e.g: let's assume this:
Payment service fails to charge for the product, so it publishes a message PAYMENT_FAILED
Order service reacts publishing a message UNDO_PRODUCT_RESERVATION
Product service increases the stock in the DB to cancel the reservation and publishes PRODUCT_UNRESERVATION_COMPLETED
Order service finishes the transaction saving the final state of the order as CANCELLED_PAYMENT_FAILED.
In this scenario imagine that for whatever reason, order service publishes a UNDO_PRODUCT_RESERVATION message but doesn't receive the PRODUCT_UNRESERVATION_COMPLETED message, so it retries publishing another UNDO_PRODUCT_RESERVATION message.
Now, imagine that those two UNDO_PRODUCT_RESERVATION messages for the same order end up arriving to ProductService. If I process both of them I could end up setting an invalid stock for the product.
In this scenario how can I implement idempotency?
Following Artem's instructions I can now detect duplicated messages (by checking the message header) and ignore them but there may still be situations like the following where I shouldn't ignore the duplicated messages:
Product service gets the message and starts processing it but crashes before updating the stock.
Order Service doesn't get a response so it retries sending UNDO_PRODUCT_RESERVATION
Product service knows this is a duplicated message BUT, in this case it should repeat the processing again.
Can you help me come up with a way to support this scenario as well? How could I distinguish when I should discard the message or reprocess it?
We used spring-integration-kafka to produce and consume messages with Kafka in our microservices. In our case, we send org.springframework.messaging.Message objects to topics and get the same type from topics after deserialization from byte-array. In Message entity there are message-id, sent-time etc. headers values other than message payload which is the actual object that you want to transfer from one microservice to others. We use unique message-id value to implement idempotency. On producer side, you must implement some logic to ensure that, the message-id of the Message is the same when it is produced multiple times. This is actually related to your produce logic. In our case, we use Publishing Events Using Local Transactions which is very well described in the blog https://www.nginx.com/blog/event-driven-data-management-microservices/ by Chris Richardson. With this approach we can recrate Message object with the same message-id on producer side. On consumer side, we persist all the consumed message id values to database and check this ids before processing the received messages. If we see a message whose id is in our persistent store, we simply ignore it.
In your case, To implement idempotency:
you should keep a unique identifier with the messages,
On producer side, you must generate the same identifier when it is produced multiple times,
On consumer side, you must check the received id to detect whether it is consumed before or not
Regarding to Second Scenario Which is Described in UPDATE,
I think you should change your mind a little bit. If you want to implement publish-subscribe mechanism which is more suitable in microservices architecture, you shouldn't wait response on producer side. In this scenario, you wait other message to know whether the consumer consumed the message or not and if it is not consumed by the consumer, you send it again.
How about the implementation below;
On producer side, you send messages to Kafka within a transaction in producer. You should provide a mechanism here to send messages to kafka only the transaction on producer side is committed. This is Atomicity issue and i give a link above which shows how to solve this issue.
On Consumer side, you poll messages from kafka topic one by one in order and you get the next message only when the current message can be consumed. If it is not consumed, you shouldn't get the next message. Because the next message might be related to current message and if you consume the next message you may corrupt consistency of your data. Its not producer's concern when the message not consumed. On consumer side, you should provide retry and replay mechanisms to consume messages.
I think you shouldn't wait response on producer side. Kafka is a very smart tool, and with its offset commit capability, as a consumer you don't have to consume messages when you poll messages from topic. If you have a problem while processing messages, you simply don't commit offset to get next message.
With the implementation described above, you don't have a problem like "How could I distinguish when I should discard the message or reprocess it?"
actually because of the complications you mentioned about organizing transaction over multiple micro services over Apache Kafka, I developed another concept and wrote a blog about it.
If you reach a state of complication that Kafka solution might not be feasible anymore, you might find it as an interesting read. It is too long to explain here but basically it uses a J2EE container fully with Micro Service principle and with full transaction support between the Micro Services with the help of the Spring Boot + Netflix.
Micro Services Fanout and Transaction Problems and Solutions with Spring Boot and Netflix