Is batch message sending possible with Sink.actorRefWithAck? - scala

I'm using Akka Streams and I came across Sink.actorRefWithAck. I understand that it sends a message and only tries pulling in another element from the stream when an acknowledgement for the previous message has been received. Is there a way to batch-process messages with this sink? Example: pull five messages and only pull the next five once the first five have been acknowledged. I've thought about something like
source.grouped(5).to(Sink.actorRefWithAck(...))
But that would require the receiver to change to work with sequences, which let's assume is out of the question.

No, that is not possible with Sink.actorRefWithAck() while keeping individual messages being queued in the actor mailbox rather than the entire batch.
One idea to queue up messages in the actor inbox more eagerly would be to use source.mapAsync(n)(ask-actor).to(Sink.ignore). This would send n to the actor and then as soon as the first one gets a response from the actor, it would pull and enqueue a new element.

Related

Reconsume Kafka Message that failed during processing due to DB error

I am new to Kafka and would like to seek advice on what is the best practice to handle such scenario.
Scenario:
I have a spring boot application that has a consumer method that is listening for messages via the #KafkaListner annotation. Once an incoming message has occurred, the consumer method will process the message, which simply performs database updates to different tables via JdbcTemplate.
If the updates to the tables are successful, I will manually commit the message by calling the acknowledge() method. If the database update fails, instead of calling the acknowledge() method, I will call the nack() method with a given duration (E.g. 10 seconds) such that the message will reappear again to be consumed.
Things to note
I am not concerned with the ordering of the messages. Whatever event comes I just have to consume and process it, that's all.
I am only given a topic (no retryable topic and no dead letter topic)
Here is the problem
If I do the above method, my consumer becomes inconsistent. Let's say if I call the nack() method with a duration of 1min, meaning to say after 1 min, the same message will reappear.
Within this 1 min, there could "x" number of incoming messages to be consumed and processed. The observation made was none of these messages are getting consumed and processed.
What I want to know
Hence, I hope someone will advise me what I am doing wrongly and what is the best practice / way to handle such scenarios.
Thanks!
Records are always received in order; there is no way to defer the current record until later, but continue to process other records after this one when consuming from a single topic.
Kafka topics are a linear log and not a queue.
You would need to send it to another topic; the #RetryableTopic (non-blocking retrties) feature is specifically designed for this use case.
https://docs.spring.io/spring-kafka/docs/current/reference/html/#retry-topic
You could also increase the container concurrency so at least you could continue to process records from other partitions.

How to process events which are out of order using Kafka Streams

I have an application where events are sent on a Kafka topic based on user actions like User Login, user's Intermediate actions (optional) and User Logout. Each event has some information in a event object along with userId , for example a Login Event has loginTime; Add Note has notes (Intermediate actions). Similarly a Logout event has logoutTime. The requirement is to aggregate information from all these events into one object after receiving the Logout event for each user & send it on downstream.
Due to some reasons (Network delay, multiple event producer) events may not come in order (User Logout event may come before Intermediate event), So the question is how to handle such scenarios? I can not wait for Intermediate events after receiving User Logout event since Intermediate events are optional depending on user's actions.
The only option which I think here, is to wait for some time after receiving User Logout event, process Intermediate events if received within that wait time & send processed event, but again not sure how to achieve this.
Kafka does not guarantee order on topic, it guarantee order on partition. One topic can have more than one partition so every consumer that is consuming your topic will consume one partition. That is how kafka is achieving scalability. So what you are experiencing is normal behavior (it isn't bug or related to network delay or something like that). What you can do is to make sure that all messages that you want to proceed in order are sent to the same partition. You can do that by setting number of partitions to 1, that is the dumbest way. When you send message with producer, by default kafka take a look into key, take hash of it and by that hash know on which partition should send a message. You can make sure that for all messages, the key is the same. That way all hashes of keys will be the same and all messages will go to the same partition. Also, you can implement custom partitioner and override default way how kafka choose on which partition message will go. In this way, all messages will arrive in order. If you cannot do any of this actions, then you will receive events out of order and you will have to think about a way how to consume them out of order but that is not question related to kafka.
If you are not able to preserve order of event (that Logout will be last event),
you can achieve your requirements using ProcesorApi from Kafka Streams. Kafka Streams DSL can be combine with Processor API (more details here).
You can have several partitions, but all events for particular user has to be send to same Partition.
You have to implement custom Processor/Transformer.
Your processor will be put each event/activity in state store (aggregate all event from particular user under same key).
Processor API gives you ability to create some kind of scheduler (Punctuator).
You can schedule to check every X seconds events for particular user. If Logout was long ago, you get all events/activities and make some aggregation and send results to downstreams.
As said in other answers, in Kafka order is maintained on per-partition basis.
Since you are talking about user events, why don't you make UserID as your Kafka topic key? So, that all events related to a specific user will always be ordered (provided they are produced by a single producer).
You should ensure (by design) that only one Kafka producer pushes all the user change events to the given topic. In this way, you can avoid out-of order messages due to multiple producers.
From streams, you might also want to look at Windows in Kafka streams. Tumbling windows for example is non-overlapping and fixed size. You aggregate records over a period of time.
Now you may want to sort the aggregated by their timestamp (or you said you have logout time, login time etc) and act accordingly.
Simple and effective solution
Use synchronous send and set delivery.timeout.ms and retries to a maximum value.
To ensure fault tolerance set acks=all with min.insync.replicas=2 (topic configuration) and use a single producer to push to that topic.
You should also set max.block.ms to some max value so that your send() does not return immediately if there is an error in fetching the metadata (for example, when Kafka is down).
Benchmark the synchronous send with your rate and check to see if it meets your requirements or benchmark number.
This ensures that a message that came first is sent first to Kafka and then the next message is not sent until the previous message is successfully acknowledged.
If your benchmark figure is not met, try having a back-pressure
mechanism like in-memory/persistent queue.
Add event to a queue in Thread-1
Peek (not dequeue) event from the queue in Thread-2
Call producer.send(...).get() in Thread-2
Dequeue the event in Thread-2
The key is to make your frontend tracker to send ordered events to the backend service which then produces events to kafka.
You can achieve that by batching the events, and sending the batched events to the backend only after the previous batched events are successfully delivered.

Producing a batch message

Let's say there is a batch API for performing tasks List[T]. In order to do the job all the tasks needs to be pushed to kafka. There are 2 ways to do that :
1) Pushing List as a message in kafka
2) Pushing individual task T in kafka
I believe approach 1 would be better since i don't have to push the messages to kafka mutiple times for a single batch call. Can some one please tell me if there is any harm in such approach ?
A Kafka producer can batch together individual messages sent within a short time window (the particular config is linger.ms), so the cost of sending individual messages is probably a lot lower than you think.
Probably a more important factor to consider is how the consumer is going to consume messages. What should happen if the consumer cannot process one of the tasks, for example? If the consumer is just just going to call some other batch-based API which succeeds or fails as a batch, the a single message containing a list of tasks would be a perfectly good fit. On the other hand if the consumer ultimately has to process tasks individually then sending individual messages is probably a better fit, and will probably save you from having to implement some sort of retry logic in your consumer, because you can probably configure Kafka to behave with the semantics you need.
Starting from Kafka v0.11 you can also use transactions in the producer to publish your entire batch atomically. i.e. you begin the transaction, then publish your tasks message by message, finally you commit the transaction. Even though the messages can be sent to kafka in multiple batches, they will only become visible to consumers once you commit the transaction, as long as your consumers are running in read-committed mode.
Option 1 is the preferred method in Kafka so long as the entire batch should always stay together. If you publish a List of records as a batch then they will be stored as a batch, they will be (optionally) compressed as a batch yielding better compression, and they will be fetched by consumers as a batch yielding fewer fetch requests.
If you send individual messages then you will have to give them a common key or they will get spread out over different partitions and possibly be sent out of order, or to different consumers of a consumer group.

Can we add a message for future processing in MSMQ

I am trying to create a MSMQ solution and for certain message I want them to be processed after 6PM only so is there a way in MSMQ so that the message is processed in Future?
A queue is a first-in, first-out data structure. If your application needs to process some messages after 6 PM, move the messages to a different queue that is only processed after 6 PM.
You could get the application to Peek the message first to read the property that lets you know it's a post-6pm message and act accordingly. If it's after 6pm, receive the message; if it isn't then Peek Next.

How to prevent actor mailbox growing in Scala?

As for as I know, the mailboxes of Scala actors have no size limit. So, if an actor reads messages from its mailbox slower than others send messages to that mailbox, then it eventually creates a memory leak.
How can we make sure that it not does happen? Should we limit the mailbox size anyway ? What are the best practices to prevent the mailbox growing?
Instead of having a push strategy where producers send directly messages to consumers, you could use a pull strategy, where consumers request messages from producers.
To be sure that the reply is almost instantaneous, producers can produce a limited number of data in advance. When they receive a request, first they send one of the pregenerated data, then they generate a new one.
You could also use Akka actors, which provide bounded mailbox.