As for as I know, the mailboxes of Scala actors have no size limit. So, if an actor reads messages from its mailbox slower than others send messages to that mailbox, then it eventually creates a memory leak.
How can we make sure that it not does happen? Should we limit the mailbox size anyway ? What are the best practices to prevent the mailbox growing?
Instead of having a push strategy where producers send directly messages to consumers, you could use a pull strategy, where consumers request messages from producers.
To be sure that the reply is almost instantaneous, producers can produce a limited number of data in advance. When they receive a request, first they send one of the pregenerated data, then they generate a new one.
You could also use Akka actors, which provide bounded mailbox.
Related
I have a kafka topic. The producer publishes 2 kinds of messages to this topic. Large messages which take more time to process and then small or fast processing messages. The small messages are of large volume (80%). The consumer receives these messages and sends these messages to our processing system. Our processing system have set of microservices deployed in Kubernetes environment as pods (which provides option to scaling).
I have to get the overall processing time as 200ms per transaction and system processing speed of (with scaling) to 10000 tps.
Now what is the better way to design this system in such way that small messages are processed with no blockage from large messages. Or is there a way to isolate the large messages in same channel without impacting processing small messages. Looking for your valuable inputs.
I have put a sample control flow of our system
.
The one option which I have is that consumer diverts the large message to one system and small messages to other system. But this doesn't seem like a good design and nightmare to maintain 2 systems with same functionalities. Also this could lead improper resource allocation.
I will assume large message and small messages can be processed out of order. Otherwise small messages will have to wait for large message and there is no parallelization possible.
I will also assume, you can not change producer to write large messages to another topic. Otherwise, you can just ask producers to send large messages to a different topic, with lesser number of consumers, so large messages will not block small messages.
Ok, with above two assumptions, following is the simplest solution:
On the consumer, if you read a small message, forward it to the message parser as you are doing today.
On the consumer, if you read a large message, instead of forwarding to the message parser, send it to another topic. Let's call it "Large Message Topic"
Configure limited number of consumers on the "Large Message Topic" to read and process large messages.
Alternatively, you will have to take control of commit offset, and add a little more complexity to your consumer code. You can use the solution below:
Disable auto commit, don't call commit on consumer after reading each batch.
If you read a small message, forward it to the message parser as you are doing today.
If you read large messages, send them to another thread/thread pool on your consumer process, that will forward it to the message parser. This thread pool processes in coming messages in a sequence, and keeps track of last offset completed.
Once in a while, you call commit with offset = min (consumer offset, large message offset)
I have a List of ids, with more or lest 400.000 ids, i need send to kafka the ids, i don't know if the best option is send the message split in n messages with x transactions, or if is better in one message adjusting like said in this post:
How can I send large messages with Kafka (over 15MB)?
This is a very generic question and it depends on how you want to process it.
If your consumer is capable of processing each of id entries quickly, then you can put a lot of them into a single message.
OTOH, if the processing is slow, it's better to publish more messages (across multiple partitions), so if you use consumer groups you wouldn't get group membership loss events etc.
Not to forget, there's also a limit on message size (as you've linked) with default of around 1mb.
In other words - you might need to perf-test on your own side, as it's hard to make a decision with only so little data.
I'm using Akka Streams and I came across Sink.actorRefWithAck. I understand that it sends a message and only tries pulling in another element from the stream when an acknowledgement for the previous message has been received. Is there a way to batch-process messages with this sink? Example: pull five messages and only pull the next five once the first five have been acknowledged. I've thought about something like
source.grouped(5).to(Sink.actorRefWithAck(...))
But that would require the receiver to change to work with sequences, which let's assume is out of the question.
No, that is not possible with Sink.actorRefWithAck() while keeping individual messages being queued in the actor mailbox rather than the entire batch.
One idea to queue up messages in the actor inbox more eagerly would be to use source.mapAsync(n)(ask-actor).to(Sink.ignore). This would send n to the actor and then as soon as the first one gets a response from the actor, it would pull and enqueue a new element.
I'm trying to replace rabbit mq with apache-kafka and while planning, I bumped in to several conceptual planning problem.
First we are using rabbit mq for per user queue policy meaning each user uses one queue. This suits our need because each user represent some job to be done with that particular user, and if that user causes a problem, the queue will never have a problem with other users because queues are seperated ( Problem meaning messages in the queue will be dispatch to the users using http request. If user refuses to receive a message (server down perhaps?) it will go back in retry queue, which will result in no loses of message (Unless queue goes down))
Now kafka is fault tolerant and failure safe because it write to a disk.
And its exactly why I am trying to implement kafka to our structure.
but there are problem to my plannings.
First, I was thinking to create as many topic as per user meaning each user would have each topic (What problem will this cause? My max estimate is that I will have around 1~5 million topics)
Second, If I decide to go for topics based on operation and partition by random hash of users id, if there was a problem with one user not consuming message currently, will the all user in the partition have to wait ? What would be the best way to structure this situation?
So as conclusion, 1~5 millions users. We do not want to have one user blocking large number of other users being processed. Having topic per user will solve this issue, it seems like there might be an issue with zookeeper if such large number gets in (Is this true? )
what would be the best solution for structuring? Considering scalability?
First, I was thinking to create as many topic as per user meaning each user would have each topic (What problem will this cause? My max estimate is that I will have around 1~5 million topics)
I would advise against modeling like this.
Google around for "kafka topic limits", and you will find the relevant considerations for this subject. I think you will find you won't want to make millions of topics.
Second, If I decide to go for topics based on operation and partition by random hash of users id
Yes, have a single topic for these messages and then route those messages based on the relevant field, like user_id or conversation_id. This field can be present as a field on the message and serves as the ProducerRecord key that is used to determine which partition in the topic this message is destined for. I would not include the operation in the topic name, but in the message itself.
if there was a problem with one user not consuming message currently, will the all user in the partition have to wait ? What would be the best way to structure this situation?
This depends on how the users are consuming messages. You could set up a timeout, after which the message is routed to some "failed" topic. Or send messages to users in a UDP-style, without acks. There are many ways to model this, and it's tough to offer advice without knowing how your consumers are forwarding messages to your clients.
Also, if you are using Kafka Streams, make note of the StreamPartitioner interface. This interface appears in KStream and KTable methods that materialize messages to a topic and may be useful in a chat applications where you have clients idling on a specific TCP connection.
In Akka, when an actor dies while processing a message (inside onReceive(...) { ... }, that message is lost. Is there a way to guarantee losslessness? Is there a way to configure Akka to always persist messages before sending them to onReceive, so that they can be recovered and replayed when the actor does die?
Perhaps something like a persistent mailbox?
Yes, take a look at Akka Persistence, in particular AtLeastOnceDelivery. This stores messages on the sender side in order to also cover losses during the delivery process, because otherwise the message might not ever reach the destination mailbox.