The picture below depicts my basic use-case using message groups and Spring-based JMS consumers.
Please note, here the concurrency refers to the config set as shown below:
defaultJmsListenerContainerFactory.setConcurrency("3-10");
Would the G1 and G2 listener receive messages concurrently for the respective groups?
Would in any case the message from one group wait for the dispatch of any message in another group?
Generally speaking, multiple consumers receiving grouped messages can receive them concurrently. However, there are caveats...
The core JMS client implementations actually consumes messages from a local data structure that is filled with messages asynchronously based on the consumerWindowSize which is 1 MiB (1024 * 1024 bytes) by default. If a consumer is receiving messages from a large, contiguous group and its "window" fills up then the broker will not be able to dispatch any more messages to it and will have to wait for the consumer to acknowledge messages in order to dispatch more. Once that block of grouped messages is dispatched then the broker will be able to dispatch messages from other groups to other consumers.
This is also explained in the documentation (although in a bit less detail).
Related
Scenario
10 kafka consumers within a same Consumer Group.
Kafka has 10 partitions => which means each partition is automatically assigned to a single consumer within the group.
Message is sent to partition on a round-robin basis.
Every now and then, a message will take much longer to process than other messages.
In such occasions, there's a chance the next message is assigned to a consumer that is still busy working while there are other free consumers
Question
Does Kafka support a mechanism to automatically send message to a partition whose consumer is free?
If it doesn't, what is the common approach to this scenario?
Although you could implement a custom Assignor class, by default, consumption is only based on assignment, not by load; such information is not communicated back to the group coordinator. Plus, shuffling around constantly based on load would likely cause frequent group rebalances, causing consumption to be even slower
Regarding length-of-processing, I am not aware of any way your consumer would be able to inspect message before partition assignment and polling such records. Therefore, you'd need to decouple your processing logic from the actual poll loop if you'd like to improve processing times.
I have a single node ActiveMQ instance with two competing consumers connected to a topic. The topic subscription is shared as per JMS 2.0 specification. Shared subscription does guarantee that only either of the subscribers (using same subscription name) gets the message. But what I noticed is that it does not guarantee that the second message is delivered only if the first one is acknowledged. In case if the first consumer takes time to acknowledge the message, the second message is delivered to the free consumer even before the acknowledgement of the first one is sent by the consumer to the broker. Is this a standard behaviour? And is there a way to stop the broker from delivering the second message before the acknowledgement of the first one?
ActiveMQ Artemis allows the exclusive queues. They are special queues which route all messages to only one consumer at a time.
Obviously exclusive queues have a draw back that you cannot scale out the consumers to improve consumption as only one consumer would technically be active.
However I would suggest to take a look at the message grouping to scale out your solution. Message groups are useful when you want all messages for a certain value of the property to be processed serially by the same consumer, without stopping the delivery of messages with different value of the property to other consumers.
Are Kafka and SQS same?
I see that both are messaging queue systems and are event-based.
Do they serve the same purpose, If not how are they different?
Apache Kafka and Amazon SQS both are using for message streaming but are not same.
Apache Kafka follows publish subscriber model, where producer send event/message to topic and one or more consumers are subscribed to that topic to get the event/message. In topic you find partitions for parallel streaming. There is consumer group concept once. When a message is read from a partition of topic it will be commit to identify its already read by that consumer group to avoid inconsistency in reading in concurrent programming. However, other consumer group can still read that message form the partition.
Where, Amazon SQS follows Queue and queue can be created any region of Amazon SQS. You can push message to Queue and only one consumer can subscribe to each Queue and can pull message from Queue. That's why SQS is pull based streaming. SQS Queues are two types: FIFO and Standard.
There is another concept in AWS which is Amazon SNS, which is publish subscriber based like Kafka, but there is not any message retention policy in SNS. It's for instant messaging like email, sms etc. It only can push message to subscriber when the subscribers are available. Otherwise message will be lost. However, SQS with SNS can overcome this drawback. Amazon SNS with SQS is called fanout pattern. In this pattern, a message published to an SNS topic is distributed to multiple SQS queues in parallel and SQS queue assure persistence, because SQS has retention policy. It can persist message up to 14 days(defult 4 days). Amazon SQS with SNS can achieved highly throughput parallel streaming and can replace Apache Kafka.
Yes, they are two messaging systems, but there is a big difference:
Kafka
Kafka is pretty scalable system and fits on high workloads when you want to send messages in batches (to have a good message throughput).
Kafka topic consists of some number of partitions which can be read completely parallel by different consumers in one consumer group and that give us a very good performance.
For example, if you need to build a high loaded streaming system, Kafka is really suitable for it.
SQS
SQS is an Amazon managed service (so you do not have to support infrastructure by yourself).
SQS is better for eventing when you need to catch some message (event) by some client and then this message will be automatically popped out from the queue.
As for my experience SQS is not so fast as Kafka and it doesn't fit to high workload, it's much more suitable for eventing where count of events per second is not so much.
For example if you want to react on some S3 file upload (to start some processing of this file) SQS is very good.
SQS and Kafka are both messaging systems.
The primary differences are :
Ordering at scale.
Kafka - Produced messages are always consumed in order irrespective of the number of items in the queue.
SQS - "A FIFO queue looks through the first 20k messages to determine available message groups. This means that if you have a backlog of messages in a single message group, you can't consume messages from other message groups that were sent to the queue at a later time until you successfully consume the messages from the backlog"
Limit on the Number of groups/topic/partitions
Kafka - Although the limit is quite high, but the number of topics/partitions is usually in the order of thousands (which can increase depending on the cluster size). SQS - "There is no quota to the number of message groups within a FIFO queue."
Deduplication - Kafka does not support deduplication of data in case same data is produced multiple times. SQS tries to dedup messages based on the dedup-id and the dedup-interval. "Assuming that the producer receives at least one acknowledgement before the deduplication interval expires, multiple retries neither affect the ordering of messages nor introduce duplicates."
Partition management. Kafka - Creations or additions of partitions are created and managed by the user. SQS controls the number of partitions and it can increase or decrease it depending on the load and usage pattern.
Dead letter queue - Kafka does not have the concept of a DL queue (it can be explicitly created and maintained by the user thought). SQS inherently supports a DL queue by itself.
Overall if we want so summarise the points above, we can say that SQS is meant for offloading background tasks to an async pipeline. Kafka is much more scalable and should be used as a stream processing pipeline.
SQS is a queue. You have a list of messages that would need to be processed by some other part of the application. A message would ideally be processed once by one processor and would be marked as processed and removed from the queue. The purpose of the queue is to coordinate and distribute the processing of messages among the different processors.
Kafka is more similar to Kinesis which is used mainly for data streaming. Messages are stored in topics for other components to read. Any component can listen to topics and/or read all messages at any time. The main purpose is to allow the efficient delivery of messages to any number of recipients and allow the continuous streaming of data between components in a dynamic and elastic way.
At a birds view, there is one main difference
Kafka is used for pub sub model. If a producer sends a single message. If a kafka topic has 2 consumers , both the consumers will receive the message
SQS is more like competing consumer pattern. If a producer sends a message and the sqs has 2 consumers. Only one consumer will receive the message. The other one wont get the message, if the 1st consumer has processed the message successfully. The 2nd consumer has a chance to recieve the message only if the message visibility times out. ie., 1st consumer is not able to process the message within the given time and cant delete the message within the visibility timeout.
Imagine a scenario where we have 3 partitions belonging to 3 different topics on a machine which runs a kafka process/broker. This broker will receive messages for all three partitions. It will store them on different log subdirectories. My question is how does the kafka broker schedule these writes? How does it decide which partition/topic will be written next?
For ordering over requests, the image below shows roughly, how the broker internally handles produce requests:
There is a number of network threads that pull bytes of the network layer and convert these to internal requests. These requests are then stuck in a fifo request queue, from where the io threads pull them and append the contained messages to the relevant partitions. So in short messages are processed in the order they are received in.
Looking through the code I am unsure, whether there may be potential for a race condition here, where a smaller request could "overtake" a large request that was sent immediately before it. However even if this were possible it is an extremely unlikely fringe case that I can't see ever occurring for a single producer. Maybe someone with a better understanding of the code can weigh in here?
As for ordering of batched messages in one request, the request stores messages internally in a HashMap, which uses TopicPartition as a key, since as far as I am aware a Scala HashMap does not preserve ordering of the inserted elements, I don't think that there are any guarantees around the order in which multiple partitions in one request get processed - which is fine, as ordering is only guaranteed to be preserved within the partition.
Within each partition, messages are processed in the order they were given to the producer before sending.
I have a distributed system that reads messages from RabbitMQ. In my situation now I need to process no more than N msgs/s.
For example: Imagine service A that sends text messages (SMS). This service can only handle 100 msgs/s. My system has multiple consumers that reads data from RabbitMQ. Each message needs to be processed first and than be send to service A. Processing time is always different.
So the question:
Is it possible to configure queue to dispatch no more than 100 msgs/s to multiple consumers?
You can use the prefetch_size parameter of the qos method to limit throughput to your consumers.
channel.basic_qos(100);
See also: http://www.rabbitmq.com/blog/2012/05/11/some-queuing-theory-throughput-latency-and-bandwidth/