Using Single SQS for multiple subscribers based on message identifier - publish-subscribe

We have application where multiple subscribers are writing to publisher Kafka topic This data is then propagated to specific subscriber topic then subscriber consumes this data from specific topic assigned to it.
We want to use SQS for same purpose but issue is we will again need an SQS for each subscriber.
Handling these multiple SQS will create an issue and if there is time when no data is published to subscriber the queue assigned to it will be idle.
Is there any way i can use single SQS from which all subscribers can consumed messages base don message identifier.
Challenges needs to be cover in this design are:
Each subscriber can get its message based on identifier
Latency must not be there in case one publisher publish very few messages and other one is publishing it in millions.
We can have one SQS for each publisher but single SQS for all subscribers of this publisher.
Can any one suggest any architecture using similar implementation.
Thanks

I think you can achieve it by setting up a single SQS queue. You would want to set up a Lambda trigger on that queue which will serve as a Service Manager (SM). SM will have a static JSON file that define the mapping between message identifier and its subscriber/worker. SM will receive an SQS message event, find the message attribute used for identifier, and then look up in JSON to find the corresponding subscriber. If subscriber is found, SM will invoke it.
Consider using SQS batch trigger.

Related

Are SQS and Kafka same?

Are Kafka and SQS same?
I see that both are messaging queue systems and are event-based.
Do they serve the same purpose, If not how are they different?
Apache Kafka and Amazon SQS both are using for message streaming but are not same.
Apache Kafka follows publish subscriber model, where producer send event/message to topic and one or more consumers are subscribed to that topic to get the event/message. In topic you find partitions for parallel streaming. There is consumer group concept once. When a message is read from a partition of topic it will be commit to identify its already read by that consumer group to avoid inconsistency in reading in concurrent programming. However, other consumer group can still read that message form the partition.
Where, Amazon SQS follows Queue and queue can be created any region of Amazon SQS. You can push message to Queue and only one consumer can subscribe to each Queue and can pull message from Queue. That's why SQS is pull based streaming. SQS Queues are two types: FIFO and Standard.
There is another concept in AWS which is Amazon SNS, which is publish subscriber based like Kafka, but there is not any message retention policy in SNS. It's for instant messaging like email, sms etc. It only can push message to subscriber when the subscribers are available. Otherwise message will be lost. However, SQS with SNS can overcome this drawback. Amazon SNS with SQS is called fanout pattern. In this pattern, a message published to an SNS topic is distributed to multiple SQS queues in parallel and SQS queue assure persistence, because SQS has retention policy. It can persist message up to 14 days(defult 4 days). Amazon SQS with SNS can achieved highly throughput parallel streaming and can replace Apache Kafka.
Yes, they are two messaging systems, but there is a big difference:
Kafka
Kafka is pretty scalable system and fits on high workloads when you want to send messages in batches (to have a good message throughput).
Kafka topic consists of some number of partitions which can be read completely parallel by different consumers in one consumer group and that give us a very good performance.
For example, if you need to build a high loaded streaming system, Kafka is really suitable for it.
SQS
SQS is an Amazon managed service (so you do not have to support infrastructure by yourself).
SQS is better for eventing when you need to catch some message (event) by some client and then this message will be automatically popped out from the queue.
As for my experience SQS is not so fast as Kafka and it doesn't fit to high workload, it's much more suitable for eventing where count of events per second is not so much.
For example if you want to react on some S3 file upload (to start some processing of this file) SQS is very good.
SQS and Kafka are both messaging systems.
The primary differences are :
Ordering at scale.
Kafka - Produced messages are always consumed in order irrespective of the number of items in the queue.
SQS - "A FIFO queue looks through the first 20k messages to determine available message groups. This means that if you have a backlog of messages in a single message group, you can't consume messages from other message groups that were sent to the queue at a later time until you successfully consume the messages from the backlog"
Limit on the Number of groups/topic/partitions
Kafka - Although the limit is quite high, but the number of topics/partitions is usually in the order of thousands (which can increase depending on the cluster size). SQS - "There is no quota to the number of message groups within a FIFO queue."
Deduplication - Kafka does not support deduplication of data in case same data is produced multiple times. SQS tries to dedup messages based on the dedup-id and the dedup-interval. "Assuming that the producer receives at least one acknowledgement before the deduplication interval expires, multiple retries neither affect the ordering of messages nor introduce duplicates."
Partition management. Kafka - Creations or additions of partitions are created and managed by the user. SQS controls the number of partitions and it can increase or decrease it depending on the load and usage pattern.
Dead letter queue - Kafka does not have the concept of a DL queue (it can be explicitly created and maintained by the user thought). SQS inherently supports a DL queue by itself.
Overall if we want so summarise the points above, we can say that SQS is meant for offloading background tasks to an async pipeline. Kafka is much more scalable and should be used as a stream processing pipeline.
SQS is a queue. You have a list of messages that would need to be processed by some other part of the application. A message would ideally be processed once by one processor and would be marked as processed and removed from the queue. The purpose of the queue is to coordinate and distribute the processing of messages among the different processors.
Kafka is more similar to Kinesis which is used mainly for data streaming. Messages are stored in topics for other components to read. Any component can listen to topics and/or read all messages at any time. The main purpose is to allow the efficient delivery of messages to any number of recipients and allow the continuous streaming of data between components in a dynamic and elastic way.
At a birds view, there is one main difference
Kafka is used for pub sub model. If a producer sends a single message. If a kafka topic has 2 consumers , both the consumers will receive the message
SQS is more like competing consumer pattern. If a producer sends a message and the sqs has 2 consumers. Only one consumer will receive the message. The other one wont get the message, if the 1st consumer has processed the message successfully. The 2nd consumer has a chance to recieve the message only if the message visibility times out. ie., 1st consumer is not able to process the message within the given time and cant delete the message within the visibility timeout.

Allow consumption of same message by different instances of the same service from a Kafka topic

I have several instances of the same service subscribed to a Kafka topic. A producer publishes 1 message to a topic. I want this message to be consumed by all instances. When instance is started, the messages should be read from the end of topic/partitions. I don't want the instances to receive messages which were published before service is started (but these won't be a big problem if some old messages are processed by the service). I don't want the instances to lose messages if the instances are disconnected from Kafka for some time or Kafka is down which mean that I need to commit offsets periodically. Message can be processed twice, it is not a big problem.
Is the following the best way to archive the described behavior: generate new Kafka group id using new Guid or timestamp for each instance each time instance is started?
What are disadvantages of the approach described in item 1 above?
It is enough to do two things. First, each instance of the service should have its own group.id. That guarantees that each of them will read all published messages, and will receive published messages after reconnecting. This id is per instance and there is no need to regenerate it on start. Second, each instance should have the property auto.offset.reset=latest, which is also the default. This guarantees that the consumer will not read messages, which were published before the first start of the instance.
Of course, your instances need to commit offsets after processing of the messages.

Kafka Consumer API vs Streams API for event filtering

Should I use the Kafka Consumer API or the Kafka Streams API for this use case? I have a topic with a number of consumer groups consuming off it. This topic contains one type of event which is a JSON message with a type field buried internally. Some messages will be consumed by some consumer groups and not by others, one consumer group will probably not be consuming many messages at all.
My question is:
Should I use the consumer API, then on each event read the type field and drop or process the event based on the type field.
OR, should I filter using the Streams API, filter method and predicate?
After I consume an event, the plan is to process that event (DB delete, update, or other depending on the service) then if there is a failure I will produce to a separate queue which I will re-process later.
Thanks you.
This seems more a matter of opinion. I personally would go with Streams/KSQL, likely smaller code that you would have to maintain. You can have another intermediary topic that contains the cleaned up data that you can then attach a Connect sink, other consumers, or other Stream and KSQL processes. Using streams you can scale a single application on different machines, you can store state, have standby replicas and more, which would be a PITA to do it all yourself.

Pub-Sub messaging system for two way communication

Hi, I have the system as captured in the image. I'm planning to adopt a reliable messaging system, but I'm bit confused over which one to use. Below explained the detail flow of data and my requirement.
Step 1: data from System is given to Publisher.
Step 2: Publisher simply pushes the data to the Topic based Messaging
system.
Step 3: There will be more than one subscribers for each topic and
subscribers should get notified as soon there are some entries in
messaging system.
Step 4: Subscribers process the data and update the status back to messaging
system.
Step 5: Publisher should get notified for the processed messages and
acknowledge the System which gave the data.
So, my question is can I use RabbitMq or Kafka for "Topic Based Messaging System" ? my main requirement here is to update the status back from subscribers and also publisher should get notification for the status update. (I'm not much bothered about the throughput, performance, scalable AT THIS POINT of TIME). Also my another concern is data recovery/HA.
How about having a N+1 topic system, one for publishing messages which would be consumed by N subscribers, and N topics for acknowledgements, one per subscriber.
Your "System" could subscribe to all these N acknowledgment topics, and can verify if all the subscribers processed the original message which was published by the producer.
Each message in Kafka for eg. has a message key, and the same message key can be used to co-relate the original message with its subscriber specific acknowledgement.
Does this achieve what you want in your system ?

SQS: How to forward message to subscriber based on a certain key

I have a validation service which takes in validation-requests and publishes them to a SQS queue. Now based on the type of validation request, I want to forward the message to that specific service.
So basically, I have one producer and multiple consumers, but essentially, one message is to be consumed by only one consumer.
What approach should I use? Should I have a different SQS queue for each service or I can do this using a single queue based on message type?
As I see it, you have three options;
The first option, like you say is to have a unique consumer for each message type. This is the approach we use and we have thousands of queues and many different messages types.
The second option would be to decorate the message being pushed onto SQS with something that would indicate it's desired consume, then have a generic consumer in your application that can forward the message on to the right consumer. Though this approach is generally seen as an anti pattern, I would personally agree.
Thirdly, you could take advantage of SNS filtering but that's only if you use SNS right now otherwise you'd have to invest in some time to setup it up and make it work.
Hope that helps!