Are SQS and Kafka same? - apache-kafka

Are Kafka and SQS same?
I see that both are messaging queue systems and are event-based.
Do they serve the same purpose, If not how are they different?

Apache Kafka and Amazon SQS both are using for message streaming but are not same.
Apache Kafka follows publish subscriber model, where producer send event/message to topic and one or more consumers are subscribed to that topic to get the event/message. In topic you find partitions for parallel streaming. There is consumer group concept once. When a message is read from a partition of topic it will be commit to identify its already read by that consumer group to avoid inconsistency in reading in concurrent programming. However, other consumer group can still read that message form the partition.
Where, Amazon SQS follows Queue and queue can be created any region of Amazon SQS. You can push message to Queue and only one consumer can subscribe to each Queue and can pull message from Queue. That's why SQS is pull based streaming. SQS Queues are two types: FIFO and Standard.
There is another concept in AWS which is Amazon SNS, which is publish subscriber based like Kafka, but there is not any message retention policy in SNS. It's for instant messaging like email, sms etc. It only can push message to subscriber when the subscribers are available. Otherwise message will be lost. However, SQS with SNS can overcome this drawback. Amazon SNS with SQS is called fanout pattern. In this pattern, a message published to an SNS topic is distributed to multiple SQS queues in parallel and SQS queue assure persistence, because SQS has retention policy. It can persist message up to 14 days(defult 4 days). Amazon SQS with SNS can achieved highly throughput parallel streaming and can replace Apache Kafka.

Yes, they are two messaging systems, but there is a big difference:
Kafka
Kafka is pretty scalable system and fits on high workloads when you want to send messages in batches (to have a good message throughput).
Kafka topic consists of some number of partitions which can be read completely parallel by different consumers in one consumer group and that give us a very good performance.
For example, if you need to build a high loaded streaming system, Kafka is really suitable for it.
SQS
SQS is an Amazon managed service (so you do not have to support infrastructure by yourself).
SQS is better for eventing when you need to catch some message (event) by some client and then this message will be automatically popped out from the queue.
As for my experience SQS is not so fast as Kafka and it doesn't fit to high workload, it's much more suitable for eventing where count of events per second is not so much.
For example if you want to react on some S3 file upload (to start some processing of this file) SQS is very good.

SQS and Kafka are both messaging systems.
The primary differences are :
Ordering at scale.
Kafka - Produced messages are always consumed in order irrespective of the number of items in the queue.
SQS - "A FIFO queue looks through the first 20k messages to determine available message groups. This means that if you have a backlog of messages in a single message group, you can't consume messages from other message groups that were sent to the queue at a later time until you successfully consume the messages from the backlog"
Limit on the Number of groups/topic/partitions
Kafka - Although the limit is quite high, but the number of topics/partitions is usually in the order of thousands (which can increase depending on the cluster size). SQS - "There is no quota to the number of message groups within a FIFO queue."
Deduplication - Kafka does not support deduplication of data in case same data is produced multiple times. SQS tries to dedup messages based on the dedup-id and the dedup-interval. "Assuming that the producer receives at least one acknowledgement before the deduplication interval expires, multiple retries neither affect the ordering of messages nor introduce duplicates."
Partition management. Kafka - Creations or additions of partitions are created and managed by the user. SQS controls the number of partitions and it can increase or decrease it depending on the load and usage pattern.
Dead letter queue - Kafka does not have the concept of a DL queue (it can be explicitly created and maintained by the user thought). SQS inherently supports a DL queue by itself.
Overall if we want so summarise the points above, we can say that SQS is meant for offloading background tasks to an async pipeline. Kafka is much more scalable and should be used as a stream processing pipeline.

SQS is a queue. You have a list of messages that would need to be processed by some other part of the application. A message would ideally be processed once by one processor and would be marked as processed and removed from the queue. The purpose of the queue is to coordinate and distribute the processing of messages among the different processors.
Kafka is more similar to Kinesis which is used mainly for data streaming. Messages are stored in topics for other components to read. Any component can listen to topics and/or read all messages at any time. The main purpose is to allow the efficient delivery of messages to any number of recipients and allow the continuous streaming of data between components in a dynamic and elastic way.

At a birds view, there is one main difference
Kafka is used for pub sub model. If a producer sends a single message. If a kafka topic has 2 consumers , both the consumers will receive the message
SQS is more like competing consumer pattern. If a producer sends a message and the sqs has 2 consumers. Only one consumer will receive the message. The other one wont get the message, if the 1st consumer has processed the message successfully. The 2nd consumer has a chance to recieve the message only if the message visibility times out. ie., 1st consumer is not able to process the message within the given time and cant delete the message within the visibility timeout.

Related

Why components involved in request response flow should not consume messages on a kafka topic?

As part of design decision at my client site, the components(microservice) involved in http request-response flow are allowed to produce messsages on a kafka topic, but not allowed to consume messages from kafka topic.
Such components(microservice) can read & write database, can talk to other components, can produce messages on a topic, but cannot consume messages from a kafka topic.
Instead ,the design suggest to write separate utilities that consume messages from kafka topics and store in database. Components(microservice) involved in request-response flow, will read that information from database.
What are the design flaws, if such components(microservice) consume kafka topics? Why the design is suggesting to write separate utilities to consume kafka topic and store in database, so that components can read those information from database.
Kafka Topics are divided into partitions, and for each consumer group, the partitions are distributed among the various consumers in that group. Each consumer is responsible for consuming the messages in the partitions is gets assigned.
Presumably, your request handling machines are clustered and load balanced. There are two ways you might have those machines subscribe to Kafka topics, and both of those ways are broken:
You could put your request handling machines in different consumer groups. In this case, each one will have to consume all of the messages. That is probably not what you want, which is typically to have each consumer pull from the queue and have each message processed only once. Also, the consumers will be out of sync and will process the messages ad different rates.
You could put your request handling machines in the same consumer groups. In this case, each one will only have access to the partitions that it is assigned. Every machine will see different message streams. This, of course, is also not what you want. Clients would get different results depending on which machine the load balancer directed them to.
If you want all of your request handling machines to pull from the same queue of messages across the whole topic, then they need to communicate with a single consumer that is assigned all the partitions.

Message queue (like RabbitMQ) or Kafka for Microservices?

We are starting a new project, where we are evaluating the tech stack for asynchronous communication between microservices? We are considering RabbitMQ and Kafka for this.
Can anyone shed some light on the key considerations to decide one between these twos?
Thanks
Selection depends upon what exactly your microservices needs. Both has something different as compared to other.
RabbitMQ in a nutshell
Who are the players:
Consumer
Publisher
Exchange
Route
The flow starts from the Publisher, which send a message to exchange, Exchange is a middleware layer that knows to route the message to the queue, consumers can define which queue they are consuming from (by defining binding), RabbitMQ pushes the message to the consumer, and once consumed and acknowledgment has arrived, message is removed from the queue.
Any piece in this system can be scaled out: producer, consumer, and also the RabbitMQ itself can be clustered, and highly available.
Kafka
Who are the players
Consumer / Consumer groups
Producer
Kafka source connect
Kafka sink connect
Topic and topic partition
Kafka stream
Broker
Zookeeper
Kafka is a robust system and has several members in the game. but once you understand well the flow, this becomes easy to manage and to work with.
Producer send a message record to a topic, a topic is a category or feed name to which records are published, it can be partitioned, to get better performance, consumers subscribed to a topic and start to pull messages from it, when a topic is partitioned, then each partition get its own consumer instance, we called all instances of same consumer a consumer group.
In Kafka messages are always remaining in the topic, also if they were consumed (limit time is defined by retention policy)
Also, Kafka uses sequential disk I/O, this approach boosts the performance of Kafka and makes it a leader option in queues implementation, and a safe choice for big data use cases.
Use Kafka if you need
Time travel/durable/commit log
Many consumers for the same message
High throughput
Stream processing
Replicability
High availability
Message order
Use RabbitMq if you need:
flexible routing
Priority Queue
A standard protocol message queue
For more info
In order to select a message broker, I think this list could be really helpful.
 Supported programming languages: You probably should pick one that supports a
variety of programming languages.
 Supported messaging standards: Does the message broker support any standards,
such as AMQP and STOMP, or is it proprietary?
 Messaging order: Does the message broker preserve the ordering of messages?
 Delivery guarantees: What kind of delivery guarantees does the broker make?
 Persistence: Are messages persisted to disk and able to survive broker crashes?
 Durability: If a consumer reconnects to the message broker, will it receive the
messages that were sent while it was disconnected?
 Scalability: How scalable is the message broker?
 Latency: What is the end-to-end latency?
 Competing consumers: Does the message broker support competing consumers?
Kafka
Rabbit MQ
It's a distributed streaming platform, working on the pub-sub model.
It's a message-broker, that works on pub-sub and queue-based model.
No out of the box support for retries and DLQ
Supports retries and DLQ out of the box(via DLX).
Consumers can't filter messages specifically.
Topic exchange and header exchange facilitate consumer-based message filtering.
Messages are retained till their validity period.
Messages are gone as soon as they are read.
Does not support scheduled or delayed message routing.
Supports scheduled and delayed routing of messages.
Scales horizontally
Scales vertically
Pull based approach
Pull based apporach
Supports event replay with consumer groups
No way to event replay

What is the difference between pulsar and kafka in regards to consumption?

In order to consume data from Kafka, we can have multiple consumers on a topic, totally decoupled. Then, what is meant by no shared consumption on the page(https://streaml.io/blog/pulsar-streaming-queuing) which shares differences between kafka and pulsar?
In his blog, Sijie is referring to shared messaging as queuing. With queuing messaging, multiple consumers are created to receive messages from a single topic. Which consumer gets the message is completely random.
The issue with implementing the messaging pattern with Kafka lies in way that Kafka consumers mark that they’ve consumed a message. Kafka consumers use what’s called a high watermark for consumer offsets. That means that a consumer can only say, “I’ve processed up to this point” rather than, “I’ve processed this message.”
Consider the scenario in which multiple Kafka consumers from the same consumer group were processing from the same topic partition and one of the consumers fails due to an exception while the other succeed. Because Kafka does not a have a built-in way to only acknowledge a single message, and only uses a high-water mark, the failed message would be erronously marked as consumed when in fact it failed and needs to be either reprocessed or published to an error queue, etc.
In order to avoid this situation, you would need to have just a single consumer per partition which limits the comsumption throughput of the topic. Which in turn requires you to increase the number of partitions in order to meet your throughput needs.
There is a detailed explanation in this blog post

Making Kafka producer and Consumer synchronous

I have one kafka producer and consumer.The kafka producer is publishing to one topic and the data is taken and some processing is done. The kafka consumer is reading from another topic about whether the processing of data from topic 1 was successful or not ie topic 2 has success or failure messages.Now Iam starting my consumer and then publishing the data to topic 1 .I want to make the producer and consumer synchronous ie once the producer publishes the data the consumer should read the success or failure message for that data and then the producer should proceed with the next set of data .
Apache Kafka and Publish/Subscribe messaging in general seeks to de-couple producers and consumers through the use of streaming async events. What you are describing is more like a batch job or a synchronous Remote Procedure Call (RPC) where the Producer and Consumer are explicitly coupled together. The standard Apache Kafka Producers/Consumer APIs do not support this Message Exchange Pattern but you can always write your own simple wrapper on top of the Kafka API's that uses Correlation IDs, Consumption ACKs, and Request/Response messages to make your own interface that behaves as you wish.
Short Answer : You can't do that, Kafka doesn't provide that support.
Long Answer: As Hans explained, Publish/Subscribe messaging model keeps Publish and subscribe completely unaware of each other and I believe that is where the power of this model lies. Producer can produce without worrying about if there is any consumer and consumer can consume without worrying about how many producers are there.
The closest you can do is, you can make your producer synchronous. Which means you can wait till your message is received and acknowledged by broker.
if you want to do that, flush after every send.

How does Kafka message processing scale in publish-subscribe mode?

All, Forgive me I am a newbie just beginner of Kafka. Currently I was reading the document of Kafka about the difference between traditional message system like Active MQ and Kafka.
As the document put.
For the traditional message system. they can not scale the message processing.
Since
Publish-subscribe allows you broadcast data to multiple processes, but
has no way of scaling processing since every message goes to every
subscriber.
I think this make sense to me.
But for the Kafka. Document says the Kafka can scale the message processing even in the publish-subscribe mode. (Please correct me if I was wrong. Thanks.)
The consumer group concept in Kafka generalizes these two concepts. As
with a queue the consumer group allows you to divide up processing
over a collection of processes (the members of the consumer group). As
with publish-subscribe, Kafka allows you to broadcast messages to
multiple consumer groups.
The advantage of Kafka's model is that every topic has both these
properties—it can scale processing and is also multi-subscriber—there
is no need to choose one or the other.
So my question is How Kafka make it ? I mean scaling the processing in the publish-subscribe mode. Thanks.
The main unique features in Kafka that enables scalable pub/sub are:
Partitioning individual topics and spreading the active partitions across multiple brokers in the cluster to take advantage of more machines, disks, and cache memory. Producers and consumers often connect to many or all nodes in the cluster, not just a single master node for a given topic/queue.
Storing all messages in a sequential commit log and not deleting them when consumed. This leads to more sequential reads and writes, offloads the broker from having to deal with keeping track of different copies of messages, deleting individual messages, handling fragmentation, tracking which consumer has acknowledged consuming which messages.
Enabling smart parallel processing of individual consumers and consumer groups in a way that each parallel message stream can come from the distributed partitions mentioned in #1 while offloading the offset management and partition assignment logic onto the clients themselves. Kafka scales with more consumers because the consumers do some of the work (unlike most other pub/sub brokers where the bulk of the work is done in the broker)