Delay Kafka event sending, using Outbox pattern and CDC - apache-kafka

I would like to delay sending Kafka messages of given topic by 5 minutes. In order to ensure it will be sent, I need to persist this topic in a database before sending (Outbox pattern). Now, is there any CDC solution that provides delayed reaction to changes in database? Does Debezium allow such thing?

Related

Kafka transaction management at producer end

I am looking for how Kafka behave when the producer is running in transaction.
I have a oracle database insert operations running in same transaction which rollback the changes if the transaction is rolled back.
How does Kafka producer behave in case of transaction rollback.
Will the message be rolled back or Kafka doesn't support rollback.
I know the JMS message are committed to queue only when transaction is committed. Looking for similar solutions if it is supported.
Note : Producer code is written using spring boot.
You are trying to update two systems
update a record in your oracle database
sending a event to apache kafka
This represents a challenge as you would like it to be atomic, either everything gets executed or nothing, otherwise you will end up with inconsistencies between your database and kafka.
You might send a Kafka message even if the database transaction was rollbacked.
Or the other way around (if you are sending the message just after the commit), you might commit the database transaction and crash (for some reason) just before sending the Kafka event.
One of the simplest solution is to use the outbox pattern:
Let's say you want to update an order table and send orderEvent to kafka
Instead of sending the event to kafka in the same transaction
You can save it a database table (outbox) using the same transaction as the order update
A separate process will read data from outbox table and make sure it's sent to kafka (using at least once semantic)
Your consumer need to be idempotent.
In this post, I explain more in detail how to implement this solution
https://mirakl.tech/sending-kafka-message-in-a-transactional-way-34d6d19bb7b2

Using kafka for CQRS

Been reading a lot about kafka's use as an event store and a potential good candidate for CQRS.
I was wondering, since messages in kafka have a limited retention time, how will events be replayed after the messages were deleted from the disk where kafka retains messages?
Logically, when these messages are stored externally from kafka (after reading messages from kafka topics) in a db (sql/nosql), that would make more sense from an event store standpoint than kafka.
In lieu of above, given my understanding is correct, what is the real use case of kafka being used in CQRS even though the actual intent of kafka was just a high throughput messaging system?
You can use Kafka of event store and CQRS. You can use Kafka Stream to process all events generated by commands and store a snapshot of your entities in a changelog topic and store the changelog topic in a NOSQL one or more databases that meets your requirement. Also, all event can be store in a database(PostgresSql). What's important to know is that Kafka can be used as a store(its store files in high available way) or as a message query.
Retention time: You can set the retention time as long as you want or even keep messages forever in the topic.
Using Kafka as the data store: Sure, you can. There is a feature named Log Compaction. Let say the following scenario:
Insert product with ID=10, Name=Apple, Price=10
Insert product with ID=20, Name=Orange, Price=20
Update product with ID=10, Price becomes 30
When one topic is turned on the log compaction, a background job will periodically clean up messages on that topic. This job will check if any message has the same key then only keeps the final. With the above scenario, messages which are written to Kafka will the following format:
Message 1: Key=1, Name=Apple, Price=10
Message 2: Key=2, Name=Orange, Price=20
Message 3: Key=1, Name=Apple, Price=30 (Every update now includes all fields so it can self-contained)
After the log compaction, the topic will become:
Message 1: Key=2, Name=Orange, Price=20
Message 2: Key=1, Name=Apple, Price=30 (Keep the lastest record with the ID=1)
In reality, Kafka uses log compaction feature to make Kafka as the persistent data storage.

Are SQS and Kafka same?

Are Kafka and SQS same?
I see that both are messaging queue systems and are event-based.
Do they serve the same purpose, If not how are they different?
Apache Kafka and Amazon SQS both are using for message streaming but are not same.
Apache Kafka follows publish subscriber model, where producer send event/message to topic and one or more consumers are subscribed to that topic to get the event/message. In topic you find partitions for parallel streaming. There is consumer group concept once. When a message is read from a partition of topic it will be commit to identify its already read by that consumer group to avoid inconsistency in reading in concurrent programming. However, other consumer group can still read that message form the partition.
Where, Amazon SQS follows Queue and queue can be created any region of Amazon SQS. You can push message to Queue and only one consumer can subscribe to each Queue and can pull message from Queue. That's why SQS is pull based streaming. SQS Queues are two types: FIFO and Standard.
There is another concept in AWS which is Amazon SNS, which is publish subscriber based like Kafka, but there is not any message retention policy in SNS. It's for instant messaging like email, sms etc. It only can push message to subscriber when the subscribers are available. Otherwise message will be lost. However, SQS with SNS can overcome this drawback. Amazon SNS with SQS is called fanout pattern. In this pattern, a message published to an SNS topic is distributed to multiple SQS queues in parallel and SQS queue assure persistence, because SQS has retention policy. It can persist message up to 14 days(defult 4 days). Amazon SQS with SNS can achieved highly throughput parallel streaming and can replace Apache Kafka.
Yes, they are two messaging systems, but there is a big difference:
Kafka
Kafka is pretty scalable system and fits on high workloads when you want to send messages in batches (to have a good message throughput).
Kafka topic consists of some number of partitions which can be read completely parallel by different consumers in one consumer group and that give us a very good performance.
For example, if you need to build a high loaded streaming system, Kafka is really suitable for it.
SQS
SQS is an Amazon managed service (so you do not have to support infrastructure by yourself).
SQS is better for eventing when you need to catch some message (event) by some client and then this message will be automatically popped out from the queue.
As for my experience SQS is not so fast as Kafka and it doesn't fit to high workload, it's much more suitable for eventing where count of events per second is not so much.
For example if you want to react on some S3 file upload (to start some processing of this file) SQS is very good.
SQS and Kafka are both messaging systems.
The primary differences are :
Ordering at scale.
Kafka - Produced messages are always consumed in order irrespective of the number of items in the queue.
SQS - "A FIFO queue looks through the first 20k messages to determine available message groups. This means that if you have a backlog of messages in a single message group, you can't consume messages from other message groups that were sent to the queue at a later time until you successfully consume the messages from the backlog"
Limit on the Number of groups/topic/partitions
Kafka - Although the limit is quite high, but the number of topics/partitions is usually in the order of thousands (which can increase depending on the cluster size). SQS - "There is no quota to the number of message groups within a FIFO queue."
Deduplication - Kafka does not support deduplication of data in case same data is produced multiple times. SQS tries to dedup messages based on the dedup-id and the dedup-interval. "Assuming that the producer receives at least one acknowledgement before the deduplication interval expires, multiple retries neither affect the ordering of messages nor introduce duplicates."
Partition management. Kafka - Creations or additions of partitions are created and managed by the user. SQS controls the number of partitions and it can increase or decrease it depending on the load and usage pattern.
Dead letter queue - Kafka does not have the concept of a DL queue (it can be explicitly created and maintained by the user thought). SQS inherently supports a DL queue by itself.
Overall if we want so summarise the points above, we can say that SQS is meant for offloading background tasks to an async pipeline. Kafka is much more scalable and should be used as a stream processing pipeline.
SQS is a queue. You have a list of messages that would need to be processed by some other part of the application. A message would ideally be processed once by one processor and would be marked as processed and removed from the queue. The purpose of the queue is to coordinate and distribute the processing of messages among the different processors.
Kafka is more similar to Kinesis which is used mainly for data streaming. Messages are stored in topics for other components to read. Any component can listen to topics and/or read all messages at any time. The main purpose is to allow the efficient delivery of messages to any number of recipients and allow the continuous streaming of data between components in a dynamic and elastic way.
At a birds view, there is one main difference
Kafka is used for pub sub model. If a producer sends a single message. If a kafka topic has 2 consumers , both the consumers will receive the message
SQS is more like competing consumer pattern. If a producer sends a message and the sqs has 2 consumers. Only one consumer will receive the message. The other one wont get the message, if the 1st consumer has processed the message successfully. The 2nd consumer has a chance to recieve the message only if the message visibility times out. ie., 1st consumer is not able to process the message within the given time and cant delete the message within the visibility timeout.

Making Kafka producer and Consumer synchronous

I have one kafka producer and consumer.The kafka producer is publishing to one topic and the data is taken and some processing is done. The kafka consumer is reading from another topic about whether the processing of data from topic 1 was successful or not ie topic 2 has success or failure messages.Now Iam starting my consumer and then publishing the data to topic 1 .I want to make the producer and consumer synchronous ie once the producer publishes the data the consumer should read the success or failure message for that data and then the producer should proceed with the next set of data .
Apache Kafka and Publish/Subscribe messaging in general seeks to de-couple producers and consumers through the use of streaming async events. What you are describing is more like a batch job or a synchronous Remote Procedure Call (RPC) where the Producer and Consumer are explicitly coupled together. The standard Apache Kafka Producers/Consumer APIs do not support this Message Exchange Pattern but you can always write your own simple wrapper on top of the Kafka API's that uses Correlation IDs, Consumption ACKs, and Request/Response messages to make your own interface that behaves as you wish.
Short Answer : You can't do that, Kafka doesn't provide that support.
Long Answer: As Hans explained, Publish/Subscribe messaging model keeps Publish and subscribe completely unaware of each other and I believe that is where the power of this model lies. Producer can produce without worrying about if there is any consumer and consumer can consume without worrying about how many producers are there.
The closest you can do is, you can make your producer synchronous. Which means you can wait till your message is received and acknowledged by broker.
if you want to do that, flush after every send.

Email notification when Kafka producer and consumer goes down

I have developed a data pipeline using Kafka. Right now I have one type of producer and two types of consumers setup in the cluster.
Producer: gets the message from a windows server
Consumer: Consumer A uses Spark Streaming to transform and present a real time view. Consumer B stores the RAW data, might be useful for building the schema at a later stage.
For various reasons starting from network, the consumers do not receive any data and also it is possible that the consumer process might die in case there is a system failure.
I would be interested in knowing if there is a way to implement something which sends you email notification when the consumer stops receiving messages or the consumer thread dies altogether. Do Kafka or Zookeeper provide a way of doing it?
Right now I am thinking of checking the target system if it is receiving messages or not. But in future if the number of targets increase it will be really complex to write email notification systems for individual targets.