How to ensure exactly once processing in Kafka for older version 0.10.0 - apache-kafka

I have a scenario where I need to call an api to the notification service which will send an email to the user.
Based on the message that appears to the topic I need to consume and call an api.
But if the consumer fails in between due to the nature of the at-least once configuration of the consumer its hard to avoid duplicate processing.
Sorry for newbie question. I Tried to find the many blogs but most of them are using newer version of Kafka but in our infra they are still using older version.
Wondering how can we achieve this behavior?
Links that I referred earlier :
If exactly-once semantics are impossible, what theoretical constraint is Kafka relaxing?
I also saw there documentation where they mention to store the processing result and offset to the same storage. But my scenario is api call lets imagine if the consumer call api and dies in between before committing offset then next poll will get that message and call api. In this way email will be sent twice.
Any help is appreciated.

Related

How to check the progress status of the messages in kafka?

I have designed the REST Post API in java which actually publishes the message to particular Kafka topic, lets say its "ProductTopic".
In the background, a microservice is listening to this "ProductTopic" topic and start to consume the message and saves to DB. Now i would like write a GET REST API to see the progress(which gives the output of job) of the job, like how much messages are successfully consumed and how is still pending. So that end user will have an idea about what's happening.
Is there a way to achieve this ? I did searched a lot in google, all i see was the command line query to see the consumption of the messages. Not any java implementation example available from confluent side. Any help would be appreciated.
You should check consumer lag for the consumer group of your service. Lag is approximately endOffset-currentOffset. You can find examples here

Mixing communication methods for microservices

I am working on a project which is actually will be a better version of an old project. We want it to be scalable to be able to deal with high load. So we decided to go with microservices instead of monolithic. Then I started to do research about microservices, how they communicate, common design patterns and other things. Since I want my services to be scalable, event based communication made sense to me. So I decided to use kafka for this purpose.
We have much more services in the system but to simplify my question lets say I have 2 types of services which are work-node and master-node. I want both of them to be scalable. For now they are communicating over kafka.
My question : for a case I want to publish an event (produce a message on a topic) from master-node and get that event (consume from the topic) from all work-nodes. But for an other case I need to send a message to specific work-node. To be able to cover first case, all my work-nodes have different group ids in kafka and when a message published on a topic they all get that message. I know that I am not able to send a message to specific consumer with kafka. Since my nodes are scalable and their number can increase or decrease depending on the load, creating a topic for each node does not seem a good idea. My first solution was adding work-node id in message. So other work-nodes can ignore that message. Well it works but I don't think it is a good solution. My second solution is sending http request if I am going to send a message to specific node. But I don't know mixing 2 communication methods is a good solution.
What do you guys think about this problem. Is there a better solution that I am missing ? Or my whole design is going wrong ?
Kafka is not an appropriate technology for the use case you describe. I would recommend using Cadence Workflow which natively supports routing tasks to specific nodes as well as dozens of other features that messaging systems lack.
Feel free to join Cadence Workflow slack channel if you have specific questions.
I think you should able to. Consider regular Kafka flow. You have some consumer groups subscribed to the topic. Producer doesn't send message to specific partition until you specify.
Now think about the scenario that you produce some message based on your algorithm to the specific partitions.
Message received from A
some kind of algorithm like hashcode generated always 0 for A
Message send to Partition 0
Consumer 1 connected to Partiton 0
Only Consumer 1 gets the message coming from A

Using a Kafka consumer in order for a message to be consumed by exactly once semantics

I am new to Kafka and I am seeking guidance on how to use Kafka in order to implement the following message pattern:
First, I want the message to be asynchronous and furthermore it needs to be "consumed" i.e. a single consumer should consume it and other consumers won't be able to consume it thereafter.
A use case of this message pattern is when you have multiple instances of a "delivery service" and you want only one of these instances to consume the message (this assumes one cannot leverage idempotency for some reason).
Can someone please advise how to configure the Kafka Consumer in order to achieve the above?
I think you're essentially looking to use Kafka as a traditional message queue (e.g. Rabbit MQ) where in the message gets removed after consumption. There has been quite a lot of debate on this. As it is always the case, there are merits and demerits on both sides of the fence.
The answers on this post are more or less against the idea ...
However...
This article talks about an approach on how you could possibly try and make it work. The messages won't really be deleted but the approach is quite similar. It is a fairly comprehensive post that covers the overhead and the optimisations that you could explore to make it more efficient.
I hope this helps!
Great question and its something a lot of us struggle with when deploying and using Kafka. In fact, there are a number of times where a project I was working on tried to use Kafka for the use case you described with very little success.
In a nutshell, there are a few Message Exchange Patterns that you come across when dealing with messaging:
Request->Reply
Publish/Subscribe
Queuing (which is what you are trying to do)
Without digging too deep into why, Kafka was really built simply for Publish/Subscribe. There are other products that implement the other features separately and one that actually does all three.
So a question I have for you is would you be open to using something other than Kafka for this project?
You may use spring kafka to do this. Spring Kafka takes care of lot of configurations and boiler plate code. Check example here https://www.baeldung.com/spring-kafka. This should get your started.
Also, you may need to read on how Kafka actually works. The messages that you publish to the Topics in Kafka are natively asynchronous. Your producers don't worry about who consumes it or what happens to the messages once published.
Then consumers in your delivery services should subscribe to the topics. If you want your delivery services to consume a message only once, then the consumers for your delivery services should be in the same group (same group id). Kafka takes care of making sure that the message that was consumed by one of the Consumers (in a same group) won't be available to other Consumers.
The default message retention period is seven days which is configurable in Kafka.

Does it make sense to use Apache Kafka for this Scenario?

There are several applications which have to be integrated together and they have to exchange Issues. So one of them will get the issue and then do something and later on change the Status of this Issue. And the other applications which could be involved to this Issue should get the new Information. This continues until the Issue reaches the final Status Closed. The Problem is the Issue have to be mapped, because these applications do not all support the same Data Format.
I'm not sure whether to send the whole Issue always or just the new Status as an Event.
How does Kafka Support Data Transformation?
What if my Issue has an attachment?(>5MB)
Thanks for your advice
Yes it does make sense.
Kafka can do transformations through both the Kafka Streams API, and KSQL which is a streaming SQL engine built on top of Kafka Streams.
Typically Kafka is used for smaller messages; one pattern to consider for larger content is to store it in an object store (e.g. S3, or similar depending on your chosen architecture) and reference a pointer to it in your Kafka message.
I'm not sure whether to send the whole Issue always or just the new Status as an Event.
You can do this either way. If you send the whole Issue and then publish all subsequent updates to the same issue as Kafka messages that contain a common kafka message key (perhaps a unique issue ID number) then you can configure your kafka topic as a compacted topic and the brokers will automatically delete any older copies of the data to save disk space.
If you chose to only send deltas (changes) then you need to be careful to have a retention period that’s long enough so that the initial complete record will never expire while the issue is still open and publishing updates. The default retention period is 7 days.
How does Kafka Support Data Transformation?
Yes. In Kafka Connect via Single Message Transforms (SMT), or in Kafka Streams using native Streams code (in Java).
What if my Issue has an attachment?(>5MB)
You can configure kafka for large messages but if they are much larger than 5 or 10 MB then it’s usually better to follow a claim check pattern and store them external to Kafka and just publish a reference link back to the externally stored data so the consumer can retrieve the attachment out of band from Kafka.

Is Kafka ready for production use?

I have an application in production that has to process several gigabytes of messages per day. I like the Kafka architecture and performance a lot; it perfectly fits my needs.
I'd like to replace my messaging layer with Kafka at some point. Is the 0.7.1 version good enough for production use in terms of stability and consistency in performance?
It is definitely in use at several Big Data companies already, including LinkedIn, where it was created (and later open sourced), and Tumblr. Just Tumblr by itself handles many gigabytes of messages per day. I'm sure LinkedIn is way up there too. You can see a list of companies known to currently use it here:
https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
Also, be sure to subscribe to their mailing list, there are lots of people actively trying it out and using it in production environments.
I'm sure it can handle whatever volume you can throw at it.
There is one critical feature I think Kafka is missing before it is ready for production.
"Flushing messages to disc if the producer can't reach any Kafka broker"
The issue has been filed a long time ago here:
https://issues.apache.org/jira/browse/KAFKA-156
This feature will makes the complete Kafka event pipline even more robust for some use-cases when the producer always has to be able to send events. For example when you track pageviews or like-button clicks and you don't want to miss any events, even if all Kafka brokers are unreachable.
I must agree with Dave, Kafka is a good tool but it missing some basic features which some can be done manually but then you need to think what Kafka provide. some missing things are:
(As Dave said) Flushing messages to disk when the producer fail to send them
Consumers ability to track which messages were handled (not just consumed) and which wasn't in case of a restart.
Monitoring - a way to receive the current status of the entities in the system like the current size of the queue in the producer or the write\read pace at the brokers (those can be done but are not part of the tool).
I have used kafka for quite sometime. Using native java and python clients would be preferred.
I had to struggle a lot finding a proper node.js client. literally re-wrote my whole code many a times using different clients as they had lot of bugs.
Finally settled with franz-kafka for node.js.
Apart from that maintaining the consumer offsets is a bit difficult. It is missing some good features like exchanges that exist in AMQP based Apache Qpid or RabbitMQ
Since it's distributed, supports offline messages and the performance is really impressive. I too preferred it :)