Using a Kafka consumer in order for a message to be consumed by exactly once semantics - apache-kafka

I am new to Kafka and I am seeking guidance on how to use Kafka in order to implement the following message pattern:
First, I want the message to be asynchronous and furthermore it needs to be "consumed" i.e. a single consumer should consume it and other consumers won't be able to consume it thereafter.
A use case of this message pattern is when you have multiple instances of a "delivery service" and you want only one of these instances to consume the message (this assumes one cannot leverage idempotency for some reason).
Can someone please advise how to configure the Kafka Consumer in order to achieve the above?

I think you're essentially looking to use Kafka as a traditional message queue (e.g. Rabbit MQ) where in the message gets removed after consumption. There has been quite a lot of debate on this. As it is always the case, there are merits and demerits on both sides of the fence.
The answers on this post are more or less against the idea ...
However...
This article talks about an approach on how you could possibly try and make it work. The messages won't really be deleted but the approach is quite similar. It is a fairly comprehensive post that covers the overhead and the optimisations that you could explore to make it more efficient.
I hope this helps!

Great question and its something a lot of us struggle with when deploying and using Kafka. In fact, there are a number of times where a project I was working on tried to use Kafka for the use case you described with very little success.
In a nutshell, there are a few Message Exchange Patterns that you come across when dealing with messaging:
Request->Reply
Publish/Subscribe
Queuing (which is what you are trying to do)
Without digging too deep into why, Kafka was really built simply for Publish/Subscribe. There are other products that implement the other features separately and one that actually does all three.
So a question I have for you is would you be open to using something other than Kafka for this project?

You may use spring kafka to do this. Spring Kafka takes care of lot of configurations and boiler plate code. Check example here https://www.baeldung.com/spring-kafka. This should get your started.
Also, you may need to read on how Kafka actually works. The messages that you publish to the Topics in Kafka are natively asynchronous. Your producers don't worry about who consumes it or what happens to the messages once published.
Then consumers in your delivery services should subscribe to the topics. If you want your delivery services to consume a message only once, then the consumers for your delivery services should be in the same group (same group id). Kafka takes care of making sure that the message that was consumed by one of the Consumers (in a same group) won't be available to other Consumers.
The default message retention period is seven days which is configurable in Kafka.

Related

Mixing communication methods for microservices

I am working on a project which is actually will be a better version of an old project. We want it to be scalable to be able to deal with high load. So we decided to go with microservices instead of monolithic. Then I started to do research about microservices, how they communicate, common design patterns and other things. Since I want my services to be scalable, event based communication made sense to me. So I decided to use kafka for this purpose.
We have much more services in the system but to simplify my question lets say I have 2 types of services which are work-node and master-node. I want both of them to be scalable. For now they are communicating over kafka.
My question : for a case I want to publish an event (produce a message on a topic) from master-node and get that event (consume from the topic) from all work-nodes. But for an other case I need to send a message to specific work-node. To be able to cover first case, all my work-nodes have different group ids in kafka and when a message published on a topic they all get that message. I know that I am not able to send a message to specific consumer with kafka. Since my nodes are scalable and their number can increase or decrease depending on the load, creating a topic for each node does not seem a good idea. My first solution was adding work-node id in message. So other work-nodes can ignore that message. Well it works but I don't think it is a good solution. My second solution is sending http request if I am going to send a message to specific node. But I don't know mixing 2 communication methods is a good solution.
What do you guys think about this problem. Is there a better solution that I am missing ? Or my whole design is going wrong ?
Kafka is not an appropriate technology for the use case you describe. I would recommend using Cadence Workflow which natively supports routing tasks to specific nodes as well as dozens of other features that messaging systems lack.
Feel free to join Cadence Workflow slack channel if you have specific questions.
I think you should able to. Consider regular Kafka flow. You have some consumer groups subscribed to the topic. Producer doesn't send message to specific partition until you specify.
Now think about the scenario that you produce some message based on your algorithm to the specific partitions.
Message received from A
some kind of algorithm like hashcode generated always 0 for A
Message send to Partition 0
Consumer 1 connected to Partiton 0
Only Consumer 1 gets the message coming from A

Designing a real-time data pipeline for an e-commerce web site [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I want to learn Apache Kafka. I read articles and documents but I could not figure out how Kafka works. There are lots of questions in my mind :( I want to create a Kafka cluster and develop some code for preparing data engineering interviews. But, I am stuck. Any help would be appreciated. I will try to explain my questions in an example scenario.
For instance, there is a popular e-commerce company. They have a huge amount of web traffic. The web site is running on AWS. The mobile applications are also using AWS services.
The marketing department wants to observe the efficiency of their advertisement actions like email, SMS etc. They also want to follow important real-time metrics (sold products, page views, active users in the last n minutes etc) in a dashboard.
First, the campaign automation system sends personalized campaign emails to target customers. When a user clicks the link in the advertisement email, the browser is opening the e-commerce web site.
On the background, the website developers should send a clickstream event to the Kafka cluster with the related parameters (like customer id, advertisement id, source_medium etc).
How can the backend developers send a message to the Kafka cluster when a user loads the web site? Should developers send a post request or get request? Are they other alternatives?
Then data engineers should direct this clickstream message to the storage layer. (for example AWS S3). Will this cause too many small files in AWS S3 buckets? May this slow down the execution of data flows?
Data engineers need to develop a data pipeline in order to do real-time analysis. Which technologies should data engineers use? (Kafka Connect, Kafka Streams, Producer and Consumer etc)
Kafka topics may have lots of messages. One message can be consumed by different consumers. A consumer reads the message from the Kafka topic. Then, another consumer can read it, even after a while. So data engineers need to manage offsets in order to consume all messages one and only one. How can they manage offsets properly?
All clickstream events should be consumed.
All clickstream events should be consumed for once. If a product view event is consumed more than once, the dashboard will not show the correct product view count.
Do developers need to manage offsets manually? Or is there any technology/way which manages offsets automatically?
Event order can be important. The marketing department wants to see the category view durations. For instance, a user views 10 books in the ebooks category. Ten events were created. User is on the same category page until his/her first action. So data engineers need to calculate the duration between the first event and the last event.
However, Kafka is a queue and there is not any order in it. Producers can send data to Kafka asynchronously. How can data engineers calculate the durations correctly?
What happens if a producer sends an event to Kafka after the total elapsed duration was calculated.
Note: View duration may fit better to content web sites. For example, Netflix marketing users want to analyze the content view durations and percentages. If a user opens a movie and watched just five minutes, the marketing department may consider that the user does not like the movie.
Thanks in advance
You have really asked several unrelated questions here. Firstly, Kafka has a lot of free documentation available for it, along with many high quality 'getting started' blocks and paid books and courses. I would definitely start there. You might still have questions, but at least you will have a better awareness of the platform and you can ask questions in a lot more focused ways, which will hopefully get a much better answer. Start with the official docs. Personally, I learned Kafka by reading the Effective Kafka book, but I'm sure there are many others.
Going through your list of questions.
How can the backend developers send a message to the Kafka cluster when a user loads the web site? Should developers send a post request or get request? Are they other alternatives?
The website would typically publish an event. This is done by opening a client connection to a set of Kafka brokers and publishing a record to some topic. You mention POST/GET requests: this is not how Kafka generally works — the clients establish persistent connections to a cluster of brokers. However, if you preferred programming model is REST, Confluent does provide a Kafka REST Proxy for this use case.
Then data engineers should direct this clickstream message to the storage layer. (for example AWS S3). Will this cause too many small files in AWS S3 buckets? May this slow down the execution of data flows?
It depends how you write to S3. You may develop a custom consumer application that stages writes in a different persistent layer and then writes to S3 in batches. Kafka Connect also has an Amazon S3 connector that moves data in chunks.
Data engineers need to develop a data pipeline in order to do real-time analysis. Which technologies should data engineers use? (Kafka Connect, Kafka Streams, Producer and Consumer etc)
There is no correct answer here. All of the technologies you have listed are valid and may be used to a similar effect. Both Connect and Streams are quite popular for this types of applications; however, you can just as easily write a custom consumer application for all your needs.
Kafka topics may have lots of messages. One message can be consumed by different consumers. A consumer reads the message from the Kafka topic. Then, another consumer can read it, even after a while. So data engineers need to manage offsets in order to consume all messages one and only one. How can they manage offsets properly?
In the simplest case, Kafka offset management is automatic and the default behaviour allows for at-least once delivery, whereby a record will be delivered again if the first processing attempt failed. This may lead to duplicate effects (counting a clickstream event twice, as you described) but this is addressed by making your consumer idempotent. This is a fairly complex topic; there is great answer on Quora that covers the issue of exactly-once delivery in detail.
Event order can be important. The marketing department wants to see the category view durations. For instance, a user views 10 books in the ebooks category. Ten events were created. User is on the same category page until his/her first action. So data engineers need to calculate the duration between the first event and the last event.
The concept of order is backed into Kafka. Kafka's topics are sharded into partitions, where each partition is a totally-ordered, unbounded stream of records. Records may be strictly ordered provided they are published to the same partition. This is achieved by assigning them the same key, which the Kafka client hashes behind the scenes to arrive at the partition index. Any two records that have the same key will occupy the same partition, and will therefore be ordered.
Welcome to stackoverflow! I will answer a few of your questions, however you should go through the Kafka documentation for such things, if you are facing any problem while implementing it, then you should post here.
How can developers send data to a Kafka cluster? You have talked about producers, but I guess you haven't read about them, the developers will have to use a producer to produce an event to a Kafka topic.You can read more about a Kafka producer in the documentation.
To direct the messages to a storage layer, Kafka consumers will be used.
Note : Kafka Connect can be used instead of Kafka producer and consumer in some scenarios, Kafka connect has source connectors and sink connectors instead of producer and consumer.
For real time data analysis, Kafka Streams or KSQL can be used. These cannot be explained in an answer, I recommend you go through the documentation.
A single Kafka topic can have multiple consumer groups, and every consumer group has a different offset, you can tweak the configuration to use or not to use these offsets for every consumer group.
You can change various configurations such as Ack = All, to guarantee at least once and at most once semantics. Again you should go through the documentation to understand this completely.
You can maintain message order in Kafka as well, for that to happen, your consumers will have to wait for the acknowledgement from Kafka after every message has been sent, obviously this will slow down the process but you will have to compromise one of the things.
I haven't understood your requirements related to the last point, but I guess you should go through Kafka Streams and KSQL documentation once, as you can manage your window size for analysis over there.
I have tried to answer most of your questions in brief but to understand it completely, obviously you will have to go through the documentation in detail.
Agree with the answers above. The questions you ask are reasonably straightforward and likely answered in the official documentation.
As per one of the replies, there are lots of excellent books and tutorials online. I recently wrote a summary of educational resources on Kafka which you might find useful.
Based on your scenario, this will be a straightforward stream processing application with an emitter and a few consumers.
The clickstream event would be published onto the Kafka cluster through a Kafka client library. It's not clear what language the website is written in, but there is likely a library available for that language. The web server connects to Kafka brokers and publishes a message every time the user performs some action of significance.
You mention that order matters. Kafka has inherent support for ordered messages. All you need to do is publish related messages with the same key, for example the username of the customer or their ID. Kafka then ensures that those messages will appear in the order that they were published.
You say that multiple consumers will be reading the same stream. This is easily achieved by giving each set of consumers a different group.id. Kafka keeps a separate set of committed offsets for each consumer group (Kafka's terminology for a related set of consumers), so that one group can process messages independently of another. For committing offsets, the easiest approach is to use the automatic offset commit mode that is enabled by default. This way records will not be committed until your consumer is finished with them, and if a consumer fails midway through processing a batch of records, those records will be redelivered.

Is Kafka suitable for running a public API?

I have an event stream that I want to publish. It's partitioned into topics, continually updates, will need to scale horizontally (and not having a SPOF is nice), and may require replaying old events in certain circumstances. All the features that seem to match Kafka's capabilities.
I want to publish this to the world through a public API that anyone can connect to and get events. Is Kafka a suitable technology for exposing as a public API?
I've read the Documentation page, but not gone any deeper yet. ACLs seem to be sensible.
My concerns
Consumers will be anywhere in the world. I can't see that being a problem seeing Kafka's architecture. The rate of messages probably won't be more than 10 per second.
Is integration with zookeeper an issue?
Are there any arguments against letting subscriber clients connect that I don't control?
Are there any arguments against letting subscriber clients connect that I don't control?
One of the issues that I would consider is possible group.id collisions.
Let's say that you have one single topic to be used by the world for consuming your messages.
Now if one of your clients has a multi-node system and wants to avoid reading the same message twice, they would set the same group.id to both nodes, forming a consumer group.
But, what if someone else in the world uses the same group.id? They would affect the first client, causing it to lose messages. There seems to be no security at that level.

Kafka consumer does not continue from where it left off

I'm currently using consumer groups to read messages from kafka. I have noticed however, that if my consumer goes down and I bring it back up again, it does not consume messages from where it left off. After reading the documentation here, it seems like I would have to implement this functionality myself. I know there's an autooffset.reset config for the consumer, but that just seems to allow me to either consume everything from the beginning, or consume from the last message currently on the queue. Is my understanding correct? That I would have to implement this myself? Or am I missing something here. It seems like a pretty basic feature that any queueing system should provide out of the box.
The version I'm using is 0.8.1.1 with scala version 2.10.
Based on the link, you're trying to use SimpleConsumer. With SimpleConsumer you need to take care of low level details like managing offsets by yourself. It is more difficult but allows to have more control on how data is consumed.
If all you want is just to read data without worrying much about low level details, take a look at HighLevelConsumer:
https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Group+Example

Is Kafka ready for production use?

I have an application in production that has to process several gigabytes of messages per day. I like the Kafka architecture and performance a lot; it perfectly fits my needs.
I'd like to replace my messaging layer with Kafka at some point. Is the 0.7.1 version good enough for production use in terms of stability and consistency in performance?
It is definitely in use at several Big Data companies already, including LinkedIn, where it was created (and later open sourced), and Tumblr. Just Tumblr by itself handles many gigabytes of messages per day. I'm sure LinkedIn is way up there too. You can see a list of companies known to currently use it here:
https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
Also, be sure to subscribe to their mailing list, there are lots of people actively trying it out and using it in production environments.
I'm sure it can handle whatever volume you can throw at it.
There is one critical feature I think Kafka is missing before it is ready for production.
"Flushing messages to disc if the producer can't reach any Kafka broker"
The issue has been filed a long time ago here:
https://issues.apache.org/jira/browse/KAFKA-156
This feature will makes the complete Kafka event pipline even more robust for some use-cases when the producer always has to be able to send events. For example when you track pageviews or like-button clicks and you don't want to miss any events, even if all Kafka brokers are unreachable.
I must agree with Dave, Kafka is a good tool but it missing some basic features which some can be done manually but then you need to think what Kafka provide. some missing things are:
(As Dave said) Flushing messages to disk when the producer fail to send them
Consumers ability to track which messages were handled (not just consumed) and which wasn't in case of a restart.
Monitoring - a way to receive the current status of the entities in the system like the current size of the queue in the producer or the write\read pace at the brokers (those can be done but are not part of the tool).
I have used kafka for quite sometime. Using native java and python clients would be preferred.
I had to struggle a lot finding a proper node.js client. literally re-wrote my whole code many a times using different clients as they had lot of bugs.
Finally settled with franz-kafka for node.js.
Apart from that maintaining the consumer offsets is a bit difficult. It is missing some good features like exchanges that exist in AMQP based Apache Qpid or RabbitMQ
Since it's distributed, supports offline messages and the performance is really impressive. I too preferred it :)