kafka - How to read and process messages in a fault tolerant way? [closed] - apache-kafka

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 10 months ago.
Improve this question
I am new to Kafka, and I am trying to understand how a Kafka Consumer can consume and process a message from a Kafka topic without losing messages if the consumer fails in between.
For Example:
There is a kafka topic called cart_checkout which just holds an event when the user successfully checks out their cart. All it will do is get an event and send an email to the user that the their items are checked out.
This is the happy path:
Consumer gets an event from the topic.
Read offset is committed to Kafka.
Consumer calls app specific function to send email.
Repeat.
What happens if the app fails during Step 3 ?
If the consumer starts up then it will miss an event, am I right ? (Since the read message was committed)
The consumer can rewind but how will it know to rewind ?
RabbitMQ solution:
It seems easier to solve in RabbitMQ since we can control the ACKs to a message. But in Kafka, if we commit the read-offset we lose the message when the app restarts, if we don't commit the read-offset then the same message is sent to multiple consumers.
What is the best possible way to deal with this problem ?

As you mentioned a consumer offset can also be managed manually which you will need in this case to avoid duplications(by default the delivery guarantee is at least once) or missing an email as mentioned in your use case.
To manually control the offset auto-commit should be disabled (enable.auto.commit = false).
Now regarding the second part of your question where you mention:
if we don't commit the read-offset then the same message is sent to multiple consumers
This understanding is not fully correct. In Kafka by default each consumer controls its own offset and consumers in the same consumer group don't share partitions(each consumer in the same consumer group reads from a single partition) so the message will not be processed (in Kafka consumers poll messages) by other consumers in the same consumer group, that would happen only to consumers in a different consumer group but then it would be the intention that they also read the message anyway, that's by design in Kafka clients. It's also relevant that you understand that the consumer poll returns multiple messages and the default commitSync or commitAsync will commit the offset of all messages returned by that poll call if you want to avoid possible email duplication you might want to use a more specific commit, check the API here: https://kafka.apache.org/30/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
As you're learning this is a common misunderstanding and there are some nice resources to read that clarify this concept for free around, I suggest the official design section of the documentation: https://kafka.apache.org/documentation/#theconsumer also this free book chapter: https://www.oreilly.com/library/view/kafka-the-definitive/9781491936153/ch04.html is specific to what you're doing now. Good luck.

Related

Kafka consumer disconnect Event

Is there any way we can detect crash or shut down of consumer?
I want that kafka server publish event when mentioned situation to all kafka clients (publishers, consumers....).
Is it possible?
Kafka keeps track of the consumed offsets per consumer on special internal topics. You could setup a special "monitoring service", have that constantly consuming from those offset internal topics, and trigger any notification/alerting mechanisms as needed so that your publishers and other consumers are notified programatically. This other SO question has more details about that.
Depending on your use case, lag monitoring is also a really good way to know if your consumers are falling behind and/or crashed. There's multiple solutions for that out there, or again, you could build your own to customize alerting/notification behavior.

Notify consumer about new events in Kafka topics

Is there a way to notify consumer about the new events being published to kafka topics which consumer has subscribed to while consumer is not actively listening? I know the question itself seems confusing but i was thinking if it is really necessary to have one process running continuously in order to consume messages. I think it will make consumer process easier if we know when the message is available to read.
Consumers read messages by polling the topic, so fundamentally, you must have a process running continuously. If the consumer does not poll within the value of the property max.poll.interval.ms, the consumer will leave the group. A hallmark feature of event-driven architectures is that consumers and producers are decoupled; the consumer does not know whether the producer even exists. Therefore, there is no way to know when a message is available to read without actively polling.

Kafka operations [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Hi im very new to Kafka operations, all i understand from it is event data is stored in so called topics. These topics are like logs and are written to disk and even duplicated.
What are producers and consumers? Are they essentially just parts of the application like micro services where one producers data and another requests data?
My question is what exactly is the difference between a conventional database and Kafka topics?
Is it just that the data type is different?
In databases, objects are stored and in topics events are stored? They are both written to hard disk?
What problem does Kafka actually solve?
There are some problems with decentralised micro services with dependencies across micro services
How does Kafka solve this problem?
Thanks everyone
First off, producers and consumers can be part of the same application. You don't need to have "microservices" to use Kafka.
one producers data and another requests data?
Yes
what exactly is the difference between a conventional database and Kafka topics?
Unclear what you consider as a "conventional" database, but Kafka itself has no query capabilities nor any defined record schema. Such features are enabled by external tooling
They are both written to hard disk?
Not all databases write to disk. Kafka does write to disk
What problem does Kafka actually solve?
There's use cases mentioned on the website, but the original goal was log/metric aggregation into a datalake, not intra-service communication.
But if you have point-to-point-to-point dependency chain, you need to ensure all applications in that chain are up, whereas they could instead fail occasionally and pickup from where they stopped reading from a replicated log
Data is stored in so called topics. These topics are like logs and are written to disk and even duplicated.
Data in Kafka is seen as events. Each event usually represents that something happened. The event is stored in a given topic on a Kafka broker. The topic can be seen as a way to organize data into categories.
What are producers and consumers?
Producers create events and submit them to Kafka brokers which then store these events in the appropriate topic. Consumers can consume from the aforementioned data, pulling the events that were created by a producer.
My question is what exactly is the difference between a conventional database and Kafka topics?
Hard to define conventional. But I suppose no, Kafka is not a conventional database. You will probably often find yourself using other databases with kafka. Kafka is primarilly suited for capturing real-time events, storing them in order to direct them elsewhere in real-time (historical retrieval is also possible).
What problem does Kafka actually solve?
Handling anything that requires event streaming. It does so durably and provides a large amount of guarantees and flexibility in handling large amounts of data.
In conclusion: I would suggest you start by going through the first part of the documentation found at Kafka Documentation.
If you really want to dive in then you can also find a book titled Kafka: The definitive edition.

Designing a real-time data pipeline for an e-commerce web site [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I want to learn Apache Kafka. I read articles and documents but I could not figure out how Kafka works. There are lots of questions in my mind :( I want to create a Kafka cluster and develop some code for preparing data engineering interviews. But, I am stuck. Any help would be appreciated. I will try to explain my questions in an example scenario.
For instance, there is a popular e-commerce company. They have a huge amount of web traffic. The web site is running on AWS. The mobile applications are also using AWS services.
The marketing department wants to observe the efficiency of their advertisement actions like email, SMS etc. They also want to follow important real-time metrics (sold products, page views, active users in the last n minutes etc) in a dashboard.
First, the campaign automation system sends personalized campaign emails to target customers. When a user clicks the link in the advertisement email, the browser is opening the e-commerce web site.
On the background, the website developers should send a clickstream event to the Kafka cluster with the related parameters (like customer id, advertisement id, source_medium etc).
How can the backend developers send a message to the Kafka cluster when a user loads the web site? Should developers send a post request or get request? Are they other alternatives?
Then data engineers should direct this clickstream message to the storage layer. (for example AWS S3). Will this cause too many small files in AWS S3 buckets? May this slow down the execution of data flows?
Data engineers need to develop a data pipeline in order to do real-time analysis. Which technologies should data engineers use? (Kafka Connect, Kafka Streams, Producer and Consumer etc)
Kafka topics may have lots of messages. One message can be consumed by different consumers. A consumer reads the message from the Kafka topic. Then, another consumer can read it, even after a while. So data engineers need to manage offsets in order to consume all messages one and only one. How can they manage offsets properly?
All clickstream events should be consumed.
All clickstream events should be consumed for once. If a product view event is consumed more than once, the dashboard will not show the correct product view count.
Do developers need to manage offsets manually? Or is there any technology/way which manages offsets automatically?
Event order can be important. The marketing department wants to see the category view durations. For instance, a user views 10 books in the ebooks category. Ten events were created. User is on the same category page until his/her first action. So data engineers need to calculate the duration between the first event and the last event.
However, Kafka is a queue and there is not any order in it. Producers can send data to Kafka asynchronously. How can data engineers calculate the durations correctly?
What happens if a producer sends an event to Kafka after the total elapsed duration was calculated.
Note: View duration may fit better to content web sites. For example, Netflix marketing users want to analyze the content view durations and percentages. If a user opens a movie and watched just five minutes, the marketing department may consider that the user does not like the movie.
Thanks in advance
You have really asked several unrelated questions here. Firstly, Kafka has a lot of free documentation available for it, along with many high quality 'getting started' blocks and paid books and courses. I would definitely start there. You might still have questions, but at least you will have a better awareness of the platform and you can ask questions in a lot more focused ways, which will hopefully get a much better answer. Start with the official docs. Personally, I learned Kafka by reading the Effective Kafka book, but I'm sure there are many others.
Going through your list of questions.
How can the backend developers send a message to the Kafka cluster when a user loads the web site? Should developers send a post request or get request? Are they other alternatives?
The website would typically publish an event. This is done by opening a client connection to a set of Kafka brokers and publishing a record to some topic. You mention POST/GET requests: this is not how Kafka generally works — the clients establish persistent connections to a cluster of brokers. However, if you preferred programming model is REST, Confluent does provide a Kafka REST Proxy for this use case.
Then data engineers should direct this clickstream message to the storage layer. (for example AWS S3). Will this cause too many small files in AWS S3 buckets? May this slow down the execution of data flows?
It depends how you write to S3. You may develop a custom consumer application that stages writes in a different persistent layer and then writes to S3 in batches. Kafka Connect also has an Amazon S3 connector that moves data in chunks.
Data engineers need to develop a data pipeline in order to do real-time analysis. Which technologies should data engineers use? (Kafka Connect, Kafka Streams, Producer and Consumer etc)
There is no correct answer here. All of the technologies you have listed are valid and may be used to a similar effect. Both Connect and Streams are quite popular for this types of applications; however, you can just as easily write a custom consumer application for all your needs.
Kafka topics may have lots of messages. One message can be consumed by different consumers. A consumer reads the message from the Kafka topic. Then, another consumer can read it, even after a while. So data engineers need to manage offsets in order to consume all messages one and only one. How can they manage offsets properly?
In the simplest case, Kafka offset management is automatic and the default behaviour allows for at-least once delivery, whereby a record will be delivered again if the first processing attempt failed. This may lead to duplicate effects (counting a clickstream event twice, as you described) but this is addressed by making your consumer idempotent. This is a fairly complex topic; there is great answer on Quora that covers the issue of exactly-once delivery in detail.
Event order can be important. The marketing department wants to see the category view durations. For instance, a user views 10 books in the ebooks category. Ten events were created. User is on the same category page until his/her first action. So data engineers need to calculate the duration between the first event and the last event.
The concept of order is backed into Kafka. Kafka's topics are sharded into partitions, where each partition is a totally-ordered, unbounded stream of records. Records may be strictly ordered provided they are published to the same partition. This is achieved by assigning them the same key, which the Kafka client hashes behind the scenes to arrive at the partition index. Any two records that have the same key will occupy the same partition, and will therefore be ordered.
Welcome to stackoverflow! I will answer a few of your questions, however you should go through the Kafka documentation for such things, if you are facing any problem while implementing it, then you should post here.
How can developers send data to a Kafka cluster? You have talked about producers, but I guess you haven't read about them, the developers will have to use a producer to produce an event to a Kafka topic.You can read more about a Kafka producer in the documentation.
To direct the messages to a storage layer, Kafka consumers will be used.
Note : Kafka Connect can be used instead of Kafka producer and consumer in some scenarios, Kafka connect has source connectors and sink connectors instead of producer and consumer.
For real time data analysis, Kafka Streams or KSQL can be used. These cannot be explained in an answer, I recommend you go through the documentation.
A single Kafka topic can have multiple consumer groups, and every consumer group has a different offset, you can tweak the configuration to use or not to use these offsets for every consumer group.
You can change various configurations such as Ack = All, to guarantee at least once and at most once semantics. Again you should go through the documentation to understand this completely.
You can maintain message order in Kafka as well, for that to happen, your consumers will have to wait for the acknowledgement from Kafka after every message has been sent, obviously this will slow down the process but you will have to compromise one of the things.
I haven't understood your requirements related to the last point, but I guess you should go through Kafka Streams and KSQL documentation once, as you can manage your window size for analysis over there.
I have tried to answer most of your questions in brief but to understand it completely, obviously you will have to go through the documentation in detail.
Agree with the answers above. The questions you ask are reasonably straightforward and likely answered in the official documentation.
As per one of the replies, there are lots of excellent books and tutorials online. I recently wrote a summary of educational resources on Kafka which you might find useful.
Based on your scenario, this will be a straightforward stream processing application with an emitter and a few consumers.
The clickstream event would be published onto the Kafka cluster through a Kafka client library. It's not clear what language the website is written in, but there is likely a library available for that language. The web server connects to Kafka brokers and publishes a message every time the user performs some action of significance.
You mention that order matters. Kafka has inherent support for ordered messages. All you need to do is publish related messages with the same key, for example the username of the customer or their ID. Kafka then ensures that those messages will appear in the order that they were published.
You say that multiple consumers will be reading the same stream. This is easily achieved by giving each set of consumers a different group.id. Kafka keeps a separate set of committed offsets for each consumer group (Kafka's terminology for a related set of consumers), so that one group can process messages independently of another. For committing offsets, the easiest approach is to use the automatic offset commit mode that is enabled by default. This way records will not be committed until your consumer is finished with them, and if a consumer fails midway through processing a batch of records, those records will be redelivered.

When to close a producer or consumer

Lately, we are having some performance issues with our Kafka consumers and producers. We use Kafka Java API in scala. What is considered to be good practice wrt opening and closing of consumer and producer objects? I believe this is a quite open-ended question and the right answer is always depends but I am trying to reason about this.
Can Consumers can be long-running connections and left open?
Should producers be closed whenever we are done producing messages?
Can Consumers can be long-running connections and left open?
In general, yes.
In detail: depending on your consumer configuration.
If your consumers are members of consumer group they certainly should be closed - to trigger the rebalance at earliest possible time.
If your consumers are using auto-commiting of offsets, they would still keep committing every N ms (AFAIK 60k), possibly wasting resources.
Otherwise, they can stay - but why waste resources?
Should producers be closed whenever we are done producing messages?
In general, yes.
Depends on your design, but if you can say at certain time you won't be sending any more messages, then you can close. That does not mean you should be closing and re-creating a producer after every sent message.