Is Kafka suitable for running a public API? - apache-kafka

I have an event stream that I want to publish. It's partitioned into topics, continually updates, will need to scale horizontally (and not having a SPOF is nice), and may require replaying old events in certain circumstances. All the features that seem to match Kafka's capabilities.
I want to publish this to the world through a public API that anyone can connect to and get events. Is Kafka a suitable technology for exposing as a public API?
I've read the Documentation page, but not gone any deeper yet. ACLs seem to be sensible.
My concerns
Consumers will be anywhere in the world. I can't see that being a problem seeing Kafka's architecture. The rate of messages probably won't be more than 10 per second.
Is integration with zookeeper an issue?
Are there any arguments against letting subscriber clients connect that I don't control?

Are there any arguments against letting subscriber clients connect that I don't control?
One of the issues that I would consider is possible group.id collisions.
Let's say that you have one single topic to be used by the world for consuming your messages.
Now if one of your clients has a multi-node system and wants to avoid reading the same message twice, they would set the same group.id to both nodes, forming a consumer group.
But, what if someone else in the world uses the same group.id? They would affect the first client, causing it to lose messages. There seems to be no security at that level.

Related

Mixing communication methods for microservices

I am working on a project which is actually will be a better version of an old project. We want it to be scalable to be able to deal with high load. So we decided to go with microservices instead of monolithic. Then I started to do research about microservices, how they communicate, common design patterns and other things. Since I want my services to be scalable, event based communication made sense to me. So I decided to use kafka for this purpose.
We have much more services in the system but to simplify my question lets say I have 2 types of services which are work-node and master-node. I want both of them to be scalable. For now they are communicating over kafka.
My question : for a case I want to publish an event (produce a message on a topic) from master-node and get that event (consume from the topic) from all work-nodes. But for an other case I need to send a message to specific work-node. To be able to cover first case, all my work-nodes have different group ids in kafka and when a message published on a topic they all get that message. I know that I am not able to send a message to specific consumer with kafka. Since my nodes are scalable and their number can increase or decrease depending on the load, creating a topic for each node does not seem a good idea. My first solution was adding work-node id in message. So other work-nodes can ignore that message. Well it works but I don't think it is a good solution. My second solution is sending http request if I am going to send a message to specific node. But I don't know mixing 2 communication methods is a good solution.
What do you guys think about this problem. Is there a better solution that I am missing ? Or my whole design is going wrong ?
Kafka is not an appropriate technology for the use case you describe. I would recommend using Cadence Workflow which natively supports routing tasks to specific nodes as well as dozens of other features that messaging systems lack.
Feel free to join Cadence Workflow slack channel if you have specific questions.
I think you should able to. Consider regular Kafka flow. You have some consumer groups subscribed to the topic. Producer doesn't send message to specific partition until you specify.
Now think about the scenario that you produce some message based on your algorithm to the specific partitions.
Message received from A
some kind of algorithm like hashcode generated always 0 for A
Message send to Partition 0
Consumer 1 connected to Partiton 0
Only Consumer 1 gets the message coming from A

Designing a real-time data pipeline for an e-commerce web site [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I want to learn Apache Kafka. I read articles and documents but I could not figure out how Kafka works. There are lots of questions in my mind :( I want to create a Kafka cluster and develop some code for preparing data engineering interviews. But, I am stuck. Any help would be appreciated. I will try to explain my questions in an example scenario.
For instance, there is a popular e-commerce company. They have a huge amount of web traffic. The web site is running on AWS. The mobile applications are also using AWS services.
The marketing department wants to observe the efficiency of their advertisement actions like email, SMS etc. They also want to follow important real-time metrics (sold products, page views, active users in the last n minutes etc) in a dashboard.
First, the campaign automation system sends personalized campaign emails to target customers. When a user clicks the link in the advertisement email, the browser is opening the e-commerce web site.
On the background, the website developers should send a clickstream event to the Kafka cluster with the related parameters (like customer id, advertisement id, source_medium etc).
How can the backend developers send a message to the Kafka cluster when a user loads the web site? Should developers send a post request or get request? Are they other alternatives?
Then data engineers should direct this clickstream message to the storage layer. (for example AWS S3). Will this cause too many small files in AWS S3 buckets? May this slow down the execution of data flows?
Data engineers need to develop a data pipeline in order to do real-time analysis. Which technologies should data engineers use? (Kafka Connect, Kafka Streams, Producer and Consumer etc)
Kafka topics may have lots of messages. One message can be consumed by different consumers. A consumer reads the message from the Kafka topic. Then, another consumer can read it, even after a while. So data engineers need to manage offsets in order to consume all messages one and only one. How can they manage offsets properly?
All clickstream events should be consumed.
All clickstream events should be consumed for once. If a product view event is consumed more than once, the dashboard will not show the correct product view count.
Do developers need to manage offsets manually? Or is there any technology/way which manages offsets automatically?
Event order can be important. The marketing department wants to see the category view durations. For instance, a user views 10 books in the ebooks category. Ten events were created. User is on the same category page until his/her first action. So data engineers need to calculate the duration between the first event and the last event.
However, Kafka is a queue and there is not any order in it. Producers can send data to Kafka asynchronously. How can data engineers calculate the durations correctly?
What happens if a producer sends an event to Kafka after the total elapsed duration was calculated.
Note: View duration may fit better to content web sites. For example, Netflix marketing users want to analyze the content view durations and percentages. If a user opens a movie and watched just five minutes, the marketing department may consider that the user does not like the movie.
Thanks in advance
You have really asked several unrelated questions here. Firstly, Kafka has a lot of free documentation available for it, along with many high quality 'getting started' blocks and paid books and courses. I would definitely start there. You might still have questions, but at least you will have a better awareness of the platform and you can ask questions in a lot more focused ways, which will hopefully get a much better answer. Start with the official docs. Personally, I learned Kafka by reading the Effective Kafka book, but I'm sure there are many others.
Going through your list of questions.
How can the backend developers send a message to the Kafka cluster when a user loads the web site? Should developers send a post request or get request? Are they other alternatives?
The website would typically publish an event. This is done by opening a client connection to a set of Kafka brokers and publishing a record to some topic. You mention POST/GET requests: this is not how Kafka generally works — the clients establish persistent connections to a cluster of brokers. However, if you preferred programming model is REST, Confluent does provide a Kafka REST Proxy for this use case.
Then data engineers should direct this clickstream message to the storage layer. (for example AWS S3). Will this cause too many small files in AWS S3 buckets? May this slow down the execution of data flows?
It depends how you write to S3. You may develop a custom consumer application that stages writes in a different persistent layer and then writes to S3 in batches. Kafka Connect also has an Amazon S3 connector that moves data in chunks.
Data engineers need to develop a data pipeline in order to do real-time analysis. Which technologies should data engineers use? (Kafka Connect, Kafka Streams, Producer and Consumer etc)
There is no correct answer here. All of the technologies you have listed are valid and may be used to a similar effect. Both Connect and Streams are quite popular for this types of applications; however, you can just as easily write a custom consumer application for all your needs.
Kafka topics may have lots of messages. One message can be consumed by different consumers. A consumer reads the message from the Kafka topic. Then, another consumer can read it, even after a while. So data engineers need to manage offsets in order to consume all messages one and only one. How can they manage offsets properly?
In the simplest case, Kafka offset management is automatic and the default behaviour allows for at-least once delivery, whereby a record will be delivered again if the first processing attempt failed. This may lead to duplicate effects (counting a clickstream event twice, as you described) but this is addressed by making your consumer idempotent. This is a fairly complex topic; there is great answer on Quora that covers the issue of exactly-once delivery in detail.
Event order can be important. The marketing department wants to see the category view durations. For instance, a user views 10 books in the ebooks category. Ten events were created. User is on the same category page until his/her first action. So data engineers need to calculate the duration between the first event and the last event.
The concept of order is backed into Kafka. Kafka's topics are sharded into partitions, where each partition is a totally-ordered, unbounded stream of records. Records may be strictly ordered provided they are published to the same partition. This is achieved by assigning them the same key, which the Kafka client hashes behind the scenes to arrive at the partition index. Any two records that have the same key will occupy the same partition, and will therefore be ordered.
Welcome to stackoverflow! I will answer a few of your questions, however you should go through the Kafka documentation for such things, if you are facing any problem while implementing it, then you should post here.
How can developers send data to a Kafka cluster? You have talked about producers, but I guess you haven't read about them, the developers will have to use a producer to produce an event to a Kafka topic.You can read more about a Kafka producer in the documentation.
To direct the messages to a storage layer, Kafka consumers will be used.
Note : Kafka Connect can be used instead of Kafka producer and consumer in some scenarios, Kafka connect has source connectors and sink connectors instead of producer and consumer.
For real time data analysis, Kafka Streams or KSQL can be used. These cannot be explained in an answer, I recommend you go through the documentation.
A single Kafka topic can have multiple consumer groups, and every consumer group has a different offset, you can tweak the configuration to use or not to use these offsets for every consumer group.
You can change various configurations such as Ack = All, to guarantee at least once and at most once semantics. Again you should go through the documentation to understand this completely.
You can maintain message order in Kafka as well, for that to happen, your consumers will have to wait for the acknowledgement from Kafka after every message has been sent, obviously this will slow down the process but you will have to compromise one of the things.
I haven't understood your requirements related to the last point, but I guess you should go through Kafka Streams and KSQL documentation once, as you can manage your window size for analysis over there.
I have tried to answer most of your questions in brief but to understand it completely, obviously you will have to go through the documentation in detail.
Agree with the answers above. The questions you ask are reasonably straightforward and likely answered in the official documentation.
As per one of the replies, there are lots of excellent books and tutorials online. I recently wrote a summary of educational resources on Kafka which you might find useful.
Based on your scenario, this will be a straightforward stream processing application with an emitter and a few consumers.
The clickstream event would be published onto the Kafka cluster through a Kafka client library. It's not clear what language the website is written in, but there is likely a library available for that language. The web server connects to Kafka brokers and publishes a message every time the user performs some action of significance.
You mention that order matters. Kafka has inherent support for ordered messages. All you need to do is publish related messages with the same key, for example the username of the customer or their ID. Kafka then ensures that those messages will appear in the order that they were published.
You say that multiple consumers will be reading the same stream. This is easily achieved by giving each set of consumers a different group.id. Kafka keeps a separate set of committed offsets for each consumer group (Kafka's terminology for a related set of consumers), so that one group can process messages independently of another. For committing offsets, the easiest approach is to use the automatic offset commit mode that is enabled by default. This way records will not be committed until your consumer is finished with them, and if a consumer fails midway through processing a batch of records, those records will be redelivered.

Pub/Sub and consumer aware publishing. Stop producing when nobody is subscribed

I'm trying to find a messaging system that supports the following use case.
Producer registers topic namespace
client subscribes to topic
first client triggers notification on producer to start producing
new client with subscription to the same topic receives data (potentially conflated, similar to hot/cold observables in RX world)
When the last client goes away, unsubscribe or crash, notify the producer to stop producing to said topic
I am aware that according to the pub/sub pattern A producer is defined to be blissfully unaware of the existence of consumers, meaning that my use-case simply does not fit the pub/sub paradigm.
So far I have looked into Kafka, Redis, NATS.io and Amazon SQS, but without much success. I've been thinking about a few possible ways to solve this, Haven't however found anything that would satisfy my needs yet.
One option that springs to mind, for bullet 2) is to model a request/reply pattern as amongs others detailed on the NATS page to have the producer listen to clients. A client would then publish a 'subscribe' message into the system that the producer would pick up on a wildcard subscription. This however leaves one big problem, which is unsubscribing. Assuming the consumer stops as it should, publishing an unsubscribe message just like the subscribe would work. But in the case of a crash or similar this won't work.
I'd be grateful for any ideas, references or architectural patterns/best practices that satisfy the above.
I've been doing quite a bit of research over the past week but haven't come across any satisfying Q&A or articles. Either I'm approaching it entirely wrong, or there just doesn't seem to be much out there which would surprise me as to me, this appears to be a fairly common scenario that applies to many domains.
thanks in advance
Chris
//edit
An actual simple use-case that I have at hand is stock quote distribution.
Quotes come from external source
subscribe to stock A quotes from external system when the first end-user looks at stock A
Stop receiving quotes for stock A from external system when no more end-users look at said stock
RabbitMQ has internal events you can use with the Event Exchange Plugin. Event such as consumer.created or consumer.deleted could be use to trigger some actions at your server level: for example, checking the remaining number of consumers using RabbitMQ Management API and takes action such as closing a topic, based on your use cases.
I don't have any messaging consumer present based publishing in mind. Got ever worst because you'll need kind of heartbeat mechanism to handle consumer crashes.
So here are my two cents, not sue if you're looking for an out of the box solution, but if not, you could wrap your application around a zookeeper cluster to handle all your use cases.
Simply use watchers on ephemeral nodes to check when you have no more consumers ( including crashes) and put some watcher around a 'consumers' path to be advertised when you get consumers.
Consumers side, you would have to register your zk node ID whenever you start it.
It's not so complicated to do, and zk is not the only solution for this, you might use other consensus techs as well.
A start for zookeeper :
https://zookeeper.apache.org/doc/r3.1.2/zookeeperStarted.html
( strongly advise to use curator api, which handle lot of recipes in a smooth way)
Yannick
Unfortunately you haven't specified your use business use case that you try to solve with such requirements. From the sound of it you want not the pub/sub system, but an orchestration one.
I would recommend checking out the Cadence Workflow that is capable of supporting your listed requirements and many more orchestration use cases.
Here is a strawman design that satisfies your requirements:
Any new subscriber sends an event to a workflow with a TopicName as a workflowID to subscribe. If workflow with given ID doesn't exist it is automatically started.
Any subscribe sends another signal to unsubscribe.
When no subscribers are left workflow exits.
Publisher sends an event to the workflow to deliver to subscribers.
Workflow delivers the event to the subscribers using an activity.
If workflow with given TopicName doesn't run the publish event to it is going to fail.
Cadence offers a lot of other advantages over using queues for task processing.
Built it exponential retries with unlimited expiration interval
Failure handling. For example it allows to execute a task that notifies another service if both updates couldn't succeed during a configured interval.
Support for long running heartbeating operations
Ability to implement complex task dependencies. For example to implement chaining of calls or compensation logic in case of unrecoverble failures (SAGA)
Gives complete visibility into current state of the update. For example when using queues all you know if there are some messages in a queue and you need additional DB to track the overall progress. With Cadence every event is recorded.
Ability to cancel an update in flight.
Distributed CRON support
See the presentation that goes over Cadence programming model.

Using a Kafka consumer in order for a message to be consumed by exactly once semantics

I am new to Kafka and I am seeking guidance on how to use Kafka in order to implement the following message pattern:
First, I want the message to be asynchronous and furthermore it needs to be "consumed" i.e. a single consumer should consume it and other consumers won't be able to consume it thereafter.
A use case of this message pattern is when you have multiple instances of a "delivery service" and you want only one of these instances to consume the message (this assumes one cannot leverage idempotency for some reason).
Can someone please advise how to configure the Kafka Consumer in order to achieve the above?
I think you're essentially looking to use Kafka as a traditional message queue (e.g. Rabbit MQ) where in the message gets removed after consumption. There has been quite a lot of debate on this. As it is always the case, there are merits and demerits on both sides of the fence.
The answers on this post are more or less against the idea ...
However...
This article talks about an approach on how you could possibly try and make it work. The messages won't really be deleted but the approach is quite similar. It is a fairly comprehensive post that covers the overhead and the optimisations that you could explore to make it more efficient.
I hope this helps!
Great question and its something a lot of us struggle with when deploying and using Kafka. In fact, there are a number of times where a project I was working on tried to use Kafka for the use case you described with very little success.
In a nutshell, there are a few Message Exchange Patterns that you come across when dealing with messaging:
Request->Reply
Publish/Subscribe
Queuing (which is what you are trying to do)
Without digging too deep into why, Kafka was really built simply for Publish/Subscribe. There are other products that implement the other features separately and one that actually does all three.
So a question I have for you is would you be open to using something other than Kafka for this project?
You may use spring kafka to do this. Spring Kafka takes care of lot of configurations and boiler plate code. Check example here https://www.baeldung.com/spring-kafka. This should get your started.
Also, you may need to read on how Kafka actually works. The messages that you publish to the Topics in Kafka are natively asynchronous. Your producers don't worry about who consumes it or what happens to the messages once published.
Then consumers in your delivery services should subscribe to the topics. If you want your delivery services to consume a message only once, then the consumers for your delivery services should be in the same group (same group id). Kafka takes care of making sure that the message that was consumed by one of the Consumers (in a same group) won't be available to other Consumers.
The default message retention period is seven days which is configurable in Kafka.

Is Kafka ready for production use?

I have an application in production that has to process several gigabytes of messages per day. I like the Kafka architecture and performance a lot; it perfectly fits my needs.
I'd like to replace my messaging layer with Kafka at some point. Is the 0.7.1 version good enough for production use in terms of stability and consistency in performance?
It is definitely in use at several Big Data companies already, including LinkedIn, where it was created (and later open sourced), and Tumblr. Just Tumblr by itself handles many gigabytes of messages per day. I'm sure LinkedIn is way up there too. You can see a list of companies known to currently use it here:
https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
Also, be sure to subscribe to their mailing list, there are lots of people actively trying it out and using it in production environments.
I'm sure it can handle whatever volume you can throw at it.
There is one critical feature I think Kafka is missing before it is ready for production.
"Flushing messages to disc if the producer can't reach any Kafka broker"
The issue has been filed a long time ago here:
https://issues.apache.org/jira/browse/KAFKA-156
This feature will makes the complete Kafka event pipline even more robust for some use-cases when the producer always has to be able to send events. For example when you track pageviews or like-button clicks and you don't want to miss any events, even if all Kafka brokers are unreachable.
I must agree with Dave, Kafka is a good tool but it missing some basic features which some can be done manually but then you need to think what Kafka provide. some missing things are:
(As Dave said) Flushing messages to disk when the producer fail to send them
Consumers ability to track which messages were handled (not just consumed) and which wasn't in case of a restart.
Monitoring - a way to receive the current status of the entities in the system like the current size of the queue in the producer or the write\read pace at the brokers (those can be done but are not part of the tool).
I have used kafka for quite sometime. Using native java and python clients would be preferred.
I had to struggle a lot finding a proper node.js client. literally re-wrote my whole code many a times using different clients as they had lot of bugs.
Finally settled with franz-kafka for node.js.
Apart from that maintaining the consumer offsets is a bit difficult. It is missing some good features like exchanges that exist in AMQP based Apache Qpid or RabbitMQ
Since it's distributed, supports offline messages and the performance is really impressive. I too preferred it :)