Using kafka state store change log topic to share state - apache-kafka

I am trying to do a small POC on event sourcing in the e-commerce domain. So I have an order service, a customer contact service and a delivery service. Now for logistical reasons, some customer contact information should be available to the delivery person so that they can call the customer in case they are not at home etc. Obviously, customers can update their contact information and the delivery should ideally show the latest contact info. I am using kafka as the messaging framework and state store.
When I create a state store by reading events related to an aggregate's state changes (e.g. customer contact info change) and apply them on the aggregate, the state store is backed with a change-log topic (e.g. contact-service-customer-contact-changelog). Now if I need some of this data in another service (e.g. delivery service) can I use the same change log topic to create another state store local to that service? In the literature available this type of topic is termed "internal" and so it seems we are not supposed to use this topic for anything other than re-building the state store for the instances of the original service. So we should be re-publishing the updates to the state store to a new topic for other services to join with this data? Or is there another way to address this use case?

It's called internal topic because it's managed by the Kafka Streams library. This implies, that the topic name might change, and thus it might not be save to consume the topic. Nothing prevents you to consume the topic though.
To not mess up the state, you should never write to this topic though. Reading from it, is "conceptually save".
It's hard to advice and you should make your own decision to read the topic or to create a topic manually and write the same data a second time.

Related

Keeping services in sync in a kafka event driven backbone

Say I am using Kafka as the event-driven backbone for all my microservices in my system design. Many microservices use the events data to populate their internal databases.
Now there is a requirement where I need to create a new service and it uses some events data. The service will only be able to consume events after the time it comes live and hence, won't have a lot of data that it missed. I want a strategy such that I don't have to backfill my internal databases by writing out scripts.
What are some cool strategies I can have which do not create a huge load on Kafka & does not account for a lot of scripting to backfill data in the new services that I ever create?
There are a few strategies you can have here, depending on how you publish data to a kafka topic. Here are a few ideas:
first, you can set the retention of a kafka topic to be forever, meaning that it will store all the data. This is OK as kafka is built for this purpose as well. See this. By doing this, any new service that come alive can start consuming data from the start.
if you are using kafka for latest state publishing for a given entity/aggregate, you can also consider configuring the topic to be a compacted. This will let you store at least the latest state of your entity/aggregate on the topic, and new consumers that starts listening on the topic will have less data to configure. However, your consumers still need to know how to process multiple messages per entity/aggregate as you cannot guarantee it will have exactly one message in the topic.

Designing a real-time data pipeline for an e-commerce web site [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I want to learn Apache Kafka. I read articles and documents but I could not figure out how Kafka works. There are lots of questions in my mind :( I want to create a Kafka cluster and develop some code for preparing data engineering interviews. But, I am stuck. Any help would be appreciated. I will try to explain my questions in an example scenario.
For instance, there is a popular e-commerce company. They have a huge amount of web traffic. The web site is running on AWS. The mobile applications are also using AWS services.
The marketing department wants to observe the efficiency of their advertisement actions like email, SMS etc. They also want to follow important real-time metrics (sold products, page views, active users in the last n minutes etc) in a dashboard.
First, the campaign automation system sends personalized campaign emails to target customers. When a user clicks the link in the advertisement email, the browser is opening the e-commerce web site.
On the background, the website developers should send a clickstream event to the Kafka cluster with the related parameters (like customer id, advertisement id, source_medium etc).
How can the backend developers send a message to the Kafka cluster when a user loads the web site? Should developers send a post request or get request? Are they other alternatives?
Then data engineers should direct this clickstream message to the storage layer. (for example AWS S3). Will this cause too many small files in AWS S3 buckets? May this slow down the execution of data flows?
Data engineers need to develop a data pipeline in order to do real-time analysis. Which technologies should data engineers use? (Kafka Connect, Kafka Streams, Producer and Consumer etc)
Kafka topics may have lots of messages. One message can be consumed by different consumers. A consumer reads the message from the Kafka topic. Then, another consumer can read it, even after a while. So data engineers need to manage offsets in order to consume all messages one and only one. How can they manage offsets properly?
All clickstream events should be consumed.
All clickstream events should be consumed for once. If a product view event is consumed more than once, the dashboard will not show the correct product view count.
Do developers need to manage offsets manually? Or is there any technology/way which manages offsets automatically?
Event order can be important. The marketing department wants to see the category view durations. For instance, a user views 10 books in the ebooks category. Ten events were created. User is on the same category page until his/her first action. So data engineers need to calculate the duration between the first event and the last event.
However, Kafka is a queue and there is not any order in it. Producers can send data to Kafka asynchronously. How can data engineers calculate the durations correctly?
What happens if a producer sends an event to Kafka after the total elapsed duration was calculated.
Note: View duration may fit better to content web sites. For example, Netflix marketing users want to analyze the content view durations and percentages. If a user opens a movie and watched just five minutes, the marketing department may consider that the user does not like the movie.
Thanks in advance
You have really asked several unrelated questions here. Firstly, Kafka has a lot of free documentation available for it, along with many high quality 'getting started' blocks and paid books and courses. I would definitely start there. You might still have questions, but at least you will have a better awareness of the platform and you can ask questions in a lot more focused ways, which will hopefully get a much better answer. Start with the official docs. Personally, I learned Kafka by reading the Effective Kafka book, but I'm sure there are many others.
Going through your list of questions.
How can the backend developers send a message to the Kafka cluster when a user loads the web site? Should developers send a post request or get request? Are they other alternatives?
The website would typically publish an event. This is done by opening a client connection to a set of Kafka brokers and publishing a record to some topic. You mention POST/GET requests: this is not how Kafka generally works — the clients establish persistent connections to a cluster of brokers. However, if you preferred programming model is REST, Confluent does provide a Kafka REST Proxy for this use case.
Then data engineers should direct this clickstream message to the storage layer. (for example AWS S3). Will this cause too many small files in AWS S3 buckets? May this slow down the execution of data flows?
It depends how you write to S3. You may develop a custom consumer application that stages writes in a different persistent layer and then writes to S3 in batches. Kafka Connect also has an Amazon S3 connector that moves data in chunks.
Data engineers need to develop a data pipeline in order to do real-time analysis. Which technologies should data engineers use? (Kafka Connect, Kafka Streams, Producer and Consumer etc)
There is no correct answer here. All of the technologies you have listed are valid and may be used to a similar effect. Both Connect and Streams are quite popular for this types of applications; however, you can just as easily write a custom consumer application for all your needs.
Kafka topics may have lots of messages. One message can be consumed by different consumers. A consumer reads the message from the Kafka topic. Then, another consumer can read it, even after a while. So data engineers need to manage offsets in order to consume all messages one and only one. How can they manage offsets properly?
In the simplest case, Kafka offset management is automatic and the default behaviour allows for at-least once delivery, whereby a record will be delivered again if the first processing attempt failed. This may lead to duplicate effects (counting a clickstream event twice, as you described) but this is addressed by making your consumer idempotent. This is a fairly complex topic; there is great answer on Quora that covers the issue of exactly-once delivery in detail.
Event order can be important. The marketing department wants to see the category view durations. For instance, a user views 10 books in the ebooks category. Ten events were created. User is on the same category page until his/her first action. So data engineers need to calculate the duration between the first event and the last event.
The concept of order is backed into Kafka. Kafka's topics are sharded into partitions, where each partition is a totally-ordered, unbounded stream of records. Records may be strictly ordered provided they are published to the same partition. This is achieved by assigning them the same key, which the Kafka client hashes behind the scenes to arrive at the partition index. Any two records that have the same key will occupy the same partition, and will therefore be ordered.
Welcome to stackoverflow! I will answer a few of your questions, however you should go through the Kafka documentation for such things, if you are facing any problem while implementing it, then you should post here.
How can developers send data to a Kafka cluster? You have talked about producers, but I guess you haven't read about them, the developers will have to use a producer to produce an event to a Kafka topic.You can read more about a Kafka producer in the documentation.
To direct the messages to a storage layer, Kafka consumers will be used.
Note : Kafka Connect can be used instead of Kafka producer and consumer in some scenarios, Kafka connect has source connectors and sink connectors instead of producer and consumer.
For real time data analysis, Kafka Streams or KSQL can be used. These cannot be explained in an answer, I recommend you go through the documentation.
A single Kafka topic can have multiple consumer groups, and every consumer group has a different offset, you can tweak the configuration to use or not to use these offsets for every consumer group.
You can change various configurations such as Ack = All, to guarantee at least once and at most once semantics. Again you should go through the documentation to understand this completely.
You can maintain message order in Kafka as well, for that to happen, your consumers will have to wait for the acknowledgement from Kafka after every message has been sent, obviously this will slow down the process but you will have to compromise one of the things.
I haven't understood your requirements related to the last point, but I guess you should go through Kafka Streams and KSQL documentation once, as you can manage your window size for analysis over there.
I have tried to answer most of your questions in brief but to understand it completely, obviously you will have to go through the documentation in detail.
Agree with the answers above. The questions you ask are reasonably straightforward and likely answered in the official documentation.
As per one of the replies, there are lots of excellent books and tutorials online. I recently wrote a summary of educational resources on Kafka which you might find useful.
Based on your scenario, this will be a straightforward stream processing application with an emitter and a few consumers.
The clickstream event would be published onto the Kafka cluster through a Kafka client library. It's not clear what language the website is written in, but there is likely a library available for that language. The web server connects to Kafka brokers and publishes a message every time the user performs some action of significance.
You mention that order matters. Kafka has inherent support for ordered messages. All you need to do is publish related messages with the same key, for example the username of the customer or their ID. Kafka then ensures that those messages will appear in the order that they were published.
You say that multiple consumers will be reading the same stream. This is easily achieved by giving each set of consumers a different group.id. Kafka keeps a separate set of committed offsets for each consumer group (Kafka's terminology for a related set of consumers), so that one group can process messages independently of another. For committing offsets, the easiest approach is to use the automatic offset commit mode that is enabled by default. This way records will not be committed until your consumer is finished with them, and if a consumer fails midway through processing a batch of records, those records will be redelivered.

Kafka Microservice Proper Use Cases

In my new work's project, i discovered that instead of directly making post/put API calls from one microservice to another microservice, a microservice would produce a message to kafka, which is then consumed by a single microservice.
For example, Order microservice would publish a record to "pending-order" topic, which would then be consumed by Inventory microservice (no other consumer). In turn, after consuming the record and done some processing, Inventory microservice would produce a record to "processed-order" which would then be consumed only by Order microservice.
Is this a correct use case? Or is it better to just do API calls between microservices in this case?
There are two strong use cases of Kafka in a microservice based application:
You need to do a state change in multiple microservices as part of a single end user activity. If you do this by calling all the appropriate microservice APIs sequentially or parallely, there will be two issues:
Firstly, you lose atomicity i.e. you canNot guarantee "all or nothing" . It's very well possible that the call to microservice A succeeds but call to service B fails and that would lead to inconsistent data permanently. Secondly, in a cloud environment unpredictable latency and network timeouts are not uncommon and so when you make multiple calls as part of a single call, the probability of one of these calls getting delayed or failed is higher impacting user experience. Hence, the general recommendation here is, you write the user action atomically in a Kafka topic as an event and have multiple consumer groups - one for each interested microservice consume the event and make the state change in their own database. If the action is triggered by the user from a UI, you would also need to provide a "read your own write" guarantee where the user would like to see his data immediately after writing. Therefore, you'd need to write the event first in the local database of the first microservice and then do log based event sourcing (using an aporopriate Kafka Connector) to transfer the event data to Kafka. This will enable you to show the data to the user from the local DB. You may also need to update a cache, a search index, a distributed file system etc. and all of these can be done by consuming the Kafka events published by the individual microservices.
It is not very uncommon that you need to pull data from multiple microservice to do some activity or to aggregate data and display to the user. This, in general, is not recommended because of the latency and timeout issue mentioned above. It is usually recommended that we precompute those aggregates in the microservice local DB based on Kafka events published by the other microservices when they were changing their own state. This will allow you to serve the aggregate data to the user much faster. This is called materialized view pattern.
The only point to remember here is writing to Kafka log or broker and reading from it us asynchronous and there maybe a little time delay.
Microservice as consumer, seems fishy to me. You might mean Listeners to that topic would consume the message and maybe they will call your second microservice i.e. Inventory Microservice.
Yes, the model is fine, specially when you want to have asynchronous behavior and loads of traffic handled through it.
Imaging a scenario when you have more than 1 microservice to call from 1 endpoint. Here you need either aggregation layer which aggregates your services and you call it once, or you would like to publish several messages to Kafka which then will do the job.
Also think about Read services, if you need to call a microservice to read some data from somewhere else, then you can't use Kafka.
It all depends on your requirements and design.

How do I limit a Kafka client to only access one customer's data?

I'm evaluating Apache Kafka for publishing some event streams (and commands) between my services running on machines.
However, most of those machines are owned by customers, on their premises, and connected to their networks.
I don't want a machine owned by one customer to have access to another customer's data.
I see that Kafka has an access control module, which looks like it lets you restrict a client's access based on topic.
So, I could create a topic per customer and restrict each customer to just their own topic. This seems like a bad idea that I could regret in the future, because I've seen things that recommend restricting the number of Kafka topics to the 1000s at most.
Another design is to create a partition per customer. However, I don't see a way to restrict access if I do that.
Is there a way out of this quandary?

Building and querying state in Apache Kafka: Kafka Stream?

I am building an apache cluster for my needs and most of it is stateless. But there is one situation because of which I really need state.
To explain lets say I am storing every Pharmacy store that opens and the transactions that happen at that and every store. So the store opens with an initial state of number of medicines. As medicines are sold and more medicines are stocked, the state continually keeps changing.
While Kafka is serving my need of keeping up with live transactions in real time I need to be able to build pharmacy store state and query and find out at any given point the count of a given medicine in a store. Is it possible? Is that what Kafka Stream is used for?
Yes, you can use Kafka Streams to build an application that consumes a Kafka topic and maintains a queryable store that is continuously updated to maintain, as in your example, the current drug inventory.
Check out the documentation to get started: http://docs.confluent.io/current/streams/index.html
Also check out these examples using Kafka Streams' "Interactive Queries" feature:
https://github.com/confluentinc/examples/blob/3.1.x/kafka-streams/src/main/java/io/confluent/examples/streams/interactivequeries/WordCountInteractiveQueriesExample.java
https://github.com/confluentinc/examples/blob/3.1.x/kafka-streams/src/main/java/io/confluent/examples/streams/interactivequeries/kafkamusic/KafkaMusicExample.java