Which messages should be published to a Kafka topic, and when? - apache-kafka

I have some few services, like Catalog Service, Customer Service, Recommendations Service, Order Taking Service and so on ..., each service has its own Keyspace in a Cassandra database.
I have two questions:
1 - For a change in a service: should I first publish the change's event (or record) to Kafka and then consume it from that same service in other to update its database, or should I update its database first and then publish the record to Kafka ?
2 - How to choose which change to publish to Kafka, should I publish all updates to Kafka, even those without interest for others services, like "attribute X updated to Y for product Z" ?

1) I would suggest you always try to read your writes. Which operation is more likely to succeed? A replicated ack from Kafka, or a durable Cassandra upsert? If you think Kafka is more durable, then you'd write it there then use a tool like Kafka Connect to write it down to Cassandra (assuming you really need Cassandra over a Global KTable, that's up for debate)
2) There's no straightforward answer. If you think data will ever be consumed in ways that might be relevant, then produce it. Think about it like an audit log of any and all events. If you want to build an idempotent system that always knows latest state of any product and all changes that happened, then you can either store the whole object each time as (id, product) pairs where you holistic update the entire product, or you can store each delta of what changed and rebuild state from that

Related

How to make sure two microservices are in sync

I have a kubernetes solution with lots of microservices. Each microservice has its own database and to send data between the services we use kafka.
I have one microservice that generates lots of orders and order lines.
Theese are saved to the order services own database and every change should be pushed to kafka using a kafka connector setup.
Another microservice with items and prices. All changes are saved to tables in this services database and changes are pushed to their own topic using the kafka connector.
Now I have a third microservice (The calculater) that calculate something based on the data from the previous mentioned services. Right now it just consumes changes from the order, orderline, item and price topics. And when its time it calculates.
The Calculater microservice is scheduled to do the calculation at a certain time each day. But before doing the calculation Id like to know if the service is up to date with data from the other two microservices.
Is there some kind of best practice on how to do this?
What this should make sure is that I havent lost a change. Lets say an orderlines quantity was changed from 100 to 1. Then I want to make sure I have gotten that change before I start calculating.
If you want to know if the orders and items microservices have all their data published to kafka prior to having the calculater execute its logic, that is quite application specific and it may be hard to come up with a good answer without more details. The Kafka connector that sends the orders, order lines and so on messages from the database to Kafka is some kind of CDC connector (so it basically listens to DB table changes and publishes them to Kafka)? If so, most likely you will need some way to compare the latest message in Kafka with the latest row updated to know if the connector has sent all DB updates to Kafka. There may be connectors that expose that information somehow, or you may have to implement something yourself.
On the other side, if what you want is to know if calculater has read all the messages that have been published to kafka by the other services, that is easier. You just need to get the high watermarks (the latest offset in each topic) and check that the calculater consumer has actually consumed it (so there is no lag). I guess, though, that the topics are continuously updated, so most likely there will always be some lag, but well, there is nothing you can do about that.

Modelling a Kafka cluster

I have an API endpoint that accepts events with a specific user ID and some other data. I want those events broadcasted to some external locations and I wanted to explore using Kafka as a solution for that.
I have the following requirements:
Events with the same UserID should be delivered in order to the external locations.
Events should be persisted.
If a single external location is failing, that shouldn't delay delivery to other locations.
Initially, from some reading I did, it felt like I want to have N consumers where N is the number of external locations I want to broadcast to. That should fulfill requirement (3). I also probably want one producer, my API, that will push events to my Kafka cluster. Requirement (2) should come in automatically with Kafka.
I was more confused regarding how to model the internal Kafka cluster side of things. Again, from the reading I did, it sounds like it's a bad practice to have millions of topics, so having a single topic for each userID is not an option. The other option I read about is having one partition for each userID (let's say M partitions). That would allow requirement (1) to happen out of the box, if I understand correctly. But that would also mean I have M brokers, is that correct? That also sounds unreasonable.
What would be the best way to fulfill all requirements? As a start, I plan on hosting this with a local Kafka cluster.
You are correct that one topic per user is not ideal.
Partition count is not dependent upon broker count, so this is a better design.
If a single external location is failing, that shouldn't delay delivery to other locations.
This is standard consumer-group behavior, not topic/partition design.

Kafka Streams DSL over Kafka Consumer API

Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.

Collect users activity in Kafka?

I desire to provide a fast ability to get status of user his availability.
It must be fastest reading data from storage.
Thus I chosed Redis storage for storing available status of each users.
So, besides that I need store more extended information about available users, such as region, time of login, etc.
For this purpose I got a Kafka, where this data is stored.
Question is, how to synchronise Kafka and Redis?
Which sequence should be, first store event online users in Kafka, then sink it to Redis?
Second is store in Redis and asynchronously in Kafka.
I afraid a latency between Kafka and Redis for sink operation.
As I understood from the question, you want to store only user and userstatus in Redis and complete profile on Kafka.
I am not sure about the reason of choosing Kafka as your primary source of all data. Also, how are you planning to use the data stored there.
If data storage in Kafka is really important to you, then I'd suggest to update your primary database first(Kafka or any) and then update cache.
In this case, you need to do a sync operation on Kafka producer and once its successful, update ur cache.
As your readd operations are only from redis - performance will not be impacted.
But if opting sync producer might add little bit overhead beacuse of acknowledgement when compared to async.

Using Apache Kafka to maintain data integrity across databases in microservices architecture

Has anyone used Apache Kafka to maintain data integrity across microservice architecture which each service has its own database? I have been searching around and there was some posts mentioned about using Kafka but I'm looking for more details such as in how Kafka was used. Do you have to write code for producer and consumer (say for Customer database as producer and Orders database as consumer so that if a Customer is deleted in the Customer database then the Orders database somehow need to know that so it will delete all Orders for that Customer as well).
Yes, you'll need to write that processing code
For example, one database would be connected to a CDC reader to emit all changes to a stream (the producer), which could be fed into a KTable or custom consumer to write upserts/deletes into a local cache of another service. I mention it ought to be a cache rather than a database is because when the service restarts, you potentially miss some events, or duplicate others, so the source of the materialized view should ideally be Kafka itself (via a compacted topic)