How to make sure two microservices are in sync - apache-kafka

I have a kubernetes solution with lots of microservices. Each microservice has its own database and to send data between the services we use kafka.
I have one microservice that generates lots of orders and order lines.
Theese are saved to the order services own database and every change should be pushed to kafka using a kafka connector setup.
Another microservice with items and prices. All changes are saved to tables in this services database and changes are pushed to their own topic using the kafka connector.
Now I have a third microservice (The calculater) that calculate something based on the data from the previous mentioned services. Right now it just consumes changes from the order, orderline, item and price topics. And when its time it calculates.
The Calculater microservice is scheduled to do the calculation at a certain time each day. But before doing the calculation Id like to know if the service is up to date with data from the other two microservices.
Is there some kind of best practice on how to do this?
What this should make sure is that I havent lost a change. Lets say an orderlines quantity was changed from 100 to 1. Then I want to make sure I have gotten that change before I start calculating.

If you want to know if the orders and items microservices have all their data published to kafka prior to having the calculater execute its logic, that is quite application specific and it may be hard to come up with a good answer without more details. The Kafka connector that sends the orders, order lines and so on messages from the database to Kafka is some kind of CDC connector (so it basically listens to DB table changes and publishes them to Kafka)? If so, most likely you will need some way to compare the latest message in Kafka with the latest row updated to know if the connector has sent all DB updates to Kafka. There may be connectors that expose that information somehow, or you may have to implement something yourself.
On the other side, if what you want is to know if calculater has read all the messages that have been published to kafka by the other services, that is easier. You just need to get the high watermarks (the latest offset in each topic) and check that the calculater consumer has actually consumed it (so there is no lag). I guess, though, that the topics are continuously updated, so most likely there will always be some lag, but well, there is nothing you can do about that.

Related

kafka consumer to store history of events in a data store

We are working with kafka as Event Streaming Platform. So far, there is one producer of data and 3 consumers, each of them subcribed to one or several topics in kafka. This is working perfectly fine. Fyi, the kafka retention period is set to 5s since we don't need to persist the events more than that.
Right now, we have a new use-case coming to persist all the events for the latest 20 mins (in an another data store) for post-analysis (mainly for training purposes). So this new kafka consumer should subscribe to all existing topics. We only want to persist the history of latest 20mins of events in the data store and not all the events for a session (that can represent several hours or days). TThe targetted througput is 170kb/s and for 20mins it is almost 1M of messages to be persisted.
We are wondering which architecture pattern is adapted for such situtation? This is not a nominal use-case compared to the current use-cases, so we don't want to reduce the performance of the system to be able to manage it. Our idea is to empty the topcis as fast as we can , push the data into a queue and have another app with a different rate in charge of reading the data from the queue and persisting them into the store.
We woul greatly appreciate any experience or feedback to manage such use-case. Especially about the expiration/pruge mechanism to be used. For sure we need something highly available and scalable.
Regards
You could use Kafka Connect with topics.regex=* to consume everything and write to one location, but you'll end up with a really high total lag, especially if you keep adding new topics.
If you have retention.ms=5000, then I don't know if Kafka is a proper tool for your use case, but perhaps you could ingest into Splunk or Elasticsearch or other time-series system where you can properly slice by 20 minute windows.

Modelling a Kafka cluster

I have an API endpoint that accepts events with a specific user ID and some other data. I want those events broadcasted to some external locations and I wanted to explore using Kafka as a solution for that.
I have the following requirements:
Events with the same UserID should be delivered in order to the external locations.
Events should be persisted.
If a single external location is failing, that shouldn't delay delivery to other locations.
Initially, from some reading I did, it felt like I want to have N consumers where N is the number of external locations I want to broadcast to. That should fulfill requirement (3). I also probably want one producer, my API, that will push events to my Kafka cluster. Requirement (2) should come in automatically with Kafka.
I was more confused regarding how to model the internal Kafka cluster side of things. Again, from the reading I did, it sounds like it's a bad practice to have millions of topics, so having a single topic for each userID is not an option. The other option I read about is having one partition for each userID (let's say M partitions). That would allow requirement (1) to happen out of the box, if I understand correctly. But that would also mean I have M brokers, is that correct? That also sounds unreasonable.
What would be the best way to fulfill all requirements? As a start, I plan on hosting this with a local Kafka cluster.
You are correct that one topic per user is not ideal.
Partition count is not dependent upon broker count, so this is a better design.
If a single external location is failing, that shouldn't delay delivery to other locations.
This is standard consumer-group behavior, not topic/partition design.

Streaming on-demand data on to Kafka topics based on consumer requests

We are a source system and we have a couple of downstream systems which would require our data for their needs, currently we are publishing events onto Kafka topics as and when there is a change in source system for them to consume and make changes to their tables (all delta updates)
Our downstream systems is currently accessing our database directly sometimes to make complete refresh of their tables on demand basis once in a while to make sure data is in sync apart from subscribing to Kafka topics, as you know we always need a data refresh sometimes when we feel data is out of sync for some reason.
We are planning to stop giving access to our database directly, how can we achieve this ? Is there a way that consumers request us their data needs by any triggers like passing request to us and we can publish the stream of data for them to consume on their end and they sync the tables or get the bulk data into their memory to perform some tasks based on their needs.
We currently have written RESTful APIs to provide data based on the requests, but we are exposing small data volumes as I think APIs we only send smaller volumes of data, but it won't work when we want to send millions of data to consumers and I believe only way is to stream data on Kafka, but with Kafka how can we respond to the request from consumers and only pump that specific data on to Kafka topics for them to consume ?
You have the option of setting the retention policy on any topic to keep messages forever with:
retention.ms: -1
see the docs
In that case you could store the entire change log in the same manner that you currently are. Then if a consumer needs to re-materialize the entire history, they can start with the first offset and go from there without you having to produce a specialized dataset.

Which messages should be published to a Kafka topic, and when?

I have some few services, like Catalog Service, Customer Service, Recommendations Service, Order Taking Service and so on ..., each service has its own Keyspace in a Cassandra database.
I have two questions:
1 - For a change in a service: should I first publish the change's event (or record) to Kafka and then consume it from that same service in other to update its database, or should I update its database first and then publish the record to Kafka ?
2 - How to choose which change to publish to Kafka, should I publish all updates to Kafka, even those without interest for others services, like "attribute X updated to Y for product Z" ?
1) I would suggest you always try to read your writes. Which operation is more likely to succeed? A replicated ack from Kafka, or a durable Cassandra upsert? If you think Kafka is more durable, then you'd write it there then use a tool like Kafka Connect to write it down to Cassandra (assuming you really need Cassandra over a Global KTable, that's up for debate)
2) There's no straightforward answer. If you think data will ever be consumed in ways that might be relevant, then produce it. Think about it like an audit log of any and all events. If you want to build an idempotent system that always knows latest state of any product and all changes that happened, then you can either store the whole object each time as (id, product) pairs where you holistic update the entire product, or you can store each delta of what changed and rebuild state from that

Can Debezium ensure all events from the same transaction are published at the same time?

I'm starting to explore the use of change data capture to convert the database changes from a legacy and commercial application (which I cannot modify) into events that could be consumed by other systems. Simplifying my real case, let's say that there will be two tables involved, order with the order header details and order_line with the details of each of the products requested.
My current understanding is that events from the two tables will be published into two different kafka topics and I should aggregate them using kafka-streams or ksql. I've seen there are different options to define the window that will be used to select all the events that are related, however it is not clear for me how I could be sure all the events coming from the same database transaction are already in the topic, so I do not miss any of them.
Is Debezium able to ensure this (all events from same transaction are published) or it could happen that, for example, Debezium crashes while publishing the events and only part of the ones generated by the same transaction are in Kafka?
If so, what's the recommended approach to handle this?
Thanks
Debezium stores the positions of transaction logs that it reads completely in Kafka and it uses these positions to resume its work on any crash or other situation like this also in other situations that may happen sometimes and in this situation the debezium loss it's position, it will restore it by reading the snapshot of database again!