Can Debezium ensure all events from the same transaction are published at the same time? - apache-kafka

I'm starting to explore the use of change data capture to convert the database changes from a legacy and commercial application (which I cannot modify) into events that could be consumed by other systems. Simplifying my real case, let's say that there will be two tables involved, order with the order header details and order_line with the details of each of the products requested.
My current understanding is that events from the two tables will be published into two different kafka topics and I should aggregate them using kafka-streams or ksql. I've seen there are different options to define the window that will be used to select all the events that are related, however it is not clear for me how I could be sure all the events coming from the same database transaction are already in the topic, so I do not miss any of them.
Is Debezium able to ensure this (all events from same transaction are published) or it could happen that, for example, Debezium crashes while publishing the events and only part of the ones generated by the same transaction are in Kafka?
If so, what's the recommended approach to handle this?
Thanks

Debezium stores the positions of transaction logs that it reads completely in Kafka and it uses these positions to resume its work on any crash or other situation like this also in other situations that may happen sometimes and in this situation the debezium loss it's position, it will restore it by reading the snapshot of database again!

Related

How to make sure two microservices are in sync

I have a kubernetes solution with lots of microservices. Each microservice has its own database and to send data between the services we use kafka.
I have one microservice that generates lots of orders and order lines.
Theese are saved to the order services own database and every change should be pushed to kafka using a kafka connector setup.
Another microservice with items and prices. All changes are saved to tables in this services database and changes are pushed to their own topic using the kafka connector.
Now I have a third microservice (The calculater) that calculate something based on the data from the previous mentioned services. Right now it just consumes changes from the order, orderline, item and price topics. And when its time it calculates.
The Calculater microservice is scheduled to do the calculation at a certain time each day. But before doing the calculation Id like to know if the service is up to date with data from the other two microservices.
Is there some kind of best practice on how to do this?
What this should make sure is that I havent lost a change. Lets say an orderlines quantity was changed from 100 to 1. Then I want to make sure I have gotten that change before I start calculating.
If you want to know if the orders and items microservices have all their data published to kafka prior to having the calculater execute its logic, that is quite application specific and it may be hard to come up with a good answer without more details. The Kafka connector that sends the orders, order lines and so on messages from the database to Kafka is some kind of CDC connector (so it basically listens to DB table changes and publishes them to Kafka)? If so, most likely you will need some way to compare the latest message in Kafka with the latest row updated to know if the connector has sent all DB updates to Kafka. There may be connectors that expose that information somehow, or you may have to implement something yourself.
On the other side, if what you want is to know if calculater has read all the messages that have been published to kafka by the other services, that is easier. You just need to get the high watermarks (the latest offset in each topic) and check that the calculater consumer has actually consumed it (so there is no lag). I guess, though, that the topics are continuously updated, so most likely there will always be some lag, but well, there is nothing you can do about that.

Schema registry incompatible changes

In all the documentation it’s clear described how to handle compatible changes with Schema Registry with compatibility types.
But how to introduce incompatible changes without disturbing the downstream consumers directly, so that the can migrated in their own pace?
We have the following situation (see image) where the producer is producing the same message in both schema versions:
Image
The problem is how to migrated the app’s and the sink connector in a controlled way, where business continuity is important and the consumer are not allowed to process the same message (in the new format).
consumer are not allowed to process the same message (in the new format).
Your consumers need to be aware of the old format while consuming the new one; they need to understand what it means to consume the "same message". That's up to you to code, not something Connect or other consumers can automatically determine, with or without a Registry.
In my experience, the best approach to prevent duplicate record processing across various topics is to persist unique ids (UUID) as part of each record, across all schema versions, and then query some source of truth for what has been processed already, or not. When not processed, insert these ids into that system after the records have been.
This may require placing a stream processing application that filters already processed records out of a topic before the sink connector will consume it
I figure what you are looking for is kind of an equivalent to a topic-offset, but spanning multiple ones. Technically this is not provided by Kafka and with good reasons I'd like to add. The solution would be very specific to each use case, but I figure it boils all down to introducing your own functional offset attribute in both streams.
Consumers will have to maintain state in regards to what messages have been processed when switching to another topic filtering out messages that were processed from the other topic. You could use your own sequence numbering or timestamps to keep track of process across topics. Using a sequence will be easier keeping track of the progress as only one value needs to be stored at consumer end. When using UUIDs or other non-sequence ids will potentially require a more complex state keeping mechanism.
Keep in mind that switching to a new topic will probably mean that lots of messages will have to be skipped and depending on the amount this might cause a delay that you need to be willing to accept.

How I can use Kafka like relational database

good time of day. I am sorry my poor English. I have some issue, can you help me to understand how i can use kafka and kafka streams like database.
My problem is i have some microservices and each service have their data in own database. I need for report purposes collect data in one point, for this i chose the kafka. I use debezuim maybe you know it (change data capture debezium), each table in relational database it is a topic in kafka. And i wrote the application with kafka stream (i joined streams each other) so far good. Example: I have the topic for ORDER and ORDER_DETAILS, after a while will come some event for join this topic, problem is i dont know when come this event maybe after minutes or after monthes or after years. How i can get data in topics ORDER and ORDER_DETAIL after month or year ? It is right way save data in topic infinitely? can you give me some advice maybe have some solutions.
The event will come as soon as there is a change in the database.
Typically, the changes to the database tables are pushed as messages to the topic.
Each and every update to the database will be a kafka message. Since there is a message for every update, you might be interested in only the latest value (update) for any given key which mostly will be the primary key
In this case, you can maintain the topic infinitely (retention.ms=-1) but compact (cleanup.policy=compact) it in order to save space.
You may also be interested in configuring segment.ms and/or segment.bytes for further tuning the topic retention.

Is there any way to ensure that duplicate records are not inserted in kafka topic?

I have been trying to implement a queuing mechanism using kafka where I want to ensure that duplicate records are not inserted into topic created.
I found that iteration is possible in consumer. Is there any way by which we can do this in producer thread as well?
This is known as exactly-once processing.
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply
by optionally integrating support for this on the server.
The existing
high-level consumer doesn't expose a lot of the more fine grained
control of offsets (e.g. to reset your position). We will be working
on that soon

Ingesting data from REST api to Kafka

I have many REST API to pull the data from different data sources, now i want to publish these rest response to different kafka topics. Also i want to make sure that duplicate data is not getting produced.
Is there any tools available to do this kind of operations?
So in general a Kafka processing pipeline should be able to handle messages that are sent multiple times. Exactly once delivery of Kafka messages is a feature that's only been around since mid 2017 (giving that I'm writing this Jan 2018), and Kafka 0.11, so in general unless you're super bleedy edge in your Kafka installation your pipeline should be able to handle multiple deliveries of the same message.
That's of course your pipeline. Now you have a problem where you have a data source that may deliver the message to you multiple times, to your HTTP -> Kafka microservice.
Theoretically you should design your pipeline to be idempotent: that multiple applications of the same change message should only affect the data once. This is, of course, easier said than done. But if you manage this then "problem solved": just send duplicate messages through and whatever it doesn't matter. This is probably the best thing to drive for, regardless of whatever once only delivery CAP Theorem bending magic KIP-98 does. (And if you don't get why this super magic well here's a homework topic :) )
Let's say your input data is posts about users. If your posted data includes some kind of updated_at date you could create a transaction log Kafka topic. Set the key to be the user ID and the values to be all the (say) updated_at fields applied to that user. When you're processing a HTTP Post look up the user in a local KTable for that topic, examine if your post has already been recorded. If it's already recorded then don't produce the change into Kafka.
Even without the updated_at field you could save the user document in the KTable. If Kafka is a stream of transaction log data (the database inside out) then KTables are the streams right side out: a database again. If the current value in the KTable (the accumulation of all applied changes) matches the object you were given in your post, then you've already applied the changes.