Filter Repeated Messages In Kafka - apache-kafka

PREFACE:
In our organization we're trying to use Kafka to solve a problem that involves capturing changes in an Oracle Database and send through Kafka. It's in fact a CDC, we are using Kafka Connector for that.
We catch the changes in Oracle, using Oracle Flashback queries, this allows us to get the timestamp of the change and operation (Insert, Delete, Update) involved.
Once some change is made in a table we observe, the Kafka Connector publishes this to a topic, we further read this topic using Kafka Streams.
The problem is that sometimes there are equal lines that appears in the Flashback query, because of some Update in the table that didn't change nothing (this triggers a flashback change too), or if the table has 100 columns, and we watch only 20, it end up that we see repeated lines in the query because none of those 20 fields has changed.
We use flashback to get changed rows (including excluded ones). In the connector we set timestamp+increment mode (timestamp is obtained by the field versions_starttime of flashback query)
Important: we can't touch the DB more than this, I mean, we can't create triggers instead of using this already Flashback scheme.
THE QUESTION
We're trying to filter records in Kafka, if some (key, value) is equally in content we want to discard. Note that, this is not exactly once semantics. The record will be repeated with large timestamps differences.
If I use a KTable to check the last value of some record, how efficient this will be after a long period?
I mean, internal state storage of consumers, is handled by RocksDB and a backed Kafka Topic, since if I use a non windowed KTable this internal space could end up being very large.
Which is considered a good approach in this scenario? To not overload Kafka Consumers internal state storage, and the same time being capable to know if the actual record was already processed time ago.

Related

Schema registry incompatible changes

In all the documentation it’s clear described how to handle compatible changes with Schema Registry with compatibility types.
But how to introduce incompatible changes without disturbing the downstream consumers directly, so that the can migrated in their own pace?
We have the following situation (see image) where the producer is producing the same message in both schema versions:
Image
The problem is how to migrated the app’s and the sink connector in a controlled way, where business continuity is important and the consumer are not allowed to process the same message (in the new format).
consumer are not allowed to process the same message (in the new format).
Your consumers need to be aware of the old format while consuming the new one; they need to understand what it means to consume the "same message". That's up to you to code, not something Connect or other consumers can automatically determine, with or without a Registry.
In my experience, the best approach to prevent duplicate record processing across various topics is to persist unique ids (UUID) as part of each record, across all schema versions, and then query some source of truth for what has been processed already, or not. When not processed, insert these ids into that system after the records have been.
This may require placing a stream processing application that filters already processed records out of a topic before the sink connector will consume it
I figure what you are looking for is kind of an equivalent to a topic-offset, but spanning multiple ones. Technically this is not provided by Kafka and with good reasons I'd like to add. The solution would be very specific to each use case, but I figure it boils all down to introducing your own functional offset attribute in both streams.
Consumers will have to maintain state in regards to what messages have been processed when switching to another topic filtering out messages that were processed from the other topic. You could use your own sequence numbering or timestamps to keep track of process across topics. Using a sequence will be easier keeping track of the progress as only one value needs to be stored at consumer end. When using UUIDs or other non-sequence ids will potentially require a more complex state keeping mechanism.
Keep in mind that switching to a new topic will probably mean that lots of messages will have to be skipped and depending on the amount this might cause a delay that you need to be willing to accept.

In Kafka, how to handle deleted rows from source table that are already reflected in Kafka topic?

I am using a JDBC source connector with mode timestamp+incrementing to fetch table from Postgres, using Kafka Connect. The updates in data are reflected in Kafka topic but the deletion of records has no effect. So, my questions are:
Is there some way to handle deleted records?
How to handle records that are deleted but still present in kafka topic?
The recommendation is to either 1) adjust your source database to be append/update only, as well, either via a boolean or timestamp that is filtered out when Kafka Connect queries the table.
If your database is running out of space, then you can delete old records, which should already have been processed by Kafka
Option 2) Use CDC tools to capture delete events immediately rather than missing them in a period table scan. Debezium is a popular option for Postgres
A Kafka topic can be seen as an "append-only" log. It keeps all meesages for as long as you like but Kafka is not built to delete individual messages out of a topic.
In the scenario you are describing it is common that the downstream application (consuming the topic) handles the information on a deleted record.
As an alternative you could set the cleanup.policy of your topic to compact which means it will eventually keep only the latest value for each key. If you now define the key of a message as the primary key of the Postgres table, your topic will eventually delete the record when you produce a message with the same key and a null value into the topic. However,
I am not sure if your connector is flexible to do this
Depending on what you do with the data in the kafka topic, this could still not be a solution to your problem as the downstream application will still read both record, the original one and the null message as the deleted record.

How I can use Kafka like relational database

good time of day. I am sorry my poor English. I have some issue, can you help me to understand how i can use kafka and kafka streams like database.
My problem is i have some microservices and each service have their data in own database. I need for report purposes collect data in one point, for this i chose the kafka. I use debezuim maybe you know it (change data capture debezium), each table in relational database it is a topic in kafka. And i wrote the application with kafka stream (i joined streams each other) so far good. Example: I have the topic for ORDER and ORDER_DETAILS, after a while will come some event for join this topic, problem is i dont know when come this event maybe after minutes or after monthes or after years. How i can get data in topics ORDER and ORDER_DETAIL after month or year ? It is right way save data in topic infinitely? can you give me some advice maybe have some solutions.
The event will come as soon as there is a change in the database.
Typically, the changes to the database tables are pushed as messages to the topic.
Each and every update to the database will be a kafka message. Since there is a message for every update, you might be interested in only the latest value (update) for any given key which mostly will be the primary key
In this case, you can maintain the topic infinitely (retention.ms=-1) but compact (cleanup.policy=compact) it in order to save space.
You may also be interested in configuring segment.ms and/or segment.bytes for further tuning the topic retention.

Can Debezium ensure all events from the same transaction are published at the same time?

I'm starting to explore the use of change data capture to convert the database changes from a legacy and commercial application (which I cannot modify) into events that could be consumed by other systems. Simplifying my real case, let's say that there will be two tables involved, order with the order header details and order_line with the details of each of the products requested.
My current understanding is that events from the two tables will be published into two different kafka topics and I should aggregate them using kafka-streams or ksql. I've seen there are different options to define the window that will be used to select all the events that are related, however it is not clear for me how I could be sure all the events coming from the same database transaction are already in the topic, so I do not miss any of them.
Is Debezium able to ensure this (all events from same transaction are published) or it could happen that, for example, Debezium crashes while publishing the events and only part of the ones generated by the same transaction are in Kafka?
If so, what's the recommended approach to handle this?
Thanks
Debezium stores the positions of transaction logs that it reads completely in Kafka and it uses these positions to resume its work on any crash or other situation like this also in other situations that may happen sometimes and in this situation the debezium loss it's position, it will restore it by reading the snapshot of database again!

JDBC sink connector insert/upsert based on max timestamp?

I'm very new to Kafka connect
I am inserting records from multiple sources into one table.
In some cases, it may be possible for some records to reach before others.
Since I cannot control which source will pull which record first, I want to add a check on the timestamp key of the record.
I have a key called "LastModified_timestamp" in my schema where I store the timestamp of the latest state of my record.
I want to add a check to my JDBC sink connector where I can upsert a record based on comparing the value of LastModified_timestamp
I want to ignore the records which have a older timestamp and only want to upsert/insert the latest one. I couldn't find any configuration to achieve this
Is there any way by which I can achieve this?
Will writing a custom query help in this case?
The JDBC Sink connector does not support this kind of feature. You have two options to consider:
Single Message Transform (SMT) - these apply logic to records as they pass through Kafka Connect. SMT are great for things like dropping columns, changing datatypes, etc. BUT not appropriate for more complex processing and logic, including logic which needs to span multiple records as yours does here
Process the data in the source Kafka topic first, to apply the necessary logic. You can do this with Kafka Streams, KSQL, and several other stream processing frameworks (e.g. Spark, Flink, etc). You'd need some kind of stateful logic that could work out if a record was older than already processed.
Can you describe more about your upstream source for the data? It might be there's a better way to orchestrate the data coming through to enforce the ordering.
A final idea would be to land all records to your target DB and then use logic in your database query consuming it to select the most recent (based on LastModified_timestamp) record for a given key.
Disclaimer: I work for Confluent, the company behind the open-source KSQL project.