Confluent JDBC source Connector incremental starting row - apache-kafka

I want to use the Confluent's JDBC source connector to retrieve data from a SQL Server table into Kafka.
I want to use the incrementing mode to start retrieving data from the table only from the moment the connector starts running:
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
mode=incrementing
incrementing.column.name=id_column_name
When I run this connector, it starts retrieving all the rows from the table, not the ones that are going to be inserted after that point in time. I've been checking the connector configuration properties but I can't seem to find a configuration element for this situation.
The table doesn't contain any Timestamp values, so I can't use the properties timestamp.initial and timestamp.column.name properties. It includes a Datetime column however, but I think this is not useful in this case.
How can I do this?
Any help would be greatly appreciated.

You can try to use query-based ingest or manually seed the offsets topic with the appropriate value.
Source: Kafka Connect Deep Dive – JDBC Source Connector by Robin Moffatt

Related

Is there a way to configure Debezium to store in Kafka not all the changes from database but only a certain ones?

I have mongodb and I need to send the changes from a certain query to kafka broker. I heard that debezium tracks changes from database and stores it to kafka. But is there a way to configure that process to store not all the changes that happen in database but only a certain ones?
You can perform some filtering using their single message transform (SMT) Kafka Connect plugin. You can check its documentation to see if it has the features that you need: https://debezium.io/documentation/reference/stable/transformations/filtering.html
Depending on the source technology you could.
When using PostgreSQL as a source, for example, you can define which operations to include in the PG publication that is read by Debezium
More info at the Debezium docs

How do we check number of records are loaded so far onto db from Kafka topic?

I'm trying to load data from Kafka topic to Postgres using Jdbc sink connector . Now, how do we know the number of records are loaded so far into Postgres. As of now I keep on checking number of records in db using sql query. Is there any other way I can know about it?
Kafka Connect doesn't track this. I see nothing wrong with SELECT COUNT(*) on the table, however this doesn't exclude other processes writing to that table as well
it is not possible in KAFKA. Because once you have sinked the records into the target DB, KAFKA is already done its job. But you can track number of records that you are updating using SINK Record Collections write into your local file or insert into a KAFKA State store.

Why Kafka Connect Works?

I'm trying to wrap my head around how Kafka Connect works and I can't understand one particular thing.
From what I have read and watched, I understand that Kafka Connect allows you to send data into Kafka using Source Connectors and read data from Kafka using Sink Connectors. And the great thing about this is that Kafka Connect somehow abstracts away all the platform-specific things and all you have to care about is having proper connectors. E.g. you can use a PostgreSQL Source Connector to write to Kafka and then use Elasticsearch and Neo4J Sink Connectors in parallel to read the data from Kafka.
My question is: how does this abstraction work? Why are Source and Sink connectors written by different people able to work together? In order to read data from Kafka and write them anywhere, you have to expect some fixed message structure/schema, right? E.g. how does an Elasticsearch Sink know in advance what kind of messages would a PostgreSQL Source produce? What if I replaced PostgreSQL Source with MySQL source? Would the produced messages have the same structure?
It would be logical to assume that Kafka requires some kind of a fixed message structure, but according to the documentation the SourceRecord which is sent to Kafka does not necessarily have a fixed structure:
...can have arbitrary structure and should be represented using
org.apache.kafka.connect.data objects (or primitive values). For
example, a database connector might specify the sourcePartition as
a record containing { "db": "database_name", "table": "table_name"}
and the sourceOffset as a Long containing the timestamp of the row".
In order to read data from Kafka and write them anywhere, you have to expect some fixed message structure/schema, right?
Exactly. Refer the Javadoc on the Struct and Schema classes of the Connect API as well as the Converter interface
Of course, those are not strict requirements, but without them, then the framework doesn't work across different sources and sinks, but this is no different than the contract between producers and consumers regarding serialization

JDBC sink connector insert/upsert based on max timestamp?

I'm very new to Kafka connect
I am inserting records from multiple sources into one table.
In some cases, it may be possible for some records to reach before others.
Since I cannot control which source will pull which record first, I want to add a check on the timestamp key of the record.
I have a key called "LastModified_timestamp" in my schema where I store the timestamp of the latest state of my record.
I want to add a check to my JDBC sink connector where I can upsert a record based on comparing the value of LastModified_timestamp
I want to ignore the records which have a older timestamp and only want to upsert/insert the latest one. I couldn't find any configuration to achieve this
Is there any way by which I can achieve this?
Will writing a custom query help in this case?
The JDBC Sink connector does not support this kind of feature. You have two options to consider:
Single Message Transform (SMT) - these apply logic to records as they pass through Kafka Connect. SMT are great for things like dropping columns, changing datatypes, etc. BUT not appropriate for more complex processing and logic, including logic which needs to span multiple records as yours does here
Process the data in the source Kafka topic first, to apply the necessary logic. You can do this with Kafka Streams, KSQL, and several other stream processing frameworks (e.g. Spark, Flink, etc). You'd need some kind of stateful logic that could work out if a record was older than already processed.
Can you describe more about your upstream source for the data? It might be there's a better way to orchestrate the data coming through to enforce the ordering.
A final idea would be to land all records to your target DB and then use logic in your database query consuming it to select the most recent (based on LastModified_timestamp) record for a given key.
Disclaimer: I work for Confluent, the company behind the open-source KSQL project.

Querying MySQL tables using Apache Kafka

I am trying to use Kafka Streams for achieving a use-case.
I have two tables in MySQL - User and Account. And I am getting events from MySQL into Kafka using a Kafka MySQL connector.
I need to get all user-IDs within an account from within Kafka itself.
So I was planning to use KStream on MySQL output topic, process it to form an output and publish it to a topic with Key as the account-id and value as the userIds separated by comma (,).
Then I can use interactive query to get all userIds using account id, with the get() method of ReadOnlyKeyValueStore class.
Is this the right way to do this? Is there a better way?
Can KSQL be used here?
You can use Kafka Connect to stream data in from MySQL, e.g. using Debezium. From here you can use KStreams, or KSQL, to transform the data, including re-keying which I think is what you're looking to do here, as well as join it to other streams.
If you ingest the data from MySQL into a topic with log compaction set then you are guaranteed to always have the latest value for every key in the topic.
I would take a look at striim if you want built in CDC and interactive continuous SQL queries on the streaming data in one UI. More info here:
http://www.striim.com/blog/2017/08/making-apache-kafka-processing-preparation-kafka/