Kafka JDBC connector not picking up new commits - apache-kafka

I am currently using a Kafka JDBC connector to poll records from an Oracle db. The connector properties are set to use timestamp mode and we have provided a simple select query in the properties (not using a where clause) - based on my understanding this should work.
However currently when instantiating the connector I can see the initial query does pull out all of the records it should and does publish them to the Kafka consumer - but any new commits to the oracle db are not picked up and the connector just sits polling without finding any new info, and maintaining its offset.
No exceptions are being thrown in the connector, and no indication of a problem other than it is not picking up the new commits in the db.
One thing of note, which i have been unable to prove makes a difference, is that the fields in the oracle db are all nullable. But i have tested changing that for the timestamp field, and it had no effect and the same behaviour continued. I have also tested in bulk mode and it works fine and does pick up new commits, though I cannot use bulk mode as we cannot duplicate the records for the system.
Does anyone have any idea why the connector is unable to pick up new commits for timestamp mode?

What does your properties file look like? You need to make sure to use an incrementing column or a time stamp column.
If you you are using a time stamp column, is it getting updated on the commit?
Regarding nulls, You can tweak your query to coalesce the null column to a value. Alternatively, I think there is a setting to allow nullable columns.

Related

Kafka connect: How to handle database schema/table changes

Wondering if there is a documented process on how to handle database schema changes. I am using Debezium source connector for postgres and confluent JDBC Sink connector to replicate database changes. I need to do some changes in database as below
Add new columns to existing table
Modify database column type and update name.
I am not sure what is the best way to do this. Solution that I can think if is
Stop source connector
Wait for sinks to consume all messages
Upgrade the databases
Start source and sink connector
Debezium will automatically add new fields in the record schema for new columns. So you would update your consumer and downstream systems first to prepare for those events. No need to stop the source...
If you change types and names, then you might run into backwards incompatible schema changes, and these operations are generally not recommended. Instead, always add new columns but "deprecate" and dont use the old ones. After you are done reading events from those old columns in all other systems, then drop those columns.

Mimic `schema_only` snapshot mode with Debezium PostgreSQL connector

The MySQL connector has a snapshot mode schema_only:
the connector runs a snapshot of the schemas and not the data. This setting is useful when you do not need the topics to contain a consistent snapshot of the data but need them to have only the changes since the connector was started.
The PostgreSQL connector does not have the schema_only option. I am wondering if the following pattern would work to mimic that capability:
Use initial for snapshot mode.
Then, for every included table, add the following config:
"snapshot.select.statement.overrides": "schema.table1,schema.table2...",
"snapshot.select.statement.overrides.schema.table1": "SELECT * FROM [schema].[table1] LIMIT 0"
"snapshot.select.statement.overrides.schema.table2": "SELECT * FROM [schema].[table2] LIMIT 0"
...
According to the docs, snapshot.select.statement.overrides:
Specifies the table rows to include in a snapshot. Use the property if you want a snapshot to include only a subset of the rows in a table. This property affects snapshots only. It does not apply to events that the connector reads from the log.
It seems as though the above procedure would include the relevant schemas for debezium, while preventing any read events from being emitted.
Are there complications with this technique that I am not taking into account?

Is it right use-case of KSql

I am using KStreams where I need to de-duplicate the data. Source ingests duplicated data due to many reasons i.e data itself duplicate, re-partitioning.
Currently using Redis for this use-case where data is stored something as below
id#object list-of-applications-processed-this-id-and-this-object
As KSQL is implemented on top of RocksDB which is also a Key-Value database, can I use KSql for this use case?
At the time of successful processing, I would add an entry to KSQL. At the time of reception, I will have to check the existence of the id in KSQL.
Is it correct use case as per KSql design in the event processing world?
If you want to use to use ksqlDB as a cache, you can create a TABLE using the topic as data source. Note that a CREATE TABLE statement by itself, does only declare a schema (it does not pull in any data into ksqlDB yet).
CREATE TABLE inputTable <schemaDefinition> WITH(kafka_topic='...');
Check out the docs for more details: https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/create-table/
To pull in the data, you can create a second table via:
CREATE TABLE cache AS SELECT * FROM inputTable;
This will run a query in the background, that read the input data and puts the result into the ksqlDB server. Because the query is a simple SELECT * it effectively pulls in all data from the topic. You can now issue "pull queries" (i.e, lookups) against the result to use TABLE cache as desired: https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/select-pull-query/
Future work:
We are currently working on adding "source tables" (cf https://github.com/confluentinc/ksql/pull/7474) that will make this setup simpler. If you declare a source table, you can do the same with a single statement instead of two:
CREATE SOURCE TABLE cache <schemaDefinition> WITH(kafka_topic='...');

Kafka Connect and Custom Queries

I'm interested in using the Kafka Source JDBC connector to perform to a publish to Kafka, for when an Invoice gets created. On the source end, it's broken up into 2 tables Invoice, and InvoiceLine.
Is this possible, using custom queries. What would the query look like?
Also since its polling, what gets published could contain one or more invoices in a topic?
Thanks
Yes, you can use custom queries. From the docs:
Custom Query: The source connector supports using custom queries instead of copying whole tables. With a custom query, one of the other update automatic update modes can be used as long as the necessary WHERE clause can be correctly appended to the query. Alternatively, the specified query may handle filtering to new updates itself; however, note that no offset tracking will be performed (unlike the automatic modes where incrementing and/or timestamp column values are recorded for each record), so the query must track offsets itself.

DB2 updated rows since last check

I want to periodically export data from db2 and load it in another database for analysis.
In order to do this, I would need to know which rows have been inserted/updated since the last time I've exported things from a given table.
A simple solution would probably be to add a timestamp to every table and use that as a reference, but I don't have such a TS at the moment, and I would like to avoid adding it if possible.
Is there any other solution for finding the rows which have been added/updated after a given time (or something else that would solve my issue)?
There is an easy option for a timestamp in Db2 (for LUW) called
ROW CHANGE TIMESTAMP
This is managed by Db2 and could be defined as HIDDEN so existing SELECT * FROM queries will not retrieve the new row which would cause extra costs.
Check out the Db2 CREATE TABLE documentation
This functionality was originally added for optimistic locking but can be used for such situations as well.
There is a similar concept for Db2 z/OS - you have to check that out as I have not tried this one.
Of cause there are other ways to solve it like Replication etc.
That is not possible if you do not have a timestamp column. With a timestamp, you can know which are new or modified rows.
You can also use the TimeTravel feature, in order to get the new values, but that implies a timestamp column.
Another option, is to put the tables in append mode, and then get the rows after a given one. However, this option is not sure after a reorg, and affects the performance and space utilisation.
One possible option is to use SQL replication, but that needs extra tables for staging.
Finally, another option is to read the logs, with the db2ReadLog API, but that implies a development. Also, just appliying the archived logs into the new database is possible, however the database will remain in roll forward pending.