Kafka Connect Reread Entire File for KSQLDB Debugging or KSQLDB possiblity to insert all events after query creation? - apache-kafka

I start to develop with KSQLDB and alongside with Kafka-connect. Kafka-connect is awesome and everything is well and has the behaviour to not reread the records if it detects that it was already read in the past (extrem useful for production). But, for development and debugging of KSQLDB queries it is necessary to replay the data, as ksqldb will create table entries on the fly on emitted changes. If nothing is replayed the to 'test' query stays empty. Any advice how to replay a csv file with kafka connect after the file is inserted for the first time? Maybe, ksqldb has the possiblity to reread the whole topic after the table is created. Has someone the answer for a beginner?

Create the source connector with a different name, or give the CSV file a new name. Both should cause it to be re-read.

I had also a problem in my configuration file and ksqldb did not recognize the
SET 'auto.offset.reset'='earliest';
option. Use the above command, notice the ', to force ksqldb to reread the whole topic from the beginning after a table/stream creation command. You have to set it manually every time you connect via ksql-cli or client.

Related

debezium connector failover mechanism

i'm learning about debezium connectors and im using debezium for postgresql. I have a small question to clarify.
Imagine a situation like this. I have a debezium connector for a table called tableA and changes happening on that table publish to a topic called topicA. Connector works without any issue and changes are publishing to the topic without any issue. Now think that for some reason i need to delete my connector and start a new connector with the same configurations for the same table that publish to the same topic. So there is a time gap between i stop my connector and start a new one with same configs. What happen to the data that get change during that time on my tableA.
Will that gonna start from where it stopped or what will happen ?
Dushan , The answer is depends on how the connector stops. The various scenarios are articulated here
https://debezium.io/documentation/reference/stable/connectors/postgresql.html#postgresql-kafka-connect-process-stops-gracefully
In an ideal case scenario , the Log Sequence Number is recorded in the database history topic. Unless the history topic is re-created or messages expired the LSN offsets are stored and on restart will resume from that location

FilePulse SourceConnector

I would like to continuously reading csv file in ksqldb with FilePulse Source Connector but it not work correctly
a) the connector read the file only once or
b) the connector read all data from file, but in that case there are duplicities in kafka topic (every time when connector read the appended file then insert all data from file into topic - not only the changed data
Is there any options how to solve this? (to continuously read only appended data from file or remove duplicities in kafka topic)
Thank you
To my knowledge, the file source connector doesn't track the file content. The connector only sees a modified file, so reads the whole thing on any update. Otherwise, reading the file once is expected behavior and you should reset your consumer offsets to handle this in your processing logic; for example make a table in ksql
If you want to tail a file for appends, other options like the spooldir connector, or Filebeat/Fluentd would be preferred (and are actually documented as being production-grade solutions for reading files into Kafka)
Disclaimer: I'm the author of Connect FilePulse
Connect FilePulse is probably not the best solution for continuously reading files. And as already mentioned in other answers: it might be a good idea to use solutions like: Filebeat, Fluentd or Logstash.
But, FilePulse actually supports continous reading using the LocalRowFileInputReader with the reader's property read.max.wait.ms. Here is an older answer for a question similar to yours: Stackoverflow: How can be configured kafka-connect-file-pulse for continuous reading of a text file?

How do we check number of records are loaded so far onto db from Kafka topic?

I'm trying to load data from Kafka topic to Postgres using Jdbc sink connector . Now, how do we know the number of records are loaded so far into Postgres. As of now I keep on checking number of records in db using sql query. Is there any other way I can know about it?
Kafka Connect doesn't track this. I see nothing wrong with SELECT COUNT(*) on the table, however this doesn't exclude other processes writing to that table as well
it is not possible in KAFKA. Because once you have sinked the records into the target DB, KAFKA is already done its job. But you can track number of records that you are updating using SINK Record Collections write into your local file or insert into a KAFKA State store.

In Kafka, how to handle deleted rows from source table that are already reflected in Kafka topic?

I am using a JDBC source connector with mode timestamp+incrementing to fetch table from Postgres, using Kafka Connect. The updates in data are reflected in Kafka topic but the deletion of records has no effect. So, my questions are:
Is there some way to handle deleted records?
How to handle records that are deleted but still present in kafka topic?
The recommendation is to either 1) adjust your source database to be append/update only, as well, either via a boolean or timestamp that is filtered out when Kafka Connect queries the table.
If your database is running out of space, then you can delete old records, which should already have been processed by Kafka
Option 2) Use CDC tools to capture delete events immediately rather than missing them in a period table scan. Debezium is a popular option for Postgres
A Kafka topic can be seen as an "append-only" log. It keeps all meesages for as long as you like but Kafka is not built to delete individual messages out of a topic.
In the scenario you are describing it is common that the downstream application (consuming the topic) handles the information on a deleted record.
As an alternative you could set the cleanup.policy of your topic to compact which means it will eventually keep only the latest value for each key. If you now define the key of a message as the primary key of the Postgres table, your topic will eventually delete the record when you produce a message with the same key and a null value into the topic. However,
I am not sure if your connector is flexible to do this
Depending on what you do with the data in the kafka topic, this could still not be a solution to your problem as the downstream application will still read both record, the original one and the null message as the deleted record.

Confluent JDBC source Connector incremental starting row

I want to use the Confluent's JDBC source connector to retrieve data from a SQL Server table into Kafka.
I want to use the incrementing mode to start retrieving data from the table only from the moment the connector starts running:
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
mode=incrementing
incrementing.column.name=id_column_name
When I run this connector, it starts retrieving all the rows from the table, not the ones that are going to be inserted after that point in time. I've been checking the connector configuration properties but I can't seem to find a configuration element for this situation.
The table doesn't contain any Timestamp values, so I can't use the properties timestamp.initial and timestamp.column.name properties. It includes a Datetime column however, but I think this is not useful in this case.
How can I do this?
Any help would be greatly appreciated.
You can try to use query-based ingest or manually seed the offsets topic with the appropriate value.
Source: Kafka Connect Deep Dive – JDBC Source Connector by Robin Moffatt