JDBC Source Connector in Incrementing Mode Skips Records (but then doesn't...) - postgresql

I'm facing an issue with a JDBC Source Connector in (strictly) incrementing mode. In my dataset which the connector is pulling from, there are consecutive ids starting from id=1.
On the initial deployment of the connector, it ingests only about 1/3-1/2 of the expected records in the dataset. Note that I am sure that in the "initial deployment of the connector" a brand new consumer group is created. The ids of the records getting skipped over are seemingly random. I'm determining the "expected records" by running the source SQL directly against the db containing the dataset. I've encountered the skipping in datasets where ids go up to 10k, in datasets where ids go over 130k, and a few in between. What's weirder is that when I force re-ingestion of the data (i.e. same exact data/dataset) via sending a message to the consumer_offsets topic and re-deploying the connector, I do not encounter the skipping; all expected records make it to the topic.
How can that be?
Other notes:
-There is no ordering or CTE in my SQL

Related

kafka s3 connector "skip" records if flush.size is set > 1

I set up a kafka s3 sink connector to consume messages from a kafka topic and dump them into minio in parquet format. Finally I query from dremio to verify the integrity of the data pipe.
The kafka topic consists of 12 partitions and each partition contains various # of records.
What I've found out is that
if I set flush.size=1.
I can get all records one per parquet file in minio and query in dremio returns correct # of records.
if I set flush.size > 1
I won't be able to get the exact total number of records in minio and dremio query.
I've always got less.
The larger the flush.size is set, the more records are skipped and if flush.size is set large enough, partitions are skipped as well.
I understand that it's probably not skipping records.
The connector is waiting for more new records to fill up the buffer size then flushes to s3. This won't work as if the data is EOD, I'll have to wait for 24 hours to get yesterdays data dumped to minio?
I am looking for a parameter to trigger time-out then force flush to s3.
I tried rotate.interval.ms but it only checks first record and last record time stamp span. It will not trigger a time out and force flush if no new record is injected to kafka.
Is there any parameter to trigger time-out and force flush to s3?
It seems that all rotate interval parameters are expecting a new record to trigger the evaluation of the flush condition, either span or scheduled.
That's not gonna serve the purpose I mentioned. We want to time-out and force flush without the dependency on a new record being processed.
rotate.schedule.interval.ms works.
I made a typo in sink properties, I put rotate.scheduled.interval.ms
Once I correct the typo, it asks me to specify timezone then everything works as expected. I got all 10071 records in all 12 partitions.

How do we check number of records are loaded so far onto db from Kafka topic?

I'm trying to load data from Kafka topic to Postgres using Jdbc sink connector . Now, how do we know the number of records are loaded so far into Postgres. As of now I keep on checking number of records in db using sql query. Is there any other way I can know about it?
Kafka Connect doesn't track this. I see nothing wrong with SELECT COUNT(*) on the table, however this doesn't exclude other processes writing to that table as well
it is not possible in KAFKA. Because once you have sinked the records into the target DB, KAFKA is already done its job. But you can track number of records that you are updating using SINK Record Collections write into your local file or insert into a KAFKA State store.

How to interpret the active record count metric from a kafka connect source task?

I have a kafka connect source task connector (jdbc postgres connector) and I can view the kafka_connect_source_task_source_record_active_count_avg metric from this connector. I note that the graph looks like this:
So we see occasional steps of 100 records. If I change the metric from _avg to max indeed the steps are of size 100.
I am unsure how to interpret this information, though. Does this mean that right now (end of the chart) there are over 1100 records that have not been committed to kafka, and they've been that way for weeks? I'm wondering why this value doesn't decrease. The connector is on a very active database, so it wouldn't surprise me if it's always "behind" (is that the right word for it?). But I'd like to know if it's at least always working through the back of the messages, or if it's "accumulating" certain messages that are never being committed to kafka for one reason or another, and if this number reflects these "stuck" messages.
source-record-active-count
:The most recent number of records that have been produced by this task but not yet completely written to Kafka.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-196%3A+Add+metrics+to+Kafka+Connect+framework#KIP196:AddmetricstoKafkaConnectframework-SourceTaskMetrics
You could also plot the offsets of the topic you're producing to to see if they follow the same steps of increase
The alternative would be to use Debezium rather than the JDBC source

How the offset should be configured for a new consumer group for an existing topic when source connectors can't be paused

We have an existing topic where the data gets published by JDBC source connector using the mode increment + timestamp (The source connector uses increment+timestamp (https://docs.confluent.io/current/connect/kafka-connect-jdbc/source-connector/index.html#incremental-query-modes)
We have existing consumer groups which consumes data from some existing topics. Now we are introducing a new consumer group (call this group k) which should consume data from the same existing topics and should write to database. As a first step, we have an initial data migration workflow to take a dump of source database and copy the dump to destination database before starting consuming messages from existing topic.
Now when the consumer group starts, I am wondering what should be the offset it should start with?
One option is to use latest. But the problem is that existing source connectors would be publishing data to existing topics when initial data migration is being done for this new consumer group. In our case we have 10s of tables to be migrated and there could be a gap where the table dump was taken but still some changes are getting done to the source database and so data will get added to topics. So, there is a chance that we may miss to process some records.
We don't have the option to pause the source connectors which would solve the problem for us.
If we use offset earliest we will end up processing all the old data from kafka topic which is not required as we have done an initial data migration.
We want to maintain only one source connector regardless of the number of consumer groups.
I was going through kafka consumer APIs like seek which takes timestamp. I can note down the time before initial data migration and call consumer.seek once the consumer group has started and partitions are assigned. But I couldn't find any docs saying that the timestamp is GMT based or something else. Is it ok to use this API by passing the time which is number of milliseconds elapsed from epoch?
If I understand this sentence correctly: "If we use offset latest we might lose some data as source connectors might have written some data to the topic during initial data migration" the topic will end up having some data from initial loads and CDC data mixed up, so there is no offset that clearly distinct this. Therefore, you will not get far setting any particular offset.
I see the following options:
Have your consumer group K filtering out initial load data and read from earliest
Produce the initial load data to a dedicated topic
If possible perform the initial load outside of business hours so that no CDC data is flowing (maybe over week-end or bank holidays)

In Kafka, how to handle deleted rows from source table that are already reflected in Kafka topic?

I am using a JDBC source connector with mode timestamp+incrementing to fetch table from Postgres, using Kafka Connect. The updates in data are reflected in Kafka topic but the deletion of records has no effect. So, my questions are:
Is there some way to handle deleted records?
How to handle records that are deleted but still present in kafka topic?
The recommendation is to either 1) adjust your source database to be append/update only, as well, either via a boolean or timestamp that is filtered out when Kafka Connect queries the table.
If your database is running out of space, then you can delete old records, which should already have been processed by Kafka
Option 2) Use CDC tools to capture delete events immediately rather than missing them in a period table scan. Debezium is a popular option for Postgres
A Kafka topic can be seen as an "append-only" log. It keeps all meesages for as long as you like but Kafka is not built to delete individual messages out of a topic.
In the scenario you are describing it is common that the downstream application (consuming the topic) handles the information on a deleted record.
As an alternative you could set the cleanup.policy of your topic to compact which means it will eventually keep only the latest value for each key. If you now define the key of a message as the primary key of the Postgres table, your topic will eventually delete the record when you produce a message with the same key and a null value into the topic. However,
I am not sure if your connector is flexible to do this
Depending on what you do with the data in the kafka topic, this could still not be a solution to your problem as the downstream application will still read both record, the original one and the null message as the deleted record.