Kafka Connect: Topic shows 3x the number of events than expected - apache-kafka

We are using Kafka Connect JDBC to sync tables between to databases (Debezium would be perfect for this but is out of the question).
The Sync in general works fine but it seems there are 3x the number of events / messages stored in the topic than expected.
What could be the reason for this?
Some additional information
The target database contains the exact number of messages (count of messages in the topics / 3).
Most of the topics are split into 3 partitions (Key is set via SMT, DefaultPartitioner is used).
JDBC Source Connector
{
"name": "oracle_source",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:oracle:thin:#dbdis01.allesklar.de:1521:stg_cdb",
"connection.user": "****",
"connection.password": "****",
"schema.pattern": "BBUCH",
"topic.prefix": "oracle_",
"table.whitelist": "cdc_companies, cdc_partners, cdc_categories, cdc_additional_details, cdc_claiming_history, cdc_company_categories, cdc_company_custom_fields, cdc_premium_custom_field_types, cdc_premium_custom_fields, cdc_premiums, cdc, cdc_premium_redirects, intermediate_oz_data, intermediate_oz_mapping",
"table.types": "VIEW",
"mode": "timestamp+incrementing",
"incrementing.column.name": "id",
"timestamp.column.name": "ts",
"key.converter": "org.apache.kafka.connect.converters.IntegerConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"validate.non.null": false,
"numeric.mapping": "best_fit",
"db.timezone": "Europe/Berlin",
"transforms":"createKey, extractId, dropTimestamp, deleteTransform",
"transforms.createKey.type": "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields": "id",
"transforms.extractId.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractId.field": "id",
"transforms.dropTimestamp.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.dropTimestamp.blacklist": "ts",
"transforms.deleteTransform.type": "de.meinestadt.kafka.DeleteTransformation"
}
}
JDBC Sink Connector
{
"name": "postgres_sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url": "jdbc:postgresql://writer.branchenbuch.psql.integration.meinestadt.de:5432/branchenbuch",
"connection.user": "****",
"connection.password": "****",
"key.converter": "org.apache.kafka.connect.converters.IntegerConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.schemas.enable": true,
"insert.mode": "upsert",
"pk.mode": "record_key",
"pk.fields": "id",
"delete.enabled": true,
"auto.create": true,
"auto.evolve": true,
"topics.regex": "oracle_cdc_.*",
"transforms": "dropPrefix",
"transforms.dropPrefix.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex": "oracle_cdc_(.*)",
"transforms.dropPrefix.replacement": "$1"
}
}
Strange Topic Count

This isn't an answer per-se but it's easier to format here than in the comments box.
It's not clear why you'd be getting duplicates. Some possibilities would be:
You have more than one instance of the connector running
You have on instance of the connector running but have previously run other instances which loaded the same data to the topic
Data's coming from multiple tables and being merged into one topic (not possible here based on your config, but if you were using Single Message Transform to modify target-topic name could be a possibility)
In terms of investigation I would suggest:
Isolate the problem by splitting the connector into one connector per table.
Examine each topic and locate examples of the duplicate messages. See if there is a pattern to which topics have duplicates. KSQL will be useful here:
SELECT ROWKEY, COUNT(*) FROM source GROUP BY ROWKEY HAVING COUNT(*) > 1
I'm guessing at ROWKEY (the key of the Kafka message) - you'll know your data and which columns should be unique and can be used to detect duplicates.
Once you've found a duplicate message, use kafkacat to examine the duplicate instances. Do they have the exact same Kafka message timestamp?
For more back and forth, StackOverflow isn't such an appropriate platform - I'd recommend heading to http://cnfl.io/slack and the #connect channel.

Related

MirrorSourceConnector: override consumer key.serializer property

I am trying to run MirrorSourceConnector from a Topic in cluster A to cluster B.
After creating the connector and consuming first message I noticed that mirrored topic key and value is always serialized as a ByteArray. Which in case of a key is a bit of a problem when doing the transformations with a custom class.
After checking MirrorSourceConfig class in github I found out that with source.admin. and target.admin I could basically add consumer and producer properties. But seems it does not make any different (in logs I could still see that ByteArray serializer is being used).
My connector config looks like that:
{"target.cluster.status.storage.replication.factor": "-1",
"connector.class": "org.apache.kafka.connect.mirror.MirrorSourceConnector",
"auto.create.mirror.topics.enable": true,
"offset-syncs.topic.replication.factor": "1",
"replication.factor": "1",
"sync.topic.acls.enabled": "false",
"topics": "test-topic",
"target.cluster.config.storage.replication.factor": "-1",
"source.cluster.alias": "source-cluster-dev",
"source.cluster.bootstrap.servers": "source-cluster-dev:9092",
"target.cluster.offset.storage.replication.factor": "-1",
"target.cluster.alias": "target-cluster-dev",
"target.cluster.security.protocol": "PLAINTEXT",
"header.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"name": "test-mirror-connector",
"source.admin.key.deserializer": "org.apache.kafka.common.serialization.StringDeserializer",
"source.admin.value.deserializer":"org.apache.kafka.common.serialization.ByteArrayDeserializer",
"target.admin.key.serializer": "org.apache.kafka.common.serialization.StringDeserializer",
"target.admin.value.serializer":"org.apache.kafka.common.serialization.ByteArrayDeserializer",
"target.cluster.bootstrap.servers": "target-cluster-dev:9092"}
Is there a way to override Consumer and Producer Ser/De-serialization properties or any other way to make mirror topic to be exactly the same as a source topic? In the meaning of seralization.

Debezium heartbeat table not updating

There is already a question Debezium Heartbeat Action not firing but it did not resolve my issue.
Here is my source connector config for postgres. It is generating heatbeat events after every 5 seconds. I have confirmed that by checking kafka topic but the issue is that it is not updating the row in the database heartbeat table. Any suggestions?
{
"name": "postgres-localdb-source-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres",
"database.port": "5432",
"slot.name":"debezium",
"database.user": "postgres",
"database.password": "postgres",
"database.dbname" : "postgres",
"database.server.name": "dbserver2",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "schema-changes.dbserver2",
"schema.include": "inventory",
"tables.include": "customers,heartbeat",
"publication.autocreate.mode" : "filtered",
"max.batch.size":"20480",
"max.queue.size":"81920",
"poll.interval.ms":"100",
"heartbeat.interval.ms": "5000",
"heartbeat.action.query" :"INSERT INTO heartbeat (id, ts) VALUES (1, NOW()) ON CONFLICT(id) DO UPDATE SET ts=EXCLUDED.ts;"
} }
Try to share your heartbeat table with a DDL. Your heartbeat table have a primary key? Debezium only track updates and deletes if table have PK defined. Also try to share your debezium version because this propertiers change from version to version.
Try a UPDATE without WHERE to test if the problem is in your query. Check if your heartbeat in your schema public or inventory and add as a prefix in your query.
UPDATE inventory.heartbeat SET ts = NOW();
On your tables.include add for each table a prefix with schema.
"tables.include": "inventory.customers,inventory.heartbeat",
On tables.include try to change to tables.include.list. Source: https://debezium.io/documentation/reference/1.6/connectors/mysql.html#:~:text=connector%20configuration%20property.-,table.include.list,-empty%20string
"tables.include.list": "inventory.customers,inventory.heartbeat",

How to migrate consumer offsets using MirrorMaker 2.0?

With Kafka 2.7.0, I am using MirroMaker 2.0 as a Kafka-connect connector to replicate all the topics from the primary Kafka cluster to the backup cluster.
All the topics are being replicated perfectly except __consumer_offsets. Below are the connect configurations:
{
"name": "test-connector",
"config": {
"connector.class": "org.apache.kafka.connect.mirror.MirrorSourceConnector",
"topics.blacklist": "some-random-topic",
"replication.policy.separator": "",
"source.cluster.alias": "",
"target.cluster.alias": "",
"exclude.internal.topics":"false",
"tasks.max": "10",
"key.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"source.cluster.bootstrap.servers": "xx.xx.xxx.xx:9094",
"target.cluster.bootstrap.servers": "yy.yy.yyy.yy:9094",
"topics": "test-topic-from-primary,primary-kafka-connect-offset,primary-kafka-connect-config,primary-kafka-connect-status,__consumer_offsets"
}
}
In a similar question here, the accepted answer says the following:
Add this in your consumer.config:
exclude.internal.topics=false
And add this in your producer.config:
client.id=__admin_client
Where do I add these in my configuration?
Here the Connector Configuration Properties does not have such property named client.id, I have set the value of exclude.internal.topics to false though.
Is there something I am missing here?
UPDATE
I learned that Kafka 2.7 and above supports automated consumer offset sync using MirrorCheckpointTask as mentioned here.
I have created a connector for this having the below configurations:
{
"name": "mirror-checkpoint-connector",
"config": {
"connector.class": "org.apache.kafka.connect.mirror.MirrorCheckpointConnector",
"sync.group.offsets.enabled": "true",
"source.cluster.alias": "",
"target.cluster.alias": "",
"exclude.internal.topics":"false",
"tasks.max": "10",
"key.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
"source.cluster.bootstrap.servers": "xx.xx.xxx.xx:9094",
"target.cluster.bootstrap.servers": "yy.yy.yyy.yy:9094",
"topics": "__consumer_offsets"
}
}
Still no help.
Is this the correct approach? Is there something needed?
you do not want to replicate connsumer_offsets. The offsets from the src to the destination cluster will not be the same for various reasons.
MirrorMaker2 provides the ability to do offset translation. It will populate the destination cluster with a translated offset generated from the src cluster. https://cwiki.apache.org/confluence/display/KAFKA/KIP-545%3A+support+automated+consumer+offset+sync+across+clusters+in+MM+2.0
__consumer_offsets is ignored by default
topics.exclude = [.*[\-\.]internal, .*\.replica, __.*]
you'll need to override this config

Kafka Connect - Not working for UPDATE operation

I am new to Kafka-Connect source and sink. I created application to transfer Table Data from one Schema (Schema1) to another Schema (Schema2), here I used Oracle as a Database. I successfully transferred data/row for INSERT operation from Table "Schema1.Header" to Table "Schema2.Header", but not working for UPDATE operation with below mentioned config.
SOURCE Config:
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:oracle:thin:#localhost:1524:XE",
"connection.user": "USER",
"connection.password": "user1234",
"dialect.name": "OracleDatabaseDialect",
"topic.prefix": "Schema1.Header",
"incrementing.column.name": "SC_NO",
"mode": "incrementing",
"query": "SELECT * FROM (SELECT HEADER_V1.* FROM Schema1.Header HEADER_V1 INNER JOIN Schema1.LINE_V1 LINE_V1 ON HEADER_V1.SC_NO = LINE_V1.SC_NO AND LINE_V1.CLNAME_CODE ='XXXXXX' AND HEADER_V1.ITEM_TYPE = 'XXX')",
"transforms": "ReplaceField",
"transforms.ReplaceField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.ReplaceField.blacklist": "col_3,col_10"
}
SINK Config:
{
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url": "jdbc:oracle:thin:#localhost:1524:XE",
"connection.user": "USER2",
"connection.password": "user21234",
"dialect.name": "OracleDatabaseDialect",
"topics": "Schema1.Header",
"table.name.format": "Schema2.Header",
"tasks.max": "1"
}
Kindly help me to fix this issue.
Note : I need to do all CRUD operations in Schema Schema1.Tables only, Using Kafka connect am transferring those data to another Schema Schema2.Tables. Newly inserted data/row got transferred but updated data/row not transferred via Kafka-Connect. What I have to do achieve this?
According to this blog you need to set the mode to timestamp (or better timestamp+incrementing if you want to both new and updated rows) in your source config.
In addition, you then need to specify the timestamp.column.name which should point to a timestamp column that is updated every time the row is updated.

kafka jdbc source connector error in query mode [duplicate]

This question already has an answer here:
How to add explicit WHERE clause in Kafka Connect JDBC Source connector
(1 answer)
Closed 3 years ago.
I have a JDBCSourceConnector in kafka that uses a query to stream data from database.
but I have problem with the query I wrote for selecting data.
I tested query in Postgres psql and also in DBeaver. It's working fine but in kafka config, it produces an SQL syntax error
Error
ERROR Failed to run query for table TimestampIncrementingTableQuerier{name='null', query='select "Users".* from "Users" join "SchoolUserPivots" on "Users".id = "SchoolUserPivots".user_id where school_id = 1 and role_id = 2', topicPrefix='teacher', timestampColumn='"Users".updatedAt', incrementingColumn='id'}: {} (io.confluent.connect.jdbc.source.JdbcSourceTask:221)
org.postgresql.util.PSQLException: ERROR: syntax error at or near "WHERE"
Config json
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"timestamp.column.name": "\"Users\".updatedAt",
"incrementing.column.name": "id",
"connection.password": "123",
"tasks.max": "1",
"query": "select \"Users\".* from \"Users\" join \"SchoolUserPivots\" on \"Users\".id = \"SchoolUserPivots\".user_id where school_id = 1 and role_id = 2",
"timestamp.delay.interval.ms": "5000",
"mode": "timestamp+incrementing",
"topic.prefix": "teacher",
"connection.user": "user",
"name": "SourceTeacher",
"connection.url": "jdbc:postgresql://ip:5432/school",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.json.JsonConverter"
}
You can't use "mode": "timestamp+incrementing", with a custom query that includes WHERE.
See https://www.confluent.io/blog/kafka-connect-deep-dive-jdbc-source-connector for more details, as well as https://github.com/confluentinc/kafka-connect-jdbc/issues/566. That github issue suggests one workaround, by using a subselect for your query.