Unable to stream data from MySQL to Postgres using Kafka - postgresql

I am trying Kafka for the first time and set up Kafka cluster using AWS MSK. The objective is to stream data from MySQL server to Postgresql.
I used debezium MySQL connector for source and Confluent JDBC connector for the sink.
MySQL config:
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.server.id": "1",
"tasks.max": "3",
"internal.key.converter.schemas.enable": "false",
"transforms.unwrap.add.source.fields": "ts_ms",
"key.converter.schemas.enable": "false",
"internal.key.converter": "org.apache.kafka.connect.json.JsonConverter",
"internal.value.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"internal.value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
After registering the Mysql connector, its status is "running" and capturing the changes being made in MySQL table and showing result in consumer console in the following format:
{"id":5,"created_at":1594910329000,"userid":"asldnl3r234mvnkk","amount":"B6Eg","wallet_type":"CDW"}
My first issue: in table "amount" column is of type "decimal" and contains numeric value but in consumer console why it is showing as alphanumeric value?
For Postgresql as target DB, I used JDBC sink connector, with following config:
"name": "postgres-connector-db08",
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"topics": "mysql-cash.kafka_test.test",
"connection.url": "jdbc:postgresql://xxxxxx:5432/test?currentSchema=public",
"connection.user": "xxxxxx",
"connection.password": "xxxxxx",
"insert.mode": "upsert",
"auto.create": "true",
"auto.evolve": "true"
After registering JDBC connector when I check status it gives an error:
{"name":"postgres-connector-db08","connector":{"state":"RUNNING","worker_id":"x.x.x.x:8083"},"tasks":[{"id":0,"state":"FAILED","worker_id":"x.x.x.x:8083","trace":"org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:561)
org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:322)
org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)
org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)
org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177)
org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)\nCaused by: org.apache.kafka.connect.errors.ConnectException: Sink connector 'postgres-connector-db08' is configured with 'delete.enabled=false' and 'pk.mode=none' and therefore requires records with a non-null Struct value and non-null Struct schema, but found record at (topic='mysql-cash.kafka_test.test',partition=0,offset=0,timestamp=1594909233389) with a HashMap value and null value schema.
io.confluent.connect.jdbc.sink.RecordValidator.lambda$requiresValue$2(RecordValidator.java:83)
io.confluent.connect.jdbc.sink.BufferedRecords.add(BufferedRecords.java:82)
io.confluent.connect.jdbc.sink.JdbcDbWriter.write(JdbcDbWriter.java:66)
io.confluent.connect.jdbc.sink.JdbcSinkTask.put(JdbcSinkTask.java:74)
org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:539)
... 10 more
"}],"type":"sink"}
Why this error is coming? Is something I missed in the sink config?

https://docs.confluent.io/kafka-connect-jdbc/current/sink-connector/index.html#data-mapping
The sink connector requires knowledge of schemas, so you should use a suitable converter e.g. the Avro converter that comes with Schema Registry, or the JSON converter with schemas enabled.
Since the JSON is plain (has no schema) and the connector is configured with "value.converter.schemas.enable": "false" (JSON converter with schemas disabled), Avro converter should be set up with Schema Registry: https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained/#applying-schema

Answer about first issue. "Why decimal is dropped in alpha-numerical format?"
The conversion of decimal depends on decimal.handling.mode configuration.
Specifies how the connector should handle values for DECIMAL and NUMERIC columns: precise (the default) represents them precisely using java.math.BigDecimal values represented in change events in a binary form; or double represents them using double values, which may result in a loss of precision but will be far easier to use. string option encodes values as formatted string which is easy to consume but a semantic information about the real type is lost.
https://debezium.io/documentation/reference/0.10/connectors/mysql.html#decimal-values
If there's no proper conversion configure for you, you can also create custom converter.
https://debezium.io/documentation/reference/stable/development/converters.html
If lucky you can find some open-source converters to solve this issue.

Related

Configure Debezium CDC -> Kafka -> JDBC Sink (Multiple tables) Question

We have around 100 tables in SQL server DB(Application DB) which needs to be synced to SQL server DB(for Analytics) in near Realtime.
Future use case: Scale the Proof of Concept for 30 Source DBs to one destination DB(for Analytics) in near Realtime.
I am thinking to use one sink connector or few sink connectors for multiple tables. Please let me know if this is a good idea.
But I am not sure how to configure the sink to cater for multiple tables especially that each table might have its own primary key. Internet seems to have very simple examples of sink connector but not addressing complex use cases.
Debezium CDC(Source) config
{ "name": "wwi",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"database.dbname": "************************",
"database.history": "io.debezium.relational.history.MemoryDatabaseHistory",
"database.hostname": "**********************",
"database.password": "**********************",
"database.port": "1433",
"database.server.name": "******",
"database.user": "*********",
"decimal.handling.mode": "string",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "true",
"snapshot.mode": "schema_only",
"table.include.list": "Sales.Orders,Warehouse.StockItems",
"tasks.max": "1",
"tombstones.on.delete": "false",
"transforms": "route,unwrap",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "$3",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"value.converter.schemas.enable": "true",
"value.convertor": "org.apache.kafka.connect.json.JsonConverter"
}
}
JDBC Sink config
{
"name": "sqlsinkcon",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"topics": "orders",
"tasks.max": "1",
"auto.evolve": "true",
"connection.user": "********",
"auto.create": "true",
"connection.url": "jdbc:sqlserver://************",
"insert.mode": "upsert",
"pk.mode":"record_key",
"pk.fields":"OrderID",
"db.name": "kafkadestination"
}
}
The sink will write one table, per consumed topic. topics or topics.regex can be used to consume multiple topics at once.
Regarding scalability (or at least, fault tolerance), I prefer one sink task, with one topic (therefore writing to one table). Otherwise, if you consume multiple topics, and connector fails, then it'll potentially crash all the task threads due to the consumer rebalancing.
Also, using JSON / plaintext formats in Kafka isn't most optimal in terms of network bandwidth. I'd suggest a binary format like Avro or Protobuf.

Kafka MongoDB Sink only one record from table

I am using ksqlDB, Where I have created a table from the stream. when I fire select query in that table it gives me all the record properly. Now I want to sink that table in MongoDB. I am also able to create a sink between the Kafka table to MongoDB. But somehow it sinks only one record into it(MongoDB). Whereas in the table I have 100 records. Below is my MongoDB sink connector.
{
"name": "MongoSinkConnectorConnector_1",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"topics": "FEEDS",
"connection.uri": "mongodb://xxx:xxx#x.x.x.x:27017/",
"database": "xxx",
"max.num.retries": "1000000",
"writemodel.strategy": "com.mongodb.kafka.connect.sink.writemodel.strategy.UpdateOneTimestampsStrategy",
"value.projection.type": "allowlist",
"value.projection.list": "id",
"document.id.strategy": "com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy",
"buffer.capacity": "20000",
"value.converter.schema.registry.url": "http://x.x.x.x:8081",
"key.converter.schemas.enable": "false",
"insert.mode": "upsert"
}
}
I could not able to understand that, what's the reason behind that. Any help appreciated. Thank you
you can set the "batch.size" property in the Sink Connector property and also you can have a better write model strategy that you can read through Mongo DB Source Connector Official documentation https://docs.confluent.io/cloud/current/connectors/cc-mongo-db-source.html.

Schema Changes Not Replicating with Kafka Connect

The goal is: MySQL -> Kafka -> MySQL. The sink destination should be up to date with production.
Inserting and deleting records is fine, but I am having issues with schema changes, such as a dropped column. The changes are not replicating to the sink destination.
My source:
{
"name": "hub-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.hostname": "mysql",
"database.port": "3306",
"database.user": "debezium",
"database.password": "dbz",
"database.server.id": "42",
"database.server.name": "study",
"database.include.list": "companyHub",
"database.history.kafka.bootstrap.servers": "broker:29092",
"database.history.kafka.topic": "dbhistory.study",
"include.schema.changes": "true"
}
}
My sink
{
"name": "companyHub-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"connection.url": "jdbc:mysql://172.18.141.102:3306/Hub",
"connection.user": "user",
"connection.password": "passaword",
"topics": "study.companyHub.countryTaxes, study.companyHub.addresses",
"auto.create": "true",
"auto.evolve": "true",
"delete.enabled": "true",
"insert.mode": "upsert",
"pk.fields": "id",
"pk.mode": "record_key",
"transforms": "dropPrefix, unwrap",
"transforms.dropPrefix.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex":"study.companyHub.(.*)$",
"transforms.dropPrefix.replacement":"$1",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.delete.handling.mode": "rewrite"
}
}
Any help would be greatly appreciated!
Auto-creation and Auto-evoluton
Tip
Make sure the JDBC user has the appropriate permissions for DDL.
If auto.create is enabled, the connector can CREATE the destination table if it is found to be missing. The creation takes place online with records being consumed from the topic, since the connector uses the record schema as a basis for the table definition. Primary keys are specified based on the key configuration settings.
If auto.evolve is enabled, the connector can perform limited auto-evolution by issuing ALTER on the destination table when it encounters a record for which a column is found to be missing. Since data-type changes and removal of columns can be dangerous, the connector does not attempt to perform such evolutions on the table. Addition of primary key constraints is also not attempted. In contrast, if auto.evolve is disabled no evolution is performed and the connector task fails with an error stating the missing columns.
For both auto-creation and auto-evolution, the nullability of a column is based on the optionality of the corresponding field in the schema, and default values are also specified based on the default value of the corresponding field if applicable. We use the following mapping from Connect schema types to database-specific types:
Schema Type MySQL Oracle PostgreSQL SQLite SQL Server Vertica
Auto-creation or auto-evolution is not supported for databases not mentioned here.
Important
For backwards-compatible table schema evolution, new fields in record schemas must be optional or have a default value. If you need to delete a field, the table schema should be manually altered to either drop the corresponding column, assign it a default value, or make it nullable.

How to use the Kafka Connect JDBC to source PostgreSQL with multiple schemas that contain tables with the same name?

I need to source data from a PostgreSQL database with ~2000 schemas. All schemas contain the same tables (it is a multi-tenant application).
The connector is configured as following:
{
"name": "postgres-source",
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"timestamp.column.name": "updated",
"incrementing.column.name": "id",
"connection.password": "********",
"tasks.max": "1",
"mode": "timestamp+incrementing",
"topic.prefix": "postgres-source-",
"connection.user": "*********",
"poll.interval.ms": "3600000",
"numeric.mapping": "best_fit",
"connection.url": "jdbc:postgresql://*******:5432/*******",
"table.whitelist": "table1",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable":"false"
}
With this configuration, I get this error:
"The connector uses the unqualified table name as the topic name and has detected duplicate unqualified table names. This could lead to mixed data types in the topic and downstream processing errors. To prevent such processing errors, the JDBC Source connector fails to start when it detects duplicate table name configurations"
Apparently the connector does not want to publish data from multiple tables with the same name to a single topic.
This does not matter to me, it could go to a single topic or multiple topics (one for each schema).
As an additional info, if I add:
"schema.pattern": "schema1"
to the config, the connector works and the data from the specified schema and table is copied.
Is there a way to copy multiple schemas that contain tables with the same name?
Thank you

ElasticsearchSinkConnector Failed to deserialize data to Avro

I created the simplest kafka sink connector config and I'm using confluent 4.1.0:
{
"connector.class":
"io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"type.name": "test-type",
"tasks.max": "1",
"topics": "dialogs",
"name": "elasticsearch-sink",
"key.ignore": "true",
"connection.url": "http://localhost:9200",
"schema.ignore": "true"
}
and in the topic I save the messages in JSON
{ "topics": "resd"}
But in the result I get an error:
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
As cricket_007 says, you need to tell Connect to use Json deserialiser, if that's the format your data is in. Add this to your connector configuration:
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false"
That error happens because it's trying to read non Confluent Schema Registry encoded Avro messages.
If the topic data is Avro, it needs to use the Schema Registry.
Otherwise, if topic data is JSON, then you've started the connect cluster with AvroConverter on your keys or values in the property file, where you need to use the JsonConverter instead