Kafka connect path format not working properly - apache-kafka

I create connector with the script below, but in S3, I see partition format of /year=2015/month=12/day=07/hour=15/ . Is there a way to implement partition of 'dt'=YYYY-MM-dd/'hour'=HH/ format ?
curl -X POST \
-H "Content-Type: application/json" \
--data '{
"name": "content.logging.test",
"config": {
"topics": "content.logging",
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"s3.region": "ap-northeast-1",
"s3.bucket.name": "kafka-connect-test",
"locale": "en-US",
"timezone": "UTC",
"tasks.max": 1,
"flush.size": 10,
"partitioner.class": "io.confluent.connect.storage.partitioner.HourlyPartitioner",
"partition.duration.ms": 3600000,
"path.format": "'dt'=YYYY-MM-dd/'hour'=HH/"
}
}' http://$CONNECT_REST_ADVERTISED_HOST_NAME:8083/connectors

You should use the TimeBasedPartitioner if you want to use a format
https://docs.confluent.io/kafka-connect-s3-sink/current/index.html#partitioning-records-into-s3-objects

Related

Debezium NPE while creating oracle source connector

I am creating a Debezium Oracle source connector with the following curl post command. but getting null pointer exception, not able to find where could be issued.
curl command
curl -X POST -H "Content-Type: application/json" --data '{ "name":"debez_ora_cdc", "config":{ "connector.class":"io.debezium.connector.oracle.OracleConnector", "tasks.max":"1", "connection.url":"jdbc:oracle:thin:#host_ip:port:test", "connection.user":"logminer_user", "connection.password":"logminer_user", "table.include.list": "logminer_user.users", "database.server.name": "server1",
"database.tablename.case.insensitive": "true", "database.hostname": "host_ip",
"database.port": "1522", "database.user": "logminer_user", "database.password": "logminer_user", "database.dbname": "test", "database.history.kafka.bootstrap.servers": "localhost:9092",
"database.history.kafka.topic": "server1.oracle.history", "database.history.skip.unparseable.ddl": "true", "include.schema.changes": "true", "snapshot.mode": "initial", "errors.log.enable": "true"
} }' http://localhost:8083/connectors
Exception
{"name":"debez_ora_cdc","connector":{"state":"RUNNING","worker_id":"null:-1"},"tasks":[{"id":0,"state":"FAILED","worker_id":"null:-1","trace":"org.apache.kafka.connect.errors.ConnectException: An exception occurred in the change event producer. This connector will be stopped.\n\tat io.debezium.pipeline.ErrorHandler.setProducerThrowable(ErrorHandler.java:50)\n\tat io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:116)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: io.debezium.DebeziumException: java.lang.NullPointerException\n\tat io.debezium.pipeline.source.AbstractSnapshotChangeEventSource.execute(AbstractSnapshotChangeEventSource.java:85)\n\tat io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:155)\n\tat io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:137)\n\tat io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:109)\n\t... 5 more\nCaused by: java.lang.NullPointerException\n\tat io.debezium.relational.TableEditorImpl.columnWithName(TableEditorImpl.java:46)\n\tat io.debezium.relational.TableEditorImpl.hasColumnWithName(TableEditorImpl.java:50)\n\tat io.debezium.relational.TableEditorImpl.lambda$updatePrimaryKeys$0(TableEditorImpl.java:103)\n\tat java.util.ArrayList.removeIf(ArrayList.java:1413)\n\tat io.debezium.relational.TableEditorImpl.updatePrimaryKeys(TableEditorImpl.java:102)\n\tat io.debezium.relational.TableEditorImpl.create(TableEditorImpl.java:267)\n\tat io.debezium.relational.Tables.lambda$overwriteTable$2(Tables.java:192)\n\tat io.debezium.util.FunctionalReadWriteLock.write(FunctionalReadWriteLock.java:84)\n\tat io.debezium.relational.Tables.overwriteTable(Tables.java:186)\n\tat io.debezium.jdbc.JdbcConnection.readSchema(JdbcConnection.java:1209)\n\tat io.debezium.connector.oracle.OracleSnapshotChangeEventSource.readTableStructure(OracleSnapshotChangeEventSource.java:181)\n\tat io.debezium.connector.oracle.OracleSnapshotChangeEventSource.readTableStructure(OracleSnapshotChangeEventSource.java:35)\n\tat io.debezium.relational.RelationalSnapshotChangeEventSource.doExecute(RelationalSnapshotChangeEventSource.java:114)\n\tat io.debezium.pipeline.source.AbstractSnapshotChangeEventSource.execute(AbstractSnapshotChangeEventSource.java:76)\n\t... 8 more\n"}],"type":"source"}

Debezium should only read new changes

Even though I'm using snapshot.mode:schema_only, I'm getting complete records of the database whereas I only want the new ones. Are there any other modifications that I should do?
This is the config of my source connector:
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" \
localhost:8083/connectors/ \
-d '{ "name": "inventory-connector",
"config": { "connector.class":"io.debezium.connector.mysql.MySqlConnector",
"tasks.max": "1",
"snapshot.mode":"schema_only",
"database.hostname": "----",
"database.port": "3306",
"database.user": "---",
"database.password": "----",
"database.server.id": "1",
"database.server.name": "data",
"database.whitelist": "---",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "dbhistory.---"
}
}'

How to replicate all changes from source to destination db using debezium and confluent-sink-connector running on docker

The below code is my Dockerfile for Kafka-connect-JDBC and MySQL-driver
FROM debezium/connect:1.3
ENV KAFKA_CONNECT_JDBC_DIR=$KAFKA_CONNECT_PLUGINS_DIR/kafka-connect-jdbc
ENV MYSQL_DRIVER_VERSION 8.0.20
ARG KAFKA_JDBC_VERSION=5.5.0
RUN curl -k -SL "https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-${MYSQL_DRIVER_VERSION}.tar.gz" \
| tar -xzf - -C /kafka/libs --strip-components=1 mysql-connector-java-8.0.20/mysql-connector-java-${MYSQL_DRIVER_VERSION}.jar
RUN mkdir $KAFKA_CONNECT_JDBC_DIR && cd $KAFKA_CONNECT_JDBC_DIR &&\
curl -sO https://packages.confluent.io/maven/io/confluent/kafka-connect-jdbc/$KAFKA_JDBC_VERSION/kafka-connect-jdbc-$KAFKA_JDBC_VERSION.jar
docker build . --tag kafka kafka-connect-sink
Below is my source db json
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" 192.168.99.102:8083/connectors/ -d '{
"name": "inventory-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"tasks.max": "1",
"database.hostname": "mysql",
"database.port": "3306",
"database.user": "debezium",
"database.password": "dbz",
"database.server.id": "184054",
"database.server.name": "dbserver1",
"database.include.list": "inventory",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "dbhistory.inventory"
}
}'
Below is my destination db sink json
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" 192.168.99.102:8083/connectors/ -d '{
"name": "inventory-connector-sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"connection.url": "jdbc:mysql://192.168.0.104:3306/pk?useSSL=false",
"connection.user": "pavan",
"connection.password": "root",
"topics": "dbserver1.inventory.customers",
"table.name.format": "pk.customers",
"auto.create": "true",
"auto.evolve": "true",
"delete.enabled": "true",
"insert.mode": "upsert",
"pk.fields": "id",
"pk.mode": "record_key",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.delete.handling.mode": "rewrite"
}
}'
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" 192.168.99.102:8083/connectors/ -d '{
"name": "inventory-connector-sink-addresses",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"connection.url": "jdbc:mysql://192.168.0.104:3306/pk?useSSL=false",
"connection.user": "pavan",
"connection.password": "root",
"topics": "dbserver1.inventory.addresses",
"table.name.format": "pk.addresses",
"auto.create": "true",
"auto.evolve": "true",
"delete.enabled": "true",
"insert.mode": "upsert",
"pk.fields": "id",
"pk.mode": "record_key",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.delete.handling.mode": "rewrite"
}
}'
with this configuration i need to subscribe to each topic but problem is i had 100+ tables to get replicate in destination db is there anyway i can do it in single json configuration so that i can subscribe to all topics.
You can use topics (or topics.regex) property to define the list of topics to consume and table.name.format property of JBDC Sink connector or RegexRouter SMT (or combine them) to override destination table names:
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" 192.168.99.102:8083/connectors/ -d '{
"name": "inventory-connector-sink-addresses",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"connection.url": "jdbc:mysql://192.168.0.104:3306/pk?useSSL=false",
"connection.user": "pavan",
"connection.password": "root",
"topics": "dbserver1.inventory.addresses,dbserver1.inventory.customers",
"auto.create": "true",
"auto.evolve": "true",
"delete.enabled": "true",
"insert.mode": "upsert",
"pk.fields": "",
"pk.mode": "record_key",
"transforms": "route,unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.delete.handling.mode": "rewrite",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "pk.$3"
}
}'

Debezium Connector Outbox Transform

I am trying to use a MySql Source Connector with the Outbox SMT supported by debezium with the following config. I am using the latest jars of debezium-core and debezium-mysql-connector (1.1)
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" localhost: 8083/connectors/ -d '{
"name": "debezium-mysql-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"tasks.max": "1",
"database.hostname": "MySql",
"database.port": "3306",
"database.user": "**",
"database.password": "**",
"database.server.id": "1033113244",
"database.server.name": "anomaly-changelog",
"database.whitelist": "anomaly",
"database.history.kafka.bootstrap.servers": "Kafka:9092",
"database.history.kafka.topic": "anomaly.schema.history",
"transforms": "outbox,reroute",
"transforms.reroute.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.reroute.regex": "(.*)",
"transforms.reroute.replacement": "$1-SMT",
"transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter"
}
}'
But I am still getting the following error:
{"error_code":400,"message":"Connector configuration is invalid and contains the following 2 error(s):\nInvalid value io.debezium.transforms.outbox.EventRouter for configuration transforms.outbox.type: Class io.debezium.transforms.outbox.EventRouter could not be found.\nInvalid value null for configuration transforms.outbox.type: Not a Transformation}
I don't see why it is not being recognized.
You can try something like this:
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" demo:8083/connectors/ -d '{ "name": "order-connector", "config": { "connector.class": "io.debezium.connector.mysql.MySqlConnector", "tasks.max": "1", "database.hostname": "mariadb_order", "database.port": "3306", "database.user": "root", "database.password": "***", "database.server.id": "223344", "database.server.name": "orderdbserver","table.whitelist": "orderdb.outbox", "transforms": "outbox", "transforms.outbox.type" :"io.debezium.transforms.outbox.EventRouter", "database.history.kafka.bootstrap.servers": "kafka:9092", "database.history.kafka.topic":"dbhistory.orderdb", "transforms.outbox.table.fields.additional.placement" : "aggregateid:envelope:id" } }'

Kafka connect jdbc source mssql server loading millions record throwing out of memory error

I have tried to load 77 millions of record from MSSQL server to Kafka topic through Kafka connect JDBC source.
Tried batch approach given batch.max.rows as 1000. In this case, after 1000 records, it's throughout of memory. Please share suggestions on how to make it works
Below are connector approach i tried
curl -X POST http://test.com:8083/connectors -H "Content-Type: application/json" -d '{
"name": "mssql_jdbc_rsitem_pollx",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:sqlserver://test:1433;databaseName=xxx",
"connection.user": "xxxx",
"connection.password": "xxxx",
"topic.prefix": "mssql-rsitem_pollx-",
"mode":"incrementing",
"table.whitelist" : "test",
"timestamp.column.name": "itemid",
"max.poll.records" :"100",
"max.poll.interval.ms":"3000",
"validate.non.null": false
}
}'
curl -X POST http://test.com:8083/connectors -H "Content-Type: application/json" -d '{
"name": "mssql_jdbc_test_polly",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"tasks.max": "10",
"connection.url": "jdbc:sqlserver://test:1433;databaseName=xxx;defaultFetchSize=10000;useCursorFetch=true",
"connection.user": "xxxx",
"connection.password": "xxxx",
"topic.prefix": "mssql-rsitem_polly-",
"mode":"incrementing",
"table.whitelist" : "test",
"timestamp.column.name": "itemid",
"poll.interval.ms":"86400000",
"validate.non.null": false
}
}'
try to increase Java heap size, write in command line:
export KAFKA_HEAP_OPTS="-Xms1g -Xmx2g"
you can change the "Xmx2g" part to match your capacity.