I have the following kafka connector config:
{
"name": "some-topic-connector",
"config": {
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"tasks.max": "1",
"topics": "some-topic",
"hdfs.url": "hdfs://hadoopams1",
"logs.dir": "apps/kafka-connect-preview/some-topic.logs",
"topics.dir": "apps/kafka-connect-preview/some-topic.db",
"hadoop.conf.dir": "/etc/hadoop/conf",
"flush.size": "1000000",
"rotate.interval.ms": "3600000",
"rotate.schedule.interval.ms": "86400000",
"hive.integration": "true",
"hive.metastore.uris": "thrift://metastore-1.hadoop-1.foobar.com:9083",
"hive.database": "preview",
"locale": "en_GB",
"timezone": "Europe/Berlin",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry.preview.foobar.com",
"schema.compatibility": "BACKWARD",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"partition.duration.ms": "86400000",
"path.format": "'dt'=YYYYMMdd",
"partition.field.name": "dt"
}
}
I've verified that the data is written to HDFS successfully, but for some reason the table in Hive is not being created. From the logs, I can't see any errors in Kafka Connect.
What am I doing wrong? Is there some configuration or a requirement that I'm missing?
There is a known issue (feature?) that the HdfsSinkConnector does not create a table in Hive in case the logs.dir and topics.dir already exists. This can, for example, happen if you decided to enable hive integration at some point after the connector was already created.
There is also a pull request that fixes this issue, but it has not been merged.
So either
you build your own HdfsSinkConnector based on the pull request linked above
you rename the directories, recreate the connector, wait until the Hive tables have been created and then move the tables back (difficult in a production environment, of course)
or you create the table manually
Related
We have around 100 tables in SQL server DB(Application DB) which needs to be synced to SQL server DB(for Analytics) in near Realtime.
Future use case: Scale the Proof of Concept for 30 Source DBs to one destination DB(for Analytics) in near Realtime.
I am thinking to use one sink connector or few sink connectors for multiple tables. Please let me know if this is a good idea.
But I am not sure how to configure the sink to cater for multiple tables especially that each table might have its own primary key. Internet seems to have very simple examples of sink connector but not addressing complex use cases.
Debezium CDC(Source) config
{ "name": "wwi",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"database.dbname": "************************",
"database.history": "io.debezium.relational.history.MemoryDatabaseHistory",
"database.hostname": "**********************",
"database.password": "**********************",
"database.port": "1433",
"database.server.name": "******",
"database.user": "*********",
"decimal.handling.mode": "string",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "true",
"snapshot.mode": "schema_only",
"table.include.list": "Sales.Orders,Warehouse.StockItems",
"tasks.max": "1",
"tombstones.on.delete": "false",
"transforms": "route,unwrap",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "$3",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"value.converter.schemas.enable": "true",
"value.convertor": "org.apache.kafka.connect.json.JsonConverter"
}
}
JDBC Sink config
{
"name": "sqlsinkcon",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"topics": "orders",
"tasks.max": "1",
"auto.evolve": "true",
"connection.user": "********",
"auto.create": "true",
"connection.url": "jdbc:sqlserver://************",
"insert.mode": "upsert",
"pk.mode":"record_key",
"pk.fields":"OrderID",
"db.name": "kafkadestination"
}
}
The sink will write one table, per consumed topic. topics or topics.regex can be used to consume multiple topics at once.
Regarding scalability (or at least, fault tolerance), I prefer one sink task, with one topic (therefore writing to one table). Otherwise, if you consume multiple topics, and connector fails, then it'll potentially crash all the task threads due to the consumer rebalancing.
Also, using JSON / plaintext formats in Kafka isn't most optimal in terms of network bandwidth. I'd suggest a binary format like Avro or Protobuf.
I am using MongoSourceConnector to connect kafka topic with mongo database collection. For single database with single kafka topic it's working fine, but is there any way that i could do a connection for multiple mongo database with single kafka topic.
If you are running kafka-connect in distributed mode then you can create a another connector config file with the above mentioned config
I am not really sure about multiple databases and a single Kafka topic but you can surely listen to multiple databases change streams and push data to topics. Since topic creation depends on the database_name.collection_name, so you will have more topics.
You can provide the Regex to listen to multiple databases in the pipeline.
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-names_.*/}},{\"ns.coll\":{\"$regex\":/^collection_name$/}}]}}]"
Here is the complete Kafka connector configuration.
Mongo to Kafka source connector
{
"name": "mongo-to-kafka-connect",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"publish.full.document.only": "true",
"tasks.max": "3",
"key.converter.schemas.enable": "false",
"topic.creation.enable": "true",
"poll.await.time.ms": 1000,
"poll.max.batch.size": 100,
"topic.prefix": "any prefix for topic name",
"output.json.formatter": "com.mongodb.kafka.connect.source.json.formatter.SimplifiedJson",
"connection.uri": "mongodb://<username>:<password>#ip:27017,ip:27017,ip:27017,ip:27017/?authSource=admin&replicaSet=xyz&tls=true",
"value.converter.schemas.enable": "false",
"copy.existing": "true",
"topic.creation.default.replication.factor": 3,
"topic.creation.default.partitions": 3,
"topic.creation.compacted.cleanup.policy": "compact",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"mongo.errors.log.enable": "true",
"heartbeat.interval.ms": 10000,
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-names_.*/}},{\"ns.coll\":{\"$regex\":/^collection_name$/}}]}}]"
}
}
You can get more details from official docs.
https://www.mongodb.com/docs/kafka-connector/current/source-connector/
https://docs.confluent.io/platform/current/connect/index.html
I want to sync two tables from two distinct database services with Kafka Connect Source and Kafka Connect Sink.
Kafka Connect Source reads data from source database and publishes changes into a topic named TOP1, and Kafka Connect Sink subscribed into the TOP1 and should write the change into the destination database.
The source and destination database are MSSQL and I use Debezium connector for SQL Server.
I created Kafka Connect Source with following configuration:
{
"name": "sql-source",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"tasks.max": "1",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"database.server.name": "TEST",
"database.dbname": "test_source",
"database.hostname": "172.1x.xx.xx",
"database.port": "1433",
"database.user": "sa",
"database.password": "xxxxx",
"database.instance": "MSSQLSERVER",
"database.history.kafka.bootstrap.servers": "kafka01.xxxx.dev:9092",
"database.history.kafka.topic": "schema-changes.inventory",
"value.converter.schema.registry.url":"http://kafka01.xxxx.dev:8081"
}
}
This works great and publish any changes(insert, update, delete) with following schema:
{
"before": {...},
"after": {...},
"source": {...}
}
But how should I create Kafka Connect Sink configuration that in destination database I have exactly the same data as source database.
When a record inserted in source the same insert in destination, a record deleted in source the same record delete from destination and also update in source results update in the destination.
I am using ksqlDB, Where I have created a table from the stream. when I fire select query in that table it gives me all the record properly. Now I want to sink that table in MongoDB. I am also able to create a sink between the Kafka table to MongoDB. But somehow it sinks only one record into it(MongoDB). Whereas in the table I have 100 records. Below is my MongoDB sink connector.
{
"name": "MongoSinkConnectorConnector_1",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"topics": "FEEDS",
"connection.uri": "mongodb://xxx:xxx#x.x.x.x:27017/",
"database": "xxx",
"max.num.retries": "1000000",
"writemodel.strategy": "com.mongodb.kafka.connect.sink.writemodel.strategy.UpdateOneTimestampsStrategy",
"value.projection.type": "allowlist",
"value.projection.list": "id",
"document.id.strategy": "com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy",
"buffer.capacity": "20000",
"value.converter.schema.registry.url": "http://x.x.x.x:8081",
"key.converter.schemas.enable": "false",
"insert.mode": "upsert"
}
}
I could not able to understand that, what's the reason behind that. Any help appreciated. Thank you
you can set the "batch.size" property in the Sink Connector property and also you can have a better write model strategy that you can read through Mongo DB Source Connector Official documentation https://docs.confluent.io/cloud/current/connectors/cc-mongo-db-source.html.
I am using Debezium Postgres connector to capture changes in postgres tables.
The documentation for this connector
https://debezium.io/documentation/reference/connectors/postgresql.html
mentions a configuration parameter
table.include.list
However when I set the value of this parameter to 'config.abc'. Even after that changes from both tables in config schema (namely abc and def) are getting streamed.
The reason I want to do this is that I want to create 2 separate connectors for each of the 2 tables to split the load and faster change data streaming.
Is this a known issue ? Anyway to overcome this ?
The same problem here (Debezium Version: 1.1.0.Final and postgresql:11.13.0). Below is my configuration:
{
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"transforms.unwrap.delete.handling.mode": "rewrite",
"slot.name": "debezium_planning",
"transforms": "unwrap,extractInt",
"include.schema.changes": "false",
"decimal.handling.mode": "string",
"database.schema": "partsmaster",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter": "org.apache.kafka.connect.converters.LongConverter",
"database.user": "************",
"database.dbname": "smartng-db",
"transforms.extractInt.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"database.server.name": "ASLog.m3d.smartng-db",
"database.port": "***********",
"plugin.name": "pgoutput",
"column.exclude.list": "create_date, create_user, change_date, change_user",
"transforms.extractInt.field": "planning_id",
"database.hostname": "************",
"database.password": "*************",
"name": "ASLog.m3d.source-postgres-smartng-planning",
"transforms.unwrap.add.fields": "op,table,lsn,source.ts_ms",
"table.include.list": "partsmaster.planning",
"snapshot.mode": "never"
}
A change in another table partsmaster.basic is causing this connector to fail because the attribute planning_id is not available in table partsmaster.basic.
I need separate connectors for each table. table.include.list is not working. Neither table.exclude.list does.
My last resort would be to use a separate Postgres Publication for each connector. But I still hope I can find an immediate configuration solution here or be pointed out at what I am missing. In a previous version I have used table.whitelist without problems.
Solution:
create seperate publications to each table:
CREATE PUBLICATION dbz_planning FOR TABLE partsmaster.planning;
drop previous replication slot:
select pg_drop_replication_slot('debezium_planning');
Change your connector configurations:
{
"slot.name": "dbz_planning",
"publication.name": "planning_publication"
}
et voilĂ !