multiple collections mongodb to Kafka topic - mongodb

The application writes data every month to a new collection (for example, journal_2205, journal_2206). Is it possible to configure the connector so that it reads the oplog from the new collection and writes to one topic? I use the connector
https://www.mongodb.com/docs/kafka-connector/current/source-connector/
Thank you!

Yes, this is possible, you can listen to multiple change streams from multiple mongo collections. You just need to provide the Regex for the collection names in pipeline, you can even provide the Regex for database names if you have multiple databases.
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-name$/}},{\"ns.coll\":{\"$regex\":/^journal_.*/}}]}}]"
You can even exclude any given database using $nin, which you dont want to listen for any change-stream.
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-name$/,\"$nin\":[/^any_database_name$/]}},{\"ns.coll\":{\"$regex\":/^journal_.*/}}]}}]"
Here is the complete Kafka connector configuration.
Mongo to Kafka source connector
{
"name": "mongo-to-kafka-connect",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"publish.full.document.only": "true",
"tasks.max": "3",
"key.converter.schemas.enable": "false",
"topic.creation.enable": "true",
"poll.await.time.ms": 1000,
"poll.max.batch.size": 100,
"topic.prefix": "any prefix for topic name",
"output.json.formatter": "com.mongodb.kafka.connect.source.json.formatter.SimplifiedJson",
"connection.uri": "mongodb://<username>:<password>#ip:27017,ip:27017,ip:27017,ip:27017/?authSource=admin&replicaSet=xyz&tls=true",
"value.converter.schemas.enable": "false",
"copy.existing": "true",
"topic.creation.default.replication.factor": 3,
"topic.creation.default.partitions": 3,
"topic.creation.compacted.cleanup.policy": "compact",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"mongo.errors.log.enable": "true",
"heartbeat.interval.ms": 10000,
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-name$/}},{\"ns.coll\":{\"$regex\":/^journal_.*/}}]}}]"
}
}
You can get more details from official docs.
https://www.mongodb.com/docs/kafka-connector/current/source-connector/
https://docs.confluent.io/platform/current/connect/index.html

Related

Only Map objects supported in absence of schema for record conversion to BigQuery format

I'm streaming data from Postgres to Kakfa to Big Query. Most tables in PG have a primary key, as such most tables/topics have an Avro key and value schema, these all go to Big Query fine.
I do have a couple of tables that do not have a PK, and subsequently have no Avro key schema.
When I create a sink connector for those tables the connector errors with,
Caused by: com.wepay.kafka.connect.bigquery.exception.ConversionConnectException: Only Map objects supported in absence of schema for record conversion to BigQuery format.
If I remove the 'key.converter' config then I get 'Top-level Kafka Connect schema must be of type 'struct'' error.
How do I handle this?
Here's the connector config for reference,
{
"project": "staging",
"defaultDataset": "data_lake",
"keyfile": "<redacted>",
"keySource": "JSON",
"sanitizeTopics": "true",
"kafkaKeyFieldName": "_kid",
"autoCreateTables": "true",
"allowNewBigQueryFields": "true",
"upsertEnabled": "false",
"bigQueryRetry": "5",
"bigQueryRetryWait": "120000",
"bigQueryPartitionDecorator": "false",
"name": "hd-sink-bq",
"connector.class": "com.wepay.kafka.connect.bigquery.BigQuerySinkConnector",
"tasks.max": "1",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "<redacted>",
"key.converter.basic.auth.credentials.source": "USER_INFO",
"key.converter.schema.registry.basic.auth.user.info": "<redacted>",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "<redacted>",
"value.converter.basic.auth.credentials.source": "USER_INFO",
"value.converter.schema.registry.basic.auth.user.info": "<redacted>",
"topics": "public.event_issues",
"errors.tolerance": "all",
"errors.log.include.messages": "true",
"errors.deadletterqueue.topic.name": "connect.bq-sink.deadletter",
"errors.deadletterqueue.topic.replication.factor": "1",
"errors.deadletterqueue.context.headers.enable": "true",
"transforms": "tombstoneHandler",
"offset.flush.timeout.ms": "300000",
"transforms.dropNullRecords.predicate": "isNullRecord",
"transforms.dropNullRecords.type": "org.apache.kafka.connect.transforms.Filter",
"transforms.tombstoneHandler.behavior": "drop_warn",
"transforms.tombstoneHandler.type": "io.aiven.kafka.connect.transforms.TombstoneHandler"
}
For my case, I used to handle such case by using the predicate, as following
{
...
"predicates.isTombstone.type":
"org.apache.kafka.connect.transforms.predicates.RecordIsTombstone",
"predicates": "isTombstone",
"transforms.x.predicate":"isTombstone",
"transforms.x.negate":true
...
}
This as per the docs here, and the transforms.x.negate will skip such tompStone records.

Kafka MongoDB sink Connector exception handling

I have created a connector from Kafka to MongoDB to sink the data. In some cases, there is a case in which I got the wrong data on my topic. So that topic sink with the DB at that time it will give me a duplicate key issue due to the index which I created.
But in this case, I want to move that data in dlq. But it is not moving the record.
this is my connector can anyone please help me with this.
{
"name": "test_1",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"topics": "test",
"connection.uri": "xxx",
"database": "test",
"collection": "test_record",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schemas.enable": "true",
"value.converter.schema.registry.url": "http://xxx:8081",
"document.id.strategy.overwrite.existing": "true",
"document.id.strategy": "com.mongodb.kafka.connect.sink.processor.id.strategy.ProvidedInKeyStrategy",
"transforms": "hk",
"transforms.hk.type": "org.apache.kafka.connect.transforms.HoistField$Key",
"transforms.hk.field": "_id",
"writemodel.strategy": "com.mongodb.kafka.connect.sink.writemodel.strategy.UpdateOneTimestampsStrategy",
"write.method": "upsert",
"errors.tolerance":"all",
"errors.deadletterqueue.topic.name":"dlq_sink",
"errors.deadletterqueue.context.headers.enable":true,
"errors.retry.delay.max.ms": 60000,
"errors.retry.timeout": 300000
}
}
Thanks,

Allow only some columns in singlestore kafka connect

I am using kafka to send my cdc data which are collected by debezium to a singlestore database and I am using this kafka connect json:
{
"name": "my-connector",
"config": {
"connector.class":"com.singlestore.kafka.SingleStoreSinkConnector",
"tasks.max":"1",
"transforms": "dropPrefix,unwrap",
"transforms.dropPrefix.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex": "dbserver1.inventory.(.*)",
"transforms.dropPrefix.replacement": "$1",
"errors.tolerance": "all",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"topics":"dbserver1.inventory.addresses",
"connection.ddlEndpoint" : "memsql:3306",
"connection.database" : "test",
"connection.user" : "root",
"connection.password": "password",
"insert.mode": "upsert",
"tableKey.primary.keyName" : "id",
"fields.whitelist": "id,city",
"auto.create": "true",
"auto.evolve": "true",
"transforms.unwrap.delete.handling.mode":"rewrite",
"transforms.unwrap.add.fields": "ts_ms",
"singlestore.metadata.allow": true,
"singlestore.metadata.table": "kafka_connect_transaction_metadata"
}
}
I want the singlestore database to only receive and save data from columns id and city.
but apparently
"fields.whitelist": "id,city",
does not work in this kind of kafka connect like it does in jdbc sink connector. how can I manage this?
It's me again, looks like you should be able to use Arcion Cloud as your CDC tool. It will allow you to filter for specific columns within a table and then insert the insert/update/deletes into SingleStore.
https://docs.arcion.io/docs/references/filter-reference/

Error handling for invalid JSON in kafka sink connector

I have a sink connector for mongodb, that takes json from a topic and puts it into the mongoDB collection. But, when I send an invalid JSON from a producer to that topic (e.g. with an invalid special character ") => {"id":1,"name":"\"}, the connector stops. I tried using errors.tolerance = all, but the same thing is happening. What should happen is that the connector should skip and log that invalid JSON, and keep the connector running. My distributed-mode connector is as follows:
{
"name": "sink-mongonew_test1",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"topics": "error7",
"connection.uri": "mongodb://****:27017",
"database": "abcd",
"collection": "abc",
"type.name": "kafka-connect",
"key.ignore": "true",
"document.id.strategy": "com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy",
"value.projection.list": "id",
"value.projection.type": "whitelist",
"writemodel.strategy": "com.mongodb.kafka.connect.sink.writemodel.strategy.UpdateOneTimestampsStrategy",
"delete.on.null.values": "false",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"errors.tolerance": "all",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"errors.deadletterqueue.topic.name": "crm_data_deadletterqueue",
"errors.deadletterqueue.topic.replication.factor": "1",
"errors.deadletterqueue.context.headers.enable": "true"
}
}
Since Apache Kafka 2.0, Kafka Connect has included error handling options, including the functionality to route messages to a dead letter queue, a common technique in building data pipelines.
https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues/
As commented, you're using connect-api-1.0.1.*.jar, version 1.0.1, so that explains why those properties are not working
Your alternatives outside of running a newer version of Kafka Connect include Nifi or Spark Structured Streaming

Partition By Multiple Nested Fields in Kafka Connect HDFS Sink

We are running kafka hdfs sink connector(version 5.2.1) and needs HDFS data to be partitioned by multiple nested fields.The data in topics is stored as Avro and has nested elements.How ever connect cannot recognize the nested fields and throws an error that the field cannot be found.Below is the connector configuration we are using. Doesn't hdfs sink connect support partitioning by nested fields ?.I can partition by using non nested fields
{
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"topics.dir": "/projects/test/kafka/logdata/coss",
"avro.codec": "snappy",
"flush.size": "200",
"connect.hdfs.principal": "test#DOMAIN.COM",
"rotate.interval.ms": "500000",
"logs.dir": "/projects/test/kafka/tmp/wal/coss4",
"hdfs.namenode.principal": "hdfs/_HOST#HADOOP.DOMAIN",
"hadoop.conf.dir": "/etc/hdfs",
"topics": "test1",
"connect.hdfs.keytab": "/etc/hdfs-qa/test.keytab",
"hdfs.url": "hdfs://nameservice1:8020",
"hdfs.authentication.kerberos": "true",
"name": "hdfs_connector_v1",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://myschema:8081",
"partition.field.name": "meta.ID,meta.source,meta.HH",
"partitioner.class": "io.confluent.connect.storage.partitioner.FieldPartitioner"
}
I added nested field support for the TimestampPartitioner, but the FieldPartitioner still has an outstanding PR
https://github.com/confluentinc/kafka-connect-storage-common/pull/67