Error handling for invalid JSON in kafka sink connector - mongodb

I have a sink connector for mongodb, that takes json from a topic and puts it into the mongoDB collection. But, when I send an invalid JSON from a producer to that topic (e.g. with an invalid special character ") => {"id":1,"name":"\"}, the connector stops. I tried using errors.tolerance = all, but the same thing is happening. What should happen is that the connector should skip and log that invalid JSON, and keep the connector running. My distributed-mode connector is as follows:
{
"name": "sink-mongonew_test1",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"topics": "error7",
"connection.uri": "mongodb://****:27017",
"database": "abcd",
"collection": "abc",
"type.name": "kafka-connect",
"key.ignore": "true",
"document.id.strategy": "com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy",
"value.projection.list": "id",
"value.projection.type": "whitelist",
"writemodel.strategy": "com.mongodb.kafka.connect.sink.writemodel.strategy.UpdateOneTimestampsStrategy",
"delete.on.null.values": "false",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"errors.tolerance": "all",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"errors.deadletterqueue.topic.name": "crm_data_deadletterqueue",
"errors.deadletterqueue.topic.replication.factor": "1",
"errors.deadletterqueue.context.headers.enable": "true"
}
}

Since Apache Kafka 2.0, Kafka Connect has included error handling options, including the functionality to route messages to a dead letter queue, a common technique in building data pipelines.
https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues/
As commented, you're using connect-api-1.0.1.*.jar, version 1.0.1, so that explains why those properties are not working
Your alternatives outside of running a newer version of Kafka Connect include Nifi or Spark Structured Streaming

Related

Only Map objects supported in absence of schema for record conversion to BigQuery format

I'm streaming data from Postgres to Kakfa to Big Query. Most tables in PG have a primary key, as such most tables/topics have an Avro key and value schema, these all go to Big Query fine.
I do have a couple of tables that do not have a PK, and subsequently have no Avro key schema.
When I create a sink connector for those tables the connector errors with,
Caused by: com.wepay.kafka.connect.bigquery.exception.ConversionConnectException: Only Map objects supported in absence of schema for record conversion to BigQuery format.
If I remove the 'key.converter' config then I get 'Top-level Kafka Connect schema must be of type 'struct'' error.
How do I handle this?
Here's the connector config for reference,
{
"project": "staging",
"defaultDataset": "data_lake",
"keyfile": "<redacted>",
"keySource": "JSON",
"sanitizeTopics": "true",
"kafkaKeyFieldName": "_kid",
"autoCreateTables": "true",
"allowNewBigQueryFields": "true",
"upsertEnabled": "false",
"bigQueryRetry": "5",
"bigQueryRetryWait": "120000",
"bigQueryPartitionDecorator": "false",
"name": "hd-sink-bq",
"connector.class": "com.wepay.kafka.connect.bigquery.BigQuerySinkConnector",
"tasks.max": "1",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "<redacted>",
"key.converter.basic.auth.credentials.source": "USER_INFO",
"key.converter.schema.registry.basic.auth.user.info": "<redacted>",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "<redacted>",
"value.converter.basic.auth.credentials.source": "USER_INFO",
"value.converter.schema.registry.basic.auth.user.info": "<redacted>",
"topics": "public.event_issues",
"errors.tolerance": "all",
"errors.log.include.messages": "true",
"errors.deadletterqueue.topic.name": "connect.bq-sink.deadletter",
"errors.deadletterqueue.topic.replication.factor": "1",
"errors.deadletterqueue.context.headers.enable": "true",
"transforms": "tombstoneHandler",
"offset.flush.timeout.ms": "300000",
"transforms.dropNullRecords.predicate": "isNullRecord",
"transforms.dropNullRecords.type": "org.apache.kafka.connect.transforms.Filter",
"transforms.tombstoneHandler.behavior": "drop_warn",
"transforms.tombstoneHandler.type": "io.aiven.kafka.connect.transforms.TombstoneHandler"
}
For my case, I used to handle such case by using the predicate, as following
{
...
"predicates.isTombstone.type":
"org.apache.kafka.connect.transforms.predicates.RecordIsTombstone",
"predicates": "isTombstone",
"transforms.x.predicate":"isTombstone",
"transforms.x.negate":true
...
}
This as per the docs here, and the transforms.x.negate will skip such tompStone records.

Produce Avro messages in Confluent Control Center UI

To develop a data transfer application I need first define a key/value avro schemas. The producer application is not developed yet till define the avro schema.
I cloned a topic and its key/value avro schemas that are already working and
and also cloned the the jdbc snink connector. Simply I just changed the topic and connector names.
Then I copied and existing message successfully sent sink using Confluent Topic Message UI Producer.
But it is sending the error: "Unknown magic byte!"
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
at io.confluent.kafka.serializers.AbstractKafkaSchemaSerDe.getByteBuffer(AbstractKafkaSchemaSerDe.java:250)
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer$DeserializationContext.<init>(AbstractKafkaAvroDeserializer.java:323)
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserializeWithSchemaAndVersion(AbstractKafkaAvroDeserializer.java:164)
at io.confluent.connect.avro.AvroConverter$Deserializer.deserialize(AvroConverter.java:172)
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:107)
... 17 more
[2022-07-25 03:45:42,385] INFO Stopping task (io.confluent.connect.jdbc.sink.JdbcSinkTask)
Reading other questions it seems the message has to be serialized using the schema.
Unknown magic byte with kafka-avro-console-consumer
is it possible to send a message to a topic with AVRO key/value schemas using the Confluent Topic UI?
Any idea if the avro schemas need information depending on the connector/source? or if namespace depends on the topic name?
This is my key schema. And the topic's name is knov_03
{
"connect.name": "dbserv1.MY_DB_SCHEMA.ps_sap_incoming.Key",
"fields": [
{
"name": "id_sap_incoming",
"type": "long"
}
],
"name": "Key",
"namespace": "dbserv1.MY_DB_SCHEMA.ps_sap_incoming",
"type": "record"
}
Connector:
{
"name": "knov_05",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"topics": "knov_03",
"connection.url": "jdbc:mysql://eXXXXX:3306/MY_DB_SCHEMA?useSSL=FALSE&nullCatalogMeansCurrent=true",
"connection.user": "USER",
"connection.password": "PASSWORD",
"insert.mode": "upsert",
"delete.enabled": "true",
"pk.mode": "record_key",
"pk.fields": "id_sap_incoming",
"auto.create": "true",
"auto.evolve": "true",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"key.converter.schema.registry.url": "http://schema-registry:8081"
}
}
Thanks.

Kafka MongoDB sink Connector exception handling

I have created a connector from Kafka to MongoDB to sink the data. In some cases, there is a case in which I got the wrong data on my topic. So that topic sink with the DB at that time it will give me a duplicate key issue due to the index which I created.
But in this case, I want to move that data in dlq. But it is not moving the record.
this is my connector can anyone please help me with this.
{
"name": "test_1",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"topics": "test",
"connection.uri": "xxx",
"database": "test",
"collection": "test_record",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schemas.enable": "true",
"value.converter.schema.registry.url": "http://xxx:8081",
"document.id.strategy.overwrite.existing": "true",
"document.id.strategy": "com.mongodb.kafka.connect.sink.processor.id.strategy.ProvidedInKeyStrategy",
"transforms": "hk",
"transforms.hk.type": "org.apache.kafka.connect.transforms.HoistField$Key",
"transforms.hk.field": "_id",
"writemodel.strategy": "com.mongodb.kafka.connect.sink.writemodel.strategy.UpdateOneTimestampsStrategy",
"write.method": "upsert",
"errors.tolerance":"all",
"errors.deadletterqueue.topic.name":"dlq_sink",
"errors.deadletterqueue.context.headers.enable":true,
"errors.retry.delay.max.ms": 60000,
"errors.retry.timeout": 300000
}
}
Thanks,

multiple collections mongodb to Kafka topic

The application writes data every month to a new collection (for example, journal_2205, journal_2206). Is it possible to configure the connector so that it reads the oplog from the new collection and writes to one topic? I use the connector
https://www.mongodb.com/docs/kafka-connector/current/source-connector/
Thank you!
Yes, this is possible, you can listen to multiple change streams from multiple mongo collections. You just need to provide the Regex for the collection names in pipeline, you can even provide the Regex for database names if you have multiple databases.
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-name$/}},{\"ns.coll\":{\"$regex\":/^journal_.*/}}]}}]"
You can even exclude any given database using $nin, which you dont want to listen for any change-stream.
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-name$/,\"$nin\":[/^any_database_name$/]}},{\"ns.coll\":{\"$regex\":/^journal_.*/}}]}}]"
Here is the complete Kafka connector configuration.
Mongo to Kafka source connector
{
"name": "mongo-to-kafka-connect",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"publish.full.document.only": "true",
"tasks.max": "3",
"key.converter.schemas.enable": "false",
"topic.creation.enable": "true",
"poll.await.time.ms": 1000,
"poll.max.batch.size": 100,
"topic.prefix": "any prefix for topic name",
"output.json.formatter": "com.mongodb.kafka.connect.source.json.formatter.SimplifiedJson",
"connection.uri": "mongodb://<username>:<password>#ip:27017,ip:27017,ip:27017,ip:27017/?authSource=admin&replicaSet=xyz&tls=true",
"value.converter.schemas.enable": "false",
"copy.existing": "true",
"topic.creation.default.replication.factor": 3,
"topic.creation.default.partitions": 3,
"topic.creation.compacted.cleanup.policy": "compact",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"mongo.errors.log.enable": "true",
"heartbeat.interval.ms": 10000,
"pipeline": "[{\"$match\":{\"$and\":[{\"ns.db\":{\"$regex\":/^database-name$/}},{\"ns.coll\":{\"$regex\":/^journal_.*/}}]}}]"
}
}
You can get more details from official docs.
https://www.mongodb.com/docs/kafka-connector/current/source-connector/
https://docs.confluent.io/platform/current/connect/index.html

Allow only some columns in singlestore kafka connect

I am using kafka to send my cdc data which are collected by debezium to a singlestore database and I am using this kafka connect json:
{
"name": "my-connector",
"config": {
"connector.class":"com.singlestore.kafka.SingleStoreSinkConnector",
"tasks.max":"1",
"transforms": "dropPrefix,unwrap",
"transforms.dropPrefix.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex": "dbserver1.inventory.(.*)",
"transforms.dropPrefix.replacement": "$1",
"errors.tolerance": "all",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"topics":"dbserver1.inventory.addresses",
"connection.ddlEndpoint" : "memsql:3306",
"connection.database" : "test",
"connection.user" : "root",
"connection.password": "password",
"insert.mode": "upsert",
"tableKey.primary.keyName" : "id",
"fields.whitelist": "id,city",
"auto.create": "true",
"auto.evolve": "true",
"transforms.unwrap.delete.handling.mode":"rewrite",
"transforms.unwrap.add.fields": "ts_ms",
"singlestore.metadata.allow": true,
"singlestore.metadata.table": "kafka_connect_transaction_metadata"
}
}
I want the singlestore database to only receive and save data from columns id and city.
but apparently
"fields.whitelist": "id,city",
does not work in this kind of kafka connect like it does in jdbc sink connector. how can I manage this?
It's me again, looks like you should be able to use Arcion Cloud as your CDC tool. It will allow you to filter for specific columns within a table and then insert the insert/update/deletes into SingleStore.
https://docs.arcion.io/docs/references/filter-reference/