Allow only some columns in singlestore kafka connect - apache-kafka

I am using kafka to send my cdc data which are collected by debezium to a singlestore database and I am using this kafka connect json:
{
"name": "my-connector",
"config": {
"connector.class":"com.singlestore.kafka.SingleStoreSinkConnector",
"tasks.max":"1",
"transforms": "dropPrefix,unwrap",
"transforms.dropPrefix.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropPrefix.regex": "dbserver1.inventory.(.*)",
"transforms.dropPrefix.replacement": "$1",
"errors.tolerance": "all",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"topics":"dbserver1.inventory.addresses",
"connection.ddlEndpoint" : "memsql:3306",
"connection.database" : "test",
"connection.user" : "root",
"connection.password": "password",
"insert.mode": "upsert",
"tableKey.primary.keyName" : "id",
"fields.whitelist": "id,city",
"auto.create": "true",
"auto.evolve": "true",
"transforms.unwrap.delete.handling.mode":"rewrite",
"transforms.unwrap.add.fields": "ts_ms",
"singlestore.metadata.allow": true,
"singlestore.metadata.table": "kafka_connect_transaction_metadata"
}
}
I want the singlestore database to only receive and save data from columns id and city.
but apparently
"fields.whitelist": "id,city",
does not work in this kind of kafka connect like it does in jdbc sink connector. how can I manage this?

It's me again, looks like you should be able to use Arcion Cloud as your CDC tool. It will allow you to filter for specific columns within a table and then insert the insert/update/deletes into SingleStore.
https://docs.arcion.io/docs/references/filter-reference/

Related

Only Map objects supported in absence of schema for record conversion to BigQuery format

I'm streaming data from Postgres to Kakfa to Big Query. Most tables in PG have a primary key, as such most tables/topics have an Avro key and value schema, these all go to Big Query fine.
I do have a couple of tables that do not have a PK, and subsequently have no Avro key schema.
When I create a sink connector for those tables the connector errors with,
Caused by: com.wepay.kafka.connect.bigquery.exception.ConversionConnectException: Only Map objects supported in absence of schema for record conversion to BigQuery format.
If I remove the 'key.converter' config then I get 'Top-level Kafka Connect schema must be of type 'struct'' error.
How do I handle this?
Here's the connector config for reference,
{
"project": "staging",
"defaultDataset": "data_lake",
"keyfile": "<redacted>",
"keySource": "JSON",
"sanitizeTopics": "true",
"kafkaKeyFieldName": "_kid",
"autoCreateTables": "true",
"allowNewBigQueryFields": "true",
"upsertEnabled": "false",
"bigQueryRetry": "5",
"bigQueryRetryWait": "120000",
"bigQueryPartitionDecorator": "false",
"name": "hd-sink-bq",
"connector.class": "com.wepay.kafka.connect.bigquery.BigQuerySinkConnector",
"tasks.max": "1",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "<redacted>",
"key.converter.basic.auth.credentials.source": "USER_INFO",
"key.converter.schema.registry.basic.auth.user.info": "<redacted>",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "<redacted>",
"value.converter.basic.auth.credentials.source": "USER_INFO",
"value.converter.schema.registry.basic.auth.user.info": "<redacted>",
"topics": "public.event_issues",
"errors.tolerance": "all",
"errors.log.include.messages": "true",
"errors.deadletterqueue.topic.name": "connect.bq-sink.deadletter",
"errors.deadletterqueue.topic.replication.factor": "1",
"errors.deadletterqueue.context.headers.enable": "true",
"transforms": "tombstoneHandler",
"offset.flush.timeout.ms": "300000",
"transforms.dropNullRecords.predicate": "isNullRecord",
"transforms.dropNullRecords.type": "org.apache.kafka.connect.transforms.Filter",
"transforms.tombstoneHandler.behavior": "drop_warn",
"transforms.tombstoneHandler.type": "io.aiven.kafka.connect.transforms.TombstoneHandler"
}
For my case, I used to handle such case by using the predicate, as following
{
...
"predicates.isTombstone.type":
"org.apache.kafka.connect.transforms.predicates.RecordIsTombstone",
"predicates": "isTombstone",
"transforms.x.predicate":"isTombstone",
"transforms.x.negate":true
...
}
This as per the docs here, and the transforms.x.negate will skip such tompStone records.

Ignore updating missing fields with Confluent JDBC Connector

Say I have a Postgres database table with the fields "id", "flavor" and "item". I also have a Kafka topic with two messages (let's ignore the kafka-key and assume the ID is in the value for now. Schema-definition is also omitted):
{"id": 1, "flavor": "chocolate"}
{"id": 1, "item": "cookie"}
Now I'd like to use the Confluent JDBC (Sink) Connector for persisting the kafka messages in UPSERT-mode, hoping to get to the following end result in the database:
id | flavor | item
----------------------
1 | chocolate | cookie
Ẁhat I did get, however, is was this:
id | flavor | item
----------------------
1 | null | cookie
I assume that's because the second message uses an UPDATE-statement where it infers the "null" values for fields that weren't provided, and writes those null-values over my actual data.
Is there a way to get to my desired result by changing the configuration of either the Confluent JDBC Connector or PostgreSQL 12? Failing that, is there another reasonably supported postgreSQL compatible connector out there that can do this?
Here's my connector configuration (connection details obviously redacted):
{
"name": "sink-jdbc-upsertstest",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics": "TEST-upserts",
"connection.url": "jdbc:postgresql://host:port/database",
"connection.user": "user",
"connection.password": "password",
"dialect.name": "ExtendedPostgreSqlDatabaseDialect",
"table.name.format": "upsert-test",
"batch.size": "100",
"insert.mode": "upsert",
"auto.evolve": "true",
"auto.create": "true",
"pk.mode": "record_value",
"pk.fields": "id",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "true",
"errors.deadletterqueue.topic.name": "dlq_upserts",
"errors.tolerance": "all",
"errors.deadletterqueue.topic.replication.factor": "1",
"errors.deadletterqueue.context.headers.enable": "true"
}
}

Delete document with MongoSinkConnector

I'm able to insert/update documents in mongo but i'm struggling to delete documents.
This is how the data is recorded in kafka topic by a debezium connect from a SQL Server source(the last row is how DELETE operation looks like):
{"user_code":1001} {"user_code":1001,"first_name":"Sally","last_name":"Thomas","email":"sally.thomas#acme.com"}
{"user_code":1002} {"user_code":1002,"first_name":"George","last_name":"Bailey","email":"gbailey#foobar.com"}
{"user_code":1003} {"user_code":1003,"first_name":"Edward","last_name":"Walker","email":"ed#walker.com"}
{"user_code":1003} null
In this example even if after the NULL value, the document with user_code 1003 still in MongoDB.
Below is how I'm configuring my MongoSinkConnector (I've already tried both mongodb.delete.on.null.values and delete.on.null.values but none of them worked):
{
"name": "inventory-connector-sink-2",
"config": {
"connector.class" : "com.mongodb.kafka.connect.MongoSinkConnector",
"tasks.max" : "1",
"topics": "server1.dbo.customers",
"connection.uri": "mongodb://root:root#mongo:27017/",
"database": "testDB",
"collection": "customers_2",
"database.history.kafka.bootstrap.servers" : "kafka:9092",
"database.history.kafka.topic": "schema-changes.inventory",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": false,
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": false,
"document.id.strategy": "com.mongodb.kafka.connect.sink.processor.id.strategy.FullKeyStrategy",
"writemodel.strategy":"com.mongodb.kafka.connect.sink.writemodel.strategy.ReplaceOneBusinessKeyStrategy",
"mongodb.delete.on.null.values": true
}
}
I've also tried using PartialValueStrategy but no luck.
PS: I'm working with confluentinc/cp-kafka-connect docker image for my sink connector.

Kafka-confluent: How to use pk.mode=record_key for upsert and delete mode in JDBC sink connector?

In Kafka confluent, how can we use upsert using the source as CSV file while using pk.mode=record_key for composite key in the MySQL table? The upsert mode is working while using the pk.mode=record_values. Is there any additional configuration that needs to be done?
I am getting this error if I am trying with pk.mode=record_key. Error - Caused by: org.apache.kafka.connect.errors.ConnectException: Need exactly one PK column defined since the key schema for records is a primitive type.
Below is my JDBC sink connector configuration:
{
"name": "<name>",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"topics": "<topic name>",
"connection.url": "<url>",
"connection.user": "<user name>",
"connection.password": "*******",
"insert.mode": "upsert",
"batch.size": "50000",
"table.name.format": "<table name>",
"pk.mode": "record_key",
"pk.fields": "field1,field2",
"auto.create": "true",
"auto.evolve": "true",
"max.retries": "10",
"retry.backoff.ms": "3000",
"mode": "bulk",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schemas.enable": "true",
"value.converter.schema.registry.url": "http://localhost:8081"
}
}
You need to use pk.mode of record.value.
This means take field(s) from the value of the message and use them as the primary key in the target table and for UPSERT purposes.
If you set record.key it will try to take the key field(s) from the Kafka message key. Unless you've actually got the values in your message key, this is not the setting that you want to use.
These might help you further:
Kafka Connect JDBC Sink deep-dive: Working with Primary Keys
📹https://rmoff.dev/kafka-jdbc-video
📹https://rmoff.dev/ksqldb-jdbc-sink-video

Error handling for invalid JSON in kafka sink connector

I have a sink connector for mongodb, that takes json from a topic and puts it into the mongoDB collection. But, when I send an invalid JSON from a producer to that topic (e.g. with an invalid special character ") => {"id":1,"name":"\"}, the connector stops. I tried using errors.tolerance = all, but the same thing is happening. What should happen is that the connector should skip and log that invalid JSON, and keep the connector running. My distributed-mode connector is as follows:
{
"name": "sink-mongonew_test1",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"topics": "error7",
"connection.uri": "mongodb://****:27017",
"database": "abcd",
"collection": "abc",
"type.name": "kafka-connect",
"key.ignore": "true",
"document.id.strategy": "com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy",
"value.projection.list": "id",
"value.projection.type": "whitelist",
"writemodel.strategy": "com.mongodb.kafka.connect.sink.writemodel.strategy.UpdateOneTimestampsStrategy",
"delete.on.null.values": "false",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"errors.tolerance": "all",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"errors.deadletterqueue.topic.name": "crm_data_deadletterqueue",
"errors.deadletterqueue.topic.replication.factor": "1",
"errors.deadletterqueue.context.headers.enable": "true"
}
}
Since Apache Kafka 2.0, Kafka Connect has included error handling options, including the functionality to route messages to a dead letter queue, a common technique in building data pipelines.
https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues/
As commented, you're using connect-api-1.0.1.*.jar, version 1.0.1, so that explains why those properties are not working
Your alternatives outside of running a newer version of Kafka Connect include Nifi or Spark Structured Streaming