Partition By Multiple Nested Fields in Kafka Connect HDFS Sink

Partition By Multiple Nested Fields in Kafka Connect HDFS Sink - apache-kafka

We are running kafka hdfs sink connector(version 5.2.1) and needs HDFS data to be partitioned by multiple nested fields.The data in topics is stored as Avro and has nested elements.How ever connect cannot recognize the nested fields and throws an error that the field cannot be found.Below is the connector configuration we are using. Doesn't hdfs sink connect support partitioning by nested fields ?.I can partition by using non nested fields
{
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"topics.dir": "/projects/test/kafka/logdata/coss",
"avro.codec": "snappy",
"flush.size": "200",
"connect.hdfs.principal": "test#DOMAIN.COM",
"rotate.interval.ms": "500000",
"logs.dir": "/projects/test/kafka/tmp/wal/coss4",
"hdfs.namenode.principal": "hdfs/_HOST#HADOOP.DOMAIN",
"hadoop.conf.dir": "/etc/hdfs",
"topics": "test1",
"connect.hdfs.keytab": "/etc/hdfs-qa/test.keytab",
"hdfs.url": "hdfs://nameservice1:8020",
"hdfs.authentication.kerberos": "true",
"name": "hdfs_connector_v1",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://myschema:8081",
"partition.field.name": "meta.ID,meta.source,meta.HH",
"partitioner.class": "io.confluent.connect.storage.partitioner.FieldPartitioner"
}

I added nested field support for the TimestampPartitioner, but the FieldPartitioner still has an outstanding PR
https://github.com/confluentinc/kafka-connect-storage-common/pull/67

Related

Only Map objects supported in absence of schema for record conversion to BigQuery format

I'm streaming data from Postgres to Kakfa to Big Query. Most tables in PG have a primary key, as such most tables/topics have an Avro key and value schema, these all go to Big Query fine.
I do have a couple of tables that do not have a PK, and subsequently have no Avro key schema.
When I create a sink connector for those tables the connector errors with,
Caused by: com.wepay.kafka.connect.bigquery.exception.ConversionConnectException: Only Map objects supported in absence of schema for record conversion to BigQuery format.
If I remove the 'key.converter' config then I get 'Top-level Kafka Connect schema must be of type 'struct'' error.
How do I handle this?
Here's the connector config for reference,
{
"project": "staging",
"defaultDataset": "data_lake",
"keyfile": "<redacted>",
"keySource": "JSON",
"sanitizeTopics": "true",
"kafkaKeyFieldName": "_kid",
"autoCreateTables": "true",
"allowNewBigQueryFields": "true",
"upsertEnabled": "false",
"bigQueryRetry": "5",
"bigQueryRetryWait": "120000",
"bigQueryPartitionDecorator": "false",
"name": "hd-sink-bq",
"connector.class": "com.wepay.kafka.connect.bigquery.BigQuerySinkConnector",
"tasks.max": "1",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "<redacted>",
"key.converter.basic.auth.credentials.source": "USER_INFO",
"key.converter.schema.registry.basic.auth.user.info": "<redacted>",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "<redacted>",
"value.converter.basic.auth.credentials.source": "USER_INFO",
"value.converter.schema.registry.basic.auth.user.info": "<redacted>",
"topics": "public.event_issues",
"errors.tolerance": "all",
"errors.log.include.messages": "true",
"errors.deadletterqueue.topic.name": "connect.bq-sink.deadletter",
"errors.deadletterqueue.topic.replication.factor": "1",
"errors.deadletterqueue.context.headers.enable": "true",
"transforms": "tombstoneHandler",
"offset.flush.timeout.ms": "300000",
"transforms.dropNullRecords.predicate": "isNullRecord",
"transforms.dropNullRecords.type": "org.apache.kafka.connect.transforms.Filter",
"transforms.tombstoneHandler.behavior": "drop_warn",
"transforms.tombstoneHandler.type": "io.aiven.kafka.connect.transforms.TombstoneHandler"
}

For my case, I used to handle such case by using the predicate, as following
{
...
"predicates.isTombstone.type":
"org.apache.kafka.connect.transforms.predicates.RecordIsTombstone",
"predicates": "isTombstone",
"transforms.x.predicate":"isTombstone",
"transforms.x.negate":true
...
}
This as per the docs here, and the transforms.x.negate will skip such tompStone records.

how to connect kafka topics to postgres database using kafka connect jdbc sink

got errors while trying to connect kafka topics to postgres using jdbcsink connector
these are the error logs(see image) that i got when tried with the configuration
{
"name": "temperature_jdbcsink",
"config" : {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"task.max": "1",
"topics": "temperature",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://localhost:8081",
"value.converter.schema.registry.url": "http://localhost:8081",
"transforms": "Flatten, RenameFields",
"transfores.Flatten.type":"org.apache.kafka.connect.transforms.Flatten$value",
"transforms.Flatten.deliniter":"_",
"transforms.RenameFields.type": "org.apache.kafka.connect.transforms.ReplaceField$value",
"transforms.RenameFields.renames": "value:value,timestamp:timestamp",
"connection.url": "jdbc:postgresql://localhost:5432/jdbcsink",
"connection.user": "postgres",
"connection.password": "postgres",
"insert.mode": "upsert",
"batch.size":"2",
"table.name.format": "temperature",
"pk.mode":"none",
"db.timezone": "Asia/Kolkata"
}
}
https://i.stack.imgur.com/fXqO3.png
https://i.stack.imgur.com/fXqO3.png
https://i.stack.imgur.com/V5Btk.png

Error says Unknown magic byte. This seems there is data in the topic that wasn't produced using Confluent Avro serializer
For example, are your keys really Avro? This isn't common for storing data in JDBC sink since database primary keys are typically plain string or integer types. Therefore, use the respective converter for those, not Avro

Produce Avro messages in Confluent Control Center UI

To develop a data transfer application I need first define a key/value avro schemas. The producer application is not developed yet till define the avro schema.
I cloned a topic and its key/value avro schemas that are already working and
and also cloned the the jdbc snink connector. Simply I just changed the topic and connector names.
Then I copied and existing message successfully sent sink using Confluent Topic Message UI Producer.
But it is sending the error: "Unknown magic byte!"
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
at io.confluent.kafka.serializers.AbstractKafkaSchemaSerDe.getByteBuffer(AbstractKafkaSchemaSerDe.java:250)
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer$DeserializationContext.<init>(AbstractKafkaAvroDeserializer.java:323)
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserializeWithSchemaAndVersion(AbstractKafkaAvroDeserializer.java:164)
at io.confluent.connect.avro.AvroConverter$Deserializer.deserialize(AvroConverter.java:172)
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:107)
... 17 more
[2022-07-25 03:45:42,385] INFO Stopping task (io.confluent.connect.jdbc.sink.JdbcSinkTask)
Reading other questions it seems the message has to be serialized using the schema.
Unknown magic byte with kafka-avro-console-consumer
is it possible to send a message to a topic with AVRO key/value schemas using the Confluent Topic UI?
Any idea if the avro schemas need information depending on the connector/source? or if namespace depends on the topic name?
This is my key schema. And the topic's name is knov_03
{
"connect.name": "dbserv1.MY_DB_SCHEMA.ps_sap_incoming.Key",
"fields": [
{
"name": "id_sap_incoming",
"type": "long"
}
],
"name": "Key",
"namespace": "dbserv1.MY_DB_SCHEMA.ps_sap_incoming",
"type": "record"
}
Connector:
{
"name": "knov_05",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"tasks.max": "1",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"topics": "knov_03",
"connection.url": "jdbc:mysql://eXXXXX:3306/MY_DB_SCHEMA?useSSL=FALSE&nullCatalogMeansCurrent=true",
"connection.user": "USER",
"connection.password": "PASSWORD",
"insert.mode": "upsert",
"delete.enabled": "true",
"pk.mode": "record_key",
"pk.fields": "id_sap_incoming",
"auto.create": "true",
"auto.evolve": "true",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"key.converter.schema.registry.url": "http://schema-registry:8081"
}
}
Thanks.

How to get logical types from schema registry to avro files using Kafka GSC Connector

I'm loading avro files into GCS using Kafka GCS connector. In my schema in the schema registry I have logical types on some of my columns, but it seems like they're not being transferred to the files. How can logical types from a schema be transferred to avro files?
Here is my connector configuration for what it's worth:
{
"connector.class": "io.confluent.connect.gcs.GcsSinkConnector",
"confluent.topic.bootstrap.servers": "kafka.internal:9092",
"flush.size": "200000",
"tasks.max": "300",
"topics": "prod_ny, prod_vr",
"group.id": "gcs_sink_connect",
"value.converter.value.subject.name.strategy": "io.confluent.kafka.serializers.subject.RecordNameStrategy",
"gcs.credentials.json": "---",
"confluent.license: "---",
"value.converter.schema.registry.url": "http://p-og.prod:8081",
"gcs.bucket.name": "kafka_load",
"format.class": "io.confluent.connect.gcs.format.avro.AvroFormat",
"gcs.part.size": "5242880",
"confluent.topic.replication.factor": "1",
"name": "gcs_sink_prod",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"storage.class": "io.confluent.connect.gcs.storage.GcsStorage",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"auto.offset.reset": "latest"
}

Error handling for invalid JSON in kafka sink connector

I have a sink connector for mongodb, that takes json from a topic and puts it into the mongoDB collection. But, when I send an invalid JSON from a producer to that topic (e.g. with an invalid special character ") => {"id":1,"name":"\"}, the connector stops. I tried using errors.tolerance = all, but the same thing is happening. What should happen is that the connector should skip and log that invalid JSON, and keep the connector running. My distributed-mode connector is as follows:
{
"name": "sink-mongonew_test1",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"topics": "error7",
"connection.uri": "mongodb://****:27017",
"database": "abcd",
"collection": "abc",
"type.name": "kafka-connect",
"key.ignore": "true",
"document.id.strategy": "com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy",
"value.projection.list": "id",
"value.projection.type": "whitelist",
"writemodel.strategy": "com.mongodb.kafka.connect.sink.writemodel.strategy.UpdateOneTimestampsStrategy",
"delete.on.null.values": "false",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"errors.tolerance": "all",
"errors.log.enable": "true",
"errors.log.include.messages": "true",
"errors.deadletterqueue.topic.name": "crm_data_deadletterqueue",
"errors.deadletterqueue.topic.replication.factor": "1",
"errors.deadletterqueue.context.headers.enable": "true"
}
}

Since Apache Kafka 2.0, Kafka Connect has included error handling options, including the functionality to route messages to a dead letter queue, a common technique in building data pipelines.
https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues/
As commented, you're using connect-api-1.0.1.*.jar, version 1.0.1, so that explains why those properties are not working
Your alternatives outside of running a newer version of Kafka Connect include Nifi or Spark Structured Streaming

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Partition By Multiple Nested Fields in Kafka Connect HDFS Sink - apache-kafka

I added nested field support for the TimestampPartitioner, but the FieldPartitioner still has an outstanding PR https://github.com/confluentinc/kafka-connect-storage-common/pull/67

Related

Only Map objects supported in absence of schema for record conversion to BigQuery format

how to connect kafka topics to postgres database using kafka connect jdbc sink

Produce Avro messages in Confluent Control Center UI

How to get logical types from schema registry to avro files using Kafka GSC Connector

Error handling for invalid JSON in kafka sink connector

Categories

Resources