S3SinkConnector: Issue while consuming compressed data from Kafka

S3SinkConnector: Issue while consuming compressed data from Kafka - apache-kafka

I have data in Kafka Topic which is Avro serialised and compressed using zstd codec. To transfer this data to S3 I have created a S3SinkConnector with below config -
{
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"s3.region": "ap-south-1",
"topics.dir": "0/test_debezium_sept_12_mon_5/public/test_table",
"flush.size": "10000",
"tasks.max": "1",
"s3.part.size": "67108864",
"timezone": "Asia/Calcutta",
"rotate.interval.ms": "60000",
"locale": "en_GB",
"format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"s3.bucket.name": "zeta-aws-aps1-metis-0-s3-pvt",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"partition.duration.ms": "86400000",
"schema.compatibility": "NONE",
"topics": "cdc_test_debezium_sept_12_mon_5.public.test_table",
"parquet.codec": "gzip",
"connect.meta.data": "true",
"value.converter.schema.registry.url": {{url}},
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"name": "cdc_test_debezium_sept_12_mon_5.public.test_table_cdc_zeta-aws-aps1-metis-0-s3-pvt_ap-south-1_sink",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"path.format": "'date'=YYYY-MM-dd",
"rotate.schedule.interval.ms": "180000",
"timestamp.extractor": "RecordField",
"key.converter.schema.registry.url": "{{url}}",
"timestamp.field": "cdc_source_ts_ms"
}
The above S3SinkConnector fails with following error
Caused by: org.apache.kafka.connect.errors.DataException: Failed to deserialize data for topic cdc_test_debezium_sept_12_mon_5.public.test_table to Avro: \n\tat io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:118)\n\tat org.apache.kafka.connect.storage.Converter.toConnectData(Converter.java:87)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$0(WorkerSinkTask.java:492)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:146)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:180)\n\t... 13 more\nCaused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id 1106\nCaused by: java.io.EOFException
NOTE: If I disable compression on producer/kafka side then S3 connector works fine. Issue is only while enabling compression on Kafka side.

Related

Avro schema must be a record

I'm trying to use S3SinkConnector with the following settings:
{
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
"flush.size": 1,
"s3.bucket.name": "*****",
"s3.object.tagging": "true",
"s3.region": "us-east-2",
"aws.access.key.id": "*****",
"aws.secret.access.key": "*****",
"s3.part.retries": 5,
"s3.retry.backoff.ms": 1000,
"behavior.on.null.values": "ignore",
"keys.format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"headers.format.class": "io.confluent.connect.s3.format.json.JsonFormat",
"store.kafka.headers": "true",
"store.kafka.keys": "true",
"topics": "***",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"topics.dir": "kafka-backup",
"value.converter": "io.confluent.connect.json.JsonSchemaConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"partitioner.class": "io.confluent.connect.storage.partitioner.HourlyPartitioner",
"locale": "en-US",
"timezone": "UTC",
"timestamp.extractor": "Record"
}
The records in Kafka are store in JSON format and saved there via io.confluent.connect.json.JsonSchemaConverter. So all messages has strict schema.
When sink connector trying to read records from Kafka - I got exception - "Avro schema must be a record."
I didn't get why I got this error, cause I don't use any avro format.
The full stack trace:
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:631)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:333)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:234)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:203)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:188)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:243)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalArgumentException: Avro schema must be a record.
at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:124)
at org.apache.parquet.avro.AvroParquetWriter.writeSupport(AvroParquetWriter.java:150)
at org.apache.parquet.avro.AvroParquetWriter.access$200(AvroParquetWriter.java:36)
at org.apache.parquet.avro.AvroParquetWriter$Builder.getWriteSupport(AvroParquetWriter.java:182)
at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:563)
at io.confluent.connect.s3.format.parquet.ParquetRecordWriterProvider$1.write(ParquetRecordWriterProvider.java:102)
at io.confluent.connect.s3.format.S3RetriableRecordWriter.write(S3RetriableRecordWriter.java:46)
at io.confluent.connect.s3.format.KeyValueHeaderRecordWriterProvider$1.write(KeyValueHeaderRecordWriterProvider.java:107)
at io.confluent.connect.s3.TopicPartitionWriter.writeRecord(TopicPartitionWriter.java:562)
at io.confluent.connect.s3.TopicPartitionWriter.checkRotationOrAppend(TopicPartitionWriter.java:311)
at io.confluent.connect.s3.TopicPartitionWriter.executeState(TopicPartitionWriter.java:254)
at io.confluent.connect.s3.TopicPartitionWriter.write(TopicPartitionWriter.java:205)
at io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:234)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:601)
Version of S3connector - 10.3.0
Version of kafka-connect - 7.0.1

You need to not use ParquetFormat, or you need to produce Avro. ParquetFormat requires Avro (source of S3 sink).

Kafka MongoDB sink Connector exception handling

I have created a connector from Kafka to MongoDB to sink the data. In some cases, there is a case in which I got the wrong data on my topic. So that topic sink with the DB at that time it will give me a duplicate key issue due to the index which I created.
But in this case, I want to move that data in dlq. But it is not moving the record.
this is my connector can anyone please help me with this.
{
"name": "test_1",
"config": {
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"topics": "test",
"connection.uri": "xxx",
"database": "test",
"collection": "test_record",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schemas.enable": "true",
"value.converter.schema.registry.url": "http://xxx:8081",
"document.id.strategy.overwrite.existing": "true",
"document.id.strategy": "com.mongodb.kafka.connect.sink.processor.id.strategy.ProvidedInKeyStrategy",
"transforms": "hk",
"transforms.hk.type": "org.apache.kafka.connect.transforms.HoistField$Key",
"transforms.hk.field": "_id",
"writemodel.strategy": "com.mongodb.kafka.connect.sink.writemodel.strategy.UpdateOneTimestampsStrategy",
"write.method": "upsert",
"errors.tolerance":"all",
"errors.deadletterqueue.topic.name":"dlq_sink",
"errors.deadletterqueue.context.headers.enable":true,
"errors.retry.delay.max.ms": 60000,
"errors.retry.timeout": 300000
}
}
Thanks,

Change the topic name in a HDFS2 SINK CONNECTOR integrated with HIVE

Good morning
When I work with a HDFS2 connector sink, integrated with HIVE, the database table get the name of the topic. Is there a way to choice the name of the table?.
This is the config of my conector:
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"hive.integration": "true",
"hive.database": "databaseEze",
"hive.metastore.uris": "thrift://server1.dc.es.arioto:9083",
"transforms.InsertField.timestamp.field": "carga",
"flush.size": "100000",
"tasks.max": "2",
"timezone": "Europe/Paris",
"transforms": "RenameField,InsertField,carga_format",
"rotate.interval.ms": "900000",
"locale": "en-GB",
"logs.dir": "/logs",
"format.class": "io.confluent.connect.hdfs.avro.AvroFormat",
"transforms.InsertField.type": "org.apache.kafka.connect.transforms.InsertField$Value",
"transforms.RenameField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"transforms.RenameField.renames": "var1:Test1,var2:Test2,var3:test3",
"transforms.carga_format.type": "org.apache.kafka.connect.transforms.TimestampConverter$Value",
"transforms.carga_format.target.type": "string",
"transforms.carga_format.format": "yyyyMMdd",
"hadoop.conf.dir": "/etc/hadoop/",
"schema.compatibility": "BACKWARD",
"topics": "Skiel-Tracking-Replicator",
"hdfs.url": "hdfs://database/user/datavaseEze/",
"transforms.InsertField.topic.field": "ds_topic",
"partition.field.name": "carga",
"transforms.InsertField.partition.field": "test_partition",
"value.converter.schema.registry.url": "http://schema-registry-eze-dev.ocjc.serv.dc.es.arioto",
"partitioner.class": "io.confluent.connect.storage.partitioner.FieldPartitioner",
"name": "KAFKA-HDFS-HIVE-TEST",
"transforms.fx_carga_format.field": "carga",
"transforms.InsertField.offset.field": "test_offset"
}
With that config, the table will name **Skiel-Tracking-Replicator** and I want that the table name will be d9nvtest.

You can use the RegexRouter Single Message Transform to modify the topic name.
{
"transforms" : "renameTopic",
"transforms.renameTopic.type" : "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.renameTopic.regex" : "Skiel-Tracking-Replicator",
"transforms.renameTopic.replacement": "d9nvtest"
}
See https://rmoff.net/2020/12/11/twelve-days-of-smt-day-4-regexrouter/

While using RegexRouter with kafka-connect-hdfs, this issue occurs - https://github.com/confluentinc/kafka-connect-hdfs/issues/236
Last comment here specifies that these two are conceptually incompatible.

How to get logical types from schema registry to avro files using Kafka GSC Connector

I'm loading avro files into GCS using Kafka GCS connector. In my schema in the schema registry I have logical types on some of my columns, but it seems like they're not being transferred to the files. How can logical types from a schema be transferred to avro files?
Here is my connector configuration for what it's worth:
{
"connector.class": "io.confluent.connect.gcs.GcsSinkConnector",
"confluent.topic.bootstrap.servers": "kafka.internal:9092",
"flush.size": "200000",
"tasks.max": "300",
"topics": "prod_ny, prod_vr",
"group.id": "gcs_sink_connect",
"value.converter.value.subject.name.strategy": "io.confluent.kafka.serializers.subject.RecordNameStrategy",
"gcs.credentials.json": "---",
"confluent.license: "---",
"value.converter.schema.registry.url": "http://p-og.prod:8081",
"gcs.bucket.name": "kafka_load",
"format.class": "io.confluent.connect.gcs.format.avro.AvroFormat",
"gcs.part.size": "5242880",
"confluent.topic.replication.factor": "1",
"name": "gcs_sink_prod",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"storage.class": "io.confluent.connect.gcs.storage.GcsStorage",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"auto.offset.reset": "latest"
}

Kafka Connector - Tolerance exceeded in error handler

I have above 50 source connectors over sql server but two of them are going in error, please tell me what could be the reason as we have limited access to kafka server.
{
"name": "xxxxxxxxxxxxx",
"connector": {
"state": "RUNNING",
"worker_id": "xxxxxxxxxxxxxx:8083"
},
"tasks": [
{
"state": "FAILED",
"trace": "org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)\n\tat org.apache.kafka.connect.runtime.TransformationChain.apply(TransformationChain.java:44)\n\tat org.apache.kafka.connect.runtime.WorkerSourceTask.sendRecords(WorkerSourceTask.java:292)\n\tat org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:228)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: org.apache.kafka.connect.errors.DataException: Schema required for [updating schema metadata]\n\tat org.apache.kafka.connect.transforms.util.Requirements.requireSchema(Requirements.java:31)\n\tat org.apache.kafka.connect.transforms.SetSchemaMetadata.apply(SetSchemaMetadata.java:64)\n\tat org.apache.kafka.connect.runtime.TransformationChain.lambda$apply$0(TransformationChain.java:44)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)\n\t... 11 more\n",
"id": 0,
"worker_id": "xxxxxxxxxxxxx:8083"
}
],
"type": "source"
}
Source Connector configurations:
{
"name": "xxxxxxxx",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"database.history.kafka.topic": "dbhistory.fullfillment.ecom",
"transforms": "unwrap,setSchemaName",
"internal.key.converter.schemas.enable": "false",
"offset.storage.partitons": "2",
"include.schema.changes": "false",
"table.whitelist": "dbo.abc",
"decimal.handling.mode": "double",
"transforms.unwrap.drop.tombstones": "false",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"transforms.setSchemaName.schema.name": "com.data.meta.avro.abc",
"database.dbname": "xxxxxx",
"database.user": "xxxxxx",
"database.history.kafka.bootstrap.servers": "xxxxxxxxxxxx",
"database.server.name": "xxxxxxx",
"database.port": "xxxxxx",
"transforms.setSchemaName.type": "org.apache.kafka.connect.transforms.SetSchemaMetadata$Value",
"key.converter.schemas.enable": "false",
"value.converter.schema.registry.url": "http://xxxxxxxxxx:8081",
"internal.key.converter": "org.apache.kafka.connect.json.JsonConverter",
"database.hostname": "xxxxxxx",
"database.password": "xxxxxxx",
"internal.value.converter.schemas.enable": "false",
"internal.value.converter": "org.apache.kafka.connect.json.JsonConverter",
"name": "xxxxxxxxxxx"
}
}

If you look at the stack trace in the trace field, and replacing the \n and \t characters within with newline and tabs, you will see:
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
at org.apache.kafka.connect.runtime.TransformationChain.apply(TransformationChain.java:44)
at org.apache.kafka.connect.runtime.WorkerSourceTask.sendRecords(WorkerSourceTask.java:292)
at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:228)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.connect.errors.DataException: Schema required for [updating schema metadata]
at org.apache.kafka.connect.transforms.util.Requirements.requireSchema(Requirements.java:31)
at org.apache.kafka.connect.transforms.SetSchemaMetadata.apply(SetSchemaMetadata.java:64)
at org.apache.kafka.connect.runtime.TransformationChain.lambda$apply$0(TransformationChain.java:44)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
... 11 more
And thus the cause of your error is being thrown in the SetSchemaMetadata Single Message Transform: org.apache.kafka.connect.errors.DataException: Schema required for [updating schema metadata]
I would check the configuration on your connectors, isolate the ones that have failed, and look at the Single Message Transform configuration. This issue might be relevant.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

S3SinkConnector: Issue while consuming compressed data from Kafka - apache-kafka

Related

Avro schema must be a record

Kafka MongoDB sink Connector exception handling

Change the topic name in a HDFS2 SINK CONNECTOR integrated with HIVE

How to get logical types from schema registry to avro files using Kafka GSC Connector

Kafka Connector - Tolerance exceeded in error handler

Categories

Resources