Can I write custom kafka connect transform for converting JSON to AVRO? - apache-kafka

I want to use kafka-connect-hdfs for writing schemaless json records from kafka to hdfs file.
If I am using JsonConvertor as key/value convertor then it is not working. But if I am using StringConvertor then it writes the json as escaped string.
For example:
actual json -
{"name":"test"}
data written to hdfs file -
"{\"name\":\"test\"}"
expected output to hdfs file -
{"name":"test"}
Is there any way or alternative I can achieve this or I have to use it with schema only?
Below is the exception I get when I try to use JSONConvertor:
[2017-09-06 14:40:19,344] ERROR Task hdfs-sink-0 threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:148)
org.apache.kafka.connect.errors.DataException: JsonConverter with schemas.enable requires "schema" and "payload" fields and may not contain additional fields. If you are trying to deserialize plain JSON data, set schemas.enable=false in your converter configuration.
at org.apache.kafka.connect.json.JsonConverter.toConnectData(JsonConverter.java:308)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:406)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:250)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:180)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:148)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:146)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:190)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Configuration of quickstart-hdfs.properties:
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=test_hdfs_avro
hdfs.url=hdfs://localhost:9000
flush.size=1
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
Configuration of connect-avro-standalone.properties:
bootstrap.servers=localhost:9092
schemas.enable=false
key.converter.schemas.enable=false
value.converter.schemas.enable=false

When you specify a converter in you connector's configuration properties you need to include all the properties pertaining to this converter, regardless of whether such properties are included in the worker's config too.
In the above example, you'll need to specify both:
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schemas.enable=false
in quickstart-hdfs.properties.
FYI, JSON export is coming up in the HDFS Connector soon. Track the related pull request here: https://github.com/confluentinc/kafka-connect-hdfs/pull/196
Update: JsonFormat has been merged to master branch.

Related

kafka SMT keeps failing to extract json field to use as message key

I am using leneses.io s3 source connector to read json files and trying to set message key using SMT.
Here is the config used for connector on AWS MSK
connector.class=io.lenses.streamreactor.connect.aws.s3.source.S3SourceConnector
tasks.max=1
topics=topic_3
connect.s3.vhost.bucket=true
connect.s3.aws.auth.mode=Credentials
connect.s3.aws.access.key=<<access key>>
connect.s3.aws.region=eu-central-1
connect.s3.aws.secret.key=<<secret key>>
schema.enable=false
connect.s3.kcql=INSERT INTO topic_3 SELECT * FROM bucket1:json STOREAS `JSON` WITH_FLUSH_COUNT = 1
aws.region=eu-central-1
aws.custom.endpoint=https://s3.eu-central-1.amazonaws.com
transforms.createKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms=createKey
key.converter.schemas.enable=false
transforms.createKey.fields=id
value.converter.schemas.enable=false
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
I can't get the SMT to work and running into below error
[Worker-0d3e3af50908b12ee] [2022-04-13 11:43:08,461] ERROR [dev2-s3-source-connector-4|task-0] Error encountered in task dev2-s3-source-connector-4-0. Executing stage 'TRANSFORMATION' with class 'org.apache.kafka.connect.transforms.ValueToKey'. (org.apache.kafka.connect.runtime.errors.LogReporter:66)
[Worker-0d3e3af50908b12ee] org.apache.kafka.connect.errors.DataException: Only Map objects supported in absence of schema for [copying fields from value to key], found: java.lang.String
P.s. if the SMT commands were removed from config then json files are being read into kafka topic with no issues (but the message key is empty)

Implementing custom AvroConverter for confluent kafka-connect-s3

I am using Confluent's Kafka s3 connect for copying data from apache Kafka to AWS S3.
The problem is that I have Kafka data in AVRO format which is NOT using Confluent Schema Registry’s Avro serializer and I cannot change the Kafka producer. So I need to deserialize existing Avro data from Kafka and then persist the same in parquet format in AWS S3. I tried using confluent's AvroConverter as value converter like this -
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost/api/v1/avro
And i am getting this error -
Caused by: org.apache.kafka.connect.errors.DataException: Failed to deserialize data for topic dcp-all to Avro:
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:110)
at org.apache.kafka.connect.storage.Converter.toConnectData(Converter.java:86)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$2(WorkerSinkTask.java:488)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
As far as I understand, "io.confluent.connect.avro.AvroConverter" will only work if the data is written in Kafka using Confluent Schema Registry’s Avro serializer and hence I am getting this error. So my question is Do I need to implement a generic AvroConverter in this case? And if yes, how do I extend the existing source code - https://github.com/confluentinc/kafka-connect-storage-cloud?
Any help here will be appreciated.
You don't need to extend that repo. You just need to implement a Converter (part of Apache Kafka) shade it into a JAR, then place it on your Connect worker's CLASSPATH, like BlueApron did for Protobuf
Or see if this works - https://github.com/farmdawgnation/registryless-avro-converter
NOT using Confluent Schema Registry
Then what registry are you using? Each one that I know of has configurations to interface with the Confluent one

SF_KAFKA_CONNECTOR name is empty or invalid error using Confluent Cloud and Snowflake Kafka Connector

I have a cluster running in Confluent Cloud and am able to Produce and Consume data using other applications. However, when I try to hook up the Snowflake Kafka Connector I receive these errors:
[2019-10-15 22:12:08,979] INFO Creating connector source-snowflake of type com.snowflake.kafka.connector.SnowflakeSinkConnector (org.apache.kafka.connect.runtime.Worker)
[2019-10-15 22:12:08,983] INFO Instantiated connector source-snowflake with version 0.5.1 of type class com.snowflake.kafka.connector.SnowflakeSinkConnector (org.apache.kafka.connect.runtime.Worker)
[2019-10-15 22:12:08,986] INFO
[SF_KAFKA_CONNECTOR] Snowflake Kafka Connector Version: 0.5.1 (com.snowflake.kafka.connector.Utils)
[2019-10-15 22:12:09,029] INFO
[SF_KAFKA_CONNECTOR] SnowflakeSinkConnector:start (com.snowflake.kafka.connector.SnowflakeSinkConnector)
[2019-10-15 22:12:09,030] ERROR
[SF_KAFKA_CONNECTOR] name is empty or invalid. It should match Snowflake object identifier syntax. Please see the documentation. (com.snowflake.kafka.connector.Utils)
[2019-10-15 22:12:09,033] ERROR WorkerConnector{id=source-snowflake} Error while starting connector (org.apache.kafka.connect.runtime.WorkerConnector)
com.snowflake.kafka.connector.internal.SnowflakeKafkaConnectorException:
[SF_KAFKA_CONNECTOR] Exception: Invalid input connector configuration
[SF_KAFKA_CONNECTOR] Error Code: 0001
[SF_KAFKA_CONNECTOR] Detail: input kafka connector configuration is null, missing required values, or wrong input value
at com.snowflake.kafka.connector.internal.SnowflakeErrors.getException(SnowflakeErrors.java:347)
at com.snowflake.kafka.connector.internal.SnowflakeErrors.getException(SnowflakeErrors.java:306)
at com.snowflake.kafka.connector.Utils.validateConfig(Utils.java:400)
at com.snowflake.kafka.connector.SnowflakeSinkConnector.start(SnowflakeSinkConnector.java:131)
at org.apache.kafka.connect.runtime.WorkerConnector.doStart(WorkerConnector.java:111)
at org.apache.kafka.connect.runtime.WorkerConnector.start(WorkerConnector.java:136)
at org.apache.kafka.connect.runtime.WorkerConnector.transitionTo(WorkerConnector.java:196)
at org.apache.kafka.connect.runtime.Worker.startConnector(Worker.java:252)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.startConnector(DistributedHerder.java:1079)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.access$1300(DistributedHerder.java:117)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder$15.call(DistributedHerder.java:1095)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder$15.call(DistributedHerder.java:1091)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Here is my scrubbed Snowflake config file:
{
"name":"snowsink",
"config":{
"connector.class":"com.snowflake.kafka.connector.SnowflakeSinkConnector",
"tasks.max":"8",
"topics":"tp-snow-test",
"buffer.count.records":"100",
"buffer.flush.time":"60",
"buffer.size.bytes":"65536",
"snowflake.url.name":"xxxxxxx.east-us-2.azure.snowflakecomputing.com",
"snowflake.user.name":"svc_cc_strm",
"snowflake.private.key":"<key>",
"snowflake.private.key.passphrase":<password>,
"snowflake.database.name":"testdb",
"snowflake.schema.name":"test1",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"com.snowflake.kafka.connector.records.SnowflakeJsonConverter"
}
}
Any ideas? Thanks.
The name of the connector should be a valid SQL identifier to Snowflake. So many of the kafka topic examples have dashes in them that when I first tried the Snowflake Kafka connector I got this same error.
According to the documentation, a Snowflake pipe is created using the connector_name specified, and pipe names must be valid SQL identifiers.
The connector creates one pipe for each topic partition. The name is:
SNOWFLAKE_KAFKA_CONNECTOR_PIPE_.
Also from the same doc page at "Fields in the Configuration File" for name:
Application name. This must be unique across all Kafka connectors used by the customer. This name name must be a valid Snowflake unquoted identifier.
If the topic has a dash in it then it will need to mapped to a table name that is also a proper SQL identifier in your connector config, otherwise it will try to create the table name as the same as the topic name and fail on the "-" in the name.
You need to change the name of your connector (source-snow) to remove the - from it (so that it matches this validation pattern).
🤷‍♂️
You need to have below entry in your config file , below topics entry.
"topics":"tp-snow-test",
"snowflake.topic2table.map": "tp-snow-test:TestKafkaTable",

MongoDB Kafka Connector not generating the message key with the Mongo document id

I'm using the beta release of the MongoDB Kafka Connector to publish from MongoDB to a Kafka topic.
Messages are generated into Kafka but their key is null when it should be the document id:
This is my connect standalone config:
bootstrap.servers=xxx:9092
# The converters specify the format of data in Kafka and how to translate it into Connect data. Every Connect user will
# need to configure these based on the format they want their data in when loaded from or stored into Kafka
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
# Converter-specific settings can be passed in by prefixing the Converter's setting with the converter you want to apply
# it to
key.converter.schemas.enable=false
value.converter.schemas.enable=false
# The internal converter used for offsets and config data is configurable and must be specified, but most users will
# always want to use the built-in default. Offset and config data is never visible outside of Kafka Connect in this format.
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
And the mongodb source properties:
name=mongo-source
connector.class=com.mongodb.kafka.connect.MongoSourceConnector
tasks.max=1
# Connection and source configuration
connection.uri=mongodb+srv://xxx
database=mydb
collection=mycollection
topic.prefix=someprefix
poll.max.batch.size=1000
poll.await.time.ms=5000
# Change stream options
pipeline=[]
batch.size=0
change.stream.full.document=updateLookup
collation=
Below there's an example of a message String value:
"{\"_id\": {\"_data\": \"xxx\"}, \"operationType\": \"replace\", \"clusterTime\": {\"$timestamp\": {\"t\": 1564140389, \"i\": 1}}, \"fullDocument\": {\"_id\": \"5\", \"name\": \"Some Client\", \"clientId\": \"someclient\", \"clientSecret\": \"1234\", \"whiteListedIps\": [], \"enabled\": true, \"_class\": \"myproject.Client\"}, \"ns\": {\"db\": \"mydb\", \"coll\": \"mycollection\"}, \"documentKey\": {\"_id\": \"5\"}}"
I tried using a transform to extract if from the value, specifically from the documentKey field:
transforms=InsertKey
transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.InsertKey.fields=documentKey
But got an exception:
Caused by: org.apache.kafka.connect.errors.DataException: Only Struct objects supported for [copying fields from value to key], found: java.lang.String
at org.apache.kafka.connect.transforms.util.Requirements.requireStruct(Requirements.java:52)
at org.apache.kafka.connect.transforms.ValueToKey.applyWithSchema(ValueToKey.java:79)
at org.apache.kafka.connect.transforms.ValueToKey.apply(ValueToKey.java:65)
Any ideas to generate a key with the document id?
According to exception, that is thrown:
Caused by: org.apache.kafka.connect.errors.DataException: Only Struct objects supported for [copying fields from value to key], found: java.lang.String
at org.apache.kafka.connect.transforms.util.Requirements.requireStruct(Requirements.java:52)
at org.apache.kafka.connect.transforms.ValueToKey.applyWithSchema(ValueToKey.java:79)
at org.apache.kafka.connect.transforms.ValueToKey.apply(ValueToKey.java:65)
Unfortunately Mongo DB connector, that you use, it doesn't create properly schema.
Above connector create Record with key and value schema's as String.
Check this line:: How record is created by connector. That is the reason why you can't apply Transformation to it
This should be supported in release 1.3.0:
https://jira.mongodb.org/browse/KAFKA-40

org.apache.kafka.connect.errors.DataException: Invalid JSON for record default value: null

I have a Kafka Avro Topic generated using KafkaAvroSerializer.
My standalone properties are as below.
I am using Confluent 4.0.0 to run Kafka connect.
key.converter=io.confluent.connect.avro.AvroConverter
value.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=<schema_registry_hostname>:8081
value.converter.schema.registry.url=<schema_registry_hostname>:8081
key.converter.schemas.enable=true
value.converter.schemas.enable=true
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
When I run Kafka connectors for hdfs sink in standalone mode, I get this error message:
[2018-06-27 17:47:41,746] ERROR WorkerSinkTask{id=camus-email-service-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
org.apache.kafka.connect.errors.DataException: Invalid JSON for record default value: null
at io.confluent.connect.avro.AvroData.defaultValueFromAvro(AvroData.java:1640)
at io.confluent.connect.avro.AvroData.toConnectSchema(AvroData.java:1527)
at io.confluent.connect.avro.AvroData.toConnectSchema(AvroData.java:1410)
at io.confluent.connect.avro.AvroData.toConnectSchema(AvroData.java:1290)
at io.confluent.connect.avro.AvroData.toConnectData(AvroData.java:1014)
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:88)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:454)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:287)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:198)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:166)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2018-06-27 17:47:41,748] ERROR WorkerSinkTask{id=camus-email-service-0} Task is being killed and will not recover until manually restarted ( org.apache.kafka.connect.runtime.WorkerTask)
[2018-06-27 17:52:19,554] INFO Kafka Connect stopping (org.apache.kafka.connect.runtime.Connect).
When I use kafka-avro-console-consumer passing the schema registry, I get the Kafka messages deserialized.
i.e.:
/usr/bin/kafka-avro-console-consumer --bootstrap-server <kafka-host>:9092 --topic <KafkaTopicName> --property schema.registry.url=<schema_registry_hostname>:8081
Changing the "subscription" column's datatype to Union datatype fixed the issue. Avroconverters were able to deserialize the messages.
I think your Kafka key is null, which is not Avro.
Or it is some other type but malformed, and not converted to a RECORD datatype. See AvroData source code
case RECORD: {
if (!jsonValue.isObject()) {
throw new DataException("Invalid JSON for record default value: " + jsonValue.toString());
}
UPDATE According to your comment, then you can see this is true
$ curl -X GET localhost:8081/subjects/<kafka-topic>-key/versions/latest
{"subject":"<kafka-topic>-key","version":2,"id":625,"schema":"\"bytes\""}
In any case, HDFS Connect does not natively store the key, so try not deserializing the key at all rather than using Avro.
key.converter=org.apache.kafka.connect.converters.ByteArrayConverter
Also, your console consumer is not printing the key, so your test isn't adequate. You need to add --property print.key=true