Kafka Sink HDFS Unrecognized token - apache-kafka

I'm trying to write JSON with Kafka HDFS Sink.
I have the following properties (connect-standalone.properties):
key.converter.schemas.enable = false
value.converter.schemas.enable = false
schemas.enable=false
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
And on my properties:
format.class=io.confluent.connect.hdfs.json.JsonFormat
And I got the following exception:
org.apache.kafka.connect.errors.DataException: Converting byte[] to Kafka connect failed due to serilization error
...
Caused By: org.apache.kafka.commom.errors.SerlizationException: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'test' : was expecting 'null', 'true', 'false' or NaN at [Source: (byte[])"test" line: 1 column: 11]
My JSON is valid.
How can I solve it please?
*I'm trying also with sample JSON like:
{"key":"value"}
And still same error.
Thanks.

According to the error, not all the messages in the topic are actually JSON objects. The latest messages might be valid, or the Kafka values might be valid (not the keys, though), but the error shows that it tried to read a plain string, (byte[])"test", which is not valid
If you only want text data into HDFS, you could use String format, however that won't have Hive integration
format.class=io.confluent.connect.hdfs.string.StringFormat
If you did want to use Hive with this format, you would need to define the JSON Serde yourself

Related

kafka SMT keeps failing to extract json field to use as message key

I am using leneses.io s3 source connector to read json files and trying to set message key using SMT.
Here is the config used for connector on AWS MSK
connector.class=io.lenses.streamreactor.connect.aws.s3.source.S3SourceConnector
tasks.max=1
topics=topic_3
connect.s3.vhost.bucket=true
connect.s3.aws.auth.mode=Credentials
connect.s3.aws.access.key=<<access key>>
connect.s3.aws.region=eu-central-1
connect.s3.aws.secret.key=<<secret key>>
schema.enable=false
connect.s3.kcql=INSERT INTO topic_3 SELECT * FROM bucket1:json STOREAS `JSON` WITH_FLUSH_COUNT = 1
aws.region=eu-central-1
aws.custom.endpoint=https://s3.eu-central-1.amazonaws.com
transforms.createKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms=createKey
key.converter.schemas.enable=false
transforms.createKey.fields=id
value.converter.schemas.enable=false
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
I can't get the SMT to work and running into below error
[Worker-0d3e3af50908b12ee] [2022-04-13 11:43:08,461] ERROR [dev2-s3-source-connector-4|task-0] Error encountered in task dev2-s3-source-connector-4-0. Executing stage 'TRANSFORMATION' with class 'org.apache.kafka.connect.transforms.ValueToKey'. (org.apache.kafka.connect.runtime.errors.LogReporter:66)
[Worker-0d3e3af50908b12ee] org.apache.kafka.connect.errors.DataException: Only Map objects supported in absence of schema for [copying fields from value to key], found: java.lang.String
P.s. if the SMT commands were removed from config then json files are being read into kafka topic with no issues (but the message key is empty)

Strange behaviour when using AvroConverter in Confluent Sink Connector

I was running confluent(v5.5.1) s3 sink connector with below config:
"value.converter":"io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url":"http://registryurl",
"value.converter.value.subject.name.strategy":"io.confluent.kafka.serializers.subject.RecordNameStrategy",
......
And got below error in the log like:
DEBUG Sending GET with input null to http://registryurl/schemas/ids/309?fetchMaxId=false
DEBUG Sending POST with input {......} to http://registryurl/MyRecordName?deleted=true
Caused by: org.apache.kafka.common.errors.SerializationException: Error retrieving Avro value schema version for id 309
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Subject not found.; error code: 40401
There are 2 questions that baffles me here:
Why the sink connector sends additional POST request to schema registry given it's just a consumer? And I have successfully received messages when using a standard kafka consumer, which ONLY sends a GET request to the schema registry.
As per this docs and official doc, the schema subject format will be like SubjectNamingStrategy-value or -key. However judging by the log, it does not suffix the request with "-value". I have tried all the 3 strategies and found ONLY the default TopicNameStrategy works as expected.
Appreciated if anyone could shed some light here.
Thanks a lot

How should I understand this Kafka Avro Schema Registry Error - Error retrieving Avro unknown schema?

My kafka consumer sometime throws this kind of error:
org.apache.kafka.common.errors.SerializationException: Error retrieving Avro unknown schema for id 42
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Unrecognized token 'upstream': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
at [Source: (sun.net.www.protocol.http.HttpURLConnection$HttpInputStream); line: 1, column: 10]; error code: 50005
at io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:293)
at io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:352)
at io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:660)
at io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:642)
It is unclear to me what the root cause of this.
Is this because the consumer cannot access to schema registry url (something like authentication error) or this is because the kafka message is bad (not compatible with the schema in the registry)?
What does the "Unrecognized token 'upstream'" mean ?

How to ignore error result in Kafka Connect Elasticsearch

I am trying to run kafka connect for elastic search .
But because of some mistake i entered wrong record in kafka topic .
Now i fixed that issue and inserting correct value but elastic search is still throwing error on previous record in the topic
Here is the error
Caused by: org.apache.kafka.connect.errors.DataException: Converting byte[] to Kafka Connect data failed due to serialization error
Caused by: org.apache.kafka.common.errors.SerializationException: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'lambdaDemo0': was expecting ('true', 'false' or 'null')
at [Source: (byte[])"lambdaDemo0-9749-0e710000fd04"; line: 1, column: 13]
Is there any way i can ignore the older record in the topic and tell kafka connect to pick latest record ?
I am trying to delete the topic i get topic marked for deletion but still records are present in the topic .
I tried below two properties but does seems to be working
drop.invalid.message=true
behavior.on.malformed.documents=ignore
Please suggest how i can clean up the wrong record in the topic
You can tell Kafka Connect to just skip bad records
errors.tolerance = all
Optionally, you can route these messages to another topic (known as a dead letter queue) for inspection by adding
errors.tolerance = all
errors.deadletterqueue.topic.name = my-dlq-topic
These settings are valid for Kafka Connect with any connector that is failing in the serialisation/deserialisation stage of processing. For more information see this article.

MongoDB Kafka Connector not generating the message key with the Mongo document id

I'm using the beta release of the MongoDB Kafka Connector to publish from MongoDB to a Kafka topic.
Messages are generated into Kafka but their key is null when it should be the document id:
This is my connect standalone config:
bootstrap.servers=xxx:9092
# The converters specify the format of data in Kafka and how to translate it into Connect data. Every Connect user will
# need to configure these based on the format they want their data in when loaded from or stored into Kafka
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
# Converter-specific settings can be passed in by prefixing the Converter's setting with the converter you want to apply
# it to
key.converter.schemas.enable=false
value.converter.schemas.enable=false
# The internal converter used for offsets and config data is configurable and must be specified, but most users will
# always want to use the built-in default. Offset and config data is never visible outside of Kafka Connect in this format.
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
And the mongodb source properties:
name=mongo-source
connector.class=com.mongodb.kafka.connect.MongoSourceConnector
tasks.max=1
# Connection and source configuration
connection.uri=mongodb+srv://xxx
database=mydb
collection=mycollection
topic.prefix=someprefix
poll.max.batch.size=1000
poll.await.time.ms=5000
# Change stream options
pipeline=[]
batch.size=0
change.stream.full.document=updateLookup
collation=
Below there's an example of a message String value:
"{\"_id\": {\"_data\": \"xxx\"}, \"operationType\": \"replace\", \"clusterTime\": {\"$timestamp\": {\"t\": 1564140389, \"i\": 1}}, \"fullDocument\": {\"_id\": \"5\", \"name\": \"Some Client\", \"clientId\": \"someclient\", \"clientSecret\": \"1234\", \"whiteListedIps\": [], \"enabled\": true, \"_class\": \"myproject.Client\"}, \"ns\": {\"db\": \"mydb\", \"coll\": \"mycollection\"}, \"documentKey\": {\"_id\": \"5\"}}"
I tried using a transform to extract if from the value, specifically from the documentKey field:
transforms=InsertKey
transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.InsertKey.fields=documentKey
But got an exception:
Caused by: org.apache.kafka.connect.errors.DataException: Only Struct objects supported for [copying fields from value to key], found: java.lang.String
at org.apache.kafka.connect.transforms.util.Requirements.requireStruct(Requirements.java:52)
at org.apache.kafka.connect.transforms.ValueToKey.applyWithSchema(ValueToKey.java:79)
at org.apache.kafka.connect.transforms.ValueToKey.apply(ValueToKey.java:65)
Any ideas to generate a key with the document id?
According to exception, that is thrown:
Caused by: org.apache.kafka.connect.errors.DataException: Only Struct objects supported for [copying fields from value to key], found: java.lang.String
at org.apache.kafka.connect.transforms.util.Requirements.requireStruct(Requirements.java:52)
at org.apache.kafka.connect.transforms.ValueToKey.applyWithSchema(ValueToKey.java:79)
at org.apache.kafka.connect.transforms.ValueToKey.apply(ValueToKey.java:65)
Unfortunately Mongo DB connector, that you use, it doesn't create properly schema.
Above connector create Record with key and value schema's as String.
Check this line:: How record is created by connector. That is the reason why you can't apply Transformation to it
This should be supported in release 1.3.0:
https://jira.mongodb.org/browse/KAFKA-40