org.apache.kafka.connect.errors.DataException: Invalid JSON for record default value: null - apache-kafka

I have a Kafka Avro Topic generated using KafkaAvroSerializer.
My standalone properties are as below.
I am using Confluent 4.0.0 to run Kafka connect.
key.converter=io.confluent.connect.avro.AvroConverter
value.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=<schema_registry_hostname>:8081
value.converter.schema.registry.url=<schema_registry_hostname>:8081
key.converter.schemas.enable=true
value.converter.schemas.enable=true
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
When I run Kafka connectors for hdfs sink in standalone mode, I get this error message:
[2018-06-27 17:47:41,746] ERROR WorkerSinkTask{id=camus-email-service-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
org.apache.kafka.connect.errors.DataException: Invalid JSON for record default value: null
at io.confluent.connect.avro.AvroData.defaultValueFromAvro(AvroData.java:1640)
at io.confluent.connect.avro.AvroData.toConnectSchema(AvroData.java:1527)
at io.confluent.connect.avro.AvroData.toConnectSchema(AvroData.java:1410)
at io.confluent.connect.avro.AvroData.toConnectSchema(AvroData.java:1290)
at io.confluent.connect.avro.AvroData.toConnectData(AvroData.java:1014)
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:88)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:454)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:287)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:198)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:166)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2018-06-27 17:47:41,748] ERROR WorkerSinkTask{id=camus-email-service-0} Task is being killed and will not recover until manually restarted ( org.apache.kafka.connect.runtime.WorkerTask)
[2018-06-27 17:52:19,554] INFO Kafka Connect stopping (org.apache.kafka.connect.runtime.Connect).
When I use kafka-avro-console-consumer passing the schema registry, I get the Kafka messages deserialized.
i.e.:
/usr/bin/kafka-avro-console-consumer --bootstrap-server <kafka-host>:9092 --topic <KafkaTopicName> --property schema.registry.url=<schema_registry_hostname>:8081

Changing the "subscription" column's datatype to Union datatype fixed the issue. Avroconverters were able to deserialize the messages.

I think your Kafka key is null, which is not Avro.
Or it is some other type but malformed, and not converted to a RECORD datatype. See AvroData source code
case RECORD: {
if (!jsonValue.isObject()) {
throw new DataException("Invalid JSON for record default value: " + jsonValue.toString());
}
UPDATE According to your comment, then you can see this is true
$ curl -X GET localhost:8081/subjects/<kafka-topic>-key/versions/latest
{"subject":"<kafka-topic>-key","version":2,"id":625,"schema":"\"bytes\""}
In any case, HDFS Connect does not natively store the key, so try not deserializing the key at all rather than using Avro.
key.converter=org.apache.kafka.connect.converters.ByteArrayConverter
Also, your console consumer is not printing the key, so your test isn't adequate. You need to add --property print.key=true

Related

MongoDB Kafka Connect can't send large kafka messages

I am trying to send json large data (more than 1 Mo ) from MongoDB with kafka connector, it's worked well for small data , but I got the following error when working with big json data :
[2022-09-27 11:13:48,290] ERROR [source_mongodb_connector|task-0] WorkerSourceTask{id=source_mongodb_connector-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:195)
org.apache.kafka.connect.errors.ConnectException: Unrecoverable exception from producer send callback
at org.apache.kafka.connect.runtime.WorkerSourceTask.maybeThrowProducerSendException(WorkerSourceTask.java:290)
at org.apache.kafka.connect.runtime.WorkerSourceTask.sendRecords(WorkerSourceTask.java:351)
at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:257)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:188)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.kafka.common.errors.RecordTooLargeException: The message is 2046979 bytes when serialized which is larger than 1048576, which is the value of the max.request.size configuration.
I tried to configure Topic , here is the description
*hadoop#vps-data1 ~/kafka $ bin/kafka-configs.sh --bootstrap-server 192.168.13.80:9092,192.168.13.81:9092,192.168.13.82:9092 --entity-type topics --entity-name prefix.large.topicData --describe
Dynamic configs for topic prefix.large.topicData are:
max.message.bytes=1280000 sensitive=false synonyms={DYNAMIC_TOPIC_CONFIG:max.message.bytes=1280000, STATIC_BROKER_CONFIG:message.max.bytes=419430400, DEFAULT_CONFIG:message.max.bytes=1048588}
Indeed I configured producer, consumer and server properties file but the same problem still stacking ....
Any help will be appreciated
The solution is to configure kafka and kafka connect properties

Error while consuming AVRO Kafka Topic from KSQL Stream

I created some dummydata as a Stream in KSQLDB with
VALUE_FORMAT='JSON' TOPIC='MYTOPIC'
The Setup is over Docker-compose. I am running a Kafka Broker, Schema-registry, ksqldbcli, ksqldb-server, zookeeper
Now I want to consume these records from the topic.
My first and last approach was over the commandline with following command
docker run --net=host --rm confluentinc/cp-schema-registry:5.0.0 kafka-avro-console-consumer
--bootstrap-server localhost:29092 --topic DXT --from-beginning --max-messages 10
--property print.key=true --property print.value=true
--value-deserializer io.confluent.kafka.serializers.KafkaAvroDeserializer
--key-deserializer org.apache.kafka.common.serialization.StringDeserializer
But that just returns the error
[2021-04-22 21:45:42,926] ERROR Unknown error when running consumer: (kafka.tools.ConsoleConsumer$:76)
org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
I also tried it with different use cases in Java Spring but with no prevail. I just cannot consume the created topics.
If I would need to define my own schema, where should I do that and what would be the easiest way because I just created a stream in Ksqldb?
Is there an easy to follow example. I did not specifiy anything else when I created the stream like in the quickstart example on Ksqldb.io. (I added the schema-registry in my deployment)
As I am a noob that is sitting here for almost 10 hours any help would be appreciated.
Edit: I found that pure JSON does not need the Schema-registry with ksqldb. Here.
But how to deserialize it?
If you've written JSON data to the topic then you can read it with the kafka-console-consumer.
The error you're getting (Error deserializing Avro message for id -1…Unknown magic byte!) is because you're using the kafka-avro-console-consumer which attempts to deserialise the topic data as Avro - which it isn't, hence the error.
You can also use PRINT DXT; from within ksqlDB.

Issues reading AVRO encoded messages (created by KSQL stream) with Kafka Connect

there's something weird happening when we are creating AVRO messages through KSQL and try to consume them by using Kafka Connect. A bit of context:
Source data
A 3rd party provider is producing data on one of our Kafka clusters as JSON (so far, so good). We actually see the data coming in.
Data Transformation
As our internal systems require data to be encoded in AVRO, we created a KSQL cluster that transforms the incoming data into AVRO by creating the following stream in KSQL:
{
"ksql": "
CREATE STREAM src_stream (browser_name VARCHAR)
WITH (KAFKA_TOPIC='json_topic', VALUE_FORMAT='JSON');
CREATE STREAM sink_stream WITH (KAFKA_TOPIC='avro_topic',VALUE_FORMAT='AVRO', PARTITIONS=1, REPLICAS=3) AS
SELECT * FROM src_stream;
",
"streamsProperties": {
"ksql.streams.auto.offset.reset": "earliest"
}
}
(so far, so good)
We see the data being produced from the JSON topic onto the AVRO topic, as the offset increases.
We then create a Kafka connector in a (new) Kafka Connect cluster. As some context, we are using multiple Kafka Connect clusters (with the same properties for those clusters), and as such we have a Kafka Connect cluster running for this data, but an exact copy of the cluster for other AVRO data (1 is for analytics, 1 for our business data).
The sink for this connector is BigQuery, we're using the Wepay BigQuery Sink Connector 1.2.0. Again, so far, so good. Our business cluster is running fine with this connector and the AVRO topics on the business cluster are streaming into BigQuery.
When we try to consume the AVRO topic created by our KSQL statement earlier however, we see an exception being thrown :/
The exception is the following:
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:510)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:490)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:225)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.connect.errors.DataException: dpt_video_event-created_v2
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:98)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$0(WorkerSinkTask.java:510)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
... 13 more
Caused by: org.apache.kafka.common.errors.SerializationException: Error retrieving Avro schema for id 0
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema not found; error code: 40403
at io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:209)
at io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:235)
at io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:415)
at io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:408)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getSchemaByIdFromRegistry(CachedSchemaRegistryClient.java:123)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getBySubjectAndId(CachedSchemaRegistryClient.java:190)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getById(CachedSchemaRegistryClient.java:169)
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserialize(AbstractKafkaAvroDeserializer.java:121)
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserializeWithSchemaAndVersion(AbstractKafkaAvroDeserializer.java:243)
at io.confluent.connect.avro.AvroConverter$Deserializer.deserialize(AvroConverter.java:134)
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:85)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$0(WorkerSinkTask.java:510)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:510)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:490)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:225)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Which, to us, indicates that Kafka Connect is reading the message, decodes the AVRO and tries to fetch the schema with ID 0 from the schema registry. Obviously, schema IDs in the schema registry are always > 0.
We're currently stuck in trying to identify the issue here. It looks like KSQL is encoding the message with schema ID 0, but we're unable to find the cause for that :/
Any help is appreciated!
BR,
Patrick
UPDATE:
We have implemented a basic consumer for the AVRO messages and that consumer is correctly identifying the schema in the AVRO messages (ID: 3), so it seems to be rekated to Kafka Connect, instead of the actual KSQL / AVRO messages.
Obviously, schema IDs in the schema registry are always > 0... It looks like KSQL is encoding the message with schema ID 0, but we're unable to find the cause for that
The AvroConverter does a "dumb check" that only looks that the consumed bytes start with a magic byte of 0x0. The next 4 bytes are the ID.
If you are using key.converter=AvroConverter and your keys start like 0x00000 in hex, then the ID would be shown as 0 in the logs, and the lookup would fail.
Last I checked, KSQL doesn't output keys in Avro format, so you will want to check the properties of your connector.

Configure Apache Kafka sink jdbc connector

I want to send the data sent to the topic to a postgresql-database. So I follow this guide and have configured the properties-file like this:
name=transaction-sink
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics=transactions
connection.url=jdbc:postgresql://localhost:5432/db
connection.user=db-user
connection.password=
auto.create=true
insert.mode=insert
table.name.format=transaction
pk.mode=none
I start the connector with
./bin/connect-standalone etc/schema-registry/connect-avro-standalone.properties etc/kafka-connect-jdbc/sink-quickstart-postgresql.properties
The sink-connector is created but does not start due to this error:
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
The schema is in avro-format and registered and I can send (produce) messages to the topic and read (consume) from it. But I can't seem to sent it to the database.
This is my ./etc/schema-registry/connect-avro-standalone.properties
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
This is a producer feeding the topic using the java-api:
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
properties.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://localhost:8081");
try (KafkaProducer<String, Transaction> producer = new KafkaProducer<>(properties)) {
Transaction transaction = new Transaction();
transaction.setFoo("foo");
transaction.setBar("bar");
UUID uuid = UUID.randomUUID();
final ProducerRecord<String, Transaction> record = new ProducerRecord<>(TOPIC, uuid.toString(), transaction);
producer.send(record);
}
I'm verifying data is properly serialized and deserialized using
./bin/kafka-avro-console-consumer --bootstrap-server localhost:9092 \
--property schema.registry.url=http://localhost:8081 \
--topic transactions \
--from-beginning --max-messages 1
The database is up and running.
This is not correct:
The unknown magic byte can be due to a id-field not part of the schema
What that error means that the message on the topic was not serialised using the Schema Registry Avro serialiser.
How are you putting data on the topic?
Maybe all the messages have the problem, maybe only some—but by default this will halt the Kafka Connect task.
You can set
"errors.tolerance":"all",
to get it to ignore messages that it can't deserialise. But if all of them are not correctly Avro serialised this won't help and you need to serialise them correctly, or choose a different Converter (e.g. if they're actually JSON, use the JSONConverter).
These references should help you more:
https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues
http://rmoff.dev/ksldn19-kafka-connect
Edit :
If you are serialising the key with StringSerializer then you need to use this in your Connect config:
key.converter=org.apache.kafka.connect.storage.StringConverter
You can set it at the worker (global property, applies to all connectors that you run on it), or just for this connector (i.e. put it in the connector properties itself, it will override the worker settings)

Error retrieving Avro schema for id 1, Subject not found.; error code: 40401

Caused by: org.apache.kafka.common.errors.SerializationException: Error retrieving Avro schema for id 1
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Subject not found.; error code: 40401
Confluent Version 4.1.0
I am consuming data from a couple of topics(topic_1, topic_2) using KTable, joining the data and then pushing the data onto another topic(topic_out) using KStream. (Ktable.toStream())
The data is in avro format
When I check the schema by using
curl -X GET http://localhost:8081/subjects/
I find
topic_1-value
topic_1-key
topic_2-value
topic_2-key
topic_out-value
but there is no subject with topic_out-key. Why is it not created?
output from topic_out:
kafka-avro-console-consumer --bootstrap-server localhost:9092 --from-beginning --property print.key=true --topic topic_out
"code1 " {"code":{"string":"code1 "},"personid":{"string":"=NA="},"agentoffice":{"string":"lic1 "},"status":{"string":"a"},"sourcesystem":{"string":"ILS"},"lastupdate":{"long":1527240990138}}
I can see the key being generated, but no subject for key.
Why is subject with key required?
I am feeding this topic to another connector (hdfs-sink) to push the data to hdfs but it fails with below error
Caused by: org.apache.kafka.common.errors.SerializationException: Error retrieving Avro schema for id 5\nCaused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Subject not found.; error code: 40401
when I look at the schema-registry.logs, I can see:
[2018-05-24 15:40:06,230] INFO 127.0.0.1 - -
[24/May/2018:15:40:06 +0530] "POST /subjects/topic_out-key?deleted=true HTTP/1.1" 404 51 9 (io.confluent.rest-utils.requests:77)
any idea why the subject topic_out-key not being created?
any idea why the subject topic_out-key not being created
Because the Key of your Kafka Streams output is a String, not an Avro encoded string.
You can verify that using kafka-console-consumer instead and adding --property print.value=false and not seeing any special characters compared to the same command when you do print the value (this is showing the data is binary Avro)
From Kafka Connect, you must use Kafka's StringConverter class for key.converter property rather than the Confluent Avro one