MongoDB Kafka Connect can't send large kafka messages - apache-kafka

I am trying to send json large data (more than 1 Mo ) from MongoDB with kafka connector, it's worked well for small data , but I got the following error when working with big json data :
[2022-09-27 11:13:48,290] ERROR [source_mongodb_connector|task-0] WorkerSourceTask{id=source_mongodb_connector-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:195)
org.apache.kafka.connect.errors.ConnectException: Unrecoverable exception from producer send callback
at org.apache.kafka.connect.runtime.WorkerSourceTask.maybeThrowProducerSendException(WorkerSourceTask.java:290)
at org.apache.kafka.connect.runtime.WorkerSourceTask.sendRecords(WorkerSourceTask.java:351)
at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:257)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:188)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:243)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.kafka.common.errors.RecordTooLargeException: The message is 2046979 bytes when serialized which is larger than 1048576, which is the value of the max.request.size configuration.
I tried to configure Topic , here is the description
*hadoop#vps-data1 ~/kafka $ bin/kafka-configs.sh --bootstrap-server 192.168.13.80:9092,192.168.13.81:9092,192.168.13.82:9092 --entity-type topics --entity-name prefix.large.topicData --describe
Dynamic configs for topic prefix.large.topicData are:
max.message.bytes=1280000 sensitive=false synonyms={DYNAMIC_TOPIC_CONFIG:max.message.bytes=1280000, STATIC_BROKER_CONFIG:message.max.bytes=419430400, DEFAULT_CONFIG:message.max.bytes=1048588}
Indeed I configured producer, consumer and server properties file but the same problem still stacking ....
Any help will be appreciated

The solution is to configure kafka and kafka connect properties

Related

Unable to run connect-distributed.sh

I get the below error when I am executing the command.. My kafka messages are not flowing down into hdfs
/bin/connect-distributed.sh /etc/schema-registry/connect-avro-distributed.properties /kafka-connect/quickstart-hdfs.properties
ERROR Failed to start task local-file-source-0 (org.apache.kafka.connect.runtime.Worker:456)
org.apache.kafka.common.config.ConfigException: Invalid value io.confluent.kafka.serializers.subject.TopicNameStrategy for configuration key.subject.name.strategy: Class io.confluent.kafka.serializers.subject.TopicNameStrategy could not be found.
at org.apache.kafka.common.config.ConfigDef.parseType(ConfigDef.java:718)
at org.apache.kafka.common.config.ConfigDef$ConfigKey.<init>(ConfigDef.java:1063)
at org.apache.kafka.common.config.ConfigDef.define(ConfigDef.java:148)
at org.apache.kafka.common.config.ConfigDef.define(ConfigDef.java:168)
at org.apache.kafka.common.config.ConfigDef.define(ConfigDef.java:207)
at org.apache.kafka.common.config.ConfigDef.define(ConfigDef.java:369)
at org.apache.kafka.common.config.ConfigDef.define(ConfigDef.java:382)
at io.confluent.kafka.serializers.AbstractKafkaSchemaSerDeConfig.baseConfigDef(AbstractKafkaSchemaSerDeConfig.java:153)
at io.confluent.connect.avro.AvroConverterConfig.<init>(AvroConverterConfig.java:27)
at io.confluent.connect.avro.AvroConverter.configure(AvroConverter.java:63)
at org.apache.kafka.connect.runtime.isolation.Plugins.newConverter(Plugins.java:263)
at org.apache.kafka.connect.runtime.Worker.startTask(Worker.java:434)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.startTask(DistributedHerder.java:865)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.access$1600(DistributedHerder.java:110)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder$13.call(DistributedHerder.java:880)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder$13.call(DistributedHerder.java:876)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
connect-distributedonly takes one property file, not two. The quickstart files are meant to be used with connect-standalone
The actual error you're getting about the topic name strategy is likely an issue/bug with the version of the Avro serializers you're using

Is there a way to using kafka schema registry without magic byte?

I'm trying to make my applications work using the schema registry from confluent but at this point I'm not in total control of the producers, you can even see them as legacy applications that simply are not bound to the confluent products.
I was looking at the confluent information and it seems all the messages should include in the payload a Magic Byte and Schema ID
https://docs.confluent.io/3.2.0/schema-registry/docs/serializer-formatter.html
or else when I try to consume it I get an error:
[2020-09-25 13:12:09,008] ERROR WorkerSinkTask{id=s3_parquet_connector-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:491)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:468)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:324)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:228)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:200)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:184)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:234)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.connect.errors.DataException: Failed to deserialize data for topic com.obj_pos to Protobuf:
at io.confluent.connect.protobuf.ProtobufConverter.toConnectData(ProtobufConverter.java:123)
at org.apache.kafka.connect.storage.Converter.toConnectData(Converter.java:87)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$1(WorkerSinkTask.java:491)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
... 13 more
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Protobuf message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
[2020-09-25 13:12:09,010] ERROR WorkerSinkTask{id=s3_parquet_connector-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)
my question is, if there is a way of somehow either disable this magic byte check or if I could create a kafka stream that would just append a this 5 bytes to the initial message so that afterwards I could consume it with a consumer that would connect to the schema registry.
What is happening is that the producer is out of my control so I would need somehow to be able to deserialize messages that do not contain those 5 bytes because they are produced by producers that don't rely on the confluent serializers/de-serializers
they are produced by producers that don't rely on the confluent serializers
Then the problem isn't the Registry.
You shouldn't be using the Converters written by Confluent to consume the messages, as those are bound to the Registry, and there is no way to skip it.
You would instead use the BlueApron ones (assuming the data is really protobuf), or write your own Converter classes.

Issues reading AVRO encoded messages (created by KSQL stream) with Kafka Connect

there's something weird happening when we are creating AVRO messages through KSQL and try to consume them by using Kafka Connect. A bit of context:
Source data
A 3rd party provider is producing data on one of our Kafka clusters as JSON (so far, so good). We actually see the data coming in.
Data Transformation
As our internal systems require data to be encoded in AVRO, we created a KSQL cluster that transforms the incoming data into AVRO by creating the following stream in KSQL:
{
"ksql": "
CREATE STREAM src_stream (browser_name VARCHAR)
WITH (KAFKA_TOPIC='json_topic', VALUE_FORMAT='JSON');
CREATE STREAM sink_stream WITH (KAFKA_TOPIC='avro_topic',VALUE_FORMAT='AVRO', PARTITIONS=1, REPLICAS=3) AS
SELECT * FROM src_stream;
",
"streamsProperties": {
"ksql.streams.auto.offset.reset": "earliest"
}
}
(so far, so good)
We see the data being produced from the JSON topic onto the AVRO topic, as the offset increases.
We then create a Kafka connector in a (new) Kafka Connect cluster. As some context, we are using multiple Kafka Connect clusters (with the same properties for those clusters), and as such we have a Kafka Connect cluster running for this data, but an exact copy of the cluster for other AVRO data (1 is for analytics, 1 for our business data).
The sink for this connector is BigQuery, we're using the Wepay BigQuery Sink Connector 1.2.0. Again, so far, so good. Our business cluster is running fine with this connector and the AVRO topics on the business cluster are streaming into BigQuery.
When we try to consume the AVRO topic created by our KSQL statement earlier however, we see an exception being thrown :/
The exception is the following:
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:510)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:490)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:225)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.connect.errors.DataException: dpt_video_event-created_v2
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:98)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$0(WorkerSinkTask.java:510)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
... 13 more
Caused by: org.apache.kafka.common.errors.SerializationException: Error retrieving Avro schema for id 0
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema not found; error code: 40403
at io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:209)
at io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:235)
at io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:415)
at io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:408)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getSchemaByIdFromRegistry(CachedSchemaRegistryClient.java:123)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getBySubjectAndId(CachedSchemaRegistryClient.java:190)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getById(CachedSchemaRegistryClient.java:169)
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserialize(AbstractKafkaAvroDeserializer.java:121)
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserializeWithSchemaAndVersion(AbstractKafkaAvroDeserializer.java:243)
at io.confluent.connect.avro.AvroConverter$Deserializer.deserialize(AvroConverter.java:134)
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:85)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$0(WorkerSinkTask.java:510)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:510)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:490)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:225)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Which, to us, indicates that Kafka Connect is reading the message, decodes the AVRO and tries to fetch the schema with ID 0 from the schema registry. Obviously, schema IDs in the schema registry are always > 0.
We're currently stuck in trying to identify the issue here. It looks like KSQL is encoding the message with schema ID 0, but we're unable to find the cause for that :/
Any help is appreciated!
BR,
Patrick
UPDATE:
We have implemented a basic consumer for the AVRO messages and that consumer is correctly identifying the schema in the AVRO messages (ID: 3), so it seems to be rekated to Kafka Connect, instead of the actual KSQL / AVRO messages.
Obviously, schema IDs in the schema registry are always > 0... It looks like KSQL is encoding the message with schema ID 0, but we're unable to find the cause for that
The AvroConverter does a "dumb check" that only looks that the consumed bytes start with a magic byte of 0x0. The next 4 bytes are the ID.
If you are using key.converter=AvroConverter and your keys start like 0x00000 in hex, then the ID would be shown as 0 in the logs, and the lookup would fail.
Last I checked, KSQL doesn't output keys in Avro format, so you will want to check the properties of your connector.

org.apache.kafka.connect.errors.DataException: Invalid JSON for record default value: null

I have a Kafka Avro Topic generated using KafkaAvroSerializer.
My standalone properties are as below.
I am using Confluent 4.0.0 to run Kafka connect.
key.converter=io.confluent.connect.avro.AvroConverter
value.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=<schema_registry_hostname>:8081
value.converter.schema.registry.url=<schema_registry_hostname>:8081
key.converter.schemas.enable=true
value.converter.schemas.enable=true
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
When I run Kafka connectors for hdfs sink in standalone mode, I get this error message:
[2018-06-27 17:47:41,746] ERROR WorkerSinkTask{id=camus-email-service-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
org.apache.kafka.connect.errors.DataException: Invalid JSON for record default value: null
at io.confluent.connect.avro.AvroData.defaultValueFromAvro(AvroData.java:1640)
at io.confluent.connect.avro.AvroData.toConnectSchema(AvroData.java:1527)
at io.confluent.connect.avro.AvroData.toConnectSchema(AvroData.java:1410)
at io.confluent.connect.avro.AvroData.toConnectSchema(AvroData.java:1290)
at io.confluent.connect.avro.AvroData.toConnectData(AvroData.java:1014)
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:88)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:454)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:287)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:198)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:166)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2018-06-27 17:47:41,748] ERROR WorkerSinkTask{id=camus-email-service-0} Task is being killed and will not recover until manually restarted ( org.apache.kafka.connect.runtime.WorkerTask)
[2018-06-27 17:52:19,554] INFO Kafka Connect stopping (org.apache.kafka.connect.runtime.Connect).
When I use kafka-avro-console-consumer passing the schema registry, I get the Kafka messages deserialized.
i.e.:
/usr/bin/kafka-avro-console-consumer --bootstrap-server <kafka-host>:9092 --topic <KafkaTopicName> --property schema.registry.url=<schema_registry_hostname>:8081
Changing the "subscription" column's datatype to Union datatype fixed the issue. Avroconverters were able to deserialize the messages.
I think your Kafka key is null, which is not Avro.
Or it is some other type but malformed, and not converted to a RECORD datatype. See AvroData source code
case RECORD: {
if (!jsonValue.isObject()) {
throw new DataException("Invalid JSON for record default value: " + jsonValue.toString());
}
UPDATE According to your comment, then you can see this is true
$ curl -X GET localhost:8081/subjects/<kafka-topic>-key/versions/latest
{"subject":"<kafka-topic>-key","version":2,"id":625,"schema":"\"bytes\""}
In any case, HDFS Connect does not natively store the key, so try not deserializing the key at all rather than using Avro.
key.converter=org.apache.kafka.connect.converters.ByteArrayConverter
Also, your console consumer is not printing the key, so your test isn't adequate. You need to add --property print.key=true

Strange error on Kafka broker

In our production Kafka broker I found this strange error in server.log. Due to this message sending in one of the topic was impacted. Producer was getting error "Partition count is 0: should refresh metadata". Kafka version 0.10.0.1. Open JDK Java 1.8
Can anyone help me out as to what could this mean?
[2018-01-10 17:23:51,411] ERROR Processor got uncaught exception. (kafka.network.Processor)
java.lang.NoClassDefFoundError: Could not initialize class java.net.IDN
at javax.net.ssl.SNIHostName.<init>(SNIHostName.java:175)
at sun.security.ssl.ServerNameExtension.<init>(ServerNameExtension.java:137)
at sun.security.ssl.HelloExtensions.<init>(HelloExtensions.java:78)
at sun.security.ssl.HandshakeMessage$ClientHello.<init>(HandshakeMessage.java:250)
at sun.security.ssl.ServerHandshaker.processMessage(ServerHandshaker.java:217)
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979)
at sun.security.ssl.Handshaker$1.run(Handshaker.java:919)
at sun.security.ssl.Handshaker$1.run(Handshaker.java:916)
at java.security.AccessController.doPrivileged(Native Method)
at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1369)
at org.apache.kafka.common.network.SslTransportLayer.runDelegatedTasks(SslTransportLayer.java:336)
at org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:414)
at org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:270)
at org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:62)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:338)
at org.apache.kafka.common.network.Selector.poll(Selector.java:291)
at kafka.network.Processor.poll(SocketServer.scala:476)
at kafka.network.Processor.run(SocketServer.scala:416)
at java.lang.Thread.run(Thread.java:745)