How to use Single Message Transforms with Kafka Connect JDBC Source Connector and multiple tables? - apache-kafka

I want to set the mesage key when importing tables with the Kafka Connect Source JDBC Connector.
How can Single Message Transforms (SMT) in Kafka Connect/Source be targeted to the right fields when having multiple tables defined to be read from JDBC connector? SMTs need a column name which might differ when having multiple tables.
I don't see a way to filter SMT definitions based on table name or similar. The code sample below works fine since it is only one table.
But what to do if you have different tables, e.g. User, Order, Product ?
"table.whitelist" : "User"
"transforms":"createKey,extract",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"user_id",
"transforms.extract.type":"org.apache.kafka.connect.transforms.ExtractField\$Key",
"transforms.extract.field":"user_id",
When a worker task with that configuration meets a table without that user_id field, it crashes and remains in status FAILED
org.apache.kafka.connect.errors.ConnectException:
Tolerance exceeded in error handler
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
at org.apache.kafka.connect.runtime.TransformationChain.apply(TransformationChain.java:50)
at org.apache.kafka.connect.runtime.WorkerSourceTask.sendRecords(WorkerSourceTask.java:293)
at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:229)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)\nCaused by: java.lang.NullPointerException
at org.apache.kafka.connect.transforms.ValueToKey.applyWithSchema(ValueToKey.java:85)
at org.apache.kafka.connect.transforms.ValueToKey.apply(ValueToKey.java:65)
at org.apache.kafka.connect.runtime.TransformationChain.lambda$apply$0(TransformationChain.java:50)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
... 11 more
This is plausible since there is no possibility to define by table or target optic, or is it? I would expect a capability to restrict transforms to a given table or topic, e.g. something like
transforms.<topic-name>.createKey.type
Am I missing something or is it a Connect restriction?

It is not possible to apply SMTs only to specific topics because this is a connector level configuration meaning that it is applied to every processed message.
I would recommend you to create distinct connectors for every topic so that you can apply SMTs only to a subset of the topics.

Related

Apache Nifi ConsumeKafkaRecord_2_6 consuming message from topic where key and value are avro serialized using confluent schema registry

I am using nifi to build a dataflow with the following setup:
apache nifi 1.14.1
kafka 2.13-2.7.1
confluent schema registry
I am also using the processor ConsumeKafkaRecord_2_6 to process messages from a topic where the key and the value where both serialized using avro - schemas for the key and value are stored in the confluent schema registry. But the processor fails to parse the message because there is not a way - that I can see - to specify that both key and value are avro serialized with schemas stored in the confluent schema registry. The convention for naming the schema is usually [topic name]-value and [topic name]-key. I can read the messages just fine using kcat, formerly kafkacat using:
kcat -b broker1:9092,broker2:9092,broker3:9092 -t mytopic -s avro -r http://schema-registry_url.com -p 0
Is there a way to read such messages or am I supposed to add my own processor to nifi? Here's a trace of the error:
causes: org.apache.nifi.serialization.MalformedRecordException: Error while getting next record. Root cause: org.apache.avro.AvroRuntimeException: Malformed data. Length is negative: negative 62
org.apache.nifi.serialization.MalformedRecordException: Error while getting next record. Root cause: org.apache.avro.AvroRuntimeException: Malformed data. Length is negative: negative 62
at org.apache.nifi.avro.AvroRecordReader.nextRecord(AvroRecordReader.java:52)
at org.apache.nifi.serialization.RecordReader.nextRecord(RecordReader.java:50)
at sun.reflect.GeneratedMethodAccessor559.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.nifi.controller.service.StandardControllerServiceInvocationHandler.invoke(StandardControllerServiceInvocationHandler.java:254)
at org.apache.nifi.controller.service.StandardControllerServiceInvocationHandler.access$100(StandardControllerServiceInvocationHandler.java:38)
at org.apache.nifi.controller.service.StandardControllerServiceInvocationHandler$ProxiedReturnObjectInvocationHandler.invoke(StandardControllerServiceInvocationHandler.java:240)
at com.sun.proxy.$Proxy192.nextRecord(Unknown Source)
at org.apache.nifi.processors.kafka.pubsub.ConsumerLease.writeRecordData(ConsumerLease.java:549)
at org.apache.nifi.processors.kafka.pubsub.ConsumerLease.lambda$processRecords$3(ConsumerLease.java:342)
at java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1556)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at org.apache.nifi.processors.kafka.pubsub.ConsumerLease.processRecords(ConsumerLease.java:329)
at org.apache.nifi.processors.kafka.pubsub.ConsumerLease.poll(ConsumerLease.java:188)
at org.apache.nifi.processors.kafka.pubsub.ConsumeKafkaRecord_2_6.onTrigger(ConsumeKafkaRecord_2_6.java:472)
at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1202)
at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:214)
at org.apache.nifi.controller.scheduling.QuartzSchedulingAgent$2.run(QuartzSchedulingAgent.java:137)
at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.avro.AvroRuntimeException: Malformed data. Length is negative: -62
at org.apache.avro.io.BinaryDecoder.doReadBytes(BinaryDecoder.java:336)
at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:263)
at org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201)
at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:430)
at org.apache.nifi.avro.NonCachingDatumReader.readString(NonCachingDatumReader.java:51)
at org.apache.avro.generic.GenericDatumReader.readMapKey(GenericDatumReader.java:335)
at org.apache.avro.generic.GenericDatumReader.readMap(GenericDatumReader.java:321)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:177)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:240)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:230)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:174)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:144)
at org.apache.nifi.avro.AvroReaderWithExplicitSchema.nextAvroRecord(AvroReaderWithExplicitSchema.java:92)
at org.apache.nifi.avro.AvroRecordReader.nextRecord(AvroRecordReader.java:39)
... 27 common frames omitted
I am attaching pictures of the processor
ConsumeKafkaRecord 1 of 2
ConsumeKafkaRecord 2 of 2
AvroReader
SchemaRegistry
If the data is already serialized correctly by some Confluent Serializer, you should prefer using the "Confluent Content-Encoded Schema Reference" option in the AvroReader since the Schema ID is embedded within the record and will get the correct subject/version, accordingly.
Otherwise, using the "Schema Name" or "Schema Text" value will either perform a lookup against the registry or use a literal, however, the deserializer will still expect a certain content-length of the record bytes, and seems to be the cause of the issue Malformed data. Length is negative ...

Is there a way to using kafka schema registry without magic byte?

I'm trying to make my applications work using the schema registry from confluent but at this point I'm not in total control of the producers, you can even see them as legacy applications that simply are not bound to the confluent products.
I was looking at the confluent information and it seems all the messages should include in the payload a Magic Byte and Schema ID
https://docs.confluent.io/3.2.0/schema-registry/docs/serializer-formatter.html
or else when I try to consume it I get an error:
[2020-09-25 13:12:09,008] ERROR WorkerSinkTask{id=s3_parquet_connector-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:491)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:468)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:324)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:228)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:200)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:184)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:234)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.connect.errors.DataException: Failed to deserialize data for topic com.obj_pos to Protobuf:
at io.confluent.connect.protobuf.ProtobufConverter.toConnectData(ProtobufConverter.java:123)
at org.apache.kafka.connect.storage.Converter.toConnectData(Converter.java:87)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$1(WorkerSinkTask.java:491)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
... 13 more
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Protobuf message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
[2020-09-25 13:12:09,010] ERROR WorkerSinkTask{id=s3_parquet_connector-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)
my question is, if there is a way of somehow either disable this magic byte check or if I could create a kafka stream that would just append a this 5 bytes to the initial message so that afterwards I could consume it with a consumer that would connect to the schema registry.
What is happening is that the producer is out of my control so I would need somehow to be able to deserialize messages that do not contain those 5 bytes because they are produced by producers that don't rely on the confluent serializers/de-serializers
they are produced by producers that don't rely on the confluent serializers
Then the problem isn't the Registry.
You shouldn't be using the Converters written by Confluent to consume the messages, as those are bound to the Registry, and there is no way to skip it.
You would instead use the BlueApron ones (assuming the data is really protobuf), or write your own Converter classes.

io.debezium.DebeziumException: The db history topic or its content is fully or partially missing

I am facing frequent issues related to db history topic which is created by the connector itself. There is a temporary solution (by changing the name of the db history topic) which I tried but it's not the better way to handle it. Also, the retention byte is set to -1.
This is the error stack.
ERROR WorkerSourceTask{id=cdcit.ventures.sandbox.streamdomain.streamsubdomain.order-filter-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
io.debezium.DebeziumException: The db history topic or its content is fully or partially missing. Please check database history topic configuration and re-execute the snapshot.
at io.debezium.relational.HistorizedRelationalDatabaseSchema.recover(HistorizedRelationalDatabaseSchema.java:47)
at io.debezium.connector.sqlserver.SqlServerConnectorTask.start(SqlServerConnectorTask.java:87)
at io.debezium.connector.common.BaseSourceTask.start(BaseSourceTask.java:101)
at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:213)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:184)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:234)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[2020-09-04 19:12:26,445] ERROR WorkerSourceTask{id=cdcit.ventures.sandbox.streamdomain.streamsubdomain.order-filter-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)
You must use a single database history topic per connector. The topic must not be used by more than one connector.
please change the value of parameter "name" in the config "connector.properties" to a new name.
Thanks.

Issues reading AVRO encoded messages (created by KSQL stream) with Kafka Connect

there's something weird happening when we are creating AVRO messages through KSQL and try to consume them by using Kafka Connect. A bit of context:
Source data
A 3rd party provider is producing data on one of our Kafka clusters as JSON (so far, so good). We actually see the data coming in.
Data Transformation
As our internal systems require data to be encoded in AVRO, we created a KSQL cluster that transforms the incoming data into AVRO by creating the following stream in KSQL:
{
"ksql": "
CREATE STREAM src_stream (browser_name VARCHAR)
WITH (KAFKA_TOPIC='json_topic', VALUE_FORMAT='JSON');
CREATE STREAM sink_stream WITH (KAFKA_TOPIC='avro_topic',VALUE_FORMAT='AVRO', PARTITIONS=1, REPLICAS=3) AS
SELECT * FROM src_stream;
",
"streamsProperties": {
"ksql.streams.auto.offset.reset": "earliest"
}
}
(so far, so good)
We see the data being produced from the JSON topic onto the AVRO topic, as the offset increases.
We then create a Kafka connector in a (new) Kafka Connect cluster. As some context, we are using multiple Kafka Connect clusters (with the same properties for those clusters), and as such we have a Kafka Connect cluster running for this data, but an exact copy of the cluster for other AVRO data (1 is for analytics, 1 for our business data).
The sink for this connector is BigQuery, we're using the Wepay BigQuery Sink Connector 1.2.0. Again, so far, so good. Our business cluster is running fine with this connector and the AVRO topics on the business cluster are streaming into BigQuery.
When we try to consume the AVRO topic created by our KSQL statement earlier however, we see an exception being thrown :/
The exception is the following:
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:510)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:490)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:225)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.connect.errors.DataException: dpt_video_event-created_v2
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:98)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$0(WorkerSinkTask.java:510)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
... 13 more
Caused by: org.apache.kafka.common.errors.SerializationException: Error retrieving Avro schema for id 0
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema not found; error code: 40403
at io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:209)
at io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:235)
at io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:415)
at io.confluent.kafka.schemaregistry.client.rest.RestService.getId(RestService.java:408)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getSchemaByIdFromRegistry(CachedSchemaRegistryClient.java:123)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getBySubjectAndId(CachedSchemaRegistryClient.java:190)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getById(CachedSchemaRegistryClient.java:169)
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserialize(AbstractKafkaAvroDeserializer.java:121)
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserializeWithSchemaAndVersion(AbstractKafkaAvroDeserializer.java:243)
at io.confluent.connect.avro.AvroConverter$Deserializer.deserialize(AvroConverter.java:134)
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:85)
at org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$0(WorkerSinkTask.java:510)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)
at org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:510)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:490)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:225)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Which, to us, indicates that Kafka Connect is reading the message, decodes the AVRO and tries to fetch the schema with ID 0 from the schema registry. Obviously, schema IDs in the schema registry are always > 0.
We're currently stuck in trying to identify the issue here. It looks like KSQL is encoding the message with schema ID 0, but we're unable to find the cause for that :/
Any help is appreciated!
BR,
Patrick
UPDATE:
We have implemented a basic consumer for the AVRO messages and that consumer is correctly identifying the schema in the AVRO messages (ID: 3), so it seems to be rekated to Kafka Connect, instead of the actual KSQL / AVRO messages.
Obviously, schema IDs in the schema registry are always > 0... It looks like KSQL is encoding the message with schema ID 0, but we're unable to find the cause for that
The AvroConverter does a "dumb check" that only looks that the consumed bytes start with a magic byte of 0x0. The next 4 bytes are the ID.
If you are using key.converter=AvroConverter and your keys start like 0x00000 in hex, then the ID would be shown as 0 in the logs, and the lookup would fail.
Last I checked, KSQL doesn't output keys in Avro format, so you will want to check the properties of your connector.

Kafka Connect export multiple event types from same topic

I am trying to use a new feature (https://www.confluent.io/blog/put-several-event-types-kafka-topic/) regarding storing two
different types of events on the same topic. Actually I am using Confluent version 4.1.0 and set these properties below
to make this happen
properties.put(KafkaAvroSerializerConfig.VALUE_SUBJECT_NAME_STRATEGY,TopicRecordNameStrategy.class.getName());
properties.put("value.multi.type", true);
Data are written to topic without issues and can be seen from a Kafka Streams application as Generic Avro Records. Also
on the Kafka Schema registry two new entries are created one for each event hosted on that specific topic.
The problem I am facing is that I cannot export these data from this topic using Kafka Connect. In the simplest case when
I use a File Sink Connector as below
{
"name": "sink-connector",
"config": {
"topics": "source-topic",
"connector.class": "org.apache.kafka.connect.file.FileStreamSinkConnector",
"tasks.max": 1,
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schema.registry.url":"http://kafka-schema-registry:8081",
"value.converter":"io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url":"http://kafka-schema-registry:8081",
"value.subject.name.strategy":"io.confluent.kafka.serializers.subject.TopicRecordNameStrategy",
"file": "/tmp/sink-file.txt"
}
}
I get an error from the Connector that seems to be some kind of serialization error based on AvroConverter like
the one shown here
org.apache.kafka.connect.errors.DataException: source-topic
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:95)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:468)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:301)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:205)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:173)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.kafka.common.errors.SerializationException: Error retrieving Avro schema for id 2
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Subject not found.; error code: 40401
at io.confluent.kafka.schemaregistry.client.rest.RestService.sendHttpRequest(RestService.java:202)
at io.confluent.kafka.schemaregistry.client.rest.RestService.httpRequest(RestService.java:229)
at io.confluent.kafka.schemaregistry.client.rest.RestService.lookUpSubjectVersion(RestService.java:296)
at io.confluent.kafka.schemaregistry.client.rest.RestService.lookUpSubjectVersion(RestService.java:284)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getVersionFromRegistry(CachedSchemaRegistryClient.java:125)
at io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient.getVersion(CachedSchemaRegistryClient.java:236)
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserialize(AbstractKafkaAvroDeserializer.java:152)
at io.confluent.kafka.serializers.AbstractKafkaAvroDeserializer.deserializeWithSchemaAndVersion(AbstractKafkaAvroDeserializer.java:194)
at io.confluent.connect.avro.AvroConverter$Deserializer.deserialize(AvroConverter.java:120)
at io.confluent.connect.avro.AvroConverter.toConnectData(AvroConverter.java:83)
at org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:468)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:301)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:205)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:173)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Note that schema registry has an Avro schema with id 2 and another with schema id 3 that describe the two events hosted on the
same topic. Same issues arise when using JDBC connector.
So how do I handle this case in order to export data from my Kafka Cluster
to an external system. Am I missing something on my configuration ? Is it possible to have a topic with multiple type of events
and export them through Kafka Connect ?
Found the solution. My code was passing key as String and value as avro. Hive-sink while reading tried to lookup for avro schema of key and was not able to find it.
Adding the property
key.converter=org.apache.kafka.connect.storage.StringConverter
key.converter.schema.registry.url=http://localhost:8081
helped resolve the issue.