Kafka Connect - From JSON records to Avro files in HDFS - apache-kafka

My current setup contains Kafka, HDFS, Kafka Connect, and a Schema Registry all in networked docker containers.
The Kafka topic contains simple JSON data without a Schema:
{
"repo_name": "ironbee/ironbee"
}
The Schema Registry contains a JSON Schema describing the data in the Kafka Topic:
{"$schema": "https://json-schema.org/draft/2019-09/schema",
"$id": "http://example.com/example.json",
"type": "object",
"title": "Root Schema",
"required": [
"repo_name"
],
"properties": {
"repo_name": {
"type": "string",
"default": "",
"title": "The repo_name Schema",
"examples": [
"ironbee/ironbee"
]
}
}}
What I am trying to achieve is a Connection that reads JSON data from a Topic and dumps it into files in HDFS (Avro or Parquet).
{
"name": "kafka to hdfs",
"connector.class": "io.confluent.connect.hdfs3.Hdfs3SinkConnector",
"topics": "repo",
"hdfs.url": "hdfs://namenode:9000",
"flush.size": 3,
"confluent.topic.bootstrap.servers": "kafka-1:19092,kafka-2:29092,kafka-3:39092",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"value.converter": "io.confluent.connect.json.JsonSchemaConverter",
"value.converter.schemas.enable": "false",
"value.converter.schema.registry.url": "http://schema-registry:8081"
}
If I try to read the raw JSON value via the StringConverter (no schema used) and dump it into Avro files it works, resulting in
Key=null Value={my json} touples
so no usable structure at all.
When I try to use my schema via the JsonSchemaConverter I get the errors
“Converting byte[] to Kafka Connect data failed due to serialization error of topic”
“Unknown magic byte”
I think that there is something wrong with the configuration of my connection, but after a week of trying everything my google-skills have reached their limits.
All the code is available here: https://github.com/SDU-minions/7-Scalable-Systems-Project/tree/dev/Kafka

raw JSON value via the StringConverter (no schema used)
schemas.enable property only exists on JSONConverter. Strings don't have schemas. JSONSchema always has a schema, so property also doesn't exist there.
When I try to use my schema via the JsonSchemaConverter I get the errors
Your producer needs to use Confluent JSONSchema Serializer. Otherwise, it doesn't get sent to Kafka with the "magic byte" referred to in your error.
I personally haven't tried converting JSON schema records to Avro directly in Connect. Usually the pattern is to either produce Avro directly, or convert within ksqlDB, for example to a new Avro topic, which is then consumed by Connect.

Related

Not able to override consumer config in azure iot hub sink connector

I'm making an AzureIoT Hub sink connector using the Microsoft connector class. I am using an AVRO converter on the connector.
I want to use KafkaAvroDeserializer, on the consumer to deserialize the Avro data coming from the topic but I'm unable to override value. deserializer value.
I'm using consumer.override.value.deserializer in the logs.
Could anyone please suggest a way out?
My config is below :
"consumer.value.deserializer": "io.confluent.kafka.serializers.KafkaAvroDeSerializer".
I'm getting the deserializer as byte array and I want it to be kafkaAvroDeserializer
I am making a azure iot hub sink connector. And, I'm getting error deserializing avro data from kafka topic.
{
"config": {
"IotHub.ConnectionString": "connectionString",
"IotHub.MessageDeliveryAcknowledgement": "None",
"confluent.topic.bootstrap.servers": "server",
"confluent.topic.replication.factor": "1","connector.class":"com.microsoft.azure.iot.kafka.connect.sink.IotHubSinkConnector",
"consumer.override.auto.register.schemas": "true",
"consumer.override.id.compatibility.strict": "false",
"consumer.override.latest.compatibility.strict": "false",
"consumer.override.schema.registry.url": "registryUrl",
"consumer.value.deserializer":"io.confluent.kafka.serializers.KafkaAvroDeSerializer",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"name": "TEST1",
"tasks.max": "1",
"topics": "testtopicazure3",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.auto.register.schemas": "true",
"value.converter.schema.registry.url": "registryUrl"
},
}
Getting error :
Caused by:
org.apache.kafka.common.errors.SerializationException: Error
deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException:
Unknown magic byte!
In Connect, you only set value.converter, not consumer client deserializers
value.converter=io.confluent.connect.avro.AvroConverter
And all your consumer.override prefixes should be value.converter, instead
https://docs.confluent.io/kafka-connectors/self-managed/userguide.html#configuring-key-and-value-converters

KSQLDB Push Queries Fail to Deserialize Data - Schema Lookup Performed with Wrong Schema ID

I'm not certain what I could be missing.
I have set up a Kafka broker server, with a Zookeeper and a distributed Kafka Connect.
For schema management, I have set up an Apicurio Schema Registry instance
I also have KSQLDB setup
The following I can confirm is working as expected
My source JDBC connector successfully pushed table data into the topic stss.market.info.public.ice_symbols
Problem:
Inside the KSQLDB server, I have successfully created a table from the topic stss.market.info.public.ice_symbols
Here is the detail of the table created
The problem I'm facing is when performing a push query against this table, it returns no data. Deserialization of the data fails due to the unsuccessful lookup of the AVRO schema in the Apicurio Registry.
Looking at the Apicurio Registry logs reveals that KSQLDB calls to Apicrio Registry to fetch the deserialization schema using a schema ID of 0 instead of 5, which is the ID of the schema I have registered in the registry.
KSQLDB server logs also confirm this 404 HTTP response in the Apicurio logs as shown in the image below
Expectation:
I expect, KSQLDB queries to the table to perform a schema lookup with an ID of 5 and not 0. I'm guessing I'm probably missing some configuration.
Here is the image of the schema registered in the Apicruio Registry
Here is also my source connector configuration. It has the appropriate schema lookup strategy configured. Although, I don't believe KSQLDB requires this when deserialization its table data. This configuration should only be relevant to the capturing of the table data, and its validation and storage in the topic stss.market.info.public.ice_symbols.
{
"name": "new.connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"plugin.name": "pgoutput",
"database.hostname": "172.17.203.10",
"database.port": "6000",
"database.user": "postgres",
"database.password": "123",
"database.dbname": "stss_market_info",
"database.server.name": "stss.market.info",
"table.include.list": "public.ice_symbols",
"message.key.columns": "public.ice_symbols:name",
"snapshot.mode": "always",
"transforms": "unwrap,extractKey",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.extractKey.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractKey.field": "name",
"value.converter": "io.apicurio.registry.utils.converter.AvroConverter",
"value.converter.apicurio.registry.url": "http://local-server:8080/apis/registry/v2",
"value.converter.apicurio.registry.auto-register": true,
"value.converter.apicurio.registry.find-latest": true,
"value.apicurio.registry.as-confluent": true,
"name": "new.connector",
"value.converter.schema.registry.url": "http://local-server:8080/apis/registry/v2"
}
}
Thanks in advance for any assistance.
You can specify the "VALUE_SCHEMA_ID=5" property in the WITH clause when you create a stream/table.

Kafka Connect FileStreamSink connector does not include the KEY in the output file

Trying a simple File sink connector to extract data from a topic. The generated file does not include the event key and I am not able to find a setting that enables that. Eventually the goal will be to load the file using a source connector and produce the same sample data and the event KEY is very important.
Thanks
{
"name": "save-seed-data",
"config": {
"connector.class": "org.apache.kafka.connect.file.FileStreamSinkConnector",
"tasks.max": "1",
"name": "save-seed-data",
"topics": "FIRM",
"file": "/tmp/FIRM.txt",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter.schemas.enable": "false"
}
}
Not sure where you found that the key should be in the output, since the source code only references the value.
You can download and use a message transform to move the key over into the value, though.
https://github.com/jcustenborder/kafka-connect-transform-archive
Also worth mentioning, that the FileStream Source Connector does not parse the data. Each line, also, only goes into the value
Generally, using kafkacat is much more straightforward for dumping/loading data from files.

How to configure Kafka connect sink connector for exasol databse

I am trying to setup kafka sink connector for writing to exasol database.
I have followed https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/ this article.
Since I could not find any similar sink connector class for exasol hence I tried to use jar https://github.com/exasol/kafka-connect-jdbc-exasol/tree/master/kafka-connect-exasol/jars [copied this jar in
$confluent_dir/share/java/kafka-connect-jdbc] and given the Dialect class inside it as a connector class name in my config json file below.
I have created a json file for configuration as below:
{
"name": "jdbc_sink_mysql_dev_02",
"config": {
"_comment": "The JDBC connector class. Don't change this if you want to use the JDBC Source.",
"connector.class": "com.exasol.connect.jdbc.dailect.ExasolDatabaseDialect",
"_comment": "How to serialise the value of keys - here use the Confluent Avro serialiser. Note that the JDBC Source Connector always returns null for the key ",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"_comment": "Since we're using Avro serialisation, we need to specify the Confluent schema registry at which the created schema is to be stored. NB Schema Registry and Avro serialiser are both part of Confluent Platform.",
"key.converter.schema.registry.url": "http://localhost:8081",
"_comment": "As above, but for the value of the message. Note that these key/value serialisation settings can be set globally for Connect and thus omitted for individual connector configs to make them shorter and clearer",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://localhost:8081",
"_comment": " --- JDBC-specific configuration below here --- ",
"_comment": "JDBC connection URL. This will vary by RDBMS. Consult your manufacturer's handbook for more information",
"connection.url": "jdbc:exa:<myhost>:<myport> <myuser>/<mypassword>",
"_comment": "Which table(s) to include",
"table.whitelist": "<my_table_name>",
"_comment": "Pull all rows based on an timestamp column. You can also do bulk or incrementing column-based extracts. For more information, see http://docs.confluent.io/current/connect/connect-jdbc/docs/source_config_options.html#mode",
"mode": "timestamp",
"_comment": "Which column has the timestamp value to use? ",
"timestamp.column.name": "update_ts",
"_comment": "If the column is not defined as NOT NULL, tell the connector to ignore this ",
"validate.non.null": "false",
"_comment": "The Kafka topic will be made up of this prefix, plus the table name ",
"topic.prefix": "mysql-"
}
}
I am trying to load this connector with below command:
./bin/confluent load jdbc_sink_mysql_dev_02 -d <my_configuration_json_file_path>
P.S. My confluent version is 5.1.0
In similar fashion I have created a mysql-source connector for reading data from mysql and its working well , my use case demands to write that data to exasol database using sink-connector.
Although I am not getting any exceptions, but kafka is not reading any messages.
Any pointers or help to configure such sink-connector to write to exasol database.

Exception while Deserialize avro data using ConfluentSchemaRegistry?

I am new to flink and Kafka. I am trying to deserialize avro data using Confluent Schema registry. I have already installed flink and Kafka on ec2 machine. Also, the "test" topic has been created before running code.
Code Path: https://gist.github.com/mandar2174/5dc13350b296abf127b92d0697c320f2
The code does the following operation as part of implementation:
1) Create a flink DataStream object using a list of user element. (User class is avro generated class)
2) Write the Datastream source to Kafka using AvroSerializationSchema.
3) Read the data from Kafka using ConfluentRegistryAvroDeserializationSchema by reading the schema from Confluent Schema registry.
Command to run flink executable jar:
./bin/flink run -c com.streaming.example.ConfluentSchemaRegistryExample /opt/flink-1.7.2/kafka-flink-stream-processing-assembly-0.1.jar
Exception while running code:
java.io.IOException: Unknown data format. Magic number does not match
at org.apache.flink.formats.avro.registry.confluent.ConfluentSchemaRegistryCoder.readSchema(ConfluentSchemaRegistryCoder.java:55)
at org.apache.flink.formats.avro.RegistryAvroDeserializationSchema.deserialize(RegistryAvroDeserializationSchema.java:66)
at org.apache.flink.streaming.util.serialization.KeyedDeserializationSchemaWrapper.deserialize(KeyedDeserializationSchemaWrapper.java:44)
at org.apache.flink.streaming.connectors.kafka.internal.KafkaFetcher.runFetchLoop(KafkaFetcher.java:140)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:665)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:94)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:58)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
at java.lang.Thread.run(Thread.java:748)
Avro schema which I am using for User class is as below:
{
"type": "record",
"name": "User",
"namespace": "com.streaming.example",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "favorite_number",
"type": [
"int",
"null"
]
},
{
"name": "favorite_color",
"type": [
"string",
"null"
]
}
]
}
Can someone point out what steps I am missing as part of deserializing avro data using confluent Kafka schema registry?
How you wrote the Avro data needs to use the Registry as well in order for the deserializer that depends on it to work.
But this is an open PR in Flink, still for adding a ConfluentRegistryAvroSerializationSchema class
The workaround, I believe would be to use AvroDeserializationSchema, which does not depend on the Registry.
If you did want to use the Registry in the producer code, then you'd have to do so outside of Flink until that PR is merged.