kafka connector to read from csv and convert in to avro - apache-kafka

is there any kafka connector which read from csv and converts into Avro before pushing it to the topic?
I have gone through the well know https://github.com/jcustenborder/kafka-connect-spooldir, but it only reads and pushes to the topic.
I am planning to modify the code base for my custom use but before i make the changes, i just wanted to check if there already such connector available.

kafka-connect-spooldir does do exactly what you describe. When you run it, you just need to set Kafka Connect to use Avro converter. For example:
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://schema-registry:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081",
See https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained for more information about how converters and connectors relate.
Edit in response to your comment:
When i am using kafka-console-consumer i am seeing data as
103693(2018-03-11T09:19:17Z Sugar - assa8.7
when i am using kafka-avro-console-consumer format is
{"order_id":{"string":"1035"},"customer_id":{"string":"93"},"order_ts":{"string":"2018-03-11T09:19:17Z"},"product":{"string":"Sugar - assa"},"order_total_usd":{"string":"8.7"}}.
This shows that it is Avro data on your topic. The whole point of kafka-avro-console-consumer is that it decodes the binary Avro data and renders it in plain format. The output from kafka-console-consumer shows the raw Avro, parts of which may look human-readable (Sugar - assa) but others clearly not (103693)

Related

PLC4X OPCUA -Kafka Connnector

I want to use the PLC4X Connector (https://www.confluent.io/hub/apache/kafka-connect-plc4x-plc4j) to connect OPC UA (Prosys Simulation Server) with Kafka.
However I really do not find any website that describe the kafka connect configuration options?
I tried to connect to the prosys opc ua simulation server and than stream the data to a kafka topic.
I managed it to simply send the data and consume it, however i want to use a schema and the avro connverter.
My output from my sink python connector looks like this. That seems a bit strange to me too?
b'Struct{fields=Struct{ff=-5.4470555688606E8,hhh=Sean Ray MD},timestamp=1651838599206}'
How can I use the PLC4X connector with the Avro converter and a Schema?
Thanks!
{
"connector.class": "org.apache.plc4x.kafka.Plc4xSourceConnector",
"default.topic":"plcTestTopic",
"connectionString":"opcua.tcp://127.0.0.1:12345",
"tasks.max": "2",
"sources": "machineA",
"sources.machineA.connectionString": "opcua:tcp://127.0.0.1:12345",
"sources.machineA.jobReferences": "jobA",
"jobs": "jobA",
"jobs.jobA.interval": "5000",
"jobs.jobA.fields": "job1,job2",
"jobs.jobA.fields.job1": "ns=2;i=2",
"jobs.jobA.fields.job2": "ns=2;i=3"
}
When using a schema with Avro and the Confluent schema registry, the following settings should be used. You can also choose to use different settings for both the keys and values.
key.converter=io.confluent.connect.avro.AvroConverter
value.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url:http://127.0.0.1:8081
value.converter.schema.registry.url:http://127.0.0.1:8081
key.converter.schemas.enable=true
value.converter.schemas.enable=true
Sample configuration files are also available in the PLC4X Github repository.
https://github.com/apache/plc4x/tree/develop/plc4j/integrations/apache-kafka/config

Debezium topics and schema registry subject descriptions

When I create Debezium connector, it creates many kafka topics and schema registry subjects.
I am not sure about what these topics and subjects are what is its purpose
My connector configuration:
{
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"snapshot.locking.mode": "minimal",
"database.user": "XXXXX",
"tasks.max": "3",
"database.history.kafka.bootstrap.servers": "XX:9092",
"database.history.kafka.topic": "history.cdc.fkw.supply.marketplace.fk_sp_generic_checklist",
"database.server.name": "cdc.fkw.supply.marketplace.fk_sp_generic_checklist",
"heartbeat.interval.ms": "5000",
"database.port": "3306",
"table.whitelist": "fk_sp_generic_checklist.entity_checklist",
"database.hostname": "abc.kcloud.in",
"database.password": "XXXXXX",
"database.history.kafka.recovery.poll.interval.ms": "5000",
"name": "cdc.fkw.supply.marketplace1.fk_sp_generic_checklist.connector",
"database.history.skip.unparseable.ddl": "true",
"errors.tolerance": "all",
"database.whitelist": "fk_sp_generic_checklist",
"snapshot.mode": "when_needed"
}
Subjects got created in schema registry:
1) __debezium-heartbeat.cdc.fkw.supply.marketplace.fk_sp_generic_checklist-key
2) __debezium-heartbeat.cdc.fkw.supply.marketplace.fk_sp_generic_checklist-value
3) cdc.fkw.supply.marketplace.fk_sp_generic_checklist-key
4) cdc.fkw.supply.marketplace.fk_sp_generic_checklist-value
5) cdc.fkw.supply.marketplace.fk_sp_generic_checklist.fk_sp_generic_checklist.entity_checklist-key
6) cdc.fkw.supply.marketplace.fk_sp_generic_checklist.fk_sp_generic_checklist.entity_checklist-value
7) tr.cdc.fkw.supply.marketplace.fk_sp_generic_checklist.fk_sp_generic_checklist.entity_checklist-value
The Kafka topics which got created are:
1) __debezium-heartbeat.cdc.fkw.supply.marketplace.fk_sp_generic_checklist
2) cdc.fkw.supply.marketplace.fk_sp_generic_checklist
3) cdc.fkw.supply.marketplace.fk_sp_generic_checklist.fk_sp_generic_checklist.entity_checklist
4) history.cdc.fkw.supply.marketplace.fk_sp_generic_checklist
Questions:
What is the purpose of the subjects and topics based on my above connector configuration?
What if I deleted my connector and again created a new one with the same name and same database.tables? Will the data ingest from the beginning?
Is there a way to delete the entire connector and create a new one with the same name but as a fresh connector? (This is in case I messed up with some settings and then want to delete the existing data and create a fresh one)
What is the purpose of the [...] topics based on my above connector configuration
Obviously, Debezium reads each database-table into one topic.
Otherwise, you seem to have asked this - What are the extra topics created when creating and debezium source connector
[purpose of the] subjects
The subjects are all made because of your key.converter and value.converter configs (which are not shown). They are optional, for example, if you configured JSONConverter instead of using the Schema Registry.
You have a -key and a -value schema for each topic that that Connector is using that map to the Kafka record key-value pairs. This is not unique to Debezium. The tr.cdc... one seems to be extra, and doesn't refer to anything in the config shown, nor has an associated topic name.
Sidenote: Avro keys are usually discouraged unless you have a specific purpose for it; keys are often ID's or simple values that are used for comparison, partitioning, and compaction. If you modify a complex Avro object in any way. E.g. add/remove/rename fields, then that results in problems for consumers that expected that record to be in-order with previous records of some other field-value will have issues.
Delete and re-create ... Will it start from the beginning
With the same name, no. Source Connectors use the internal Kafka Connect offsets topic. Debezium History Topic also comes into effect, I assume. You would need to manually change these events to reset the database records to read.
Delete and start fresh.
Yes, deletes are possible. Refer the Connect REST API DELETE HTTP method. Then read above about (2).

Problems with Avro deserialization in Kafka sink connectors

I'm trying to read data from DB2 using Kafka and then to write it to HDFS. I use distributed confluent platform with standard JDBC and HDFS connectors.
As the HDFS connector needs to know the schema, it requires avro data as an input. Thus, I have to specify the following avro converters for the data fed to Kafka (in etc/kafka/connect-distributed.properties):
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
I then run my JDBC connector and check with the console-avro-consumer that I can successfully read the data fetched from the DB2.
However, when I launch the HDFS Connector, it does not work anymore. Instead, it outputs SerializationException:
Error deserializing Avro message for id -1
... Unknown magic byte!
To check if this is a problem with the HDFS connector, I tried to use a simple FileSink connector instead. However, I saw exactly the same exception when using the FileSink (and the file itself was created but stayed empty).
I then carried out the following experiment: Instead of using avro converter for the key and value I used json converters:
key.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schema.enable=false
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schema.enable=false
This fixed the problem with the FileSink connector, i.e., the whole pipeline from DB2 to the file worked fine. However, for the HDFS connector this solution is infeasible as the connector needs the schema and consequently avro format as an input.
It feels to me that the deserialization of avro format in the sink connectors is not implemented properly as the console-avro-consumer can still successfully read the data.
Does anyone have any idea of what could be the reason of this behavior? I'd also appreciate an idea of a simple fix for this!
check with the console-avro-consumer that I can successfully read the data fetched
I'm guessing you didn't add --property print.key=true --from-beginning when you did that.
Its possible that the latest values are Avro, but connect is clearly failing somewhere on the topic, so you need to scan it to find out where that happens
If using JsonConverter works, and the data is actually readable JSON on disk, then it sounds like the JDBC Connector actually wrote JSON, not Avro
If you are able to pinpoint the offset for the bad message, you can use the regular console consumer with the connector group id set, then add --max-messages along with a partition and offset specified to skip those events

kafka connect hdfs sink connector is failing

I'm trying to use Kafka connect sink to write files from Kafka to HDFS.
My properties looks like:
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
flush.size=3
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
schema.compatability=BACKWARD
key.converter.schemas.enabled=false
value.converter.schemas.enabled=false
schemas.enable=false
And When I'm trying to run the connector I got the following exception:
org.apache.kafka.connect.errors.DataException: JsonConverter with schemas.enable requires "schema" and "payload" fields and may not contain additional fields. If you are trying to deserialize plain JSON data, set schemas.enable=false in your converter configuration.
I'm using Confluent version 4.0.0.
Any suggestions please?
My understanding of this issue is that if you set schemas.enable=true, you tell kafka that you would like to include the schema into messages that kafka must transfer. In this case, a kafka message does not have a plain json format. Instead, it first describes the schema and then attaches the payload (i.e., the actual data) that corresponds to the schema (read about AVRO formatting). And this leads to the conflict: On the one hand you've specified JsonConverter for your data, on the other hand you ask kafka to include the schema into messages. To fix this, you can either use AvroConverter with schemas.enable = true or JsonCOnverter with schemas.enable=false.

How to dump avro data from Kafka topic and read it back in Java/Scala

We need to export production data from a Kafka topic to use it for testing purposes: the data is written in Avro and the schema is placed on the Schema registry.
We tried the following strategies:
Using kafka-console-consumer with StringDeserializer or BinaryDeserializer. We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
Using kafka-avro-console-consumer: it generates a json which includes also some bytes, for example when deserializing BigDecimal. We didn't even know which parsing option to choose (it is not avro, it is not json)
Other unsuitable strategies:
deploying a special kafka consumer would require us to package and place that code in some production server, since we are talking about our production cluster. It is just too long. After all, isn't kafka console consumer already a consumer with configurable options?
Potentially suitable strategies
Using a kafka connect Sink. We didn't find a simple way to reset the consumer offset since apparently the connector created consumer is still active even when we delete the sink
Isn't there a simply, easy way to dump the content of the value (not the schema) of a Kafka topic containing avro data to a file so that it can be parsed? I expect this to be achievable using kafka-console-consumer with the right options, plus using the correct Java Api of Avro.
for example, using kafka-console-consumer... We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
You wouldn't use regular console consumer. You would use kafka-avro-console-consumer which deserializes the binary avro data into json for you to read on the console. You can redirect > topic.txt to the console to read it.
If you did use the console consumer, you can't parse the Avro immediately because you still need to extract the schema ID from the data (4 bytes after the first "magic byte"), then use the schema registry client to retrieve the schema, and only then will you be able to deserialize the messages. Any Avro library you use to read this file as the console consumer writes it expects one entire schema to be placed at the header of the file, not only an ID pointing to anything in the registry at every line. (The basic Avro library doesn't know anything about the registry either)
The only thing configurable about the console consumer is the formatter and the registry. You can add decoders by additionally exporting them into the CLASSPATH
in such a format that you can re-read it from Java?
Why not just write a Kafka consumer in Java? See Schema Registry documentation
package and place that code in some production server
Not entirely sure why this is a problem. If you could SSH proxy or VPN into the production network, then you don't need to deploy anything there.
How do you export this data
Since you're using the Schema Registry, I would suggest using one of the Kafka Connect libraries
Included ones are for Hadoop, S3, Elasticsearch, and JDBC. I think there's a FileSink Connector as well
We didn't find a simple way to reset the consumer offset
The connector name controls if a new consumer group is formed in distributed mode. You only need a single consumer, so I would suggest standalone connector, where you can set offset.storage.file.filename property to control how the offsets are stored.
KIP-199 discusses reseting consumer offsets for Connect, but feature isn't implemented.
However, did you see Kafka 0.11 how to reset offsets?
Alternative options include Apache Nifi or Streamsets, both integrate into the Schema Registry and can parse Avro data to transport it to numerous systems
One option to consider, along with cricket_007's, is to simply replicate data from one cluster to another. You can use Apache Kafka Mirror Maker to do this, or Replicator from Confluent. Both give the option of selecting certain topics to be replicated from one cluster to another- such as a test environment.