How to dump avro data from Kafka topic and read it back in Java/Scala - apache-kafka

We need to export production data from a Kafka topic to use it for testing purposes: the data is written in Avro and the schema is placed on the Schema registry.
We tried the following strategies:
Using kafka-console-consumer with StringDeserializer or BinaryDeserializer. We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
Using kafka-avro-console-consumer: it generates a json which includes also some bytes, for example when deserializing BigDecimal. We didn't even know which parsing option to choose (it is not avro, it is not json)
Other unsuitable strategies:
deploying a special kafka consumer would require us to package and place that code in some production server, since we are talking about our production cluster. It is just too long. After all, isn't kafka console consumer already a consumer with configurable options?
Potentially suitable strategies
Using a kafka connect Sink. We didn't find a simple way to reset the consumer offset since apparently the connector created consumer is still active even when we delete the sink
Isn't there a simply, easy way to dump the content of the value (not the schema) of a Kafka topic containing avro data to a file so that it can be parsed? I expect this to be achievable using kafka-console-consumer with the right options, plus using the correct Java Api of Avro.

for example, using kafka-console-consumer... We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
You wouldn't use regular console consumer. You would use kafka-avro-console-consumer which deserializes the binary avro data into json for you to read on the console. You can redirect > topic.txt to the console to read it.
If you did use the console consumer, you can't parse the Avro immediately because you still need to extract the schema ID from the data (4 bytes after the first "magic byte"), then use the schema registry client to retrieve the schema, and only then will you be able to deserialize the messages. Any Avro library you use to read this file as the console consumer writes it expects one entire schema to be placed at the header of the file, not only an ID pointing to anything in the registry at every line. (The basic Avro library doesn't know anything about the registry either)
The only thing configurable about the console consumer is the formatter and the registry. You can add decoders by additionally exporting them into the CLASSPATH
in such a format that you can re-read it from Java?
Why not just write a Kafka consumer in Java? See Schema Registry documentation
package and place that code in some production server
Not entirely sure why this is a problem. If you could SSH proxy or VPN into the production network, then you don't need to deploy anything there.
How do you export this data
Since you're using the Schema Registry, I would suggest using one of the Kafka Connect libraries
Included ones are for Hadoop, S3, Elasticsearch, and JDBC. I think there's a FileSink Connector as well
We didn't find a simple way to reset the consumer offset
The connector name controls if a new consumer group is formed in distributed mode. You only need a single consumer, so I would suggest standalone connector, where you can set offset.storage.file.filename property to control how the offsets are stored.
KIP-199 discusses reseting consumer offsets for Connect, but feature isn't implemented.
However, did you see Kafka 0.11 how to reset offsets?
Alternative options include Apache Nifi or Streamsets, both integrate into the Schema Registry and can parse Avro data to transport it to numerous systems

One option to consider, along with cricket_007's, is to simply replicate data from one cluster to another. You can use Apache Kafka Mirror Maker to do this, or Replicator from Confluent. Both give the option of selecting certain topics to be replicated from one cluster to another- such as a test environment.

Related

Sending Avro messages to Kafka

I have an app that produces an array of messages in raw JSON periodically. I was able to convert that to Avro using the avro-tools. I did that because I needed the messages to include schema due to the limitations of Kafka-Connect JDBC sink. I can open this file on notepad++ and see that it includes the schema and a few lines of data.
Now I would like to send this to my central Kafka Broker and then use Kafka Connect JDBC sink to put the data in a database. I am having a hard time understanding how I should be sending these Avro files I have to my Kafka Broker. Do I need a schema registry for my purposes? I believe Kafkacat does not support Avro so I suppose I will have to stick with the kafka-producer.sh that comes with the Kafka installation (please correct me if I am wrong).
Question is: Can someone please share the steps to produce my Avro file to a Kafka broker without getting Confluent getting involved.
Thanks,
To use the Kafka Connect JDBC Sink, your data needs an explicit schema. The converter that you specify in your connector configuration determines where the schema is held. This can either be embedded within the JSON message (org.apache.kafka.connect.json.JsonConverter with schemas.enabled=true) or held in the Schema Registry (one of io.confluent.connect.avro.AvroConverter, io.confluent.connect.protobuf.ProtobufConverter, or io.confluent.connect.json.JsonSchemaConverter).
To learn more about this see https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
To write an Avro message to Kafka you should serialise it as Avro and store the schema in the Schema Registry. There is a Go client library to use with examples
without getting Confluent getting involved.
It's not entirely clear what you mean by this. The Kafka Connect JDBC Sink is written by Confluent. The best way to manage schemas is with the Schema Registry. If you don't want to use the Schema Registry then you can embed the schema in your JSON message but it's a suboptimal way of doing things.

In Kafka Connector, how do I get the bootstrap-server address My Kafka Connect is currently using?

I'm developing a Kafka Sink connector on my own. My deserializer is JSONConverter. However, when someone send a wrong JSON data into my connector's topic, I want to omit this record and send this record to a specific topic of my company.
My confuse is: I can't find any API for me to get my Connect's bootstrap.servers.(I know it's in the confluent's etc directory but it's not a good idea to write hard code of the directory of "connect-distributed.properties" to get the bootstrap.servers)
So question, is there another way for me to get the value of bootstrap.servers conveniently in my connector program?
Instead of trying to send the "bad" records from a SinkTask to Kafka, you should instead try to use the dead letter queue feature that was added in Kafka Connect 2.0.
You can configure the Connect runtime to automatically dump records that failed to be processed to a configured topic acting as a DLQ.
For more details, see the KIP that added this feature.

Problems with Avro deserialization in Kafka sink connectors

I'm trying to read data from DB2 using Kafka and then to write it to HDFS. I use distributed confluent platform with standard JDBC and HDFS connectors.
As the HDFS connector needs to know the schema, it requires avro data as an input. Thus, I have to specify the following avro converters for the data fed to Kafka (in etc/kafka/connect-distributed.properties):
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
I then run my JDBC connector and check with the console-avro-consumer that I can successfully read the data fetched from the DB2.
However, when I launch the HDFS Connector, it does not work anymore. Instead, it outputs SerializationException:
Error deserializing Avro message for id -1
... Unknown magic byte!
To check if this is a problem with the HDFS connector, I tried to use a simple FileSink connector instead. However, I saw exactly the same exception when using the FileSink (and the file itself was created but stayed empty).
I then carried out the following experiment: Instead of using avro converter for the key and value I used json converters:
key.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schema.enable=false
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schema.enable=false
This fixed the problem with the FileSink connector, i.e., the whole pipeline from DB2 to the file worked fine. However, for the HDFS connector this solution is infeasible as the connector needs the schema and consequently avro format as an input.
It feels to me that the deserialization of avro format in the sink connectors is not implemented properly as the console-avro-consumer can still successfully read the data.
Does anyone have any idea of what could be the reason of this behavior? I'd also appreciate an idea of a simple fix for this!
check with the console-avro-consumer that I can successfully read the data fetched
I'm guessing you didn't add --property print.key=true --from-beginning when you did that.
Its possible that the latest values are Avro, but connect is clearly failing somewhere on the topic, so you need to scan it to find out where that happens
If using JsonConverter works, and the data is actually readable JSON on disk, then it sounds like the JDBC Connector actually wrote JSON, not Avro
If you are able to pinpoint the offset for the bad message, you can use the regular console consumer with the connector group id set, then add --max-messages along with a partition and offset specified to skip those events

Kafka-connect sink task ignores file offset storage property

I'm experiencing quite weird behavior working with Confluent JDBC connector. I'm pretty sure that it's not related to Confluent stack, but to Kafka-connect framework itself.
So, I define offset.storage.file.filename property as default /tmp/connect.offsets and run my sink connector. Obviously, I expect connector to persist offsets in the given file (it doesn't exist on file system, but it should be automatically created, right?). Documentation says:
offset.storage.file.filename
The file to store connector offsets in. By storing offsets on disk, a standalone process can be stopped and started on a single node and resume where it previously left off.
But Kafka behaves in completely different manner.
It checks if the given file exists.
It it's not, Kafka just ignores it and persists offsets in Kafka topic.
If I create given file manually, reading fails anyway (EOFException) and offsets are being persisted in topic again.
Is it a bug or, more likely, I don't understand how to work with this configurations? I understand difference between two approaches to persist offsets and file storage is more convenient for my needs.
The offset.storage.file.filename is only used in source connectors, in standalone mode. It is used to place a bookmark on the input data source and remember where it stopped reading it. The created file contains something like the file line number (for a file source) or a table row number (for jdbc source or databases in general).
When running Kafka Connect in distributed mode, this file is replaced by a Kafka topic named by default connect-offsets which should be replicated in order to tolerate failures.
As far as sink connectors are concerned, no matter which plugin or mode (standalone/distributed) is used, they all store where they last stopped reading their input topic in an internal topic named __consumer_offsets like any Kafka consumers. This allows to use traditional tools like kafka-consumer-groups.sh command-line tools to see how the much the sink connector is lagging.
The Confluent Kafka replicator, despite being a source connector, is probably an exception because it reads from a remote Kafka and may use a Kafka consumer, but only one cluster will maintain those original consumer group offsets.
I agree that the documentation is not clear, this setting is required whatever the connector type is (source or sink), but it is only used on by source connectors. The reason behind this design decision is that a single Kafka Connect worker (I mean a single JVM process) can run multiple connectors, potentially both source and sink connectors. Said differently, this setting is worker level setting, not a connector setting.
The property offset.storage.file.filename only applies to workers of source connectors running in standalone mode. If you are seeing Kafka persist offsets in a Kafka topic for a source, you are running in distributed mode. You should be launching your connector with the provided script connect-standalone. There's a description of the different modes here. Instructions on running in the different modes are here.

Unable to push Avro data to HDFS using Confluent Platform

I have a system pushing Avro data in to multiple Kafka topics.
I want to push that data to HDFS. I came across confluent but am not sure how can I send data to HDFS without starting kafka-avro-console-producer.
Steps I performed:
I have my own Kafka and ZooKeeper running so i just started schema registry of confluent.
I started kafka-connect-hdfs after changing topic name.
This step is also successful. It's able to connect to HDFS.
After this I started pushing data to Kafka but the messages were not being pushed to HDFS.
Please help. I'm new to Confluent.
You can avoid using the kafka-avro-console-producer and use your own producer to send messages to the topics, but we strongly encourage you to use the Confluent Schema Registry (https://github.com/confluentinc/schema-registry) to manage your schemas and use the Avro serializer that is bundled with the Schema Registry to keep your Avro data consistent. There's a nice writeup on the rationale for why this is a good idea to do here.
If you are able to send messages that were produced with the kafka-avro-console-producer to HDFS, then your problem is likely in the kafka-connect-hdfs connector not being able to deserialize the data. I assume you are going through the quickstart guide. The best results will come from you using the same serializer on both sides (in and out of Kafka) if you are intending to write Avro to HDFS. How this process works is described in this documentation.