I need a kafka sink connector allowing the user to persist topic content as .CSV files. I've been investigating a while.
Confluent provides FileSink connector which - as far as I understood - doesn't support CSV format.
I've been playing a bit with
this project but the sink task seems not implemented. Alongside, this one actually implements just the Source part.
Does a project with this capability currently exists?
Related
Can someone help me with the best possible way to dump data from a kafka topic to Google Cloud storage?
I would like to build a near real time pipeline which is capable to creating multiple files based on time or size cuts.
Data in kafka topic is in JSON format.
Confluent offers a GCS Kafka Connect sink, but you could also try using Google DataFlow / Apache Beam to do the same.
Is Kafka Spool Directory connector suitable for loading streaming data (log) into Kafka in production ? Can it be run in distributed mode ? Is there any other connector that can be used since filestream source connector is not suitable for production ?
Does this match your requirements?
provides the capability to watch a directory for files and read the data as new files are written to the input directory.
Do you have CSV or JSON files?
If so, then you can use the Spooldir connector
It can be argued that something like Flume, Logastash, Filebeat, FluentD, Syslog, GELF, or other log solutions are more appropriately suited for your purposes of collecting logs into Kafka
I am fetching JSON data from a Kafka topic. I need to dump this data onto GCS (Google Cloud Storage) into a directory, wherein the directory name will be fetched from the value of "ID" in the JSON data.
I googled and did not find any similar use case wherein Kafka Connect can be used to interpret the JSON data and create directories dynamically based on the value from the JSON data.
Can this be achieved using Kafka Connect?
You can use Kafka Connect GCS sink connector which is provided by Confluent.
The Google Cloud Storage (GCS) connector, currently available as a
sink, allows you to export data from Kafka topics to GCS objects in
various formats. In addition, for certain data layouts, GCS connector
exports data by guaranteeing exactly-once delivery semantics to
consumers of the GCS objects it produces.
Here's an example configuration for the connector:
name=gcs-sink
connector.class=io.confluent.connect.gcs.GcsSinkConnector
tasks.max=1
topics=gcs_topic
gcs.bucket.name=#bucket-name
gcs.part.size=5242880
flush.size=3
gcs.credentials.path=#/path/to/credentials/keys.json
storage.class=io.confluent.connect.gcs.storage.GcsStorage
format.class=io.confluent.connect.gcs.format.avro.AvroFormat
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
schema.compatibility=BACKWARD
confluent.topic.bootstrap.servers=localhost:9092
confluent.topic.replication.factor=1
# Uncomment and insert license for production use
# confluent.license=
You can find more details for installation and configuration in the link I've provided above.
This isn't really possible out-of-the-box using most connectors. Instead, you can implement your own Kafka Connect sink task that processes Kafka records and then writes them to the correct GCS directories based on your JSON.
Here's the method you'd override in the connector.
Here's a link to the source code for the AWS S3 sink connector.
We need to export production data from a Kafka topic to use it for testing purposes: the data is written in Avro and the schema is placed on the Schema registry.
We tried the following strategies:
Using kafka-console-consumer with StringDeserializer or BinaryDeserializer. We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
Using kafka-avro-console-consumer: it generates a json which includes also some bytes, for example when deserializing BigDecimal. We didn't even know which parsing option to choose (it is not avro, it is not json)
Other unsuitable strategies:
deploying a special kafka consumer would require us to package and place that code in some production server, since we are talking about our production cluster. It is just too long. After all, isn't kafka console consumer already a consumer with configurable options?
Potentially suitable strategies
Using a kafka connect Sink. We didn't find a simple way to reset the consumer offset since apparently the connector created consumer is still active even when we delete the sink
Isn't there a simply, easy way to dump the content of the value (not the schema) of a Kafka topic containing avro data to a file so that it can be parsed? I expect this to be achievable using kafka-console-consumer with the right options, plus using the correct Java Api of Avro.
for example, using kafka-console-consumer... We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
You wouldn't use regular console consumer. You would use kafka-avro-console-consumer which deserializes the binary avro data into json for you to read on the console. You can redirect > topic.txt to the console to read it.
If you did use the console consumer, you can't parse the Avro immediately because you still need to extract the schema ID from the data (4 bytes after the first "magic byte"), then use the schema registry client to retrieve the schema, and only then will you be able to deserialize the messages. Any Avro library you use to read this file as the console consumer writes it expects one entire schema to be placed at the header of the file, not only an ID pointing to anything in the registry at every line. (The basic Avro library doesn't know anything about the registry either)
The only thing configurable about the console consumer is the formatter and the registry. You can add decoders by additionally exporting them into the CLASSPATH
in such a format that you can re-read it from Java?
Why not just write a Kafka consumer in Java? See Schema Registry documentation
package and place that code in some production server
Not entirely sure why this is a problem. If you could SSH proxy or VPN into the production network, then you don't need to deploy anything there.
How do you export this data
Since you're using the Schema Registry, I would suggest using one of the Kafka Connect libraries
Included ones are for Hadoop, S3, Elasticsearch, and JDBC. I think there's a FileSink Connector as well
We didn't find a simple way to reset the consumer offset
The connector name controls if a new consumer group is formed in distributed mode. You only need a single consumer, so I would suggest standalone connector, where you can set offset.storage.file.filename property to control how the offsets are stored.
KIP-199 discusses reseting consumer offsets for Connect, but feature isn't implemented.
However, did you see Kafka 0.11 how to reset offsets?
Alternative options include Apache Nifi or Streamsets, both integrate into the Schema Registry and can parse Avro data to transport it to numerous systems
One option to consider, along with cricket_007's, is to simply replicate data from one cluster to another. You can use Apache Kafka Mirror Maker to do this, or Replicator from Confluent. Both give the option of selecting certain topics to be replicated from one cluster to another- such as a test environment.
I have a system pushing Avro data in to multiple Kafka topics.
I want to push that data to HDFS. I came across confluent but am not sure how can I send data to HDFS without starting kafka-avro-console-producer.
Steps I performed:
I have my own Kafka and ZooKeeper running so i just started schema registry of confluent.
I started kafka-connect-hdfs after changing topic name.
This step is also successful. It's able to connect to HDFS.
After this I started pushing data to Kafka but the messages were not being pushed to HDFS.
Please help. I'm new to Confluent.
You can avoid using the kafka-avro-console-producer and use your own producer to send messages to the topics, but we strongly encourage you to use the Confluent Schema Registry (https://github.com/confluentinc/schema-registry) to manage your schemas and use the Avro serializer that is bundled with the Schema Registry to keep your Avro data consistent. There's a nice writeup on the rationale for why this is a good idea to do here.
If you are able to send messages that were produced with the kafka-avro-console-producer to HDFS, then your problem is likely in the kafka-connect-hdfs connector not being able to deserialize the data. I assume you are going through the quickstart guide. The best results will come from you using the same serializer on both sides (in and out of Kafka) if you are intending to write Avro to HDFS. How this process works is described in this documentation.