I'm trying to use this connector to pull messages from Kafka and write them to HDFS. Works fine as long as the file doesn't already exist, but if it does then it throws a FileAlreadyExistsException. Is there a way to append to an already-existing file using this connector? I'm using an HdfsFlow.dataWithPassThrough flow, and it takes an HdfsWritingSettings, but that only allows you to set an overwrite boolean; there's no append option.
Related
I have message in kafka as Json like
{"name":"abc"} when I am applying sink connector with Json converter for fileStream sink connector i am getting messages as
{name=abc}
which is not correct Json. I tried simple string connector but no difference.
Can someone please help me with this.
I want message as it is in file
FileStreamSink always writes Connect Struct toString output, and is not meant to be used in production use cases.
It does not support a format.class=JSONFormat such as S3 or HDFS sinks.
As a workaround, you could run Minio as an S3 replacement, or you could use a different sink connector altogether, depending on what you actually want to do with that data. For example, Mongo or JDBC sinks, which respectively offer their own export tooling and can search/analyze your data faster than flat files.
I learned to use Kafka connect to consume CSV files using the Confluent SpoolDir connector. Is there any way to use this connector (or does any other such connector exist) for "Ctrl+A" delimited files?
The records in the (source) files I want to use are separated by newline, whereas the columns are separated by "Ctrl+A".
You could use the file pulse connector for ingesting the files. You'd just need to install it with the confluent tool and set up the respective configuration.
You need to configure the DelimitedRowFilter Filter to parse the CSV files accurately. It should be possible some how to use "Ctrl+A".
When everything is correctly set-up you can copy the CSV files into a given directory. In the configured time interval files are read and the content is copied into a topic.
I was able to do this using the SpoolDir connector by creating a SpoolDirCsvSourceConnector and setting the csv.separator.char=01 (01 is the ASCII for Ctrl-A) in file-source properties.
We are using Kafka connect S3 sink connector that connect to Kafka and load data to S3 buckets.Now I want to load data from S3 buckets to AWS Redshift using Copy command, for that I'm creating my own custom connector.Use case is I want to load data that created over S3 to Redshift in synchronous way, and then next time S3 connector should replace the existing file and again our custom connector load data to S3.
How can I do this using Confluent Kafka Connect,or my other better approach to do same task?
Thanks in advance !
If you want data to Redshift, you should probably just use the JDBC Sink Connector and download the Redshift JDBC Driver into the kafka-connect-jdbc directory.
Otherwise, rather than writing a Connector, you could use Lambda to trigger some type of S3 event notification to do some type of Redshift upload
Alternatively, if you are simply looking to query S3 data, you could use Athena instead without dealing with any databases
But basically, Sink Connectors don't communicate between one another. They are independent tasks that are designed to initially consume from a topic and write to a destination, not necessarily trigger external, downstream systems.
You want to achieve synchronous behaviour from Kafka to redshift then S3 sink connector is not right option.
If you are using S3 sink connector then first put the data into s3 and then externally run copy command to push to S3. ( Copy command is extra overhead )
No customize code or validation can happen before pushing to redshift.
Redshift sink connector has come up with native jdbc library which is equivalent fast to S3 copy command.
We need to export production data from a Kafka topic to use it for testing purposes: the data is written in Avro and the schema is placed on the Schema registry.
We tried the following strategies:
Using kafka-console-consumer with StringDeserializer or BinaryDeserializer. We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
Using kafka-avro-console-consumer: it generates a json which includes also some bytes, for example when deserializing BigDecimal. We didn't even know which parsing option to choose (it is not avro, it is not json)
Other unsuitable strategies:
deploying a special kafka consumer would require us to package and place that code in some production server, since we are talking about our production cluster. It is just too long. After all, isn't kafka console consumer already a consumer with configurable options?
Potentially suitable strategies
Using a kafka connect Sink. We didn't find a simple way to reset the consumer offset since apparently the connector created consumer is still active even when we delete the sink
Isn't there a simply, easy way to dump the content of the value (not the schema) of a Kafka topic containing avro data to a file so that it can be parsed? I expect this to be achievable using kafka-console-consumer with the right options, plus using the correct Java Api of Avro.
for example, using kafka-console-consumer... We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
You wouldn't use regular console consumer. You would use kafka-avro-console-consumer which deserializes the binary avro data into json for you to read on the console. You can redirect > topic.txt to the console to read it.
If you did use the console consumer, you can't parse the Avro immediately because you still need to extract the schema ID from the data (4 bytes after the first "magic byte"), then use the schema registry client to retrieve the schema, and only then will you be able to deserialize the messages. Any Avro library you use to read this file as the console consumer writes it expects one entire schema to be placed at the header of the file, not only an ID pointing to anything in the registry at every line. (The basic Avro library doesn't know anything about the registry either)
The only thing configurable about the console consumer is the formatter and the registry. You can add decoders by additionally exporting them into the CLASSPATH
in such a format that you can re-read it from Java?
Why not just write a Kafka consumer in Java? See Schema Registry documentation
package and place that code in some production server
Not entirely sure why this is a problem. If you could SSH proxy or VPN into the production network, then you don't need to deploy anything there.
How do you export this data
Since you're using the Schema Registry, I would suggest using one of the Kafka Connect libraries
Included ones are for Hadoop, S3, Elasticsearch, and JDBC. I think there's a FileSink Connector as well
We didn't find a simple way to reset the consumer offset
The connector name controls if a new consumer group is formed in distributed mode. You only need a single consumer, so I would suggest standalone connector, where you can set offset.storage.file.filename property to control how the offsets are stored.
KIP-199 discusses reseting consumer offsets for Connect, but feature isn't implemented.
However, did you see Kafka 0.11 how to reset offsets?
Alternative options include Apache Nifi or Streamsets, both integrate into the Schema Registry and can parse Avro data to transport it to numerous systems
One option to consider, along with cricket_007's, is to simply replicate data from one cluster to another. You can use Apache Kafka Mirror Maker to do this, or Replicator from Confluent. Both give the option of selecting certain topics to be replicated from one cluster to another- such as a test environment.
I need to append the streaming data into hdfs using Flume. Without overwriting the existing log file I need to append the streaming data to existing file in hdfs. Could you please provide links for the MR code for the same.
Flume does not overwrite existing data in hdfs directory by default. It is because, flume save incoming data with folder name appended sink timestamp, such as
Flume.2345234523 so if you run flume again in the same directory in hdfs it will create another file, under the same hdfs path.