Kafka Connect FileStreamSink connector removes quotation marks and changes colon to equal sign for JSON message - apache-kafka

Summary
When I stream this with the console producer
{"id":1337,"status":"example_topic_1 success"}
I get this in from my filestream consumer
/data/example_topic_1.txt
{id=1337, status=example_topic_1 success}
This is a major problem for me, because the original JSON message cannot be recovered without making assumptions about where the quotes used to be. How can I output the messages to a file, while preserving the quotation marks?
Details
First, I start my file sink connector.
# sh bin/connect-standalone.sh \
> config/worker.properties \
> config/connect-file-sink-example_topic_1.properties
Second, I start console consumer (also built in to Kafka) so that I have easy visual confirmation that the messages are coming through correctly.
# sh bin/kafka-console-consumer.sh \
> --bootstrap-server kafka_broker:9092 \
> --topic example_topic_1
Finally, I start a console producer for sending messages, and I enter a message.
# sh bin/kafka-console-producer.sh \
> --broker-list kafka_broker:9092 \
> --topic example_topic_1
From the console consumer, the message pops out correctly, with quotes.
{"id":1337,"status":"example_topic_1 success"}
But I get this from my the FileStreamSink consumer:
/data/example_topic_1.txt
{id=1337, status=example_topic_1 success}
My Configuration
config/worker.properties
offset.storage.file.filename=/tmp/example.offsets
bootstrap.servers=kafka_broker:9092
offset.flush.interval.ms=10000
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schemas.enable=false
config/connect-file-sink-example_topic_1.properties
name=file-sink-example_topic_1
connector.class=FileStreamSink
tasks.max=1
file=/data/example_topic_1.txt
topics=example_topic_1

Since you're not actually wanting to parse the JSON data, but just pass it straight through as a lump of text, you need to use the StringConverter:
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter
This article explains more about the nuances of converters: https://rmoff.net/2019/05/08/when-a-kafka-connect-converter-is-not-a-converter/. This shows an example of what you're trying to do, although uses kafkacat in place of the console producer/consumer.

Related

unable to read avro message via kafka-avro-console-consumer (end goal read it via spark streaming)

(end goal) before trying out whether i could eventually read avro data, usng spark stream, out of the Confluent Platform like some described here: Integrating Spark Structured Streaming with the Confluent Schema Registry
I'd to verify whether I could use below command to read them:
$ kafka-avro-console-consumer \
> --topic my-topic-produced-using-file-pulse-xml \
> --from-beginning \
> --bootstrap-server localhost:9092 \
> --property schema.registry.url=http://localhost:8081
I receive this error message, Unknown magic byte
Processed a total of 1 messages
[2020-09-10 12:59:54,795] ERROR Unknown error when running consumer: (kafka.tools.ConsoleConsumer$:76)
org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
[2020-09-10 12:59:54,795] ERROR Unknown error when running consumer: (kafka.tools.ConsoleConsumer$:76)
org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
note, The message can be read like this (using console-consumer instead of avro-console-consumer):
kafka-console-consumer \
--bootstrap-server localhost:9092 --group my-group-console \
--from-beginning \
--topic my-topic-produced-using-file-pulse-xml
The message was produced using confluent connect file-pulse (1.5.2) reading xml file (streamthoughts/kafka-connect-file-pulse)
Please help here:
Did I use the kafka-avro-console-consumer wrong?
I tried "deserializer" properties options described here: https://stackoverflow.com/a/57703102/4582240, did not help
I did not want to be brave to start the spark streaming to read the data yet.
the file-pulse 1.5.2 properties i used are like below added 11/09/2020 for completion.
name=connect-file-pulse-xml
connector.class=io.streamthoughts.kafka.connect.filepulse.source.FilePulseSourceConnector
topic= my-topic-produced-using-file-pulse-xml
tasks.max=1
# File types
fs.scan.filters=io.streamthoughts.kafka.connect.filepulse.scanner.local.filter.RegexFileListFilter
file.filter.regex.pattern=.*\\.xml$
task.reader.class=io.streamthoughts.kafka.connect.filepulse.reader.XMLFileInputReader
force.array.on.fields=sometagNameInXml
# File scanning
fs.cleanup.policy.class=io.streamthoughts.kafka.connect.filepulse.clean.LogCleanupPolicy
fs.scanner.class=io.streamthoughts.kafka.connect.filepulse.scanner.local.LocalFSDirectoryWalker
fs.scan.directory.path=/tmp/kafka-connect/xml/
fs.scan.interval.ms=10000
# Internal Reporting
internal.kafka.reporter.bootstrap.servers=localhost:9092
internal.kafka.reporter.id=connect-file-pulse-xml
internal.kafka.reporter.topic=connect-file-pulse-status
# Track file by name
offset.strategy=name
If you are getting Unknown Magic Byte with the consumer, then the producer didn't use the Confluent AvroSerializer, and might have pushed Avro data that doesn't use the Schema Registry.
Without seeing the Producer code or consuming and inspecting the data in binary format, it is difficult to know which is the case.
The message was produced using confluent connect file-pulse
Did you use value.converter with the AvroConverter class?

Missing required argument "[zookeeper]"

i'm trying to start a consumer using Apache Kafka, it used to work well, but i had to format my pc and reinstall everything again, and now when trying to run this:
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
this is what i'm getting:
Missing required argument "[zookeeper]"
Option Description
------ -----------
--blacklist <blacklist> Blacklist of topics to exclude from
consumption.
--bootstrap-server <server to connect
to>
--consumer.config <config file> Consumer config properties file.
--csv-reporter-enabled If set, the CSV metrics reporter will
be enabled
--delete-consumer-offsets If specified, the consumer path in
zookeeper is deleted when starting up
--formatter <class> The name of a class to use for
formatting kafka messages for
display. (default: kafka.tools.
DefaultMessageFormatter)
--from-beginning If the consumer does not already have
an established offset to consume
from, start with the earliest
message present in the log rather
than the latest message.
--key-deserializer <deserializer for
key>
--max-messages <Integer: num_messages> The maximum number of messages to
consume before exiting. If not set,
consumption is continual.
--metrics-dir <metrics directory> If csv-reporter-enable is set, and
this parameter isset, the csv
metrics will be outputed here
--new-consumer Use the new consumer implementation.
--property <prop>
--skip-message-on-error If there is an error when processing a
message, skip it instead of halt.
--timeout-ms <Integer: timeout_ms> If specified, exit if no message is
available for consumption for the
specified interval.
--topic <topic> The topic id to consume on.
--value-deserializer <deserializer for
values>
--whitelist <whitelist> Whitelist of topics to include for
consumption.
--zookeeper <urls> REQUIRED: The connection string for
the zookeeper connection in the form
host:port. Multiple URLS can be
given to allow fail-over.
my guess is that there's some kind of problem with the zookeeper connection port, because it's telling me to specify the port which zookeeper has to use to get connected to kafka. I'm not sure of this though, and don't know how to figure out the port to specify if this was the problem. Any suggestions??
Thanks in advance for the help
It looks like you are using an old version of the Kafka tools that requires to set --new-consumer if you want to directly connect to the brokers.
I'd recommend picking a recent version of Kafka so you only need to specify --bootstrap-server like in your example: http://kafka.apache.org/downloads

Kafka streams word count application

I'm playing around with the kafka streaming API (Kakfa version: 0.10.2.0) trying to make a simple wordcount example work: Wordcount App gist. I'm running both producer and console consumer:
./kafka-console-producer.sh -topic input-topic --broker-list localhost:9092
./kafka-console-consumer.sh --topic output-topic --bootstrap-server localhost:9092 --from-beginning
start the application and everything seems to be working fine but when I type in some strings within the console producer, the consumer receives nothing at all. If I change the app to do a simple toUppercase on the input the consumer receives the stream (modified to upper case) fine:
//The following code works fine:
val uppercasedWithMapValues: KStream[String, String] = textLines.mapValues(_.toUpperCase())
uppercasedWithMapValues.to("output-topic")
Does anyone know why I'm receiving nothing on the word-count example? Should I specify any serializer on the consumer? In my last test the console consumer processed the messages that I sent through the console but didn't show them, see below the output:
➜ bin ./kafka-console-consumer.sh \
--topic output-topic \
--bootstrap-server localhost:9092 \
--from-beginning
[2017-08-02 07:48:20,187]WARN Error while fetching metadata with correlation id 2 :
{output-topic=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
[2017-08-02 07:48:20,197] WARN The following subscribed topics are not assigned
to any members in the group console-consumer-91651 : [output-topic]
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
^CProcessed a total of 7 messages
KStream works because it doesn't use caching. For KTable you have to wait a bit, or set cache.max.bytes.buffering to 0 (but not in a production code!)

Unable to read from file through Kafka producer

I am trying to read a file using kafka producer.Zookeeper and Broker server are running. I am able to read inputs from command prompt using Kafka producer and Consumers using below commands -
Kafka Producer
kafka-console-producer --topic incoming --broker localhost:9092
Kafka Consumer
kafka-console-consumer --topic incoming --zookeeper localhost:2181
For reading from file i had tried below command line arguments -
kafka-console-producer -–broker-list localhost:9092 -–topic incoming --new-producer < C:\abc.txt
but it produced below error -
û is not a recognized option
I googled the message and it says about correcting the producer command which looks correct to me.
For kafka-10 you don't need to pass --new-producer flag. Following command is working for me:
kafka-console-producer.sh --broker-list localhost:9092 --topic incoming < C:\abc.txt

checking kafka data if compressed

The document said add the line compression.codec=gzip in producer.properties to make the message compressed.
However when I open the data file such as: 0000000000000.log I found the data does not look like it is compressed. How should check whether the data in kafka is compressed already?
P.S: Every testing I would stop the Kafka cluster and Zookeeper and deleted all of the data in kafka-logs and Zookeeper,then start the server again and create a new topic.
The Java ProducerConfig class has changed for this config.
public static final String COMPRESSION_TYPE_CONFIG = "compression.type";
I've successfully produced messages with the java client (0.8.2.1) using the ProducerConfig.COMPRESSION_TYPE_CONFIG and it works fine.
Example:
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "gzip");
Or set compression.type=gzip in your server.properties file for the Java client.
Update for cli tool
Read the usage for the command line tools:
chrisblack:kafka:% ./bin/kafka-console-producer.sh
...
--compression-codec [compression-codec] The compression codec: either 'none',
'gzip', 'snappy', or 'lz4'.If
specified without value, then it
defaults to 'gzip'
...
--new-producer Use the new producer implementation.
--producer-property <producer_prop> A mechanism to pass user-defined
properties in the form key=value to
the producer.
--property <prop> A mechanism to pass user-defined
properties in the form key=value to
the message reader. This allows
custom configuration for a user-
defined message reader.
...
Try running a similar command from the shell:
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test_compression --compression-codec