Shows invalid characters while consuming using kafka console consumer - apache-kafka

While consuming from the Kafka topic using Kafka console consumer or kt(GoLang CLI tool for Kafka), I am getting invalid characters.
...
\u0000\ufffd?\u0006app\u0000\u0000\u0000\u0000\u0000\u0000\u003e#\u0001
\u0000\u000cSec-39\u001aSome Actual Value Text\ufffd\ufffd\ufffd\ufffd\ufffd
\ufffd\u0015#\ufffd\ufffd\ufffd\ufffd\ufffd\ufff
...
Even though Kafka connect can actually sink the proper data to an SQL database.

Given that you say
Kafka connect can actually sink the proper data to an SQL database.
my assumption would be that you're using Avro serialization for the data on the topic. Kafka Connect configured correctly will take the Avro data and deserialise it.
However, console tools such as kafka-console-consumer, kt, kafkacat et al do not support Avro, and so you get a bunch of weird characters if you use them to read data from a topic that is Avro-encoded.
To read Avro data to the command line you can use kafka-avro-console-consumer:
kafka-avro-console-consumer
--bootstrap-server kafka:29092\
--topic test_topic_avro \
--property schema.registry.url=http://schema-registry:8081
Edit: Adding a suggestion from #CodeGeas too:
Alternatively, reading data using REST Proxy can be done with the following:
# Create a consumer for JSON data
curl -X POST -H "Content-Type: application/vnd.kafka.v2+json" \
-H "Accept: application/vnd.kafka.v2+json" \
--data '{"name": "my_consumer_instance", "format": "avro", "auto.offset.reset": "earliest"}' \
# Subscribe the consumer to a topic
http://kafka-rest-instance:8082/consumers/my_json_consumer
curl -X POST -H "Content-Type: application/vnd.kafka.v2+json" \
--data '{"topics":["YOUR-TOPIC-NAME"]}' \
http://kafka-rest-instance:8082/consumers/my_json_consumer/instances/my_consumer_instance/subscription
# Then consume some data from a topic using the base URL in the first response.
curl -X GET -H "Accept: application/vnd.kafka.avro.v2+json" \
http://kafka-rest-instance:8082/consumers/my_json_consumer/instances/my_consumer_instance/records
Later, to delete the consumer afterwards:
curl -X DELETE -H "Accept: application/vnd.kafka.avro.v2+json" \
http://kafka-rest-instance:8082/consumers/my_json_consumer/instances/my_consumer_instance

By default, the console consumer tools deserializes both the message key and value using ByteArrayDeserializer but then obviously tries to print data to the command line using the default formatter.
This tool however allows to customize the deserializers and formatter used. See the following extract from the help output:
--formatter <String: class> The name of a class to use for
formatting kafka messages for
display. (default: kafka.tools.
DefaultMessageFormatter)
--property <String: prop> The properties to initialize the
message formatter. Default
properties include:
print.timestamp=true|false
print.key=true|false
print.value=true|false
key.separator=<key.separator>
line.separator=<line.separator>
key.deserializer=<key.deserializer>
value.deserializer=<value.
deserializer>
Users can also pass in customized
properties for their formatter; more
specifically, users can pass in
properties keyed with 'key.
deserializer.' and 'value.
deserializer.' prefixes to configure
their deserializers.
--key-deserializer <String:
deserializer for key>
--value-deserializer <String:
deserializer for values>
Using these settings, you should be able to change the output to be what you want.

Related

Kafka: Source-Connector to Topic mapping is Flakey

I have the following Kafka connector configuration (below). I have created the "member" topic already (30 partitions). The problem is that I will install the connector and it will work; i.e.
curl -d "#mobiledb-member.json" -H "Content-Type: application/json" -X PUT https://TTGSSQA0VRHAP81.ttgtpmg.net:8085/connectors/mobiledb-member-connector/config
curl -s https://TTGSSQA0VRHAP81.ttgtpmg.net:8085/connectors/member-connector/topics
returns:
{"member-connector":{"topics":[member]}}
the status call returns no errors:
curl -s https://TTGSSQA0VRHAP81.ttgtpmg.net:8085/connectors/mobiledb-member-connector/status
{"name":"member-connector","connector":{"state":"RUNNING","worker_id":"ttgssqa0vrhap81.***.net:8085"},"tasks":[{"id":0,"state":"RUNNING","worker_id":"ttgssqa0vrhap81.***.net:8085"}],"type":"source"}
... but at other times, I will install a similar connector config and it will return no topics.
{"member-connector":{"topics":[]}}
Yet the status shows no errors and the Connector logs show no clues as to why this "connector to topic" mapping isn't working. Why aren't the logs helping out?
Connector configuration.
{
"connector.class":"io.confluent.connect.jdbc.JdbcSourceConnector",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"connection.url":"jdbc:sqlserver:****;",
"connection.user":"***",
"connection.password":"***",
"transforms":"createKey",
"table.poll.interval.ms":"120000",
"key.converter.schemas.enable":"false",
"value.converter.schemas.enable":"false",
"poll.interval.ms":"5000",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"name":"member-connector",
"tasks.max":"4",
"query":"SELECT * FROM member_kafka_test",
"table.types":"TABLE",
"topic.prefix":"member",
"mode":"timestamp+incrementing",
"transforms.createKey.fields":"member_id",
"incrementing.column.name": "member_id",
"timestamp.column.name" : "update_datetime"
}

How does JDBC sink connector inserts values into postgres database

I'm using JDBC sink connector to load data from kafka topic to postgres database.
here is my configuration:
curl --location --request PUT 'http://localhost:8083/connectors/sink_1/config' \
--header 'Content-Type: application/json' \
--data-raw '{
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url":"jdbc:postgresql://localhost:5432/postgres",
"connection.user":"user",
"connection.password":"passwd",
"tasks.max" : "10",
"topics":"<topic_name_same_as_tablename>",
"insert.mode":"insert",
"key.converter":"org.apache.kafka.connect.converters.ByteArrayConverter",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"quote.sql.identifiers":"never",
"errors.tolerance":"all",
"errors.deadletterqueue.topic.name":"failed_records",
"errors.deadletterqueue.topic.replication.factor":"1",
"errors.log.enable":"true"
}'
In my table, I have 100k+ records so, I tried partitioning the topic into 10 and I tried with tasks.max with 10 to speed up the loading process, which was much faster when compared to single partition.
Can someone help me understand how the sink connector loads data into postgres? How will be the insert statement it will consider? either approach-1 or approach-2? If it is approach-1 then can we achieve approach-2? if yes, how can we?

unable to read avro message via kafka-avro-console-consumer (end goal read it via spark streaming)

(end goal) before trying out whether i could eventually read avro data, usng spark stream, out of the Confluent Platform like some described here: Integrating Spark Structured Streaming with the Confluent Schema Registry
I'd to verify whether I could use below command to read them:
$ kafka-avro-console-consumer \
> --topic my-topic-produced-using-file-pulse-xml \
> --from-beginning \
> --bootstrap-server localhost:9092 \
> --property schema.registry.url=http://localhost:8081
I receive this error message, Unknown magic byte
Processed a total of 1 messages
[2020-09-10 12:59:54,795] ERROR Unknown error when running consumer: (kafka.tools.ConsoleConsumer$:76)
org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
[2020-09-10 12:59:54,795] ERROR Unknown error when running consumer: (kafka.tools.ConsoleConsumer$:76)
org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
note, The message can be read like this (using console-consumer instead of avro-console-consumer):
kafka-console-consumer \
--bootstrap-server localhost:9092 --group my-group-console \
--from-beginning \
--topic my-topic-produced-using-file-pulse-xml
The message was produced using confluent connect file-pulse (1.5.2) reading xml file (streamthoughts/kafka-connect-file-pulse)
Please help here:
Did I use the kafka-avro-console-consumer wrong?
I tried "deserializer" properties options described here: https://stackoverflow.com/a/57703102/4582240, did not help
I did not want to be brave to start the spark streaming to read the data yet.
the file-pulse 1.5.2 properties i used are like below added 11/09/2020 for completion.
name=connect-file-pulse-xml
connector.class=io.streamthoughts.kafka.connect.filepulse.source.FilePulseSourceConnector
topic= my-topic-produced-using-file-pulse-xml
tasks.max=1
# File types
fs.scan.filters=io.streamthoughts.kafka.connect.filepulse.scanner.local.filter.RegexFileListFilter
file.filter.regex.pattern=.*\\.xml$
task.reader.class=io.streamthoughts.kafka.connect.filepulse.reader.XMLFileInputReader
force.array.on.fields=sometagNameInXml
# File scanning
fs.cleanup.policy.class=io.streamthoughts.kafka.connect.filepulse.clean.LogCleanupPolicy
fs.scanner.class=io.streamthoughts.kafka.connect.filepulse.scanner.local.LocalFSDirectoryWalker
fs.scan.directory.path=/tmp/kafka-connect/xml/
fs.scan.interval.ms=10000
# Internal Reporting
internal.kafka.reporter.bootstrap.servers=localhost:9092
internal.kafka.reporter.id=connect-file-pulse-xml
internal.kafka.reporter.topic=connect-file-pulse-status
# Track file by name
offset.strategy=name
If you are getting Unknown Magic Byte with the consumer, then the producer didn't use the Confluent AvroSerializer, and might have pushed Avro data that doesn't use the Schema Registry.
Without seeing the Producer code or consuming and inspecting the data in binary format, it is difficult to know which is the case.
The message was produced using confluent connect file-pulse
Did you use value.converter with the AvroConverter class?

Registering Schema ID with Topic using confluent_kafka for python

The only answer I have gotten so far, is that you have to give the schema and the topic the same name, and then this should link them together. But after registering a schema with name test_topic like:
{
"type": "record",
"name": "test_topic",
"namespace": "com.test",
"doc": "My test schema",
"fields": [
{
"name": "name",
"type": "string"
}
]
}
and running the following command, it inserts without a problem.
curl -X POST -H "Content-Type: application/vnd.kafka.json.v2+json" -H "Accept: application/vnd.kafka.v2+json" --data '{"records":[{"value":{"name": "My first name"}}]}' "http://localhost/topics/test_topic"
But when I run the following command as well it inserts without giving any error (note,I changed the property name)
curl -X POST -H "Content-Type: application/vnd.kafka.json.v2+json" -H "Accept: application/vnd.kafka.v2+json" --data '{"records":[{"value":{"test": "My first name"}}]}' "http://localhost/topics/test_topic"
I would have suspected an error message saying that my data does not match the schema for this topic...
My schema ID is 10, so I know it is working and registered, but not very useful at the moment.
Python Code:
from confluent_kafka import Producer
import socket
import json
conf = {'bootstrap.servers': 'localhost:9092', 'client.id': socket.gethostname()}
producer = Producer(conf)
def acked(err, msg):
if err is not None:
print(f'Failed to deliver message: {str(msg)}, {str(err)}')
else:
print(f'Message produced: {str(msg)}')
producer.produce("test_topic", key="key", value=json.dumps({"test": name}).encode('ascii') , callback=acked)
producer.poll(5)
you have to give the schema and the topic the same name, and then this should link them together
That's not quite how the Schema Registry works.
Each kafka record has a key and a value.
The Registry has subjects, which are not strictly mapped to topics.
However, the Kafka clients (de)serializer implementation will use both topic-key and topic-value subject names to register/extract schemas from the registry.
Clients cannot tell the registry what ID to put the schema at. That logic is calculated server side
I'm not sure I understand what your post has to do with the REST Proxy, but you're posting plain JSON and not telling it that the data should be Avro (you're using the incorrect header)
If Avro is used, the content type will be application/vnd.kafka.avro.v2+json

Kafka Connect FileStreamSink connector removes quotation marks and changes colon to equal sign for JSON message

Summary
When I stream this with the console producer
{"id":1337,"status":"example_topic_1 success"}
I get this in from my filestream consumer
/data/example_topic_1.txt
{id=1337, status=example_topic_1 success}
This is a major problem for me, because the original JSON message cannot be recovered without making assumptions about where the quotes used to be. How can I output the messages to a file, while preserving the quotation marks?
Details
First, I start my file sink connector.
# sh bin/connect-standalone.sh \
> config/worker.properties \
> config/connect-file-sink-example_topic_1.properties
Second, I start console consumer (also built in to Kafka) so that I have easy visual confirmation that the messages are coming through correctly.
# sh bin/kafka-console-consumer.sh \
> --bootstrap-server kafka_broker:9092 \
> --topic example_topic_1
Finally, I start a console producer for sending messages, and I enter a message.
# sh bin/kafka-console-producer.sh \
> --broker-list kafka_broker:9092 \
> --topic example_topic_1
From the console consumer, the message pops out correctly, with quotes.
{"id":1337,"status":"example_topic_1 success"}
But I get this from my the FileStreamSink consumer:
/data/example_topic_1.txt
{id=1337, status=example_topic_1 success}
My Configuration
config/worker.properties
offset.storage.file.filename=/tmp/example.offsets
bootstrap.servers=kafka_broker:9092
offset.flush.interval.ms=10000
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schemas.enable=false
config/connect-file-sink-example_topic_1.properties
name=file-sink-example_topic_1
connector.class=FileStreamSink
tasks.max=1
file=/data/example_topic_1.txt
topics=example_topic_1
Since you're not actually wanting to parse the JSON data, but just pass it straight through as a lump of text, you need to use the StringConverter:
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter
This article explains more about the nuances of converters: https://rmoff.net/2019/05/08/when-a-kafka-connect-converter-is-not-a-converter/. This shows an example of what you're trying to do, although uses kafkacat in place of the console producer/consumer.