Processing data streams in kafka - apache-kafka

I am trying to set up a kafka consumer to process data from Kafka streams. I was able to set up the connection to the stream and the data is visible but it's a mixture of special characters and ASCII.
I am using the inbuilt kafka console, but have also tried the python version of confluent-kafka. The only parameters that need to be followed is to use SASL_PLAINTEXT security protocol with SCRAM-SHA-256. I am open to using other methods to also parse the output (not Java if possible).
Kafka Console
bin/kafka-console-consumer.sh --bootstrap-server server:9092 \
--topic TOPIC --from-beginning --consumer.config=consumer.properties
Confluent Kafka Python
topics = "TOPIC"
conf = {
"bootstrap.servers": "server:9092",
"group.id": "group",
"security.protocol": "SASL_PLAINTEXT",
"sasl.mechanisms" : "SCRAM-SHA-256",
}
c = Consumer(conf)
c.subscribe([topics])
running = True
while running:
message = c.poll()
print(message.value())
c.close()
Output
PLE9K1PKH3S0MAY38ChangeRequest : llZYMEgVmq2CHG:Infra RequestKSUSMAINCHANGEKC-10200-FL01DATA_MISSINGCHGUSD
DATA_MISSINGDATA_MISSINGUSD
CANCEL
▒▒12SLM:Measurement"Schedule(1 = 0)USDUSD▒▒▒
l▒l▒V?▒▒▒
llZYMEgVmq
company_team team_nameTEAM###SGP000000140381PPL000002020234
Latha M▒>▒>▒ChangeRequest
hello:1234543534 cloud abcdef▒▒▒
▒Ի▒
▒▒▒
John Smithjs12345SGP000000140381▒NPPL000002020234
▒Ի▒
I am trying to parse the data on the standard output initially, but the expectation at the end is to get the parsed data in a database. Any advice would be appreciated.

It seems like you have your messages encoded in binary format. To print those you will need to set up a binary decoder and pass them through that. In case you have produced them using a specific schema you might also need to deserialize the objects using the Schema Registry which contains the schema for the given topic. You are looking at something in the lines of:
message_bytes = io.BytesIO(message.value())
decoder = BinaryDecoder(message_bytes)

As jaivalis has mentioned there appears to be a mismatch between your producers and the consumer you are using to ingest the data. Kafka Streams exposes two properties for controlling the serialization and deserialization of data that passes through the topology; default.value.serde, default.key.serde. I recommend reviewing your streams application's configuration to find a suitable deserializer for the consumer to use.
https://kafka.apache.org/documentation/#streamsconfigs
Do note however that these serdes may be overwritten by the your streams application implementation. Be sure to review your implementation as well to ensure you have found the correct serialization format.
https://kafka.apache.org/21/documentation/streams/developer-guide/datatypes.html#overriding-default-serdes

Related

Who is responsible for Avro Schema serialization in Kafka queue?

I am learning the basics of Avro schema serialization. I understand that both, keys and values, can have their specific Avro schemas. However, what I am confused is how the serialization process actually works.
Do you specify the Avro schemas to use at the time of creating a topic? This way, the producer can post a message using plain json text and the kafka server knows how to serialize/deserialize it. Likewise, the consumer can obtain a record in plain json text.
Or, do you specify the schemas to use at the time of posting a message to a topic?
Finally, let's say I define my schemas in mykeyschema.avsc and myvalueschema.avsc. Would appreciate an example of how to use the schemas either from the command line kafka tools or as curl scripts (for rest proxy). Thanks.
Topics are independent of schemas; they do not (and often are not) defined together.
Most importantly, Kafka only knows about byte arrays; the clients decide serialization format. If you choose to pay for Confluent Server, for example, only then can you force Kafka to accept only Avro bytes (obviously, this adds latency to your requests because the records are being deserialized by the server to do the validation, but this is the trade-off for "protecting the topic from bad actors").
That being said, producers are the ones sending data. They often are responsible for registering the schema, based on what is sent. Consumers can then decide to use that schema or define their own projection of those fields (Avro requires both reader and writer schema for deserialization).
example of how to use the schemas either from the command line kafka tools
kafka-avro-console-producer --topic foobar \
--property value.schema="$(jq -rc < example-schema.avsc)" \
--bootstrap-server localhost:9092 --sync
And rather than type out long JSON payloads, you can redirect JSON file with line-separated records into that
kafka-avro-console-producer ... < records.json
REST Proxy
When you call POST requests to send data you can provide the key/value schema as JSON encoded values (not ideal since it makes every request much larger than necessary), or you can pre-register the schema, which returns an ID that you can use.
https://docs.confluent.io/platform/current/kafka-rest/api.html

How to deserialize avro message using mirrormaker?

I want to replicate a kafka topic to an azure event hub.
The messages are in avro format and uses a schema that is behind a schema registry with USER_INFO authentication.
Using a java client to connect to kafka, I can use a KafkaAvroDeserializer to deserialize the message correctly.
But this configuration doesn't seems to work with mirrormaker.
Is is possible to deserialize the avro message using mirrormaker before sending it ?
Cheers
For MirrorMaker1, the consumer deserializer properties are hard-coded
Unless you plan on re-serializing the data into a different format when the producer sends data to EventHub, you should stick to using the default ByteArrayDeserializer.
If you did want to manipulate the messages in any way, that would need to be done with a MirrorMakerMessageHandler subclass
For MirrorMaker2, you can use AvroConverter followed by some transforms properties, but still ByteArrayConverter would be preferred for a one-to-one byte copy.

MQTT converter trouble to perform tansformation on values

I have been struggling for some days to convert values from ByteArray coming from a Mqtt Source Connector to a String. Our standalone configuration has the following parameters:
Our standalone properties file looks like this:
# The converters specify the format of data in Kafka and how to translate it into Connect data. Every Connect user will
# need to configure these based on the format they want their data in when loaded from or stored into Kafka
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
# Converter-specific settings can be passed in by prefixing the Converter's setting with the converter we want to apply
# it to
key.converter.schemas.enable=true
value.converter.schemas.enable=true
When we try to perform some transformation on the values, it looks like Kafka has data format as ByteArrays, therefore it is not possible to perform any operation on the value. Is there a way to convert the ByteArray to a StringConverter?
What we tried to do is to change the parameters as above:
converter.encoding=UTF-8
value.converter=org.apache.kafka.connect.storage.StringConverter
We had no luck. Our Input is a string in python similar to:
{'id': 42, 'cost': 4000}
Any suggestion how to configure the property file?
EDIT: As required, I am providing more information, we moved to distributed mode, with a cluster of 3 brokers, we start the connector as follows:
{
"name":"MqttSourceConnector",
"config":{
"connector.class":"io.confluent.connect.mqtt.MqttSourceConnector",
"tasks.max":"2",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"org.apache.kafka.connect.converters.ByteArrayConverter",
"mqtt.server.uri":"ssl://XXXXXX.XXXX.amazonaws.com:1883",
"mqtt.topics":"/mqtt/topic",
"mqtt.ssl.trust.store.path":"XXXXXXXXXX.jks",
"mqtt.ssl.trust.store.password":"XXXXX",
"mqtt.ssl.key.store.path":"XXXXXXXXX.jks",
"mqtt.ssl.key.store.password":"XXXXXX",
"mqtt.ssl.key.password":"XXXXXXXX",
"max.retry.time.ms":86400000,
"mqtt.connect.timeout.seconds":2,
"mqtt.keepalive.interval.seconds":4
}
What I receive as kwy value is:
/mqtt/topic:"stringified json with kafka topic and key"
What I would like to see:
Topic: kafka.topic.from.the.mqtt.paylod.string
Key: key_from_mqtt_string
Ofc the mqtt payload should be a json for us, but I can't manage to convert it
Single Message Transform usually require a schema in the data.
Connectors like this, and others connecting from message queues, generally require the ByteArrayConverter. After that you need to apply a schema to it after which you can start to manipulate the fields.
I wrote about one way of doing this here in which you ingest the raw bytes to a Kafka topic and then apply a schema using a stream processor (ksqlDB in my example, but you can use other option if you want).

What should I use: Kafka Stream or Kafka consumer api or Kafka connect

I would like to know what would be best for me: Kafka stream or Kafka consumer api or Kafka connect?
I want to read data from topic then do some processing and write to database. So I have written consumers but I feel I can write Kafka stream application and use it's stateful processor to perform any changes and write it to database which can eliminate my consumer code and only have to write db code.
Databases I want to insert my records are:
HDFS - (insert raw JSON)
MSSQL - (processed json)
Another option is Kafka connect but I have found there is no json support as of now for hdfs sink and jdbc sink connector.(I don't want to write in avro) and creating schema is also pain for complex nested messages.
Or should I write custom Kafka connect to do this.
So need you opinion on whether I should write Kafka consumer or Kafka stream or Kafka connect?
And what will be better in terms of performance and have less overhead?
You can use a combination of them all
I have tried HDFS sink for JSON but not able to use org.apache.kafka.connect.json.JsonConverter
Not clear why not. But I would assume you forgot to set schemas.enabled=false.
when I set org.apache.kafka.connect.storage.StringConverter it works but it writes the json object in string escaped format. For eg. {"name":"hello"} is written into hdfs as "{\"name\":\"hello\"}"
Yes, it will string-escape the JSON
Processing I want to do is basic validation and few field values transformation
Kafka Streams or Consumer API is capable of validation. Connect is capable of Simple Message Transforms (SMT)
Some use cases, you need to "duplicate data" onto Kafka; process your "raw" topic, read it using a consumer, then produce it back into a "cleaned" topic, from which you can use Kafka Connect to write to a database or filesystem.
Welcome to stack overflow! Please take the tout https://stackoverflow.com/tour
Please make posts with precise question, not asking for opinions - this makes the site clearer, and opinions are not answers (and subject to every person preferences). Asking "How to use Kafka-connect with json" - or so would fit this site.
Also, please show some research.
Less overhead would be kafka consumer - kafka stream and kafka connect use kafka consumer, so you will always be able to make less overhead, but will also lose all benefits (tolerant to failures, easy of usage, support, etc)
First, it depends of what your processing is. Aggregation? Counting? Validation? Then, you can use kafka streams to do the processing and write the result to a new topic, on the format you want.
Then, you can use kafka connect to send the data to your database. You are not forced to use avro, you can use other format for key/value, see
Kafka Connect HDFS Sink for JSON format using JsonConverter
Kafka Connect not outputting JSON

Two strange bytes at the beginning of each message of a Kafka message produced by my Kafka Connector

I develop a Kafka connector which simply creates messages for each line in a file retrieved from an external API. It works nicely but now I try to consume the messages and I have two strange bytes at the beginning of each value. I can reproduce the problem with the console consumer and with my kafka stream processor.
�168410002,OpenX Market,459980962,OpenX_Bidder_Order_merkur_bidder_800x250,313115722,OpenX_Bidder_ANY_LI_merkur_800x250_550,106800839362,OpenX_Bidder_Creative_merkur_800x250_2,10
The source files are good and even printlns before creating the SourceRecord don't show these two bytes. I used a struct with one field before and now use a simple String schema but I still have the same problem:
def convert(line: String, ...) = {
...
val record = new SourceRecord(
Partition.sole(partition),
offset.forConnectApi,
topic,
Schema.STRING_SCHEMA,
line
)
...
So in the above code, if I add println(line) no strange chars are shown.
It looks like you used the AvroConverter or the JsonConverter in your connector. Try using the StringConverter that ships with Kafka in your key.converter and value.converter in the worker for connect. That will encode the data as strings that shouldn't have this extra stuff in it.