MQTT Kafka Source connector : funny byte characters - apache-kafka

I am following https://github.com/kaiwaehner/kafka-connect-iot-mqtt-connector-example for connecting Mosquitto and Kafka with MQTT source connector. I am getting the data sent by the Mosquitto Publisher into the Mosquitto Subscriber and the Kafka Consumer. But the key and value field in my ConsumerRecord object of kafka-consumer is having some prepended byte characters.
Below are the code snippets and the outputs I'm getting.
mqttPublisher.py
while v3 < 3:
data3 = {
"time": str(datetime.datetime.now().time()),
"val": v3
}
client.publish("sensor/dist", json.dumps(data3), qos=2)
v3 += 1
time.sleep(2)
mqttSubscriber.py
def on_message_print(client, userdata, message):
print(message.topic,message.payload)
subscribe.callback(on_message_print, "sensor/#", hostname="localhost")
kafkaConsumer.py
consumer = KafkaConsumer('mqtt.',
bootstrap_servers=['localhost:9092'])
for message in consumer:
print(message)
Output:mqttSubscriber.py
sensor/dist b'{"time": "12:44:30.817462", "val": 0}'
sensor/dist b'{"time": "12:44:32.820040", "val": 1}'
sensor/dist b'{"time": "12:44:34.822657", "val": 2}'
Output : kafkaConsumer.py
ConsumerRecord(topic='mqtt.', partition=0, offset=225, timestamp=1545117270870, timestamp_type=0, key=b'\x00\x00\x00\x00\x01\x16sensor/dist', value=b'\x00\x00\x00\x00\x02J{"time": "12:44:30.817462", "val": 0}', headers=[('mqtt.message.id', b'0'), ('mqtt.qos', b'0'), ('mqtt.retained', b'false'), ('mqtt.duplicate', b'false')], checksum=None, serialized_key_size=17, serialized_value_size=43, serialized_header_size=62)
ConsumerRecord(topic='mqtt.', partition=0, offset=226, timestamp=1545117272821, timestamp_type=0, key=b'\x00\x00\x00\x00\x01\x16sensor/dist', value=b'\x00\x00\x00\x00\x02J{"time": "12:44:32.820040", "val": 1}', headers=[('mqtt.message.id', b'0'), ('mqtt.qos', b'0'), ('mqtt.retained', b'false'), ('mqtt.duplicate', b'false')], checksum=None, serialized_key_size=17, serialized_value_size=43, serialized_header_size=62)
ConsumerRecord(topic='mqtt.', partition=0, offset=227, timestamp=1545117274824, timestamp_type=0, key=b'\x00\x00\x00\x00\x01\x16sensor/dist', value=b'\x00\x00\x00\x00\x02J{"time": "12:44:34.822657", "val": 2}', headers=[('mqtt.message.id', b'0'), ('mqtt.qos', b'0'), ('mqtt.retained', b'false'), ('mqtt.duplicate', b'false')], checksum=None, serialized_key_size=17, serialized_value_size=43, serialized_header_size=62)
What is causing the above prepending of extra bytes in the Kafka Consumer?
Thanks in advance.

As part of the demo, you're starting a Schema Registry
Start Kafka Connect and dependencies (Kafka, Zookeeper, Schema Registry):
confluent start connect
If you look at the first 5 bytes, you'll see they start with 0, then four more bytes representing an integer.
See the Schema Registry Wire Format and try doing a curl localhost:8081/subjects to see if it lists your topic name for mqtt-key and mqtt-value.
If you didn't want Avro, you would need to configure and edit your Kafka Connect property file to use different Converters, and not use confluent start other than getting Kafka and Zookeeper running
Or if you want Python to deserialize the Avro, you can refer to the confluent-kafka-python repo on Github

Related

Debezium before field is null for update operation

My Debezium is run by server ,and my config file is below
debezium.sink.type=kafka
debezium.source.connector.class=io.debezium.connector.postgresql.PostgresConnector
debezium.source.offset.storage.file.filename=org.apache.kafka.connect.storage.FileOffsetBackingStore
debezium.source.offset.flush.interval.ms=0
debezium.source.database.hostname=localhost
debezium.source.database.port=5432
debezium.source.plugin.name=pgoutput
debezium.source.database.user=casdoor
debezium.source.database.password=casdoor
debezium.source.database.dbname=casdoor
debezium.source.database.server.name=casdoor
debezium.source.database.schema.name=public
debezium.source.database.table.name=user
debezium.source.key.converter=org.apache.kafka.connect.json.JsonConverter
debezium.source.value.converter=org.apache.kafka.connect.json.JsonConverter
debezium.sink.kafka.producer.key.serializer=org.apache.kafka.common.serialization.StringSerializer
debezium.sink.kafka.producer.value.serializer=org.apache.kafka.common.serialization.StringSerializer
quarkus.log.console.json=false
but the update data in kafka have no before part.
kafka data
as shown in the picture,the before is always null. but the operation is update data
{
"schema":Object{...},
"payload":{
"before":null,
"after":Object{...},
"source":Object{...},
"op":"u",
"ts_ms":1652168353005,
"transaction":null
}
}

kafka connect sink to mongo only last result with delay

i have aggregation query pageView group by country, results push to out topic.
And sink to mongodb by kafka connector
{
"connector.class": "MongoDbAtlasSink",
"name": "confluent-mongodb-sink",
"input.data.format" : "JSON",
"connection.host": "ip",
"topics": "viewPageCountByUsers",
"max.num.retries": "3",
"retries.defer.timeout": "5000",
"max.batch.size": "0",
"database": "test",
"collection": "ViewPagesCountByUsers",
"tasks.max": "1"
}
The problem is that this data is very frequent and very load mongodb. How i can set kafkaconnection that send only last value by key as batch, example with 5 sec delay ?
Example: It's pointless to update the database 5 times
{countryID:7, viewCount: 111}
{countryID:7, viewCount: 112}
{countryID:7, viewCount: 113}
{countryID:7, viewCount: 114}
{countryID:7, viewCount: 115}
If there was an opportunity send only last result by key with 5 sec delay i can update 1 time.
// collect batch 5 sec and flush:
{countryID:7, viewCount: 115}
{countryID:8, viewCount: 573}
How do it?
Sink connectors just take whatever is in the topic, generally without batching.
You'd need to use a stream-processor such as Kafka Streams / KSQLdb to run a windowed-aggregation, then output to a new topic, which you'd read from the sink connector.

kafka mongodb sink connector issue while writing to mongodb

I am facing issue while writing to mongodb using mongo kafka sink connector.I am using mongodb of v5.0.3 and Strimzi kafka of v2.8.0. I have added p1/mongo-kafka-connect-1.7.0-all.jar and p2/mongodb-driver-core-4.5.0.jar in connect cluster plugins path.Created connector using below
{
"name": "mongo-sink",
"config": {
"topics": "sinktest2",
"connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
"tasks.max": "1",
"connection.uri": "mongodb://mm-0.mongoservice.st.svc.cluster.local:27017,mm-1.mongoservice.st.svc.cluster.local:27017",
"database": "sinkdb",
"collection": "sinkcoll",
"mongo.errors.tolerance": "all",
"mongo.errors.log.enable": true,
"errors.log.include.messages": true,
"errors.deadletterqueue.topic.name": "sinktest2.deadletter",
"errors.deadletterqueue.context.headers.enable": true
}
}
root#ubuntuserver-0:/persistent# curl http://localhost:8083/connectors/mongo-sink/status
{"name":"mongo-sink","connector":{"state":"RUNNING","worker_id":"localhost:8083"},"tasks":[{"id":0,"state":"RUNNING","worker_id":"localhost:8083"}],"type":"sink"}
When I check the status after creating connector it is showing running, but when I start sending records to kafka topic connector is running into issues.connector status is showing as below.
root#ubuntuserver-0:/persistent# curl http://localhost:8083/connectors/mongo-sink/status
{
"name":"mongo-sink",
"connector":{
"state":"RUNNING",
"worker_id":"localhost:8083"
},
"tasks":[
{
"id":0,
"state":"FAILED",
"worker_id":"localhost:8083",
"trace":"org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:206)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:132)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:496)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:473)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:328)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:232)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:201)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:182)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:231)\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: org.apache.kafka.connect.errors.DataException: Converting byte[] to Kafka Connect data failed due to serialization error: \n\tat org.apache.kafka.connect.json.JsonConverter.toConnectData(JsonConverter.java:324)\n\tat org.apache.kafka.connect.storage.Converter.toConnectData(Converter.java:87)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.convertValue(WorkerSinkTask.java:540)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$2(WorkerSinkTask.java:496)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:156)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:190)\n\t... 13 more\nCaused by: org.apache.kafka.common.errors.SerializationException: com.fasterxml.jackson.core.io.JsonEOFException: Unexpected end-of-input: expected close marker for Object (start marker at [Source: (byte[])\"{ \"; line: 1, column: 1])\n at [Source: (byte[])\"{ \"; line: 1, column: 4]\nCaused by: com.fasterxml.jackson.core.io.JsonEOFException: Unexpected end-of-input: expected close marker for Object (start marker at [Source: (byte[])\"{ \"; line: 1, column: 1])\n at [Source: (byte[])\"{ \"; line: 1, column: 4]\n\tat com.fasterxml.jackson.core.base.ParserMinimalBase._reportInvalidEOF(ParserMinimalBase.java:664)\n\tat com.fasterxml.jackson.core.base.ParserBase._handleEOF(ParserBase.java:486)\n\tat com.fasterxml.jackson.core.base.ParserBase._eofAsNextChar(ParserBase.java:498)\n\tat com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipWSOrEnd2(UTF8StreamJsonParser.java:3033)\n\tat com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipWSOrEnd(UTF8StreamJsonParser.java:3003)\n\tat com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextFieldName(UTF8StreamJsonParser.java:989)\n\tat com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:249)\n\tat com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:68)\n\tat com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:15)\n\tat com.fasterxml.jackson.databind.ObjectMapper._readTreeAndClose(ObjectMapper.java:4270)\n\tat com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:2734)\n\tat org.apache.kafka.connect.json.JsonDeserializer.deserialize(JsonDeserializer.java:64)\n\tat org.apache.kafka.connect.json.JsonConverter.toConnectData(JsonConverter.java:322)\n\tat org.apache.kafka.connect.storage.Converter.toConnectData(Converter.java:87)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.convertValue(WorkerSinkTask.java:540)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.lambda$convertAndTransformRecord$2(WorkerSinkTask.java:496)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:156)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:190)\n\tat org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:132)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.convertAndTransformRecord(WorkerSinkTask.java:496)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.convertMessages(WorkerSinkTask.java:473)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:328)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:232)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:201)\n\tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:182)\n\tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:231)\n\tat java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n"
}
],
"type":"sink"
}
I am writing sample json record to kafka topic.
./kafka-console-producer.sh --topic sinktest2 --bootstrap-server sample-kafka-kafka-bootstrap:9093 --producer.config /persistent/client.txt < /persistent/emp.json
emp.json is below file
{
"employee": {
"name": "abc",
"salary": 56000,
"married": true
}
}
I don't see any logs in connector pod and no databse and collection being created in mongodb.
Please help to resolve this issue. Thank you !!
I think you are missing some configuration parameters like converter, and schema.
Update your config to add following:
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
If you are using KafkaConnect on kubernetes, you may create the sink connector as shown below. Create a file with name like mongo-sink-connector.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
name: mongodb-sink-connector
labels:
strimzi.io/cluster: my-connect-cluster
spec:
class: com.mongodb.kafka.connect.MongoSinkConnector
tasksMax: 2
config:
connection.uri: "mongodb://root:password#mongodb-0.mongodb-headless.default.svc.cluster.local:27017"
database: test
collection: sink
topics: sink-topic
key.converter: org.apache.kafka.connect.json.JsonConverter
value.converter: org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable: false
value.converter.schemas.enable: false
Execute the command:
$ kubectl apply -f mongo-sink-connector.yaml
you should see the output:
kafkaconnector.kafka.strimzi.io/mongo-apps-sink-connector created
Before starting the producer, check the status of connector and verify the topic has created as follows:
Status:
[kafka#my-connect-cluster-connect-5d47fb574-69xpv kafka]$ curl http://localhost:8083/connectors/mongodb-sink-connector/status
{"name":"mongodb-sink-connector","connector":{"state":"RUNNING","worker_id":"IP-ADDRESS:8083"},"tasks":[{"id":0,"state":"RUNNING","worker_id":"IP-ADDRESS:8083"},{"id":1,"state":"RUNNING","worker_id":"IP-ADDRESS:8083"}],"type":"sink"}
[kafka#my-connect-cluster-connect-5d47fb574-69xpv kafka]$
Check topic creation, you will see sink-topic
[kafka#my-connect-cluster-connect-5d47fb574-69xpv kafka]$ bin/kafka-topics.sh --bootstrap-server my-cluster-kafka-bootstrap:9092 --list
__consumer_offsets
__strimzi-topic-operator-kstreams-topic-store-changelog
__strimzi_store_topic
connect-cluster-configs
connect-cluster-offsets
connect-cluster-status
sink-topic
Now, go on kafka server to execute the producer
[kafka#my-cluster-kafka-0 kafka]$ bin/kafka-console-producer.sh --broker-list my-cluster-kafka-bootstrap:9092 --topic sink-topic
Successful execution will show you a prompt like > to enter/test the data
>{"employee": {"name": "abc", "salary": 56000, "married": true}}
>
On anther terminal, connect to kafka server and start consumer to verify the data
[kafka#my-cluster-kafka-0 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic sink-topic --from-beginning
{"employee": {"name": "abc", "salary": 56000, "married": true}}
If you see this data, means everything is working fine. Now let us check on mongodb. Connect with your mongodb server and check
rs0:PRIMARY> use test
switched to db test
rs0:PRIMARY> show collections
sink
rs0:PRIMARY> db.sink.find()
{ "_id" : ObjectId("6234a4a0dad1a2638f57a6b2"), "employee" : { "name" : "abc", "salary" : NumberLong(56000), "married" : true } }
et Voila!
You're hitting a serialization exception. I'll break the message out a bit:
com.fasterxml.jackson.core.io.JsonEOFException: Unexpected end-of-input:
expected close marker for Object (start marker at [Source: (byte[])"{ "; line: 1, column: 1])
at [Source: (byte[])"{ "; line: 1, column: 4]
Caused by: com.fasterxml.jackson.core.io.JsonEOFException:
Unexpected end-of-input: expected close marker for Object (start marker at [Source: (byte[])"{ "; line: 1, column: 1])
at [Source: (byte[])"{ "; line: 1, column: 4]
"expected close marker for Object" suggests to me that the parser is expecting to see the entire JSON object as one line, rather than pretty-printed.
{"employee": {"name": "abc", "salary": 56000, "married": true}}

how to access latest offset of topic in confluent kafka rest proxy to calculate lag

In confluent kafka rest proxy we can get the last committed offset of particular consumer group but how can we get the latest offset of topic to calculate the lag.
You can use Kafka REST Proxy to fetch the latest offset committed for a particular partition. According to the Confluent Docs,
GET /consumers/(string: group_name)/instances/(string: instance)/offsets
Get the last committed offsets for the given partitions (whether the
commit happened by this process or another).
Note that this request must be made to the specific REST proxy
instance holding the consumer instance.
Parameters:
group_name (string) -- The name of the consumer group
instance (string) -- The ID of the consumer instance Request JSON
Array of Objects:
partitions -- A list of partitions to find the last committed offsets for
partitions[i].topic (string) -- Name of the topic
partitions[i].partition (int) -- Partition ID
Response JSON Array of Objects:
offsets -- A list of committed offsets
offsets[i].topic (string) -- Name of the topic for which an offset was committed
offsets[i].partition (int) -- Partition ID for which an offset was committed
offsets[i].offset (int) -- Committed offset
offsets[i].metadata (string) -- Metadata for the committed offset
Status Codes:
404 Not Found --
Error code 40402 -- Partition not found
Error code 40403 -- Consumer instance not found
Example Request:
GET /consumers/testgroup/instances/my_consumer/offsets HTTP/1.1
Host: proxy-instance.kafkaproxy.example.com
Accept: application/vnd.kafka.v2+json, application/vnd.kafka+json, application/json
{
"partitions": [
{
"topic": "test",
"partition": 0
},
{
"topic": "test",
"partition": 1
}
]
}
Example Response:
HTTP/1.1 200 OK
Content-Type: application/vnd.kafka.v2+json
{"offsets":
[
{
"topic": "test",
"partition": 0,
"offset": 21,
"metadata":""
},
{
"topic": "test",
"partition": 1,
"offset": 31,
"metadata":""
}
]
}
Looks like there is an early access feature for this: https://docs.confluent.io/platform/current/kafka-rest/api.html#get--clusters-cluster_id-consumer-groups-consumer_group_id-lags

How can I acqurie the JSON data from Kafka using SparkStreaming

I use kafka monitor the alteration of LocalFile and SparkStreaming to analyse . But I can't extrct the data from the kafka because the format of data is JSON .
When I tap the command bin/kafka-console-consumer.sh --bootstrap-server master:9092,slave1:9092,slave2:9092 --topic kafka-streaming --from-beginning,
THE FORMAT OF DATA IS:
{
"schema": {
"type": "string",
"optional": false
},
"payload": "{\"like_count\": 594, \"view_count\": 49613, \"user_name\": \" w\", \"play_url\": \"http://upic/2019/04/08/12/BMjAxOTA0MDgxMjQ4MTlfMjA3ODc2NTY2XzEyMDQzOTQ0MTc4XzJfMw==_b_Bfa330c5ca9009708aaff0167516a412d.mp4?tag=1-1555248600-h-0-gjcfcmzmef-954d5652f100c12e\", \"description\": \"ţ ų ඣ 9 9 9 9\", \"cover\": \"http://uhead/AB/2016/03/09/18/BMjAxNjAzMDkxODI1MzNfMjA3ODc2NTY2XzJfaGQ5OQ==.jpg\", \"video_id\": 5235997527237673952, \"comment_count\": 39, \"download_url\": \"http://2019/04/08/12/BMjAxOTA0MDgxMjQ4MTlfMjA3ODc2NTY2XzEyMDQzOTQ0MTc4XzJfMw==_b_Bfa330c5ca9009708aaff0167516a412d.mp4?tag=1-1555248600-h-1-zdpjkouqke-5862405191e4c1e4\", \"user_id\": 207876566, \"video_create_time\": \"2019-04-08 12:48:21\", \"user_sex\": \"F\"}"
}
The version of spark is 2.3.0 and the kafka version is 1.1.0. The version of spark-streaming-kafka is 0-10_2.11-2.3.0.
The JSON data in the column of PAYOAD is I want to deal with and analyse. How can I change the codes to acquire the JSON data
Use org.apache.kafka.common.serialization.StringDeserializer and org.apache.kafka.common.serialization.StringSerializer for consuming and sending data to kafka topic respectively.
This way you will get a String on consumption which can very easily be converted to JSON Object using JSONParser