Registering Schema ID with Topic using confluent_kafka for python - apache-kafka

The only answer I have gotten so far, is that you have to give the schema and the topic the same name, and then this should link them together. But after registering a schema with name test_topic like:
{
"type": "record",
"name": "test_topic",
"namespace": "com.test",
"doc": "My test schema",
"fields": [
{
"name": "name",
"type": "string"
}
]
}
and running the following command, it inserts without a problem.
curl -X POST -H "Content-Type: application/vnd.kafka.json.v2+json" -H "Accept: application/vnd.kafka.v2+json" --data '{"records":[{"value":{"name": "My first name"}}]}' "http://localhost/topics/test_topic"
But when I run the following command as well it inserts without giving any error (note,I changed the property name)
curl -X POST -H "Content-Type: application/vnd.kafka.json.v2+json" -H "Accept: application/vnd.kafka.v2+json" --data '{"records":[{"value":{"test": "My first name"}}]}' "http://localhost/topics/test_topic"
I would have suspected an error message saying that my data does not match the schema for this topic...
My schema ID is 10, so I know it is working and registered, but not very useful at the moment.
Python Code:
from confluent_kafka import Producer
import socket
import json
conf = {'bootstrap.servers': 'localhost:9092', 'client.id': socket.gethostname()}
producer = Producer(conf)
def acked(err, msg):
if err is not None:
print(f'Failed to deliver message: {str(msg)}, {str(err)}')
else:
print(f'Message produced: {str(msg)}')
producer.produce("test_topic", key="key", value=json.dumps({"test": name}).encode('ascii') , callback=acked)
producer.poll(5)

you have to give the schema and the topic the same name, and then this should link them together
That's not quite how the Schema Registry works.
Each kafka record has a key and a value.
The Registry has subjects, which are not strictly mapped to topics.
However, the Kafka clients (de)serializer implementation will use both topic-key and topic-value subject names to register/extract schemas from the registry.
Clients cannot tell the registry what ID to put the schema at. That logic is calculated server side
I'm not sure I understand what your post has to do with the REST Proxy, but you're posting plain JSON and not telling it that the data should be Avro (you're using the incorrect header)
If Avro is used, the content type will be application/vnd.kafka.avro.v2+json

Related

Kafka: Source-Connector to Topic mapping is Flakey

I have the following Kafka connector configuration (below). I have created the "member" topic already (30 partitions). The problem is that I will install the connector and it will work; i.e.
curl -d "#mobiledb-member.json" -H "Content-Type: application/json" -X PUT https://TTGSSQA0VRHAP81.ttgtpmg.net:8085/connectors/mobiledb-member-connector/config
curl -s https://TTGSSQA0VRHAP81.ttgtpmg.net:8085/connectors/member-connector/topics
returns:
{"member-connector":{"topics":[member]}}
the status call returns no errors:
curl -s https://TTGSSQA0VRHAP81.ttgtpmg.net:8085/connectors/mobiledb-member-connector/status
{"name":"member-connector","connector":{"state":"RUNNING","worker_id":"ttgssqa0vrhap81.***.net:8085"},"tasks":[{"id":0,"state":"RUNNING","worker_id":"ttgssqa0vrhap81.***.net:8085"}],"type":"source"}
... but at other times, I will install a similar connector config and it will return no topics.
{"member-connector":{"topics":[]}}
Yet the status shows no errors and the Connector logs show no clues as to why this "connector to topic" mapping isn't working. Why aren't the logs helping out?
Connector configuration.
{
"connector.class":"io.confluent.connect.jdbc.JdbcSourceConnector",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"connection.url":"jdbc:sqlserver:****;",
"connection.user":"***",
"connection.password":"***",
"transforms":"createKey",
"table.poll.interval.ms":"120000",
"key.converter.schemas.enable":"false",
"value.converter.schemas.enable":"false",
"poll.interval.ms":"5000",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"name":"member-connector",
"tasks.max":"4",
"query":"SELECT * FROM member_kafka_test",
"table.types":"TABLE",
"topic.prefix":"member",
"mode":"timestamp+incrementing",
"transforms.createKey.fields":"member_id",
"incrementing.column.name": "member_id",
"timestamp.column.name" : "update_datetime"
}

How does JDBC sink connector inserts values into postgres database

I'm using JDBC sink connector to load data from kafka topic to postgres database.
here is my configuration:
curl --location --request PUT 'http://localhost:8083/connectors/sink_1/config' \
--header 'Content-Type: application/json' \
--data-raw '{
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url":"jdbc:postgresql://localhost:5432/postgres",
"connection.user":"user",
"connection.password":"passwd",
"tasks.max" : "10",
"topics":"<topic_name_same_as_tablename>",
"insert.mode":"insert",
"key.converter":"org.apache.kafka.connect.converters.ByteArrayConverter",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"quote.sql.identifiers":"never",
"errors.tolerance":"all",
"errors.deadletterqueue.topic.name":"failed_records",
"errors.deadletterqueue.topic.replication.factor":"1",
"errors.log.enable":"true"
}'
In my table, I have 100k+ records so, I tried partitioning the topic into 10 and I tried with tasks.max with 10 to speed up the loading process, which was much faster when compared to single partition.
Can someone help me understand how the sink connector loads data into postgres? How will be the insert statement it will consider? either approach-1 or approach-2? If it is approach-1 then can we achieve approach-2? if yes, how can we?

Kafka connect MongoDB Kafka connector using JSONConverter not working

I'm trying to configure the Kafka connector to use mongoDB as the source and send the records into Kakfa topics.
I've successfully done so, but I'm trying to do it with the JSONConverter in order to also save the schema with the payload.
My problem is that the connector is saving the data as follows:
{ "schema": { "type": "string" } , "payload": "{....}" }
In other words, it's automatically assuming the actual JSON is a string and it's saving the schema as String.
This is how I'm setting up the connector:
curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d '{
"name": "newtopic",
"config": {
"tasks.max":1,
"connector.class":"com.mongodb.kafka.connect.MongoSourceConnector",
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enabled": "true",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enabled": "true",
"connection.uri":"[MONGOURL]",
"database":"dbname",
"collection":"collname",
"pipeline":"[]",
"topic.prefix": "",
"publish.full.document.only": "true"
}}'
Am I missing something for the configuration? Is it simply not able to guess the schema of the document stored in MongoDB, so it goes with String?
There's nothing to guess. Mongo is schemaless, last I checked, so the schema is a string or bytes. I would suggest using AvroConverter or setting schema enabled to false
You may also want to try using Debezium to see if you get different results
The latest version 1.3 of the MongoDB connector should solve this issue.
It even provides an option for inferring the source schema.

Shows invalid characters while consuming using kafka console consumer

While consuming from the Kafka topic using Kafka console consumer or kt(GoLang CLI tool for Kafka), I am getting invalid characters.
...
\u0000\ufffd?\u0006app\u0000\u0000\u0000\u0000\u0000\u0000\u003e#\u0001
\u0000\u000cSec-39\u001aSome Actual Value Text\ufffd\ufffd\ufffd\ufffd\ufffd
\ufffd\u0015#\ufffd\ufffd\ufffd\ufffd\ufffd\ufff
...
Even though Kafka connect can actually sink the proper data to an SQL database.
Given that you say
Kafka connect can actually sink the proper data to an SQL database.
my assumption would be that you're using Avro serialization for the data on the topic. Kafka Connect configured correctly will take the Avro data and deserialise it.
However, console tools such as kafka-console-consumer, kt, kafkacat et al do not support Avro, and so you get a bunch of weird characters if you use them to read data from a topic that is Avro-encoded.
To read Avro data to the command line you can use kafka-avro-console-consumer:
kafka-avro-console-consumer
--bootstrap-server kafka:29092\
--topic test_topic_avro \
--property schema.registry.url=http://schema-registry:8081
Edit: Adding a suggestion from #CodeGeas too:
Alternatively, reading data using REST Proxy can be done with the following:
# Create a consumer for JSON data
curl -X POST -H "Content-Type: application/vnd.kafka.v2+json" \
-H "Accept: application/vnd.kafka.v2+json" \
--data '{"name": "my_consumer_instance", "format": "avro", "auto.offset.reset": "earliest"}' \
# Subscribe the consumer to a topic
http://kafka-rest-instance:8082/consumers/my_json_consumer
curl -X POST -H "Content-Type: application/vnd.kafka.v2+json" \
--data '{"topics":["YOUR-TOPIC-NAME"]}' \
http://kafka-rest-instance:8082/consumers/my_json_consumer/instances/my_consumer_instance/subscription
# Then consume some data from a topic using the base URL in the first response.
curl -X GET -H "Accept: application/vnd.kafka.avro.v2+json" \
http://kafka-rest-instance:8082/consumers/my_json_consumer/instances/my_consumer_instance/records
Later, to delete the consumer afterwards:
curl -X DELETE -H "Accept: application/vnd.kafka.avro.v2+json" \
http://kafka-rest-instance:8082/consumers/my_json_consumer/instances/my_consumer_instance
By default, the console consumer tools deserializes both the message key and value using ByteArrayDeserializer but then obviously tries to print data to the command line using the default formatter.
This tool however allows to customize the deserializers and formatter used. See the following extract from the help output:
--formatter <String: class> The name of a class to use for
formatting kafka messages for
display. (default: kafka.tools.
DefaultMessageFormatter)
--property <String: prop> The properties to initialize the
message formatter. Default
properties include:
print.timestamp=true|false
print.key=true|false
print.value=true|false
key.separator=<key.separator>
line.separator=<line.separator>
key.deserializer=<key.deserializer>
value.deserializer=<value.
deserializer>
Users can also pass in customized
properties for their formatter; more
specifically, users can pass in
properties keyed with 'key.
deserializer.' and 'value.
deserializer.' prefixes to configure
their deserializers.
--key-deserializer <String:
deserializer for key>
--value-deserializer <String:
deserializer for values>
Using these settings, you should be able to change the output to be what you want.

How do I create the json for creating a distributed Kafka Connect Instance with a transformation?

Using standalone mode I create a connector and my customized transformation like that:
name=rabbitmq-source
connector.class=com.github.jcustenborder.kafka.connect.rabbitmq.RabbitMQSourceConnector
tasks.max=1
rabbitmq.host=rabbitmq-server
rabbitmq.queue=answers
kafka.topic=net.gutefrage.answers
transforms=extractFields
transforms.extractFields.type=net.gutefrage.connector.transforms.ExtractFields$Value
transforms.extractFields.fields=body,envelope.routingKey
transforms.extractFields.structName=net.gutefrage.events
But for a distributed connector what is the syntax for the PUT request to the Connect REST API? I cannot find any example in the docs.
Already tried a couple of things like:
cat <<EOF >/tmp/connector
{
"name": "rabbitmq-source",
"config": {
"connector.class": "com.github.jcustenborder.kafka.connect.rabbitmq.RabbitMQSourceConnector",
"tasks.max": "1",
"rabbitmq.host": "rabbitmq-server",
"rabbitmq.queue": "answers",
"kafka.topic": "net.gutefrage.answers",
"transforms": "extractFields",
"transforms.extractFields": {
"type": "net.gutefrage.connector.transforms.ExtractFields$Value",
"fields": "body,envelope.routingKey",
"structName": "net.gutefrage.events"
}
}
}
EOF
curl -vs --stderr - -X POST -H "Content-Type: application/json" --data #/tmp/connector "http://localhost:8083/connectors"
rm /tmp/connector
or also this did not work:
{
"name": "rabbitmq-source",
"config": {
"connector.class": "com.github.jcustenborder.kafka.connect.rabbitmq.RabbitMQSourceConnector",
"tasks.max": "1",
"rabbitmq.host": "rabbitmq-server",
"rabbitmq.queue": "answers",
"kafka.topic": "net.gutefrage.answers",
"transforms": "extractFields",
"transforms.extractFields.type": "net.gutefrage.connector.transforms.ExtractFields$Value",
"transforms.extractFields.fields": "body,envelope.routingKey",
"transforms.extractFields.structName": "net.gutefrage.events"
}
}
For the last variant I get the following error:
{"error_code":400,"message":"Connector configuration is invalid and contains the following 1 error(s):\nInvalid value class net.gutefrage.connector.transforms.ExtractFields for configuration transforms.extractFields.type: Error getting config definition from Transformation: null\nYou can also find the above list of errors at the endpoint `/{connectorType}/config/validate`"}
Please note that with the properties format it works nicely (using Landoops Create New Connector UI in fast-data-dev. Interesting that the Landoop's Ui feature 'translate to curl' produces the very same json as my second example)
Update
To be sure that it's not a problem with Landoop, docker and with my custom transformation, I've started zookeeper, broker, schema registry and Kafka Connect in distributed mode with the standard distributed properties from COP 3.3.0
bin/connect-distributed etc/schema-registry/connect-avro-distributed.properties
which logs
[2017-09-13 14:07:52,930] INFO Loading plugin from: /opt/connectors/confluent-oss-gf-assembly-1.0.jar (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:176)
[2017-09-13 14:07:53,711] INFO Registered loader: PluginClassLoader{pluginLocation=file:/opt/connectors/confluent-oss-gf-assembly-1.0.jar} (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:199)
[2017-09-13 14:07:53,711] INFO Added plugin 'com.github.jcustenborder.kafka.connect.rabbitmq.RabbitMQSourceConnector' (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:132)
[2017-09-13 14:07:53,712] INFO Added plugin 'net.gutefrage.connector.transforms.ExtractFields$Key' (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:132)
[2017-09-13 14:07:53,712] INFO Added plugin 'net.gutefrage.connector.transforms.ExtractFields$Value' (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:132)
All good so far. Then I created a connector config:
cat <<EOF >/tmp/connector
{
"name": "rabbitmq-source",
"config": {
"connector.class": "com.github.jcustenborder.kafka.connect.rabbitmq.RabbitMQSourceConnector",
"tasks.max": "1",
"rabbitmq.host": "rabbitmq-server",
"rabbitmq.queue": "answers",
"kafka.topic": "net.gutefrage.answers",
"transforms": "extractFields",
"transforms.extractFields.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
"transforms.extractFields.field": "body"
}
}
EOF
Please note that there I do now use the standard (bundled) extract field transform.
When I post that with curl -vs --stderr - -X POST -H "Content-Type: application/json" --data #/tmp/connector "http://localhost:8083/connectors"
I get the same
{"error_code":400,"message":"Connector configuration is invalid and contains the following 1 error(s):\nInvalid value class org.apache.kafka.connect.transforms.ExtractField for configuration transforms.extractFields.type: Error getting config definition from Transformation: null\nYou can also find the above list of errors at the endpoint `/{connectorType}/config/validate`"}*
Make sure the $Value in transforms.extractFields.type=net.gutefrage.connector.transforms.ExtractFields$Value is not interpreted as a variable by the bash command cat. It worked for me.
If you want to run the Kafka Connect worker in standalone mode, then you must start the worker and supply the worker configuration file and one or more connector configuration files. All of those configuration files are in Java properties format, so the first configuration sample you provided is the correct format:
name=rabbitmq-source
connect.class=com.github.jcustenborder.kafka.connect.rabbitmq.RabbitMQSourceConnector
tasks.max=1
rabbitmq.host=rabbitmq-server
rabbitmq.queue=answers
kafka.topic=net.gutefrage.answers
transforms=extractFields
transforms.extractFields.type=net.gutefrage.connector.transforms.ExtractFields$Value
transforms.extractFields.fields=body,envelope.routingKey
transforms.extractFields.structName=net.gutefrage.events
If you want to run the Kafka Connect worker in distributed mode, then you will have to first start the distributed worker and then create the connector as a second step using the REST API and a PUT request with a JSON document to the /connectors endpoint. That JSON document would match the format of you're second JSON document:
{
"name": "rabbitmq-source",
"config": {
"connector.class": "com.github.jcustenborder.kafka.connect.rabbitmq.RabbitMQSourceConnector",
"tasks.max": "1",
"rabbitmq.host": "rabbitmq-server",
"rabbitmq.queue": "answers",
"kafka.topic": "net.gutefrage.answers",
"transforms": "extractFields",
"transforms.extractFields.type": "net.gutefrage.connector.transforms.ExtractFields$Value",
"transforms.extractFields.fields": "body,envelope.routingKey",
"transforms.extractFields.structName": "net.gutefrage.events"
}
}
The Confluent CLI, included in Confluent's Open Source Platform that includes Kafka, is a developer tool to help you quickly get started by running a Zookeeper instance, a Kafka broker, the Confluent Schema Registry, the REST proxy, and a Connect worker in distributed mode. When you load a connector, you specify the connector configuration as either a JSON file or property file, converting the latter to a JSON format using jq.
However, the error you reported is:
{
"error_code":400,
"message":"Connector configuration is invalid and contains the following 1 error(s):\nInvalid value class net.gutefrage.connector.transforms.ExtractFields for configuration transforms.extractFields.type: Error getting config definition from Transformation: null\nYou can also find the above list of errors at the endpoint `/{connectorType}/config/validate`"
}
The important part of this error message is "Error getting config definition from Transformation: null". Although this is a bit too cryptic, it means that the config() method of the net.gutefrage.connector.transforms.ExtractFields Java class is returning null.
Make sure that the net.gutefrage.connector.transforms.ExtractFields$Valuestring you specified is the correct fully qualified name for the nested static class Value, and that the Value class fully and correctly implements the org.apache.kafka.connect.transforms.Transformation<? extends ConnectRecord<R>> interface. Note that the config() method must return a non-null ConfigDef object.
Take a look at this example of a Single Message Transform (SMT) that ships with Apache Kafka, or Robin's blog post for other examples.
For using json format of connector config and the CP connect CLI, the jq tool has to be installed on the machine where the Kafka-Connect Cluster is running.
E.g. for Landoops fast-data-dev environment you'll have to
docker exec rabbitmqconnect_fast-data-dev_1 apk add --no-cache jq
Then this will work:
docker exec rabbitmqconnect_fast-data-dev_1 /opt/confluent-3.3.0/bin/confluent config rabbitmq-source -d /tmp/connector-config.json
This doesn't solve the issue when using the connector REST endpoint though.
With fast-data-dev you can build a JAR file for any connector, and then just add it in the classpath with the instructions at
https://github.com/Landoop/fast-data-dev#enable-additional-connectors
The UI will auto-detect the new connector - and provide you instructions when you hit NEW for the new connector at:
http://localhost:3030/kafka-connect-ui
What would also be worth trying - as fast-data-dev comes already with an generic MQTT sink connector, is trying it out. See instructions at http://docs.datamountaineer.com/en/latest/mqtt-sink.html
You would effectively need to do
connect.mqtt.kcql=INSERT INTO /answers SELECT body FROM net.gutefrage.answers
As this is a generic MQTT connector - possibly you will need to add the rabbitmq client library using the enable-additional-connectors instructions