Kafka connect MongoDB sink connector using kafka-avro-console-producer - mongodb

I'm trying to write some documents to MongoDB using the Kafka connect MongoDB connector. I've managed to set up all the components required and start up the connector but when I send the message to Kafka using the kafka-avro-console-producer, Kafka connect is giving me the following error:
org.apache.kafka.connect.errors.DataException: Error: `operationType` field is doc is missing.
I've tried to add this field to the message but then kafka connect is asking me to include a documentKey field. It seems like I need to include some extra fields apart from the payload defined in my schema but I can't find a comprehensive documentation. Does anyone have an example of a kafka message payload (using kafka-avro-console-producer) that goes through a Kafka -> Kafka connect -> MongoDB pipeline?
See following an example of one of the messages I'm sending to Kafka (btw, kafka-avro-console-consumer is able to consume the messages):
./kafka-avro-console-producer --broker-list kafka:9093 --topic sampledata --property value.schema='{"type":"record","name":"myrecord","fields":[{"name":"field1","type":"string"}]}'
{"field1": "value1"}
And see also following the configuration of the sink connector:
{"name": "mongo-sink",
"config": {
"connector.class":"com.mongodb.kafka.connect.MongoSinkConnector",
"value.converter":"io.confluent.connect.avro.AvroConverter", "value.converter.schema.registry.url":"http://schemaregistry:8081",
"connection.uri":"mongodb://cadb:27017",
"database":"cognitive_assistant",
"collection":"topicData",
"topics":"sampledata6",
"change.data.capture.handler": "com.mongodb.kafka.connect.sink.cdc.mongodb.ChangeStreamHandler"
}
}

I've just managed to make the connector work. I deleted the change.data.capture.handler property from the connector configuration and it works now.

Related

FlinkKafkaConsumer/Producer & Confluent Avro schema registry: Validation failed & Compatibility mode writes invalid schema

Hello together im struggling with (de-)serializing a simple avro schema together with schema registry.
The setup:
2 Flink jobs written in java (one consumer, one producer)
1 confluent schema registry for schema validation
1 kafka cluster for messaging
The target:
The producer should send a message serialized with ConfluentRegistryAvroSerializationSchema which includes updating and validating the schema.
The consumer should then deserialize the message into an object with the received schema. Using ConfluentRegistryAvroDeserializationSchema.
So far so good:
If i configre my subject on the schema registry to be FORWARD-compatible the producer writes the correct avro schema to the registry, but it ends with the error (even if i completely and permanetly delete the subject first):
Failed to send data to Kafka: Schema being registered is incompatible with an earlier schema for subject "my.awesome.MyExampleEntity-value"
The schema was successfully written:
{
"subject": "my.awesome.MyExampleEntity-value",
"version": 1,
"id": 100028,
"schema": "{\"type\":\"record\",\"name\":\"input\",\"namespace\":\"my.awesome.MyExampleEntity\",\"fields\":[{\"name\":\"serialNumber\",\"type\":\"string\"},{\"name\":\"editingDate\",\"type\":\"int\",\"logicalType\":\"date\"}]}"
}
following this i could try to set the compability to NONE
If i do so i can produce my data on the kafka but:
The schema registry has a new version of my schema looking like this:
{
"subject": "my.awesome.MyExampleEntity-value",
"version": 2,
"id": 100031,
"schema": "\"bytes\""
}
Now i can produce data but the consumer is not able to deserialize this schema emiting the following error:
Caused by: org.apache.avro.AvroTypeException: Found bytes, expecting my.awesome.MyExampleEntity
...
Im currently not sure where the problem exactly is.
Even if i completely and permanetly delete the subject (including schemas) my producer should work fine from scratch registering a whole "new" subject with schema.
On the other hand if i set the compatibility to "NONE" the schema exchange should work anyway by should registering a schema which can be read by the consumer.
Can anybody help me out here?
According to a latest confluent doc NONE: schema compatibility checks are disabled docs:
The whole problem with serialisation was about the usage of the following flag in the kafka config:
"schema.registry.url"
"key.serializer"
"key.deserializer"
"value.serializer"
"value.deserializer"
Setting this flags in flink, even if they are logically correct leads to a undebuggable schema validation and serialisation chaos.
Omitted all of these flags and it works fine.
The registry url needs to be set in ConfluentRegistryAvro(De)serializationSchema only.

Error while consuming AVRO Kafka Topic from KSQL Stream

I created some dummydata as a Stream in KSQLDB with
VALUE_FORMAT='JSON' TOPIC='MYTOPIC'
The Setup is over Docker-compose. I am running a Kafka Broker, Schema-registry, ksqldbcli, ksqldb-server, zookeeper
Now I want to consume these records from the topic.
My first and last approach was over the commandline with following command
docker run --net=host --rm confluentinc/cp-schema-registry:5.0.0 kafka-avro-console-consumer
--bootstrap-server localhost:29092 --topic DXT --from-beginning --max-messages 10
--property print.key=true --property print.value=true
--value-deserializer io.confluent.kafka.serializers.KafkaAvroDeserializer
--key-deserializer org.apache.kafka.common.serialization.StringDeserializer
But that just returns the error
[2021-04-22 21:45:42,926] ERROR Unknown error when running consumer: (kafka.tools.ConsoleConsumer$:76)
org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
I also tried it with different use cases in Java Spring but with no prevail. I just cannot consume the created topics.
If I would need to define my own schema, where should I do that and what would be the easiest way because I just created a stream in Ksqldb?
Is there an easy to follow example. I did not specifiy anything else when I created the stream like in the quickstart example on Ksqldb.io. (I added the schema-registry in my deployment)
As I am a noob that is sitting here for almost 10 hours any help would be appreciated.
Edit: I found that pure JSON does not need the Schema-registry with ksqldb. Here.
But how to deserialize it?
If you've written JSON data to the topic then you can read it with the kafka-console-consumer.
The error you're getting (Error deserializing Avro message for id -1…Unknown magic byte!) is because you're using the kafka-avro-console-consumer which attempts to deserialise the topic data as Avro - which it isn't, hence the error.
You can also use PRINT DXT; from within ksqlDB.

Configure Apache Kafka sink jdbc connector

I want to send the data sent to the topic to a postgresql-database. So I follow this guide and have configured the properties-file like this:
name=transaction-sink
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics=transactions
connection.url=jdbc:postgresql://localhost:5432/db
connection.user=db-user
connection.password=
auto.create=true
insert.mode=insert
table.name.format=transaction
pk.mode=none
I start the connector with
./bin/connect-standalone etc/schema-registry/connect-avro-standalone.properties etc/kafka-connect-jdbc/sink-quickstart-postgresql.properties
The sink-connector is created but does not start due to this error:
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
The schema is in avro-format and registered and I can send (produce) messages to the topic and read (consume) from it. But I can't seem to sent it to the database.
This is my ./etc/schema-registry/connect-avro-standalone.properties
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
This is a producer feeding the topic using the java-api:
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
properties.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://localhost:8081");
try (KafkaProducer<String, Transaction> producer = new KafkaProducer<>(properties)) {
Transaction transaction = new Transaction();
transaction.setFoo("foo");
transaction.setBar("bar");
UUID uuid = UUID.randomUUID();
final ProducerRecord<String, Transaction> record = new ProducerRecord<>(TOPIC, uuid.toString(), transaction);
producer.send(record);
}
I'm verifying data is properly serialized and deserialized using
./bin/kafka-avro-console-consumer --bootstrap-server localhost:9092 \
--property schema.registry.url=http://localhost:8081 \
--topic transactions \
--from-beginning --max-messages 1
The database is up and running.
This is not correct:
The unknown magic byte can be due to a id-field not part of the schema
What that error means that the message on the topic was not serialised using the Schema Registry Avro serialiser.
How are you putting data on the topic?
Maybe all the messages have the problem, maybe only some—but by default this will halt the Kafka Connect task.
You can set
"errors.tolerance":"all",
to get it to ignore messages that it can't deserialise. But if all of them are not correctly Avro serialised this won't help and you need to serialise them correctly, or choose a different Converter (e.g. if they're actually JSON, use the JSONConverter).
These references should help you more:
https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues
http://rmoff.dev/ksldn19-kafka-connect
Edit :
If you are serialising the key with StringSerializer then you need to use this in your Connect config:
key.converter=org.apache.kafka.connect.storage.StringConverter
You can set it at the worker (global property, applies to all connectors that you run on it), or just for this connector (i.e. put it in the connector properties itself, it will override the worker settings)

Kafka-connect issue

I installed Apache Kafka on centos 7 (confluent), am trying to run filestream kafka connect in distributed mode but I was getting below error:
[2017-08-10 05:26:27,355] INFO Added alias 'ValueToKey' to plugin 'org.apache.kafka.connect.transforms.ValueToKey' (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:290)
Exception in thread "main" org.apache.kafka.common.config.ConfigException: Missing required configuration "internal.key.converter" which has no default value.
at org.apache.kafka.common.config.ConfigDef.parseValue(ConfigDef.java:463)
at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:453)
at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:62)
at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:75)
at org.apache.kafka.connect.runtime.WorkerConfig.<init>(WorkerConfig.java:197)
at org.apache.kafka.connect.runtime.distributed.DistributedConfig.<init>(DistributedConfig.java:289)
at org.apache.kafka.connect.cli.ConnectDistributed.main(ConnectDistributed.java:65)
Which is now resolved by updating the workers.properties as mentioned in http://docs.confluent.io/current/connect/userguide.html#connect-userguide-distributed-config
Command used:
/home/arun/kafka/confluent-3.3.0/bin/connect-distributed.sh ../../../properties/file-stream-demo-distributed.properties
Filestream properties file (workers.properties):
name=file-stream-demo-distributed
connector.class=org.apache.kafka.connect.file.FileStreamSourceConnector
tasks.max=1
file=/tmp/demo-file.txt
bootstrap.servers=localhost:9092,localhost:9093,localhost:9094
config.storage.topic=demo-2-distributed
offset.storage.topic=demo-2-distributed
status.storage.topic=demo-2-distributed
key.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schemas.enable=true
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter.schemas.enable=false
group.id=""
I added below properties and command went through without any errors.
bootstrap.servers=localhost:9092,localhost:9093,localhost:9094
config.storage.topic=demo-2-distributed
offset.storage.topic=demo-2-distributed
status.storage.topic=demo-2-distributed
group.id=""
But, now when I run consumer command, I am unable to see the messages in /tmp/demo-file.txt. Please let me know if there is a way I can check if the messages are published to kafka topics and partitions ?
kafka-console-consumer --zookeeper localhost:2181 --topic demo-2-distributed --from-beginning
I believe I am missing something really basic here. Can some one please help?
You need to define unique topics for Kafka connect framework to store its config, offset, and status.
In your workers.properties file change these parameters to something like the following:
config.storage.topic=demo-2-distributed-config
offset.storage.topic=demo-2-distributed-offset
status.storage.topic=demo-2-distributed-status
These topics are use to store state and configuration metadata of connect and not for storing the messages for any of the connectors that run on top of connect. Do not use console consumer on any of these three topics and expect to see the messages.
The messages are stored in the topic configured in the connector configuration json with the parameter called "topic".
Example file-sink-config.json file
{
"name": "MyFileSink",
"config": {
"topics": "mytopic",
"connector.class": "org.apache.kafka.connect.file.FileStreamSinkConnector",
"tasks.max": 1,
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"file": "/tmp/demo-file.txt"
}
}
Once the distributed worker is running you need to apply the config file to it using curl like so:
curl -X POST -H "Content-Type: application/json" --data #file-sink-config.json http://localhost:8083/connectors
After that the config will be safely stored in the config topic you created for all distributed workers to use. Make sure the config topic (and the status and offset topics) will not expire messages or you will loose you Connector configuration when it does.

pykafka can not connect kafka broker

When I use pykafka to connect kafka cluster via the following code:
from pykafka import KafkaClient
client = KafkaClient(hosts="10.0.0.101:9092")
I got the exception as following:
raise Exception('Unable to connect to a broker to fetch metadata.')
Exception: Unable to connect to a broker to fetch metadata.
But when I was using the command line such as:
kafka-console-producer --broker-list 10.0.0.101:9092 --topic userCND
it works fine but just gives me a warning message:
WARN Property topic is not valid (kafka.utils.VerifiableProperties)
What version of Kafka are you using? pykafka currently only supports 0.8.2, not 0.9.0.
You may want to use the REST API instead. Learn more about the REST API here:
http://docs.confluent.io/2.0.0/kafka-rest/docs/index.html