Debezium can not capture change from MongoDB - mongodb

I am using debezium mongo connnect 1.4.2 on Kafka connect 2.2. And it seems 'collection.include.list' configuration is preventing Debezium getting the collection data change. If I delete the collection.include.list config, the capture will start work. But will apply on all the collections which I don't want.
Can anyone send me some example about how collection.include.list can be configured? I tried to input '<db_name>[.]<collection_name>' , however I keep on got this warning and no data was captured.
[2021-04-03 07:58:21,971] WARN After applying the include/exclude list filters, no changes will be captured. Please check your configuration! (io.debezium.connector.mongodb.MongoDbSchema:96)
My config is like below:
{
"name": "pipeline-mongo-connector",
"config": {
"connector.class": "io.debezium.connector.mongodb.MongoDbConnector",
"mongodb.hosts": "xxxx_host:3717",
"mongodb.name": "pipeline_mongo",
"mongodb.user": "xxxxxxx",
"mongodb.password":"xxxxxx",
"collection.include.list": "prod-datapipeline[.]*"
}
}
Thanks!

Related

How to get full document when using kafka mongodb source connector when tracking update operations on a collection?

I am using Kakfa MongoDB Source Connector [https://www.confluent.io/hub/mongodb/kafka-connect-mongodb] with confluent platform v5.4. Below is my MongoDB Source Connector config
{
"name": "mongodb-replica-set-connector",
"config": {
"tasks.max": 1,
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"connection.uri": "mongodb://<username:password>#<MongoDB-Server-IP-Or-DNS>/<DB-Name>?ssl=false&authSource=<DB-Name>&retryWrites=true&w=majority",
"database": "<DB-Name>",
"collection": "<Collection-Name>",
"topic.prefix": ""
}
}
I am getting full and correct document details when a record in inserted into the specified collection. But when I perform delete or update operation, I do not get the full document. Below is the screenshot for delete and update operation from a stream which reads the topic specified in the config. My questions is - What should I specify in the config so I get full document when the update operation is performed? Is there any way to get the info like id or key for the document which was deleted?
You can set "change.stream.full.document": "updateLookup" in the Kafka connector to get a fullDocument field for both inserts and updates.
Deletes are also sent with this approach, but the fullDocument field will be empty.
Source: https://docs.mongodb.com/kafka-connector/current/source-connector/configuration-properties/change-stream/#std-label-source-configuration-change-stream
Use the property publish.full.document.only": "true" in the MongoDB Connector config for getting the full document in case any create and update operation is done on the MongoDB collection. Delete operations cannot be tracked as it does not go with the idea of the CDC(change data capture) concept. Only the changes (create/update) in the data can be captured.

Kafka connect MongoDB Kafka connector using JSONConverter not working

I'm trying to configure the Kafka connector to use mongoDB as the source and send the records into Kakfa topics.
I've successfully done so, but I'm trying to do it with the JSONConverter in order to also save the schema with the payload.
My problem is that the connector is saving the data as follows:
{ "schema": { "type": "string" } , "payload": "{....}" }
In other words, it's automatically assuming the actual JSON is a string and it's saving the schema as String.
This is how I'm setting up the connector:
curl -X POST http://localhost:8083/connectors -H "Content-Type: application/json" -d '{
"name": "newtopic",
"config": {
"tasks.max":1,
"connector.class":"com.mongodb.kafka.connect.MongoSourceConnector",
"key.converter":"org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enabled": "true",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enabled": "true",
"connection.uri":"[MONGOURL]",
"database":"dbname",
"collection":"collname",
"pipeline":"[]",
"topic.prefix": "",
"publish.full.document.only": "true"
}}'
Am I missing something for the configuration? Is it simply not able to guess the schema of the document stored in MongoDB, so it goes with String?
There's nothing to guess. Mongo is schemaless, last I checked, so the schema is a string or bytes. I would suggest using AvroConverter or setting schema enabled to false
You may also want to try using Debezium to see if you get different results
The latest version 1.3 of the MongoDB connector should solve this issue.
It even provides an option for inferring the source schema.

Kafka connect JDBC source connector not working

Hello Everyone,
I am using Kafka JDBC Source connector using for postgres. Following is my connector configuration. Some how it is not bringing any data. What is wrong in this configuration?
{
"name": "test-connection",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"mode": "timestamp",
"timestamp.column.name": "TEST_DT",
"topic.prefix": "test",
"connection.password": "xxxxxx",
"validate.non.null": "false",
"connection.user": "xxxxxx",
"table.whitelist": "test.test",
"connection.url": "jdbc:postgresql://xxxx:5432/xxxx?ssl=true&stringtype=unspecified",
"name": "test-connection"
},
"tasks": [],
"type": "source"
}
Do I need to create the topic or does it get generated automatically?
I expect the data to be flowing based on the example but the data is not flowing.Following is the log I see in the kafka connect. But, no data is flowing in.
Log
[2019-07-07 20:52:37,465] INFO WorkerSourceTask{id=test-connection-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2019-07-07 20:52:37,465] INFO WorkerSourceTask{id=test-connection-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask)
Do I need to create the topic or does it get generated automatically?
it generates automatically with the "test" prefix you set in "topic.prefix": "test"
so your topic is called "testtest-connection" or "testtest.test"
it is possible that you are using Avro schema, if so, you have to consume the topic with Avro consumer.
I faced exactly the same problem and there was no error in the log, though I was adding/modifying the records in postgres and it was not sending any messages. Was getting the same log messages in INFO mode as you mentioned. Here I resolved it and possibly one or all of these might be causing this. So please check what was the issue at your end. If it solves your issue, please accept this as the answer.
"table.whitelist" : "public.abcd" <-- This property you ensure you give the schema name also explicitly e.g. I gave "public" as my "abcd" table was in this schema.
Usually the Database when we run (e.g. via Docker) the timezone is in UTC mode and if you are in a timezone more than that then while querying the data internally it would put the filter condition such a way that your data is filtered out. To overcome the best way is your timestamp column should be "timestamp with timezone" this solved my issue. Another variation I did was i inserted the data and gave the value of this column as "now() -interval '3 days'" to ensure the data is old and immediately it flow to Topic. Well, best is to give timestamp with timezone instead of this hack.
Finally another possible solution could be while giving the connector config you tell what timezone is your postgres db is. You may google for the property. As point-2 solved my issue so i didn't try this.
CREATE TABLE public.abcd (
id SERIAL PRIMARY KEY,
title VARCHAR(100) not NULL,
update_ts TIMESTAMP with time zone default now() not null
);
My config which worked. in case needed.
{
"name": "jdbc_source_connector_postgresql_004",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:postgresql://192.168.99.116:30000/mydb",
"connection.user": "sachin",
"connection.password": "123456",
"topic.prefix": "thesach004_",
"poll.interval.ms" : 1000,
"mode":"timestamp",
"table.whitelist" : "public.abcd",
"timestamp.column.name": "update_ts",
"validate.non.null": false,
"transforms":"createKey,extractInt",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"id",
"transforms.extractInt.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field":"id"
}
}
-$achin

Using a custom converter with Kafka Connect?

I'm trying to use a custom converter with Kafka Connect and I cannot seem to get it right. I'm hoping someone has experience with this and could help me figure it out !
Initial situation
my custom converter's class path is custom.CustomStringConverter.
to avoid any mistakes, my custom converter is currently just a copy/paste of the pre-existing StringConverter (of course, this will change when I'll get it to work).
https://github.com/apache/kafka/blob/trunk/connect/api/src/main/java/org/apache/kafka/connect/storage/StringConverter.java
I have a kafka connect cluster of 3 nodes, The nodes are running confluent's official docker images (confluentinc/cp-kafka-connect:3.3.0).
Each node is configured to load a jar with my converter in it (using a docker volume).
What happens ?
When the connectors start, they correctly load the jars and find the custom converter. Indeed, this is what I see in the logs :
[2017-10-10 13:06:46,274] INFO Registered loader: PluginClassLoader{pluginLocation=file:/opt/custom-connectors/custom-converter-1.0-SNAPSHOT.jar} (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:199)
[2017-10-10 13:06:46,274] INFO Added plugin 'custom.CustomStringConverter' (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:132)
[...]
[2017-10-10 13:07:43,454] INFO Added aliases 'CustomStringConverter' and 'CustomString' to plugin 'custom.CustomStringConverter' (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:293)
I then POST a JSON config to one of the connector nodes to create my connector :
{
"name": "hdfsSinkCustom",
"config": {
"topics": "yellow",
"tasks.max": "1",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "custom.CustomStringConverter",
"connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
"hdfs.url": "hdfs://hdfs-namenode:8020/hdfs-sink",
"topics.dir": "yellow_storage",
"flush.size": "1",
"rotate.interval.ms": "1000"
}
}
And receive the following reply :
{
"error_code": 400,
"message": "Connector configuration is invalid and contains the following 1 error(s):\nInvalid value custom.CustomStringConverter for configuration value.converter: Class custom.CustomStringConverter could not be found.\nYou can also find the above list of errors at the endpoint `/{connectorType}/config/validate`"
}
What am I missing ?
If I try running Kafka Connect stadnalone, the error message is the same.
Has anybody faced this already ? What am I missing ?
Ok, I found out the solution thanks to Philip Schmitt on the Kafka Users mailing list.
He mentioned this issue: https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-6007 , which is indeed the problem I am facing.
To quote him:
To test this, I simply copied my SMT jar to the folder of the connector I was using and adjusted the plugin.path property.
Indeed, I got rid of this error by putting the converter in the connector's folder.
I also tried something else: create a custom connector and use that custom connector with the custom converter, both loaded as plugins. It also works.
Summary: converters are loaded by the connector. If your connector is a plugin, your converter should be as well. If you connector is not a plugin (bundled with your kafka connect distrib), your converter should not be either.

How do I create the json for creating a distributed Kafka Connect Instance with a transformation?

Using standalone mode I create a connector and my customized transformation like that:
name=rabbitmq-source
connector.class=com.github.jcustenborder.kafka.connect.rabbitmq.RabbitMQSourceConnector
tasks.max=1
rabbitmq.host=rabbitmq-server
rabbitmq.queue=answers
kafka.topic=net.gutefrage.answers
transforms=extractFields
transforms.extractFields.type=net.gutefrage.connector.transforms.ExtractFields$Value
transforms.extractFields.fields=body,envelope.routingKey
transforms.extractFields.structName=net.gutefrage.events
But for a distributed connector what is the syntax for the PUT request to the Connect REST API? I cannot find any example in the docs.
Already tried a couple of things like:
cat <<EOF >/tmp/connector
{
"name": "rabbitmq-source",
"config": {
"connector.class": "com.github.jcustenborder.kafka.connect.rabbitmq.RabbitMQSourceConnector",
"tasks.max": "1",
"rabbitmq.host": "rabbitmq-server",
"rabbitmq.queue": "answers",
"kafka.topic": "net.gutefrage.answers",
"transforms": "extractFields",
"transforms.extractFields": {
"type": "net.gutefrage.connector.transforms.ExtractFields$Value",
"fields": "body,envelope.routingKey",
"structName": "net.gutefrage.events"
}
}
}
EOF
curl -vs --stderr - -X POST -H "Content-Type: application/json" --data #/tmp/connector "http://localhost:8083/connectors"
rm /tmp/connector
or also this did not work:
{
"name": "rabbitmq-source",
"config": {
"connector.class": "com.github.jcustenborder.kafka.connect.rabbitmq.RabbitMQSourceConnector",
"tasks.max": "1",
"rabbitmq.host": "rabbitmq-server",
"rabbitmq.queue": "answers",
"kafka.topic": "net.gutefrage.answers",
"transforms": "extractFields",
"transforms.extractFields.type": "net.gutefrage.connector.transforms.ExtractFields$Value",
"transforms.extractFields.fields": "body,envelope.routingKey",
"transforms.extractFields.structName": "net.gutefrage.events"
}
}
For the last variant I get the following error:
{"error_code":400,"message":"Connector configuration is invalid and contains the following 1 error(s):\nInvalid value class net.gutefrage.connector.transforms.ExtractFields for configuration transforms.extractFields.type: Error getting config definition from Transformation: null\nYou can also find the above list of errors at the endpoint `/{connectorType}/config/validate`"}
Please note that with the properties format it works nicely (using Landoops Create New Connector UI in fast-data-dev. Interesting that the Landoop's Ui feature 'translate to curl' produces the very same json as my second example)
Update
To be sure that it's not a problem with Landoop, docker and with my custom transformation, I've started zookeeper, broker, schema registry and Kafka Connect in distributed mode with the standard distributed properties from COP 3.3.0
bin/connect-distributed etc/schema-registry/connect-avro-distributed.properties
which logs
[2017-09-13 14:07:52,930] INFO Loading plugin from: /opt/connectors/confluent-oss-gf-assembly-1.0.jar (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:176)
[2017-09-13 14:07:53,711] INFO Registered loader: PluginClassLoader{pluginLocation=file:/opt/connectors/confluent-oss-gf-assembly-1.0.jar} (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:199)
[2017-09-13 14:07:53,711] INFO Added plugin 'com.github.jcustenborder.kafka.connect.rabbitmq.RabbitMQSourceConnector' (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:132)
[2017-09-13 14:07:53,712] INFO Added plugin 'net.gutefrage.connector.transforms.ExtractFields$Key' (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:132)
[2017-09-13 14:07:53,712] INFO Added plugin 'net.gutefrage.connector.transforms.ExtractFields$Value' (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:132)
All good so far. Then I created a connector config:
cat <<EOF >/tmp/connector
{
"name": "rabbitmq-source",
"config": {
"connector.class": "com.github.jcustenborder.kafka.connect.rabbitmq.RabbitMQSourceConnector",
"tasks.max": "1",
"rabbitmq.host": "rabbitmq-server",
"rabbitmq.queue": "answers",
"kafka.topic": "net.gutefrage.answers",
"transforms": "extractFields",
"transforms.extractFields.type": "org.apache.kafka.connect.transforms.ExtractField$Value",
"transforms.extractFields.field": "body"
}
}
EOF
Please note that there I do now use the standard (bundled) extract field transform.
When I post that with curl -vs --stderr - -X POST -H "Content-Type: application/json" --data #/tmp/connector "http://localhost:8083/connectors"
I get the same
{"error_code":400,"message":"Connector configuration is invalid and contains the following 1 error(s):\nInvalid value class org.apache.kafka.connect.transforms.ExtractField for configuration transforms.extractFields.type: Error getting config definition from Transformation: null\nYou can also find the above list of errors at the endpoint `/{connectorType}/config/validate`"}*
Make sure the $Value in transforms.extractFields.type=net.gutefrage.connector.transforms.ExtractFields$Value is not interpreted as a variable by the bash command cat. It worked for me.
If you want to run the Kafka Connect worker in standalone mode, then you must start the worker and supply the worker configuration file and one or more connector configuration files. All of those configuration files are in Java properties format, so the first configuration sample you provided is the correct format:
name=rabbitmq-source
connect.class=com.github.jcustenborder.kafka.connect.rabbitmq.RabbitMQSourceConnector
tasks.max=1
rabbitmq.host=rabbitmq-server
rabbitmq.queue=answers
kafka.topic=net.gutefrage.answers
transforms=extractFields
transforms.extractFields.type=net.gutefrage.connector.transforms.ExtractFields$Value
transforms.extractFields.fields=body,envelope.routingKey
transforms.extractFields.structName=net.gutefrage.events
If you want to run the Kafka Connect worker in distributed mode, then you will have to first start the distributed worker and then create the connector as a second step using the REST API and a PUT request with a JSON document to the /connectors endpoint. That JSON document would match the format of you're second JSON document:
{
"name": "rabbitmq-source",
"config": {
"connector.class": "com.github.jcustenborder.kafka.connect.rabbitmq.RabbitMQSourceConnector",
"tasks.max": "1",
"rabbitmq.host": "rabbitmq-server",
"rabbitmq.queue": "answers",
"kafka.topic": "net.gutefrage.answers",
"transforms": "extractFields",
"transforms.extractFields.type": "net.gutefrage.connector.transforms.ExtractFields$Value",
"transforms.extractFields.fields": "body,envelope.routingKey",
"transforms.extractFields.structName": "net.gutefrage.events"
}
}
The Confluent CLI, included in Confluent's Open Source Platform that includes Kafka, is a developer tool to help you quickly get started by running a Zookeeper instance, a Kafka broker, the Confluent Schema Registry, the REST proxy, and a Connect worker in distributed mode. When you load a connector, you specify the connector configuration as either a JSON file or property file, converting the latter to a JSON format using jq.
However, the error you reported is:
{
"error_code":400,
"message":"Connector configuration is invalid and contains the following 1 error(s):\nInvalid value class net.gutefrage.connector.transforms.ExtractFields for configuration transforms.extractFields.type: Error getting config definition from Transformation: null\nYou can also find the above list of errors at the endpoint `/{connectorType}/config/validate`"
}
The important part of this error message is "Error getting config definition from Transformation: null". Although this is a bit too cryptic, it means that the config() method of the net.gutefrage.connector.transforms.ExtractFields Java class is returning null.
Make sure that the net.gutefrage.connector.transforms.ExtractFields$Valuestring you specified is the correct fully qualified name for the nested static class Value, and that the Value class fully and correctly implements the org.apache.kafka.connect.transforms.Transformation<? extends ConnectRecord<R>> interface. Note that the config() method must return a non-null ConfigDef object.
Take a look at this example of a Single Message Transform (SMT) that ships with Apache Kafka, or Robin's blog post for other examples.
For using json format of connector config and the CP connect CLI, the jq tool has to be installed on the machine where the Kafka-Connect Cluster is running.
E.g. for Landoops fast-data-dev environment you'll have to
docker exec rabbitmqconnect_fast-data-dev_1 apk add --no-cache jq
Then this will work:
docker exec rabbitmqconnect_fast-data-dev_1 /opt/confluent-3.3.0/bin/confluent config rabbitmq-source -d /tmp/connector-config.json
This doesn't solve the issue when using the connector REST endpoint though.
With fast-data-dev you can build a JAR file for any connector, and then just add it in the classpath with the instructions at
https://github.com/Landoop/fast-data-dev#enable-additional-connectors
The UI will auto-detect the new connector - and provide you instructions when you hit NEW for the new connector at:
http://localhost:3030/kafka-connect-ui
What would also be worth trying - as fast-data-dev comes already with an generic MQTT sink connector, is trying it out. See instructions at http://docs.datamountaineer.com/en/latest/mqtt-sink.html
You would effectively need to do
connect.mqtt.kcql=INSERT INTO /answers SELECT body FROM net.gutefrage.answers
As this is a generic MQTT connector - possibly you will need to add the rabbitmq client library using the enable-additional-connectors instructions