Debezium Server MongoDB connector Enabling the Signalling doesn't work - debezium

I run the Debezium Server with source MongoDB and sink Kinesis on k8s. After enabling the signalling by setting
DEBEZIUM_SOURCE_SIGNAL_DATA_COLLECTION="test.debezium_signaling"
DEBEZIUM_SOURCE_COLLECTION_INCLUDE_LIST: "<some other existing collections>,test.debezium_signaling"
and restarting the pod and inserting
{
"_id": {
"$oid": "***"
},
"type": "execute-snapshot",
"data": "{'data-collections': ['test.new_collection']}"
}
nothing happens and debezium seems to be stuck at the checking current members of replica set at ...
There aren't any errors, any idea what could be the issue?
I also enabled the debug logs but there aren't any logs indicating that the signaling record is picked up by the server

Related

Why is my MSK connector in a failed state?

I'm using AWS MSK and trying to create a Kafka connector using the confluent Kafka SQS source connector.
Having uploaded the SQS source connector plugin (in zip form) I go to the MSK console and try to create the connector, specifying my existing MSK cluster and choosing my plugin.
After some time, the following error message appears:
There is an issue with the connector.
Code: UnknownError.Unknown
Message: The last operation failed. Retry the operation.
Connector is in Failed state
The rather useless error message means I don't know where to look next.
I tried the CloudWatch logs, but although my cluster is configured to send logs there, there is nothing related to this error.

List All the Kafka connect workers running in a group in distributed mode

I want to list all the Kafka connect workers running in a single group in distributed mode.
My use case is that if a workers closes/killed due to any reason, I could recreate the worker to join again to that group. Also, all the workers will be running on different machines on a network. Therefore, I need to have a central logging of all the available workers in a group.
But, I couldn't find a way to list all the workers ( The leader node and the follower nodes). could somebody tell how I could do that?
When a worker leaves, or joins, a Kafka Connect cluster the tasks get rebalanced automatically. Therefore you should only need to monitor the Kafka Connect worker processes in isolation on each host machine and restart it if you see a failure.
You can find out which workers are currently executing a connector's tasks using the REST API:
$ curl -s "http://localhost:8083/connectors?expand=info&expand=status" | jq '.'
"status": {
"name": "source-datagen-01",
"connector": {
"state": "RUNNING",
"worker_id": "kafka-connect-03:8083"
},
"tasks": [
{
"id": 0,
"state": "RUNNING",
"worker_id": "kafka-connect-03:8083"
},
{
"id": 1,
"state": "RUNNING",
"worker_id": "kafka-connect-02:8083"
},
{
"id": 2,
"state": "RUNNING",
"worker_id": "kafka-connect-01:8083"
},
{
"id": 3,
"state": "RUNNING",
"worker_id": "kafka-connect-03:8083"
},
I don't believe there's a REST API call to view the workers in the cluster regardless of connector execution.
You could parse the Kafka Connect worker logs in the cluster to determine when workers leave and join the cluster, and also inspect the Kafka Connect status topic's messages - both are shown in this doc. However, as noted, Kafka Connect itself will do the rebalancing when a workers leaves or joins.

Is there a way to use Kafka Connect with REST Proxy?

Kafka Connect source and sink connectors provide practically ideal feature set for configuring a data pipeline without writing any code. In my case I wanted to use it for integrating data from several DB servers (producers) located on the public Internet.
However some producers don't have direct access to Kafka brokers as their network/firewall configuration allows traffic to a specific host only (port 443). And unfortunately I cannot really change these settings.
My thought was to use Confluent REST Proxy but I learned that Kafka Connect uses KafkaProducer API so it needs direct access to brokers.
I found a couple possible workarounds but none is perfect:
SSH Tunnel - as described in: Consume from a Kafka Cluster through SSH Tunnel
Use REST Proxy but replace Kafka Connect with custom producers, mentioned in How can we configure kafka producer behind a firewall/proxy?
Use SSHL demultiplexer to route the trafic to broker (but just one broker)
Has anyone faced similar challenge? How did you solve it?
Sink Connectors (ones that write to external systems) do not use the Producer API.
That being said, you could use some HTTP Sink Connector that issues POST requests to the REST Proxy endpoint. It's not ideal, but it would address the problem. Note: This means you have two clusters - one that you are consuming from in order to issue HTTP requests via Connect, and the other behind the proxy.
Overall, I don't see how the question is unique to Connect, since you'd have similar issues with any other attempt to write the data to Kafka through the only open HTTPS port.
As #OneCricketeer recommended, I tried a HTTP Sink Connector with REST Proxy approach.
I managed to configure Confluent HTTP Sink connector as well as alternative one (github.com/llofberg/kafka-connect-rest) to work with Confluent REST Proxy.
I'm adding connector configuration in case it saves some time to anyone trying this approach.
Confluent HTTP Sink connector
{
"name": "connector-sink-rest",
"config": {
"topics": "test",
"tasks.max": "1",
"connector.class": "io.confluent.connect.http.HttpSinkConnector",
"headers": "Content-Type:application/vnd.kafka.json.v2+json",
"http.api.url": "http://rest:8082/topics/test",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter.schemas.enable": "false",
"batch.prefix": "{\"records\":[",
"batch.suffix": "]}",
"batch.max.size": "1",
"regex.patterns":"^~$",
"regex.replacements":"{\"value\":~}",
"regex.separator":"~",
"confluent.topic.bootstrap.servers": "localhost:9092",
"confluent.topic.replication.factor": "1"
}
}
Kafka Connect REST connector
{
"name": "connector-sink-rest-v2",
"config": {
"connector.class": "com.tm.kafka.connect.rest.RestSinkConnector",
"tasks.max": "1",
"topics": "test",
"rest.sink.url": "http://rest:8082/topics/test",
"rest.sink.method": "POST",
"rest.sink.headers": "Content-Type:application/vnd.kafka.json.v2+json",
"transforms": "velocityEval",
"transforms.velocityEval.type": "org.apache.kafka.connect.transforms.VelocityEval$Value",
"transforms.velocityEval.template": "{\"records\":[{\"value\":$value}]}",
"transforms.velocityEval.context": "{}"
}
}

Distributed Official Mongodb Kafka Source Connector with Multiple tasks Not working

I am running Apache Kafka on my Windows machine with two Kafka-Connect-Workers(Port 8083, 8084) and one topic with three partitions(replication of one).
My issue is that I am able to see the fail-over to other Kafka-Connect worker whenever I shutdown one of them, but load balancing is not happening because the number of tasks is always ONE.
I am using Official MongoDB-Kafka-Connector as Source(ChangeStream) with tasks.max=6.
I tried updating MongoDB with multiple threads so that it could push more data into Kafka-Connect and may perhaps make Kafka-Connect create more tasks. Even under higher volume of data, tasks count remain one.
How I confirmed only one task is running? That's through the api "http://localhost:8083/connectors/mongodb-connector/status" :
Response:
{
"name":"mongodb-connector",
"connector": {
"state":"RUNNING",
"worker_id":"xx.xx.xx.xx:8083"
}
"tasks": [
{
"id": 0,
"state": "RUNNING"
"worker_id": "xx.xx.xx.xx:8083"
}
],
"type": "source"
}
Am I missing something here? Why more tasks are not created?
It seems this is the behavior of Official MongoDB Kafka Source Connector. This is the answer I got on another forum from Ross Lawley(MongoDB developer):
Prior to 1.2.0 only a single task was supported by the sink connector.
The Source connector still only supports a single task, this is because it uses a single Change Stream cursor. This is enough to watch and publish changes cluster wide, database wide or down to a single collection.
I raised this ticket: https://jira.mongodb.org/browse/KAFKA-121
Got following response:
The source connector will only ever produce a single task.
This is by design as the source connector is backed by a change stream. Change streams internally use the same data as used by replication engine and as such should be able to scale as the database does.
There are no plans to allow multiple cursors, however, should you feel that this is not meeting your requirements, then you can configure multiple connectors and each would have its own change stream cursor.

Kafka topic seems to function first time only. Why?

I am working with Kafka Connect (using the Confluent implementation) and am seeing a strange behavior. I configure a source connection to pull data from a DB table, and populate a topic. This works.
But, if I delete the topic, remove the Source config, and then reset the config (perhaps adding another column to the query) the topic does not get populated. If I change the topic name to something I haven't used before, it works. I am using Postman to set the configuration, though I don't believe that matters here.
My Connect config:
{
"name": "my-jdbc-connector",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"connection.url": "jdbc:db2://db2server.mycompany.com:4461/myDB",
"connection.user: "dbUser",
"connection.password": "dbPass",
"dialect.name": "Db2DatabaseDialect",
"mode": "timestamp",
"query": "select fname, lname, custId, custRegion, lastUpdate from CustomerMaster",
"timestamp.column.name": "lastUpdate",
"table.types": "TABLE",
"topic.prefix": "master.customer"
}
}
KAFKA JDBC connector uses HighWatermark on the timestamp column i.e. last update in your case. It doesn't depend on the topic or even you can delete the JDBC connector and recreate it with the same name it still will be using the same HighWatermark because HighWatermark depends on the connector name. So even you recreate the topic it will not load data again.
So there is a way to reprocess the whole data again you can follow any of the ways:
Drop topic and delete JDBC Connector, recreate topic, and create JDBC Connector with a different name. or
Delete JDBC connector and recreate again with the same name with mode "mode": "bulk" . It will dump all DB tables again in the topic. once it loads you can again update mode to timestamp.
Please refer JDBC connector configuration details
https://docs.confluent.io/current/connect/kafka-connect-jdbc/source-connector/source_config_options.html
update lastUpdate for all records to the current timestamp.