I am running the confluent JDBC source connector to read from a DB table and publish to a Kafka Topic. The Connector is started by a Job-scheduler and I need to stop the connector after it has published all the rows from the DB table. Any idea how to stop it gracefully ?
You can use the REST API to pause (or delete) a connector
PUT /connectors/:name/pause
There is no "notification" to know if all records are loaded, though, so in the JDBC Source, you can also schedule the bulk mode with a long time delay (say a whole week), then schedule the connector deletion.
to pause it, run this from a command shell (that has CURL installed):
curl -X PUT <host>:8083/connectors/<connector_name>/pause
to resume back again you use:
curl -X PUT <host>:8083/connectors/<connector_name>/resume
to see whether it's paused or not, use:
curl <host>:8083/connectors/<connector_name>/status | jq
the "jq" part makes it more readable.
Related
I am running a Debezium connector to PostgreSQL. The snapshot.mode I use is initial, since I don't want to resnapshot just because the connector has been restarted. However, during development I want to restart the process, as the messages expire from Kafka before they have been read.
If I delete and recreate the connector via Kafka Connect REST API, this doesn't do anything, as the information in the offset/status/config topics is preserved. I have to delete and recreate them when restarting the whole connect cluster to trigger another snapshot.
Am I missing a more convenient way of doing this?
You will also need a new name for the connector as well as a new database.server.name name in the connector config, which stores all the offset information. It should almost be like deploying a connector for the first time again.
I am new to kafka connector. I have been explore about it about a week. I have used create and update the mongodb via mongodb connector curl commands. I am bit struggling to understand the concept and implementation of below.
We are registering curl command to connector at every time with unique name before producing the message. How it will be automated?. For example, If I pass the data from my application to producer should I call the curl command for each and every request?
2)I need to maintain the history collection based on that I need to pass two collection and two topics (one for updating and one for creating). How will I manage with curl configuration.
I will paste my curl update configuration below,
curl -X POST -H "Content-Type: application/json" -d '{"name":"test-students-update",
"config":{"topics":"topicData",
"connector.class":"com.mongodb.kafka.connect.MongoSinkConnector",
"tasks.max":"1",
"connection.uri":"mongodb://localhost:27017",
"database":"quickstart",
"collection":"topicData",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable":"false",
"value.converter.schemas.enable":"false",
"document.id.strategy":"com.mongodb.kafka.connect.sink.processor.id.strategy.BsonOidStrategy",
"document.id.strategy":"com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy",
"value.projection.list":"tokenNumber",
"value.projection.type":"whitelist",
"writemodel.strategy":"com.mongodb.kafka.connect.sink.writemodel.strategy.UpdateOneTimestampsStrategy"
}}' localhost:8083/connectors
Not sure what do you mean by automating CURL command for your MongoDB Sink Connector, and what is the need of running CURL command every time. Kindly clarify.
The existing MongoDB Sink connector can easily be integrated with the Confluent Hub and can run as a standalone service to serve the purpose of data UPSERT.
You can have a look at https://www.mongodb.com/docs/kafka-connector/current/sink-connector/fundamentals/#std-label-kafka-sink-fundamentals
every time with unique name before producing the message
This is not necessary. Post the connector once and it'll start a Kafka consumer and wait for data, just like any other Kafka client.
pass the data from my application to producer should I call the curl command for each and every request
As stated, no.
How it will be automated
You don't necessarily need to use curl. If you're using Kubernetes, there are CRDs for KafkaConnect. Otherwise, Terraform providers work with Connect API as well. Or you can continue to use curl in some ci/cd pipeline, but it only needs ran once to start the connector
need to pass two collection and two topics (one for updating and one for creating).
The collection field in the connector can only reference one collection. Therefore, you'd need two separate connectors, and therefore all Kafka events would be inserted or updated to those individual collections, and not reference on another unless your schema model uses ObjectId references
Alternatively, redesign your producer to send to one topic, then inserts and updates (and deletes) can happen based on the key of the record into one collection
I'm using Debezium postgres connector v1.4.2.Final.
I'm using snapshot.mode=initial_only, where I only want to get the table(s) snapshot and not stream the incremental changes. Once the snapshot is completed, I want to stop/kill the connector. How can I find out if the snapshotting is complete and that it's safe to kill the connector?
I'm using this to be able to add new tables to an existing connector. For doing that I'm trying this:
kill the original connector (snapshot.mode=initial)
start a new connector with snapshot.mode=initial_only for new tables
stop the new connector once snapshotting is complete
Start original connector after adding new tables to table.whitelist
please check JMX metrics. Verify if this one https://debezium.io/documentation/reference/1.5/connectors/postgresql.html#connectors-snaps-metric-snapshotcompleted_postgresql would suite to your needs.
I encountered that my topic despite running and operating doesn't register events occuring in my MongoDB.
Everytime I insert/modify record I'm not getting anymore logs from kafka-console-consumer command.
Is there a way clear Kafka's cache/offset maybe?
Source and sink connection are up and running. Entire cluster is also healthy, thing is that everything worked as usual but every couple weeks I see this coming back or when I log into my Mongo cloud from other location.
--partition 0 parameter didn't help, changing retention_ms to 1 too.
I checked both connectors' status and got RUNNING:
curl localhost:8083/connectors | jq
curl localhost:8083/connectors/monit_people/status | jq
Running docker-compose logs connect I found:
WARN Failed to resume change stream: Resume of change stream was not possible, as the resume point may no longer be in the oplog. 286
If the resume token is no longer available then there is the potential for data loss.
Saved resume tokens are managed by Kafka and stored with the offset data.
When running Connect in standalone mode offsets are configured using the:
`offset.storage.file.filename` configuration.
When running Connect in distributed mode the offsets are stored in a topic.
Use the `kafka-consumer-groups.sh` tool with the `--reset-offsets` flag to reset offsets.
Resetting the offset will allow for the connector to be resume from the latest resume token.
Using `copy.existing=true` ensures that all data will be outputted by the connector but it will duplicate existing data.
Future releases will support a configurable `errors.tolerance` level for the source connector and make use of the `postBatchResumeToken
Issue requires more practice with Confluent Platform thus for now I re-built entire environment by removing entire container with:
docker system prune -a -f --volumes
docker container stop $(docker container ls -a -q -f "label=io.confluent.docker").
After running docker-compose up -d all is up and working.
How to start multiple Kafka connectors in a Kafka Connect world within a single distributed worker(running on 3 different servers)?
Right now I have a need of 4 Kafka Connectors in this distributed worker(same group.id).
Currently, I am adding one connector at a time using following curl command.
curl -X POST -H "Content-type: application/json" -d '<my_single_connector_config>' 'http://localhost:8083/connectors'
Issue:
For each new connector I add, previous/existing connector(s) restarts along with new connector.
Question:
How should I start/create all these new connectors with one REST call in a distributed worker mode?
Is there any way to have all connector configs in a single REST call, like an array of connector configs?
I tried to search for the same but didn't come across any workaround for this.
Thanks.
For each new connector I add, previous/existing connector(s) restarts along with new connector.
Yes, that's the current behaviour of Kafka Connect. For further discussion see:
https://issues.apache.org/jira/browse/KAFKA-5505
https://cwiki.apache.org/confluence/display/KAFKA/Incremental+Cooperative+Rebalancing%3A+Support+and+Policies
How should I start/create all these new connectors with one REST call in a distributed worker mode?
Is there any way to have all connector configs in a single REST call, like an array of connector configs?
You can't do it in a single REST call
If you want to isolate your connectors from each other when creating/updating them, you can just run multiple distributed clusters.
So instead of 1 distributed Connect cluster running 3 connectors, you could have 3 distributed Connect clusters each running 1 connector.
Remember in practice a 'distributed Cluster' could just be of a single node, and indeed could all run on the same machine. You'd scale out for resilience and throughput capacity.