Kafka connect- Mongodb Sink Connector - mongodb

I am new to kafka connector. I have been explore about it about a week. I have used create and update the mongodb via mongodb connector curl commands. I am bit struggling to understand the concept and implementation of below.
We are registering curl command to connector at every time with unique name before producing the message. How it will be automated?. For example, If I pass the data from my application to producer should I call the curl command for each and every request?
2)I need to maintain the history collection based on that I need to pass two collection and two topics (one for updating and one for creating). How will I manage with curl configuration.
I will paste my curl update configuration below,
curl -X POST -H "Content-Type: application/json" -d '{"name":"test-students-update",
"config":{"topics":"topicData",
"connector.class":"com.mongodb.kafka.connect.MongoSinkConnector",
"tasks.max":"1",
"connection.uri":"mongodb://localhost:27017",
"database":"quickstart",
"collection":"topicData",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter":"org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable":"false",
"value.converter.schemas.enable":"false",
"document.id.strategy":"com.mongodb.kafka.connect.sink.processor.id.strategy.BsonOidStrategy",
"document.id.strategy":"com.mongodb.kafka.connect.sink.processor.id.strategy.PartialValueStrategy",
"value.projection.list":"tokenNumber",
"value.projection.type":"whitelist",
"writemodel.strategy":"com.mongodb.kafka.connect.sink.writemodel.strategy.UpdateOneTimestampsStrategy"
}}' localhost:8083/connectors

Not sure what do you mean by automating CURL command for your MongoDB Sink Connector, and what is the need of running CURL command every time. Kindly clarify.
The existing MongoDB Sink connector can easily be integrated with the Confluent Hub and can run as a standalone service to serve the purpose of data UPSERT.
You can have a look at https://www.mongodb.com/docs/kafka-connector/current/sink-connector/fundamentals/#std-label-kafka-sink-fundamentals

every time with unique name before producing the message
This is not necessary. Post the connector once and it'll start a Kafka consumer and wait for data, just like any other Kafka client.
pass the data from my application to producer should I call the curl command for each and every request
As stated, no.
How it will be automated
You don't necessarily need to use curl. If you're using Kubernetes, there are CRDs for KafkaConnect. Otherwise, Terraform providers work with Connect API as well. Or you can continue to use curl in some ci/cd pipeline, but it only needs ran once to start the connector
need to pass two collection and two topics (one for updating and one for creating).
The collection field in the connector can only reference one collection. Therefore, you'd need two separate connectors, and therefore all Kafka events would be inserted or updated to those individual collections, and not reference on another unless your schema model uses ObjectId references
Alternatively, redesign your producer to send to one topic, then inserts and updates (and deletes) can happen based on the key of the record into one collection

Related

Is there a way of telling a sink connector in Kafka Connect how to look for schema entries

I have successfully set up Kafka Connect in distributed mode locally with the Confluent BigQuery connector. The topics are being made available to me by another party; I am simply moving these topics into my Kafka Connect on my local machine, and then to the sink connector (and thus into BigQuery).
Because of the topics being created by someone else, the schema registry is also being managed by them. So in my config, I set "schema.registry.url":https://url-to-schema-registry, but we have multiple topics which all use the same schema entry, which is located at, let's say, https://url-to-schema-registry/subjects/generic-entry-value/versions/1.
What is happening, however, is that Connect is looking for the schema entry based on the topic name. So let's say my topic is my-topic. Connect is looking for the entry at this URL: https://url-to-schema-registry/subjects/my-topic-value/versions/1. But instead, I want to use the entry located at https://url-to-schema-registry/subjects/generic-entry-value/versions/1, and I want to do so for any and all topics.
How can I make this change? I have tried looking at this doc: https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#configuration-details as well as this class: https://github.com/confluentinc/schema-registry/blob/master/schema-serializer/src/main/java/io/confluent/kafka/serializers/subject/TopicRecordNameStrategy.java
but this looks to be a config parameter for the schema registry itself (which I have no control over), not the sink connector. Unless I'm not configuring something correctly.
Is there a way for me to configure my sink connector to look for a specified schema entry like generic-entry-value/versions/..., instead of the default format topic-name-value/versions/...?
The strategy is configurable at the connector level.
e.g.
value.converter.value.subject.name.strategy=...
There are only strategies built-in, however for Topic and/or RecordName lookups. You'll need to write your own class for static lookups from "generic-entry" if you otherwise cannot copy this "generic-entry-value" schema into new subjects
e.g
# get output of this to a file
curl ... https://url-to-schema-registry/subjects/generic-entry-value/versions/1/schema
# upload it again where "new-entry" is the name of the other topic
curl -XPOST -d #schema.json https://url-to-schema-registry/subjects/new-entry-value/versions

How we can Dump kafka topic into presto

I need to pushing a JSON file into a Kafka topic, connecting the topic in presto and structuring the JSON data into a queryable table.
I am following this tutorial https://prestodb.io/docs/current/connector/kafka-tutorial.html#step-2-load-data
I am not able to understand how this command will work.
$ ./kafka-tpch load --brokers localhost:9092 --prefix tpch. --tpch-type tiny
Suppose I have created test topic in kafka using producer. How will tpch file will generate of this topic?
If you already have a topic, you should skip to step 3 where it actually sets up the topics to query via Presto
kafka-tpch load creates new topics with the specified prefix
Above command creates a tpch schema and loads various tables under it. This can be used for testing purpose. If you want to work with your actual kafka topics, you need to enlist them in /catalog/kafka.properties against kafka.tables-names. If you simply provide a topic name without prefix (such as test_topic), it would land into "default" schema. However, if you specify a topic name with prefix (such as test_schema.test_topic), then the topic would appear under test_schema. While querying using presto, you can provide this schema name.

How to stop/terminate confluent JDBC source connector?

I am running the confluent JDBC source connector to read from a DB table and publish to a Kafka Topic. The Connector is started by a Job-scheduler and I need to stop the connector after it has published all the rows from the DB table. Any idea how to stop it gracefully ?
You can use the REST API to pause (or delete) a connector
PUT /connectors/:name/pause
There is no "notification" to know if all records are loaded, though, so in the JDBC Source, you can also schedule the bulk mode with a long time delay (say a whole week), then schedule the connector deletion.
to pause it, run this from a command shell (that has CURL installed):
curl -X PUT <host>:8083/connectors/<connector_name>/pause
to resume back again you use:
curl -X PUT <host>:8083/connectors/<connector_name>/resume
to see whether it's paused or not, use:
curl <host>:8083/connectors/<connector_name>/status | jq
the "jq" part makes it more readable.

Starting multiple connectors in Kafka Connect withing single distributed worker?

How to start multiple Kafka connectors in a Kafka Connect world within a single distributed worker(running on 3 different servers)?
Right now I have a need of 4 Kafka Connectors in this distributed worker(same group.id).
Currently, I am adding one connector at a time using following curl command.
curl -X POST -H "Content-type: application/json" -d '<my_single_connector_config>' 'http://localhost:8083/connectors'
Issue:
For each new connector I add, previous/existing connector(s) restarts along with new connector.
Question:
How should I start/create all these new connectors with one REST call in a distributed worker mode?
Is there any way to have all connector configs in a single REST call, like an array of connector configs?
I tried to search for the same but didn't come across any workaround for this.
Thanks.
For each new connector I add, previous/existing connector(s) restarts along with new connector.
Yes, that's the current behaviour of Kafka Connect. For further discussion see:
https://issues.apache.org/jira/browse/KAFKA-5505
https://cwiki.apache.org/confluence/display/KAFKA/Incremental+Cooperative+Rebalancing%3A+Support+and+Policies
How should I start/create all these new connectors with one REST call in a distributed worker mode?
Is there any way to have all connector configs in a single REST call, like an array of connector configs?
You can't do it in a single REST call
If you want to isolate your connectors from each other when creating/updating them, you can just run multiple distributed clusters.
So instead of 1 distributed Connect cluster running 3 connectors, you could have 3 distributed Connect clusters each running 1 connector.
Remember in practice a 'distributed Cluster' could just be of a single node, and indeed could all run on the same machine. You'd scale out for resilience and throughput capacity.

How to delete Kafka topic using Kafka REST Proxy?

How to delete Kafka topic using Kafka REST Proxy? I tried the following command, but it returns the error message:
curl -X DELETE XXX.XX.XXX.XX:9092/topics/test_topic
If it's impossible, then how to update delete the messages and update the scheme of a topic?
According to the documentation API Reference, you cannot delete topics via REST Proxy, and I agree with them because such a destructive operation should not be available via interface that is exposed to outside.
The topic deletion operation can be performed on the server where broker runs using command line utility. See How to Delete a topic in apache kafka
You can update the schema for a message when you publish it using the POST /topics/(string: topic_name) REST endpoint. If the schema for the new messages is not backward compatible with the older messages in the same topic you will have to configure your Schema Registry to allow publishing of incompatible messages, otherwise you will get an error.
See the "Example Avro request" here:
http://docs.confluent.io/3.1.1/kafka-rest/docs/api.html#post--topics-(string-topic_name)
See how to configure Schema Registry for forward, backward, or no compatibility see the documentation here:
http://docs.confluent.io/3.1.1/schema-registry/docs/api.html#compatibility
I confirmed that it is supported from version 5.5.0 or higher, and the test result worked normally. (REST Proxy API v3)
https://docs.confluent.io/current/kafka-rest/api.html#topic