Kafka Sink Connector with custom consumer-group name - apache-kafka

In kafka connect, all the sink connectors will use the different group with the naming conversion of connect-connector_name. But I want to use a custom name as the prefix.(we can do in the sink config - name properties, but looking for set it by default)
I tried to setup this in the consumer.properties file, but no use.
Does anyone know how it set it? also, What happens if I set a single group for all my sink connector?

Sink tasks always have connect- prefix for their ConsumerConfig group.id
https://issues.apache.org/jira/browse/KAFKA-4400
consumer.properties is used (optionally) for kafka-console-consumer, not Connect API
happens if I set a single group for all my sink connector?
You mean a single connector with one name? Then you'd want tasks.max to be equal to the total partitions of all topics its consuming.
If you mean multiple connectors, then you can't; all connectors within the same Connect cluster need a unique name/connector.class pair

You can override any consumer or producer properties. You have to use connector.client.config.override.policy = All in worker configuration (default is None). Then you can override consumer.group.id for your task in property consumer.override.group.id. For example:
{
"consumer.override.group.id": "testgroup",
"name": "Elasticsearch",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"topics": "orders",
"tasks.max": 1,
"connection.url": "http://elasticsearch:9200",
"type.name": "type.name=kafkaconnect",
"key.ignore": "true",
"schema.ignore": "false",
"transforms": "renameTopic",
"transforms.renameTopic.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.renameTopic.regex": "orders",
"transforms.renameTopic.replacement": "orders-latest"
}'
Documentation is here
If you use kafka-connect in docker from image confluentinc/cp-kafka-connnect-base, you can set this configuration from environment variable CONNECT_CONNECTOR_CLIENT_CONFIG_OVERRIDE_POLICY

Related

KSQLDB Push Queries Fail to Deserialize Data - Schema Lookup Performed with Wrong Schema ID

I'm not certain what I could be missing.
I have set up a Kafka broker server, with a Zookeeper and a distributed Kafka Connect.
For schema management, I have set up an Apicurio Schema Registry instance
I also have KSQLDB setup
The following I can confirm is working as expected
My source JDBC connector successfully pushed table data into the topic stss.market.info.public.ice_symbols
Problem:
Inside the KSQLDB server, I have successfully created a table from the topic stss.market.info.public.ice_symbols
Here is the detail of the table created
The problem I'm facing is when performing a push query against this table, it returns no data. Deserialization of the data fails due to the unsuccessful lookup of the AVRO schema in the Apicurio Registry.
Looking at the Apicurio Registry logs reveals that KSQLDB calls to Apicrio Registry to fetch the deserialization schema using a schema ID of 0 instead of 5, which is the ID of the schema I have registered in the registry.
KSQLDB server logs also confirm this 404 HTTP response in the Apicurio logs as shown in the image below
Expectation:
I expect, KSQLDB queries to the table to perform a schema lookup with an ID of 5 and not 0. I'm guessing I'm probably missing some configuration.
Here is the image of the schema registered in the Apicruio Registry
Here is also my source connector configuration. It has the appropriate schema lookup strategy configured. Although, I don't believe KSQLDB requires this when deserialization its table data. This configuration should only be relevant to the capturing of the table data, and its validation and storage in the topic stss.market.info.public.ice_symbols.
{
"name": "new.connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"plugin.name": "pgoutput",
"database.hostname": "172.17.203.10",
"database.port": "6000",
"database.user": "postgres",
"database.password": "123",
"database.dbname": "stss_market_info",
"database.server.name": "stss.market.info",
"table.include.list": "public.ice_symbols",
"message.key.columns": "public.ice_symbols:name",
"snapshot.mode": "always",
"transforms": "unwrap,extractKey",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.extractKey.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractKey.field": "name",
"value.converter": "io.apicurio.registry.utils.converter.AvroConverter",
"value.converter.apicurio.registry.url": "http://local-server:8080/apis/registry/v2",
"value.converter.apicurio.registry.auto-register": true,
"value.converter.apicurio.registry.find-latest": true,
"value.apicurio.registry.as-confluent": true,
"name": "new.connector",
"value.converter.schema.registry.url": "http://local-server:8080/apis/registry/v2"
}
}
Thanks in advance for any assistance.
You can specify the "VALUE_SCHEMA_ID=5" property in the WITH clause when you create a stream/table.

How to pass data when meets a condition from MongoDB to a Kafka topic with a source connector and a pipeline property?

I'm working in a source connector to watch for changes in a Mongo's collection and take them to a Kafka topic. This works nicely till I add the requirement to just put them in Kafka topic if meets a specific condition (name=Kathe). It means I need to put data in a topic just if the update process changes the name to Kathe.
My connector's config looks like:
{
"connection.uri":"xxxxxx",
"connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable":"false",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable":"false",
"topic.prefix": "qu",
"database":"sample_analytics",
"collection":"customers",
"copy.existing": "true",
"pipeline":"[{\"$match\":{\"name\":\"Kathe\"}}]",
"publish.full.document.only": "true",
"flush.timeout.ms":"15000"
}
I also have tried with
"pipeline":"[{\"$match\":{\"name\":{ \"$eq\":\"Kathe\"}}}]"
But it is not producing messages, when the condition meets.
Am I making a mistake?

Kafka Sink how to map fields to db with different topic and table schema name

I am currently setting up the Kafka Sink connector with a topic name waiting-room, while my db schema is called waiting_room. So I am trying to map the topic message to the db schema but I do not see any data entering the database.
So I tried the following scenario:
So since the table schema is waiting_room I tried to add quote.sql.identifier=ALWAYS since it quotes table name and allow the Kafka sink to quote it so it can map to the table but I did not see quote.sql.identifier=ALWAYS in the Kafka sink. Does both table.schema and Kafka sink need to be quote inorder to map it or how can I map with table schema as underscore and have kafka map it
Then if I changed the table.name.format=waiting-room and have the db schema = gt.namespace."waiting-room" I do not see my kafka sink get updated but instead my table.name.format will = waiting_room and have the status of the connector as 404 not found.
Is there a way to map and have data enter to the db when topic and db name different
Try to use Kafka Connect SMT RegexRouter:
{
"task.max": "1",
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url": "'"$URL"'",
"topics": "waiting-room",
"transforms": "route",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex": "waiting-room",
"transforms.route.replacement": "gt.namespace.waiting_room",
"errors.tolerance": "all",
"errors.log.enable": "true",
"errors.log.include.messages": true
}

Kafka-connect add more topics on the fly

I have an elasticsearch kafka-connect connector consuming some topics.
With the following configuration:
{
connection.url": "https://my-es-cluster:443",
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.ignore": "true",
"topics": "topic1,topic2",
...
}
Can I add more topics to it while it's running?
What will happen?
What if I remove some topics from list and add them again later.
I'd like to add a new topic3 here:
{
...
"topics": "topic1,topic2,topic3",
...
}
What is I remove topic2? Will other topics be re-consumed?:
{
...
"topics": "topic1,topic3",
...
}
Since you already have your kafka and kafka-connect running, you can use REST API of kafka-connect and check that yourself: https://docs.confluent.io/current/connect/references/restapi.html
If you add a new topic (topic3), all messages currently in that topic (according to retention policy) will be consumed.
PUT http://kafka-connect:8083/connectors/my-test-connector/config
{
...
"topics": "topic1,topic2,topic3",
...
}
Check status and config of this connector:
GET http://kafka-connect:8083/connectors/my-test-connector
If you want to disable some topic, just use PUT to update config for that connector.
PUT http://kafka-connect:8083/connectors/my-test-connector/config
{
...
"topics": "topic1,topic3",
...
}
Nothing will change for topic1 and topic3. Just topic2 will not be consumed any more.
But if you want to return it back, messages from topic2 will be consumed from the last committed offset, and not from beginning.
For each consumer group last committed offset is stored, does not matter that you removed topic from the config for a while.
For this case, the consumer group will be connect-my-test-connector.
Even is you delete the connector (DELETE http://kafka-connect:8083/connectors/my-test-connector) and then create it again with the same name, the offset will be saved, and consumption will be continued from them point when you've deleted it. (mind the retention policy, it's usually 7 days).

Kafka-connect issue

I installed Apache Kafka on centos 7 (confluent), am trying to run filestream kafka connect in distributed mode but I was getting below error:
[2017-08-10 05:26:27,355] INFO Added alias 'ValueToKey' to plugin 'org.apache.kafka.connect.transforms.ValueToKey' (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:290)
Exception in thread "main" org.apache.kafka.common.config.ConfigException: Missing required configuration "internal.key.converter" which has no default value.
at org.apache.kafka.common.config.ConfigDef.parseValue(ConfigDef.java:463)
at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:453)
at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:62)
at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:75)
at org.apache.kafka.connect.runtime.WorkerConfig.<init>(WorkerConfig.java:197)
at org.apache.kafka.connect.runtime.distributed.DistributedConfig.<init>(DistributedConfig.java:289)
at org.apache.kafka.connect.cli.ConnectDistributed.main(ConnectDistributed.java:65)
Which is now resolved by updating the workers.properties as mentioned in http://docs.confluent.io/current/connect/userguide.html#connect-userguide-distributed-config
Command used:
/home/arun/kafka/confluent-3.3.0/bin/connect-distributed.sh ../../../properties/file-stream-demo-distributed.properties
Filestream properties file (workers.properties):
name=file-stream-demo-distributed
connector.class=org.apache.kafka.connect.file.FileStreamSourceConnector
tasks.max=1
file=/tmp/demo-file.txt
bootstrap.servers=localhost:9092,localhost:9093,localhost:9094
config.storage.topic=demo-2-distributed
offset.storage.topic=demo-2-distributed
status.storage.topic=demo-2-distributed
key.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schemas.enable=true
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter.schemas.enable=false
group.id=""
I added below properties and command went through without any errors.
bootstrap.servers=localhost:9092,localhost:9093,localhost:9094
config.storage.topic=demo-2-distributed
offset.storage.topic=demo-2-distributed
status.storage.topic=demo-2-distributed
group.id=""
But, now when I run consumer command, I am unable to see the messages in /tmp/demo-file.txt. Please let me know if there is a way I can check if the messages are published to kafka topics and partitions ?
kafka-console-consumer --zookeeper localhost:2181 --topic demo-2-distributed --from-beginning
I believe I am missing something really basic here. Can some one please help?
You need to define unique topics for Kafka connect framework to store its config, offset, and status.
In your workers.properties file change these parameters to something like the following:
config.storage.topic=demo-2-distributed-config
offset.storage.topic=demo-2-distributed-offset
status.storage.topic=demo-2-distributed-status
These topics are use to store state and configuration metadata of connect and not for storing the messages for any of the connectors that run on top of connect. Do not use console consumer on any of these three topics and expect to see the messages.
The messages are stored in the topic configured in the connector configuration json with the parameter called "topic".
Example file-sink-config.json file
{
"name": "MyFileSink",
"config": {
"topics": "mytopic",
"connector.class": "org.apache.kafka.connect.file.FileStreamSinkConnector",
"tasks.max": 1,
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"file": "/tmp/demo-file.txt"
}
}
Once the distributed worker is running you need to apply the config file to it using curl like so:
curl -X POST -H "Content-Type: application/json" --data #file-sink-config.json http://localhost:8083/connectors
After that the config will be safely stored in the config topic you created for all distributed workers to use. Make sure the config topic (and the status and offset topics) will not expire messages or you will loose you Connector configuration when it does.