Kafka-connect add more topics on the fly - apache-kafka

I have an elasticsearch kafka-connect connector consuming some topics.
With the following configuration:
{
connection.url": "https://my-es-cluster:443",
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.ignore": "true",
"topics": "topic1,topic2",
...
}
Can I add more topics to it while it's running?
What will happen?
What if I remove some topics from list and add them again later.
I'd like to add a new topic3 here:
{
...
"topics": "topic1,topic2,topic3",
...
}
What is I remove topic2? Will other topics be re-consumed?:
{
...
"topics": "topic1,topic3",
...
}

Since you already have your kafka and kafka-connect running, you can use REST API of kafka-connect and check that yourself: https://docs.confluent.io/current/connect/references/restapi.html
If you add a new topic (topic3), all messages currently in that topic (according to retention policy) will be consumed.
PUT http://kafka-connect:8083/connectors/my-test-connector/config
{
...
"topics": "topic1,topic2,topic3",
...
}
Check status and config of this connector:
GET http://kafka-connect:8083/connectors/my-test-connector
If you want to disable some topic, just use PUT to update config for that connector.
PUT http://kafka-connect:8083/connectors/my-test-connector/config
{
...
"topics": "topic1,topic3",
...
}
Nothing will change for topic1 and topic3. Just topic2 will not be consumed any more.
But if you want to return it back, messages from topic2 will be consumed from the last committed offset, and not from beginning.
For each consumer group last committed offset is stored, does not matter that you removed topic from the config for a while.
For this case, the consumer group will be connect-my-test-connector.
Even is you delete the connector (DELETE http://kafka-connect:8083/connectors/my-test-connector) and then create it again with the same name, the offset will be saved, and consumption will be continued from them point when you've deleted it. (mind the retention policy, it's usually 7 days).

Related

How to reprocess messages in Kafka Connect?

I am working with a Kafka Sink Connector which reads from a Kafka topic and puts the data into a target database (in my case it is a Neo4j instance) .The messages need to be processed strictly sequentially since they are not idempotent. My question is if for some reason an exception occurs, for e.g. 1. Datbase goes down , 2. Connectivity to DB lost , 3. Schema parsing failure , how can we reprocess the message ?
I understand we can run with error.tolerance=none configuration and redirect failure message to a dead letter queue. But my question is there any way we can process a selected message again ? Also , is there any audit mechanism to track how many messages are processed, to seek from a given offset (without manual offset reset).
Below is my connector configuration . Also suggest if there are better data integration technologies apart from the kafka connectors to sink the data into a target database.
{
"topics": "mytopic",
"connector.class": "streams.kafka.connect.sink.Neo4jSinkConnector",
"tasks.max":"1",
"key.converter.schemas.enable":"true",
"values.converter.schemas.enable":"true",
"errors.retry.timeout": "-1",
"errors.retry.delay.max.ms": "1000",
"errors.tolerance": "none",
"errors.deadletterqueue.topic.name": "deadletter-topic",
"errors.deadletterqueue.topic.replication.factor":1,
"errors.deadletterqueue.context.headers.enable":true,
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"key.converter.enhanced.avro.schema.support":true,
"value.converter.enhanced.avro.schema.support":true,
"value.converter":"io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url":"https://schema-url/",
"value.converter.basic.auth.credentials.source":"USER_INFO",
"value.converter.basic.auth.user.info":"user:pass",
"errors.log.enable": true,
"schema.ignore":"false",
"errors.log.include.messages": true,
"neo4j.server.uri": "neo4j://my-ip:7687/neo4j",
"neo4j.authentication.basic.username": "neo4j",
"neo4j.authentication.basic.password": "neo4j",
"neo4j.encryption.enabled": false,
"neo4j.topic.cypher.mytopic": "MERGE (p:Loc_Con{name: event.geography.name})"
}
For non fatal exceptions, the connector will write to a dead letter topic.
You'd need another connector or some other consumer to read that other topic to process that data. Since it's a topic, there's no straightforward way to "process a selected message"
JMX metrics or Neo4j database metrics should both be able to tell you approximately how many messages have been processed over time

Kafka Sink Connector with custom consumer-group name

In kafka connect, all the sink connectors will use the different group with the naming conversion of connect-connector_name. But I want to use a custom name as the prefix.(we can do in the sink config - name properties, but looking for set it by default)
I tried to setup this in the consumer.properties file, but no use.
Does anyone know how it set it? also, What happens if I set a single group for all my sink connector?
Sink tasks always have connect- prefix for their ConsumerConfig group.id
https://issues.apache.org/jira/browse/KAFKA-4400
consumer.properties is used (optionally) for kafka-console-consumer, not Connect API
happens if I set a single group for all my sink connector?
You mean a single connector with one name? Then you'd want tasks.max to be equal to the total partitions of all topics its consuming.
If you mean multiple connectors, then you can't; all connectors within the same Connect cluster need a unique name/connector.class pair
You can override any consumer or producer properties. You have to use connector.client.config.override.policy = All in worker configuration (default is None). Then you can override consumer.group.id for your task in property consumer.override.group.id. For example:
{
"consumer.override.group.id": "testgroup",
"name": "Elasticsearch",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"topics": "orders",
"tasks.max": 1,
"connection.url": "http://elasticsearch:9200",
"type.name": "type.name=kafkaconnect",
"key.ignore": "true",
"schema.ignore": "false",
"transforms": "renameTopic",
"transforms.renameTopic.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.renameTopic.regex": "orders",
"transforms.renameTopic.replacement": "orders-latest"
}'
Documentation is here
If you use kafka-connect in docker from image confluentinc/cp-kafka-connnect-base, you can set this configuration from environment variable CONNECT_CONNECTOR_CLIENT_CONFIG_OVERRIDE_POLICY

Debezium connector with TimescaleDB extension

I'm having trouble with detecting changes on Postresql hyper table(TimescaleDB extension).
Setup:
I have Postresql(ver 11.10) installed with TimescaleDB(ver 1.7.1) extension.
I have 2 tables I want to monitor them with Debezium(ver 1.3.1) connector installed on Kafka Connect for the purpose CDC(Capture Data Change).
Tables are table1 and table2hyper, but table2hyper is hypertable.
After creating Debezium connector in Kafka Connect I can see 2 topics created(one for each table):
(A) kconnect.public.table1
(B) kconnect.public.table2hyper
When consuming messages with kafka-console-consumer for topic A, I can see the messages after a row update in table1.
But when consuming messages from topic B(table2hyper table changes), nothing is emitted after for example a row update in table2hyper table.
Initialy Debezium connector does a snapshot of rows from table2hyper table and sends them to topic B(I can see the messages in topic B when using kafka-console-consumer), but changes that I do after the initial snapshot are not emitted.
Why am I unable to see subsequent changes(after initial snapshot) from table2hyper?
Connector creation payload:
{
"name": "alarm-table-connector7",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "xxx",
"database.port": "5432",
"database.user": "xxx",
"database.password": "xxx",
"database.dbname": "xxx",
"database.server.name": "kconnect",
"database.whitelist": "public.dev_db",
"table.include.list": "public.table1, public.table2hyper",
"plugin.name": "pgoutput",
"tombstones.on.delete":"true",
"slot.name": "slot3",
"transforms": "unwrap",
"transforms.unwrap.type":"io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones":"false",
"transforms.unwrap.delete.handling.mode":"rewrite",
"transforms.unwrap.add.fields":"table,lsn,op"
}
}
Thx in advance!
After trying for a while, I did not succeed to stream data from hyper table with Debezium connector. I was using version 1.3.1. and upgrade to latest 1.4.1. did not help.
However, I did succeed with Confluent JDBC connector.
As far as my research and testing goes, this is the conclusion and feel free to correct me if necessary:
Debezium works on ordinary tables on INSERT, UPDATE and
DELETE events
Confluent connector captures only INSERT events(unless you combine
some columns for detecting changes) and works on ordinary and
hyper(TimescaleDB) tables.
we have never tested Debezium with TimescaleDB. I recommend you to check if TimescaleDB updates are present in logical rpelication slot. If yes it should be technically possible to have Debezium process the events. If not then is is not possible at all.

How to move a topic from one broker to another broker in kafka?

I first tried to see if I can create a topic in a particular broker. But looks like this is not possible. Even if I mention the broker host in the bootstrap
admin_client = AdminClient({
"bootstrap.servers": "xxx1.com:9092,xxx2.com:9092"
})
futmap=admin_client.create_topics(topic_list)
The program is arbitrarily picking up one of the 5 brokers that I have as the leader broker for the topic. I am trying to understand why it happens like this.
I am also trying to see if I can reassign the topic leader to another broker. I know it may be possible through the kafka-reassign-partitions command line script, but I wanted to do it programmatically using python and confluent-Kafka package. Is it possible to do this programmatically. I did not find the reassign partition function in the ADMIN class of confluent-Kafka package
Thanks
I have finally got the solution for this, the documentation of the confluent Kafka python package is not adequate for this. But one good thing about open source is that you can read the code and figure out. So, to create the topic in a particular broker, I had to code the create topic code as below. Please note that I have used replica_assignment instead of replication_factor. These two are mutually exclusive. If you use the replication_factor, the partitions will be assigned by Kafka, you can control the assignment through replica_assignment. However, I am sure that this will get re-assigned in case of a rebalancing/re-assigning of partitions. But that can also be handled through the on_revoke event. But for now, this works for me.
def createTopic(admin_client,topics):
#topic_name=topics
topic_name = ['rajib1_test_xxx_topic']
replica_assignment = [[262, 261]]
topic_list = [NewTopic(topic, num_partitions=1, replica_assignment=replica_assignment) for topic in topic_name]
futmap=admin_client.create_topics(topic_list)
# Wait for each operation to finish.
for topic, f in futmap.items():
try:
f.result() # The result itself is None
print("Topic {} created".format(topic))
except Exception as e:
print("Failed to create topic {}: {}".format(topic, e))
#return futmap
You could also use the kafka-reassign-partitions.sh tool that comes with Kafka to change the replicas of one topic to another broker.
For example, if you want to have your (in this example single-replicated, and single-partitioned) topic "test" be located on broker "1", you can first define a plan (named replicachange.json):
{
"partitions":
[{"topic": "test", "partition": 0,
"replicas": [
1
]
}],
"version":1
}
and then execute it using:
kafka-reassign-partitions.sh --zookeeper localhost:2181 --execute \
--reassignment-json-file replicachange.json

Kafka-connect issue

I installed Apache Kafka on centos 7 (confluent), am trying to run filestream kafka connect in distributed mode but I was getting below error:
[2017-08-10 05:26:27,355] INFO Added alias 'ValueToKey' to plugin 'org.apache.kafka.connect.transforms.ValueToKey' (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:290)
Exception in thread "main" org.apache.kafka.common.config.ConfigException: Missing required configuration "internal.key.converter" which has no default value.
at org.apache.kafka.common.config.ConfigDef.parseValue(ConfigDef.java:463)
at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:453)
at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:62)
at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:75)
at org.apache.kafka.connect.runtime.WorkerConfig.<init>(WorkerConfig.java:197)
at org.apache.kafka.connect.runtime.distributed.DistributedConfig.<init>(DistributedConfig.java:289)
at org.apache.kafka.connect.cli.ConnectDistributed.main(ConnectDistributed.java:65)
Which is now resolved by updating the workers.properties as mentioned in http://docs.confluent.io/current/connect/userguide.html#connect-userguide-distributed-config
Command used:
/home/arun/kafka/confluent-3.3.0/bin/connect-distributed.sh ../../../properties/file-stream-demo-distributed.properties
Filestream properties file (workers.properties):
name=file-stream-demo-distributed
connector.class=org.apache.kafka.connect.file.FileStreamSourceConnector
tasks.max=1
file=/tmp/demo-file.txt
bootstrap.servers=localhost:9092,localhost:9093,localhost:9094
config.storage.topic=demo-2-distributed
offset.storage.topic=demo-2-distributed
status.storage.topic=demo-2-distributed
key.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schemas.enable=true
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter.schemas.enable=false
group.id=""
I added below properties and command went through without any errors.
bootstrap.servers=localhost:9092,localhost:9093,localhost:9094
config.storage.topic=demo-2-distributed
offset.storage.topic=demo-2-distributed
status.storage.topic=demo-2-distributed
group.id=""
But, now when I run consumer command, I am unable to see the messages in /tmp/demo-file.txt. Please let me know if there is a way I can check if the messages are published to kafka topics and partitions ?
kafka-console-consumer --zookeeper localhost:2181 --topic demo-2-distributed --from-beginning
I believe I am missing something really basic here. Can some one please help?
You need to define unique topics for Kafka connect framework to store its config, offset, and status.
In your workers.properties file change these parameters to something like the following:
config.storage.topic=demo-2-distributed-config
offset.storage.topic=demo-2-distributed-offset
status.storage.topic=demo-2-distributed-status
These topics are use to store state and configuration metadata of connect and not for storing the messages for any of the connectors that run on top of connect. Do not use console consumer on any of these three topics and expect to see the messages.
The messages are stored in the topic configured in the connector configuration json with the parameter called "topic".
Example file-sink-config.json file
{
"name": "MyFileSink",
"config": {
"topics": "mytopic",
"connector.class": "org.apache.kafka.connect.file.FileStreamSinkConnector",
"tasks.max": 1,
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter",
"file": "/tmp/demo-file.txt"
}
}
Once the distributed worker is running you need to apply the config file to it using curl like so:
curl -X POST -H "Content-Type: application/json" --data #file-sink-config.json http://localhost:8083/connectors
After that the config will be safely stored in the config topic you created for all distributed workers to use. Make sure the config topic (and the status and offset topics) will not expire messages or you will loose you Connector configuration when it does.