Kafka Connect - Delete Connector with configs? - apache-kafka

I know how to delete Kafka connector as mentioned here Kafka Connect - How to delete a connector
But I am not sure if it also delete/erase specific connector related configs, offsets and status from *.sorage.topic for that worker?
For e.g:
Lets say I delete a connector having connector-name as"connector-abc-1.0.0" and Kafka connect worker was started with following config.
offset.storage.topic=<topic.name>.internal.offsets
config.storage.topic=<topic.name>.internal.configs
status.storage.topic=<topic.name>.internal.status
Now after DELETE call for that connector, will it erased all records from above internal topics for that specific connector?
So that I can create new connector with "same name" on same worker but different config(different offset.start or connector.class)?

When you delete a connector, the offsets are retained in the offsets topic.
If you recreate the connector with the same name, it will re-use the offsets from the previous execution (even if the connector was deleted in between).

Since Kafka is append only, then only way the messages in those Connect topics would be removed is if it were published with the connector name as the message key, and null as the value.
You could inspect those topics using console consumer to see what data is in them including --property print.key=true, and keep the consumer running when you delete a connector.
You can PUT a new config at /connectors/{name}/config, but any specific offsets that are used are dependent upon the actual connector type (sink / source); for example, there is the internal Kafka __consumer_offsets topic, used by Sink connectors, as well as the offset.storage.topic, optionally used by source connectors.
"same name" on same worker but different config(different offset.start or connector.class)?
I'm not sure changing connector.class would be a good idea with the above in mind since it'd change the connector behavior completely. offset.start isn't a property I'm aware of, so you'll need to see the documentation of that specific connector class to know what it does.

Related

Kafka connect - completely removing a connector

my question is split to two. I've read Kafka Connect - Delete Connector with configs?. I'd like to completely remove a connector, with offsets and all, so I can recreate it with the same name later. Is this possible? To my understanding, a tombstone message will kill this connector indefinitely.
The second part is - is there a way to have the kafka-connect container automatically delete all connectors he created when bringing it down?
Thanks
There is no such command to completely cleanup connector state. For sink connectors, you can use kafka-consumer-groups to reset it's offsets. For source connectors, it's not as straightforward, as you'll need to manually produce data into the Connect-managed offsets topic.
The config and status topics also persist historical data, but shouldn't prevent you from recreating the connector with the same name/details.
The Connect containers published by Confluent and Debezium always uses Distributed mode. You'll need to override the entrypoint of the container to use standalone mode to not persist the connector metadata in Kafka topics (this won't be fault tolerant, but it'll be fine for testing)

Kafka Connect writes data to non-existing topic

Does Kafka Connect creates the topic on the fly if it doesn't exist (but provided as a destination) or fails to copy messages to it?
I need to create such topics on the fly or programmatically (Java API) at least, not manually using scripts.
I searched this info, but it seems topics have to be already created before migration
Kafka Connect doesn't really control this.
There's a setting in Kafka that enables/disables automatic topic creation.
If this is turned on - Kafka Connect will create its' own topics, if not - you have to create them yourselves.
By default, Kafka will not create a new topic when a consumer subscribes to a non-existing topic. you should enable the auto.create.topics.enable=truein your Kafka server configuration file which enables auto-creation of topics on the server.
Once you turn on this feature Kafka will automatically create topics on the fly. When an application tries to connect to a non-existing topic, Kafka will create that topic automatically.

Kafka Connect: Connectors Disappear when Worker shutsdown [duplicate]

I am facing the below issue on changing some properties related to kafka and re-starting the cluster.
In kafka Consumer, there were 5 consumer jobs are running .
If we make some important property change , and on restarting cluster some/all the existing consumer jobs are not able to start.
Ideally all the consumer jobs should start ,
since it will take the meta-data info from the below System-topics .
config.storage.topic
offset.storage.topic
status.storage.topic
First, a bit of background. Kafka stores all of its data in topics, but those topics (or rather the partitions that make up a topic) are append-only logs that would grow forever unless something is done. To prevent this, Kafka has the ability to clean up topics in two ways: retention and compaction. Topics configured to use retention will retain data for a configurable length of time: the broker is free to remove any log messages that are older than this. Topics configured to use compaction require every message have a key, and the broker will always retain the last known message for every distinct key. Compaction is extremely handy when each message (i.e., key/value pair) represents the last known state for the key; since consumers are reading the topic to get the last known state for each key, they will eventually get to that last state a bit faster if older states are removed.
Which cleanup policy a broker will use for a topic depends on several things. Every topic created implicitly or explicitly will use retention by default, though you can change a couple of ways:
change the globally log.cleanup.policy broker setting, affecting only topics created after that point; or
specify the cleanup.policy topic-specific setting when you create or modify a topic
Now, Kafka Connect uses several internal topics to store connector configurations, offsets, and status information. These internal topics must be compacted topics so that (at least) the last configuration, offset, and status for each connector are always available. Since Kafka Connect never uses older configurations, offsets, and status, it's actually a good thing for the broker to remove them from the internal topics.
Before Kafka 0.11.0.0, the recommended process is to manually create these internal topics using the correct topic-specific settings. You could rely upon the broker to auto-create them, but that is problematic for several reasons, not the least of which is that the three internal topics should have different numbers of partitions.
If these internal topics are not compacted, the configurations, offsets, and status info will be cleaned up and removed after the retention period has elapsed. By default this retention period is 24 hours! That means that if you restart Kafka Connect more than 24 hours after deploying / updating a connector configuration, that connector's configuration may have been purged and it will appear as if the connector configuration never existed.
So, if you didn't create these internal topics correctly, simply use the topic admin tool to update the topic's settings as described in the documentation.
BTW, not properly creating these internal topics is a very common problem, so much so that Kafka Connect 0.11.0.0 will be able to automatically create these internal topics using the correct settings without relying upon broker auto-creation of topics.
In 0.11.0 you will still have to rely upon manual creation or broker auto-creation for topics that source connectors write to. This is not ideal, and so there's a proposal to change Kafka Connect to automatically create the topics for the source connectors while giving the source connectors control over the settings. Hopefully that improvement makes it into 0.11.1.0 so that Kafka Connect is even easier to use.

How to enable Kafka sink connector to insert data from topics to tables as and when sink is up

I have developed kafka-sink-connector (using confluent-oss-3.2.0-2.11, connect framework) for my data-store (Amppol ADS), which stores data from kafka topics to corresponding tables in my store.
Every thing is working as expected as long as kafka servers and ADS servers are up and running.
Need a help/suggestions about a specific use-case where events are getting ingested in kafka topics and underneath sink component (ADS) is down.
Expectation here is Whenever a sink servers comes up, records that were ingested earlier in kafka topics should be inserted into the tables;
Kindly advise how to handle such a case.
Is there any support available in connect framework for this..? or atleast some references will be a great help.
SinkConnector offsets are maintained in the _consumer_offsets topic on Kafka against your connector name and when SinkConnector restarts it will pick messages from Kafka server from the previous offset it had stored on the _consumer_offsets topic.
So you don't have to worry anything about managing offsets. Its all done by the workers in the Connect framework. In your scenario you go and just restart your sink connector. If the messages are pushed to Kafka by your source connector and are available in the Kafka, sink connector can be started/restarted at any time.

Kafka-connect sink task ignores file offset storage property

I'm experiencing quite weird behavior working with Confluent JDBC connector. I'm pretty sure that it's not related to Confluent stack, but to Kafka-connect framework itself.
So, I define offset.storage.file.filename property as default /tmp/connect.offsets and run my sink connector. Obviously, I expect connector to persist offsets in the given file (it doesn't exist on file system, but it should be automatically created, right?). Documentation says:
offset.storage.file.filename
The file to store connector offsets in. By storing offsets on disk, a standalone process can be stopped and started on a single node and resume where it previously left off.
But Kafka behaves in completely different manner.
It checks if the given file exists.
It it's not, Kafka just ignores it and persists offsets in Kafka topic.
If I create given file manually, reading fails anyway (EOFException) and offsets are being persisted in topic again.
Is it a bug or, more likely, I don't understand how to work with this configurations? I understand difference between two approaches to persist offsets and file storage is more convenient for my needs.
The offset.storage.file.filename is only used in source connectors, in standalone mode. It is used to place a bookmark on the input data source and remember where it stopped reading it. The created file contains something like the file line number (for a file source) or a table row number (for jdbc source or databases in general).
When running Kafka Connect in distributed mode, this file is replaced by a Kafka topic named by default connect-offsets which should be replicated in order to tolerate failures.
As far as sink connectors are concerned, no matter which plugin or mode (standalone/distributed) is used, they all store where they last stopped reading their input topic in an internal topic named __consumer_offsets like any Kafka consumers. This allows to use traditional tools like kafka-consumer-groups.sh command-line tools to see how the much the sink connector is lagging.
The Confluent Kafka replicator, despite being a source connector, is probably an exception because it reads from a remote Kafka and may use a Kafka consumer, but only one cluster will maintain those original consumer group offsets.
I agree that the documentation is not clear, this setting is required whatever the connector type is (source or sink), but it is only used on by source connectors. The reason behind this design decision is that a single Kafka Connect worker (I mean a single JVM process) can run multiple connectors, potentially both source and sink connectors. Said differently, this setting is worker level setting, not a connector setting.
The property offset.storage.file.filename only applies to workers of source connectors running in standalone mode. If you are seeing Kafka persist offsets in a Kafka topic for a source, you are running in distributed mode. You should be launching your connector with the provided script connect-standalone. There's a description of the different modes here. Instructions on running in the different modes are here.