Kafka Connect: Connectors Disappear when Worker shutsdown [duplicate] - apache-kafka

I am facing the below issue on changing some properties related to kafka and re-starting the cluster.
In kafka Consumer, there were 5 consumer jobs are running .
If we make some important property change , and on restarting cluster some/all the existing consumer jobs are not able to start.
Ideally all the consumer jobs should start ,
since it will take the meta-data info from the below System-topics .
config.storage.topic
offset.storage.topic
status.storage.topic

First, a bit of background. Kafka stores all of its data in topics, but those topics (or rather the partitions that make up a topic) are append-only logs that would grow forever unless something is done. To prevent this, Kafka has the ability to clean up topics in two ways: retention and compaction. Topics configured to use retention will retain data for a configurable length of time: the broker is free to remove any log messages that are older than this. Topics configured to use compaction require every message have a key, and the broker will always retain the last known message for every distinct key. Compaction is extremely handy when each message (i.e., key/value pair) represents the last known state for the key; since consumers are reading the topic to get the last known state for each key, they will eventually get to that last state a bit faster if older states are removed.
Which cleanup policy a broker will use for a topic depends on several things. Every topic created implicitly or explicitly will use retention by default, though you can change a couple of ways:
change the globally log.cleanup.policy broker setting, affecting only topics created after that point; or
specify the cleanup.policy topic-specific setting when you create or modify a topic
Now, Kafka Connect uses several internal topics to store connector configurations, offsets, and status information. These internal topics must be compacted topics so that (at least) the last configuration, offset, and status for each connector are always available. Since Kafka Connect never uses older configurations, offsets, and status, it's actually a good thing for the broker to remove them from the internal topics.
Before Kafka 0.11.0.0, the recommended process is to manually create these internal topics using the correct topic-specific settings. You could rely upon the broker to auto-create them, but that is problematic for several reasons, not the least of which is that the three internal topics should have different numbers of partitions.
If these internal topics are not compacted, the configurations, offsets, and status info will be cleaned up and removed after the retention period has elapsed. By default this retention period is 24 hours! That means that if you restart Kafka Connect more than 24 hours after deploying / updating a connector configuration, that connector's configuration may have been purged and it will appear as if the connector configuration never existed.
So, if you didn't create these internal topics correctly, simply use the topic admin tool to update the topic's settings as described in the documentation.
BTW, not properly creating these internal topics is a very common problem, so much so that Kafka Connect 0.11.0.0 will be able to automatically create these internal topics using the correct settings without relying upon broker auto-creation of topics.
In 0.11.0 you will still have to rely upon manual creation or broker auto-creation for topics that source connectors write to. This is not ideal, and so there's a proposal to change Kafka Connect to automatically create the topics for the source connectors while giving the source connectors control over the settings. Hopefully that improvement makes it into 0.11.1.0 so that Kafka Connect is even easier to use.

Related

Kafka Connect best practices for topic compaction

I am using Debezium which makes of Kafka Connect.
Kafka Connect exposes a couple of topics that need to be created:
OFFSET_STORAGE_TOPIC
This environment variable is required when running the Kafka Connect service. Set this to the name of the Kafka topic where the Kafka Connect services in the group store connector offsets. The topic should have many partitions, be highly replicated (e.g., 3x or more) and should be configured for compaction.
STATUS_STORAGE_TOPIC
This environment variable should be provided when running the Kafka Connect service. Set this to the name of the Kafka topic where the Kafka Connect services in the group store connector status. The topic can have multiple partitions, should be highly replicated (e.g., 3x or more) and should be configured for compaction.
Does anyone have any specific recommended compaction configs for these topics?
e.g.
is it enough to set just:
cleanup.policy: compact
unclean.leader.election.enable: true
or also:
min.compaction.lag.ms: 60000
segment.ms: 1800000
min.cleanable.dirty.ratio: 0.01
delete.retention.ms: 100
The defaults should be fine, and Connect will create/configure those topics on its own unless you preconfigure those topics with those settings.
These are the only cases when I can think of when to adjust the compaction settings
a connect-group lingering on the topic longer than you want it to be. For example, a source connector doesn't start immediately after a long downtime because it's processing the offsets topic
your Connect cluster doesn't accurately report its state, or the tasks do not rebalance appropriately (because the status topic is in a bad state)
The __consumer_offsets (compacted) topic is what is used for Sink connectors, and would be configured separately for all consumers, not only Connect

Apache Kafka: how to configure message buffering properly

I run a system comprising an InfluxDB, a Kafka Broker and data sources (sensors) producing time series data. The purpose of the broker is to protect the database from inbound event overload and as a format-agnostic platform for ingesting data. The data is transferred from Kafka to InfluxDB via Apache Camel routes.
I would like to use Kafka a intermediate message buffer in case a Camel route crashes or becomes unavailable - which is the most often error in the system. Up to now, I didn’t achieve to configure Kafka in a manner that inbound messages remain available for later consumption.
How do I configure it properly?
The messages will retain in Kafka topics based on its retention policies (you can choose between time or byte size limits) as described in the Topic Configurations. With
cleanup.policy=delete
Retention.ms=-1
the messages will in a Kafka topic will never be deleted.
Then your camel consumer will be able to re-read all messages (offsets) if you select a new consumer group or reset the offsets of the existing consumer group. Otherwise, your camel consumer might auto commit the messages (check corresponding consumer configuration) and it will not be possible to re-read offsets again for the same consumer group.
To limit the consumption rate of the camel consumer you may adjust configurations like maxPollRecords or fetchMaxBytes which are described in the docs.

Connect consumers jobs are getting deleted when restarting the cluster

I am facing the below issue on changing some properties related to kafka and re-starting the cluster.
In kafka Consumer, there were 5 consumer jobs are running .
If we make some important property change , and on restarting cluster some/all the existing consumer jobs are not able to start.
Ideally all the consumer jobs should start ,
since it will take the meta-data info from the below System-topics .
config.storage.topic
offset.storage.topic
status.storage.topic
First, a bit of background. Kafka stores all of its data in topics, but those topics (or rather the partitions that make up a topic) are append-only logs that would grow forever unless something is done. To prevent this, Kafka has the ability to clean up topics in two ways: retention and compaction. Topics configured to use retention will retain data for a configurable length of time: the broker is free to remove any log messages that are older than this. Topics configured to use compaction require every message have a key, and the broker will always retain the last known message for every distinct key. Compaction is extremely handy when each message (i.e., key/value pair) represents the last known state for the key; since consumers are reading the topic to get the last known state for each key, they will eventually get to that last state a bit faster if older states are removed.
Which cleanup policy a broker will use for a topic depends on several things. Every topic created implicitly or explicitly will use retention by default, though you can change a couple of ways:
change the globally log.cleanup.policy broker setting, affecting only topics created after that point; or
specify the cleanup.policy topic-specific setting when you create or modify a topic
Now, Kafka Connect uses several internal topics to store connector configurations, offsets, and status information. These internal topics must be compacted topics so that (at least) the last configuration, offset, and status for each connector are always available. Since Kafka Connect never uses older configurations, offsets, and status, it's actually a good thing for the broker to remove them from the internal topics.
Before Kafka 0.11.0.0, the recommended process is to manually create these internal topics using the correct topic-specific settings. You could rely upon the broker to auto-create them, but that is problematic for several reasons, not the least of which is that the three internal topics should have different numbers of partitions.
If these internal topics are not compacted, the configurations, offsets, and status info will be cleaned up and removed after the retention period has elapsed. By default this retention period is 24 hours! That means that if you restart Kafka Connect more than 24 hours after deploying / updating a connector configuration, that connector's configuration may have been purged and it will appear as if the connector configuration never existed.
So, if you didn't create these internal topics correctly, simply use the topic admin tool to update the topic's settings as described in the documentation.
BTW, not properly creating these internal topics is a very common problem, so much so that Kafka Connect 0.11.0.0 will be able to automatically create these internal topics using the correct settings without relying upon broker auto-creation of topics.
In 0.11.0 you will still have to rely upon manual creation or broker auto-creation for topics that source connectors write to. This is not ideal, and so there's a proposal to change Kafka Connect to automatically create the topics for the source connectors while giving the source connectors control over the settings. Hopefully that improvement makes it into 0.11.1.0 so that Kafka Connect is even easier to use.

Kafka 0.10.0.1 partition reassignment after broker failure

I'm testing kafka's partition reassignment as a precursor to launching a production system. I have several topics with 9 partitions each and a replication factor of 3. I've killed one of the brokers to simulate a failure condition and verified that some topics became under replicated (verification done via a fork of yahoo's kafka manager modified to allow adding a version 0.10.0.1 cluster).
I then started a new broker with a different id. I would now like to distribute partitions to this new broker. I attempted to use kafka manager's reassign partitions functionality however that did not work (possibly due to an improperly modified fork).
I saw that kafka comes with a bin/kafka-reassign-partitions.sh script but the docs say that I have to manually write out the partition reassignments for each topic in json format. Is there a way to handle this without manually deciding on which brokers partitions must go?
Hmm what a coincidence that I was doing exactly the same thing today. I don't have an answer you're probably going to like but I achieved what I wanted in the end.
Ultimately, what I did was executed the kafka-reassign-partitions command with what the same tool proposed for a reassignment. But whatever it generated I just replaced the new broker id with the old failed broker id. For some reason the generated json moved everything around.
This will fail (or rather never complete) because the old broker has passed on. I then had to delete the reassignment operation in zookeeper (znode: admin/reassign_partitions or something).
Then I restarted kafka on the new broker and it magically picked up as leader of the partition that was looking for a new replacement leader.
I'll let you know if everything is still working tomorrow and if I still have a job ;-)

Increase number of partitions in a Kafka topic from a Kafka client

I'm a new user of Apache Kafka and I'm still getting to know the internals.
In my use case, I need to increase the number of partitions of a topic dynamically from the Kafka Producer client.
I found other similar questions regarding increasing the partition size, but they utilize the zookeeper configuration. But my kafkaProducer has only the Kafka broker config, but not the zookeeper config.
Is there any way I can increase the number of partitions of a topic from the Producer side? I'm running Kafka version 0.10.0.0.
As of Kafka 0.10.0.1 (latest release): As Manav said it is not possible to increase the number of partitions from the Producer client.
Looking ahead (next releases): In an upcoming version of Kafka, clients will be able to perform some topic management actions, as outlined in KIP-4. A lot of the KIP-4 functionality is already completed and available in Kafka's trunk; the code in trunk as of today allows client to create and to delete topics. But unfortunately, for your use case, increasing the number of partitions is still not possible yet -- this is in scope for KIP-4 (see Alter Topics Request) but is not completed yet.
TL;DR: The next versions of Kafka will allow you to increase the number of partitions of a Kafka topic, but this functionality is not yet available.
It is not possible to increase the number of partitions from the Producer client.
Any specific use case use why you cannot use the broker to achieve this ?
But my kafkaProducer has only the Kafka broker config, but not the
zookeeper config.
I don't think any client will let you change the broker config. You can only access (read) the server side config at max.
Your producer can provide different keys for ProducerRecord's. The broker would place them in different partitions. For example, if you need two partitions, use keys "abc" and "xyz".
This can be done in version 0.9 as well.