Kafka Connect best practices for topic compaction - apache-kafka

I am using Debezium which makes of Kafka Connect.
Kafka Connect exposes a couple of topics that need to be created:
OFFSET_STORAGE_TOPIC
This environment variable is required when running the Kafka Connect service. Set this to the name of the Kafka topic where the Kafka Connect services in the group store connector offsets. The topic should have many partitions, be highly replicated (e.g., 3x or more) and should be configured for compaction.
STATUS_STORAGE_TOPIC
This environment variable should be provided when running the Kafka Connect service. Set this to the name of the Kafka topic where the Kafka Connect services in the group store connector status. The topic can have multiple partitions, should be highly replicated (e.g., 3x or more) and should be configured for compaction.
Does anyone have any specific recommended compaction configs for these topics?
e.g.
is it enough to set just:
cleanup.policy: compact
unclean.leader.election.enable: true
or also:
min.compaction.lag.ms: 60000
segment.ms: 1800000
min.cleanable.dirty.ratio: 0.01
delete.retention.ms: 100

The defaults should be fine, and Connect will create/configure those topics on its own unless you preconfigure those topics with those settings.
These are the only cases when I can think of when to adjust the compaction settings
a connect-group lingering on the topic longer than you want it to be. For example, a source connector doesn't start immediately after a long downtime because it's processing the offsets topic
your Connect cluster doesn't accurately report its state, or the tasks do not rebalance appropriately (because the status topic is in a bad state)
The __consumer_offsets (compacted) topic is what is used for Sink connectors, and would be configured separately for all consumers, not only Connect

Related

Does a kafka consumer machine need to run zookeeper?

So my question is this: If i have a server running Kafka (And zookeeper), and another machine only consuming messages, does the consumer machine need to run zookeeper too? Or does the server take care of all?
No.
Role of Zookeeper in Kafka is:
Broker registration: (cluster membership) with heartbeats mechanism to keep the list current
Storing topic configuration: which topics exist, how many partitions each
has, where are the replicas, who is the preferred leader, list of ISR for
partitions
Electing controller: The controller is one of the brokers and is responsible for maintaining the leader/follower relationship for all the partitions.
So Zookeeper is required only for kafka broker. There is no need to have Zookeper on the producer or consumer side.
The consumer does not need zookeeper
You have not mentioned which version of Kafka or the clients you're using.
Kafka consumers using 0.8 store their offsets in Zookeeper, so it is required for them. However, no, you would not run Zookeeper and consumers on the same server
From 0.9 and later, clients are separate from needing it (unless you want to manage external connections to Zookeeper on your own for storing data)

Kafka - consumers / producers works with all Zookeper instances down

I've configured a cluster of Kafka brokers and a cluster of Zk instances using kafka_2.11-1.1.0 distribution archive.
For Kafka brokers I've configured config/server.properties
broker.id=1,2,3
zookeeper.connect=box1:2181,box2:2181,box3:2181
For Zk instances I've configured config/zookeeper.properties:
server.1=box1:2888:3888
server.2=box3:2888:3888
server.3=box3:2888:3888
I've created a basic producer and a basic consumer and I don't know why I am able to write messages / read messages even if I shut down all the Zookeeper
instances and have all the Kafka brokers up and running.
Even booting up new consumers, producers works without any issue.
I thought having a quorum of Zk instances is a vital point for a Kafka cluster.
For both consumer and producer, I've used following configuration:
bootrapServers=box1:9092,box2:9092,box3:9092
Thanks
I thought having a quorum of Zk instances is a vital point for a Kafka cluster.
Zookeeper quorum is vital for managing partition lists, leaders, etc. In general, ZK is necessary for management that is done by the cluster coordinator in the cluster.
Basically, right now (with ZK down), you cannot modify topics (as the partition metadata is stored in ZK), start up / shut down brokers (as they use ZK for discovery) and other similar operations.
Even booting up new consumers, producers works without any issue.
Producer/consumer operations reach out to brokers only. The broker instance can still append to the log, and can still communicate with other brokers to have replication. So it is possible to send a message, get it received by broker and saved to disk, with other brokers replicating (as they are continuously sending fetch requests to the leader (and they know who this partition's leader is because they saved that data when ZK was still running)).

Does Kafka broker store metadata?

Does Kafka broker store metadata which producer API uses (e.g. which partitions are leader for a topic etc.)? As per my understanding this metadata is stored in Zookeeper , is it correct? If it is true then how Brokers are updated by Zookeeper with latest information?
All Kafka brokers can answer a metadata request that describes the current state of the cluster: what topics there are, which partitions those topics have, which broker is the leader for those partitions etc.
ZooKeeper is responsible for:
Electing a controller broker - and making sure there is only one
Cluster membership - allowing brokers to join a cluster
Topic configuration - which topics exist, how many partitions each has, where are the replicas, who is the preferred leader, what configuration overrides are set for each topic
Quotas - how much data is each client allowed to read and write
ACLs - who is allowed to read and write to which topic
There is regular communication between Kafka and ZooKeeper such that ZooKeeper knows a Kafka broker is still alive (ZooKeeper heartbeat mechanism) and also in response to events such as a topic being created or a replica falling out of sync for a topic-partition.
Kafka is a distributed system and is built to use Zookeeper which is responsible for controller election, topic configuration, clustering etc.
More precisely, Zookeeper initiates controller election. The controller broker is a single broker in the Kafka cluster which takes care of leader broker and followers for every partition. When a particular broker is taken down, the controller lets other replicas know (in order to handle partition leaders etc). Moreover, when the controller fails then Zookeeper initiates new elections in order to elect the new broker which will act as the controller.
Furthermore, Zookeeper knows which brokers are part of the Kafka cluster and which are still alive. Similarly, it is also aware of topic-specific information such as which topics exist, how many partitions each has, where are the replicas and so on.
Zookeeper also stores information regarding quotas and ACLs, i.e. what volume of data each client is allowed to consume/produce and also, who is allowed to consume or produce from a particular topic.

Why is my kafka topic not consumable with a broker down?

My issue is that I have a three broker Kafka Cluster and an availability requirement to have access to consume and produce to a topic when one or two of my three brokers is down.
I also have a reliability requirement to have a replication factor of 3. These seem to be conflicting requirements to me. Here is how my problem manifests:
I create a new topic with replication factor 3
I send several messages to that topic
I kill one of my brokers to simulate a broker issue
I attempt to consume the topic I created
My consumer hangs
I review my logs and see the error:
Number of alive brokers '2' does not meet the required replication factor '3' for the offsets topic
If I set all my broker's offsets.topic.replication.factor setting to 1, then I'm able to produce and consume my topics, even if I set the topic level replication factor to 3.
Is this an okay configuration? Or can you see any pitfalls in setting things up this way?
You only need as many brokers as your replication factor when creating the topic.
I'm guessing in your case, you start with a fresh cluster and no consumers have connected yet. In this case, the __consumer_offsets internal topic does not exist as it is only created when it's first needed. So first connect a consumer for a moment and then kill one of the brokers.
Apart from that, in order to consume you only need 1 broker up, the leader for the partition.

Kafka Connect: Connectors Disappear when Worker shutsdown [duplicate]

I am facing the below issue on changing some properties related to kafka and re-starting the cluster.
In kafka Consumer, there were 5 consumer jobs are running .
If we make some important property change , and on restarting cluster some/all the existing consumer jobs are not able to start.
Ideally all the consumer jobs should start ,
since it will take the meta-data info from the below System-topics .
config.storage.topic
offset.storage.topic
status.storage.topic
First, a bit of background. Kafka stores all of its data in topics, but those topics (or rather the partitions that make up a topic) are append-only logs that would grow forever unless something is done. To prevent this, Kafka has the ability to clean up topics in two ways: retention and compaction. Topics configured to use retention will retain data for a configurable length of time: the broker is free to remove any log messages that are older than this. Topics configured to use compaction require every message have a key, and the broker will always retain the last known message for every distinct key. Compaction is extremely handy when each message (i.e., key/value pair) represents the last known state for the key; since consumers are reading the topic to get the last known state for each key, they will eventually get to that last state a bit faster if older states are removed.
Which cleanup policy a broker will use for a topic depends on several things. Every topic created implicitly or explicitly will use retention by default, though you can change a couple of ways:
change the globally log.cleanup.policy broker setting, affecting only topics created after that point; or
specify the cleanup.policy topic-specific setting when you create or modify a topic
Now, Kafka Connect uses several internal topics to store connector configurations, offsets, and status information. These internal topics must be compacted topics so that (at least) the last configuration, offset, and status for each connector are always available. Since Kafka Connect never uses older configurations, offsets, and status, it's actually a good thing for the broker to remove them from the internal topics.
Before Kafka 0.11.0.0, the recommended process is to manually create these internal topics using the correct topic-specific settings. You could rely upon the broker to auto-create them, but that is problematic for several reasons, not the least of which is that the three internal topics should have different numbers of partitions.
If these internal topics are not compacted, the configurations, offsets, and status info will be cleaned up and removed after the retention period has elapsed. By default this retention period is 24 hours! That means that if you restart Kafka Connect more than 24 hours after deploying / updating a connector configuration, that connector's configuration may have been purged and it will appear as if the connector configuration never existed.
So, if you didn't create these internal topics correctly, simply use the topic admin tool to update the topic's settings as described in the documentation.
BTW, not properly creating these internal topics is a very common problem, so much so that Kafka Connect 0.11.0.0 will be able to automatically create these internal topics using the correct settings without relying upon broker auto-creation of topics.
In 0.11.0 you will still have to rely upon manual creation or broker auto-creation for topics that source connectors write to. This is not ideal, and so there's a proposal to change Kafka Connect to automatically create the topics for the source connectors while giving the source connectors control over the settings. Hopefully that improvement makes it into 0.11.1.0 so that Kafka Connect is even easier to use.