how to distribute messages to all partitions in topic defined by `offset.storage.topic` in kafka connect - apache-kafka

I have deployed debezium using the docker image pulled from docker pull debezium/connect
In the documentation provided at https://hub.docker.com/r/debezium/connect the description for one of the environment variable OFFSET_STORAGE_TOPIC is as follows:
This environment variable is required when running the Kafka Connect
service. Set this to the name of the Kafka topic where the Kafka
Connect services in the group store connector offsets. The topic must
have a large number of partitions (e.g., 25 or 50), be highly
replicated (e.g., 3x or more) and should be configured for compaction.
I've created the required topic named mydb-connect-offsets with 25 partitions and replication factor of 5.
The deployment is successful and everything is working fine. A sample message in mydb-connect-offsets topic looks like this. The key is ["sample-connector",{"server":"mydatabase"}] and value is
{
"transaction_id": null,
"lsn_proc": 211534539955768,
"lsn_commit": 211534539955768,
"lsn": 211534539955768,
"txId": 709459398,
"ts_usec": 1675076680361908
}
As the key is fixed, all the messages are getting to the same partition of the topic. My question is why does the documentation says that the topic must have a large number of partitions when only one partition is going to be used eventually? Also, what needs to be done to distribute the messages across all partitions?

The offsets are keyed by connector name because they must be ordered.
The large partition count is to manage offset storage of many distinct connectors in parallel, not only one.

Related

Kafka Connect best practices for topic compaction

I am using Debezium which makes of Kafka Connect.
Kafka Connect exposes a couple of topics that need to be created:
OFFSET_STORAGE_TOPIC
This environment variable is required when running the Kafka Connect service. Set this to the name of the Kafka topic where the Kafka Connect services in the group store connector offsets. The topic should have many partitions, be highly replicated (e.g., 3x or more) and should be configured for compaction.
STATUS_STORAGE_TOPIC
This environment variable should be provided when running the Kafka Connect service. Set this to the name of the Kafka topic where the Kafka Connect services in the group store connector status. The topic can have multiple partitions, should be highly replicated (e.g., 3x or more) and should be configured for compaction.
Does anyone have any specific recommended compaction configs for these topics?
e.g.
is it enough to set just:
cleanup.policy: compact
unclean.leader.election.enable: true
or also:
min.compaction.lag.ms: 60000
segment.ms: 1800000
min.cleanable.dirty.ratio: 0.01
delete.retention.ms: 100
The defaults should be fine, and Connect will create/configure those topics on its own unless you preconfigure those topics with those settings.
These are the only cases when I can think of when to adjust the compaction settings
a connect-group lingering on the topic longer than you want it to be. For example, a source connector doesn't start immediately after a long downtime because it's processing the offsets topic
your Connect cluster doesn't accurately report its state, or the tasks do not rebalance appropriately (because the status topic is in a bad state)
The __consumer_offsets (compacted) topic is what is used for Sink connectors, and would be configured separately for all consumers, not only Connect

Kafka Cluster - issue with one broker not being utilized

I am having Kafka cluster with 3 brokers and 3 zookeeper node running. we have added 4th broker recently. When we bring it as new cluster, few partitions got stored in the 4th broker as expected. Replication factor for all topics is 3 and has each topic has 10 partitions.
Later, Whenever we bring down whole kafka cluster for maintenance activity and bring it back, all topic partitions is getting stored in first 3 brokers and no partition is getting stored in 4th broker. (Note: Due to bug, we had to use new log directory every time kafka is brought up, pretty much like a new cluster)
I can see that all 4 brokers is available in zookeeper (when i do ls /brokers/ids i can see 4 broker ids) but partition is not distributed to 4th broker.
But when i trigger partition reassignment to move few partitions to 4th broker, it worked fine and 4th broker started storing the given partition. Both producer and consumer able to send and fetch data form 4th broker respectively.I cant find the reason why this storage imbalance is happening among kafka brokers. Please share your suggestion.
When we bring it as new cluster, few partitions got stored in the 4th broker as expected.
This should only be expected when you create new topics or expand partitions of existing ones. Topics do not automatically relocate to new brokers
had to use new log directory every time kafka is brought up
That might explain why data is missing. Unclear what bug you're running into, but this step shouldn't be necessary
when i trigger partition reassignment to move few partitions to 4th broker, it worked fine and 4th broker started storing the given partition. Both producer and consumer able to send and fetch data form 4th broker respectively
This is the correct way to expand a cluster, and sounds like it's working as expected.

Nifi: Create a Kafka topic with PublishKafka with a specific number of partitions/replication-factor

I am using Apache Nifi version 1.10.0. I have put some data into Kafka from Nifi using the PublishKafka_2_0 processor. I have three Kafka brokers running along side with Kafka. I am getting the data from Nifi but the topic that is created in Nifi have a replication-factor of 1 and partitions of 1.
How can I change the default value of replication-factor and partitions when creating a new topic in PublishKafka? In other words, I want the processor to create new topics with partitions=3 and replication-factors=3 instead of 1.
I understand that this can be changed after the topic is created but I would like it to be done dynamically at creation.
If I understand your setup correctly, you are relying on the client side for topic creation, i.e. topics are created when NiFi attempts to produce/consume/fetch metadata for a non-existent topic. In this scenario, Kafka will use num.partitions and default.replication.factor settings for a new topic that are defined in broker config. (Kafka defaults to 1 for both.) Currently, updating these values in server.properties is the only way to control auto-created topics' configuration.
KIP-487 is being worked on to allow producers to control topic creation (as opposed to being server-side, one-for-all verdict), but even in that implementation there is no plan for a client to control number of partitions or replication factor.

Kafka - is it possible to alter cluster size while keeping the change transparent to Producers and Consumers?

I am investigating on Kafka to assess its suitability for our use case. Can you please help me understand how flexible is Kafka with altering the size of an existing cluster?
I am investigating on Kafka to assess its suitability for our use case. Can you please help me understand how flexible is Kafka with adding brokers to an existing cluster without tearing down the cluster? Is there anything to be taken care of when doing this?
Adding servers to a Kafka cluster is easy, just assign them a unique
broker id and start up Kafka on your new servers. However these new
servers will not automatically be assigned any data partitions, so
unless partitions are moved to them they won't be doing any work until
new topics are created. So usually when you add machines to your
cluster you will want to migrate some existing data to these machines.
Refer here
Kafka supports:
Expanding your cluster
Automatically migrating data to new machines
Custom partition assignment and migration
Yes, you can increase the number of partitions of a topic using command line or AdminClient without restarting the cluster processes.
Example:
bin/kafka-topics.sh --zookeeper zk_host:port --alter --topic testtopic1
--partitions 20
Please be aware, that altering the partitions doesn't change the partitioning of existing topics.
Kafka only allows to increase the partitions but you can't decrease the partitions. If in case, you have to reduce the partitions, you need to delete and recreate the topic.
For your question "what happens with producer/consumer behave for newly added partitions"
Kafka has a property metadata.max.age.ms for producers and consumers which defaults to 300000.
metadata.max.age.ms : The period of time in milliseconds after which we force a refresh of metadata even if we haven't seen any partition leadership changes to proactively discover any new brokers or partitions.
After the given value, metadata are updated and any newly added partitions will be detected by the producers/consumers.

Increase number of partitions in a Kafka topic from a Kafka client

I'm a new user of Apache Kafka and I'm still getting to know the internals.
In my use case, I need to increase the number of partitions of a topic dynamically from the Kafka Producer client.
I found other similar questions regarding increasing the partition size, but they utilize the zookeeper configuration. But my kafkaProducer has only the Kafka broker config, but not the zookeeper config.
Is there any way I can increase the number of partitions of a topic from the Producer side? I'm running Kafka version 0.10.0.0.
As of Kafka 0.10.0.1 (latest release): As Manav said it is not possible to increase the number of partitions from the Producer client.
Looking ahead (next releases): In an upcoming version of Kafka, clients will be able to perform some topic management actions, as outlined in KIP-4. A lot of the KIP-4 functionality is already completed and available in Kafka's trunk; the code in trunk as of today allows client to create and to delete topics. But unfortunately, for your use case, increasing the number of partitions is still not possible yet -- this is in scope for KIP-4 (see Alter Topics Request) but is not completed yet.
TL;DR: The next versions of Kafka will allow you to increase the number of partitions of a Kafka topic, but this functionality is not yet available.
It is not possible to increase the number of partitions from the Producer client.
Any specific use case use why you cannot use the broker to achieve this ?
But my kafkaProducer has only the Kafka broker config, but not the
zookeeper config.
I don't think any client will let you change the broker config. You can only access (read) the server side config at max.
Your producer can provide different keys for ProducerRecord's. The broker would place them in different partitions. For example, if you need two partitions, use keys "abc" and "xyz".
This can be done in version 0.9 as well.