Nifi: Create a Kafka topic with PublishKafka with a specific number of partitions/replication-factor - apache-kafka

I am using Apache Nifi version 1.10.0. I have put some data into Kafka from Nifi using the PublishKafka_2_0 processor. I have three Kafka brokers running along side with Kafka. I am getting the data from Nifi but the topic that is created in Nifi have a replication-factor of 1 and partitions of 1.
How can I change the default value of replication-factor and partitions when creating a new topic in PublishKafka? In other words, I want the processor to create new topics with partitions=3 and replication-factors=3 instead of 1.
I understand that this can be changed after the topic is created but I would like it to be done dynamically at creation.

If I understand your setup correctly, you are relying on the client side for topic creation, i.e. topics are created when NiFi attempts to produce/consume/fetch metadata for a non-existent topic. In this scenario, Kafka will use num.partitions and default.replication.factor settings for a new topic that are defined in broker config. (Kafka defaults to 1 for both.) Currently, updating these values in server.properties is the only way to control auto-created topics' configuration.
KIP-487 is being worked on to allow producers to control topic creation (as opposed to being server-side, one-for-all verdict), but even in that implementation there is no plan for a client to control number of partitions or replication factor.

Related

how to distribute messages to all partitions in topic defined by `offset.storage.topic` in kafka connect

I have deployed debezium using the docker image pulled from docker pull debezium/connect
In the documentation provided at https://hub.docker.com/r/debezium/connect the description for one of the environment variable OFFSET_STORAGE_TOPIC is as follows:
This environment variable is required when running the Kafka Connect
service. Set this to the name of the Kafka topic where the Kafka
Connect services in the group store connector offsets. The topic must
have a large number of partitions (e.g., 25 or 50), be highly
replicated (e.g., 3x or more) and should be configured for compaction.
I've created the required topic named mydb-connect-offsets with 25 partitions and replication factor of 5.
The deployment is successful and everything is working fine. A sample message in mydb-connect-offsets topic looks like this. The key is ["sample-connector",{"server":"mydatabase"}] and value is
{
"transaction_id": null,
"lsn_proc": 211534539955768,
"lsn_commit": 211534539955768,
"lsn": 211534539955768,
"txId": 709459398,
"ts_usec": 1675076680361908
}
As the key is fixed, all the messages are getting to the same partition of the topic. My question is why does the documentation says that the topic must have a large number of partitions when only one partition is going to be used eventually? Also, what needs to be done to distribute the messages across all partitions?
The offsets are keyed by connector name because they must be ordered.
The large partition count is to manage offset storage of many distinct connectors in parallel, not only one.

Kafka Connect best practices for topic compaction

I am using Debezium which makes of Kafka Connect.
Kafka Connect exposes a couple of topics that need to be created:
OFFSET_STORAGE_TOPIC
This environment variable is required when running the Kafka Connect service. Set this to the name of the Kafka topic where the Kafka Connect services in the group store connector offsets. The topic should have many partitions, be highly replicated (e.g., 3x or more) and should be configured for compaction.
STATUS_STORAGE_TOPIC
This environment variable should be provided when running the Kafka Connect service. Set this to the name of the Kafka topic where the Kafka Connect services in the group store connector status. The topic can have multiple partitions, should be highly replicated (e.g., 3x or more) and should be configured for compaction.
Does anyone have any specific recommended compaction configs for these topics?
e.g.
is it enough to set just:
cleanup.policy: compact
unclean.leader.election.enable: true
or also:
min.compaction.lag.ms: 60000
segment.ms: 1800000
min.cleanable.dirty.ratio: 0.01
delete.retention.ms: 100
The defaults should be fine, and Connect will create/configure those topics on its own unless you preconfigure those topics with those settings.
These are the only cases when I can think of when to adjust the compaction settings
a connect-group lingering on the topic longer than you want it to be. For example, a source connector doesn't start immediately after a long downtime because it's processing the offsets topic
your Connect cluster doesn't accurately report its state, or the tasks do not rebalance appropriately (because the status topic is in a bad state)
The __consumer_offsets (compacted) topic is what is used for Sink connectors, and would be configured separately for all consumers, not only Connect

Apache Kafka: how to configure message buffering properly

I run a system comprising an InfluxDB, a Kafka Broker and data sources (sensors) producing time series data. The purpose of the broker is to protect the database from inbound event overload and as a format-agnostic platform for ingesting data. The data is transferred from Kafka to InfluxDB via Apache Camel routes.
I would like to use Kafka a intermediate message buffer in case a Camel route crashes or becomes unavailable - which is the most often error in the system. Up to now, I didn’t achieve to configure Kafka in a manner that inbound messages remain available for later consumption.
How do I configure it properly?
The messages will retain in Kafka topics based on its retention policies (you can choose between time or byte size limits) as described in the Topic Configurations. With
cleanup.policy=delete
Retention.ms=-1
the messages will in a Kafka topic will never be deleted.
Then your camel consumer will be able to re-read all messages (offsets) if you select a new consumer group or reset the offsets of the existing consumer group. Otherwise, your camel consumer might auto commit the messages (check corresponding consumer configuration) and it will not be possible to re-read offsets again for the same consumer group.
To limit the consumption rate of the camel consumer you may adjust configurations like maxPollRecords or fetchMaxBytes which are described in the docs.

How exactly Apache Nifi ConsumeKafka_1_0 processor works

I have Nifi cluster of and Kafka is also installed there.
Created one topic with 5 partitions, start consuming that topic with one gourp-id. So that each partition will get unique messages.
Now I created the 5 ConsumeKafka_1_0 processors having the intent of getting unique messages on each consumer side. But only 2 of the ConsumeKafka_1_0 are consuming all the messages rest is setting ideal.
Now what I did is started the 5 command line Kafka consumer, and what happened is, I was able to see the all the partitions are getting the messages and able to consume them from command line consumer in round-robin fashion only.
Also, I tried descried the Kafka group and what I saw was only 2 of the Nifi ConsumeKafka_1_0 is consuming all the 5 partitions and rest is ideal, see the snapshot.
Would you please let me what I am doing wrong here with Nifi consumer processor.
Note - i used Nifi version is 1.5 and Kafka version is 1.0.
I've written this article which explains how the integration with Kafka works:
https://bryanbende.com/development/2016/09/15/apache-nifi-and-apache-kafka
The Apache Kafka client (used by NiFi) is what assigns partitions to the consumers.
Typically if you had a 5 node NiFi cluster, with 1 ConsumeKafka processor on the canvas with 1 concurrent task, then each node would be consuming 1 partition.

Increase number of partitions in a Kafka topic from a Kafka client

I'm a new user of Apache Kafka and I'm still getting to know the internals.
In my use case, I need to increase the number of partitions of a topic dynamically from the Kafka Producer client.
I found other similar questions regarding increasing the partition size, but they utilize the zookeeper configuration. But my kafkaProducer has only the Kafka broker config, but not the zookeeper config.
Is there any way I can increase the number of partitions of a topic from the Producer side? I'm running Kafka version 0.10.0.0.
As of Kafka 0.10.0.1 (latest release): As Manav said it is not possible to increase the number of partitions from the Producer client.
Looking ahead (next releases): In an upcoming version of Kafka, clients will be able to perform some topic management actions, as outlined in KIP-4. A lot of the KIP-4 functionality is already completed and available in Kafka's trunk; the code in trunk as of today allows client to create and to delete topics. But unfortunately, for your use case, increasing the number of partitions is still not possible yet -- this is in scope for KIP-4 (see Alter Topics Request) but is not completed yet.
TL;DR: The next versions of Kafka will allow you to increase the number of partitions of a Kafka topic, but this functionality is not yet available.
It is not possible to increase the number of partitions from the Producer client.
Any specific use case use why you cannot use the broker to achieve this ?
But my kafkaProducer has only the Kafka broker config, but not the
zookeeper config.
I don't think any client will let you change the broker config. You can only access (read) the server side config at max.
Your producer can provide different keys for ProducerRecord's. The broker would place them in different partitions. For example, if you need two partitions, use keys "abc" and "xyz".
This can be done in version 0.9 as well.