Kafka Connect configuration and the "consumer." prefix - apache-kafka

I was hoping to get some clarification on the kafka connect configuration properties here https://docs.confluent.io/current/connect/userguide.html
We were having issues connecting to our confluent connect cluster to our kafka connect instance. We had all our settings configured correctly from what i could tell and didn’t have any luck.
After extensive googling some discovered that prefixing the configuration properties with “consumer.” seems to fix the issue. There is a mention of that prefix here https://docs.confluent.io/current/connect/userguide.html#overriding-producer-and-consumer-settings
I am having a hard time understanding wrapping my head around the prefix and how the properties are picked up by connect and used. It was my assumption that the java api client used by kafka connect will pick up the connection properties from the properties file. It might have some hard coded configuration properties that can be overridden by specifying the values in the properties file. But, this is not correct? The doc linked above mentions
All new producer configs and new consumer configs can be overridden by prefixing them with producer. or consumer.
What are the new configs? The link on that page just takes me to the list of all the configs. The doc mentions
Occasionally, you may have an application that needs to adjust the default settings. One example is a standalone process that runs a log file connector
that as the use case for using the prefix override, but this is connect cluster, how does that use case apply? Appreciate your time if you have read thus far

The new prefix is probably misleading. Apache Kafka is currently at version 2.3, and back in 0.8 and 0.9 a "new" producer and consumer API was added. These are now just the standard producer and consumer, but the new prefix has hung around.
In terms of overriding configuration, it is as you say; you can prefix any of the standard consumer/producer configs in the Kafka Connect worker with consumer. (for a sink) or producer. (for a source).
Note that as of Apache Kafka 2.3 you can also override these per connector, as detailed in this post : https://www.confluent.io/blog/kafka-connect-improvements-in-apache-kafka-2-3

The Post is too old, but I'll answer it for people who will face he same difficulty:
New properties, they would like to say : any custom consumer or producer configs.
And there is two levels :
Worker side : the worker has a consumer to read configs, status and offsets of each connector and has a producer (to write status and offsets) [not confuse with __consumer_offsets topics : offset topic is only for source connector], so to override those configs:
consumer.* (example: consumer.max.poll.records=10)
producer.* (example: producer.batch.size=10000)
Connector Side : this one will inherit the worker config by default, and to override consumer/producer configs, we should use :
consumer.override.* (example: consumer.override.max.poll.records=100)
producer.override* (example: producer.override.batch.size=20000)

Related

How to pass Apache Kafka Mirrormaker2 config for the producer

I am currently testing Mirrormaker to replicate data between two clusters. Unfortunately it seems the producer config is not utilized by the individual producers then as documented in https://github.com/apache/kafka/blob/trunk/connect/mirror/README.md.
My configuration file simplified:
clusters=INPUT,BACKUP
INPUT.consumer.compression.type=lz4
BACKUP.producer.compression.type=lz4
INPUT->BACKUP.enabled = true
INPUT->BACKUP.topics=mytopic.*
...
Then the log output when running mirrormaker2 (connect-mirror-maker.sh mirrormaker.properties) does not show this option:
INFO ProducerConfig values:
...
compression.type = none
...
The Kafka version in use is 2.7.1.
How can I pass the settings correctly, so the producer is correctly compressing? I also need to pass a few other settings, but once this works it should do for the other settings too.
Two potential solutions:
Enable connector.client.config.override.policy in mm2 workers' property file. You need to follow https://docs.confluent.io/platform/current/connect/references/allconfigs.html#override-the-worker-configuration closely.
Launch a Kafka Connect cluster and create MirrorSourceConnector and MirrorCheckpointConnector one by one with producer configs overridden. You will still need to refer to the official Confluent documentation above. I picked this approach and it works.

Disable mirrormaker2 offset-sync topics on source kafka cluster

We're using MirrorMaker2 to replicate some topics from one kerberized kafka cluster to another kafka cluster (strictly unidirectional). We don't control the source kafka cluster and we're given only access to describe and read specific topics that are to be consumed.
MirrorMaker2 creates and maintains a topic (mm2-offset-syncs) in the source cluster to encode cluster-to-cluster offset mappings for each topic-partition being replicated and also creates an AdminClient in the source cluster to handle ACL/Config propagation. Because MM2 needs authorization to create and write to these topics in the source cluster, or to perform operations through AdminClient, I'm trying to understand why/if we need these mechanisms in our scenario.
My question is:
In a strictly unidirectional scenario, what is the usefulness of this source-cluster offset-sync topic to Mirrormaker?
If indeed it's superfluous, is it possible to disable it or operate mm2 without access to create/produce to this topic?
If ACL and Config propagation is disabled, is it safe to assume that the AdminClient is not used for anything else?
In the MirrorMaker code, the offset-sync topic it is readily created by MirrorSourceConnector when it starts and then maintained by the MirrorSourceTask. The same happens to AdminClient in the MirrorSourceConnector.
I have found no way to toggle off these features but honestly I might be missing something in my line of thought.
There is an option inroduced in Kafka 3.0 to make MM2 not to create the mm2-offset-syncs topic in the source cluster and operate on it in the target cluster.
Thanks to the KIP-716: https://cwiki.apache.org/confluence/display/KAFKA/KIP-716%3A+Allow+configuring+the+location+of+the+offset-syncs+topic+with+MirrorMaker2
Pull-request:
https://issues.apache.org/jira/browse/KAFKA-12379
https://github.com/apache/kafka/pull/10221
Tim Berglund noted this KIP-716 in Kafka 3.0 release: https://www.youtube.com/watch?v=7SDwWFYnhGA&t=462s
So, to make MM2 to operate on the mm2-offset-syncs topic in the target cluster you should:
set option src->dst.offset-syncs.topic.location = target
manually create mm2-offset-syncs.dst.internal topic in the target cluster
start MM2
src and dst - are examples of aliases, replace it with yours.
Keep in mind: if mm2-offset-syncs.dst.internal topic is not created manually in the target cluster, then MM2 still tries to create this topic in the source cluster.
In case of one-direction replication process this topic is useless, because it is empty all the time, but MM2 requires it anyway.

Kafka 2.0 - Kafka Connect Sink - Creating a Kafka Producer

We are currently on HDF (Hortonworks Dataflow) 3.3.1 which bundles Kafka 2.0.0 and are trying to use Kafka Connect in distributed mode to launch a Google Cloud PubSub Sink connector.
We are planning on sending back some metadata into a Kafka Topic and need to integrate a Kafka producer into the flush() function of the Sink task java code.
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
Also, how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source? I need to use the same Bootstrap server list to start the producer.
Currently I am changing the config for the sink connector, adding bootstrap server list as a property and parsing it in the Java code for the connector. I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible.
Kindly help on this.
Thanks in advance.
need to integrate a Kafka producer into the flush() function of the Sink task java code
There is no producer instance exposed in the SinkTask API...
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
I mean, you can add whatever code you want. As far as negative impacts go, that's up to you to benchmark on your own infrastructure. Obviously adding more blocking code makes the other processes slower overall
how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source?
Sinks and sources are not workers. Look at connect-distributed.properties
I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible
It's not possible. Adding extra properties to the sink/source configs are the only way. (Feel free to make a Kafka JIRA requesting such a feature of exposing the worker configs, though)

How to override the Kafka Topic configurations in MongoDB Source Connector?

I am using MongoDB Source Connector to get the data from a MongoDB collection into Kafka. What this connector does is that it automatically creates a topic using the following naming convention:
[prefix_provided_in_the_connector_properties].[db_name].[collection_name]
In the MongoDB Source Connector's documentation, there is no mention of overriding the topic configuration such as number of partitions or replication factor. I have the following questions:
Is it possible to override the topic configs in the connector.properties file?
If not, is it then done on Kafka's end? If so, can we individually configure each topics' settings or it will globally affect all the topics?
Thank you!
Sounds like you have auto.create.topics.enable=true on your brokers. It is recommended to disable this and enforce manual topic creation.
Connect only creates internal topics for itself. Source connectors should ideally have their topics created ahead of time, otherwise, you get the defaults set in the broker server.properties. Changing the values will not change existing topics

Kafka-connect sink task ignores file offset storage property

I'm experiencing quite weird behavior working with Confluent JDBC connector. I'm pretty sure that it's not related to Confluent stack, but to Kafka-connect framework itself.
So, I define offset.storage.file.filename property as default /tmp/connect.offsets and run my sink connector. Obviously, I expect connector to persist offsets in the given file (it doesn't exist on file system, but it should be automatically created, right?). Documentation says:
offset.storage.file.filename
The file to store connector offsets in. By storing offsets on disk, a standalone process can be stopped and started on a single node and resume where it previously left off.
But Kafka behaves in completely different manner.
It checks if the given file exists.
It it's not, Kafka just ignores it and persists offsets in Kafka topic.
If I create given file manually, reading fails anyway (EOFException) and offsets are being persisted in topic again.
Is it a bug or, more likely, I don't understand how to work with this configurations? I understand difference between two approaches to persist offsets and file storage is more convenient for my needs.
The offset.storage.file.filename is only used in source connectors, in standalone mode. It is used to place a bookmark on the input data source and remember where it stopped reading it. The created file contains something like the file line number (for a file source) or a table row number (for jdbc source or databases in general).
When running Kafka Connect in distributed mode, this file is replaced by a Kafka topic named by default connect-offsets which should be replicated in order to tolerate failures.
As far as sink connectors are concerned, no matter which plugin or mode (standalone/distributed) is used, they all store where they last stopped reading their input topic in an internal topic named __consumer_offsets like any Kafka consumers. This allows to use traditional tools like kafka-consumer-groups.sh command-line tools to see how the much the sink connector is lagging.
The Confluent Kafka replicator, despite being a source connector, is probably an exception because it reads from a remote Kafka and may use a Kafka consumer, but only one cluster will maintain those original consumer group offsets.
I agree that the documentation is not clear, this setting is required whatever the connector type is (source or sink), but it is only used on by source connectors. The reason behind this design decision is that a single Kafka Connect worker (I mean a single JVM process) can run multiple connectors, potentially both source and sink connectors. Said differently, this setting is worker level setting, not a connector setting.
The property offset.storage.file.filename only applies to workers of source connectors running in standalone mode. If you are seeing Kafka persist offsets in a Kafka topic for a source, you are running in distributed mode. You should be launching your connector with the provided script connect-standalone. There's a description of the different modes here. Instructions on running in the different modes are here.