Set offset.storage.topic in connect-distributed.properties to use a different server - apache-kafka

Im consuming from a kafka topic using kafka connect to load data into splunk. I donot want the offset to be stored on the same bootstrap server. I know we can set the offset.storage.topic property. But is there something like a offset.storage.Server property?
Below are a few properties from my connect-distributed.properties config.
bootstrap.servers=[server]
group.id=connect-cluster
plugin.path=/opt/kafka_2.12-2.2.0/plugins
offset.storage.topic=connect-offsets
offset.storage.replication.factor=3

No, I don't believe this is possible.
One option if you really don't want to write offsets back to the cluster is to run Kafka Connect in standalone mode, and have it use a local file instead for offsets. If you do that you can only run a single instance of the worker though so don't get any of the scalability and fault-tolerance benefits that distributed mode provides.

Related

Kafka sink from multiple independent brokers

I want to aggregate changes from multiple databases into one so I thought to run a Debezium connector and a Kafka server/broker next to each database, and use a Kafka sink connector to consume from all those Kafkas to write into one database.
The question is, can I use a single instance of Kafka sink connector to consume at the same time, from multiple Kafka brokers which are independent (not a cluster).
Running a Kafka broker next to each database sounds very complicated. And a single Kafka connect worker that connects to different Kafka broker clusters does not seem to be supported, as far as I can see.
If you go down this path, it may make more sense to use something like Kafka MirrorMaker to copy your local topics to a single main Kafka cluster, and then use a Kafka Connect Sink to read all the copied topics from one worker and write to a central DB.
Ultimately, running a Broker next to each source database is pretty complicated. From what you described, it sounds like you have some connectivity between your different databases, but it is limited and possibly prone to disconnects. Have you considered alternative designs:
DB Replication: Use your DB vendor's native async replication to just copy the data to a single target DB. The remote region is always read-only, replication should not slow down your source DB (depends on the DB, of-course). And async DB replication can usually handle some network disconnections and latency.
Local Debezium: Run a process with Debezium next to each DB, and save all events to a file. Copy the files to some central server or to a cloud storage service like S3. Finally, import these files into a central DB. This would basically skip Kafka completely.
You can point the Connect property files at whatever bootstrap.servers you want
The property itself is required to be part of a single "cluster" (even if a single broker), which would be determined by the broker zookeeper.connect property

Kafka 2.0 - Kafka Connect Sink - Creating a Kafka Producer

We are currently on HDF (Hortonworks Dataflow) 3.3.1 which bundles Kafka 2.0.0 and are trying to use Kafka Connect in distributed mode to launch a Google Cloud PubSub Sink connector.
We are planning on sending back some metadata into a Kafka Topic and need to integrate a Kafka producer into the flush() function of the Sink task java code.
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
Also, how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source? I need to use the same Bootstrap server list to start the producer.
Currently I am changing the config for the sink connector, adding bootstrap server list as a property and parsing it in the Java code for the connector. I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible.
Kindly help on this.
Thanks in advance.
need to integrate a Kafka producer into the flush() function of the Sink task java code
There is no producer instance exposed in the SinkTask API...
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
I mean, you can add whatever code you want. As far as negative impacts go, that's up to you to benchmark on your own infrastructure. Obviously adding more blocking code makes the other processes slower overall
how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source?
Sinks and sources are not workers. Look at connect-distributed.properties
I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible
It's not possible. Adding extra properties to the sink/source configs are the only way. (Feel free to make a Kafka JIRA requesting such a feature of exposing the worker configs, though)

How to override the Kafka Topic configurations in MongoDB Source Connector?

I am using MongoDB Source Connector to get the data from a MongoDB collection into Kafka. What this connector does is that it automatically creates a topic using the following naming convention:
[prefix_provided_in_the_connector_properties].[db_name].[collection_name]
In the MongoDB Source Connector's documentation, there is no mention of overriding the topic configuration such as number of partitions or replication factor. I have the following questions:
Is it possible to override the topic configs in the connector.properties file?
If not, is it then done on Kafka's end? If so, can we individually configure each topics' settings or it will globally affect all the topics?
Thank you!
Sounds like you have auto.create.topics.enable=true on your brokers. It is recommended to disable this and enforce manual topic creation.
Connect only creates internal topics for itself. Source connectors should ideally have their topics created ahead of time, otherwise, you get the defaults set in the broker server.properties. Changing the values will not change existing topics

Kafka-connect sink task ignores file offset storage property

I'm experiencing quite weird behavior working with Confluent JDBC connector. I'm pretty sure that it's not related to Confluent stack, but to Kafka-connect framework itself.
So, I define offset.storage.file.filename property as default /tmp/connect.offsets and run my sink connector. Obviously, I expect connector to persist offsets in the given file (it doesn't exist on file system, but it should be automatically created, right?). Documentation says:
offset.storage.file.filename
The file to store connector offsets in. By storing offsets on disk, a standalone process can be stopped and started on a single node and resume where it previously left off.
But Kafka behaves in completely different manner.
It checks if the given file exists.
It it's not, Kafka just ignores it and persists offsets in Kafka topic.
If I create given file manually, reading fails anyway (EOFException) and offsets are being persisted in topic again.
Is it a bug or, more likely, I don't understand how to work with this configurations? I understand difference between two approaches to persist offsets and file storage is more convenient for my needs.
The offset.storage.file.filename is only used in source connectors, in standalone mode. It is used to place a bookmark on the input data source and remember where it stopped reading it. The created file contains something like the file line number (for a file source) or a table row number (for jdbc source or databases in general).
When running Kafka Connect in distributed mode, this file is replaced by a Kafka topic named by default connect-offsets which should be replicated in order to tolerate failures.
As far as sink connectors are concerned, no matter which plugin or mode (standalone/distributed) is used, they all store where they last stopped reading their input topic in an internal topic named __consumer_offsets like any Kafka consumers. This allows to use traditional tools like kafka-consumer-groups.sh command-line tools to see how the much the sink connector is lagging.
The Confluent Kafka replicator, despite being a source connector, is probably an exception because it reads from a remote Kafka and may use a Kafka consumer, but only one cluster will maintain those original consumer group offsets.
I agree that the documentation is not clear, this setting is required whatever the connector type is (source or sink), but it is only used on by source connectors. The reason behind this design decision is that a single Kafka Connect worker (I mean a single JVM process) can run multiple connectors, potentially both source and sink connectors. Said differently, this setting is worker level setting, not a connector setting.
The property offset.storage.file.filename only applies to workers of source connectors running in standalone mode. If you are seeing Kafka persist offsets in a Kafka topic for a source, you are running in distributed mode. You should be launching your connector with the provided script connect-standalone. There's a description of the different modes here. Instructions on running in the different modes are here.

how do we increase the maximum fetch size for kafka spout

I am using kafkaspout, I would like to increase the maximum fetch size in KafkaSpout. Can you please share your thoughts?
You can adjust message fetch size in one of two ways, either by by changing the Kafka spout's configuration programmatically or by altering the configuration of the Kafka server directly.
To change the spout's configuration, the Kafka spout documentation tells us to set the the spout's config's fetchSizeBytes variable. Assuming this is being done doing this in java:
spoutConfig.fetchSizeBytes = maximumFetchSize;
The other option, i.e. to adjust the Kafka configuration, first find Kafka's server.properties, then change that configuration, finally restart the Kafka server.
The Kafka configuration will most likely be in the config directory off of the Kafka's home directory. Adjust the appropriate configuration value by adding something similar to the following line to server.properties:
message.max.bytes=some_new_integer_value
Then restart the Kafka process. Best practice will be to repeat this change on all hosts in the cluster.