Kafka Connect: can multiple standalone connectors write to the same HDFS directory? - apache-kafka

For our pipeline, we have about 40 topics (10-25 partitions each) that we want to write into the same HDFS directory using HDFS 3 Sink Connectors in standalone mode (distributed doesn't work for our current setup). We have tried running all the topics on one connector but encounter problems recovering offsets if it needs to be restarted.
If we divide the topics among different standalone connectors, can they all write into the same HDFS directory? Since the connectors then organize all files in HDFS by topic, I don't think this should be an issue but I'm wondering if anyone has experience with this setup.
Basic example:
Connector-1 config
name=connect-1
connector.class=io.confluent.connect.hdfs3.Hdfs3SinkConnector
topics=topic1
hdfs.url=hdfs://kafkaOutput
Connector-2 config
name=connect-2
connector.class=io.confluent.connect.hdfs3.Hdfs3SinkConnector
topics=topic2
hdfs.url=hdfs://kafkaOutput

distributed doesn't work for our current setup
You should be able to run connect-distibured in the exact same nodes as connect-standalone is ran.
We have tried running all the topics on one connector but encounter problems recovering offsets if it needs to be restarted
Yeah, I would suggest not bundling all topics into one connector.
If we divide the topics among different standalone connectors, can they all write into the same HDFS directory?
That is my personal recommendation, and yes, they can because the HDFS path is named by the topic name, futher split by the partitioning scheme
Note: The following allow applies to all other storage connectors (S3 & GCS)

Related

Where do user-supplied Kafka connectors live?

We've got a managed Kafka setup (Confluent platform, Kafka connect 5.5.1), streaming data from ~40 topics across 8 to 10 connectors. A few weeks ago I noticed that for some of those topics, we don't have any consumers assigned. The consumers which should be reading from or writing to those topics are ones that our org has written and have not changed in months.
Looking through our connector hosts (AWS EC2 instances) I actually cannot see where our connector JAR files exist - which surprises me a lot. We've got all the other connectors there, and when I used confluent hub to install the BigQuery connector that got put under /usr/share/java as one would expect.
Where should home-grown connectors live on the filesystem?
For the record, when I query :8083 using the appropriate calls I can see the connector and it does have an allegedly-running task.
They are picked from the Java CLASSPATH and plugin.path
As for where they should exist, is somewhere that the user account running the connect process has access to read those files.

Kafka file stream connect and stream API

am working on the file stream connector, I have more than ten million records in the file(it's not a single file, its partition by account #). I have to load these files into the topic and update my streams. have gone through stand-alone streams, I have the following question and need help to achieve.
look at the data set, I have two account#, each account has 5 rows, I would need to group them in two rows and key as acctNbr.
how to write my source connector to read the file and get the grouping logic?
my brokers are running in Linux machines X,Y,Z.. post-development of source connector, my jar file should it deploy in every broker(if I start running in the distributed broker )?
I have only 30 mins window to extract file drop to the topic? what are all the parameters that are there to tune the logic to get my working window down? FYI, this topic would have more than 50 partitions and 3 broker set up.
Data set:
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-01","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-02","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-03","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-04","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-05","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-01","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-02","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-03","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-04","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-05","currentPrice":"10","availQnty":"10"}
how to write my source connector to read the file and get the grouping logic
FileSream connector cannot do this, and was not intended for such a purpose other than an example to write your own connectors. In other words, do not use in production.
That being said, you can use alternative solutions like Flume, Filebeat, Fluentd, NiFi, Streamsets, etc, etc, to glob your filepaths, then send all records line-by-line into a Kafka topic.
post-development of source connector, my jar file should it deploy in every broker
You should not run Connect on any broker. The Connect servers are called workers.
have only 30 mins window to extract file drop to the topic?
Not clear where this number came from. Any of the above methods listed above watch for all new files, without any defined window.

Is Kafka topic linked with zookeeper and If zookeeper changed will topic disappeare

I was working with Kafka. I downloaded the zookeeper, extracted and started it.
Then I downloaded Kafka, extracted the zipped file and started Kafka. Everything was working good. I created few topics and I was able to send and receive messages. After that I stopped Kafka and Zookeeper. Then I read that Kafka itself provides Zookeeper. So I started Zookeeper that was provided with Kafka. However the data directory for it was different, and then I started Kafka from same configuration file and same data directory location. However after starting Kafka I could not find the topics that I had created.
I just want to know that, does this mean the meta data about the topics is maintained by Zookeeper. I searched Kafka documentation, however, I could not find anything in detail.
https://kafka.apache.org/documentation/
Check this documentation provided by confluent. According to this Apache Kafka® uses ZooKeeper to store persistent cluster metadata and is a critical component of the Confluent Platform deployment. For example, if you lost the Kafka data in ZooKeeper, the mapping of replicas to Brokers and topic configurations would be lost as well, making your Kafka cluster no longer functional and potentially resulting in total data loss.
So, the answer to your question is, yes, the purpose of zookeeper is to store relevant metadata about the kafka brokers, topics, etc,.
Also, since you have just started working on Kafka and Zookeeper, I would like to mention this. By default, Kafka stored it's data in a temp location which get's deleted on system reboot, so you should change that as well.
the answer to your question tag is yes,
1)Initially you started standalone zookeeper from zip file and you stopped the zookeeper, which means the topics that are created are stored in the zookeeper standalone are lost.Now you persistent cluster metadata related to Kafka is lost .
2)second time you started the zookeeper from the package that comes along with Kafka, now the new zookeeper instance does not have any topics information that you created previously, so you need to create newly .
3) suppose in case 1: if you close the terminal and start again the zookeeper from standalone , you no need to create the Topic again ,but if you stopped the zookeeper server from standalone then topics are lost.
in simple : you created two separate zookeeper instances, where topics will not be shared between them .

Kafka-connect sink task ignores file offset storage property

I'm experiencing quite weird behavior working with Confluent JDBC connector. I'm pretty sure that it's not related to Confluent stack, but to Kafka-connect framework itself.
So, I define offset.storage.file.filename property as default /tmp/connect.offsets and run my sink connector. Obviously, I expect connector to persist offsets in the given file (it doesn't exist on file system, but it should be automatically created, right?). Documentation says:
offset.storage.file.filename
The file to store connector offsets in. By storing offsets on disk, a standalone process can be stopped and started on a single node and resume where it previously left off.
But Kafka behaves in completely different manner.
It checks if the given file exists.
It it's not, Kafka just ignores it and persists offsets in Kafka topic.
If I create given file manually, reading fails anyway (EOFException) and offsets are being persisted in topic again.
Is it a bug or, more likely, I don't understand how to work with this configurations? I understand difference between two approaches to persist offsets and file storage is more convenient for my needs.
The offset.storage.file.filename is only used in source connectors, in standalone mode. It is used to place a bookmark on the input data source and remember where it stopped reading it. The created file contains something like the file line number (for a file source) or a table row number (for jdbc source or databases in general).
When running Kafka Connect in distributed mode, this file is replaced by a Kafka topic named by default connect-offsets which should be replicated in order to tolerate failures.
As far as sink connectors are concerned, no matter which plugin or mode (standalone/distributed) is used, they all store where they last stopped reading their input topic in an internal topic named __consumer_offsets like any Kafka consumers. This allows to use traditional tools like kafka-consumer-groups.sh command-line tools to see how the much the sink connector is lagging.
The Confluent Kafka replicator, despite being a source connector, is probably an exception because it reads from a remote Kafka and may use a Kafka consumer, but only one cluster will maintain those original consumer group offsets.
I agree that the documentation is not clear, this setting is required whatever the connector type is (source or sink), but it is only used on by source connectors. The reason behind this design decision is that a single Kafka Connect worker (I mean a single JVM process) can run multiple connectors, potentially both source and sink connectors. Said differently, this setting is worker level setting, not a connector setting.
The property offset.storage.file.filename only applies to workers of source connectors running in standalone mode. If you are seeing Kafka persist offsets in a Kafka topic for a source, you are running in distributed mode. You should be launching your connector with the provided script connect-standalone. There's a description of the different modes here. Instructions on running in the different modes are here.

Which directory does apache kafka store the data in broker nodes

I can see a property in config/server.properties called log.dir? Does this mean kafka uses the same directory for storing logs and data both?
Kafka topics are "distributed and partitioned append only logs". Parameter log.dir defines where topics (ie, data) is stored.
It is not related to application/broker logging.
The default log.dir is /tmp/kafka-logs which you may want to change in case your OS has a /tmp directory cleaner.
log.dir or log.dirs in the config/server.properties specifiy the directories in which the log data is kept.
The server log directory is kafka_base_dir/logs by default. You could modify it by specifying another directory for 'kafka.logs.dir' in log4j.properties.
log.dir in server.properties is the place where the Kafka broker will store the commit logs containing your data. Typically this will your high speed mount disk for mission critical use-cases.
For application/broker logging you can use general log4j logging to get the event logging in your custom location. Below are the variables to do this.
-Dlog4j.configuration=file:<configuration file with log rolling, logging level etc.> & -Dkafka.logs.dir=<path to logs>
The directory location of logs and data were perfectly described by Mathias. Yet the data were designed for internal processing of Kafka engine, may you could use Kafka Connect to store and manipulate the data. Kafka Connect is a tool for scalability and reliability data between Apache Kafka and other systems. Look the picture bellow:
It will make simple to define connectors that move large amount of data into and out of Kafka internal data system. Kafka Connect can ingest entire database making the data available for stream processing or sink the specific data of a single topic (or multiples) to another system or database for further analysis.