Kafka file stream connect and stream API - apache-kafka

am working on the file stream connector, I have more than ten million records in the file(it's not a single file, its partition by account #). I have to load these files into the topic and update my streams. have gone through stand-alone streams, I have the following question and need help to achieve.
look at the data set, I have two account#, each account has 5 rows, I would need to group them in two rows and key as acctNbr.
how to write my source connector to read the file and get the grouping logic?
my brokers are running in Linux machines X,Y,Z.. post-development of source connector, my jar file should it deploy in every broker(if I start running in the distributed broker )?
I have only 30 mins window to extract file drop to the topic? what are all the parameters that are there to tune the logic to get my working window down? FYI, this topic would have more than 50 partitions and 3 broker set up.
Data set:
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-01","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-02","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-03","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-04","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-05","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-01","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-02","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-03","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-04","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-05","currentPrice":"10","availQnty":"10"}

how to write my source connector to read the file and get the grouping logic
FileSream connector cannot do this, and was not intended for such a purpose other than an example to write your own connectors. In other words, do not use in production.
That being said, you can use alternative solutions like Flume, Filebeat, Fluentd, NiFi, Streamsets, etc, etc, to glob your filepaths, then send all records line-by-line into a Kafka topic.
post-development of source connector, my jar file should it deploy in every broker
You should not run Connect on any broker. The Connect servers are called workers.
have only 30 mins window to extract file drop to the topic?
Not clear where this number came from. Any of the above methods listed above watch for all new files, without any defined window.

Related

Is it possible to for multiple kafka connect cluster to read from the same file and write on the same topic with SpoolDirCsvSourceConnector?

I'm using SpoolDirCsvSourceConnector to load CSV data into one Kafka topic. My CSV input file is around 3-4 Gb and I have only run the connector on one single machine, so throughput is low.
EDIT: I have to consume the .csv file. Provider sends me one big .csv file daily.
Would it possible to run the connector on multiple machines on the same file to increase throughput? These machines may or may not be able to see each other, but they will be able to connect to the same kafka cluster. If possible, I want to avoid splitting the CSV file into parts.
Unless you split the file into parts, you can only use a single instance of the connector.
Where is the data coming from? Do you have the option of consuming it directly into Kafka instead of via a CSV file? e.g. if it's from a database have you considered reading from the database directly using Kafka Connect instead?

is there a way to know how many record are written by kafka connect sink?

I use the HDFS sink connector and I want to know how many record are put to HDFS.
In logs I have log when the connector start to put file to HDFS, but not how many record there are.
For exemple :
INFO Opening record writer for: hdfs://hdfs/path/+tmp/table/partition=2020-02-27/19955b52-8189-4f70-94b5-46d579cd1505_tmp.avro (io.confluent.connect.hdfs.avro.AvroRecordWriterProvider)
Is it possible by extending the connector itself ?
I use kafka connect HDFS 2 sink.
Out of the box, not that I know of (of course, it's open source, and you could look). Each file would have variable amount of data, so metric tracking wouldn't be all too useful.
I cannot recall if debug or trace logs expose that information.
You can can use Hive/Spark/HDFS CLI to inspect each file, though

Kafka Connect: can multiple standalone connectors write to the same HDFS directory?

For our pipeline, we have about 40 topics (10-25 partitions each) that we want to write into the same HDFS directory using HDFS 3 Sink Connectors in standalone mode (distributed doesn't work for our current setup). We have tried running all the topics on one connector but encounter problems recovering offsets if it needs to be restarted.
If we divide the topics among different standalone connectors, can they all write into the same HDFS directory? Since the connectors then organize all files in HDFS by topic, I don't think this should be an issue but I'm wondering if anyone has experience with this setup.
Basic example:
Connector-1 config
name=connect-1
connector.class=io.confluent.connect.hdfs3.Hdfs3SinkConnector
topics=topic1
hdfs.url=hdfs://kafkaOutput
Connector-2 config
name=connect-2
connector.class=io.confluent.connect.hdfs3.Hdfs3SinkConnector
topics=topic2
hdfs.url=hdfs://kafkaOutput
distributed doesn't work for our current setup
You should be able to run connect-distibured in the exact same nodes as connect-standalone is ran.
We have tried running all the topics on one connector but encounter problems recovering offsets if it needs to be restarted
Yeah, I would suggest not bundling all topics into one connector.
If we divide the topics among different standalone connectors, can they all write into the same HDFS directory?
That is my personal recommendation, and yes, they can because the HDFS path is named by the topic name, futher split by the partitioning scheme
Note: The following allow applies to all other storage connectors (S3 & GCS)

Flume agent producing multiple .tmp files when data is sent in succession

I have a flume agent running in CDH 5.8.3. It creates multiple .tmp files when writing to hdfs if more than 3 valid files are sent. There is an interceptor that routes valid xmls to appropriate topic before the hdfs sink. This agent is using flafka. Interceptor and kafka are working correctly.
agent.sinks.hdfs_valid.channel=valid_channel
agent.sinks.hdfs_valid.type=hdfs
agent.sinks.hdfs_valid.writeFormat=Text
agent.sinks.hdfs_valid.hdfs.fileType=DataStream
agent.sinks.hdfs_valid.hdfs.filePrefix=event
agent.sinks.hdfs_valid.hdfs.fileSuffix=.xml
agent.sinks.hdfs_valid.hdfs.path=locationoffile/%{time}
agent.sinks.hdfs_valid.hdfs.idleTimeout=900
agent.sinks.hdfs_valid.hdfs.rollInterval=3600
agent.sinks.hdfs_valid.hdfs.kerberosPrincipal=authentication#example.com
agent.sinks.hdfs_valid.hdfs.kerberosKeytab=locationofkeytab
agent.sinks.hdfs_valid.hdfs.rollSize=0
agent.sinks.hdfs_valid.hdfs.rollCount=0
agent.sinks.hdfs_valid.hdfs.callTimeout=100000
Okay so interesting enough. Our Kafka partitions was set to 20. When flume consumes from it. The first 10 partitions are consuming from one ip and it opens a .tmp. The second 10 partitions are consuming from another ip and it opens a second .tmp. This appears to be an internal function of flume. All data arrived correctly despite having two .tmp opened.

Kafka-connect sink task ignores file offset storage property

I'm experiencing quite weird behavior working with Confluent JDBC connector. I'm pretty sure that it's not related to Confluent stack, but to Kafka-connect framework itself.
So, I define offset.storage.file.filename property as default /tmp/connect.offsets and run my sink connector. Obviously, I expect connector to persist offsets in the given file (it doesn't exist on file system, but it should be automatically created, right?). Documentation says:
offset.storage.file.filename
The file to store connector offsets in. By storing offsets on disk, a standalone process can be stopped and started on a single node and resume where it previously left off.
But Kafka behaves in completely different manner.
It checks if the given file exists.
It it's not, Kafka just ignores it and persists offsets in Kafka topic.
If I create given file manually, reading fails anyway (EOFException) and offsets are being persisted in topic again.
Is it a bug or, more likely, I don't understand how to work with this configurations? I understand difference between two approaches to persist offsets and file storage is more convenient for my needs.
The offset.storage.file.filename is only used in source connectors, in standalone mode. It is used to place a bookmark on the input data source and remember where it stopped reading it. The created file contains something like the file line number (for a file source) or a table row number (for jdbc source or databases in general).
When running Kafka Connect in distributed mode, this file is replaced by a Kafka topic named by default connect-offsets which should be replicated in order to tolerate failures.
As far as sink connectors are concerned, no matter which plugin or mode (standalone/distributed) is used, they all store where they last stopped reading their input topic in an internal topic named __consumer_offsets like any Kafka consumers. This allows to use traditional tools like kafka-consumer-groups.sh command-line tools to see how the much the sink connector is lagging.
The Confluent Kafka replicator, despite being a source connector, is probably an exception because it reads from a remote Kafka and may use a Kafka consumer, but only one cluster will maintain those original consumer group offsets.
I agree that the documentation is not clear, this setting is required whatever the connector type is (source or sink), but it is only used on by source connectors. The reason behind this design decision is that a single Kafka Connect worker (I mean a single JVM process) can run multiple connectors, potentially both source and sink connectors. Said differently, this setting is worker level setting, not a connector setting.
The property offset.storage.file.filename only applies to workers of source connectors running in standalone mode. If you are seeing Kafka persist offsets in a Kafka topic for a source, you are running in distributed mode. You should be launching your connector with the provided script connect-standalone. There's a description of the different modes here. Instructions on running in the different modes are here.