kafka file stream with limitation

kafka file stream with limitation - apache-kafka

am trying to extract the file from string input from a file, I am having stream connector running throughout the day, but am facing below issues,
1, The file has more than 20k records(20K lines), when a new file is recognized, the connector posts the message up to 10k and then goes to idle. does the connector has any limitations/configuration to limit the data? am checking this count from the topic(which is configured at source.properties) console consumer. (as soon as job started, I saw another window with consuming the messages with diff consumer group).
My file connector keeps running through our the day, file(abc.txt) data is published to the topic and tries to replace the new file (remove the current file from a location and place same file name with diff data set), the running job is getting an exception, but when I append the new data set to the existing file, it is running fine. is it an excepted behavior?
any help is really appreciated.

Related

Is there any way to send chrome history logs to kafka?

I want to send my google chrome history to kafka.
My basic idea is to use my local data located in
C:/Users/master/AppData/Local/Google/Chrome/User Data/Default/history
To do so, I want to use Kafka file source connector.
But how can I send newly added chrome history log after I run kakfa source connector?
Is there any way track the change of source file so kafka broker can acknowledge it?

Indeed you can use FileStreamSourceConnector to achieve that. You do not need to anything else.
Once you start FileStreamSourceConnector, it will hook to the specified file. So, whenever new data is appended to the file, your connector will automatically produce to the topic.
From the link that I shared above:
This connector will read only one file and send the data within that file to Kafka. It will then watch the file for appended updates only. Any modification of file lines already sent to Kafka will not be reprocessed.
This may help you: Read File Data with Connect

Kafka file stream connect and stream API

am working on the file stream connector, I have more than ten million records in the file(it's not a single file, its partition by account #). I have to load these files into the topic and update my streams. have gone through stand-alone streams, I have the following question and need help to achieve.
look at the data set, I have two account#, each account has 5 rows, I would need to group them in two rows and key as acctNbr.
how to write my source connector to read the file and get the grouping logic?
my brokers are running in Linux machines X,Y,Z.. post-development of source connector, my jar file should it deploy in every broker(if I start running in the distributed broker )?
I have only 30 mins window to extract file drop to the topic? what are all the parameters that are there to tune the logic to get my working window down? FYI, this topic would have more than 50 partitions and 3 broker set up.
Data set:
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-01","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-02","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-03","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-04","currentPrice":"10","availQnty":"10"}
{"acctNbr":"1234567","secNbr":"AAPL","date":"2010-01-05","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-01","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-02","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-03","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-04","currentPrice":"10","availQnty":"10"}
{"acctNbr":"abc3355","secNbr":"AAPL","date":"2010-01-05","currentPrice":"10","availQnty":"10"}

how to write my source connector to read the file and get the grouping logic
FileSream connector cannot do this, and was not intended for such a purpose other than an example to write your own connectors. In other words, do not use in production.
That being said, you can use alternative solutions like Flume, Filebeat, Fluentd, NiFi, Streamsets, etc, etc, to glob your filepaths, then send all records line-by-line into a Kafka topic.
post-development of source connector, my jar file should it deploy in every broker
You should not run Connect on any broker. The Connect servers are called workers.
have only 30 mins window to extract file drop to the topic?
Not clear where this number came from. Any of the above methods listed above watch for all new files, without any defined window.

Why doesn't my consumer console sync up with my text file?

I set up everything as recommended for a quick start, I used a text file as a source or producer which contains one sentence. when I launch a consumer console for the first time I am able to read the sentence (JSON format) in the file but when I add something in the file it's not showing in the consumer console and when I use the producer console to add something in the topic, it shows right the way in the consumer console. What could be the problem?
zookeeper UP
Connector UP
consumer UP
producer UP
Kafka UP

Kafka doesn't watch files for changes. You would need to program your own code to detect file modifications on disk then restart the producer thread to pick up those changes
Alternatively, use kafka-connect-spooldir connector, available on Github

I created a new topic and placed the file to the wrong path, so I had to edit these files:
bin/connect-standalone.sh
config/connect-standalone.properties
config/connect-file-source.properties
config/connect-file-sink.properties
---------- edit these lines------------------------------
topic=my_created_topic
file=PATH_TO_MY_SOURCE_FILE
Everything is working perfectly, yah!!!!!!!!!!

Flume agent producing multiple .tmp files when data is sent in succession

I have a flume agent running in CDH 5.8.3. It creates multiple .tmp files when writing to hdfs if more than 3 valid files are sent. There is an interceptor that routes valid xmls to appropriate topic before the hdfs sink. This agent is using flafka. Interceptor and kafka are working correctly.
agent.sinks.hdfs_valid.channel=valid_channel
agent.sinks.hdfs_valid.type=hdfs
agent.sinks.hdfs_valid.writeFormat=Text
agent.sinks.hdfs_valid.hdfs.fileType=DataStream
agent.sinks.hdfs_valid.hdfs.filePrefix=event
agent.sinks.hdfs_valid.hdfs.fileSuffix=.xml
agent.sinks.hdfs_valid.hdfs.path=locationoffile/%{time}
agent.sinks.hdfs_valid.hdfs.idleTimeout=900
agent.sinks.hdfs_valid.hdfs.rollInterval=3600
agent.sinks.hdfs_valid.hdfs.kerberosPrincipal=authentication#example.com
agent.sinks.hdfs_valid.hdfs.kerberosKeytab=locationofkeytab
agent.sinks.hdfs_valid.hdfs.rollSize=0
agent.sinks.hdfs_valid.hdfs.rollCount=0
agent.sinks.hdfs_valid.hdfs.callTimeout=100000

Okay so interesting enough. Our Kafka partitions was set to 20. When flume consumes from it. The first 10 partitions are consuming from one ip and it opens a .tmp. The second 10 partitions are consuming from another ip and it opens a second .tmp. This appears to be an internal function of flume. All data arrived correctly despite having two .tmp opened.

Kafka connect tutorial stopped working

I was following step #7 (Use Kafka Connect to import/export data) at this link:
http://kafka.apache.org/documentation.html#quickstart
It was working well until I deleted the 'test.txt' file. Mainly because that's how log4j files would work. After certain time, the file will get rotated - I mean - it will be renamed & a new file with the same name will start getting written to.
But after, I deleted 'test.txt', the connector stopped working. I restarted connector, broker, zookeeper etc, but the new lines from 'test.txt' are not going to the 'connect-test' topic & therefore are not going to the 'test.sink.txt' file.
How can I fix this?

The connector keeps tabs of its "last location read from a file", so in case it crashes while reading the file, it can continue where it left off.
The problem is that you deleted the file without resetting the offsets to 0, so it basically doesn't see any new data since it waits for new data to show starting at a specific character count from the beginning...
The work-around if to reset the offsets. If you are using connect in stand-alone mode, the offsets are stored in /tmp/connect.offsets by default, just delete them from there.
In the long term, we need a better file connector :)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

kafka file stream with limitation - apache-kafka

Related

Is there any way to send chrome history logs to kafka?

Kafka file stream connect and stream API

Why doesn't my consumer console sync up with my text file?

Flume agent producing multiple .tmp files when data is sent in succession

Kafka connect tutorial stopped working

Categories

Resources