How to read only new changes from a file using kafka producer - apache-kafka

I am currently using windows machine and able to read whole file through command prompt using Kafka producer and consumer. I need to only get the recent changes in a file and need to use it in as input for Apache flink. I tried using this link but due to kafka client jar mismatch issue, i was not able to use it.
In my current approach when i call my producer each time it loads the whole file and we need to run it every time to see the changes occurred to file. I thought of using threads and some way of comparing difference in file using java code but is there any of doing only by Kafka.

I had similar problem recently (but in Linux) and solved it following way:
tail -f somefile.log | kafka-console-producer.sh ...
In your case you can try some Windows alternatives to Linux's tail: 13 Ways to Tail a Log File on Windows & Linux

Related

Unable to change default /etc/kafka/connect-log4j.properties location for different kafka connectors

I am using multiple kafka connectors. But every connector is writing log within same connect.log file. But I want connectors to write different log files. For that, during startup I need to change default /etc/kafka/connect-log4j.properties file. But unable to change it.
Sample Start Script:
/usr/bin/connect-standalone ../properties/sample-worker.properties ../properties/sample-connector.properties > /dev/null 2>&1 &
Is there any way to change default /etc/kafka/connect-log4j.properties file during the startup of connectors.
Kafka uses log4j, and has a variable for overriding it
export KAFKA_LOG4J_OPTS="-Dlog4j.configuration=file:///some/other/log4j.properties"
connect-standalone.sh ...
Generally, it would be best to use connect-distributed and use some log aggregation tool like the ELK stack to parse through log events for different connectors

cant start the kafka server in my directory kafka

i want to start my very first kafka, but when i tried to run this on my kafka_2.13-2.8.0 directory bin\windows\zookeeper-server-start.bat .. \ .. \config\zookeeper.properties
why it returns \Kafka\kafka_2.13-2.8.0\bin\windows\../ ../config/log4j.properties was unexpected at this time
idk i already followed this tip to install kafka https://www.youtube.com/watch?v=bYVyRh4C94E&t=303s
It's a known error in the Kafka log4j settings, especially if the install path contains spaces or non alphanumeric characters
If you really want to run Kafka on Windows, you should use WSL2 anyway, or Docker. Otherwise, assuming you did get the bat file working, you'd eventually run into other errors that crash the broker

Is there any way to send chrome history logs to kafka?

I want to send my google chrome history to kafka.
My basic idea is to use my local data located in
C:/Users/master/AppData/Local/Google/Chrome/User Data/Default/history
To do so, I want to use Kafka file source connector.
But how can I send newly added chrome history log after I run kakfa source connector?
Is there any way track the change of source file so kafka broker can acknowledge it?
Indeed you can use FileStreamSourceConnector to achieve that. You do not need to anything else.
Once you start FileStreamSourceConnector, it will hook to the specified file. So, whenever new data is appended to the file, your connector will automatically produce to the topic.
From the link that I shared above:
This connector will read only one file and send the data within that file to Kafka. It will then watch the file for appended updates only. Any modification of file lines already sent to Kafka will not be reprocessed.
This may help you: Read File Data with Connect

Kafka connect not working for file streaming

I have been using Kafka connect for the confluent platform using the following guide
Kafka connect quickstart
But it doesn't update the sink file anymore, any changes in the source file are not written in the kafka topic.
I have already deleted all tmp files but no change.
Thanks in advance
Start up a new file source connector with a new location for storing the offsets. This connector is meant as a demo and really doesn't handle anything except a simple file that only gets append updates. Note, you shouldn't be doing anything with this connector other than a simple demo. Have a look at the connector hub if you need something for production.
To OP, I have had this like 5 mins ago but when I restarted the connector it's fine, both test.sink.txt and the consumer are getting new line added. So in a nutshell, just restart your connector.
The FileStreamSource/Sink does not work after it worked fine and you've already restarted the zookeeper, kafka server and the connector but still it does not work then the problem is with the CONNECT.OFFSETS file in the kafka directory.
You should delete it and create a new empty one.
I faced the same problem before. But correcting the path of the input and output files in the properties files as below worked for me. And it streamed from input file(test.txt) to output file(test.sink.txt).
name=local-file-source
connector.class=FileStreamSource
tasks.max=1
file=/home/mypath/kafka/test.txt
topic=connect-test
name=local-file-sink
connector.class=FileStreamSink
tasks.max=1
file=/home/mypath/kafka/test.sink.txt
topics=connect-test

Load 1GB of file to Kafka producer directly from my local machine

I have experimented with the basic examples of publishing random messages from producer to consumer by command line.
Now i want to publish all the 1GB of data present in my local machine. For that i am struggling to load that 1GB of data to producer.
Help me out please.
You can simply dump a file by simple redirection to kafka topic. Assuming 1.xml is 1GB file then you can use following command.
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test123 < ./1.xml
But please make sure that you set following properties in producer xml.
socket.request.max.bytes, socket.receive.buffer.bytes, socket.send.buffer.bytes.
You need to set max.message.bytes for test123 topic if your message size is big.
Also change Xmx parameter in console-producer.sh to avoid Out of Memory issue.
These are the general steps to load data in kafka.
We will able to understand more if you provide the error.
So couple of approaches can help:
1) You can use big data platforms like Flume which are built for such use cases.
2) If you want to implement you own code then you can use Apache commons Lib which will help you in capturing events when a new file arrives in folder (Capture events happening inside a directory) and once you have that then you can call the code which publishes the data on kafka.
3) In our project we use Logstash API to do the same which fetches from a folder and publishes data from file to kafka and then processes it through Storm.