How can be configured kafka-connect-file-pulse for continuous reading of a text file? - apache-kafka

I have FilePulse correctly configured, so that when I create a file inside the reading folder, it reads it and ingests it in the topic.
Now I need to do continuous reading of each of the files in that folder, since they are continually being updated.
I have to change any property of properties file?
My filePulseTxtFile.properties:
name=connect-file-pulse-txt
connector.class=io.streamthoughts.kafka.connect.filepulse.source.FilePulseSourceConnector
topic=lineas-fichero
tasks.max=1
File types
fs.scan.filters=io.streamthoughts.kafka.connect.filepulse.scanner.local.filter.RegexFileListFilter
file.filter.regex.pattern=.*\\.log$
task.reader.class=io.streamthoughts.kafka.connect.filepulse.reader.RowFileInputReader
File scanning
fs.cleanup.policy.class=io.streamthoughts.kafka.connect.filepulse.clean.LogCleanupPolicy
fs.scanner.class=io.streamthoughts.kafka.connect.filepulse.scanner.local.LocalFSDirectoryWalker
fs.scan.directory.path=/home/ec2-user/parser/scanDirKafka
fs.scan.interval.ms=10000
Internal Reporting
internal.kafka.reporter.bootstrap.servers=localhost:9092
internal.kafka.reporter.id=connect-file-pulse-txt
internal.kafka.reporter.topic=connect-file-pulse-status
Track file by name
offset.strategy=name
Thanks a lot!

Continious reading is only supported by the RowFileInputReader that you can configure with the read.max.wait.ms property - The maximum time to wait in milliseconds for more bytes after hitting end of file.
For example, if you configure that property to 10000 then the reader will wait 10 seconds for new lines to be added to the file before considering it completed.
Also, you should note that as long as there are task processing files, then new files that are added to the source directory will not be selected. But, you can configure the allow.tasks.reconfiguration.after.timeout.ms to force all tasks to be restarted after a given period so that new files will be scheduled.
Finally, you must take care to correctly set the max.tasks property so that all files can be processed in parallel (a task can only process one file at a time).

Related

rsyslog - timestamp setting - log file(imfile)

I applied the following setting on the .conf for rsyslog, it can read logs from the file.
module(load="imfile" PollingInterval="10")
input(
type="imfile" File="/root/loggingFile.log"
Tag="tag2"
Severity="notice"
Facility="local0"
)
The logs in the file also come as follows-directly the same, nothing different (like time);
trying1
trying2
trying3
...
tryingNew
It can read the logs, but every time a new line is written, it reads the whole file from the beginning.
When I make freshStartTail = on, even though the new line appears, it does not read at all this time.
How can I solve this problem?

Filebeat To Send Entire Log Files

So im trying to have filebeat send the entire log file as one event instead of every line as an event, but its not working, this is my filebeat setup:
multiline.pattern: ^\[
multiline.negate: true
multiline.match: after
and this is an example of a log file that i have:
2020-02-03 16:03:25,038 INFO Initial blacklisted packages:
2020-02-03 16:03:25,039 INFO Initial whitelisted packages:
2020-02-03 16:03:25,039 INFO Starting unattended upgrades script
but filebeat sends every line as an event, i need to send the whole thing as one event instead of seperated.
Any idea of what im doing wrong here?
Thanks in advance for any help!
You probably don't actually want to send the whole log as one event (one record), it doesn't make much sense. If you just want to upload a whole file, check logstash file input (not filebeat). It can do it for you.
And your multiline.pattern (which will split text into events) may be like ^[0-9]{4}-[0-9]{2}-[0-9]{2}.

Sending file descriptors from Unix Sockets

I am using the Linux sendmsg() function to send a file-descriptor to another process over a Unix Socket along with some data payload. I make multiple calls to sendmsg. In the recvmsg() companion call inside the destination process, I get the file descriptor using something like "fdptr = (int *) CMSG_DATA(cmsg); memcpy(fdptr, myfds, NUM_FD * sizeof(int));" What I am noticing that each time I look at the file descriptor, the file descriptor is yet a DIFFERENT number than it was in the prior recvmsg() call.
My question: Is the destination process holding open a bunch of open descriptors to the same file/hardware?? Do I need to close the descriptors?
What would happen if I was not to try to copy the descriptor with "memcpy(fdptr, myfds, NUM_FD * sizeof(int));" and essentially 'left it inside' CMSG_DATA(cmsg)? Would there be some descriptor with an unknown number sitting out there? Had I not copied it out, I would have never seen it was essentially yet another descriptor number.

Why does this code work successfully with Enumerator.fromFile?

I wrote the file transferring code as follows:
val fileContent: Enumerator[Array[Byte]] = Enumerator.fromFile(file)
val size = file.length.toString
file.delete // (1) THE FILE IS TEMPORARY SO SHOULD BE DELETED
SimpleResult(
header = ResponseHeader(200, Map(CONTENT_LENGTH -> size, CONTENT_TYPE -> "application/pdf")),
body = fileContent)
This code works successfully, even if the file size is rather large (2.6 MB),
but I'm confused because my understanding about .fromFile() is a wrapper of fromCallBack() and SimpleResult actually reads the file buffred,but the file is deleted before that.
MY easy assumption is that java.io.File.delete waits until the file gets released after the chunk reading completed, but I have never heard of that process of Java File class,
Or .fromFile() has already loaded all lines to the Enumerator instance, but it's against the fromCallBack() spec, I think.
Does anybody knows about this mechanism?
I'm guessing you are on some kind of a Unix system, OSX or Linux for example.
On a Unix:y system you can actually delete a file that is open, any filesystem entry is just a link to the actual file, and so is a file handle which you get when you open a file. The file contents won't become unreachable /deleted until the last link to it is removed.
So: it will no longer show up in the filesystem after you do file.delete but you can still read it using the InputStream that was created in Enumerator.fromFile(file) since that created a file handle. (On Linux you actually can find it through the special /proc filesystem which, among other things, contains the filehandles of each running process)
On windows I think you will get an error though, so if it is to run on multiple platforms you should probably check test your webapp on windows as well.

How can I validate an image file in Perl?

How would I validate that a jpg file is a valid image file. We are having files written to a directory using FTP, but we seem to be picking up the file before it has finished writing it, creating invalid images. I need to be able to identify when it is no longer being written to. Any ideas?
Easiest way might just be to write the file to a temporary directory and then move it to the real directory after the write is finished.
Or you could check here.
JPEG::Error
[arguments: none] If the file reference remains undefined after a call to new, the file is to be considered not parseable by this module, and one should issue some error message and go to another file. An error message explaining the reason of the failure can be retrieved with the Error method:
EDIT:
Image::TestJPG might be even better.
You're solving the wrong problem, I think.
What you should be doing is figuring out how to tell when whatever FTPd you're using is done writing the file - that way when you come to have the same problem for (say) GIFs, DOCs or MPEGs, you don't have to fix it again.
Precisely how you do that depends rather crucially on what FTPd on what OS you're running. Some do, I believe, have hooks you can set to trigger when an upload's done.
If you can run your own FTPd, Net::FTPServer or POE::Component::Server::FTP are customizable to do the right thing.
In the absence of that:
1) try tailing the logs with a Perl script that looks for 'upload complete' messages
2) use something like lsof or fuser to check whether anything is locking a file before you try and copy it.
Again looking at the FTP issue rather than the JPG issue.
I check the timestamp on the file to make sure it hasn't been modified in the last X (5) mins - that way I can be reasonably sure they've finished uploading
# time in seconds that the file was last modified
my $last_modified = (stat("$path/$file"))[9];
# get the time in secs since epoch (ie 1970)
my $epoch_time = time();
# ensure file's not been modified during the last 5 mins, ie still uploading
unless ( $last_modified >= ($epoch_time - 300)) {
# move / edit or what ever
}
I had something similar come up once, more or less what I did was:
var oldImageSize = 0;
var currentImageSize;
while((currentImageSize = checkImageSize(imageFile)) != oldImageSize){
oldImageSize = currentImageSize;
sleep 10;
}
processImage(imageFile);
Have the FTP process set the readonly flag, then only work with files that have the readonly flag set.