Why does this code work successfully with Enumerator.fromFile? - scala

I wrote the file transferring code as follows:
val fileContent: Enumerator[Array[Byte]] = Enumerator.fromFile(file)
val size = file.length.toString
file.delete // (1) THE FILE IS TEMPORARY SO SHOULD BE DELETED
SimpleResult(
header = ResponseHeader(200, Map(CONTENT_LENGTH -> size, CONTENT_TYPE -> "application/pdf")),
body = fileContent)
This code works successfully, even if the file size is rather large (2.6 MB),
but I'm confused because my understanding about .fromFile() is a wrapper of fromCallBack() and SimpleResult actually reads the file buffred,but the file is deleted before that.
MY easy assumption is that java.io.File.delete waits until the file gets released after the chunk reading completed, but I have never heard of that process of Java File class,
Or .fromFile() has already loaded all lines to the Enumerator instance, but it's against the fromCallBack() spec, I think.
Does anybody knows about this mechanism?

I'm guessing you are on some kind of a Unix system, OSX or Linux for example.
On a Unix:y system you can actually delete a file that is open, any filesystem entry is just a link to the actual file, and so is a file handle which you get when you open a file. The file contents won't become unreachable /deleted until the last link to it is removed.
So: it will no longer show up in the filesystem after you do file.delete but you can still read it using the InputStream that was created in Enumerator.fromFile(file) since that created a file handle. (On Linux you actually can find it through the special /proc filesystem which, among other things, contains the filehandles of each running process)
On windows I think you will get an error though, so if it is to run on multiple platforms you should probably check test your webapp on windows as well.

Related

How can be configured kafka-connect-file-pulse for continuous reading of a text file?

I have FilePulse correctly configured, so that when I create a file inside the reading folder, it reads it and ingests it in the topic.
Now I need to do continuous reading of each of the files in that folder, since they are continually being updated.
I have to change any property of properties file?
My filePulseTxtFile.properties:
name=connect-file-pulse-txt
connector.class=io.streamthoughts.kafka.connect.filepulse.source.FilePulseSourceConnector
topic=lineas-fichero
tasks.max=1
File types
fs.scan.filters=io.streamthoughts.kafka.connect.filepulse.scanner.local.filter.RegexFileListFilter
file.filter.regex.pattern=.*\\.log$
task.reader.class=io.streamthoughts.kafka.connect.filepulse.reader.RowFileInputReader
File scanning
fs.cleanup.policy.class=io.streamthoughts.kafka.connect.filepulse.clean.LogCleanupPolicy
fs.scanner.class=io.streamthoughts.kafka.connect.filepulse.scanner.local.LocalFSDirectoryWalker
fs.scan.directory.path=/home/ec2-user/parser/scanDirKafka
fs.scan.interval.ms=10000
Internal Reporting
internal.kafka.reporter.bootstrap.servers=localhost:9092
internal.kafka.reporter.id=connect-file-pulse-txt
internal.kafka.reporter.topic=connect-file-pulse-status
Track file by name
offset.strategy=name
Thanks a lot!
Continious reading is only supported by the RowFileInputReader that you can configure with the read.max.wait.ms property - The maximum time to wait in milliseconds for more bytes after hitting end of file.
For example, if you configure that property to 10000 then the reader will wait 10 seconds for new lines to be added to the file before considering it completed.
Also, you should note that as long as there are task processing files, then new files that are added to the source directory will not be selected. But, you can configure the allow.tasks.reconfiguration.after.timeout.ms to force all tasks to be restarted after a given period so that new files will be scheduled.
Finally, you must take care to correctly set the max.tasks property so that all files can be processed in parallel (a task can only process one file at a time).

Sending file descriptors from Unix Sockets

I am using the Linux sendmsg() function to send a file-descriptor to another process over a Unix Socket along with some data payload. I make multiple calls to sendmsg. In the recvmsg() companion call inside the destination process, I get the file descriptor using something like "fdptr = (int *) CMSG_DATA(cmsg); memcpy(fdptr, myfds, NUM_FD * sizeof(int));" What I am noticing that each time I look at the file descriptor, the file descriptor is yet a DIFFERENT number than it was in the prior recvmsg() call.
My question: Is the destination process holding open a bunch of open descriptors to the same file/hardware?? Do I need to close the descriptors?
What would happen if I was not to try to copy the descriptor with "memcpy(fdptr, myfds, NUM_FD * sizeof(int));" and essentially 'left it inside' CMSG_DATA(cmsg)? Would there be some descriptor with an unknown number sitting out there? Had I not copied it out, I would have never seen it was essentially yet another descriptor number.

Using ZIPFoundation without URL

In my MacOS app I am downloading an encrypted .zip file to the disk. I decrypt this file and keep the decrypted version in memory in the Data type. For security reasons the decrypted .zip will only be kept in memory.
I can successfully use ZIPFoundation's Closure based reading to extract the file contents in memory, but only by using an URL pointing to the (decrypted) .zip on disk:
guard let archive = Archive(url: url!, accessMode: .read) else { return }
Is there any way I can use the library with data only existing in memory? If not, can you point me towards a library that can handle this?
I have already tried DataCompression, but I couldn't make it work.
There's a (non-merged) Pull Request open that adds in-memory processing of ZIP archives to ZIP Foundation.
Sadly there are still some unresolved issues with in-memory writing of archives. The reading part is using fmemopen and should already work.
While the PR is not finished yet, you can have a look here: https://github.com/weichsel/ZIPFoundation/pull/78/

Sinatra example code to download a large file

I started using sinatra,
Right now I'm using the following code to handle file downloads,
It works great for small files, but when it comes to large files > 500MB
The connection disconnects in the middle.
dpath = "/some root path to file"
get '/getfile/:path' do |path|
s = path.to_s
s.gsub!("-*-","/")
fn = s.split("/").last
s = dpath +"/"+ s
send_file s,:filename => fn
end
Two things:
What does your validate method do? If it's trying to open the file in memory, you might be running out of ram on your server (especially with large files).
Where are you setting fn ? It's a local variable inside the get scope and there's nothing setting it in your code example.

How can I validate an image file in Perl?

How would I validate that a jpg file is a valid image file. We are having files written to a directory using FTP, but we seem to be picking up the file before it has finished writing it, creating invalid images. I need to be able to identify when it is no longer being written to. Any ideas?
Easiest way might just be to write the file to a temporary directory and then move it to the real directory after the write is finished.
Or you could check here.
JPEG::Error
[arguments: none] If the file reference remains undefined after a call to new, the file is to be considered not parseable by this module, and one should issue some error message and go to another file. An error message explaining the reason of the failure can be retrieved with the Error method:
EDIT:
Image::TestJPG might be even better.
You're solving the wrong problem, I think.
What you should be doing is figuring out how to tell when whatever FTPd you're using is done writing the file - that way when you come to have the same problem for (say) GIFs, DOCs or MPEGs, you don't have to fix it again.
Precisely how you do that depends rather crucially on what FTPd on what OS you're running. Some do, I believe, have hooks you can set to trigger when an upload's done.
If you can run your own FTPd, Net::FTPServer or POE::Component::Server::FTP are customizable to do the right thing.
In the absence of that:
1) try tailing the logs with a Perl script that looks for 'upload complete' messages
2) use something like lsof or fuser to check whether anything is locking a file before you try and copy it.
Again looking at the FTP issue rather than the JPG issue.
I check the timestamp on the file to make sure it hasn't been modified in the last X (5) mins - that way I can be reasonably sure they've finished uploading
# time in seconds that the file was last modified
my $last_modified = (stat("$path/$file"))[9];
# get the time in secs since epoch (ie 1970)
my $epoch_time = time();
# ensure file's not been modified during the last 5 mins, ie still uploading
unless ( $last_modified >= ($epoch_time - 300)) {
# move / edit or what ever
}
I had something similar come up once, more or less what I did was:
var oldImageSize = 0;
var currentImageSize;
while((currentImageSize = checkImageSize(imageFile)) != oldImageSize){
oldImageSize = currentImageSize;
sleep 10;
}
processImage(imageFile);
Have the FTP process set the readonly flag, then only work with files that have the readonly flag set.