How can I resume downloads in Perl? - perl

I have a project that depends upon some other binaries to be downloaded from web at install time.For this what i do is:
if ( file-present-in-src/)
# skip that file
else
# use wget to download the file
The problem with this approach is that when I interrupt a download in middle, and do invoke the script next time, the partially downloaded file is also skipped (which is not desired), also I want wget to resume the download of the partially downloaded file.
How should I go about it:
Possible Solutions I could think of:
Let the file to be downloaded to some file say download_tmp. Move to original file
if successful.
Handle SIG{'INT'} to write proper cleanup code.
But none of these could help resume the partial file download,
Any insights?

Fist, I don't understand what this has to do with Perl, since you're using wget to do the dowloading ... You could use libwww-perl (perldoc LWP) and have more control about the download process.
Then I second your idea of downloading to a "tmp" filename and move the file on success.
However I think you need to go further and verify the integrity of the files. Doing an MD5 or SHA hash is very easy, and match the downloaded one with what you're expecting. You can have a short file on server containing the checksum (filename.md5). Determine success only when you have a match.
Note that catching all the signals and generally trying to make the process unkillable, and then expecting it to have worked is bound to fail at one point or another. There could be a network timeout, a crash, power failure, configuration problem on the server ... you should instead assume downloads can fail, because they will, and code so that your process can recover.
Finally you're not telling us what kind of binaries you're downloading and what you're doing with them. Since you use wget I'm going to assume you're on Unix; you should consider using RPM+Yum or the likes, they handle all this for you. RPM are easy to write, really.

use your first approach ..
download to "FileName".tmp
move "FileName".tmp to "FileName" move! not copy
once per diem clean out all .tmp files (paranoia rulez)

You could just use wget's -N and -c options and remove the entire "if file exists" logic.

Related

Skipping the errors and recording downloads in kdb

I am trying to use kdb q script to download file from remote source.
How can I make the download keep going if there is an error?
also, how can i mark it down what its downloaded in linux when there are other files in the same directory???
Here is my code:
file:("abc.csv";"def.csv");
dbdir:"/home/terry/";
dlFunc:{
system "download.sh abc.com user /"get /remote/path/",x /",dbdir};
dlFunc each file;
If you're asking how to continue downloading other files if one file fails then you can put a protected eval around your dlFunc each file, e.g.
#[dlFunc;;()]each file;
You could capture the list of failed files using something like:
badfiles:();
{#[dlFunc;x;{y;badfiles,:enlist x}x]}each file;
Then inspect the badfiles list afterwards. The ones that succeeded would be:
file except badfiles

Is it possible for Perl move success with response 0?

Every function in the perl File::Copy module is supposed to return 1 in case of success and 0 in case of failure.
In my case, I have noticed (using whatever logs I had) that move returns 0 even when the operation succeeds (because files are actually moved) with value of $! as No such file or directory.
Has anyone noticed such issue before?
If move returns 0, trying to rename the file failed, and then either trying to copy it failed or trying to unlink the original file after copying it failed. I don't see other possibilities, at least in File::Copy version 2.33.
You may want to just try the rename and, if needed, the copy and unlink yourself, if you need better error reporting.
What version of File::Copy are you using? What version of perl? What operating system.
From File::Copy, on copy
If an error occurs in setting permissions, cp will return 0, regardless of whether the file was successfully copied.
While this is for copy, the move may also copy the file and then delete it (if it can't rename it).
There are yet other possibilities, that involve other processes interfering with the file.

What am I screwing up trying to download particular file types with wget?

I am attempting to regularly archive a few file types hosted on a community website where our admin has been MIA for years, in case he dies or just stops paying for the hosting.
I am able to download all of the files I need using wget -r -np -nd -e robots=off -l 0 URL but this leaves me with about 60,000 extra files to waste time both downloading and deleting.
I am really only looking for files with the extensions "tbt" and "zip". When I add in -A tbt,zip to the input, wget then only downloads a single file, "index.html.tmp". It immediately deletes this file because it doesn't match the file type specified, and then the process stops entirely, with wget announcing that it is finished. It does not attempt to download any of the other files that it grabs when the -A flag is not included.
What am I doing wrong? Why does specifying file types in the way that I did cause it to finish after only looking at one file?
Possibly you're hitting the same problem I've hit when trying to do something similar. When using --accept, wget determines whether a links refers to a file or directory based on whether or not it ends with a /.
For example, say I have a directory named files, and a web page that has:
Lots o' files!
If I were to request this with wget -r, then I wget would happily GET /files, see that it was an HTML document containing a bunch of links, and continue to download those links.
However, if I add -A zip to my command line, and run wget with --debug, I see:
appending ‘http://localhost:8080/files’ to urlpos.
[...]
Deciding whether to enqueue "http://localhost:8080/files".
http://localhost:8080/files (files) does not match acc/rej rules.
Decided NOT to load it.
In other words, wget thinks this is a file (no trailing /) and it doesn't match our acceptance criteria, so it gets rejected.
If I modify the remote file so that it looks like...
Lots o' files!
...then wget will follow the link and download files as desired.
I don't think there's a great solution to this problem if you need to use wget. As I mentioned in my comment, there are other tools available that may handle this situation more gracefully.
It's also possible you're experiencing a different issue; the output of adding --debug to your command line clarify things in that case.
I also experienced this issue, on a page where all the download links looked something like this: filedownload.ashx?name=file.mp3. The solution was to match for both the linked file, and the downloaded file. So my wget accept flag looked like this: -A 'ashx,mp3'. I also used the --trust-server-names flag. This catches all the .ashx that are linked in the webpage, then when wget does the second check, all the mp3 files that were downloaded will stay.
As an alternative to --trust-server-names, you may also find the --content-disposition flag helpful. Both flags help rename the file that gets downloaded from filedownload.ashx?name=file.mp3 to just file.mp3.

How to use Archive::Extract safely - againist zip bomb or similar?

Problem outline:
need allow upload ZIP files (and tgz and more compressed directory trees) via web-from
the zip files should be extracted for their content handling
planning to use Archive::Extract for the extracting
here are things like ZIP BOMBS and like...
From the manual
Archive::Extract can use either pure perl modules or command line
programs under the hood. Some of the pure perl modules (like
Archive::Tar and Compress::unLZMA) take the entire contents of the
archive into memory, which may not be feasible on your system.
Consider setting the global variable $Archive::Extract::PREFER_BIN to
1 , which will prefer the use of command line programs and won't
consume so much memory.
The questions are:
When I set the $Archive::Extract::PREFER_BIN = 1 - i'm enough protected againist ZIP-BOMB like things?
$Archive::Extract::PREFER_BIN protect me againist much memory usage - but, the standard unzip, tar -z unrar binaries are safe againist zip bomb like attacks?
If not - how to handle safely uploaded compressed directory tree? (so here is not only one file inside the e.g zip archive).
$Archive::Extract::PREFER_BIN = 1 doesn't protect you against zip bombs, you are passing the problem to the binary unzip tool of your system.
This SO question may helps you. I like the idea of running a second process with ulimit.

how to check for activity or lack thereof on a unix file directory using perl or unix commands

Scenario:
I have a process where many files are being copied (scp'd) to a DestinationServer by Host1, Host2, Host3, Host4 for example. Going to the same common directory: DestinationServer:/home/target. All the files are unique so no files will be overwritten. Host1-Host4 will have a cronjob that will launch their scp script to DestinationServer. The caveat is the Hosts are in different time zones, locations. So, they will finish at different times.
Need:
Since the files are being scp'd to Destination:/home/target, what is the best way to programmatically check when those scp's from the other Hosts are done??
Options:
My options are to programmatically do this either in perl or shell if possible.
What do I look for, what unix commands or perl modules could I use to help determine when the processes would finish? Any ideas, examples would be great! Thanks.
Use a Maildir kind of approach: copy all files to a temporary directory, then after the transfer is complete have the originating host perform a rename into the target directory via ssh. That way when a file appears in the target directory, you know that it is complete.
I suggest this because if you just scp files into the target directory and monitor the directory in whatever way, you cannot distinguish a complete transfer from an interrupted scp command or a network failure.
SGI::FAM, Sys::Gamin
Similar but alternative way to Jouni is to use semaphore files. Before scp-ing files originating host puts up semaphore-file and when finished, remove it. So you know, it's time.