Change output filename from WGET when using input file option - perl

I have a perl script that I wrote that gets some image URLs, puts the urls into an input file, and proceeds to run wget with the --input-file option. This works perfectly... or at least it did as long as the image filenames were unique.
I have a new company sending me data and they use a very TROUBLESOME naming scheme. All files have the same name, 0.jpg, in different folders.
for example:
cdn.blah.com/folder/folder/202793000/202793123/0.jpg
cdn.blah.com/folder/folder/198478000/198478725/0.jpg
cdn.blah.com/folder/folder/198594000/198594080/0.jpg
When I run my script with this, wget works fine and downloads all the images, but they are titled 0.jpg.1, 0.jpg.2, 0.jpg.3, etc. I can't just count them and rename them because files can be broken, not available, whatever.
I tried running wget once for each file with -O, but it's embarrassingly slow: starting the program, connecting to the site, downloading, and ending the program. Thousands of times. It's an hour vs minutes.
So, I'm trying to find a method to change the output filenames from wget without it taking so long. The original approach works so well that I don't want to change it too much unless necessary, but i am open to suggestions.
Additional:
LWP::Simple is too simple for this. Yes, it works, but very slowly. It has the same problem as running individual wget commands. Each get() or get_store() call makes the system re-connect to the server. Since the files are so small (60kB on average) with so many to process (1851 for this one test file alone) that the connection time is considerable.
The filename i will be using can be found with /\/(\d+)\/(\d+.jpg)/i where the filename will simply be $1$2 to get 2027931230.jpg. Not really important for this question.
I'm now looking at LWP::UserAgent with LWP::ConnCache, but it times out and/or hangs on my pc. I will need to adjust the timeout and retry values. The inaugural run of the code downloaded 693 images (43mb) in just a couple minutes before it hung. Using simple, I only got 200 images in 5 minutes.
use LWP::UserAgent;
use LWP::ConnCache;
chomp(#filelist = <INPUTFILE>);
my $browser = LWP::UserAgent->new;
$browser->conn_cache(LWP::ConnCache->new());
foreach(#filelist){
/\/(\d+)\/(\d+.jpg)/i
my $newfilename = $1.$2;
$response = $browser->mirror($_, $folder . $newfilename);
die 'response failure' if($response->is_error());
}

LWP::Simple's getstore function allows you to specify a URL to fetch from and the filename to store the data from it in. It's an excellent module for many of the same use cases as wget, but with the benefit of being a Perl module (i.e. no need to outsource to the shell or spawn off child processes).
use LWP::Simple;
# Grab the filename from the end of the URL
my $filename = (split '/', $url)[-1];
# If the file exists, increment its name
while (-e $filename)
{
$filename =~ s{ (\d+)[.]jpg }{ $1+1 . '.jpg' }ex
or die "Unexpected filename encountered";
}
getstore($url, $filename);
The question doesn't specify exactly what kind of renaming scheme you need, but this will work for the examples given by simply incrementing the filename until the current directory doesn't contain that filename.

Related

How to find the age of the file in SFTP using perl?

I am connecting SFTP and downloading the file using perl. I want to download the file that is created/modified 1 hours back.
The below is code snippet.
use strict;
use Net::SFTP::Foreign;
my $sftp_conn=Net::SFTP::Foreign->new('test.sftp.com',user=>'test',password=>'test123');
my $a1 = Net::SFTP::Foreign::Attributes->new();
my $a2 = $sftp_conn->stat('/inbox/tested.txt') or die "remote stat command failed: ".$sftp_conn->status;
$sftp_conn->get("/inbox/tested.txt","/tmp"));
Here I want to check the file age at what time it modified and calculate in hours.
You are on the right track. Calling ->stat on the connection object returns a Net::SFTP::Foreign::Attributes object. You can then call ->mtime on it to get the modifcation time.
my $attr = $sftp_conn->stat('/inbox/tested.txt')
or die "remote stat command failed: ".$sftp_conn->status;
print $attr->mtime;
There is no need to create an empty object first. You don't need the following line. You probably copied it from the SYNOPSIS in the docs, but that's just to show different ways of using that module. You can delete it.
my $a1 = Net::SFTP::Foreign::Attributes->new();
I don't know which format the mtime will be in, so I can't tell you how to do the comparison. There is nothing about that in the docs, in the code of the module or in the tests.
A quick google suggested "YYYYMMDDhhmmss", but that might not be the right one. Just try it. If it's a unix timestamp, you can just compare it to time or time - 3600, but if it's a string, you will need to parse it. Time::Piece is a useful module that comes with the Perl core to do that.

How to run a local program with user input in Perl

I'm trying to get user input from a web page written in Perl and send it to a local program (blastp), then display the results.
This is what I have right now:
(input code)
print $q->p, "Your database: $bd",
$q->p, "Your protein is: $prot",
$q->p, "Executing...";
print $q->p, system("blastp","-db $bd","-query $prot","-out results.out");
Now, I've done a little research, but I can't quite grasp how you're supposed to do things like this in Perl. I've tried opening a file, writing to it, and sending it over to blastp as an input, but I was unsuccessful.
For reference, this line produces a successful output file:
kold#sazabi ~/BLAST/pataa $ blastp -db pataa -query ../teste.fs -out results.out
I may need to force the bd to load from an absolute path, but that shouldn't be difficult.
edit: Yeah, the DBs have an environmental variable, that's fixed. Ok, all I need is to get the input into a file, pass it to the command, and then print the output file to the CGI page.
edit2: for clarification:
I am receiving user input in $prot, I want to pass it over to blastp in -query, have the program blastp execute, and then print out to the user the results.out file (or just have a link to it, since blastp can output in HTML)
EDIT:
All right, fixed everything I needed to fix. The big problem was me not seeing what was going wrong: I had to install Tiny:Capture and print out stderr, which was when I realized the environmental variable wasn't getting set correctly, so BLAST wasn't finding my databases. Thanks for all the help!
Write $prot to the file. Assuming you need to do it as-is without processing the text to split it or something:
For a fixed file name (may be problematic):
use File::Slurp;
write_file("../teste.fs", $prot, "\n") or print_error_to_web();
# Implement the latter to print error in nice HTML format
For a temp file (better):
my ($fh, $filename) = tempfile( $template, DIR => "..", CLEANUP => 1);
# You can also create temp directory which is even better, via tempdir()
print $fh "$prot\n";
close $fh;
Step 2: Run your command as you indicated:
my $rc = system("$BLASTP_PATH/blastp", "-db", "pataa"
,"-query", "../teste.fs", "-out", "results.out");
# Process $rc for errors
# Use qx[] instead of system() if you want to capture
# standard output of the command
Step 3: Read the output file in:
use File::Slurp;
my $out_file_text = read_file("results.out");
Send back to web server
print $q->p, $out_file_text;
The above code has multiple issues (e.g. you need better file/directory paths, more error handling etc...) but should start you on the right track.

Perl '-s' file test operator problem

I'm debugging a piece of code which uses the Perl '-s' function to get the size of some files.
my $File1 = /myfolder/.../mysubfolder1/document.pdf
my $File2 = /myfolder/.../mysubfolder2/document.pdf
my $File3 = /myfolder/.../mysubfolder1/document2.pdf ($File3 is actually a link to /myfolder/.../mysubfolder2/document.pdf aka $File2)
The code which is buggy is:
my $size = int((-s $File)/1024);
Where $File is replaced with $File1 - $File3.
For some reasons I can't explain this does not work on every file.
For $File1 and $File3 it works but not for $File2. I could understand if both $File2 and $File3 would not work, it would mean that the file /myfolder/.../mysubfolder2/document.pdf is somehow corrupt.
I even added a test if (-e $File)|{ before the -s to be sure the file exists, but the three files do exist.
There is an even more strange thing: there is an .htaccess in /myfolder/.../mysubfolder1/ but no .htaccess in /myfolder/.../mysubfolder2/. If it was inverse I would think the .htaccess would block the -s call somehow.
Any thoughts?
If -s fails, it returns undef and sets the error in $!. What is $!?
I suppose that if you check the size of that file with "stat" you will get something less than 1024 bytes :)
your int((-s $fn)/1024) will return 0 if size is less than 1024
To address the end of your comment, .htaccess file controls access to files by a web server's request. Once the user requests a URL which executes a valid permissible CGI/whatever script (I'm assuming yoour Perl code is in web context), THAT script has absolutely no permissioning issues regarding .htaccess (unless you actually code your Perl to read its contents and respect them explicitly by hand).
The only permissioning that can screw up your Perl file is the file system permissions in your OS.
To get the file size, your web user needs:
Execute permission on the directory containing the file
Possibly, read permission on the directory containing the file (not sure if the file size is stored in the inode?)
Possibly, read permission on the file iteself.
If all your 3 files (2 good and 1 bad) are in the same directory, check the file's read permissions.
If they are in different directories, check the file's read perms AND directory perms.
Change int((-s $file)/1024) to sprintf('%.0f', (-s $file)/1024) - you'll see something then, the file is probably under 1024 bytes, so the int() will happily return 0.

Is it possible to send POST parameters to a CGI script without another HTTP request?

I'm attempting to run a CGI script in the current environment from another Perl module. All works well using standard systems calls for GET requests. POST is fine too, until the parameter list gets too long, then they get cut off.
Has anyone ran in to this problem, or have any suggestions for other ways to attempt this?
The following are somewhat simplified for clarity. There is more error checking, etc.
For GET requests and POST requests w/o parameters, I do the following:
# $query is a CGI object.
my $perl = $^X;
my $cgi = $cgi_script_location; # /path/file.cgi
system {$perl} $cgi;
Parameters are passed through the
QUERY_STRING environment variable.
STDOUT is captured by the calling
script so whatever the CGI script
prints behaves as normal.
This part works.
For POST requests with parameters the following works, but seemingly limits my available query length:
# $query is a CGI object.
my $perl = $^X;
my $cgi = $cgi_script_location; # /path/file.cgi
# Gather parameters into a URL-escaped string suitable
# to pass to a CGI script ran from the command line.
# Null characters are handled properly.
# e.g., param1=This%20is%20a%20string&param2=42&... etc.
# This works.
my $param_string = $self->get_current_param_string();
# Various ways to do this, but system() doesn't pass any
# parameters (different question).
# Using qx// and printing the return value works as well.
open(my $cgi_pipe, "|$perl $cgi");
print {$cgi_pipe} $param_string;
close($cgi_pipe);
This method works for short parameter lists, but if the entire command gets to be close to 1000 characters, the parameter list is cut short. This is why I attempted to save the parameters to a file; to avoid shell limitations.
If I dump the parameter list from the executed CGI script I get something like the following:
param1=blah
... a bunch of other parameters ...
paramN=whatever
p <-- cut off after 'p'. There are more parameters.
Other things I've done that didn't help or work
Followed the CGI troubleshooting guide
Saved the parameters to a file using CGI->save(), passing that file to the CGI script. Only the first parameter is read using this method.
$> perl index.cgi < temp-param-file
Saved $param_string to a file, passing that file to the CGI script just like above. Same limitations as passing the commands through the command line; still gets cut off.
Made sure $CGI::POST_MAX is acceptably high (it's -1).
Made sure the CGI's command-line processing was working. (:no_debug is not set)
Ran the CGI from the command line with the same parameters. This works.
Leads
Obviously, this seems like a character limit of the shell Perl is using to execute the command, but it wasn't resolved by passing the parameters through a file.
Passign parameters to system as a single string, from HTTP input, is extremely dangerous.
From perldoc -f system,
If there is only one scalar argument, the argument is checked for shell metacharacters, and if there are any, the entire argument is passed to the system's command shell for parsing (this is /bin/sh -c on Unix platforms, but varies on other platforms). If there are no shell metacharacters in the argument,..
In other words, if I pass in arguments -e printf("working..."); rm -rf /; I can delete information from your disk (everything if your web server is running as root). If you choose to do this, make sure you call system("perl", #cgi) instead.
The argument length issue you're running into may be an OS limitation (described at http://www.in-ulm.de/~mascheck/various/argmax/):
There are different ways to learn the upper limit:
command: getconf ARG_MAX
system header: ARG_MAX in e.g. <[sys/]limits.h>
Saving to a temp file is risky: multiple calls to the CGI might save to the same file, creating a race condition where one user's parameters might be used by another user's process.
You might try opening a file handle to the process and passing arguments as standard input, instead. open my $perl, '|', 'perl' or die; fprintf(PERL, #cgi);
I didn't want to do this, but I've gone with the most direct approach and it works. I'm tricking the environment to think the request method is GET so that the called CGI script will read its input from the QUERY_STRING environment variable it expects. Like so:
$ENV{'QUERY_STRING'} = $long_parameter_string . '&' . $ENV{'QUERY_STRING'};
$ENV{'REQUEST_METHOD'} = 'GET';
system {$perl_exec} $cgi_script;
I'm worried about potential problems this may cause, but I can't think of what this would harm, and it works well so far. But, because I'm worried, I thought I'd ask the horde if they saw any potential problems:
Are there any problems handling a POST request as a GET request on the server
I'll save marking this as the official answer until people have confirmed or at least debated it on the above post.
Turns out that the problem is actually related to the difference in Content-Length between the original parameters and the parameter string I cobbled together. I didn't realize that the CGI module was using this value from the original headers as the limit to how much input to read (makes sense!). Apparently the extra escaping I was doing was adding some characters.
My solution's trick is simply to piece together the parameter string I'll be passing and modify the environment variable the CGI module will check to determine the content length to be equal to the .
Here's the final working code:
use CGI::Util qw(escape);
my $params;
foreach my $param (sort $query->param) {
my $escaped_param = escape($param);
foreach my $value ($query->param($param)) {
$params .= "$escaped_param=" . escape("$value") . "&";
}
}
foreach (keys %{$query->{'.fieldnames'}}) {
$params .= ".cgifields=" . escape("$_") . "&";
}
# This is the trick.
$ENV{'CONTENT_LENGTH'} = length($params);
open(my $cgi_pipe, "| $perl $cgi_script") || die("Cannot fork CGI: $!");
local $SIG{PIPE} = sub { warn "spooler pipe broke" };
print {$cgi_pipe} $params;
warn("param chars: " . length($params));
close($cgi_pipe) || warn "Error: CGI exited with value $?";
Thanks for all the help!

Can I get the MD5sum of a directory with Perl?

I am writing a Perl script (in Windows) that is using File::Find to index a network file system. It works great, but it takes a very long time to crawl the file system. I was thinking it would be nice to somehow get a checksum of a directory before traversing it, and it the checksum matches the checksum that was taken on a previous run, do not traverse the directory. This would eliminate a lot of processing, since the files on this file system do not change often.
On my AIX box, I use this command:
csum -h MD5 /directory
which returns something like this:
5cfe4faf4ad739219b6140054005d506 /directory
The command takes very little time:
time csum -h MD5 /directory
5cfe4faf4ad739219b6140054005d506 /directory
real 0m0.00s
user 0m0.00s
sys 0m0.00s
I have searched CPAN for a module that will do this, but it looks like all the modules will give me the MD5sum for every file in a directory, not for the directory itself.
Is there a way to get the MD5sum for a directory in Perl, or even in Windows for that matter as I could call a Win32 command from Perl?
Thanks in advance!
Can you just read the last modified dates of the files and folders? Surely that's going to be faster than building MD5's?
In order to get a checksum you must read the files, this means you will need to walk the filesystem, which puts you back in the same boat you are trying to get out of.
From what I know you cannot get an md5 of a directory. md5sum on other systems complains when you provide a directory. csum is most likely giving you a hash of the directory file contents of the top level directory, not traversing the tree.
You can grab the modified times for the files and hash them how you like by doing something like this:
sub dirModified($){
my $dir = #_[0];
opendir(DIR, "$dir");
my #dircontents = readdir(DIR);
closedir(DIR);
foreach my $item (#dircontents){
if( -f $item ){
print -M $item . " : $item - do stuff here\n";
} elsif( -d $item && $item !~ /^\.+$/ ){
dirModified("$dir/$item");
}
}
}
Yes it will take some time to run.
In addition to the other good answers, let me add this: if you want a checksum, then please use a checksum algorithm instead of a (broken!) hash function.
I don't think you don't need a cryptographically secure hash function in your file indexer -- instead you need a way to see if there are changes in the directory listings without storing the entire listing. Checksum algorithms do that: they return a different output when the input is changed. They might do it faster since they are simpler than hash functions.
It is true that a user could change a directory in a way that wouldn't be discovered by the checksum. However, a user would have to change the file names like this on purpose since normal changes in file names will (with high probability) give different checksums. Is it then necessary to guard against this "attack"?
One should always consider the consequences of each attack and choose the appropriate tools.
I did one of these in python if your interested:
http://akiscode.com/articles/sha-1directoryhash.shtml