Download only new/modified files with perl (or wget) - perl

I have a Perl script which downloads a large number of files from a remote server. I'd like to avoid hammering the server, so I'd like to avoid downloading a file if it hasn't been modified since my last check. Is there a good way to do this, either in Perl or with a shell script?
Can I get the server to send HTTP 304 rather than HTTP 200 for unmodified files?

Yes, use LWP::UserAgent and pay special attention to the mirror method. This is also available in the procedural LWP::Simple as the mirror function.
From LWP's POD:
This method will get the document identified by $url and store it in file called $filename. If the file already exists, then the request will contain an "If-Modified-Since" header matching the modification time of the file. If the document on the server has not changed since this time, then nothing happens. If the document has been updated, it will be downloaded again. The modification time of the file will be forced to match that of the server.
The return value is the the response object.
HTTP 304 is the response code the server will return if you pass the If-Modified-Since test and your copy is fresh. LWP does this internally with mirror -- you needn't worry about it.

This is based on Evan Carrol's answer, but I'm going to elaborate in case this is useful for someone else. I stubbed out the response section; I doubt that part of my code will be interesting.
#!/usr/bin/perl -w
require HTTP::Date;
require LWP::UserAgent;
require Date::Parse;
my $lastChecked = '2009-01-01';
my $ua = LWP::UserAgent->new;
$ua->default_header('If-Modified-Since' => HTTP::Date::time2str(Date::Parse::str2time($lastChecked)));
my $response = $ua->get('http://example.com/');
if ($response->code == 304) {
print "No changes.\n";
} elsif ($response->is_success) {
print $response->decoded_content;
} else {
print "Response was error " . $response->code . ": '" . $response->status_line . "'\n";
}

Related

Perl get request returns empty response, maybe session related?

I was using an open source tool called SimTT which gets an URL of a tabletennis league and then calculates the probable results (e.g. ranking of teams and players). Unfortunately the webpage moved to a different webpage.
I downloaded the open source and repaired the parsing of the webpage, but currently I'm only able to download the page manually and read it then from a file.
Below you can find an excerpt of my code to retrieve the page. It prints success, but the response is empty. Unfortunately I'm not familiar with perl and webtechniques very well, but in Wireshark I could see that one of the last things send was a new session key. But I'm not sure, if the problem is related to cookies, ssl or something like that.
It would be very nice if someone could help me to get access. I know that there are some people out there which would like to use the tool.
So heres the code:
use LWP::UserAgent ();
use HTTP::Cookies;
my $ua = LWP::UserAgent->new(keep_alive=>1);
$ua->agent('Mozilla/5.0');
$ua->cookie_jar({});
my $request = new HTTP::Request('GET', 'https://www.mytischtennis.de/clicktt/ByTTV/18-19/ligen/Bezirksoberliga/gruppe/323819/mannschaftsmeldungen/vr');
my $response = $ua->request($request);
if ($response->is_success) {
print "Success: ", $response->decoded_content;
}
else {
die $response->status_line;
}
Either there is some rudimentary anti-bot protection at the server or the server is misconfigured or otherwise broken. It looks like it expects to have an Accept-Encoding header in the request which LWP by default does not sent. The value of this header does not really seem to matter, i.e. the server will send the content compressed with gzip if the client claims to support it but it will send uncompressed data if the client offered only a compression method which is unknown to the server.
With this knowledge one can change the code like this:
my $request = HTTP::Request->new('GET',
'https://www.mytischtennis.de/clicktt/ByTTV/18-19/ligen/Bezirksoberliga/gruppe/323819/mannschaftsmeldungen/vr',
[ 'Accept-Encoding' => 'foobar' ]
);
With this simple change the code works currently for me. Note that it might change at any time if the server setup will be changed, i.e. it might then need other workarounds.

Better way to proxy an HTTP request using Perl HTTP::Response and LWP?

I need a Perl CGI script that fetches a URL and then returns the result of the fetch - the status, headers and content - unaltered to the CGI environment so that the "proxied" URL is returned by the web server to the user's browser as if they'd accessed the URL directly.
I'm running my script from cgi-bin in an Apache web server on an Ubuntu 14.04 host, but this question should be independent of server platform - anything that can run Perl CGI scripts should be able to do it.
I've tried using LWP::UserAgent::request() and I've got very close. It returns an HTTP::Response object that contains the status code, headers and content, and even has an "as_string" method that turns it into a human-readable form. The problem from a CGI perspective is that "as string" converts the status code to "HTTP/1.1 200 OK" rather than "Status: 200 OK", so the Apache server doesn't recognise the output as a valid CGI response.
I can fix this by using other methods in HTTP::Response to split out the various parts, but there seems to be no public way of getting at the encapsulated HTTP::Headers object in order to call its as_string method; instead I have to hack into the Perl blessed object hash and yank out the private "_headers" member directly. To me this seems slightly evil, so is there a better way?
Here's some code to illustrate the above. If you put it in your cgi-bin directory then you can call it as
http://localhost/cgi-bin/lwp-test?url=http://localhost/&http-response=1&show=1
You can use a different URL for testing if you want. If you set http-response=0 (or drop the param altogether) then you get the working piece-by-piece solution. If you set show=0 (or drop it) then the proxied request is returned by the script. Apache will return the proxied page if you have http-response=0 and will choke with a 500 Internal Server Error if it's 1.
#!/usr/bin/perl
use strict;
use warnings;
use CGI::Simple;
use HTTP::Request;
use HTTP::Response;
use LWP::UserAgent;
my $q = CGI::Simple->new();
my $ua = LWP::UserAgent->new();
my $req = HTTP::Request->new(GET => $q->param('url'));
my $res = $ua->request($req);
# print a text/plain header if called with "show=1" in the query string
# so proxied URL response is shown in browser, otherwise just output
# the proxied response as if it was ours.
if ($q->param('show')) {
print $q->header("text/plain");
print "\n";
}
if ($q->param('http-response')) {
# This prints the status as "HTTP/1.1 200 OK", not "Status: 200 OK".
print $res->as_string;
} else {
# This works correctly as a proxy, but using {_headers} to get at
# the private encapsulated HTTP:Response object seems a bit evil.
# There must be a better way!
print "Status: ", $res->status_line, "\n";
print $res->{_headers}->as_string;
print "\n";
print $res->content;
}
Please bear in mind that this script was written purely to demonstrate how to forward an HTTP::Response object to the CGI environment and bears no resemblance to my actual application.
You can go around the internals of the response object at $res->{_headers} by using the $res->headers method, that returns the actual HTTP::Headers instance that is used. HTTP::Response inherits that from HTTP::Message.
It would then look like this:
print "Status: ", $res->status_line, "\n";
print $res->headers->as_string;
That looks less evil, though it's still not pretty.
As simbabque pointed out, HTTP::Response has a headers method through inheritance from HTTP::Message. We can tidy up the handling of the status code by using HTTP::Response->header to push it into the embedded HTTP::Headers object, then use headers_as_string to print out the headers more cleanly. Here's the final script:-
#!/usr/bin/perl
use strict;
use warnings;
use CGI::Simple;
use HTTP::Request;
use HTTP::Response;
use LWP::UserAgent;
my $q = CGI::Simple->new();
my $ua = LWP::UserAgent->new();
my $req = HTTP::Request->new(GET => $q->param('url'));
my $res = $ua->request($req);
# print a text/plain header if called with "show=1" in the query string
# so proxied URL response is shown in browser, otherwise just output
# the proxied response as if it was ours.
if ($q->param('show')) {
print $q->header("text/plain");
}
# $res->as_string returns the status in a "HTTP/1.1 200 OK" line rather than
# a "Status: 200 OK" header field so it can't be used for a CGI response.
# We therefore have a little more work to do...
# convert status from line to header field
$res->header("Status", $res->status_line);
# now print headers and content - don't forget a blank line between the two
print $res->headers_as_string, "\n", $res->content;

Cancel Download using WWW::Mechanize in Perl

I have written a Perl script which would check a list of URLs and connect to them by sending a GET request.
Now, let's say that one of these URLs has a file which is very big in size, for instance, has a size > 100 MB.
When a request is sent to download this file using this:
$mech=WWW::Mechanize->new();
$url="http://somewebsitename.com/very_big_file.txt"
$mech->get($url)
Once the GET request is sent, it will start downloading the file. I want this to be cancelled using WWW::Mechanize. How can I do that?
I checked the documentation of this Perl Module here:
http://metacpan.org/pod/WWW::Mechanize
However, I could not find a method which would help me do this.
Thanks.
Aborting a GET request
Using the :content_cb option, you can provide a callback function to get() that will be executed for each chunk of response content received from the server. You can set* the chunk size (in bytes) using the :read_size_hint option. These options are documented in LWP::UserAgent (get() in WWW::Mechanize is just an overloaded version of the same method in LWP::UserAgent).
The following request will be aborted after reading 1024 bytes of response content:
use WWW::Mechanize;
sub callback {
my ($data, $response, $protocol) = #_;
die "Too much data";
}
my $mech = WWW::Mechanize->new;
my $url = 'http://www.example.com';
$mech->get($url, ':content_cb' => \&callback, ':read_size_hint' => 1024);
print $mech->response()->header('X-Died');
Output:
Too much data at ./mechanize line 12.
Note that the die in the callback does not cause the program itself to die; it simply sets the X-Died header in the response object. You can add the appropriate logic to your callback to determine under what conditions a request should be aborted.
Don't even fetch URL if content is too large
Based on your comments, it sounds like what you really want is to never send a request in the first place if the content is too large. This is quite different from aborting a GET request midway through, since you can fetch the Content-Length header with a HEAD request and perform different actions based on the value:
my #urls = qw(http://www.example.com http://www.google.com);
foreach my $url (#urls) {
$mech->head($url);
if ($mech->success) {
my $length = $mech->response()->header('Content-Length') // 0;
next if $length > 1024;
$mech->get($url);
}
}
Note that according to the HTTP spec, applications should set the Content-Length header. This does not mean that they will (hence the default value of 0 in my code example).
* According to the documentation, the "protocol module which will try to read data from the server in chunks of this size," but I don't think it's guaranteed.

Perl CGI with HTTP Status Codes

I have the following validation in a CGI script that will check for the GET method and return a 405 HTTP status code if the GET method is not used. Unfortunately it is still returning a 200 Status OK when using POST or PUT.
my ($buffer);
# Read in text
$ENV{'REQUEST_METHOD'} =~ tr/a-z/A-Z/;
if ($ENV{'REQUEST_METHOD'} eq "GET")
{
$buffer = $ENV{'QUERY_STRING'};
}
else
{
$cgi->$header->status('405 Method Not Allowed')
print $cgi->header('text/plain');
}
I am still new to CGI programming so I figured someone here could toss me a bone about working with CGI and HTTP status returns. If a good CGI doc is provided that would be awesome, as most returned by search are CPAN (already read a few times) and really old tutorials that are not Object oriented.
cpan docs is more than enought for CGI. If you want new tutorials don't use CGI, use one of MVC frameworks ( Catalyst, Dancer2, Mojo, etc ).
You can post 405 header if will change:
$cgi->$header->status('405 Method Not Allowed');
print $cgi->header('text/plain');
to this:
print $cgi->header(
-type=>'text/plain',
-status=> '405 Method Not Allowed'
);

How to detect a changed webpage?

In my application, I fetch webpages periodically using LWP. Is there anyway to check whether between two consecutive fetches the webpage has got changed in some respect (other than explicitly doing a comparison) ? Is there any signature(say CRC) that is being generated at lower protocol layers which can be extracted and compared against older signatures to see possible changes ?
There are two possible approaches. One is to use a digest of the page, e.g.
use strict;
use warnings;
use Digest::MD5 'md5_hex';
use LWP::UserAgent;
# fetch the page, etc.
my $digest = md5_hex $response->decoded_content;
if ( $digest ne $saved_digest ) {
# the page has changed.
}
Another option is to use an HTTP ETag, if the server provides one for the resource requested. You can simply store it and then set your request headers to include an If-None-Match field on subsequent requests. If the server ETag has remained the same, you'll get a 304 Not Modified status and an empty response body. Otherwise you'll get the new page. (And new ETag.) See Entity Tags in RFC2616.
Of course, the server could be lying, and sending the same ETag even though the content has changed. There's no way to know unless you look.
You should use the If-Modified-Since request header, noting the gotchas in the RFC. You send this header with the request. If the server supports it and thinks the content is newer, it sends it to you. If it thinks you have the most recent version, it returns a 304 with no message body.
However, as other answers have noted, the server doesn't have to tell you the truth, so you're sometimes stuck downloading the content and checking yourself. Many dynamic things will always claim to have new content because many developers have never thought about supporting basic HTTP things in their web apps.
For the LWP bits, you can create a single request with an extra header:
use HTTP::Request;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
my $request = HTTP::Request->new( GET => $url );
$r->header( 'If-Modified-Since' => $time );
$ua->request( $request );
For all requests, you can set a request handler:
$ua->add_handler(
request_send => sub {
my($request, $ua, $h) = #_;
# ... look up time from local store
$r->header( 'If-Modified-Since' => $time );
}
);
However, LWP can do most of this for you with mirror if you want to save the files:
$ua->mirror( $url, $filename )