How to obtain the 301/302 website redirect location from the http response and follow it? - perl

I have been trying to obtain the 301/302 redirect location from the http response using perl Mechanize (WWW::Mechanize), however have been having problems extracting it from the response using things like response->header and so on.
Can anyone help with extracting the redirect location from the http responses from websites that use 301 or 302 redirects please?
I know what I want to do and how to do it once I have this redirection location URL as I have done more complex things with Mechanize before, but I'm just having real problems with getting the location (or any other response fields) from the http response.
Your help would be much appreciated, Many thanks, CM

WWW::Mechanize should automatically follow redirects (unless you've told it not to via requests_redirectable), so you should not need to do anything.
EDIT: just to demonstrate:
DB<4> $mech = WWW::Mechanize->new;
DB<5> $mech->get('http://www.preshweb.co.uk/linkedin');
DB<6> x $mech->uri;
0 URI::http=SCALAR(0x903f990)
-> 'http://www.linkedin.com/in/bigpresh'
... as you can see, WWW::Mechanize followed the redirect, and ended up at the destination, automatically.
Updated with another example as requested:
DB<15> $mech = WWW::Mechanize->new;
DB<16> $mech->get('http://jjbsports.com/');
DB<17> x $mech->uri;
0 URI::http=SCALAR(0x90988f0)
-> 'http://www.jjbsports.com/'
DB<18> x substr $mech->content, 0, 40;
0 '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML'
DB<19> x $mech->title;
0 'JJB Sports | Trainers, Clothing, Football Kits, Football Boots, Running'
As you can see, it followed the redirect, and $mech->content is returning the content of the page. Does that help at all?

If its a redirect, WWW::Mechanize would use $mech->redirect_ok(); while request()ing to follow the redirect URL (this is an LWP method).
Note -
WWW::Mechanize's constructor pushes POST on to the agent's
requests_redirectable list
So you wouldn't have to worry about pushing POST to the requests_redirectable list.
If you want to be absolutely certain that the program is redirecting your URLs and log every redirect in a log file (or something), you can use LWP's simple_request and HTTP::Response's is_redirect to detect redirects, something like this-
use WWW::Mechanize;
$mech = WWW::Mechanize->new();
$mech->stack_depth(0);
my $resp = $mech->simple_request( HTTP::Request->new(GET => 'http://www.googl.com/') );
if( $resp->is_redirect ) {
my $location = $resp->header( "Location" );
my $uri = new URI( $location );
print "Got redirected to URL - $uri\n";
$mech->get($uri);
print $mech->content;
}
is_redirect will detect both 301 and 302 response codes.

Related

Mojolicious not following redirection from webarchive.org

I'm using Mojolicious DOM and UserAgent to get the source of a page from Webarchive.org, parse it, and import it into a Dotclear database (using webarchive as a backup).
In the source, there are "Previous" and "Next" links allowing to get to the different posts originaly made on the blog.
The perl script I have developped is supposed to run through those links to import all pages of this blog's snapshot.
It first get the source of the first post of the blog, parses it, put the result in a local DB, and gets the link under "Next" to do that same thing on the next post, until there is no more "Next" posts.
As for the bases.
But the trick is that the link I get from the source is not the link Webarchive has.
Webarchive's links to snapshots go like this :
http://web.archive.org/web/20131012182412/http://www.mytarget.com/post?mypost
The big number between "web" and the original URL is (i guess) the date the snapshot was made. The trick is that it changes at each snapshot, and although it may appear on one post, the next post have been snapshoted on anotherdate. So the URL wont fit.
When I click on the link i get from the source, it brings me to webarchive.org, which automaticaly searches on the page i pass, and redirect me to it.
But when I try to get the source via the get() function of Mojolicious, it just gets the "Page not found" page of webarchive.
So, there is my question : is there a way to let mojolicious follow the redirection of webarchive ? I activated max_redirects(5) on my UserAgent, but still does the same.
Here is my code :
sub main{
my ($url) = #_;
my $ua = Mojo::UserAgent->new;
$ua = $ua->max_redirects(5);
my $dom = $ua->get($url)->res->dom;
#...Treatment and parsing of the source ...
return $nextUrl;
}
my $nextUrl="http://web.archive.org/web/20131012182412/http://www.mytarget.com/post?mypost";
my $secondUrl;
while ($nextUrl){
$secondUrl = main($nextUrl);
$nextUrl = $secondUrl;
}
Thanks in advance...
I've finally found a way around.
I use this piece of code to follow the URL and get the finally reached URL :
use LWP::UserAgent qw();
my $ua = LWP::UserAgent->new;
my $ret = $ua->get($url);
$url = $ret->request->uri ."";
print "URL returned: ".$url."\n";
Then I use that URL to get the source code and fetch it.

How to redirect from one CGI to another

I am sending data from A.cgi to B.cgi. B.cgi updates the data in the database and is supposed to redirect back to A.cgi, at which point A.cgi should display the updated data. I added the following code to B.cgi to do the redirect, immediately after the database update:
$url = "http://Travel/cgi-bin/A.cgi/";
print "Location: $url\n\n";
exit();
After successfully updating the database, the page simply prints
Location: http://Travel/cgi-bin/A.cgi/
and stays on B.cgi, without ever getting redirected to A.cgi. How can I make the redirect work?
Location: is a header and headers must come before all ordinary output, that's probably your problem. But doing this manually is unneccessarly complicated anyways, you would be better of using the redirect function of CGI.pm
Use CGI's redirect method:
my $url = "http://Travel/cgi-bin/A.cgi";
my $q = CGI->new;
print $q->redirect($url);

Handling 404 and internal server errors with perl WWW::Mechanize

I am using WWW::Mechanize to crawl sites, and it works great except for sometimes it will hit a page that returns error code 404 or 500 (not found or internal server error), and then my script will just exit and stop running. This is really messing with my data collection, so is there anyway that WWW::Mechanize will let me catch these errors and see what kind of error code was returned (i.e. 404,500, etc.). Thanks for the help!
You need to disable autocheck:
my $mech = WWW::Mechanize->new( autocheck => 0 );
$mech->get("http://somedomain.com");
if ( $mech->success() ) {
...
}
else {
print "status is: " . $mech->status;
}
Also, as an aside, have a look at WWW::Mechanize::Cached::GZip and WWW::Mechanize::Cached to speed up your development when testing your mech scripts.
Turn off autocheck and manually check status(), which returns the HTTP status code of the response.
This is a 3-digit number like 200 for OK, 404 for Not Found, and so on.
use strict;
use warnings;
use WWW::Mechanize;
my $url = 'http://...';
my $mech = WWW::Mechanize->new(autocheck => 0);
$mech->get($url);
print $mech->status();
See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html for Status Code Definitions.
If the status code is 400 or above, then you got error...

How to read a web page content which may itself be redirected to another url?

I'm using this code to read the web page content:
my $ua = new LWP::UserAgent;
my $response= $ua->post($url);
if ($response->is_success){
my $content = $response->content;
...
But if $url is pointing to moved page then $response->is_success is returning false. Now how do I get the content of redirected page easily?
You need to chase the redirect itself.
if ($response->is_redirect()) {
$url = $response->header('Location');
# goto try_again
}
You may want to put this in a while loop and use "next" instead of "goto". You may also want to log it, limit the number of redirections you are willing to chase, etc.
[update]
OK I just noticed there is an easier way to do this. From the man page of LWP::UserAgent:
$ua->requests_redirectable
$ua->requests_redirectable( \#requests )
This reads or sets the object's list of request names that
"$ua->redirect_ok(...)" will allow redirection for. By default,
this is "['GET', 'HEAD']", as per RFC 2616. To change to include
'POST', consider:
push #{ $ua->requests_redirectable }, 'POST';
So yeah, maybe just do that. :-)

Why can't my Perl script print cookie values?

When I visit usatoday.com with IE, there're cookie files automatically created in my Temporary Internet Files folder. But why doesn't the following Perl script capture anything?
use WWW::Mechanize;
use strict;
use warnings;
my $browser = WWW::Mechanize->new();
my $response = $browser->get( 'http://www.usatoday.com' );
my $cookie_jar = $browser->cookie_jar(HTTP::Cookies->new());
$cookie_jar->extract_cookies( $response );
my $cookie_content = $cookie_jar->as_string;
print $cookie_content;
For some other sites like amazon.com, google.com and yahoo.com, the script works well, but at least it seems to me usatoday.com also sends cookie information to the browser, why am I having different results? Is there something I'm missing?
Any ideas? Thanks!
UsaToday uses Javascript to set the cookie. WWW::Mechanize does not parse or run Javascript.
If you need to crawl the site with a cookie, you could analyze http://i.usatoday.net/_common/_scripts/gel/lib/core/core.js and other JS files and determine how exactly the cookie is created, and create one yourself programmatically.