How to read a web page content which may itself be redirected to another url? - perl

I'm using this code to read the web page content:
my $ua = new LWP::UserAgent;
my $response= $ua->post($url);
if ($response->is_success){
my $content = $response->content;
...
But if $url is pointing to moved page then $response->is_success is returning false. Now how do I get the content of redirected page easily?

You need to chase the redirect itself.
if ($response->is_redirect()) {
$url = $response->header('Location');
# goto try_again
}
You may want to put this in a while loop and use "next" instead of "goto". You may also want to log it, limit the number of redirections you are willing to chase, etc.
[update]
OK I just noticed there is an easier way to do this. From the man page of LWP::UserAgent:
$ua->requests_redirectable
$ua->requests_redirectable( \#requests )
This reads or sets the object's list of request names that
"$ua->redirect_ok(...)" will allow redirection for. By default,
this is "['GET', 'HEAD']", as per RFC 2616. To change to include
'POST', consider:
push #{ $ua->requests_redirectable }, 'POST';
So yeah, maybe just do that. :-)

Related

Mojolicious not following redirection from webarchive.org

I'm using Mojolicious DOM and UserAgent to get the source of a page from Webarchive.org, parse it, and import it into a Dotclear database (using webarchive as a backup).
In the source, there are "Previous" and "Next" links allowing to get to the different posts originaly made on the blog.
The perl script I have developped is supposed to run through those links to import all pages of this blog's snapshot.
It first get the source of the first post of the blog, parses it, put the result in a local DB, and gets the link under "Next" to do that same thing on the next post, until there is no more "Next" posts.
As for the bases.
But the trick is that the link I get from the source is not the link Webarchive has.
Webarchive's links to snapshots go like this :
http://web.archive.org/web/20131012182412/http://www.mytarget.com/post?mypost
The big number between "web" and the original URL is (i guess) the date the snapshot was made. The trick is that it changes at each snapshot, and although it may appear on one post, the next post have been snapshoted on anotherdate. So the URL wont fit.
When I click on the link i get from the source, it brings me to webarchive.org, which automaticaly searches on the page i pass, and redirect me to it.
But when I try to get the source via the get() function of Mojolicious, it just gets the "Page not found" page of webarchive.
So, there is my question : is there a way to let mojolicious follow the redirection of webarchive ? I activated max_redirects(5) on my UserAgent, but still does the same.
Here is my code :
sub main{
my ($url) = #_;
my $ua = Mojo::UserAgent->new;
$ua = $ua->max_redirects(5);
my $dom = $ua->get($url)->res->dom;
#...Treatment and parsing of the source ...
return $nextUrl;
}
my $nextUrl="http://web.archive.org/web/20131012182412/http://www.mytarget.com/post?mypost";
my $secondUrl;
while ($nextUrl){
$secondUrl = main($nextUrl);
$nextUrl = $secondUrl;
}
Thanks in advance...
I've finally found a way around.
I use this piece of code to follow the URL and get the finally reached URL :
use LWP::UserAgent qw();
my $ua = LWP::UserAgent->new;
my $ret = $ua->get($url);
$url = $ret->request->uri ."";
print "URL returned: ".$url."\n";
Then I use that URL to get the source code and fetch it.

How to redirect from one CGI to another

I am sending data from A.cgi to B.cgi. B.cgi updates the data in the database and is supposed to redirect back to A.cgi, at which point A.cgi should display the updated data. I added the following code to B.cgi to do the redirect, immediately after the database update:
$url = "http://Travel/cgi-bin/A.cgi/";
print "Location: $url\n\n";
exit();
After successfully updating the database, the page simply prints
Location: http://Travel/cgi-bin/A.cgi/
and stays on B.cgi, without ever getting redirected to A.cgi. How can I make the redirect work?
Location: is a header and headers must come before all ordinary output, that's probably your problem. But doing this manually is unneccessarly complicated anyways, you would be better of using the redirect function of CGI.pm
Use CGI's redirect method:
my $url = "http://Travel/cgi-bin/A.cgi";
my $q = CGI->new;
print $q->redirect($url);

How to obtain the 301/302 website redirect location from the http response and follow it?

I have been trying to obtain the 301/302 redirect location from the http response using perl Mechanize (WWW::Mechanize), however have been having problems extracting it from the response using things like response->header and so on.
Can anyone help with extracting the redirect location from the http responses from websites that use 301 or 302 redirects please?
I know what I want to do and how to do it once I have this redirection location URL as I have done more complex things with Mechanize before, but I'm just having real problems with getting the location (or any other response fields) from the http response.
Your help would be much appreciated, Many thanks, CM
WWW::Mechanize should automatically follow redirects (unless you've told it not to via requests_redirectable), so you should not need to do anything.
EDIT: just to demonstrate:
DB<4> $mech = WWW::Mechanize->new;
DB<5> $mech->get('http://www.preshweb.co.uk/linkedin');
DB<6> x $mech->uri;
0 URI::http=SCALAR(0x903f990)
-> 'http://www.linkedin.com/in/bigpresh'
... as you can see, WWW::Mechanize followed the redirect, and ended up at the destination, automatically.
Updated with another example as requested:
DB<15> $mech = WWW::Mechanize->new;
DB<16> $mech->get('http://jjbsports.com/');
DB<17> x $mech->uri;
0 URI::http=SCALAR(0x90988f0)
-> 'http://www.jjbsports.com/'
DB<18> x substr $mech->content, 0, 40;
0 '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML'
DB<19> x $mech->title;
0 'JJB Sports | Trainers, Clothing, Football Kits, Football Boots, Running'
As you can see, it followed the redirect, and $mech->content is returning the content of the page. Does that help at all?
If its a redirect, WWW::Mechanize would use $mech->redirect_ok(); while request()ing to follow the redirect URL (this is an LWP method).
Note -
WWW::Mechanize's constructor pushes POST on to the agent's
requests_redirectable list
So you wouldn't have to worry about pushing POST to the requests_redirectable list.
If you want to be absolutely certain that the program is redirecting your URLs and log every redirect in a log file (or something), you can use LWP's simple_request and HTTP::Response's is_redirect to detect redirects, something like this-
use WWW::Mechanize;
$mech = WWW::Mechanize->new();
$mech->stack_depth(0);
my $resp = $mech->simple_request( HTTP::Request->new(GET => 'http://www.googl.com/') );
if( $resp->is_redirect ) {
my $location = $resp->header( "Location" );
my $uri = new URI( $location );
print "Got redirected to URL - $uri\n";
$mech->get($uri);
print $mech->content;
}
is_redirect will detect both 301 and 302 response codes.

feedpp and session ID

we are using Perl and cpan Modul FeedPP to parse RSS Feeds.
The Perl script runs trough the different items of the RSS Feeds and save the link to the database, liket his:
my $response = $ua->get($url);
if ($response->is_success) {
my $feed = XML::FeedPP->new( $response->content, -type => 'string' );
foreach my $item ( $feed->get_item() ) {
my $link = $item->link();
[...]
$url contains the URL to an RSS Feed, like http://my.domain/RSS/feeds.xml
in this case, $item->link() will contain links to the RSS article, like http://my.domain/topic/myarticle.html
The Problem is, some webservers (which provides the RSS feeds) does an HTTP refer in order to add an session ID to the URL, like this: http://my.domain/RSS/feeds.xml;jsessionid=4C989B1DB91D706C3E46B6E30427D5CD.
The strange think is, that feedPP seams to add this session-ID to the link of every item. So $item->link() contain links to the RSS article, like http://my.domain/topic/myarticle.html;jsessionid=4C989B1DB91D706C3E46B6E30427D5CD
Even if the original link does not contain an session ID.
Is there a way to turn of that behavior of feedPP??
Thank you for any kind of help.
I took a look through http://metacpan.org/pod/XML::FeedPP but didn't see any way to turn have the link() method trim those session IDs for you. (I'm using XML::FeedPP in one of my scripts and the site I happen to be parsing doesn't use session IDs.)
So I think the answer is no, not currently. You could try contacting the author or filing a bug.
IMHO, the behavior is correct: uri components which follow a semi-colon are defined part of the path (configuration parameter for interpretation), so when the uri is used to make a relative url into an absolute uri it needs to be copied as well.
You expect compatible behavior with '&' parameters, but they are not equal.
https://rt.cpan.org/Ticket/Display.html?id=73895

How do I use Perl's LWP to log in to a web application?

I would like to write a script to login to a web application and then move to other parts
of the application:
use HTTP::Request::Common qw(POST);
use LWP::UserAgent;
use Data::Dumper;
$ua = LWP::UserAgent->new(keep_alive=>1);
my $req = POST "http://example.com:5002/index.php",
[ user_name => 'username',
user_password => "password",
module => 'Users',
action => 'Authenticate',
return_module => 'Users',
return_action => 'Login',
];
my $res = $ua->request($req);
print Dumper(\$res);
if ( $res->is_success ) {
print $res->as_string;
}
When I try this code I am not able to login to the application. The HTTP status code returned is 302 that is found, but with no data.
If I post username/password with all required things then it should return the home page of the application and keep the connection live to move other parts of the application.
You may be able to use WWW::Mechanize for this purpose:
Mech supports performing a sequence of page fetches including following links and submitting forms. Each fetched page is parsed and its links and forms are extracted. A link or a form can be selected, form fields can be filled and the next page can be fetched. Mech also stores a history of the URLs you've visited, which can be queried and revisited.
I'm guessing that LWP isn't following the redirect:
push #{ $ua->requests_redirectable }, 'POST';
Any reason why you're not using WWW::Mechanize?
I've used LWP to log in to plenty of web sites and do stuff with the content, so there should be no problem doing what you want. Your code looks good so far but two things I'd suggest:
As mentioned, you may need to make the requests redirectable
You may also need to enable cookies:
$ua->cookie_jar( {} );
Hope this helps