Site scraping in perl using WWW::Mechanize - perl

I have used WWW::Mechanize in perl for site scraping application.
I have faced some difficulties when I'm going to login to particular site via WWW::Mechanize. I have gone through some examples of WWW::Mechanize. But i couldn't find out my issue.
I have mention below my code.
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use HTTP::Cookies;
use Crypt::SSLeay;
my $agent = WWW::Mechanize->new(noproxy => 0);
$agent->cookie_jar(HTTP::Cookies->new());
$agent->agent('Mozilla/5.0');
$agent->proxy(['https', 'http', 'ftp'], 'http://proxy.rcapl.com:3128');
$agent->get("http://www.facebook.com");
my $re=$agent->submit_form(
form_number => 1,
fields => {
Email => 'xyz#gmail.com',
Passwd =>'xyz'
}
);
print $re->content();
When i run the code it says:
Error POSTing https://www.facebook.com/login.php?login_attempt=1: Not Implemented at ./test.pl line 11
Can anybody tell what's going wrong on code. Do i need to set all the parameters which facebook send for login?.

The proxy is faulty:
Error GETing http://www.facebook.com: Can't connect to proxy.rcapl.com:3128 (Bad hostname) at so11406791.pl line 11.
The program works for me without calling the proxy method. Remove this.

Related

Perl - mechanize

I have the following code that works just fine.
#!/usr/bin/perl -w
use strict;
use LWP 6.03;
use URI;
my $browser=LWP::UserAgent->new;
my $url=URI->new ( 'http://www.google.com/search');
$url->query_form(
'h1'=>'en',
'num'=>'100',
'q'=>'glass',
);
my $response=$browser->get($url,
'User-Agent' => 'Mozilla/4.76 [en] (win98; U)',
'Accept' => 'image/gif, image/x-bitmap, image/jpeg, image/pjpeg,image/png,*/*',
'Accept-Charset' => 'iso-8859-1,*',
'Accept-Language' => 'en-US',
);
if ($response->content=~m/glass/i){
print "Success";
open (GGLASS,">gglass");
print GGLASS $response->content;
} else {
print "complete failure";
}
I have another piece of code that also works fine.
It uses the following:
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use HTML::TokeParser;
When I look up the documentation for my code at cpan it tells me that the libraries I am using are deprecated. Even though it works with my system, the style of programming is being abandoned. It references me to something I have never used and I do not know if that is soon to be abandoned as well. What is the popular way to scrape a website. I do not want to be considered a dinosaur or be stuck with antiquated or remedial programs and tactics that leave me in the previous century. If you could come up with a piece of code that is similar to the first example that would be nice. This way I could compare the two.
Your documentation is wrong. Neither one of LWP, URI, WWW::Mechanize, HTML::TokeParser is deprecated. Mechanize works just fine in general for crawling. I would replace HTML::TokeParser with something that handles HTML parsing in a declarative fashion, though - Web::Query is splendid, HTML::TreeBuilder::XPath is nice.
However, concerning your code example: Google's terms of use forbid scraping. Use their API instead!

Handling 404 and internal server errors with perl WWW::Mechanize

I am using WWW::Mechanize to crawl sites, and it works great except for sometimes it will hit a page that returns error code 404 or 500 (not found or internal server error), and then my script will just exit and stop running. This is really messing with my data collection, so is there anyway that WWW::Mechanize will let me catch these errors and see what kind of error code was returned (i.e. 404,500, etc.). Thanks for the help!
You need to disable autocheck:
my $mech = WWW::Mechanize->new( autocheck => 0 );
$mech->get("http://somedomain.com");
if ( $mech->success() ) {
...
}
else {
print "status is: " . $mech->status;
}
Also, as an aside, have a look at WWW::Mechanize::Cached::GZip and WWW::Mechanize::Cached to speed up your development when testing your mech scripts.
Turn off autocheck and manually check status(), which returns the HTTP status code of the response.
This is a 3-digit number like 200 for OK, 404 for Not Found, and so on.
use strict;
use warnings;
use WWW::Mechanize;
my $url = 'http://...';
my $mech = WWW::Mechanize->new(autocheck => 0);
$mech->get($url);
print $mech->status();
See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html for Status Code Definitions.
If the status code is 400 or above, then you got error...

WWW:Mechanize Form Select

I am attempting to login to Youtube with WWW:Mechanize and use forms() to print out all the forms on the page after logging in. My script is logging in successfully, and also successfully navigating to Youtube.com/inbox; However, for some reason Mechanize can not see any forms at Youtube.com/inbox. It just returns blank. Here is my code:
#!"C:\Perl64\bin\perl.exe" -T
use strict;
use warnings;
use CGI;
use CGI::Carp qw/fatalsToBrowser/;
use WWW::Mechanize;
use Data::Dumper;
my $q = CGI->new;
$q->header();
my $url = 'https://www.google.com/accounts/ServiceLogin?uilel=3&service=youtube&passive=true&continue=http://www.youtube.com/signin%3Faction_handle_signin%3Dtrue%26nomobiletemp%3D1%26hl%3Den_US%26next%3D%252Findex&hl=en_US&ltmpl=sso';
my $mechanize = WWW::Mechanize->new(autocheck => 1);
$mechanize->agent_alias( 'Windows Mozilla' );
$mechanize->get($url);
$mechanize->submit_form(
form_id => 'gaia_loginform',
fields => { Email => 'myemail',Passwd => 'mypassword' },
);
die unless ($mechanize->success);
$url = 'http://www.youtube.com/inbox';
$mechanize->get($url);
$mechanize->form_id('comeposeform');
my $page = $mechanize->content();
print Dumper($mechanize->forms());
Mechanize is unable to see any forms at youtube.com/inbox, however, like I said, I can print all of the forms from the initial link, no matter what I change it to...
Thanks in advance.
As always, one of the best debugging approaches is to print what you get and check if it is what you were expecting. This applies to your problem too.
In your case, if you print $mechanize->content() you'll see that you didn't get the page you're expecting. YouTube wants you to follow a JavaScript redirect in order to complete your cross-domain login action. You have multiple options here:
parse the returned content manually – i.e. /location\.replace\("(.+?)"/
try to have your code parse JavaScript (have a look at WWW::Scripter)
[recommended] use YouTube API for managing your inbox

Why can't my Perl script print cookie values?

When I visit usatoday.com with IE, there're cookie files automatically created in my Temporary Internet Files folder. But why doesn't the following Perl script capture anything?
use WWW::Mechanize;
use strict;
use warnings;
my $browser = WWW::Mechanize->new();
my $response = $browser->get( 'http://www.usatoday.com' );
my $cookie_jar = $browser->cookie_jar(HTTP::Cookies->new());
$cookie_jar->extract_cookies( $response );
my $cookie_content = $cookie_jar->as_string;
print $cookie_content;
For some other sites like amazon.com, google.com and yahoo.com, the script works well, but at least it seems to me usatoday.com also sends cookie information to the browser, why am I having different results? Is there something I'm missing?
Any ideas? Thanks!
UsaToday uses Javascript to set the cookie. WWW::Mechanize does not parse or run Javascript.
If you need to crawl the site with a cookie, you could analyze http://i.usatoday.net/_common/_scripts/gel/lib/core/core.js and other JS files and determine how exactly the cookie is created, and create one yourself programmatically.

How can I keep WWW::Mechanize from following redirects?

I have a Perl script that uses WWW::Mechanize to read from a file and perform some automated tasks on a website. However, the website uses a 302 redirect after every time I request a certain page. I don't want to be redirected (the page that it redirects to takes too long to respond); I just want to loop through the file and call the first link over and over. I can't figure out how to make WWW::Mechanize NOT follow redirects. Any suggestions?
WWW::Mechanize is a subclass of LWP::UserAgent. So you can use any LWP::UserAgent methods.
my $mech = WWW::Mechanize->new();
$mech->requests_redirectable([]);
WWW::Mechanize is a subclass of LWP::UserAgent; you can set the max_redirect or requests_redirectable options in the constructor as you would with LWP::UserAgent.
You can use $agent->max_redirect( 0 );, like in this example:
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
my $agent = WWW::Mechanize->new( 'autocheck' => 1, 'onerror' => undef, );
$agent->max_redirect( 0 );
$agent->get('http://www.depesz.com/test/redirect');
printf("Got HTTP/%s from %s.\n", $agent->response->code, $agent->uri);
$agent->max_redirect( 1 );
$agent->get('http://www.depesz.com/test/redirect');
printf("Got HTTP/%s from %s.\n", $agent->response->code, $agent->uri);
When running it prints:
Got HTTP/302 from http://www.depesz.com/test/redirect.
Got HTTP/200 from http://www.depesz.com/.
So, with max_redirect(0) - it clearly doesn't follow redirects.