Download web page using WWW::Mechanize::Firefox - perl

I'm trying to scrape a website using WWW::Mechanize::Firefox, but whenever I try to get the data it is displaying JavaScript code and the data that I need is not there. If I inspect the element on Mozilla, the data that I need is there.
Here's my current code:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new();
$mech->get('link_goes_here');
$mech->allow( javascript => 0 );
$mech->content_encoding();
$mech->save_content('source.html');

Ok. So you have a page that builds its content using Javascript. Presumably, you have chosen to use WWW::Mechanize::Firefox instead of WWW::Mechanize because it includes support for rendering pages that are built using Javascript.
And yet, when creating your Mechanize object, you explicitly turn off the Javascript support.
$mech->allow( javascript => 0 );
I can't test this theory because you haven't told us which URL you are using, but I bet you get a better result if you change that line to:
$mech->allow( javascript => 1 );

Related

Replace authenticated user agent login/ page scrape using Perl and Mojolicious

I am trying to port some old web scraping scripts written using older Perl modules to work using only Mojolicious.
Have written a few basic scripts with Mojo but am puzzled on an authenticated login which uses a secure login site and how this should be handled with a Mojo::UserAgent script. Unfortunately the only example I can see in the documentation is for basic authentication without forms.
The Perl script I am trying to convert to work with Mojo:UserAgent is as follows:
#!/usr/bin/perl
use LWP;
use LWP::Simple;
use LWP::Debug qw(+);
use LWP::Protocol::https;
use WWW::Mechanize;
use HTTP::Cookies;
# login first before navigating to pages
# Create our automated browser and set up to handle cookies
my $agent = WWW::Mechanize->new();
$agent->cookie_jar(HTTP::Cookies->new());
$agent->agent_alias( 'Windows IE 6' ); #tell the website who we are (old!)
# get login page
$agent->get("https://reg.mysite.com")
$agent->success or die $agent->response->status_line;
# complete the user name and password form
$agent->form_number (1);
$agent->field (username => "user1");
$agent->field (password => "pass1");
$agent->click();
#try to get member's only content page from main site on basis we are now "logged in"
$agent->get("http://www.mysite.com/memberpagesonly1");
$agent->success or die $agent->response->status_line;
$member_page = $agent->content();
print "$member_page\n";
So the above works fine. How to convert to do the same job in Mojolicious?
Mojolicious is a web application framework. While Mojo::UserAgent works well as a low-level HTTP user agent, and provides facilities that are unavailble from LWP (in particular native support for asynchronous requests and IPV6) neither are as convenient to use as as WWW::Mechanize for web scraping.
WWW::Mechanize subclasses LWP::UserAgent to interface with the internet, and uses HTML::Form to process the forms it finds. Mojo::UserAgent has no facility for processing HTML forms, and so building the corresponding HTTP requests is not at all straighforward. Information such as the HTTP method used (GET or POST) the names of the form fields, and the insertion of default values for hidden fields are all done automatically by HTML::Form and are left to the programmer if you restrict yourself to Mojo::UserAgent.
It seems to me that even trying to use Mojo::UserAgent in combination with HTML::Form is poblematic, as the former requires a Mojo::Transaction::HTTP object to represent the submission of a filled-in form, whereas the latter generates HTTP::Request objects for use with LWP.
In short, unless you are willing to largely rewrite WWW::Mechanize, I think there is no way to reimplement your software using Mojolicious modules.
You can use WWW::Mechanize to talk to web servers, and you can use Mojo::DOM to benefit from Mojolicious' as a parser. The best of two worlds... :)

How to tell if a webpage exists?

Just for fun, I'm writing a Perl program to check if a given website exists. For my purposes, a website exists if I can go into my browser, punch in the url and get a meaningful webpage (meaning not an error or "failed to open page" message). What would be the best way to go about doing this? Eventually I would like to be able to give my program a list of hundreds of urls.
I'm thinking about just pinging each of the urls on my list to see if they exist; however, I don't really know too much about networking so is this the best way to do it?
Using Library for WWW in Perl (LWP):
#!/usr/bin/perl
use LWP::Simple;
my $url = 'http://www.mytestsite.com/';
if (head($url)) {
print "Page exists\n";
} else {
print "Page does not exist\n";;
}
There is no such protocol as "pinging web pages" for existence. You actually have to request the resource and if it's served up, it exists. There are several ways to go about it, here are a couple:
Retrieving web pages with LWP
Checking for an existing web page could as simple as:
#!/usr/bin/env perl
use strict;
use warnings;
use LWP::Simple qw(head);
head('http://www.perlmeme.org') or die 'Unable to get page';
The same solution as command-line tool is lwp-request/HEAD. HEAD returns the resource headers, such as content size and will be quicker than getting all the page contents.

WWW::Mechanize won't work after submit

At my work I build a lot of wordpress sites and I also do a lot of cutting and pasting. In order to streamline this process I'm trying to make a crawler that can fill out and submit form information to wordpress. However, I can't get the crawler to operate correctly in the wordpress admin panel once I'm past the login.
I know it works to submit the login form because I've gotten the page back before. But this script doesn't seem to return the "settings" page, which is what I want. I've been trying to use this site as a guide: www.higherpass.com/Perl/Tutorials/Using-Www-mechanize/3/ for how to use mechanize but I could use some additional pointers for this. Here is my Perl script, I've tried a few variations but I just need to be pointed in the right direction.
Thanks!
use WWW::Mechanize;
my $m = WWW::Mechanize->new();
$url2 = 'http://www.moversbatonrougela.com/wp-admin/options-general.php';
$url = 'http://www.moversbatonrougela.com/wp-admin';
$m->get($url);
$m->form_name('loginform');
$m->set_fields('username' => 'user', 'password' => 'password');
$m->submit();
$response = $m->get($url2);
print $response->decoded_content();
Put the below lines of code just before $m->submit(); . Since WWW::Mechanize is a subclass of LWP::UserAgent you can use any of LWP's methods.
$m->add_handler("request_send", sub { shift->dump; return });
$m->add_handler("response_done", sub { shift->dump; return });
The above would enable logging in your code. Look out for the Request/Response return codes i.e. 200 (OK) or 302 (Redirect) etc. The URL request i.e. the $m->get() is probably getting redirected or the machine's ip is Blocked by the server. If its a redirect, then you can probably use $m->redirect_ok(); to follow the redirect URL, or in case you don't want to follow the redirect URL use $m->requests_redirectable (this is an LWP method). The logs should show something like below-
HTTP/1.1 200 OK
OR
HTTP/1.1 302 Found
If none of the above works, use an alternative of $m->submit(); like below and give it a try-
my $inputobject=$mech->current_form()->find_input( undef, 'submit' );
$m->click_button(input => $inputobject);

Getting error in accessing a link using WWW::Mechanize

Getting the following error in a JavaScript link using perl - WWW::Mechanize.
Error GETing javascript:submt_os('2','contact%20info','contact%20info'):Protocol scheme 'javascript' is not supported
This is my code:
#!/usr/bin/perl
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
$uri="http://tinyurl.com/76xv4ld";
$mech->get($uri);
# error on this link
$mech->follow_link( text => 'Contact Information');
print $mech->content();
Once I get the page, I want to click Contact Information.
Is there any other way to click Contact Information?
You can't follow a javascript link with WWW::Mechanize. Even if you had a javascript interpreter you'd need complete DOM support for anything non-trivial.
So - you need to script a web-browser. I use Selenium in my testing, which is quite bulky and requires java. You might want to investigate WWW::Mechanize::Firefox. I've not used it but it does provide a mechanize style interface to Firefox.

Perl WWW::Mechanize::Firefox POST() Implementation

Can I get some help on how to submit a POST with the necessary variables using WWW::Mechanize::Firefox? I've installed all the perl modules, and the firefox plugin and tested such that I can connect to a give host and get responses... my questions is how to submit a POST request. On the documentation Corion says he may never implement. This seems odd, I'm hoping I can use the inherited nature from Mechanize, but can't find any examples. A simple example would help me tremendously.
my $mech = WWW::Mechanize::Firefox->new();
$mech->allow( javascript =>1); # enable javascript
# http
$mech->get("http://www.example.com");
my $c = $mech->content;
Is there a mech->post() option I am simply missing?
many thanks in advance.
R
Normally you would just set the fields and submit the form like this:
$mech->get('http://www.website.com');
$mech->submit_form(
with_fields => {
user => 'me',
pass => 'secret',
}
);
get a page, fill out a form, submit the form
if you're going to be skipping the above steps by using post, you don't need mechanize firefox