Getting the following error in a JavaScript link using perl - WWW::Mechanize.
Error GETing javascript:submt_os('2','contact%20info','contact%20info'):Protocol scheme 'javascript' is not supported
This is my code:
#!/usr/bin/perl
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
$uri="http://tinyurl.com/76xv4ld";
$mech->get($uri);
# error on this link
$mech->follow_link( text => 'Contact Information');
print $mech->content();
Once I get the page, I want to click Contact Information.
Is there any other way to click Contact Information?
You can't follow a javascript link with WWW::Mechanize. Even if you had a javascript interpreter you'd need complete DOM support for anything non-trivial.
So - you need to script a web-browser. I use Selenium in my testing, which is quite bulky and requires java. You might want to investigate WWW::Mechanize::Firefox. I've not used it but it does provide a mechanize style interface to Firefox.
Related
I'm trying to scrape a website using WWW::Mechanize::Firefox, but whenever I try to get the data it is displaying JavaScript code and the data that I need is not there. If I inspect the element on Mozilla, the data that I need is there.
Here's my current code:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new();
$mech->get('link_goes_here');
$mech->allow( javascript => 0 );
$mech->content_encoding();
$mech->save_content('source.html');
Ok. So you have a page that builds its content using Javascript. Presumably, you have chosen to use WWW::Mechanize::Firefox instead of WWW::Mechanize because it includes support for rendering pages that are built using Javascript.
And yet, when creating your Mechanize object, you explicitly turn off the Javascript support.
$mech->allow( javascript => 0 );
I can't test this theory because you haven't told us which URL you are using, but I bet you get a better result if you change that line to:
$mech->allow( javascript => 1 );
I am trying to port some old web scraping scripts written using older Perl modules to work using only Mojolicious.
Have written a few basic scripts with Mojo but am puzzled on an authenticated login which uses a secure login site and how this should be handled with a Mojo::UserAgent script. Unfortunately the only example I can see in the documentation is for basic authentication without forms.
The Perl script I am trying to convert to work with Mojo:UserAgent is as follows:
#!/usr/bin/perl
use LWP;
use LWP::Simple;
use LWP::Debug qw(+);
use LWP::Protocol::https;
use WWW::Mechanize;
use HTTP::Cookies;
# login first before navigating to pages
# Create our automated browser and set up to handle cookies
my $agent = WWW::Mechanize->new();
$agent->cookie_jar(HTTP::Cookies->new());
$agent->agent_alias( 'Windows IE 6' ); #tell the website who we are (old!)
# get login page
$agent->get("https://reg.mysite.com")
$agent->success or die $agent->response->status_line;
# complete the user name and password form
$agent->form_number (1);
$agent->field (username => "user1");
$agent->field (password => "pass1");
$agent->click();
#try to get member's only content page from main site on basis we are now "logged in"
$agent->get("http://www.mysite.com/memberpagesonly1");
$agent->success or die $agent->response->status_line;
$member_page = $agent->content();
print "$member_page\n";
So the above works fine. How to convert to do the same job in Mojolicious?
Mojolicious is a web application framework. While Mojo::UserAgent works well as a low-level HTTP user agent, and provides facilities that are unavailble from LWP (in particular native support for asynchronous requests and IPV6) neither are as convenient to use as as WWW::Mechanize for web scraping.
WWW::Mechanize subclasses LWP::UserAgent to interface with the internet, and uses HTML::Form to process the forms it finds. Mojo::UserAgent has no facility for processing HTML forms, and so building the corresponding HTTP requests is not at all straighforward. Information such as the HTTP method used (GET or POST) the names of the form fields, and the insertion of default values for hidden fields are all done automatically by HTML::Form and are left to the programmer if you restrict yourself to Mojo::UserAgent.
It seems to me that even trying to use Mojo::UserAgent in combination with HTML::Form is poblematic, as the former requires a Mojo::Transaction::HTTP object to represent the submission of a filled-in form, whereas the latter generates HTTP::Request objects for use with LWP.
In short, unless you are willing to largely rewrite WWW::Mechanize, I think there is no way to reimplement your software using Mojolicious modules.
You can use WWW::Mechanize to talk to web servers, and you can use Mojo::DOM to benefit from Mojolicious' as a parser. The best of two worlds... :)
Just for fun, I'm writing a Perl program to check if a given website exists. For my purposes, a website exists if I can go into my browser, punch in the url and get a meaningful webpage (meaning not an error or "failed to open page" message). What would be the best way to go about doing this? Eventually I would like to be able to give my program a list of hundreds of urls.
I'm thinking about just pinging each of the urls on my list to see if they exist; however, I don't really know too much about networking so is this the best way to do it?
Using Library for WWW in Perl (LWP):
#!/usr/bin/perl
use LWP::Simple;
my $url = 'http://www.mytestsite.com/';
if (head($url)) {
print "Page exists\n";
} else {
print "Page does not exist\n";;
}
There is no such protocol as "pinging web pages" for existence. You actually have to request the resource and if it's served up, it exists. There are several ways to go about it, here are a couple:
Retrieving web pages with LWP
Checking for an existing web page could as simple as:
#!/usr/bin/env perl
use strict;
use warnings;
use LWP::Simple qw(head);
head('http://www.perlmeme.org') or die 'Unable to get page';
The same solution as command-line tool is lwp-request/HEAD. HEAD returns the resource headers, such as content size and will be quicker than getting all the page contents.
I'm trying to crawl this page using Perl LWP:
http://livingsocial.com/cities/86/deals/138811-hour-long-photo-session-cd-and-more
I had code that used to be able to handle living social, but it seems to have stopped working. Basically the idea was to crawl the page once, get its cookie, set the cookie in the UserAgent, and crawl it twice more. By doing this, you could get through the welcome page:
$response = $browser->get($url);
$cookie_jar->extract_cookies($response);
$browser->cookie_jar($cookie_jar);
$response = $browser->get($url);
$response = $browser->get($url);
This seems to have stopped working for normal LivingSocial pages, but still seems to work for LivinSocialEscapes. E.g.,:
http://livingsocial.com/escapes/148029-cook-islands-hotel-+-airfare
Any tips on how to get past the welcome page?
It looks like this page only works with a Javascript enabled browser (which LWP::UserAgent is not) You could try WWW::Mechanize::Firefox instead:
use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new();
$mech->get($url);
Note that you must have Firefox and the mozrepl extension installed for this module to work.
Can I get some help on how to submit a POST with the necessary variables using WWW::Mechanize::Firefox? I've installed all the perl modules, and the firefox plugin and tested such that I can connect to a give host and get responses... my questions is how to submit a POST request. On the documentation Corion says he may never implement. This seems odd, I'm hoping I can use the inherited nature from Mechanize, but can't find any examples. A simple example would help me tremendously.
my $mech = WWW::Mechanize::Firefox->new();
$mech->allow( javascript =>1); # enable javascript
# http
$mech->get("http://www.example.com");
my $c = $mech->content;
Is there a mech->post() option I am simply missing?
many thanks in advance.
R
Normally you would just set the fields and submit the form like this:
$mech->get('http://www.website.com');
$mech->submit_form(
with_fields => {
user => 'me',
pass => 'secret',
}
);
get a page, fill out a form, submit the form
if you're going to be skipping the above steps by using post, you don't need mechanize firefox