Perl Mechanize follow_link is failing - perl

I'm trying to login to a site using following code
my $mech = WWW::Mechanize->new(autosave=>1);
$mech->cookie_jar(HTTP::Cookies->new());
$mech->get($url);
$mech->follow_link( text => 'Sign In');
$mech->click();
$mech->field(UserName => "$username");
$mech->field(Password => "$password");
$mech->submit();
But during follow_link the href contains two front slashes e.g (//test/sso-login) hence follow_link is considering it as whole URL and it's failing as below
Error GETing http://test/sso-login: Can't connect to test:80 (Bad hostname)
I can't change the href since it's our of my control. Is there a way to overcome this problem and make it take the full URL appending this href.

Sure. You can modify the HTML that Mech is looking at just before you call follow_link():
my $html = $mech->content;
$html =~ s[//test/sso-login][http://example.com/test/sso-login]isg;
$mech->update_html( $html );
See the documentation for details. Search for "update_html" on that page.

Related

How can I scrape images off of a webage using Perl

I would like to scrape all images off of a webpage and am running into a problem I don't understand.
For instance if I use enter https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=%22escape%20room%22+movie+poster&safe=images into my browser and then use the browser's "View Source" option I get a massive amount of text/code. Using "find" I get more than 400 instances of
https://
So the simple code I wrote (below) gets the content and writes the result to a file. But a grep search of https:// only returns 7 instances. So obviously I am doing something incorrectly, perhaps the page is dynamic and I can't access that part?
Is there a way I can get the same source, via Perl, that I get via View Source?
my $ua = new LWP::UserAgent;
$ua->agent("$0/0.1 " . $ua->agent);
$ua->agent("Mozilla/8.0");
my $sstring = "https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=%22escape%20room%22+movie+poster&safe=images";
my $req = new HTTP::Request 'GET' => $sstring;
$req->header('Accept' => 'text/html');
my $res = $ua->request($req);
open(my $fh, '>', 'report.txt');
print $fh $res->decoded_content;
close $fh;
Here's the example I got from WWW:Mechanize::Chrome
my $mech = WWW::Mechanize::Chrome->new();
my $sstring = "https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=\"" . $arrayp . "\"" . "+movie+poster&safe=images";
$mech->get($sstring);
sleep 5;
print $_->get_attribute('href'), "\n\t-> ", $_->get_attribute('innerHTML'), "\n"
for $mech->selector('a.download');
The Google search uses Javascript to alter the page content after load. LWP::UserAgent does not support Javascript and what you get is only the initial document. (Hint: An easy way to see in the browser what LWP::UserAgent "sees" is using a browser addon to disable Javascript).
You will need to use something that is called a "headless Browser", for example WWW::Mechanize::Chrome

Following links using WWW::Mechanize

I am trying to access an internal webpage to start and stop application using WWW::Mechanize. So far I am able to log in to the application successfully. My next action item is to identify a particular service from list of services and stop them.
The problem I am facing is I am unable to follow the link on the webpage. After looking at the HTML and link object, it is evident that there isn't a URL but an on click event.
Here is snippet of HTML
<ul>
<li>
servicename
</li>
</ul>
The link object dump is
$VAR1 = \bless( [
'#',
'servicename',
'j_id_id1:j_id_id9:2:j_id_id10',
'a',
bless( do{\(my $o = 'http://blah.services.jsf')}, 'URI::http' ),
{
'href' => '#',
'style' => 'color:#3BB9FF;',
'name' => 'j_id_id1:j_id_id9:2:j_id_id10',
'onclick' => 'A4J.AJAX.Submit(\'j_id_id1\',event,{\'similarityGroupingId\':\'j_id_id1:j_id_id9:2:j_id_id10\',\'parameters\':{\'j_id_id1:j_id_id9:2:j_id_id10\':\'j_id_id1:j_id_id9:2:j_id_id10\',\'ajaxSingle\':\'j_id_id1:j_id_id9:2:j_id_id10\'} ,\'containerId\':\'j_id_id0\',\'actionUrl\':\'/pages/services.jsf;jsessionid=NghBSoEJZKXbWcK0uVzcHvyebl8G_zSpf_Zu4uqrLI7xosHAnheK!1108773228\'} );return false;',
'id' => 'j_id_id1:j_id_id9:2:j_id_id10'
}
], 'WWW::Mechanize::Link' );
Here is my code so far:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use WWW::Mechanize;
my $username = 'myuser';
my $password = 'mypass';
my $url = 'myinternalurl';
my $mech = WWW::Mechanize->new();
$mech->credentials($username,$password);
$mech->get($url);
my $link = $mech->find_link( text => 'servicename' );
#print Dumper \$link;
#$mech->follow_link( url => $link->url_abs() );
$mech->get($link->url_abs());
print $mech->text();
If I use follow_link, I get Link not found at log_in.pl line 16.. If I use get then I get back the same page. The problem is all these services appear to be hyperlinks but have the same url as my main url.
Here is a pic of the webpage:
When I manually click a service the Operations and Properties section change which allows the user to view Operation and Properties of the service they just clicked. Every service has different set of Operations and Properties.
How should I go about do this using perl? Is WWW::Mechanize the wrong tool for this one? Can anyone please suggest a solution or an alternate perl module that could help. Installing any CPAN module is not an issue. Working with latest version of perl is not an issue either. I have just started automating with perl and currently unaware of all the modules that could get the job done.
Looking forward to your guidance and help.
Note: If you feel there is any pertinent information, I may have missed, please leave a comment and I will update the question to add more details. I have modified proprietary information.
That button contains a Javascript onclick event, which will not work when using WWW::Mechanize.
Per the docs:
Please note that Mech does NOT support JavaScript, you need additional software for that. Please check "JavaScript" in WWW::Mechanize::FAQ for more.
One alternative that does support Javascript in a forms is WWW::Mechanize::Firefox.

Mojolicious lite how to redirect for not found and server error pages to user defined error page

How to redirect user defined error page for not found and server error pages to user define page Mojolicious lite
You can add a template for your custom page named exception.html.ep or not_found.html.ep at the end of your liteapp.
For example:
use Mojolicious::Lite;
get '/' => sub {
my $self = shift;
$self->render(text => "Hello.");
};
app->start;
__DATA__
## not_found.html.ep
<!DOCTYPE html>
<html>
<head><title>Page not found</title></head>
<body>Page not found <%= $status %></body>
</html>
For a reference, see the Mojolicious rendering guide.
The renderer will always try to find exception.$mode.$format.* or
not_found.$mode.$format.* before falling back to the built-in default
templates.
I wanted to run some code in my 404 page so borrowing from here
https://groups.google.com/forum/#!topic/mojolicious/0wzBRnetiHo
I made a route that catches everything and placed it after all my other routes, so urls that don't match routes fall through to this:
any '/(*)' => sub {
my $self = shift;
$self->res->code(404);
$self->res->message('Not Found');
# 404
$self->stash( {
# ... my stuff in the stash ...
} );
$self->render('mytemplate', status => 404);
};
I had an API that wanted to send back 404 errors as it would any other error—same JSON format and whatnot.
I had this snippet at the end of startup (full app, not lite). Since this is the last defined route, it only picks up anything not already handled. And, since this handles all that, Mojo never gets the chance to use its own 404 handling by looking for templates:
$self->routes->any('/*')->to(
controller => 'Base',
action => 'not_found'
);

WWW:Mechanize Form Select

I am attempting to login to Youtube with WWW:Mechanize and use forms() to print out all the forms on the page after logging in. My script is logging in successfully, and also successfully navigating to Youtube.com/inbox; However, for some reason Mechanize can not see any forms at Youtube.com/inbox. It just returns blank. Here is my code:
#!"C:\Perl64\bin\perl.exe" -T
use strict;
use warnings;
use CGI;
use CGI::Carp qw/fatalsToBrowser/;
use WWW::Mechanize;
use Data::Dumper;
my $q = CGI->new;
$q->header();
my $url = 'https://www.google.com/accounts/ServiceLogin?uilel=3&service=youtube&passive=true&continue=http://www.youtube.com/signin%3Faction_handle_signin%3Dtrue%26nomobiletemp%3D1%26hl%3Den_US%26next%3D%252Findex&hl=en_US&ltmpl=sso';
my $mechanize = WWW::Mechanize->new(autocheck => 1);
$mechanize->agent_alias( 'Windows Mozilla' );
$mechanize->get($url);
$mechanize->submit_form(
form_id => 'gaia_loginform',
fields => { Email => 'myemail',Passwd => 'mypassword' },
);
die unless ($mechanize->success);
$url = 'http://www.youtube.com/inbox';
$mechanize->get($url);
$mechanize->form_id('comeposeform');
my $page = $mechanize->content();
print Dumper($mechanize->forms());
Mechanize is unable to see any forms at youtube.com/inbox, however, like I said, I can print all of the forms from the initial link, no matter what I change it to...
Thanks in advance.
As always, one of the best debugging approaches is to print what you get and check if it is what you were expecting. This applies to your problem too.
In your case, if you print $mechanize->content() you'll see that you didn't get the page you're expecting. YouTube wants you to follow a JavaScript redirect in order to complete your cross-domain login action. You have multiple options here:
parse the returned content manually – i.e. /location\.replace\("(.+?)"/
try to have your code parse JavaScript (have a look at WWW::Scripter)
[recommended] use YouTube API for managing your inbox

How can I download a file using WWW::Mechanize or any Perl module?

Is there a way in WWW::Mechanize or any Perl module to read on a file after accessing a website. For example, I clicked a button 'Receive', and a file (.txt) will appear containing a message. How will I be able to read the content? Answers are very much appreciated.. I've been working on this for days,, Also, I tried all the possibilities. Can anyone help? If you can give me an idea please? :)
Here is a part of my code:
...
my $username = "admin";<br>
my $password = "12345";<br>
my $url = "http://...do_gsm_sms.cgi";
my $mech = WWW::Mechanize->new(autocheck => 1, quiet => 0, agent_alias =>$login_agent, cookie_jar => $cookie_jar);
$mech->credentials($username, $password);<br>
$mech->get($url);
$mech->success() or die "Can't fetch the Requested page";<br>
print "OK! \n"; #This works <br>
$mech->form_number(1);
$mech->click()
;
After this, 'Downloads' dialog box will appear so I can save the file (but I can also set the default to open it immediately instead of saving). Question is, how can I read the content of this file?
..
I take you mean that the web site responds to the form submission by returning a non-HTML response (say a 'text/plain' file), that you wish to save.
I believe you want $mech->save_content( $filename )
Added:
First you need to submit the WWW:Mech's form submission, before saving the resulting (text) file.
The click is for clicking a button, whereas you want to submit a form, using $mech->submit() or $mech->submit_form( ... ).
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
my $username = "admin";
my $password = "12345";
my $login_agent = 'WWW::Mechanize login-agent';
my $cookie_jar;
#my $url = "http://localhost/cgi-bin/form_mech.pl";
my $url = "http://localhost/form_mech.html";
my $mech = WWW::Mechanize->new(autocheck => 1, quiet => 0,
agent_alias => $login_agent, cookie_jar => $cookie_jar
);
$mech->credentials($username, $password);
$mech->get($url);
$mech->success() or die "Can't fetch the Requested page";
print "OK! \n"; #This works
$mech->submit_form(
form_number => 1,
);
die "Submit failed" unless $mech->success;
$mech->save_content('out.txt');
After the click (assuming that's doing what it's supposed to), the returned data should be stored in your $mech object. You should be able to get the file data with $mech->content(),
perhaps after verifying success with $mech->status() and the type of response with $mech->content_type().
You may find it helpful to remember that WWW::Mechanize replaces the browser; anything a browser would have done, like bringing up a download window and saving a file, doesn't actually happen, but all the information the browser would have had is accessible through WWW::Mechanize's methods.
Dare I ask... have you tried this?
my $content = $mech->content();
Open the file (not 'Downloads' window) as if you were viewing it within your browser; you can save it later with a few lines of code.
Provided you have HTML::TreeBuilder installed:
my $textFile = $mech->content(format => "text");
should get you the text of the resulting window that opens.
Then open a filehandle to write your results in:
open my $fileHandle, ">", "results.txt";
print $fileHandle $textFile;
close $fileHandle;
I do this all the time with LWP, but I'm sure it's equally possible with Mech
I think where you might be going wrong is using Mech to request the page that has the button on it when you actually want to request the content from the page that the button causes to be sent to the browser when clicked.
What you need to do is review the html source of the page with the button that initiates the download and see what the Action associated with the button is. Most likely it will be a POST with some hidden fields or a URL to do a GET.
The Target URL of the Click has the stuff you actually want to get, not the URL of the page with the button on it.
For problems like this, you often have to investigate the complete chain of events that the browser handles. Use an HTTP sniffer tool to see everything the browser is doing until it gets to the file file. You then have to do the same thing in Mech.