Finding broken links with Selenium Remote Driver

Finding broken links with Selenium Remote Driver - perl

I have a site with login and i want to test all links present in that site.
I tried with finding links and click on each to verify with Selenium Remote Driver. But one problem i have is coming back to previous URL and selecting next link. This testing should be recursive.
How can we do this with Selenium Remote Driver?
Following program i tried to check broken links
sub traverse {
my ($self) = #_;
my $links = find_links("//a");
foreach my $index (1..$#$links) {
my $url = $links->[$index]->get_attribute('href');
my $result = $links->[$index]->click();
if ($result) {
traverse();
} else {
print "url is broken $url\n";
}
}
}

I know it's possible to do in C# by checking the returned status code. So you don't actually click on the link, but you are retrieving the header of the response that link is going to give. In this header you can find the HTTP Status Code which you can check to see if the link is giving a valid response or not. Plus you're not leaving the current site!
In C#, a possible method to get the status code will look like this (The checking of the HTTP status code is not included):
private static HttpStatusCode GetStatusCode(string url)
{
var result = default(HttpStatusCode);
var request = WebRequest.Create(url);
request.Method = "HEAD";
HttpWebResponse response;
try {
response = request.GetResponse() as HttpWebResponse;
} catch (WebException) {
return HttpStatusCode.NotFound;
}
if (response != null)
{
result = response.StatusCode;
response.Close();
response.Dispose();
}
return result;
}
Altough this is no Perl code, I hope this helps

Why are you not trying to use some tool, because your site can has over 9000+ urls, it's a lot of time and job, you can use Xenu
Install
In option check use Cookie
Run IE and login thorugh it
Run Xenu
P.S. To test privete part of your site, you must login thorugh IE because Xenu uses only IE cookie

Hmm, I've crossed this bridge before and here is how I solved it. Now I should say that I crossed this bridge before WebDriver :) so this is using WWW::Selenium instead of S:R:D but the concept is the same and still applies.
One of the most tedious tasks, IMO, for a test engineer, is manually verifying links. We can automate most of the process and as long as we have the URL's for where we are expected to land after clicking the link, we can verify this functionality using Selenium and a little bit of JS.
In the below example we first navigate to our desired website and then use Selenium's getEval() function to execute JavaScript that gathers all the links on the page (anchors) and saves them in a comma separated list. This list then gets split and pushed into an array. We then iterate through the list of links in the array clicking on each one and then navigating back to the starting page using go_back.
use strict;
use warnings;
use Time::HiRes qw(sleep);
use Test::WWW::Selenium;
use Test::More "no_plan";
my $sel = Test::WWW::Selenium->new( host => "localhost",
port => 4444,
browser => "*iexplore",
browser_url => "http://www.google.com/");
$sel->open_ok("/", "true");
$sel->set_speed("1000");
my $javascript = "var allLinks = this.browserbot.getCurrentWindow().document.getElementsByTagName('a');
var separator = ',';
var all_links_texts = '';
for(var i = 0; i < allLinks.length; i++) {
all_links_texts = all_links_texts+separator+allLinks[i].href;
}
all_links_texts;";
# Get all of the links in the page and, using a comma to separate each one, add them to the all_links_texts var.
my $link_list = $sel->get_eval($javascript);
my #link_array = split /,/ , $link_list;
my $count = 0;
# Click on each link contained in the array and then go_back
# You can add other logic here like capture and store a screenshot for example
foreach my $link_name (#link_array) {
unless ($link_name =~ /^$/){
$sel->click_ok("css=a[href $= $link_name]");
$sel->wait_for_page_to_load("30000");
print "Clicked Link href: $link_name \n";
$sel->go_back();
$count++;
}
}
print "Clicked $count URL's";
pass;
This can be easily modified to do much more than just click on the links. And of course nothing beats a good pair of eyes on the intended landing pages for the links clicked. Implementing a similar solution in your organization might ease with the manual testing. Here is how I have done it in the past:
Not everything can be automated, but we can certainly make it much easier to review large amounts of links. The above logic can be easily extended to capture a screen shot and add it to a queue of "to be reviewed" images. These properly tagged [by the software] images are what you use in the final phase of the test; visual verification phase.
With this approach you'll know right away if a link is broken or not (assuming you update the logic above to also include this, again this example can be easily extended to include that functionality). As well you will have the capability of visually verifying the screen shots of the intended link landing pages.
I actually have a blog post about this very same issue here: get all links and click on each one
Hope that helps.

Related

Sending an unbuffered response in Plack

I'm working in a section of a Perl module that creates a large CSV response. The server runs on Plack, on which I'm far from expert.
Currently I'm using something like this to send the response:
$res->content_type('text/csv');
my $body = '';
query_data (
parameters => \%query_parameters,
callback => sub {
my $row_object = shift;
$body .= $row_object->to_csv;
},
);
$res->body($body);
return $res->finalize;
However, that query_data function is not a fast one and retrieves a lot of records. In there, I'm just concatenating each row into $body and, after all rows are processed, sending the whole response.
I don't like this for two obvious reasons: First, it takes a lot of RAM until $body is destroyed. Second, the user sees no response activity until that method has finished working and actually sends the response with $res->body($body).
I tried to find an answer to this in the documentation without finding what I need.
I also tried calling $res->body($row_object->to_csv) on my callback section, but seems like that ends up sending only the last call I made to $res->body, overriding all previous ones.
Is there a way to send a Plack response that flushes the content on each row, so the user starts receiving content in real time as the data is gathered and without having to accumulate all data into a veriable first?
Thanks in advance for any comments!

You can't use Plack::Response because that class is intended for representing a complete response, and you'll never have a complete response in memory at one time. What you're trying to do is called streaming, and PSGI supports it even if Plack::Response doesn't.
Here's how you might go about implementing it (adapted from your sample code):
my $env = shift;
if (!$env->{'psgi.streaming'}) {
# do something else...
}
# Immediately start the response and stream the content.
return sub {
my $responder = shift;
my $writer = $responder->([200, ['Content-Type' => 'text/csv']]);
query_data(
parameters => \%query_parameters,
callback => sub {
my $row_object = shift;
$writer->write($row_object->to_csv);
# TODO: Need to call $writer->close() when there is no more data.
},
);
};
Some interesting things about this code:
Instead of returning a Plack::Response object, you can return a sub. This subroutine will be called some time later to get the actual response. PSGI supports this to allow for so-called "delayed" responses.
The subroutine we return gets an argument that is a coderef (in this case, $responder) that should be called and passed the real response. If the real response does not include the "body" (i.e. what is normally the 3rd element of the arrayref), then $responder will return an object that we can write the body to. PSGI supports this to allow for streaming responses.
The $writer object has two methods, write and close which both do exactly as their names suggest. Don't forget to call the close method to complete the response; the above code doesn't show this because how it should be called is dependent on how query_data and your other code works.
Most servers support streaming like this. You can check $env->{'psgi.streaming'} to be sure that yours does.

Plack is middleware. Are you using a web application framework on top of it, like Mojolicious or Dancer2, or something like Apache or Starman server below it? That would affect how the buffering works.
The link above shows an example by Plack's author:
https://metacpan.org/source/MIYAGAWA/Plack-1.0037/eg/dot-psgi/echo-stream-sync.psgi
Or you can do it easily by using Dancer2 on top of Plack and Starman or Apache:
https://metacpan.org/pod/distribution/Dancer2/lib/Dancer2/Manual.pod#Delayed-responses-Async-Streaming
Regards, Peter

Some reading material for you :)
https://metacpan.org/pod/PSGI#Delayed-Response-and-Streaming-Body
https://metacpan.org/pod/Plack::Middleware::BufferedStreaming
https://metacpan.org/source/MIYAGAWA/Plack-1.0037/eg/dot-psgi/echo-stream.psgi
https://metacpan.org/source/MIYAGAWA/Plack-1.0037/eg/dot-psgi/nonblock-hello.psgi
So copy/paste/adapt and report back please

Mojolicious not following redirection from webarchive.org

I'm using Mojolicious DOM and UserAgent to get the source of a page from Webarchive.org, parse it, and import it into a Dotclear database (using webarchive as a backup).
In the source, there are "Previous" and "Next" links allowing to get to the different posts originaly made on the blog.
The perl script I have developped is supposed to run through those links to import all pages of this blog's snapshot.
It first get the source of the first post of the blog, parses it, put the result in a local DB, and gets the link under "Next" to do that same thing on the next post, until there is no more "Next" posts.
As for the bases.
But the trick is that the link I get from the source is not the link Webarchive has.
Webarchive's links to snapshots go like this :
http://web.archive.org/web/20131012182412/http://www.mytarget.com/post?mypost
The big number between "web" and the original URL is (i guess) the date the snapshot was made. The trick is that it changes at each snapshot, and although it may appear on one post, the next post have been snapshoted on anotherdate. So the URL wont fit.
When I click on the link i get from the source, it brings me to webarchive.org, which automaticaly searches on the page i pass, and redirect me to it.
But when I try to get the source via the get() function of Mojolicious, it just gets the "Page not found" page of webarchive.
So, there is my question : is there a way to let mojolicious follow the redirection of webarchive ? I activated max_redirects(5) on my UserAgent, but still does the same.
Here is my code :
sub main{
my ($url) = #_;
my $ua = Mojo::UserAgent->new;
$ua = $ua->max_redirects(5);
my $dom = $ua->get($url)->res->dom;
#...Treatment and parsing of the source ...
return $nextUrl;
}
my $nextUrl="http://web.archive.org/web/20131012182412/http://www.mytarget.com/post?mypost";
my $secondUrl;
while ($nextUrl){
$secondUrl = main($nextUrl);
$nextUrl = $secondUrl;
}
Thanks in advance...

I've finally found a way around.
I use this piece of code to follow the URL and get the finally reached URL :
use LWP::UserAgent qw();
my $ua = LWP::UserAgent->new;
my $ret = $ua->get($url);
$url = $ret->request->uri ."";
print "URL returned: ".$url."\n";
Then I use that URL to get the source code and fetch it.

Integrate solvemedia captcha using perl

I would like to know how do I insert solvemedia captcha with my script. I did install the module from their site (https://portal.solvemedia.com/media/download/WWW-SolveMedia-1.1.tar.gz) but don't know where to add this (their instructions):
Once the plugin is installed, you can start making calls to the Solve Media API.
Display the Widget
To display the Solve Media widget on one of your forms, instantiate the SolveMedia class, supplying it with your API keys. Then call the get_html function. You can find your API keys at My account:
use WWW::SolveMedia;
my $c = WWW::SolveMedia->new( 'my challenge key',
'my verification key',
'my hash key' );
# output widget
print $c->get_html();
Process Answer
You can check the user's response by calling SolveMedia.check_answer(...).
# check answer
my $result = $c->check_answer( $ENV{REMOTE_ADDR}, $challenge, $response );
if( $result->{is_valid} ){
print "Yay!";
}else{
print "Dang it :-(\n";
print "Error: ".$result->{error};
}
And this is where I get stuck, cos I don't have a clue how/where to insert that code. If anyone of you is willing to help, please respond. I'm willing to pay a few bucks.

You create the new object, and either save the results of get_html into a variable which you then stick into some web page, or you print it inline.
You put the Perl code in the subroutines that generate the pages that you want the captcha to appear.
and you put the call to process in the code that process the submission of the form on the page that you printed the captcha into.

feedpp and session ID

we are using Perl and cpan Modul FeedPP to parse RSS Feeds.
The Perl script runs trough the different items of the RSS Feeds and save the link to the database, liket his:
my $response = $ua->get($url);
if ($response->is_success) {
my $feed = XML::FeedPP->new( $response->content, -type => 'string' );
foreach my $item ( $feed->get_item() ) {
my $link = $item->link();
[...]
$url contains the URL to an RSS Feed, like http://my.domain/RSS/feeds.xml
in this case, $item->link() will contain links to the RSS article, like http://my.domain/topic/myarticle.html
The Problem is, some webservers (which provides the RSS feeds) does an HTTP refer in order to add an session ID to the URL, like this: http://my.domain/RSS/feeds.xml;jsessionid=4C989B1DB91D706C3E46B6E30427D5CD.
The strange think is, that feedPP seams to add this session-ID to the link of every item. So $item->link() contain links to the RSS article, like http://my.domain/topic/myarticle.html;jsessionid=4C989B1DB91D706C3E46B6E30427D5CD
Even if the original link does not contain an session ID.
Is there a way to turn of that behavior of feedPP??
Thank you for any kind of help.

I took a look through http://metacpan.org/pod/XML::FeedPP but didn't see any way to turn have the link() method trim those session IDs for you. (I'm using XML::FeedPP in one of my scripts and the site I happen to be parsing doesn't use session IDs.)
So I think the answer is no, not currently. You could try contacting the author or filing a bug.

IMHO, the behavior is correct: uri components which follow a semi-colon are defined part of the path (configuration parameter for interpretation), so when the uri is used to make a relative url into an absolute uri it needs to be copied as well.
You expect compatible behavior with '&' parameters, but they are not equal.
https://rt.cpan.org/Ticket/Display.html?id=73895

How do I use and debug WWW::Mechanize?

I am very new to Perl and i am learning on the fly while i try to automate some projects for work. So far its has been a lot of fun.
I am working on generating a report for a customer. I can get this report from a web page i can access.
First i will need to fill a form with my user name, password and choose a server from a drop down list, and log in.
Second i need to click a link for the report section.
Third a need to fill a form to create the report.
Here is what i wrote so far:
my $mech = WWW::Mechanize->new();
my $url = 'http://X.X.X.X/Console/login/login.aspx';
$mech->get( $url );
$mech->submit_form(
form_number => 1,
fields =>{
'ctl00$ctl00$cphVeriCentre$cphLogin$txtUser' => 'someone',
'ctl00$ctl00$cphVeriCentre$cphLogin$txtPW' => '12345',
'ctl00$ctl00$cphVeriCentre$cphLogin$ddlServers' => 'Live',
button => 'Sign-In'
},
);
die unless ($mech->success);
$mech->dump_forms();
I dont understand why, but, after this i look at the what dump outputs and i see the code for the first login page, while i belive i should have reached the next page after my successful login.
Could there be something with a cookie that can effect me and the login attempt?
Anythings else i am doing wrong?
Appreciate you help,
Yaniv

This is several months after the fact, but I resolved the same issue based on a similar questions I asked. See Is it possible to automate postback from the client side? for more info.
I used Python's Mechanize instead or Perl, but the same principle applies.
Summarizing my earlier response:
ASP.NET pages need a hidden parameter called __EVENTTARGET in the form, which won't exist when you use mechanize normally.
When visited by a normal user, there is a __doPostBack('foo') function on these pages that gives the relevant value to __EVENTTARGET via a javascript onclick event on each of the links, but since mechanize doesn't use javascript you'll need to set these values yourself.
The python solution is below, but it shouldn't be too tough to adapt it to perl.
def add_event_target(form, target):
#Creates a new __EVENTTARGET control and adds the value specified
#.NET doesn't generate this in mechanize for some reason -- suspect maybe is
#normally generated by javascript or some useragent thing?
form.new_control('hidden','__EVENTTARGET',attrs = dict(name='__EVENTTARGET'))
form.set_all_readonly(False)
form["__EVENTTARGET"] = target

You can only mechanize stuff that you know. Before you write any more code, I suggest you use a tool like Firebug and inspect what is happening in your browser when you do this manually.
Of course there might be cookies that are used. Or maybe your forgot a hidden form parameter? Only you can tell.
EDIT:
WWW::Mechanize should take care of cookies without any further intervention.
You should always check whether the methods you called were successful. Does the first get() work?
It might be useful to take a look at the server logs to see what is actually requested and what HTTP status code is sent as a response.

If you are on Windows, use Fiddler to see what data is being sent when you perform this process manually, and then use Fiddler to compare it to the data captured when performed by your script.
In my experience, a web debugging proxy like Fiddler is more useful than Firebug when inspecting form posts.

I have found it very helpful to use Wireshark utility when writing web automation with WWW::Mechanize. It will help you in few ways:
Enable you realize whether your HTTP request was successful or not.
See the reason of failure on HTTP level.
Trace the exact data which you pass to the server and see what you receive back.
Just set an HTTP filter for the network traffic and start your Perl script.

The very short gist of aspx pages it that they hold all of the local session information within a couple of variables prefixed by "__" in the general aspxform. Usually this is a top level form and all form elements will be part of it, but I guess that can vary by implementation.
For the particular implementation I was dealing with I needed to worry about 2 of these state variables, specifically:
__VIEWSTATE
__EVENTVALIDATION.
Your goal is to make sure that these variables are submitted into the form you are submitting, since they might be part of that main form aspxform that I mentioned above, and you are probably submitting a different form than that.
When a browser loads up an aspx page a piece of javascript passes this session information along within the asp server/client interaction, but of course we don't have that luxury with perl mechanize, so you will need to manually post these yourself by adding the elements to the current form using mechanize.
In the case that I just solved I basically did this:
my $browser = WWW::Mechanize->new( );
# fetch the login page to get the initial session variables
my $login_page = 'http://www.example.com/login.aspx';
$response = $browser->get( $login_page);
# very short way to find the fields so you can add them to your post
$viewstate = ($browser->find_all_inputs( type => 'hidden', name => '__VIEWSTATE' ))[0]->value;
$validation = ($browser->find_all_inputs( type => 'hidden', name => '__EVENTVALIDATION' ))[0]->value;
# post back the formdata you need along with the session variables
$browser->post( $login_page, [ username => 'user', password => 'password, __VIEWSTATE => $viewstate, __EVENTVALIDATION => $validation ]);
# finally get back the content and make sure it looks right
print $response->content();