Mojolicious not following redirection from webarchive.org - perl

I'm using Mojolicious DOM and UserAgent to get the source of a page from Webarchive.org, parse it, and import it into a Dotclear database (using webarchive as a backup).
In the source, there are "Previous" and "Next" links allowing to get to the different posts originaly made on the blog.
The perl script I have developped is supposed to run through those links to import all pages of this blog's snapshot.
It first get the source of the first post of the blog, parses it, put the result in a local DB, and gets the link under "Next" to do that same thing on the next post, until there is no more "Next" posts.
As for the bases.
But the trick is that the link I get from the source is not the link Webarchive has.
Webarchive's links to snapshots go like this :
http://web.archive.org/web/20131012182412/http://www.mytarget.com/post?mypost
The big number between "web" and the original URL is (i guess) the date the snapshot was made. The trick is that it changes at each snapshot, and although it may appear on one post, the next post have been snapshoted on anotherdate. So the URL wont fit.
When I click on the link i get from the source, it brings me to webarchive.org, which automaticaly searches on the page i pass, and redirect me to it.
But when I try to get the source via the get() function of Mojolicious, it just gets the "Page not found" page of webarchive.
So, there is my question : is there a way to let mojolicious follow the redirection of webarchive ? I activated max_redirects(5) on my UserAgent, but still does the same.
Here is my code :
sub main{
my ($url) = #_;
my $ua = Mojo::UserAgent->new;
$ua = $ua->max_redirects(5);
my $dom = $ua->get($url)->res->dom;
#...Treatment and parsing of the source ...
return $nextUrl;
}
my $nextUrl="http://web.archive.org/web/20131012182412/http://www.mytarget.com/post?mypost";
my $secondUrl;
while ($nextUrl){
$secondUrl = main($nextUrl);
$nextUrl = $secondUrl;
}
Thanks in advance...

I've finally found a way around.
I use this piece of code to follow the URL and get the finally reached URL :
use LWP::UserAgent qw();
my $ua = LWP::UserAgent->new;
my $ret = $ua->get($url);
$url = $ret->request->uri ."";
print "URL returned: ".$url."\n";
Then I use that URL to get the source code and fetch it.

Related

Finding broken links with Selenium Remote Driver

I have a site with login and i want to test all links present in that site.
I tried with finding links and click on each to verify with Selenium Remote Driver. But one problem i have is coming back to previous URL and selecting next link. This testing should be recursive.
How can we do this with Selenium Remote Driver?
Following program i tried to check broken links
sub traverse {
my ($self) = #_;
my $links = find_links("//a");
foreach my $index (1..$#$links) {
my $url = $links->[$index]->get_attribute('href');
my $result = $links->[$index]->click();
if ($result) {
traverse();
} else {
print "url is broken $url\n";
}
}
}
I know it's possible to do in C# by checking the returned status code. So you don't actually click on the link, but you are retrieving the header of the response that link is going to give. In this header you can find the HTTP Status Code which you can check to see if the link is giving a valid response or not. Plus you're not leaving the current site!
In C#, a possible method to get the status code will look like this (The checking of the HTTP status code is not included):
private static HttpStatusCode GetStatusCode(string url)
{
var result = default(HttpStatusCode);
var request = WebRequest.Create(url);
request.Method = "HEAD";
HttpWebResponse response;
try {
response = request.GetResponse() as HttpWebResponse;
} catch (WebException) {
return HttpStatusCode.NotFound;
}
if (response != null)
{
result = response.StatusCode;
response.Close();
response.Dispose();
}
return result;
}
Altough this is no Perl code, I hope this helps
Why are you not trying to use some tool, because your site can has over 9000+ urls, it's a lot of time and job, you can use Xenu
Install
In option check use Cookie
Run IE and login thorugh it
Run Xenu
P.S. To test privete part of your site, you must login thorugh IE because Xenu uses only IE cookie
Hmm, I've crossed this bridge before and here is how I solved it. Now I should say that I crossed this bridge before WebDriver :) so this is using WWW::Selenium instead of S:R:D but the concept is the same and still applies.
One of the most tedious tasks, IMO, for a test engineer, is manually verifying links. We can automate most of the process and as long as we have the URL's for where we are expected to land after clicking the link, we can verify this functionality using Selenium and a little bit of JS.
In the below example we first navigate to our desired website and then use Selenium's getEval() function to execute JavaScript that gathers all the links on the page (anchors) and saves them in a comma separated list. This list then gets split and pushed into an array. We then iterate through the list of links in the array clicking on each one and then navigating back to the starting page using go_back.
use strict;
use warnings;
use Time::HiRes qw(sleep);
use Test::WWW::Selenium;
use Test::More "no_plan";
my $sel = Test::WWW::Selenium->new( host => "localhost",
port => 4444,
browser => "*iexplore",
browser_url => "http://www.google.com/");
$sel->open_ok("/", "true");
$sel->set_speed("1000");
my $javascript = "var allLinks = this.browserbot.getCurrentWindow().document.getElementsByTagName('a');
var separator = ',';
var all_links_texts = '';
for(var i = 0; i < allLinks.length; i++) {
all_links_texts = all_links_texts+separator+allLinks[i].href;
}
all_links_texts;";
# Get all of the links in the page and, using a comma to separate each one, add them to the all_links_texts var.
my $link_list = $sel->get_eval($javascript);
my #link_array = split /,/ , $link_list;
my $count = 0;
# Click on each link contained in the array and then go_back
# You can add other logic here like capture and store a screenshot for example
foreach my $link_name (#link_array) {
unless ($link_name =~ /^$/){
$sel->click_ok("css=a[href $= $link_name]");
$sel->wait_for_page_to_load("30000");
print "Clicked Link href: $link_name \n";
$sel->go_back();
$count++;
}
}
print "Clicked $count URL's";
pass;
This can be easily modified to do much more than just click on the links. And of course nothing beats a good pair of eyes on the intended landing pages for the links clicked. Implementing a similar solution in your organization might ease with the manual testing. Here is how I have done it in the past:
Not everything can be automated, but we can certainly make it much easier to review large amounts of links. The above logic can be easily extended to capture a screen shot and add it to a queue of "to be reviewed" images. These properly tagged [by the software] images are what you use in the final phase of the test; visual verification phase.
With this approach you'll know right away if a link is broken or not (assuming you update the logic above to also include this, again this example can be easily extended to include that functionality). As well you will have the capability of visually verifying the screen shots of the intended link landing pages.
I actually have a blog post about this very same issue here: get all links and click on each one
Hope that helps.

How to redirect from one CGI to another

I am sending data from A.cgi to B.cgi. B.cgi updates the data in the database and is supposed to redirect back to A.cgi, at which point A.cgi should display the updated data. I added the following code to B.cgi to do the redirect, immediately after the database update:
$url = "http://Travel/cgi-bin/A.cgi/";
print "Location: $url\n\n";
exit();
After successfully updating the database, the page simply prints
Location: http://Travel/cgi-bin/A.cgi/
and stays on B.cgi, without ever getting redirected to A.cgi. How can I make the redirect work?
Location: is a header and headers must come before all ordinary output, that's probably your problem. But doing this manually is unneccessarly complicated anyways, you would be better of using the redirect function of CGI.pm
Use CGI's redirect method:
my $url = "http://Travel/cgi-bin/A.cgi";
my $q = CGI->new;
print $q->redirect($url);

Joomla JError doesn't show but then appears

I have 3d party component which set JError warning
JError::raiseWarning( 99, "Set your name please" );
$app = JFactory::getApplication();
$app->redirect($r);
Redirect goes to controller with code
function saveUserDetails(){
//some code here
//now I try to get that error which was set by raiseWarning
$other_errors = JError::getErrors();
print_r($other_errors);
die;
It returns just empty array. Why It doesn't contain that error?
Ok, I try to check session var with Joomla messages
$session =& JFactory::getSession();
$mes = $session->get('application.queue');
print_r($mes);
die;
Again empty. Where is that error, I can't understand.
If there is new request immediately after the redirect you might be loosing the session variable (JError content) inspect the fired requests with FireBug, Net tab, and see what happens there. Post any information you get there but if it's not in JErrors it shouldn't show on the site.
Can you give a link to the live site so I can test there and see the HTTP requests that could help.

feedpp and session ID

we are using Perl and cpan Modul FeedPP to parse RSS Feeds.
The Perl script runs trough the different items of the RSS Feeds and save the link to the database, liket his:
my $response = $ua->get($url);
if ($response->is_success) {
my $feed = XML::FeedPP->new( $response->content, -type => 'string' );
foreach my $item ( $feed->get_item() ) {
my $link = $item->link();
[...]
$url contains the URL to an RSS Feed, like http://my.domain/RSS/feeds.xml
in this case, $item->link() will contain links to the RSS article, like http://my.domain/topic/myarticle.html
The Problem is, some webservers (which provides the RSS feeds) does an HTTP refer in order to add an session ID to the URL, like this: http://my.domain/RSS/feeds.xml;jsessionid=4C989B1DB91D706C3E46B6E30427D5CD.
The strange think is, that feedPP seams to add this session-ID to the link of every item. So $item->link() contain links to the RSS article, like http://my.domain/topic/myarticle.html;jsessionid=4C989B1DB91D706C3E46B6E30427D5CD
Even if the original link does not contain an session ID.
Is there a way to turn of that behavior of feedPP??
Thank you for any kind of help.
I took a look through http://metacpan.org/pod/XML::FeedPP but didn't see any way to turn have the link() method trim those session IDs for you. (I'm using XML::FeedPP in one of my scripts and the site I happen to be parsing doesn't use session IDs.)
So I think the answer is no, not currently. You could try contacting the author or filing a bug.
IMHO, the behavior is correct: uri components which follow a semi-colon are defined part of the path (configuration parameter for interpretation), so when the uri is used to make a relative url into an absolute uri it needs to be copied as well.
You expect compatible behavior with '&' parameters, but they are not equal.
https://rt.cpan.org/Ticket/Display.html?id=73895

How to read a web page content which may itself be redirected to another url?

I'm using this code to read the web page content:
my $ua = new LWP::UserAgent;
my $response= $ua->post($url);
if ($response->is_success){
my $content = $response->content;
...
But if $url is pointing to moved page then $response->is_success is returning false. Now how do I get the content of redirected page easily?
You need to chase the redirect itself.
if ($response->is_redirect()) {
$url = $response->header('Location');
# goto try_again
}
You may want to put this in a while loop and use "next" instead of "goto". You may also want to log it, limit the number of redirections you are willing to chase, etc.
[update]
OK I just noticed there is an easier way to do this. From the man page of LWP::UserAgent:
$ua->requests_redirectable
$ua->requests_redirectable( \#requests )
This reads or sets the object's list of request names that
"$ua->redirect_ok(...)" will allow redirection for. By default,
this is "['GET', 'HEAD']", as per RFC 2616. To change to include
'POST', consider:
push #{ $ua->requests_redirectable }, 'POST';
So yeah, maybe just do that. :-)