Perl - geturls with WWW::Mechanize - perl

I am trying to submit a form on http://bioinfo.noble.org/TrSSP/ and want to extract the result.
My query data looks like this
>ATCG00270
MTIALGKFTKDEKDLFDIMDDWLRRDRFVFVGWSGLLLFPCAYFALGGWFTGTTFVTSWYTHGLASSYLEGCNFLTAA VSTPANSLAHSLLLLWGPEAQGDFTRWCQLGGLWAFVALHGAFALIGFMLRQFELARSVQLRPYNAIAFSGPIAVFVSVFLIYPLGQSGWFFAPSFGVAAIFRFILFFQGFHNWTLNPFHMMGVAGVLGAALLCAIHGATVENTLFEDGDGANTFRAFNPTQAEETYSMVTANRFWSQIFGVAFSNKRWLHFFMLFVPVTGLWMSALGVVGLALNLRAYDFVSQEIRAAEDPEFETFYTKNILLNEGIRAWMAAQDQPHENLIFPEEVLPRGNAL
My script looks like this
use strict;
use warnings;
use File::Slurp;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
my $sequence = $ARGV[0];
$mech->get( 'http://bioinfo.noble.org/TrSSP' );
$mech->submit_form( fields => { 'query_file' => $sequence, }, );
print $mech->content;
#sleep (10);
open( OUT, ">out.txt" );
my #a = $mech->find_all_links();
print OUT "\n", $a[$_]->url for ( 0 .. $#a );
print $mech->content gives a result like this
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>The job is running, please wait...</title>
<meta http-equiv="refresh" content="4;url=/TrSSP/?sessionid=1492435151653763">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<link rel="stylesheet" href="interface/style.css" type="text/css">
</head>
<body>
<table width="90%" align="center" border="0" cellpadding="0" cellspacing="0" class="table1">
<tr align="center">
<td width="50"> </td>
<td></td>
<td> </td>
</tr>
<tr align="left" height="30" valign="middle">
<td width="30"> </td>
<td bgColor="#CCCCFF"> Your sequences have been submitted to backend pipeline, please wait for result:</td>
<td width="30"> </td>
</tr>
<tr align="left">
<td> </td>
<td>
<br><br><font color="#0000FF"><strong>
</strong></font>
<BR><BR><BR><BR><BR><BR><br><br><BR><br><br><hr>
If you don't want to wait online, please copy and keep the following link to retrieve your result later:<br>
<strong>http://bioinfo.noble.org/TrSSP/?sessionid=1492435151653763</strong>
<script language="JavaScript" type="text/JavaScript">
function doit()
{
window.location.href="/TrSSP/?sessionid=1492435151653763";
}
setTimeout("doit()",9000);
</script>
</td>
<td> </td>
</tr>
</table>
</body>
</html>
I want to extract this link
http://bioinfo.noble.org/TrSSP/?sessionid=1492435151653763
and download the result when the job is completed. But find_all_links() is recognizing /TrSSP/?sessionid=1492434554474809 as a link.

We don't know how long this is backend process there is going to take. If it's minutes, you could have your program wait. Even if it's hours, waiting is reasonable.
In a browser, the page is going to refresh on its own. There are two auto-refresh mechanisms implemented in the response you are showing.
<script language="JavaScript" type="text/JavaScript">
function doit()
{
window.location.href="/TrSSP/?sessionid=1492435151653763";
}
setTimeout("doit()",9000);
</script>
The javascript setTimeout takes an argument in milliseconds, so this will be done after 9 seconds.
There is also a meta tag that tells the browser to auto-refresh:
<meta http-equiv="refresh" content="4;url=/TrSSP/?sessionid=1492435151653763">
Here, the 4 in the content means 4 seconds. So this would be done earlier.
Of course we also don't know how long they keep the session around. It might be a safe approach to reload that page every ten seconds (or more often, if you want).
You can do that by building a simple while loop and checking if the refresh is still in the response.
# do the initial submit here
...
# assign this by grabbing it from the page
$mech->content =~ m{<strong>(\Qhttp://bioinfo.noble.org/TrSSP/?sessionid=\E\d+)</strong>};
my $url = $1; # in this case, regex on HTML is fine
print "Waiting for $url\n";
while (1) {
$mech->get($url);
last unless $mech->content =~ m/refresh/;
sleep 10; # or whatever number of seconds
}
# process the final response ...
We first submit the data. We then extract the URL that you're supposed to call until they are done processing. Since this is a pretty straight-forward document, we can safely use a pattern match. The URL is always the same, and it's clearly marked with the <strong> tag. In general it's not a good idea to use regex to parse HTML, but we're not really parsing, we are just screenscraping a single value. The \Q and \E are the same as quotemeta and make sure that we don't have to escape the . and ?, which is then easier to read than having a buch of backslashes \ in the pattern.
The script will sleep for ten seconds after every attempt before trying again. Once it matches, it breaks out of the endless loop, so you can put the processing of the actual response that has the data you wanted behind that loop.
It might make sense to add some output into the loop so you can see that it's still running.
Note that this needs to really keep running until it's done. Don't stop the process.

Related

need a good method to Page break and add Header

I am using itextsharp library.I design an HTML page and convert to PDF .in that case some table are not split perfectly and row also not split correctly.
I tried to put a comment in the HTML code writing based on the font how many rows the page can contains, then after reaching the limit i just add another page,
this works but doesnt seems a professional way to fix this.
After a search i found many discussions using this code
<style type="text/css">
table { page-break-inside:auto }
tr { page-break-inside:avoid; page-break-after:auto }
thead { display:table-header-group }
tfoot { display:table-footer-group }
</style>
But, what if i need to add a header on each page?
thanks
If you don't want to use events, a tricky method is to use this css style:
table.hrepeat {
repeat-header: yes;
}
and wrap all your content inside an html table adding header to repeat as head of the table.
Example:
<!DOCTYPE html>
<head>
<style type="text/css">
table.hrepeat {repeat-header: yes;}
</style>
</head>
<body>
<table class="hrepeat">
<thead>
<tr>
<th>Header on each page</th>
</tr>
</thead>
<tbody>
....
</tbody>
</table>
</body>

Downloading a image via web scraping with a Perl script

I'm a noob in perl, trying to download a IMDB movie poster image via perl script with the help of Mechanize framework. I'm not getting the 'id' attribute for 'td' tags so that I can find the specific place for the image. This is how the HTML of the image portion of IMDB page looks like:
<table id="title-overview-widget-layout" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td id="img_primary" rowspan="2">
<div class="image">
<a href="/media/rm419297536/tt2338151?ref_=tt_ov_i">
<img width="214" height="317" itemprop="image" src="http://ia.media-imdb.com/images/M/MV5BMTYzOTE2NjkxN15BMl5BanBnXkFtZTgwMDgzMTg0MzE#._V1_SY317_CR2,0,214,317_AL_.jpg" title="PK (2014) Poster" alt="PK (2014) Poster">
</a>
</div>
<div class="pro-title-link text-center">
</td>
<td id="overview-top">
</tr>
<tr>
</tbody>
</table>
And here is the perl script I'm trying to download with:
use strict;
use warnings;
use WWW::Mechanize;
use HTML::TokeParser;
#create a new instance of mechanize
my $agent = WWW::Mechanize->new();
#get the page we want.
$agent->get("http://www.imdb.com/title/tt2338151/");
#supply a reference to that page to TokeParser
my $stream = HTML::TokeParser->new(\$agent->{content});
my $c = 0;#to store the count of images and give the images names
#loop through all the td's
while (my $tag1 = $stream->get_tag("td")) {
$tag1->[1]->{id} ||= 'none';
my $asd = $tag1->[1]->{id};
print "$asd\n"; #shows none for all of the td's
if ($asd && $asd eq 'img_primary') {
while(my $tag = $stream->get_tag("div"))
{
# $tag will contain this array => [$tag, $attr, $attrseq, $text]
#get the class of the div tag from attr
my $cls = $tag->[1]{class};
#we're looking for div's with the class gallery-img2
if($cls && $cls eq "image") {
#get the content of the src tag
my $image = $stream->get_tag('img')->[1]{src};
#create a new mechanize to download the image
my $imgDown = WWW::Mechanize->new();
#give the image url and the local path to mechanize
$imgDown->get($image, ":content_file" => ".//image".$c.".jpg");
#update the count
$c++;
}
}
}
}
print "Total images scraped $c\n";
Any help will be much appropriated.
When JavaScript is involved, it's best to use a real browser to visit pages and query their contents.
You can do this with Selenium::Remote::Driver.

How can I remove an attribute from all DOM elements with Mojolicious?

I want to remove the bgcolor attribute from all elements of a page I am scraping via Mojolicious.
My attempt has been the following:
$dom->all_contents->each(sub { $_->attr('bgcolor' => undef) });
but this seems not to work.
How do I do it right?
The following uses Mojo::DOM to delete the bgcolor attribute for every node:
use strict;
use warnings;
use Mojo::DOM;
my $dom = Mojo::DOM->new(do {local $/; <DATA>});
for my $node ($dom->find('*')->each) {
delete $node->{bgcolor};
}
print $dom;
__DATA__
<html>
<head>
<title>Hello background color</title>
</head>
<body bgcolor="white">
<h1>Hello world</h1>
<table>
<tr><td bgcolor="blue">blue</td></tr>
<tr><td bgcolor="green">green</td></tr>
</table>
</body>
</html>
Outputs:
<html>
<head>
<title>Hello background color</title>
</head>
<body>
<h1>Hello world</h1>
<table>
<tr><td>blue</td></tr>
<tr><td>green</td></tr>
</table>
</body>
</html>
Notes:
It's possible to use CSS Selectors to limit the returned nodes to only those containing the specific attribute:
for my $node ($dom->find('[bgcolor]')->each) {
One can also let Mojo handle the iteration like the following:
$dom->find('*')->each(sub {
delete $_->{bgcolor};
});
As I understand it, the DOM attribute you're looking for isn't bgcolor but background-color, the css variety. bgcolor fell out of popularity a while ago, in favor of defining classes and using CSS to set the styling on an object (including its background color). Try background-color instead.

Write on unblessed reference in perl

Why am i getting an error saying "Can't call method 'write' on unblessed reference at /usr/..."
When i run on my ubuntu there was no error at all but when i run this code on my gentoo. This error pops out. I think the OS is not the problem here. but what is it?
Here is my code :
#!/usr/bin/perl
#index.cgi
require 'foobar-lib.pl';
ui_print_header(undef, $module_info{'desc'}, "", undef, 1, 1);
ui_print_footer("/", $text{'index'});
use CGI;
use Config::Tiny;
use Data::Dumper;
use CGI::Carp qw(fatalsToBrowser);
#location/directory of configuration file
my $file = "/home/admin_config.conf";
my $Config = Config::Tiny->read($file);
#reads the section, key and the value of the configuration file.
my $status_in_file = $Config->{"offline_online_status"}->{"offline_online_status.offline_online_state"};
print "Content-type:text/html\n\n";
print qq~<html>
<link rel="stylesheet" type="text/css" href="style4.css">
<body>
<div id="content">
<div id="bar">
<span><p>Controller Settings</p></span>
</div>
<div id="tab-container">
<ul>
<li><span>Offline / Online State</span></li>
</ul>
</div>
<div id="main-container">
<table border="0" width="100%" height="80%">
<tr>
<td align="left" width="20%">
<div id="title"><span>Offline/Online Status :</span></div>
</td>
<td width="25%">
<table border="0" style=\"text-align:right;font-family:Arial, Helvetica, sans-serif;\" cellpadding="5">
<tr>
<td width="30%"><div id="data">Offline Online State:</div></td>
</tr>
<tr>
<td width="30%"><div id="data">Data Mode:</div></td>
</tr>
</table>
</td>
<td align="left" width="20%">
<table border="1" style=\"text-align:center;font-family:Arial, Helvetica, sans-serif;\" cellpadding="5">
<tr>
<td width="20%"><div id="data">$status_in_file</div></td>
</tr>
</table>
</td>
<td width="50%"></td>
</tr>
<tr>
<td colspan="4">
<div id="description"><p><b>Description :</b></p>
<p>This <i>indication</i> is sent ..</p>
</div>
</td>
</tr>
</table>
</div>
</div>
</body>
</html>
~;
Can anybody please help me?
Here is my foobar-lib.pl
=head1 foobar-lib.pl
foreign_require("foobar", "foobar-lib.pl");
#sites = foobar::list_foobar_websites()
=cut
BEGIN { push(#INC, ".."); };
use WebminCore;
init_config();
=head2 get_foobar_config()
=cut
sub get_foobar_config
{
my $lref = &read_file_lines($config{'foobar_conf'});
my #rv;
my $lnum = 0;
foreach my $line (#$lref) {
my ($n, $v) = split(/\s+/, $line, 2);
if ($n) {
push(#rv, { 'name' => $n, 'value' => $v, 'line' => $lnum });
}
$lnum++;
}
return #rv;
}
i don't really understand about this foobar-lib.pl also. Maybe this whats caused my problem when i run the codes perhaps?
The code you've shown doesn't attempt to call a method called write on anything at all, let alone on an unblessed reference. So I assume the method call happens in some code you haven't shown. Perhaps in foobar-lib.pl?
Because I can't see the code causing the error, I can only hazard a guess based on the clue that the method is called write.
In Perl, it's kind of ambiguous as to whether filehandles classed as "objects" (and can thus have methods called on them), or unblessed references (and thus can't). The situation changed in Perl 5.12, and again in Perl 5.14. So if you've got different versions of Perl installed on each machine, then you might observe different behaviours when trying to do:
$fh->write($data, $length)
The Perl 5.14+ behaviour is probably what you want (as it's the most awesome), and luckily you can achieve that same behaviour on earlier versions of Perl by pre-loading a couple of modules. Add the following two lines to the top of your script:
use IO::Handle ();
use IO::File ();
Problem solved... perhaps???
It could be because of modules installed on to different location on different server. Please perform perl -V on both the server and check modules are located at identical location.
Also check what are you passing to that method.
Also check the permissions, Does your program has write access?

How to fetch the value of a HTML tag using HTML::Tree?

Lets say i have an array which holds the contents of the body tag like shown below:
print Dumper(\#array);
$VAR1 =
[
<body>
<table width=\'100%\' height=\'100%\'>
<tr>
<td width=\'100%\' height=\'100%\'
valign=\'top\'><div style=\'height:100%\' hrefmode=\'ajax-html\' id=\'a_tabbar\'
width=\'100%\' imgpath=\'../images/datagrid/\' skinColors=\'#FCFBFC,#F4F3EE\'/>
</td>
</tr>
</table>
<script>
tabbar=newdhtmlXTabBar(\'a_tabbar\',\'top\');
tabbar.setImagePath(\'../images/datagrid/\');
tabbar.setSkinColors(\'#FCFBFC\',\'#F4F3EE\');
tabbar.setHrefMode(\'ajax-html\');
</script>
<script>
tabbar.addTab(\'866346569493123700\',\'Details \',\'242px\');
tabbar.setContentHref(\'866346569493123700\',\'../reports/dhtmlgridqueryhandler.jsp?id=866346569493123700&oracleDb=read&htmlDataId=&Type=generic&queryCode=GetDetails\');
tabbar.setTabActive(\'866346569493123700\');
</script>
</body>
]
Lets say that i want to fetch the id of the "div" tag from the contents of the #array:
I do that by :
$tree=HTML::TreeBuilder->new_from_content(#array);
$first_match = $tree->find_by_attribute('hrefmode' => 'ajax-html');
$id = $first_match->attr('id');
This works fine for the cases where there is a single value for the attribute.
But how do i fetch 866346569493123700 from the script tag in #array?
Any help on this would be much appreciated as i have been trying to get this for hours
Your use of HTML::TreeBuilder for parsing HTML is very good. You're running into a problem though because you also want information from inside a <script> tag with contains JavaScript. Unfortunately, the above module isn't going to help you beyond isolating the JS.
Given the simplicity of your goal, I believe that I'd just use a regex to find the tab id. The final command tabbar.setTabActive is fairly simple and most likely won't change much since it's a function that only accepts one value and is integral to the functionality of creating and activating this new tab.
The below code demonstrates iterating over the script tags until it finds a match for tabid:
use HTML::TreeBuilder;
use strict;
use warnings;
my $root = HTML::TreeBuilder->new_from_content(<DATA>);
if (my $element = $root->look_down('_tag' => 'div', 'hrefmode' => 'ajax-html')) {
print "div.id = '" . $element->attr('id') . "'\n";
} else {
warn "div.id not found";
}
my $tabid = '';
for ($root->find_by_tag_name('script')) {
my $scripttext = $_->as_HTML;
if ($scripttext =~ /tabbar.setTabActive\('(\d+)'\);/) {
$tabid = $1;
print "TabID = '$tabid'";
last;
}
}
warn "Tab ID not found\n" if ! $tabid;
__DATA__
<body>
<table width='100%' height='100%'>
<tr>
<td width='100%' height='100%'
valign='top'><div style='height:100%' hrefmode='ajax-html' id='a_tabbar'
width='100%' imgpath='../images/datagrid/' skinColors='#FCFBFC,#F4F3EE'/>
</td>
</tr>
</table>
<script>
tabbar=newdhtmlXTabBar('a_tabbar','top');
tabbar.setImagePath('../images/datagrid/');
tabbar.setSkinColors('#FCFBFC','#F4F3EE');
tabbar.setHrefMode('ajax-html');
</script>
<script>
tabbar.addTab('866346569493123700','Details ','242px');
tabbar.setContentHref('866346569493123700','../reports/dhtmlgridqueryhandler.jsp?id=866346569493123700&oracleDb=read&htmlDataId=&Type=generic&queryCode=GetDetails');
tabbar.setTabActive('866346569493123700');
</script>
</body>