Scrape a table#id columns with Web::Scraper - perl

Have a html-page, with a structure:
have a table with an id="searchResult"
many rows
each contains 3 td - without any class
In each table cell contains one URL, and I need the URL from the second cell (column)
Tried different XPATH scrapers like:
my $links = scraper {
process '//table[id="searchResult"]', "lines[]" => scraper {
process "//tr/td[2]/a", text => 'TEXT';
process "//tr/td[2]/a", link => '#href';
};
};
my $res = $links->scrape($html);
But not works and the $res is an empty {}.
If someone needs, here is the full test code:
use 5.014;
use warnings;
use Web::Scraper;
use Data::Dumper;
my $links = scraper {
process '//table[id="searchResult"]', "lines[]" => scraper {
process "//tr/td[2]/a", text => 'TEXT';
process "//tr/td[2]/a", link => '#href';
};
};
my $html = do {local $/;<DATA>};
#say $html;
my $res = $links->scrape($html);
say Dumper $res;
__DATA__
<html>
<body>
<p>...</p>
<table id="searchResult">
<thead><th>x</th><th>x</th><th>x</th><th>x</th><th>x</th></thead>
<tr>
<td>cell11</td>
<td>cell12</td>
<td>cell13</td>
</tr>
<tr>
<td>cell21</td>
<td>cell22</td>
<td>cell23</td>
</tr>
<tr>
<td>cell31</td>
<td>cell32</td>
<td>cell33</td>
</tr>
</table>
</body>
</html>

My preferred scraper for this type of project is Mojo::DOM. For a helpful 8 minute introductory video, check out Mojocast Episode 5.
You also could probably use a pointer to a CSS Selector Reference.
The following performs the parsing you're trying to do with this module:
use strict;
use warnings;
use Mojo::DOM;
my $dom = Mojo::DOM->new(do {local $/; <DATA>});
for my $link ($dom->find('table[id=searchResult] > tr > td:nth-child(2) > a')->each) {
print $link->{href}, " - ", $link->text, "\n";
}
__DATA__
<html>
<body>
<p>...</p>
<table id="searchResult">
<thead><th>x</th><th>x</th><th>x</th><th>x</th><th>x</th></thead>
<tr>
<td>cell11</td>
<td>cell12</td>
<td>cell13</td>
</tr>
<tr>
<td>cell21</td>
<td>cell22</td>
<td>cell23</td>
</tr>
<tr>
<td>cell31</td>
<td>cell32</td>
<td>cell33</td>
</tr>
</table>
</body>
</html>
Outputs:
#12 - cell12
#22 - cell22
#32 - cell32

Related

Downloading a image via web scraping with a Perl script

I'm a noob in perl, trying to download a IMDB movie poster image via perl script with the help of Mechanize framework. I'm not getting the 'id' attribute for 'td' tags so that I can find the specific place for the image. This is how the HTML of the image portion of IMDB page looks like:
<table id="title-overview-widget-layout" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td id="img_primary" rowspan="2">
<div class="image">
<a href="/media/rm419297536/tt2338151?ref_=tt_ov_i">
<img width="214" height="317" itemprop="image" src="http://ia.media-imdb.com/images/M/MV5BMTYzOTE2NjkxN15BMl5BanBnXkFtZTgwMDgzMTg0MzE#._V1_SY317_CR2,0,214,317_AL_.jpg" title="PK (2014) Poster" alt="PK (2014) Poster">
</a>
</div>
<div class="pro-title-link text-center">
</td>
<td id="overview-top">
</tr>
<tr>
</tbody>
</table>
And here is the perl script I'm trying to download with:
use strict;
use warnings;
use WWW::Mechanize;
use HTML::TokeParser;
#create a new instance of mechanize
my $agent = WWW::Mechanize->new();
#get the page we want.
$agent->get("http://www.imdb.com/title/tt2338151/");
#supply a reference to that page to TokeParser
my $stream = HTML::TokeParser->new(\$agent->{content});
my $c = 0;#to store the count of images and give the images names
#loop through all the td's
while (my $tag1 = $stream->get_tag("td")) {
$tag1->[1]->{id} ||= 'none';
my $asd = $tag1->[1]->{id};
print "$asd\n"; #shows none for all of the td's
if ($asd && $asd eq 'img_primary') {
while(my $tag = $stream->get_tag("div"))
{
# $tag will contain this array => [$tag, $attr, $attrseq, $text]
#get the class of the div tag from attr
my $cls = $tag->[1]{class};
#we're looking for div's with the class gallery-img2
if($cls && $cls eq "image") {
#get the content of the src tag
my $image = $stream->get_tag('img')->[1]{src};
#create a new mechanize to download the image
my $imgDown = WWW::Mechanize->new();
#give the image url and the local path to mechanize
$imgDown->get($image, ":content_file" => ".//image".$c.".jpg");
#update the count
$c++;
}
}
}
}
print "Total images scraped $c\n";
Any help will be much appropriated.
When JavaScript is involved, it's best to use a real browser to visit pages and query their contents.
You can do this with Selenium::Remote::Driver.

How to fetch the value of a HTML tag using HTML::Tree?

Lets say i have an array which holds the contents of the body tag like shown below:
print Dumper(\#array);
$VAR1 =
[
<body>
<table width=\'100%\' height=\'100%\'>
<tr>
<td width=\'100%\' height=\'100%\'
valign=\'top\'><div style=\'height:100%\' hrefmode=\'ajax-html\' id=\'a_tabbar\'
width=\'100%\' imgpath=\'../images/datagrid/\' skinColors=\'#FCFBFC,#F4F3EE\'/>
</td>
</tr>
</table>
<script>
tabbar=newdhtmlXTabBar(\'a_tabbar\',\'top\');
tabbar.setImagePath(\'../images/datagrid/\');
tabbar.setSkinColors(\'#FCFBFC\',\'#F4F3EE\');
tabbar.setHrefMode(\'ajax-html\');
</script>
<script>
tabbar.addTab(\'866346569493123700\',\'Details \',\'242px\');
tabbar.setContentHref(\'866346569493123700\',\'../reports/dhtmlgridqueryhandler.jsp?id=866346569493123700&oracleDb=read&htmlDataId=&Type=generic&queryCode=GetDetails\');
tabbar.setTabActive(\'866346569493123700\');
</script>
</body>
]
Lets say that i want to fetch the id of the "div" tag from the contents of the #array:
I do that by :
$tree=HTML::TreeBuilder->new_from_content(#array);
$first_match = $tree->find_by_attribute('hrefmode' => 'ajax-html');
$id = $first_match->attr('id');
This works fine for the cases where there is a single value for the attribute.
But how do i fetch 866346569493123700 from the script tag in #array?
Any help on this would be much appreciated as i have been trying to get this for hours
Your use of HTML::TreeBuilder for parsing HTML is very good. You're running into a problem though because you also want information from inside a <script> tag with contains JavaScript. Unfortunately, the above module isn't going to help you beyond isolating the JS.
Given the simplicity of your goal, I believe that I'd just use a regex to find the tab id. The final command tabbar.setTabActive is fairly simple and most likely won't change much since it's a function that only accepts one value and is integral to the functionality of creating and activating this new tab.
The below code demonstrates iterating over the script tags until it finds a match for tabid:
use HTML::TreeBuilder;
use strict;
use warnings;
my $root = HTML::TreeBuilder->new_from_content(<DATA>);
if (my $element = $root->look_down('_tag' => 'div', 'hrefmode' => 'ajax-html')) {
print "div.id = '" . $element->attr('id') . "'\n";
} else {
warn "div.id not found";
}
my $tabid = '';
for ($root->find_by_tag_name('script')) {
my $scripttext = $_->as_HTML;
if ($scripttext =~ /tabbar.setTabActive\('(\d+)'\);/) {
$tabid = $1;
print "TabID = '$tabid'";
last;
}
}
warn "Tab ID not found\n" if ! $tabid;
__DATA__
<body>
<table width='100%' height='100%'>
<tr>
<td width='100%' height='100%'
valign='top'><div style='height:100%' hrefmode='ajax-html' id='a_tabbar'
width='100%' imgpath='../images/datagrid/' skinColors='#FCFBFC,#F4F3EE'/>
</td>
</tr>
</table>
<script>
tabbar=newdhtmlXTabBar('a_tabbar','top');
tabbar.setImagePath('../images/datagrid/');
tabbar.setSkinColors('#FCFBFC','#F4F3EE');
tabbar.setHrefMode('ajax-html');
</script>
<script>
tabbar.addTab('866346569493123700','Details ','242px');
tabbar.setContentHref('866346569493123700','../reports/dhtmlgridqueryhandler.jsp?id=866346569493123700&oracleDb=read&htmlDataId=&Type=generic&queryCode=GetDetails');
tabbar.setTabActive('866346569493123700');
</script>
</body>

parsing the html with HTML::TreeBuilder

I want to parse the html page.Extract the badge , description , and the badge type using
<div class="row">
<div class="span8">
<table id="badge-list">
<tr>
<td style="width: 25px;"></td>
<td style="width: 200px;" class="badge-cell">
<a class="badge-name" href="/badge/show/3/">
<span class="badge-icon bronze">•</span>
Editor
</a>
<span class="multiplier">x 3892</span></td>
<td class="earned False"> </td>
<td>First edit</td>
</tr>
my perl code is follows,
i am trying to extract a class="badge-name" and other detail using the below code
my $tree = HTML::TreeBuilder->new();
$tree->parse($content);
my ($h1) = $tree->look_down('_tag', 'table', 'id', 'badge-list');
my #tr = $h1->look_down('_tag', 'tr') ;
foreach my $tr (#tr) {
my #tdList = $tr->look_down('_tag','td');
foreach my $td ( #tdList) {
if (my $a = $td->look_down('_tag','a')) {
print $a->as_text , "\n";
my $span = $a->look_down('_tag','span', 'class');
print $span->attr('class');
}
else {
my $text = $td->as_text , "\n";
print "$text\n";
}
}
}
This code is throwing warning Wide character in print at ..
look_down requires pairs of attribute/value parameters.
$a->look_down('_tag','span', 'class')
should be just
$a->look_down('_tag','span')
I would suggest to add "use utf8;" at the start of the script to add support non ASCII symbols in the print. The symbol • is deferentially is wide.
use utf8;

How to extract a column of a table from html page using perl modules?

I have the following html code of a part of a webpage.
<h2 id="failed_process">Failed Process</h2>
<table border="1">
<thead>
<tr>
<th>
<b>pid</b>
</th>
<th>
<b>Priority</b>
</th>
</tr>
</thead>
<tbody>
<tr>
<td id="90">p_201211162334</td>
<td id="priority_90">NORMAL</td>
</tr>
<tr>
<td id="91">p_201211163423</td>
<td id="priority_91">NORMAL</td>
</tr>
<tr>
<td id="98">p_201211166543</td>
<td id="priority_98">NORMAL</td>
</tr>
</tbody>
</table>
<hr>
I need to extract the pid column . The output should look like
pid
p_201211162334
p_201211163423
p_201211166543
The table count for "Failed Process" table is 4. But the problem is if I mention the table count as 4 and if there are no failed tasks in the webpage, it'll go to the next table and fetch the pid's of next table resulting in wrong pid's.
I am using the below code to get the result.
#!/usr/bin/perl
use strict;
use warnings;
use lib qw(..);
use HTML::TableExtract;
my $content = get("URL");
my $te = HTML::TableExtract->new(
headers => [qw(pid)], attribs => { id => 'failed_process' },
);
$te->parse($content);
foreach my $col ($te->rows) {
print ("\t", #$col), "\n";
}
But I am getting the following error:
Can't call method "rows" on an undefined value
With my favourite DOM parser Mojo::DOM from the Mojolicious suite it would look like that:
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
use Mojo::DOM;
# instantiate with all DATA lines
my $dom = Mojo::DOM->new(do { local $/; <DATA> });
# extract all first column cells
$dom->find('table tr')->each(sub {
my $cell = shift->children->[0];
say $cell->all_text;
});
__DATA__
<h2 id="failed_process">Failed Process</h2>
<table border="1">
...
Output:
pid
p_201211162334
p_201211163423
p_201211166543
After $te->parse($html) you may add some like foreach my $table ($te->tables) .. then you can get rows $table->rows. You may also use Data::Dumper to analyze $te.

Extracting an HTML table with Perl

I have the following table :
<table cellpadding="4" cellspacing="0" border="0">
<tr>
<td>
<span class="label">Label1< /pan>
</td>
<td>
label1_value1
</td>
</tr>
<tr>
<td>
<span class="label">Label2</span>
</td>
<td>
Label2_value1 <br/>
Label2_value2 <br/>
</td>
</tr>
<tr valign="top">
<td>
<span class="label">Label3</span>
</td>
<td>
Result 1<br/>
Result 2<br/>
<span class="related"> -
Result 1 SP2<br/> </span>
</td>
<\tr>
</table>
I want to use HTML::TableExtract in order to extract this table
I use the following code in order to extract the table :
$te->parse($table_content);
foreach my $row ($te->rows) {
if (defined($row->[1])) {
$row->[1]=~s/^\s+//gm;
$row->[1]=~s/\s+$/;/gm;
print $row->[1],"\n";
}
}
I want the result on this format :
label1_value1,label1_value1;label1_value2,result1;result2-result3
but i get wrong results could someone help what the problem with my code or if its possilbe to parse spans with HTML::TableExtract
I get the following:
label1_value1
Label2_value1
Label2_value2;
result1
result2
-
reuslt1;
First, there are some errors in your HTML such as </pan> and <\tr> and an unclosed span tag. Once those are fixed, the code
#!/usr/bin/env perl
use warnings; use strict;
use HTML::TableExtract;
my $te = HTML::TableExtract->new(
attribs => {
cellpadding => '4', cellspacing => '0', border => '0'
}
);
my ($table) = $te->parse(do { local $/; <DATA>} );
for my $row ( $table->rows ) {
for my $cell (#$row) {
$cell =~ s/^\s+//;
$cell =~ s/\s+\z/;/;
$cell =~ s/\s+/ /g;
}
print join("|", #$row), "\n";
}
will give you:
Label1;|label1_value1;
Label2;|Label2_value1 Label2_value2;
Label3;|Result 1 Result 2 - Result 1 SP2;
Now, I do not know what logic would lead from that to your desired output of:
label1_value1,label1_value1;label1_value2,result1;result2-result3
Could you please provide more information on what you are trying to do?