Using Web::Scraper - perl

Im trying to parse some html tags using perl module Web::Scraper but seems Im an inept using perl. I wonder if anyone can look for mistakes in my code...:
This is my HTML to parse (2 urls inside li tags):
<more html above here>
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="123">
<h2 class="genres"></h2>
<li>1</li>
<li><a class="sel" href="**URL_TO_EXTRACT_2**">2</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
<more stuff from here>
Im trying to obtain:
ID:1 Link:URL_TO_EXTRACT_1
ID:2 Link:URL_TO_EXTRACT_2
With this perl code:
my $scraper = scraper {
process ".zone-extract > a[href]", urls => '#href', id => 'TEXT';
result 'urls';
};
my $links = $scraper->scrape($response);
This is one of the infinite process combinations I tried, with two different results: An empty return, or all the urls inside code (and I only need links inside zone-extract).
Resolved with mob's contribution... #zone-extract instead .zone-extract :)

#!/usr/bin/env perl
use strict;
use warnings;
use Web::Scraper;
my $html = q[
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="123">
<h2 class="genres"></h2>
<li>1</li>
<li><a class="sel" href="**URL_TO_EXTRACT_2**">2</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
]; # / (turn off wrong syntax highlighting)
my $parser = scraper {
process '//div[#id="zone-extract"]//a', 'urls[]' => sub {
my $url = $_[0]->attr('href') ;
return $url;
};
};
my $ref = $parser->scrape(\$html);
print "$_\n" for #{ $ref->{urls} };

Related

Pulling text from list items using WWW::Mechanize::Firefox

Given the following HTML:
<div class="chosen-drop">
<ul class="chosen-results">
<li>Stuff 1</li>
<li>Stuff 2</li>
<li>Stuff 3</li>
</ul>
</div>
How do I pull the text from the list items using WWW::Mechanize::Firefox xpath function?
It seems like this should work, it's basically pulled from the documentation but it's coming up empty:
my #text = $mech->xpath('//div[#class="chosen-drop"]/ul/li/text()');
I must be missing something with the xpath.
With these files:
mech_xpath.pl:
#!perl -w
use strict;
use WWW::Mechanize::Firefox;
use Data::Dump qw/dump/;
my $mech = WWW::Mechanize::Firefox->new();
$mech->get_local('local.html');
my #text = $mech->xpath('//div[#class="chosen-drop"]/ul/li/text()');
warn dump \#text;
<>;
local.html:
<div class="chosen-drop">
<ul class="chosen-results">
<li>Stuff 1</li>
<li>Stuff 2</li>
<li>Stuff 3</li>
</ul>
</div>
Gives this output:
[
bless({
# tied MozRepl::RemoteObject::TiedHash
}, "MozRepl::RemoteObject::Instance"),
bless({
# tied MozRepl::RemoteObject::TiedHash
}, "MozRepl::RemoteObject::Instance"),
bless({
# tied MozRepl::RemoteObject::TiedHash
}, "MozRepl::RemoteObject::Instance"),
]
So everything looks to be working. How are you checking the contents of #text?

Parse html nested list to perl array

Input data: (some nested list with links )
<ul>
<li><a>1</a>
<ul>
<li><a>11</a>
<ul>
<li><a>111</a></li>
<li><a>112</a></li>
<li><a>113</a>
<ul>
<li><a>1131</a></li>
<li><a>1132</a></li>
<li><a>1133</a></li>
</ul></li>
<li><a>114</a></li>
<li><a>115</a></li>
</ul>
</li>
<li><a>12</a>
<ul>
<li><a>121</a>
<ul>
<li><a>1211</a></li>
<li><a>1212</a></li>
<li><a>1213</a></li>
</ul></li>
<li><a>122</a></li>
</ul>
</li>
</ul>
</li>
</ul>
Output array of strings:
1,11,111
1,11,112
1,11,113,1131
1,11,113,1132
1,11,113,1133
1,11,114
1,11,115
1,12,121,1211
1,12,121,1212
1,12,121,1213
1,12,122
Full path with text of element which in without childs.
What I tried:
1. XML::SAX::ParserFactory
https://gist.github.com/7266638 Alot of problem here. How to detect if li last, how to save path etc. I think its bad way.
Its totaly not a regexp, cos in real life example html much worse. Alot of tags, divs, spans etc
Dom? But how?
You can try with XML::Twig module. It saves all text from <a> elements and only prints them when there is no child <ul> under one of <li> elements.
#!/usr/bin/env perl
use warnings;
use strict;
use XML::Twig;
my (#li);
my $twig = XML::Twig->new(
twig_handlers => {
'a' => sub {
if ( $_->prev_elt('li') ) {
push #li, $_->text;
}
},
'li' => sub {
unless ( $_->children('ul') ) {
printf qq|%s\n|, join q|,|, #li;
}
pop #li;
},
},
)->parsefile( shift );
Run it like:
perl script.pl xmlfile
That yields:
1,11,111
1,11,112
1,11,113,1131
1,11,113,1132
1,11,113,1133
1,11,114
1,11,115
1,12,121,1211
1,12,121,1212
1,12,121,1213
1,12,122

How to parse many div's using Simple HTML DOM?

I have this HTML code
<div id="first" class="first">
One
<div id="second" class="second">
Second
<div id="third" class="third">
Third
<div id="fourth" class="fourth">
Fourth
<div id="fifth" clas="fifth">
Fifht
<div id="sixth" class="sixth">
Sixth
</div>
</div>
</div>
</div>
</div>
</div>
This code is from an external website.
I want to display 'Hi' using Simple HTML DOM from a URL.
Do you want to see something like this?
$el = $html->find("#first", 0);
while ($child = $el->children(1)) {
$el = $child;
}
echo $el->innertext;

Perl Find Web Link By DOM

I have been using Mechanize::Firefox in order to do some web scraping and I am currently in the process of going fully headless. I have been using this code:
my $link = [$firefox->find_link_dom(text_regex => qr/pdf/i)]->[0]->{href} . "\n";
which provides very good results in finding a .pdf file. Is there anyway I can find link doom's based on a text_regex using a different module that does not require the browser to be shown?
EDIT
With the Mechanize::Firefox I would get back a link out of code like:
<div id="rhc" xpathLocation="noDialog">
<div id="download" class="rhcBox_type1">
<div class="wrap">
<ul>
<li class="download icon"><strong>Download:</strong>
PDF |
Citation |
XML
</li>
<li class="print icon"><strong>Print article</strong></li>
<li class="reprint icon">EzReprint New & improved!</li>
</ul>
</div>
to download a pdf file. Any ideas why the HTML::Treebase will not get this link??
CSS3 a[href*="pdf"]: Web::Query, HTML::Query, Mojo::UserAgent
XPath //a[contains(#href, "pdf")]: HTML::TreeBuilder::XPath, XML::LibXML
use strict;
use warnings;
use HTML::TreeBuilder::XPath qw();
my $dom = HTML::TreeBuilder::XPath->new_from_content(q(...)); # ... your html code
my #nodes = $dom->findnodes('//a[contains(#href, "pdf")]');
foreach my $node (#nodes) {
my $href = $node->findvalue('#href');
...
}

Is there an easier away to extract this data?

my $changelog = "/etc/webmin/Pserver_Panel/changelog.cgi";
my $Milestone;
open (PREFS, $changelog);
while (<PREFS>)
{
if ($_ =~ m/^<h1>(.*)[ ]Milestone.*$/g) {
$Milestone=$1;
last;
}
}
close(PREFS);
Here is an example of the data its extracting from:
<h1>1.77 Milestone</h1>
<h3> 6/26/2009 </h3><ul style="margin-top:0px">
<li type=circle> Standard code house cleaning and added better compatbility for apache conversion.
</ul>
<h3> 6/21/2009 </h3><ul style="margin-top:0px">
<li type=square> Fixed Autofix so that it extracts to the right directory.
</ul>
<h3> 6/11/2009 </h3><ul style="margin-top:0px">
<li type=circle> Updated FTP link on index page to go to net2ftp, an online ftp file manager.
</ul>
<h1>1.76 Milestone</h1>
<h3> 4/14/2009 </h3><ul style="margin-top:0px">
<li type=square> Corrected a broken hyperlink to regular expressions in "View Chat Log"
<li type=circle> Changed the default number of lines back from 25 to 10 on both Chat and Pserver Logs.
<li type=circle> Noted in "View Pserver Log" search is case-sensitive and regular expression supported.
</ul>
<h3> 4/13/2009 </h3><ul style="margin-top:0px">
<li type=disc> Added AutoFix to the panel which will automatically fix prop errors.
<li type=circle> Updated error display to allow more detailed errors.
</ul>
<h3> 4/12/2009 </h3><ul style="margin-top:0px">
<li type=circle> Fixed start/stop/restart to be more reliable.
</ul>
Next, you are going to have to parse the items between milestones. Do yourself a favor, stop worrying about lines of code and use an HTML Parser, such as HTML::TokeParser:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser;
my $parser = HTML::TokeParser->new( \*DATA );
while ( my $token = $parser->get_token ) {
if ( $token->[0] eq 'S' ) {
if ( $token->[1] eq 'h1') {
my ($milestone) = split ' ', $parser->get_text('/h1');
print "Milestone is '$milestone'\n";
}
}
}
__DATA__
<h1>1.77 Milestone</h1>
...
C:\Temp> vbn
Milestone is '1.77'
Milestone is '1.76'
I you want a one-liner, how about this one?
perl -nle 'if (/^<h1>(.*)[ ]Milestone.*$/g){ print $1; last }' /etc/webmin/Pserver_Panel/changelog.cgi
Another one-liner to grab it from the command line:
perl -ne'print $1 and exit if /<h1>(.*?)\s+Milestone/' /etc/webmin/Pserver_Panel/changelog.cgi