Using Web::Scraper

Using Web::Scraper - perl

Im trying to parse some html tags using perl module Web::Scraper but seems Im an inept using perl. I wonder if anyone can look for mistakes in my code...:
This is my HTML to parse (2 urls inside li tags):
<more html above here>
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="123">
<h2 class="genres"></h2>
<li>1</li>
<li><a class="sel" href="**URL_TO_EXTRACT_2**">2</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
<more stuff from here>
Im trying to obtain:
ID:1 Link:URL_TO_EXTRACT_1
ID:2 Link:URL_TO_EXTRACT_2
With this perl code:
my $scraper = scraper {
process ".zone-extract > a[href]", urls => '#href', id => 'TEXT';
result 'urls';
};
my $links = $scraper->scrape($response);
This is one of the infinite process combinations I tried, with two different results: An empty return, or all the urls inside code (and I only need links inside zone-extract).
Resolved with mob's contribution... #zone-extract instead .zone-extract :)

#!/usr/bin/env perl
use strict;
use warnings;
use Web::Scraper;
my $html = q[
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="123">
<h2 class="genres"></h2>
<li>1</li>
<li><a class="sel" href="**URL_TO_EXTRACT_2**">2</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
]; # / (turn off wrong syntax highlighting)
my $parser = scraper {
process '//div[#id="zone-extract"]//a', 'urls[]' => sub {
my $url = $_[0]->attr('href') ;
return $url;
};
};
my $ref = $parser->scrape(\$html);
print "$_\n" for #{ $ref->{urls} };

Related

Pulling text from list items using WWW::Mechanize::Firefox

Given the following HTML:
<div class="chosen-drop">
<ul class="chosen-results">
<li>Stuff 1</li>
<li>Stuff 2</li>
<li>Stuff 3</li>
</ul>
</div>
How do I pull the text from the list items using WWW::Mechanize::Firefox xpath function?
It seems like this should work, it's basically pulled from the documentation but it's coming up empty:
my #text = $mech->xpath('//div[#class="chosen-drop"]/ul/li/text()');
I must be missing something with the xpath.

With these files:
mech_xpath.pl:
#!perl -w
use strict;
use WWW::Mechanize::Firefox;
use Data::Dump qw/dump/;
my $mech = WWW::Mechanize::Firefox->new();
$mech->get_local('local.html');
my #text = $mech->xpath('//div[#class="chosen-drop"]/ul/li/text()');
warn dump \#text;
<>;
local.html:
<div class="chosen-drop">
<ul class="chosen-results">
<li>Stuff 1</li>
<li>Stuff 2</li>
<li>Stuff 3</li>
</ul>
</div>
Gives this output:
[
bless({
# tied MozRepl::RemoteObject::TiedHash
}, "MozRepl::RemoteObject::Instance"),
bless({
# tied MozRepl::RemoteObject::TiedHash
}, "MozRepl::RemoteObject::Instance"),
bless({
# tied MozRepl::RemoteObject::TiedHash
}, "MozRepl::RemoteObject::Instance"),
]
So everything looks to be working. How are you checking the contents of #text?

Parse html nested list to perl array

Input data: (some nested list with links )
<ul>
<li><a>1</a>
<ul>
<li><a>11</a>
<ul>
<li><a>111</a></li>
<li><a>112</a></li>
<li><a>113</a>
<ul>
<li><a>1131</a></li>
<li><a>1132</a></li>
<li><a>1133</a></li>
</ul></li>
<li><a>114</a></li>
<li><a>115</a></li>
</ul>
</li>
<li><a>12</a>
<ul>
<li><a>121</a>
<ul>
<li><a>1211</a></li>
<li><a>1212</a></li>
<li><a>1213</a></li>
</ul></li>
<li><a>122</a></li>
</ul>
</li>
</ul>
</li>
</ul>
Output array of strings:
1,11,111
1,11,112
1,11,113,1131
1,11,113,1132
1,11,113,1133
1,11,114
1,11,115
1,12,121,1211
1,12,121,1212
1,12,121,1213
1,12,122
Full path with text of element which in without childs.
What I tried:
1. XML::SAX::ParserFactory
https://gist.github.com/7266638 Alot of problem here. How to detect if li last, how to save path etc. I think its bad way.
Its totaly not a regexp, cos in real life example html much worse. Alot of tags, divs, spans etc
Dom? But how?

You can try with XML::Twig module. It saves all text from <a> elements and only prints them when there is no child <ul> under one of <li> elements.
#!/usr/bin/env perl
use warnings;
use strict;
use XML::Twig;
my (#li);
my $twig = XML::Twig->new(
twig_handlers => {
'a' => sub {
if ( $_->prev_elt('li') ) {
push #li, $_->text;
}
},
'li' => sub {
unless ( $_->children('ul') ) {
printf qq|%s\n|, join q|,|, #li;
}
pop #li;
},
},
)->parsefile( shift );
Run it like:
perl script.pl xmlfile
That yields:
1,11,111
1,11,112
1,11,113,1131
1,11,113,1132
1,11,113,1133
1,11,114
1,11,115
1,12,121,1211
1,12,121,1212
1,12,121,1213
1,12,122

How to parse many div's using Simple HTML DOM?

I have this HTML code
<div id="first" class="first">
One
<div id="second" class="second">
Second
<div id="third" class="third">
Third
<div id="fourth" class="fourth">
Fourth
<div id="fifth" clas="fifth">
Fifht
<div id="sixth" class="sixth">
Sixth
</div>
</div>
</div>
</div>
</div>
</div>
This code is from an external website.
I want to display 'Hi' using Simple HTML DOM from a URL.

Do you want to see something like this?
$el = $html->find("#first", 0);
while ($child = $el->children(1)) {
$el = $child;
}
echo $el->innertext;

Perl Find Web Link By DOM

I have been using Mechanize::Firefox in order to do some web scraping and I am currently in the process of going fully headless. I have been using this code:
my $link = [$firefox->find_link_dom(text_regex => qr/pdf/i)]->[0]->{href} . "\n";
which provides very good results in finding a .pdf file. Is there anyway I can find link doom's based on a text_regex using a different module that does not require the browser to be shown?
EDIT
With the Mechanize::Firefox I would get back a link out of code like:
<div id="rhc" xpathLocation="noDialog">
<div id="download" class="rhcBox_type1">
<div class="wrap">
<ul>
<li class="download icon"><strong>Download:</strong>
PDF |
Citation |
XML
</li>
<li class="print icon"><strong>Print article</strong></li>
<li class="reprint icon">EzReprint New & improved!</li>
</ul>
</div>
to download a pdf file. Any ideas why the HTML::Treebase will not get this link??

CSS3 a[href*="pdf"]: Web::Query, HTML::Query, Mojo::UserAgent
XPath //a[contains(#href, "pdf")]: HTML::TreeBuilder::XPath, XML::LibXML

use strict;
use warnings;
use HTML::TreeBuilder::XPath qw();
my $dom = HTML::TreeBuilder::XPath->new_from_content(q(...)); # ... your html code
my #nodes = $dom->findnodes('//a[contains(#href, "pdf")]');
foreach my $node (#nodes) {
my $href = $node->findvalue('#href');
...
}

Is there an easier away to extract this data?

my $changelog = "/etc/webmin/Pserver_Panel/changelog.cgi";
my $Milestone;
open (PREFS, $changelog);
while (<PREFS>)
{
if ($_ =~ m/^<h1>(.*)[ ]Milestone.*$/g) {
$Milestone=$1;
last;
}
}
close(PREFS);
Here is an example of the data its extracting from:
<h1>1.77 Milestone</h1>
<h3> 6/26/2009 </h3><ul style="margin-top:0px">
<li type=circle> Standard code house cleaning and added better compatbility for apache conversion.
</ul>
<h3> 6/21/2009 </h3><ul style="margin-top:0px">
<li type=square> Fixed Autofix so that it extracts to the right directory.
</ul>
<h3> 6/11/2009 </h3><ul style="margin-top:0px">
<li type=circle> Updated FTP link on index page to go to net2ftp, an online ftp file manager.
</ul>
<h1>1.76 Milestone</h1>
<h3> 4/14/2009 </h3><ul style="margin-top:0px">
<li type=square> Corrected a broken hyperlink to regular expressions in "View Chat Log"
<li type=circle> Changed the default number of lines back from 25 to 10 on both Chat and Pserver Logs.
<li type=circle> Noted in "View Pserver Log" search is case-sensitive and regular expression supported.
</ul>
<h3> 4/13/2009 </h3><ul style="margin-top:0px">
<li type=disc> Added AutoFix to the panel which will automatically fix prop errors.
<li type=circle> Updated error display to allow more detailed errors.
</ul>
<h3> 4/12/2009 </h3><ul style="margin-top:0px">
<li type=circle> Fixed start/stop/restart to be more reliable.
</ul>

Next, you are going to have to parse the items between milestones. Do yourself a favor, stop worrying about lines of code and use an HTML Parser, such as HTML::TokeParser:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser;
my $parser = HTML::TokeParser->new( \*DATA );
while ( my $token = $parser->get_token ) {
if ( $token->[0] eq 'S' ) {
if ( $token->[1] eq 'h1') {
my ($milestone) = split ' ', $parser->get_text('/h1');
print "Milestone is '$milestone'\n";
}
}
}
__DATA__
<h1>1.77 Milestone</h1>
...
C:\Temp> vbn
Milestone is '1.77'
Milestone is '1.76'

I you want a one-liner, how about this one?
perl -nle 'if (/^<h1>(.*)[ ]Milestone.*$/g){ print $1; last }' /etc/webmin/Pserver_Panel/changelog.cgi

Another one-liner to grab it from the command line:
perl -ne'print $1 and exit if /<h1>(.*?)\s+Milestone/' /etc/webmin/Pserver_Panel/changelog.cgi

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Using Web::Scraper - perl

Related

Pulling text from list items using WWW::Mechanize::Firefox

Parse html nested list to perl array

How to parse many div's using Simple HTML DOM?

Perl Find Web Link By DOM

Is there an easier away to extract this data?

Categories

Resources