Pulling text from list items using WWW::Mechanize::Firefox - perl

Given the following HTML:
<div class="chosen-drop">
<ul class="chosen-results">
<li>Stuff 1</li>
<li>Stuff 2</li>
<li>Stuff 3</li>
</ul>
</div>
How do I pull the text from the list items using WWW::Mechanize::Firefox xpath function?
It seems like this should work, it's basically pulled from the documentation but it's coming up empty:
my #text = $mech->xpath('//div[#class="chosen-drop"]/ul/li/text()');
I must be missing something with the xpath.

With these files:
mech_xpath.pl:
#!perl -w
use strict;
use WWW::Mechanize::Firefox;
use Data::Dump qw/dump/;
my $mech = WWW::Mechanize::Firefox->new();
$mech->get_local('local.html');
my #text = $mech->xpath('//div[#class="chosen-drop"]/ul/li/text()');
warn dump \#text;
<>;
local.html:
<div class="chosen-drop">
<ul class="chosen-results">
<li>Stuff 1</li>
<li>Stuff 2</li>
<li>Stuff 3</li>
</ul>
</div>
Gives this output:
[
bless({
# tied MozRepl::RemoteObject::TiedHash
}, "MozRepl::RemoteObject::Instance"),
bless({
# tied MozRepl::RemoteObject::TiedHash
}, "MozRepl::RemoteObject::Instance"),
bless({
# tied MozRepl::RemoteObject::TiedHash
}, "MozRepl::RemoteObject::Instance"),
]
So everything looks to be working. How are you checking the contents of #text?

Related

Targeting individual elements in HTML using Perl and Mojo::DOM in well-formated HTML

Relative begginer with Perl, with my first question here, trying the following:
I am trying to retrieve certain information from a large online dataset (Eur-Lex), where each HTML document is well-formed HTML, with constant elements. Each HTML file is identified by its Celex number, which is supplied as the argument to the script (see my Perl code below).
The HTML data looks like this (showing only the part I'm interested in):
<!--
<blahblah>
< lots of stuff here, before the interesting part>
-->
<div id="PPClass_Contents" class="panel-collapse collapse in" role="tabpanel"
aria-labelledby="PP_Class">
<div class="panel-body">
<dl class="NMetadata">
<dt xmlns="http://www.w3.org/1999/xhtml">EUROVOC descriptor: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=341&lang=en">
<span lang="en">descriptor_1</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=5158&lang=en">
<span lang="en">descriptor_2</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=7983&lang=en">
<span lang="en">descriptor_3</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=933&lang=en">
<span lang="en">descriptor_4</span>
</a>
</li>
</ul>
</dd>
<dt xmlns="http://www.w3.org/1999/xhtml">Subject matter: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CT_CODED=BUDG&lang=en">
<span lang="en">Subject_1</span>
</a>
</li>
</ul>
</dd>
<dt xmlns="http://www.w3.org/1999/xhtml">Directory code: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>01.60.20.00 <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_1_CODED=01&lang=en">
<span lang="en">Designation_level_1</span>
</a> / <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_2_CODED=0160&lang=en">
<span lang="en">Designation_level_2</span>
</a> / <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_3_CODED=016020&lang=en">
<span lang="en">Designation_level_3</span>
</a>
</li>
</ul>
</dd>
</dl>
</div>
</div>
</div>
<!--
<still more stuff here>
-->
I am interested in the info contained in "PPClass_Contents" div id, which consists of 3 elements:
- EUROVOC descriptor:
- Subject matter:
- Directory code:
Based on the above HTML, I would like to get the children of those 3 main elements, using Perl and Mojo, getting the result similar to this (single line text file, 3 groups separated by tabs, multiple child elements within a grup are separated by pipe characters, something like this:
CELEX_No "TAB" descriptor_1|descriptor_2|descriptor_3|descriptor_4|..|descriptor_n "TAB" Subject_1|..|Subject_n "TAB" Designation_level_1|Designation_level_2|Designation_level_3|..|Designation_level_n
"descriptors", "Subjects" and "Designation_levels" elements (children of those 3 main groups) can be from 1 to "n", the number is not fixed, and is not known in advance.
I have the following code, which does print out the plain text of the interesting part, but I need to address the individual elements and print them out in a new file as described above:
#!/usr/bin/perl
# returns "Classification" descriptors for given CELEX and Language
use strict;
use warnings;
use Mojo::UserAgent;
if ($#ARGV ne "1") {
print "Wrong number of arguments!\n";
print "Syntax: clookup.pl Lang_ID celex_No.\n";
exit -1;
}
my $lang = $ARGV[0];
my $celex = $ARGV[1];
my $lclang = lc $lang;
# fetch the eurlex page
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get("https://eur-lex.europa.eu/legal-content/$lang/ALL/?uri=CELEX:$celex")->res->dom;
################ let's extract interesting parts:
my $text = $dom->at('#PPClass_Contents')->all_text;
print "$text\n";
EDIT (added):
You can try my Perl script using two arguments:
lang_code ("DE","EN","IT", etc.)
Celex number (e.g.: E2014C0303, 52015BP2212, 52015BP0930(48), 52015BP0930(36), 52015BP0930(41), E2014C0302, E2014C0301, E2014C0271, E2014C0134).
For example (if you name my script "clookup.pl"):
$ perl clookup.pl EN E2014C0303
So, how can I address individual elements (of unknown number) as described above, using Mojo::DOM?
Or, is there something simpler or faster (using Perl)?
You are on the right track. First, you need to understand the HTML inside your #PPClass_Contents. Each set of things is in a definition list. Since you only care about the definition texts, you can search directly for the <dd> elements.
$dom->at('#PPClass_Contents')->find('dd')
This will give you a Mojo::Collection, which you can iterate with ->each. We pass that an anonymous function, pretty much like a callback.
$dom->at('#PPClass_Contents')->find('dd')->each(sub {
$_; # this is the current element
});
Each element will be passed to that sub, and can be referenced using the topic variable $_. There is an <ul> inside, and each <li> contains a <span> element with the text you want. So let's find those.
$_->find('span')
We can directly build the column in your output at this stage. Let's use the other form of ->each, which turns the Mojo::Collection returned from ->find into a normal Perl list. We can then use a regular map operation to grab each <span>'s text node and join that into a string.
join '|', map { $_->text } $_->find('span')->each
To tie all that together, we declare an array outside this construct, and stick the $celex number in it as the first column.
my #columns = ($celex);
$dom->at('#PPClass_Contents')->find('dd')->each(sub {
push #columns, join '|', map { $_->text } $_->find('span')->each;
});
Producing the final tab-separated output is now trivial.
print join "\t", #columns;
I've done this with EN as the language and the $celex number 32006L0121, which the search used in its example tooltip. The result is this:
32006L0121 marketing standard|chemical product|approximation of laws|dangerous substance|scientific report|packaging|European Chemicals Agency|labelling Internal market - Principles|Approximation of laws|Technical barriers|Environment|Consumer protection Industrial policy and internal market|Internal market: approximation of laws|Dangerous substances

Parse html nested list to perl array

Input data: (some nested list with links )
<ul>
<li><a>1</a>
<ul>
<li><a>11</a>
<ul>
<li><a>111</a></li>
<li><a>112</a></li>
<li><a>113</a>
<ul>
<li><a>1131</a></li>
<li><a>1132</a></li>
<li><a>1133</a></li>
</ul></li>
<li><a>114</a></li>
<li><a>115</a></li>
</ul>
</li>
<li><a>12</a>
<ul>
<li><a>121</a>
<ul>
<li><a>1211</a></li>
<li><a>1212</a></li>
<li><a>1213</a></li>
</ul></li>
<li><a>122</a></li>
</ul>
</li>
</ul>
</li>
</ul>
Output array of strings:
1,11,111
1,11,112
1,11,113,1131
1,11,113,1132
1,11,113,1133
1,11,114
1,11,115
1,12,121,1211
1,12,121,1212
1,12,121,1213
1,12,122
Full path with text of element which in without childs.
What I tried:
1. XML::SAX::ParserFactory
https://gist.github.com/7266638 Alot of problem here. How to detect if li last, how to save path etc. I think its bad way.
Its totaly not a regexp, cos in real life example html much worse. Alot of tags, divs, spans etc
Dom? But how?
You can try with XML::Twig module. It saves all text from <a> elements and only prints them when there is no child <ul> under one of <li> elements.
#!/usr/bin/env perl
use warnings;
use strict;
use XML::Twig;
my (#li);
my $twig = XML::Twig->new(
twig_handlers => {
'a' => sub {
if ( $_->prev_elt('li') ) {
push #li, $_->text;
}
},
'li' => sub {
unless ( $_->children('ul') ) {
printf qq|%s\n|, join q|,|, #li;
}
pop #li;
},
},
)->parsefile( shift );
Run it like:
perl script.pl xmlfile
That yields:
1,11,111
1,11,112
1,11,113,1131
1,11,113,1132
1,11,113,1133
1,11,114
1,11,115
1,12,121,1211
1,12,121,1212
1,12,121,1213
1,12,122

Perl Find Web Link By DOM

I have been using Mechanize::Firefox in order to do some web scraping and I am currently in the process of going fully headless. I have been using this code:
my $link = [$firefox->find_link_dom(text_regex => qr/pdf/i)]->[0]->{href} . "\n";
which provides very good results in finding a .pdf file. Is there anyway I can find link doom's based on a text_regex using a different module that does not require the browser to be shown?
EDIT
With the Mechanize::Firefox I would get back a link out of code like:
<div id="rhc" xpathLocation="noDialog">
<div id="download" class="rhcBox_type1">
<div class="wrap">
<ul>
<li class="download icon"><strong>Download:</strong>
PDF |
Citation |
XML
</li>
<li class="print icon"><strong>Print article</strong></li>
<li class="reprint icon">EzReprint New & improved!</li>
</ul>
</div>
to download a pdf file. Any ideas why the HTML::Treebase will not get this link??
CSS3 a[href*="pdf"]: Web::Query, HTML::Query, Mojo::UserAgent
XPath //a[contains(#href, "pdf")]: HTML::TreeBuilder::XPath, XML::LibXML
use strict;
use warnings;
use HTML::TreeBuilder::XPath qw();
my $dom = HTML::TreeBuilder::XPath->new_from_content(q(...)); # ... your html code
my #nodes = $dom->findnodes('//a[contains(#href, "pdf")]');
foreach my $node (#nodes) {
my $href = $node->findvalue('#href');
...
}

Using Web::Scraper

Im trying to parse some html tags using perl module Web::Scraper but seems Im an inept using perl. I wonder if anyone can look for mistakes in my code...:
This is my HTML to parse (2 urls inside li tags):
<more html above here>
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="123">
<h2 class="genres"></h2>
<li>1</li>
<li><a class="sel" href="**URL_TO_EXTRACT_2**">2</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
<more stuff from here>
Im trying to obtain:
ID:1 Link:URL_TO_EXTRACT_1
ID:2 Link:URL_TO_EXTRACT_2
With this perl code:
my $scraper = scraper {
process ".zone-extract > a[href]", urls => '#href', id => 'TEXT';
result 'urls';
};
my $links = $scraper->scrape($response);
This is one of the infinite process combinations I tried, with two different results: An empty return, or all the urls inside code (and I only need links inside zone-extract).
Resolved with mob's contribution... #zone-extract instead .zone-extract :)
#!/usr/bin/env perl
use strict;
use warnings;
use Web::Scraper;
my $html = q[
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="123">
<h2 class="genres"></h2>
<li>1</li>
<li><a class="sel" href="**URL_TO_EXTRACT_2**">2</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
]; # / (turn off wrong syntax highlighting)
my $parser = scraper {
process '//div[#id="zone-extract"]//a', 'urls[]' => sub {
my $url = $_[0]->attr('href') ;
return $url;
};
};
my $ref = $parser->scrape(\$html);
print "$_\n" for #{ $ref->{urls} };

Is there an easier away to extract this data?

my $changelog = "/etc/webmin/Pserver_Panel/changelog.cgi";
my $Milestone;
open (PREFS, $changelog);
while (<PREFS>)
{
if ($_ =~ m/^<h1>(.*)[ ]Milestone.*$/g) {
$Milestone=$1;
last;
}
}
close(PREFS);
Here is an example of the data its extracting from:
<h1>1.77 Milestone</h1>
<h3> 6/26/2009 </h3><ul style="margin-top:0px">
<li type=circle> Standard code house cleaning and added better compatbility for apache conversion.
</ul>
<h3> 6/21/2009 </h3><ul style="margin-top:0px">
<li type=square> Fixed Autofix so that it extracts to the right directory.
</ul>
<h3> 6/11/2009 </h3><ul style="margin-top:0px">
<li type=circle> Updated FTP link on index page to go to net2ftp, an online ftp file manager.
</ul>
<h1>1.76 Milestone</h1>
<h3> 4/14/2009 </h3><ul style="margin-top:0px">
<li type=square> Corrected a broken hyperlink to regular expressions in "View Chat Log"
<li type=circle> Changed the default number of lines back from 25 to 10 on both Chat and Pserver Logs.
<li type=circle> Noted in "View Pserver Log" search is case-sensitive and regular expression supported.
</ul>
<h3> 4/13/2009 </h3><ul style="margin-top:0px">
<li type=disc> Added AutoFix to the panel which will automatically fix prop errors.
<li type=circle> Updated error display to allow more detailed errors.
</ul>
<h3> 4/12/2009 </h3><ul style="margin-top:0px">
<li type=circle> Fixed start/stop/restart to be more reliable.
</ul>
Next, you are going to have to parse the items between milestones. Do yourself a favor, stop worrying about lines of code and use an HTML Parser, such as HTML::TokeParser:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser;
my $parser = HTML::TokeParser->new( \*DATA );
while ( my $token = $parser->get_token ) {
if ( $token->[0] eq 'S' ) {
if ( $token->[1] eq 'h1') {
my ($milestone) = split ' ', $parser->get_text('/h1');
print "Milestone is '$milestone'\n";
}
}
}
__DATA__
<h1>1.77 Milestone</h1>
...
C:\Temp> vbn
Milestone is '1.77'
Milestone is '1.76'
I you want a one-liner, how about this one?
perl -nle 'if (/^<h1>(.*)[ ]Milestone.*$/g){ print $1; last }' /etc/webmin/Pserver_Panel/changelog.cgi
Another one-liner to grab it from the command line:
perl -ne'print $1 and exit if /<h1>(.*?)\s+Milestone/' /etc/webmin/Pserver_Panel/changelog.cgi