I have been using Mechanize::Firefox in order to do some web scraping and I am currently in the process of going fully headless. I have been using this code:
my $link = [$firefox->find_link_dom(text_regex => qr/pdf/i)]->[0]->{href} . "\n";
which provides very good results in finding a .pdf file. Is there anyway I can find link doom's based on a text_regex using a different module that does not require the browser to be shown?
EDIT
With the Mechanize::Firefox I would get back a link out of code like:
<div id="rhc" xpathLocation="noDialog">
<div id="download" class="rhcBox_type1">
<div class="wrap">
<ul>
<li class="download icon"><strong>Download:</strong>
PDF |
Citation |
XML
</li>
<li class="print icon"><strong>Print article</strong></li>
<li class="reprint icon">EzReprint New & improved!</li>
</ul>
</div>
to download a pdf file. Any ideas why the HTML::Treebase will not get this link??
CSS3 a[href*="pdf"]: Web::Query, HTML::Query, Mojo::UserAgent
XPath //a[contains(#href, "pdf")]: HTML::TreeBuilder::XPath, XML::LibXML
use strict;
use warnings;
use HTML::TreeBuilder::XPath qw();
my $dom = HTML::TreeBuilder::XPath->new_from_content(q(...)); # ... your html code
my #nodes = $dom->findnodes('//a[contains(#href, "pdf")]');
foreach my $node (#nodes) {
my $href = $node->findvalue('#href');
...
}
Related
Given the following HTML:
<div class="chosen-drop">
<ul class="chosen-results">
<li>Stuff 1</li>
<li>Stuff 2</li>
<li>Stuff 3</li>
</ul>
</div>
How do I pull the text from the list items using WWW::Mechanize::Firefox xpath function?
It seems like this should work, it's basically pulled from the documentation but it's coming up empty:
my #text = $mech->xpath('//div[#class="chosen-drop"]/ul/li/text()');
I must be missing something with the xpath.
With these files:
mech_xpath.pl:
#!perl -w
use strict;
use WWW::Mechanize::Firefox;
use Data::Dump qw/dump/;
my $mech = WWW::Mechanize::Firefox->new();
$mech->get_local('local.html');
my #text = $mech->xpath('//div[#class="chosen-drop"]/ul/li/text()');
warn dump \#text;
<>;
local.html:
<div class="chosen-drop">
<ul class="chosen-results">
<li>Stuff 1</li>
<li>Stuff 2</li>
<li>Stuff 3</li>
</ul>
</div>
Gives this output:
[
bless({
# tied MozRepl::RemoteObject::TiedHash
}, "MozRepl::RemoteObject::Instance"),
bless({
# tied MozRepl::RemoteObject::TiedHash
}, "MozRepl::RemoteObject::Instance"),
bless({
# tied MozRepl::RemoteObject::TiedHash
}, "MozRepl::RemoteObject::Instance"),
]
So everything looks to be working. How are you checking the contents of #text?
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm migrating from LaTeX to PrinceXML. One of the things I need to do is to convert the bibliography. I've converted my .bib file to HTML. However, since LaTeX took care of sorting the entries for me, I haven't taken care to put them into the correct order - but in the HTML the order of declaration does matter.
So my problem is: using Linux command line tools (e.g. Perl is acceptable, but Javascript is not), how can I sort a source file like this:
<div id="references">
<h2>References</h2>
<ul>
<li id="reference-to-book-1">
<span class="ref-author">Sample, Peter</span>
<cite>Online Book 1</cite>
<span class="ref-year">2011</span>
</li>
<li id="reference-to-book-2">
<cite>Physical Book 2</cite>
<span class="ref-year">2012</span>
<span class="ref-author">Example, Sandy</span>
</li>
</ul>
</div><!-- references -->
to look like this:
<div id="references">
<h2>References</h2>
<ul>
<li id="reference-to-book-2">
<span class="ref-author">Example, Sandy</span>
<cite>Physical Book 2</cite>
<span class="ref-year">2012</span>
</li>
<li id="reference-to-book-1">
<span class="ref-author">Sample, Peter</span>
<cite>Online Book 1</cite>
<span class="ref-year">2011</span>
</li>
</ul>
</div><!-- references -->
The criteria being:
The <li> elements containing the entries are sorted alphabetically according to author (i.e. everything from one <li id=" to its corresponding </li> is to be moved as a single block).
Within each entry, the elements are in the following order:
line matches class="ref-author"
line matches <cite>
line matches class="ref-year"
There are more elements (e.g. class="publisher") I omitted from the example for purposes of clarity; also, I run across this sorting problem very often. So it would be helpful if the expressions to match could be specified freely (e.g. as an array declaration in the script).
The remainder of the file (outside /id="references"/,/-- references --/) is unchanged.
The result file should have each line unchanged except for its position in the file (this point added because I the XML parsers I tried broke my indentation).
I got 1, 3 and 4 solved using sed and sort, but can't get 2 to work that way.
I'd use Mojo for this. You might need to tidy up the XML afterwards.
use Mojo::Base -strict;
use Mojo::DOM;
use Mojo::Util 'slurp';
my $xml = slurp $ARGV[0] or die "I need a file";
my $dom = Mojo::DOM->new($xml);
my $list = $dom->at('#references ul');
my $refs = $dom->find('li');
$refs->each('remove');
$refs = $refs->sort( sub { $a->at('.ref-author')->text cmp $b->at('.ref-author')->text } );
for my $ref ( #{ $refs } ){
my $new = Mojo::DOM->new('<li></li>')->at('li');
$new->append_content($ref->at('.ref-author'));
$new->append_content($ref->at('cite'));
#KEEP APPENDING IN THE ORDER YOU WANT THEM
$list->append_content($new);
}
say $dom;
I suggest you use the XML::LibXML module and parse your data as HTML. Then you can manipulate the DOM as you wish and print the modified structure back out
Here's an example of how it might work
use strict;
use warnings;
use XML::LibXML;
my $dom = XML::LibXML->load_html(IO => \*DATA);
my ($refs) = $dom->findnodes('/html/body//div[#id="references"]/ul');
my #refs = $refs->findnodes('li');
$refs->removeChild($_) for #refs;
$refs->appendChild($_) for sort {
my ($aa, $bb) = map { $_->findvalue('span[#class="ref-author"]') } $a, $b;
$aa cmp $bb;
} #refs;
print $dom, "\n";
__DATA__
<html>
<head>
<title>Title</title>
</head>
<body>
<div id="references">
<h2>References</h2>
<ul>
<li id="reference-to-book-1">
<span class="ref-author">Sample, Peter</span>
<cite>Online Book 1</cite>
<span class="ref-year">2011</span>
</li>
<li id="reference-to-book-2">
<cite>Physical Book 2</cite>
<span class="ref-year">2012</span>
<span class="ref-author">Example, Sandy</span>
</li>
</ul>
</div><!-- references -->
</body>
</html>
output
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Title</title></head><body>
<div id="references">
<h2>References</h2>
<ul>
<li id="reference-to-book-2">
<cite>Physical Book 2</cite>
<span class="ref-year">2012</span>
<span class="ref-author">Example, Sandy</span>
</li><li id="reference-to-book-1">
<span class="ref-author">Sample, Peter</span>
<cite>Online Book 1</cite>
<span class="ref-year">2011</span>
</li></ul></div><!-- references -->
</body></html>
Input data: (some nested list with links )
<ul>
<li><a>1</a>
<ul>
<li><a>11</a>
<ul>
<li><a>111</a></li>
<li><a>112</a></li>
<li><a>113</a>
<ul>
<li><a>1131</a></li>
<li><a>1132</a></li>
<li><a>1133</a></li>
</ul></li>
<li><a>114</a></li>
<li><a>115</a></li>
</ul>
</li>
<li><a>12</a>
<ul>
<li><a>121</a>
<ul>
<li><a>1211</a></li>
<li><a>1212</a></li>
<li><a>1213</a></li>
</ul></li>
<li><a>122</a></li>
</ul>
</li>
</ul>
</li>
</ul>
Output array of strings:
1,11,111
1,11,112
1,11,113,1131
1,11,113,1132
1,11,113,1133
1,11,114
1,11,115
1,12,121,1211
1,12,121,1212
1,12,121,1213
1,12,122
Full path with text of element which in without childs.
What I tried:
1. XML::SAX::ParserFactory
https://gist.github.com/7266638 Alot of problem here. How to detect if li last, how to save path etc. I think its bad way.
Its totaly not a regexp, cos in real life example html much worse. Alot of tags, divs, spans etc
Dom? But how?
You can try with XML::Twig module. It saves all text from <a> elements and only prints them when there is no child <ul> under one of <li> elements.
#!/usr/bin/env perl
use warnings;
use strict;
use XML::Twig;
my (#li);
my $twig = XML::Twig->new(
twig_handlers => {
'a' => sub {
if ( $_->prev_elt('li') ) {
push #li, $_->text;
}
},
'li' => sub {
unless ( $_->children('ul') ) {
printf qq|%s\n|, join q|,|, #li;
}
pop #li;
},
},
)->parsefile( shift );
Run it like:
perl script.pl xmlfile
That yields:
1,11,111
1,11,112
1,11,113,1131
1,11,113,1132
1,11,113,1133
1,11,114
1,11,115
1,12,121,1211
1,12,121,1212
1,12,121,1213
1,12,122
Im trying to parse some html tags using perl module Web::Scraper but seems Im an inept using perl. I wonder if anyone can look for mistakes in my code...:
This is my HTML to parse (2 urls inside li tags):
<more html above here>
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="123">
<h2 class="genres"></h2>
<li>1</li>
<li><a class="sel" href="**URL_TO_EXTRACT_2**">2</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
<more stuff from here>
Im trying to obtain:
ID:1 Link:URL_TO_EXTRACT_1
ID:2 Link:URL_TO_EXTRACT_2
With this perl code:
my $scraper = scraper {
process ".zone-extract > a[href]", urls => '#href', id => 'TEXT';
result 'urls';
};
my $links = $scraper->scrape($response);
This is one of the infinite process combinations I tried, with two different results: An empty return, or all the urls inside code (and I only need links inside zone-extract).
Resolved with mob's contribution... #zone-extract instead .zone-extract :)
#!/usr/bin/env perl
use strict;
use warnings;
use Web::Scraper;
my $html = q[
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="123">
<h2 class="genres"></h2>
<li>1</li>
<li><a class="sel" href="**URL_TO_EXTRACT_2**">2</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
]; # / (turn off wrong syntax highlighting)
my $parser = scraper {
process '//div[#id="zone-extract"]//a', 'urls[]' => sub {
my $url = $_[0]->attr('href') ;
return $url;
};
};
my $ref = $parser->scrape(\$html);
print "$_\n" for #{ $ref->{urls} };
my $changelog = "/etc/webmin/Pserver_Panel/changelog.cgi";
my $Milestone;
open (PREFS, $changelog);
while (<PREFS>)
{
if ($_ =~ m/^<h1>(.*)[ ]Milestone.*$/g) {
$Milestone=$1;
last;
}
}
close(PREFS);
Here is an example of the data its extracting from:
<h1>1.77 Milestone</h1>
<h3> 6/26/2009 </h3><ul style="margin-top:0px">
<li type=circle> Standard code house cleaning and added better compatbility for apache conversion.
</ul>
<h3> 6/21/2009 </h3><ul style="margin-top:0px">
<li type=square> Fixed Autofix so that it extracts to the right directory.
</ul>
<h3> 6/11/2009 </h3><ul style="margin-top:0px">
<li type=circle> Updated FTP link on index page to go to net2ftp, an online ftp file manager.
</ul>
<h1>1.76 Milestone</h1>
<h3> 4/14/2009 </h3><ul style="margin-top:0px">
<li type=square> Corrected a broken hyperlink to regular expressions in "View Chat Log"
<li type=circle> Changed the default number of lines back from 25 to 10 on both Chat and Pserver Logs.
<li type=circle> Noted in "View Pserver Log" search is case-sensitive and regular expression supported.
</ul>
<h3> 4/13/2009 </h3><ul style="margin-top:0px">
<li type=disc> Added AutoFix to the panel which will automatically fix prop errors.
<li type=circle> Updated error display to allow more detailed errors.
</ul>
<h3> 4/12/2009 </h3><ul style="margin-top:0px">
<li type=circle> Fixed start/stop/restart to be more reliable.
</ul>
Next, you are going to have to parse the items between milestones. Do yourself a favor, stop worrying about lines of code and use an HTML Parser, such as HTML::TokeParser:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser;
my $parser = HTML::TokeParser->new( \*DATA );
while ( my $token = $parser->get_token ) {
if ( $token->[0] eq 'S' ) {
if ( $token->[1] eq 'h1') {
my ($milestone) = split ' ', $parser->get_text('/h1');
print "Milestone is '$milestone'\n";
}
}
}
__DATA__
<h1>1.77 Milestone</h1>
...
C:\Temp> vbn
Milestone is '1.77'
Milestone is '1.76'
I you want a one-liner, how about this one?
perl -nle 'if (/^<h1>(.*)[ ]Milestone.*$/g){ print $1; last }' /etc/webmin/Pserver_Panel/changelog.cgi
Another one-liner to grab it from the command line:
perl -ne'print $1 and exit if /<h1>(.*?)\s+Milestone/' /etc/webmin/Pserver_Panel/changelog.cgi