Unable to Extract Link with Mojolicious - perl

I am trying to extract the link for the next page in a search results page using Mojo::DOM. However, I have a problem where instead of Mojo::DOM elements, I get a string after using ->find() on an existing element.
I have:
my $pagination_elements = $dom->find("div[class*=\"pagination-block\"]");
my $page_counter_text = $pagination_elements->find("div[class=\"page-of-pages\"]")->text();
$page_counter_text =~ /^Page (\d+) of (\d+)$/;
my $current_page = int($1);
my $last_page = int($2);
my $prev_next_elements = $pagination_elements->find("a[class*=\"prev-next\"]");
my $next_page_link = $prev_next_elements->last();
my $next_page_url = $next_page_link->attr("href");
On each page, there may be 2 link tags with a class of prev-next. Instead of getting the link for the last element, what I get is a string that contains the href for both of the tags (if both are available on the page).
Now, if instead of this I do:
my $next_page_link = $dom->find("div[class*=\"pagination-block\"] > ul > li > a[class*=\"prev-next\"]")->last();
my $next_page_url_rel = $next_page_link->attr("href");
I get the required link.
My question is, why does the second version work and not the first? Why do I have to start from the root DOM element to get a list of elements, and why starting from a child of the root returns a string containing all the link tags instead of just the one I want?
Edit
An example of the HTML I am parsing is:
<div class="pagination-block clearfix">
<div class="page-of-pages">
Page 2 of 100
</div>
<ul class="pagination-links">
<li>
.
.
.
</li>
<li>
<a class="page-option prev-next" href="PREV LINK">Prev</a>
</li>
<li>
<a class="page-option prev-next" href="NEXT LINK">Next</a>
</li>
</ul>
</div>

It would have helped a lot if you could have shown an example of the HTML you are processing. Instead I have imagined this, which I hope is close.
<html>
<head>
<title>Title</title>
</head>
<body>
<div class="pagination-block">
<div class="page-of-pages">Page 99 of 100</div>
<ul>
<li>
<a class="prev-next" href="/page98">Prev</a>
</li>
<li>
<a class="prev-next" href="/page100">Next</a>
</li>
<ul>
</div>
<div class="pagination-block">
<div class="page-of-pages">Page 99 of 100</div>
<ul>
<li>
<a class="prev-next" href="/page98">Prev</a>
</li>
<li>
<a class="prev-next" href="/page100">Next</a>
</li>
<ul>
</div>
</body>
</html>
Now let's look at your code
my $pagination_elements = $dom->find('div[class*="pagination-block"]')
This gives you a Mojo::Collection containing the two instances of div that have a pagination-block class.
my $prev_next_elements = $pagination_elements->find('a[class*="prev-next"]')
This does something like a map, replacing each member of the Mojo::Collection with the results of doing a find on them. Since find returns another Mojo::Collection, you now have a collection of two collections, each with two Mojo::DOM objects. To clarify
$prev_next_elements is a Mojo::Collection object with a size of 2
Both $prev_next_elements->[0] and $prev_next_elements->[1] are Mojo::Collection objects, each with a size of 2
$prev_next_elements->[0][0], $prev_next_elements->[0][1], $prev_next_elements->[1][0], and $prev_next_elements->[1][1] are all Mojo::DOM objects, each containing an <a> element from the HTML document
my $next_page_link = $prev_next_elements->last
This takes the second element of $prev_next_elements. It is the same as $prev_next_elements->[1], and so is a Mojo::Collection object containing the two Mojo::DOM elements that hold the last two <a> elements in the HTML document.
my $next_page_url = $next_page_link->attr('href')
Now you are doing another map operation: applying attr to both elements of the collection, and returning another collection containing the two href strings /page98 and /page100. Stinrgifying this Mogo::Collection just concatenates all of its elements and gives you "/page98\n/page100".
To fix all this, take the last of the $pagination_elements, giving you a Mojo::DOM object. Then do a find for the prev and next elements, giving you Mojo::Collection of the "prev" and
"next" <a> elements, and finally map those elements to links using attr('href'). You end up with Mojo::Collection containing the href text of the "prev" and "next" links in the last pagination block.
my $pagination_elements = $dom->find('div[class*="pagination-block"]');
my $last_pagination_element = $pagination_elements->last;
my $prev_next_elements = $last_pagination_element->find('a[class*="prev-next"]');
my $prev_next_links = $prev_next_elements->attr('href');
my ($prev_page_link, $next_page_link) = ($prev_next_links->first, $prev_next_links->last);
say $prev_page_link;
say $next_page_link;
output
/page98
/page100
You can collapse all that to something more convenient, like this
my $pagination_elements = $dom->find('div[class*="pagination-block"]');
my $prev_next_links = $pagination_elements->last->find('a[class*="prev-next"]')->attr('href');
my ($prev_page_link, $next_page_link) = #$prev_next_links;
say $prev_page_link;
say $next_page_link;

If you used Data::Dump (or some equivalent module) instead of print, you would get a clue as to what's going on:
use Data::Dump;
dd $next_page_url;
dd $next_page_url_rel;
Outputs:
bless(["PREV LINK", "NEXT LINK"], "Mojo::Collection")
"NEXT LINK"
As you can see, your first variable actually holds a collection, and not a string.
The problem arises because the Mojo::DOM->find returns a Mojo::Collection:
my $pagination_elements = $dom->find('div[class*="pagination-block"]');
Doing a subsequent find on a collection returns you a nested collection which is not going to perform the way you expect with calls like last.
Here are three different solutions to fix your first attempt to find the link text:
Use the Mojo::DOM->at method to find the first element in DOM structure matching the CSS selector.
my $pagination_elements = $dom->at('div[class*="pagination-block"]');
Use Mojo::Collection->first or ->last to isolate a specific element in the collection before the subsequent find.
my $pagination_elements
= $dom->find('div[class*="pagination-block"]')->last();
Use Mojo::Collection->flatten to flatten the nested collections created by your subsequent find into a new collection with all elements:
my $pagination_elements = $dom->find('div[class*="pagination-block"]');
my $prev_next_elements
= $pagination_elements->find('a[class*="prev-next"]')->flatten();
All of these methods will make your script work as you intended:
use strict;
use warnings;
use Mojo::DOM;
use Data::Dump;
my $dom = Mojo::DOM->new(do { local $/; <DATA> });
# Fix 1
my $pagination_elements = $dom->at('div[class*="pagination-block"]');
# Fix 2
#my $pagination_elements
# = $dom->find('div[class*="pagination-block"]')->last();
# Fix 3
#my $pagination_elements = $dom->find('div[class*="pagination-block"]');
#my $prev_next_elements
# = $pagination_elements->find('a[class*="prev-next"]')->flatten();
my $prev_next_elements = $pagination_elements->find('a[class*="prev-next"]');
my $next_page_link = $prev_next_elements->last();
my $next_page_url = $next_page_link->attr("href");
dd $next_page_url;
$next_page_link = $dom->find('div[class*="pagination-block"] > ul > li > a[class*="prev-next"]')->last();
my $next_page_url_rel = $next_page_link->attr("href");
dd $next_page_url_rel;
__DATA__
<html>
<head>
<title>Paging Example</title>
</head>
<body>
<div class="pagination-block clearfix">
<div class="page-of-pages">
Page 2 of 100
</div>
<ul class="pagination-links">
<li>
.
.
.
</li>
<li>
<a class="page-option prev-next" href="PREV LINK">Prev</a>
</li>
<li>
<a class="page-option prev-next" href="NEXT LINK">Next</a>
</li>
</ul>
</div>
</body>
</html>
Outputs:
"NEXT LINK"
"NEXT LINK"

Related

Targeting individual elements in HTML using Perl and Mojo::DOM in well-formated HTML

Relative begginer with Perl, with my first question here, trying the following:
I am trying to retrieve certain information from a large online dataset (Eur-Lex), where each HTML document is well-formed HTML, with constant elements. Each HTML file is identified by its Celex number, which is supplied as the argument to the script (see my Perl code below).
The HTML data looks like this (showing only the part I'm interested in):
<!--
<blahblah>
< lots of stuff here, before the interesting part>
-->
<div id="PPClass_Contents" class="panel-collapse collapse in" role="tabpanel"
aria-labelledby="PP_Class">
<div class="panel-body">
<dl class="NMetadata">
<dt xmlns="http://www.w3.org/1999/xhtml">EUROVOC descriptor: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=341&lang=en">
<span lang="en">descriptor_1</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=5158&lang=en">
<span lang="en">descriptor_2</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=7983&lang=en">
<span lang="en">descriptor_3</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=933&lang=en">
<span lang="en">descriptor_4</span>
</a>
</li>
</ul>
</dd>
<dt xmlns="http://www.w3.org/1999/xhtml">Subject matter: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CT_CODED=BUDG&lang=en">
<span lang="en">Subject_1</span>
</a>
</li>
</ul>
</dd>
<dt xmlns="http://www.w3.org/1999/xhtml">Directory code: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>01.60.20.00 <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_1_CODED=01&lang=en">
<span lang="en">Designation_level_1</span>
</a> / <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_2_CODED=0160&lang=en">
<span lang="en">Designation_level_2</span>
</a> / <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_3_CODED=016020&lang=en">
<span lang="en">Designation_level_3</span>
</a>
</li>
</ul>
</dd>
</dl>
</div>
</div>
</div>
<!--
<still more stuff here>
-->
I am interested in the info contained in "PPClass_Contents" div id, which consists of 3 elements:
- EUROVOC descriptor:
- Subject matter:
- Directory code:
Based on the above HTML, I would like to get the children of those 3 main elements, using Perl and Mojo, getting the result similar to this (single line text file, 3 groups separated by tabs, multiple child elements within a grup are separated by pipe characters, something like this:
CELEX_No "TAB" descriptor_1|descriptor_2|descriptor_3|descriptor_4|..|descriptor_n "TAB" Subject_1|..|Subject_n "TAB" Designation_level_1|Designation_level_2|Designation_level_3|..|Designation_level_n
"descriptors", "Subjects" and "Designation_levels" elements (children of those 3 main groups) can be from 1 to "n", the number is not fixed, and is not known in advance.
I have the following code, which does print out the plain text of the interesting part, but I need to address the individual elements and print them out in a new file as described above:
#!/usr/bin/perl
# returns "Classification" descriptors for given CELEX and Language
use strict;
use warnings;
use Mojo::UserAgent;
if ($#ARGV ne "1") {
print "Wrong number of arguments!\n";
print "Syntax: clookup.pl Lang_ID celex_No.\n";
exit -1;
}
my $lang = $ARGV[0];
my $celex = $ARGV[1];
my $lclang = lc $lang;
# fetch the eurlex page
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get("https://eur-lex.europa.eu/legal-content/$lang/ALL/?uri=CELEX:$celex")->res->dom;
################ let's extract interesting parts:
my $text = $dom->at('#PPClass_Contents')->all_text;
print "$text\n";
EDIT (added):
You can try my Perl script using two arguments:
lang_code ("DE","EN","IT", etc.)
Celex number (e.g.: E2014C0303, 52015BP2212, 52015BP0930(48), 52015BP0930(36), 52015BP0930(41), E2014C0302, E2014C0301, E2014C0271, E2014C0134).
For example (if you name my script "clookup.pl"):
$ perl clookup.pl EN E2014C0303
So, how can I address individual elements (of unknown number) as described above, using Mojo::DOM?
Or, is there something simpler or faster (using Perl)?
You are on the right track. First, you need to understand the HTML inside your #PPClass_Contents. Each set of things is in a definition list. Since you only care about the definition texts, you can search directly for the <dd> elements.
$dom->at('#PPClass_Contents')->find('dd')
This will give you a Mojo::Collection, which you can iterate with ->each. We pass that an anonymous function, pretty much like a callback.
$dom->at('#PPClass_Contents')->find('dd')->each(sub {
$_; # this is the current element
});
Each element will be passed to that sub, and can be referenced using the topic variable $_. There is an <ul> inside, and each <li> contains a <span> element with the text you want. So let's find those.
$_->find('span')
We can directly build the column in your output at this stage. Let's use the other form of ->each, which turns the Mojo::Collection returned from ->find into a normal Perl list. We can then use a regular map operation to grab each <span>'s text node and join that into a string.
join '|', map { $_->text } $_->find('span')->each
To tie all that together, we declare an array outside this construct, and stick the $celex number in it as the first column.
my #columns = ($celex);
$dom->at('#PPClass_Contents')->find('dd')->each(sub {
push #columns, join '|', map { $_->text } $_->find('span')->each;
});
Producing the final tab-separated output is now trivial.
print join "\t", #columns;
I've done this with EN as the language and the $celex number 32006L0121, which the search used in its example tooltip. The result is this:
32006L0121 marketing standard|chemical product|approximation of laws|dangerous substance|scientific report|packaging|European Chemicals Agency|labelling Internal market - Principles|Approximation of laws|Technical barriers|Environment|Consumer protection Industrial policy and internal market|Internal market: approximation of laws|Dangerous substances

How to find sibling element with behat/mink?

HTML:
<div id="my-id">
<li class="list_element">
<div class="my_class"></div>
</li>
<li class="list_element">
<div class="another_class"></div>
</li>
<li class="list_element">
<div class="class3"></div>
</li>
</div>
What I want to do with behat/mink:
$page = $this->getSession()->getPage();
$selector = $page->find('css', "#my-id .my_class"); //here I need anchor element located near to .my_class div.
I don't know in which one .list_element .my_class div is. I know only anchor is next to .my_class element. Which selector should I use in the find() function?
Try one of these:
#my-id .my_class ~ a
#my-id .my_class + p
#my-id .list_element a
This is too basic question.Please see more here w3schools

Sorting a structured text file [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm migrating from LaTeX to PrinceXML. One of the things I need to do is to convert the bibliography. I've converted my .bib file to HTML. However, since LaTeX took care of sorting the entries for me, I haven't taken care to put them into the correct order - but in the HTML the order of declaration does matter.
So my problem is: using Linux command line tools (e.g. Perl is acceptable, but Javascript is not), how can I sort a source file like this:
<div id="references">
<h2>References</h2>
<ul>
<li id="reference-to-book-1">
<span class="ref-author">Sample, Peter</span>
<cite>Online Book 1</cite>
<span class="ref-year">2011</span>
</li>
<li id="reference-to-book-2">
<cite>Physical Book 2</cite>
<span class="ref-year">2012</span>
<span class="ref-author">Example, Sandy</span>
</li>
</ul>
</div><!-- references -->
to look like this:
<div id="references">
<h2>References</h2>
<ul>
<li id="reference-to-book-2">
<span class="ref-author">Example, Sandy</span>
<cite>Physical Book 2</cite>
<span class="ref-year">2012</span>
</li>
<li id="reference-to-book-1">
<span class="ref-author">Sample, Peter</span>
<cite>Online Book 1</cite>
<span class="ref-year">2011</span>
</li>
</ul>
</div><!-- references -->
The criteria being:
The <li> elements containing the entries are sorted alphabetically according to author (i.e. everything from one <li id=" to its corresponding </li> is to be moved as a single block).
Within each entry, the elements are in the following order:
line matches class="ref-author"
line matches <cite>
line matches class="ref-year"
There are more elements (e.g. class="publisher") I omitted from the example for purposes of clarity; also, I run across this sorting problem very often. So it would be helpful if the expressions to match could be specified freely (e.g. as an array declaration in the script).
The remainder of the file (outside /id="references"/,/-- references --/) is unchanged.
The result file should have each line unchanged except for its position in the file (this point added because I the XML parsers I tried broke my indentation).
I got 1, 3 and 4 solved using sed and sort, but can't get 2 to work that way.
I'd use Mojo for this. You might need to tidy up the XML afterwards.
use Mojo::Base -strict;
use Mojo::DOM;
use Mojo::Util 'slurp';
my $xml = slurp $ARGV[0] or die "I need a file";
my $dom = Mojo::DOM->new($xml);
my $list = $dom->at('#references ul');
my $refs = $dom->find('li');
$refs->each('remove');
$refs = $refs->sort( sub { $a->at('.ref-author')->text cmp $b->at('.ref-author')->text } );
for my $ref ( #{ $refs } ){
my $new = Mojo::DOM->new('<li></li>')->at('li');
$new->append_content($ref->at('.ref-author'));
$new->append_content($ref->at('cite'));
#KEEP APPENDING IN THE ORDER YOU WANT THEM
$list->append_content($new);
}
say $dom;
I suggest you use the XML::LibXML module and parse your data as HTML. Then you can manipulate the DOM as you wish and print the modified structure back out
Here's an example of how it might work
use strict;
use warnings;
use XML::LibXML;
my $dom = XML::LibXML->load_html(IO => \*DATA);
my ($refs) = $dom->findnodes('/html/body//div[#id="references"]/ul');
my #refs = $refs->findnodes('li');
$refs->removeChild($_) for #refs;
$refs->appendChild($_) for sort {
my ($aa, $bb) = map { $_->findvalue('span[#class="ref-author"]') } $a, $b;
$aa cmp $bb;
} #refs;
print $dom, "\n";
__DATA__
<html>
<head>
<title>Title</title>
</head>
<body>
<div id="references">
<h2>References</h2>
<ul>
<li id="reference-to-book-1">
<span class="ref-author">Sample, Peter</span>
<cite>Online Book 1</cite>
<span class="ref-year">2011</span>
</li>
<li id="reference-to-book-2">
<cite>Physical Book 2</cite>
<span class="ref-year">2012</span>
<span class="ref-author">Example, Sandy</span>
</li>
</ul>
</div><!-- references -->
</body>
</html>
output
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Title</title></head><body>
<div id="references">
<h2>References</h2>
<ul>
<li id="reference-to-book-2">
<cite>Physical Book 2</cite>
<span class="ref-year">2012</span>
<span class="ref-author">Example, Sandy</span>
</li><li id="reference-to-book-1">
<span class="ref-author">Sample, Peter</span>
<cite>Online Book 1</cite>
<span class="ref-year">2011</span>
</li></ul></div><!-- references -->
</body></html>

Parse html nested list to perl array

Input data: (some nested list with links )
<ul>
<li><a>1</a>
<ul>
<li><a>11</a>
<ul>
<li><a>111</a></li>
<li><a>112</a></li>
<li><a>113</a>
<ul>
<li><a>1131</a></li>
<li><a>1132</a></li>
<li><a>1133</a></li>
</ul></li>
<li><a>114</a></li>
<li><a>115</a></li>
</ul>
</li>
<li><a>12</a>
<ul>
<li><a>121</a>
<ul>
<li><a>1211</a></li>
<li><a>1212</a></li>
<li><a>1213</a></li>
</ul></li>
<li><a>122</a></li>
</ul>
</li>
</ul>
</li>
</ul>
Output array of strings:
1,11,111
1,11,112
1,11,113,1131
1,11,113,1132
1,11,113,1133
1,11,114
1,11,115
1,12,121,1211
1,12,121,1212
1,12,121,1213
1,12,122
Full path with text of element which in without childs.
What I tried:
1. XML::SAX::ParserFactory
https://gist.github.com/7266638 Alot of problem here. How to detect if li last, how to save path etc. I think its bad way.
Its totaly not a regexp, cos in real life example html much worse. Alot of tags, divs, spans etc
Dom? But how?
You can try with XML::Twig module. It saves all text from <a> elements and only prints them when there is no child <ul> under one of <li> elements.
#!/usr/bin/env perl
use warnings;
use strict;
use XML::Twig;
my (#li);
my $twig = XML::Twig->new(
twig_handlers => {
'a' => sub {
if ( $_->prev_elt('li') ) {
push #li, $_->text;
}
},
'li' => sub {
unless ( $_->children('ul') ) {
printf qq|%s\n|, join q|,|, #li;
}
pop #li;
},
},
)->parsefile( shift );
Run it like:
perl script.pl xmlfile
That yields:
1,11,111
1,11,112
1,11,113,1131
1,11,113,1132
1,11,113,1133
1,11,114
1,11,115
1,12,121,1211
1,12,121,1212
1,12,121,1213
1,12,122

How to use "this" and not "this" selectors in jQuery

I have 4 divs with content like below:
<div class="prodNav-Info-Panel">content</div>
<div class="prodNav-Usage-Panel">content</div>
<div class="prodNav-Guarantee-Panel">content</div>
<div class="prodNav-FAQ-Panel">content</div>
And a navigation list like this:
<div id="nav">
<ul id="navigation">
<li><a class="prodNav-Info" ></a></li>
<li><a class="prodNav-Usage" ></a></li>
<li><a class="prodNav-Guarantee"></a></li>
<li><a class="prodNav-FAQ" ></a></li>
</ul>
</div>
When the page is first displayed I show all the content by executing this:
$('div.prodNav-Usage-Panel').fadeIn('slow');
$('div.prodNav-Guarantee-Panel').fadeIn('slow');
$('div.prodNav-FAQ-Panel').fadeIn('slow');
$('div.prodNav-Info-Panel').fadeIn('slow');
Now, when you click the navigation list item it reveals the clicked content and hides the others, like this:
$('.prodNav-Info').click( function() {
$('div.prodNav-Info-Panel').fadeIn('slow');
$('div.prodNav-Usage-Panel').fadeOut('slow');
$('div.prodNav-Guarantee-Panel').fadeOut('slow');
$('div.prodNav-FAQ-Panel').fadeOut('slow');
});
So what I have is 4 separate functions because I do not know which content is currently displayed. I know this is inefficient and can be done with a couple of lines of code. It seems like there is a way of saying: when this is clicked, hide the rest.
Can I do this with something like $(this) and $(not this)?
Thanks,
Erik
In your particular case you maybe able to use the .sibilings() method something like this:
$(this).fadeIn().sibilings().fadeOut()
Otherwise, lets say that you have a set of elements stored somewhere that points to all of your elements:
// contains 5 elements:
var $hiders = $(".prodNavPanel");
// somewhere later:
$hiders.not("#someElement").fadeOut();
$("#someElement").fadeIn();
Also, I would suggest changing the classes for your <div> and <a> to something more like:
<div class="prodNavPanel" id="panel-Info">content</div>
....
<a class="prodNavLink" href="#panel-Info">info</a>
This gives you a few advantages over your HTML. First: the links will have useful hrefs. Second: You can easily select all your <div>/<a> tags. Then you can do this with jQuery:
$(function() {
var $panels = $(".prodNavPanel");
$(".prodNavLink").click(function() {
var m = this.href.match(/(#panel.*)$/);
if (m) {
var panelId = m[1];
$panels.not(panelId).fadeOut();
$(panelId).fadeIn();
return false; // prevents browser from "moving" the page
}
});
});