Sorting a structured text file [closed] - perl

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm migrating from LaTeX to PrinceXML. One of the things I need to do is to convert the bibliography. I've converted my .bib file to HTML. However, since LaTeX took care of sorting the entries for me, I haven't taken care to put them into the correct order - but in the HTML the order of declaration does matter.
So my problem is: using Linux command line tools (e.g. Perl is acceptable, but Javascript is not), how can I sort a source file like this:
<div id="references">
<h2>References</h2>
<ul>
<li id="reference-to-book-1">
<span class="ref-author">Sample, Peter</span>
<cite>Online Book 1</cite>
<span class="ref-year">2011</span>
</li>
<li id="reference-to-book-2">
<cite>Physical Book 2</cite>
<span class="ref-year">2012</span>
<span class="ref-author">Example, Sandy</span>
</li>
</ul>
</div><!-- references -->
to look like this:
<div id="references">
<h2>References</h2>
<ul>
<li id="reference-to-book-2">
<span class="ref-author">Example, Sandy</span>
<cite>Physical Book 2</cite>
<span class="ref-year">2012</span>
</li>
<li id="reference-to-book-1">
<span class="ref-author">Sample, Peter</span>
<cite>Online Book 1</cite>
<span class="ref-year">2011</span>
</li>
</ul>
</div><!-- references -->
The criteria being:
The <li> elements containing the entries are sorted alphabetically according to author (i.e. everything from one <li id=" to its corresponding </li> is to be moved as a single block).
Within each entry, the elements are in the following order:
line matches class="ref-author"
line matches <cite>
line matches class="ref-year"
There are more elements (e.g. class="publisher") I omitted from the example for purposes of clarity; also, I run across this sorting problem very often. So it would be helpful if the expressions to match could be specified freely (e.g. as an array declaration in the script).
The remainder of the file (outside /id="references"/,/-- references --/) is unchanged.
The result file should have each line unchanged except for its position in the file (this point added because I the XML parsers I tried broke my indentation).
I got 1, 3 and 4 solved using sed and sort, but can't get 2 to work that way.

I'd use Mojo for this. You might need to tidy up the XML afterwards.
use Mojo::Base -strict;
use Mojo::DOM;
use Mojo::Util 'slurp';
my $xml = slurp $ARGV[0] or die "I need a file";
my $dom = Mojo::DOM->new($xml);
my $list = $dom->at('#references ul');
my $refs = $dom->find('li');
$refs->each('remove');
$refs = $refs->sort( sub { $a->at('.ref-author')->text cmp $b->at('.ref-author')->text } );
for my $ref ( #{ $refs } ){
my $new = Mojo::DOM->new('<li></li>')->at('li');
$new->append_content($ref->at('.ref-author'));
$new->append_content($ref->at('cite'));
#KEEP APPENDING IN THE ORDER YOU WANT THEM
$list->append_content($new);
}
say $dom;

I suggest you use the XML::LibXML module and parse your data as HTML. Then you can manipulate the DOM as you wish and print the modified structure back out
Here's an example of how it might work
use strict;
use warnings;
use XML::LibXML;
my $dom = XML::LibXML->load_html(IO => \*DATA);
my ($refs) = $dom->findnodes('/html/body//div[#id="references"]/ul');
my #refs = $refs->findnodes('li');
$refs->removeChild($_) for #refs;
$refs->appendChild($_) for sort {
my ($aa, $bb) = map { $_->findvalue('span[#class="ref-author"]') } $a, $b;
$aa cmp $bb;
} #refs;
print $dom, "\n";
__DATA__
<html>
<head>
<title>Title</title>
</head>
<body>
<div id="references">
<h2>References</h2>
<ul>
<li id="reference-to-book-1">
<span class="ref-author">Sample, Peter</span>
<cite>Online Book 1</cite>
<span class="ref-year">2011</span>
</li>
<li id="reference-to-book-2">
<cite>Physical Book 2</cite>
<span class="ref-year">2012</span>
<span class="ref-author">Example, Sandy</span>
</li>
</ul>
</div><!-- references -->
</body>
</html>
output
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Title</title></head><body>
<div id="references">
<h2>References</h2>
<ul>
<li id="reference-to-book-2">
<cite>Physical Book 2</cite>
<span class="ref-year">2012</span>
<span class="ref-author">Example, Sandy</span>
</li><li id="reference-to-book-1">
<span class="ref-author">Sample, Peter</span>
<cite>Online Book 1</cite>
<span class="ref-year">2011</span>
</li></ul></div><!-- references -->
</body></html>

Related

Targeting individual elements in HTML using Perl and Mojo::DOM in well-formated HTML

Relative begginer with Perl, with my first question here, trying the following:
I am trying to retrieve certain information from a large online dataset (Eur-Lex), where each HTML document is well-formed HTML, with constant elements. Each HTML file is identified by its Celex number, which is supplied as the argument to the script (see my Perl code below).
The HTML data looks like this (showing only the part I'm interested in):
<!--
<blahblah>
< lots of stuff here, before the interesting part>
-->
<div id="PPClass_Contents" class="panel-collapse collapse in" role="tabpanel"
aria-labelledby="PP_Class">
<div class="panel-body">
<dl class="NMetadata">
<dt xmlns="http://www.w3.org/1999/xhtml">EUROVOC descriptor: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=341&lang=en">
<span lang="en">descriptor_1</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=5158&lang=en">
<span lang="en">descriptor_2</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=7983&lang=en">
<span lang="en">descriptor_3</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=933&lang=en">
<span lang="en">descriptor_4</span>
</a>
</li>
</ul>
</dd>
<dt xmlns="http://www.w3.org/1999/xhtml">Subject matter: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CT_CODED=BUDG&lang=en">
<span lang="en">Subject_1</span>
</a>
</li>
</ul>
</dd>
<dt xmlns="http://www.w3.org/1999/xhtml">Directory code: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>01.60.20.00 <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_1_CODED=01&lang=en">
<span lang="en">Designation_level_1</span>
</a> / <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_2_CODED=0160&lang=en">
<span lang="en">Designation_level_2</span>
</a> / <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_3_CODED=016020&lang=en">
<span lang="en">Designation_level_3</span>
</a>
</li>
</ul>
</dd>
</dl>
</div>
</div>
</div>
<!--
<still more stuff here>
-->
I am interested in the info contained in "PPClass_Contents" div id, which consists of 3 elements:
- EUROVOC descriptor:
- Subject matter:
- Directory code:
Based on the above HTML, I would like to get the children of those 3 main elements, using Perl and Mojo, getting the result similar to this (single line text file, 3 groups separated by tabs, multiple child elements within a grup are separated by pipe characters, something like this:
CELEX_No "TAB" descriptor_1|descriptor_2|descriptor_3|descriptor_4|..|descriptor_n "TAB" Subject_1|..|Subject_n "TAB" Designation_level_1|Designation_level_2|Designation_level_3|..|Designation_level_n
"descriptors", "Subjects" and "Designation_levels" elements (children of those 3 main groups) can be from 1 to "n", the number is not fixed, and is not known in advance.
I have the following code, which does print out the plain text of the interesting part, but I need to address the individual elements and print them out in a new file as described above:
#!/usr/bin/perl
# returns "Classification" descriptors for given CELEX and Language
use strict;
use warnings;
use Mojo::UserAgent;
if ($#ARGV ne "1") {
print "Wrong number of arguments!\n";
print "Syntax: clookup.pl Lang_ID celex_No.\n";
exit -1;
}
my $lang = $ARGV[0];
my $celex = $ARGV[1];
my $lclang = lc $lang;
# fetch the eurlex page
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get("https://eur-lex.europa.eu/legal-content/$lang/ALL/?uri=CELEX:$celex")->res->dom;
################ let's extract interesting parts:
my $text = $dom->at('#PPClass_Contents')->all_text;
print "$text\n";
EDIT (added):
You can try my Perl script using two arguments:
lang_code ("DE","EN","IT", etc.)
Celex number (e.g.: E2014C0303, 52015BP2212, 52015BP0930(48), 52015BP0930(36), 52015BP0930(41), E2014C0302, E2014C0301, E2014C0271, E2014C0134).
For example (if you name my script "clookup.pl"):
$ perl clookup.pl EN E2014C0303
So, how can I address individual elements (of unknown number) as described above, using Mojo::DOM?
Or, is there something simpler or faster (using Perl)?
You are on the right track. First, you need to understand the HTML inside your #PPClass_Contents. Each set of things is in a definition list. Since you only care about the definition texts, you can search directly for the <dd> elements.
$dom->at('#PPClass_Contents')->find('dd')
This will give you a Mojo::Collection, which you can iterate with ->each. We pass that an anonymous function, pretty much like a callback.
$dom->at('#PPClass_Contents')->find('dd')->each(sub {
$_; # this is the current element
});
Each element will be passed to that sub, and can be referenced using the topic variable $_. There is an <ul> inside, and each <li> contains a <span> element with the text you want. So let's find those.
$_->find('span')
We can directly build the column in your output at this stage. Let's use the other form of ->each, which turns the Mojo::Collection returned from ->find into a normal Perl list. We can then use a regular map operation to grab each <span>'s text node and join that into a string.
join '|', map { $_->text } $_->find('span')->each
To tie all that together, we declare an array outside this construct, and stick the $celex number in it as the first column.
my #columns = ($celex);
$dom->at('#PPClass_Contents')->find('dd')->each(sub {
push #columns, join '|', map { $_->text } $_->find('span')->each;
});
Producing the final tab-separated output is now trivial.
print join "\t", #columns;
I've done this with EN as the language and the $celex number 32006L0121, which the search used in its example tooltip. The result is this:
32006L0121 marketing standard|chemical product|approximation of laws|dangerous substance|scientific report|packaging|European Chemicals Agency|labelling Internal market - Principles|Approximation of laws|Technical barriers|Environment|Consumer protection Industrial policy and internal market|Internal market: approximation of laws|Dangerous substances

Unable to Extract Link with Mojolicious

I am trying to extract the link for the next page in a search results page using Mojo::DOM. However, I have a problem where instead of Mojo::DOM elements, I get a string after using ->find() on an existing element.
I have:
my $pagination_elements = $dom->find("div[class*=\"pagination-block\"]");
my $page_counter_text = $pagination_elements->find("div[class=\"page-of-pages\"]")->text();
$page_counter_text =~ /^Page (\d+) of (\d+)$/;
my $current_page = int($1);
my $last_page = int($2);
my $prev_next_elements = $pagination_elements->find("a[class*=\"prev-next\"]");
my $next_page_link = $prev_next_elements->last();
my $next_page_url = $next_page_link->attr("href");
On each page, there may be 2 link tags with a class of prev-next. Instead of getting the link for the last element, what I get is a string that contains the href for both of the tags (if both are available on the page).
Now, if instead of this I do:
my $next_page_link = $dom->find("div[class*=\"pagination-block\"] > ul > li > a[class*=\"prev-next\"]")->last();
my $next_page_url_rel = $next_page_link->attr("href");
I get the required link.
My question is, why does the second version work and not the first? Why do I have to start from the root DOM element to get a list of elements, and why starting from a child of the root returns a string containing all the link tags instead of just the one I want?
Edit
An example of the HTML I am parsing is:
<div class="pagination-block clearfix">
<div class="page-of-pages">
Page 2 of 100
</div>
<ul class="pagination-links">
<li>
.
.
.
</li>
<li>
<a class="page-option prev-next" href="PREV LINK">Prev</a>
</li>
<li>
<a class="page-option prev-next" href="NEXT LINK">Next</a>
</li>
</ul>
</div>
It would have helped a lot if you could have shown an example of the HTML you are processing. Instead I have imagined this, which I hope is close.
<html>
<head>
<title>Title</title>
</head>
<body>
<div class="pagination-block">
<div class="page-of-pages">Page 99 of 100</div>
<ul>
<li>
<a class="prev-next" href="/page98">Prev</a>
</li>
<li>
<a class="prev-next" href="/page100">Next</a>
</li>
<ul>
</div>
<div class="pagination-block">
<div class="page-of-pages">Page 99 of 100</div>
<ul>
<li>
<a class="prev-next" href="/page98">Prev</a>
</li>
<li>
<a class="prev-next" href="/page100">Next</a>
</li>
<ul>
</div>
</body>
</html>
Now let's look at your code
my $pagination_elements = $dom->find('div[class*="pagination-block"]')
This gives you a Mojo::Collection containing the two instances of div that have a pagination-block class.
my $prev_next_elements = $pagination_elements->find('a[class*="prev-next"]')
This does something like a map, replacing each member of the Mojo::Collection with the results of doing a find on them. Since find returns another Mojo::Collection, you now have a collection of two collections, each with two Mojo::DOM objects. To clarify
$prev_next_elements is a Mojo::Collection object with a size of 2
Both $prev_next_elements->[0] and $prev_next_elements->[1] are Mojo::Collection objects, each with a size of 2
$prev_next_elements->[0][0], $prev_next_elements->[0][1], $prev_next_elements->[1][0], and $prev_next_elements->[1][1] are all Mojo::DOM objects, each containing an <a> element from the HTML document
my $next_page_link = $prev_next_elements->last
This takes the second element of $prev_next_elements. It is the same as $prev_next_elements->[1], and so is a Mojo::Collection object containing the two Mojo::DOM elements that hold the last two <a> elements in the HTML document.
my $next_page_url = $next_page_link->attr('href')
Now you are doing another map operation: applying attr to both elements of the collection, and returning another collection containing the two href strings /page98 and /page100. Stinrgifying this Mogo::Collection just concatenates all of its elements and gives you "/page98\n/page100".
To fix all this, take the last of the $pagination_elements, giving you a Mojo::DOM object. Then do a find for the prev and next elements, giving you Mojo::Collection of the "prev" and
"next" <a> elements, and finally map those elements to links using attr('href'). You end up with Mojo::Collection containing the href text of the "prev" and "next" links in the last pagination block.
my $pagination_elements = $dom->find('div[class*="pagination-block"]');
my $last_pagination_element = $pagination_elements->last;
my $prev_next_elements = $last_pagination_element->find('a[class*="prev-next"]');
my $prev_next_links = $prev_next_elements->attr('href');
my ($prev_page_link, $next_page_link) = ($prev_next_links->first, $prev_next_links->last);
say $prev_page_link;
say $next_page_link;
output
/page98
/page100
You can collapse all that to something more convenient, like this
my $pagination_elements = $dom->find('div[class*="pagination-block"]');
my $prev_next_links = $pagination_elements->last->find('a[class*="prev-next"]')->attr('href');
my ($prev_page_link, $next_page_link) = #$prev_next_links;
say $prev_page_link;
say $next_page_link;
If you used Data::Dump (or some equivalent module) instead of print, you would get a clue as to what's going on:
use Data::Dump;
dd $next_page_url;
dd $next_page_url_rel;
Outputs:
bless(["PREV LINK", "NEXT LINK"], "Mojo::Collection")
"NEXT LINK"
As you can see, your first variable actually holds a collection, and not a string.
The problem arises because the Mojo::DOM->find returns a Mojo::Collection:
my $pagination_elements = $dom->find('div[class*="pagination-block"]');
Doing a subsequent find on a collection returns you a nested collection which is not going to perform the way you expect with calls like last.
Here are three different solutions to fix your first attempt to find the link text:
Use the Mojo::DOM->at method to find the first element in DOM structure matching the CSS selector.
my $pagination_elements = $dom->at('div[class*="pagination-block"]');
Use Mojo::Collection->first or ->last to isolate a specific element in the collection before the subsequent find.
my $pagination_elements
= $dom->find('div[class*="pagination-block"]')->last();
Use Mojo::Collection->flatten to flatten the nested collections created by your subsequent find into a new collection with all elements:
my $pagination_elements = $dom->find('div[class*="pagination-block"]');
my $prev_next_elements
= $pagination_elements->find('a[class*="prev-next"]')->flatten();
All of these methods will make your script work as you intended:
use strict;
use warnings;
use Mojo::DOM;
use Data::Dump;
my $dom = Mojo::DOM->new(do { local $/; <DATA> });
# Fix 1
my $pagination_elements = $dom->at('div[class*="pagination-block"]');
# Fix 2
#my $pagination_elements
# = $dom->find('div[class*="pagination-block"]')->last();
# Fix 3
#my $pagination_elements = $dom->find('div[class*="pagination-block"]');
#my $prev_next_elements
# = $pagination_elements->find('a[class*="prev-next"]')->flatten();
my $prev_next_elements = $pagination_elements->find('a[class*="prev-next"]');
my $next_page_link = $prev_next_elements->last();
my $next_page_url = $next_page_link->attr("href");
dd $next_page_url;
$next_page_link = $dom->find('div[class*="pagination-block"] > ul > li > a[class*="prev-next"]')->last();
my $next_page_url_rel = $next_page_link->attr("href");
dd $next_page_url_rel;
__DATA__
<html>
<head>
<title>Paging Example</title>
</head>
<body>
<div class="pagination-block clearfix">
<div class="page-of-pages">
Page 2 of 100
</div>
<ul class="pagination-links">
<li>
.
.
.
</li>
<li>
<a class="page-option prev-next" href="PREV LINK">Prev</a>
</li>
<li>
<a class="page-option prev-next" href="NEXT LINK">Next</a>
</li>
</ul>
</div>
</body>
</html>
Outputs:
"NEXT LINK"
"NEXT LINK"

Perl Find Web Link By DOM

I have been using Mechanize::Firefox in order to do some web scraping and I am currently in the process of going fully headless. I have been using this code:
my $link = [$firefox->find_link_dom(text_regex => qr/pdf/i)]->[0]->{href} . "\n";
which provides very good results in finding a .pdf file. Is there anyway I can find link doom's based on a text_regex using a different module that does not require the browser to be shown?
EDIT
With the Mechanize::Firefox I would get back a link out of code like:
<div id="rhc" xpathLocation="noDialog">
<div id="download" class="rhcBox_type1">
<div class="wrap">
<ul>
<li class="download icon"><strong>Download:</strong>
PDF |
Citation |
XML
</li>
<li class="print icon"><strong>Print article</strong></li>
<li class="reprint icon">EzReprint New & improved!</li>
</ul>
</div>
to download a pdf file. Any ideas why the HTML::Treebase will not get this link??
CSS3 a[href*="pdf"]: Web::Query, HTML::Query, Mojo::UserAgent
XPath //a[contains(#href, "pdf")]: HTML::TreeBuilder::XPath, XML::LibXML
use strict;
use warnings;
use HTML::TreeBuilder::XPath qw();
my $dom = HTML::TreeBuilder::XPath->new_from_content(q(...)); # ... your html code
my #nodes = $dom->findnodes('//a[contains(#href, "pdf")]');
foreach my $node (#nodes) {
my $href = $node->findvalue('#href');
...
}

Using Web::Scraper

Im trying to parse some html tags using perl module Web::Scraper but seems Im an inept using perl. I wonder if anyone can look for mistakes in my code...:
This is my HTML to parse (2 urls inside li tags):
<more html above here>
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="123">
<h2 class="genres"></h2>
<li>1</li>
<li><a class="sel" href="**URL_TO_EXTRACT_2**">2</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
<more stuff from here>
Im trying to obtain:
ID:1 Link:URL_TO_EXTRACT_1
ID:2 Link:URL_TO_EXTRACT_2
With this perl code:
my $scraper = scraper {
process ".zone-extract > a[href]", urls => '#href', id => 'TEXT';
result 'urls';
};
my $links = $scraper->scrape($response);
This is one of the infinite process combinations I tried, with two different results: An empty return, or all the urls inside code (and I only need links inside zone-extract).
Resolved with mob's contribution... #zone-extract instead .zone-extract :)
#!/usr/bin/env perl
use strict;
use warnings;
use Web::Scraper;
my $html = q[
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="123">
<h2 class="genres"></h2>
<li>1</li>
<li><a class="sel" href="**URL_TO_EXTRACT_2**">2</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
]; # / (turn off wrong syntax highlighting)
my $parser = scraper {
process '//div[#id="zone-extract"]//a', 'urls[]' => sub {
my $url = $_[0]->attr('href') ;
return $url;
};
};
my $ref = $parser->scrape(\$html);
print "$_\n" for #{ $ref->{urls} };

Is there an easier away to extract this data?

my $changelog = "/etc/webmin/Pserver_Panel/changelog.cgi";
my $Milestone;
open (PREFS, $changelog);
while (<PREFS>)
{
if ($_ =~ m/^<h1>(.*)[ ]Milestone.*$/g) {
$Milestone=$1;
last;
}
}
close(PREFS);
Here is an example of the data its extracting from:
<h1>1.77 Milestone</h1>
<h3> 6/26/2009 </h3><ul style="margin-top:0px">
<li type=circle> Standard code house cleaning and added better compatbility for apache conversion.
</ul>
<h3> 6/21/2009 </h3><ul style="margin-top:0px">
<li type=square> Fixed Autofix so that it extracts to the right directory.
</ul>
<h3> 6/11/2009 </h3><ul style="margin-top:0px">
<li type=circle> Updated FTP link on index page to go to net2ftp, an online ftp file manager.
</ul>
<h1>1.76 Milestone</h1>
<h3> 4/14/2009 </h3><ul style="margin-top:0px">
<li type=square> Corrected a broken hyperlink to regular expressions in "View Chat Log"
<li type=circle> Changed the default number of lines back from 25 to 10 on both Chat and Pserver Logs.
<li type=circle> Noted in "View Pserver Log" search is case-sensitive and regular expression supported.
</ul>
<h3> 4/13/2009 </h3><ul style="margin-top:0px">
<li type=disc> Added AutoFix to the panel which will automatically fix prop errors.
<li type=circle> Updated error display to allow more detailed errors.
</ul>
<h3> 4/12/2009 </h3><ul style="margin-top:0px">
<li type=circle> Fixed start/stop/restart to be more reliable.
</ul>
Next, you are going to have to parse the items between milestones. Do yourself a favor, stop worrying about lines of code and use an HTML Parser, such as HTML::TokeParser:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TokeParser;
my $parser = HTML::TokeParser->new( \*DATA );
while ( my $token = $parser->get_token ) {
if ( $token->[0] eq 'S' ) {
if ( $token->[1] eq 'h1') {
my ($milestone) = split ' ', $parser->get_text('/h1');
print "Milestone is '$milestone'\n";
}
}
}
__DATA__
<h1>1.77 Milestone</h1>
...
C:\Temp> vbn
Milestone is '1.77'
Milestone is '1.76'
I you want a one-liner, how about this one?
perl -nle 'if (/^<h1>(.*)[ ]Milestone.*$/g){ print $1; last }' /etc/webmin/Pserver_Panel/changelog.cgi
Another one-liner to grab it from the command line:
perl -ne'print $1 and exit if /<h1>(.*?)\s+Milestone/' /etc/webmin/Pserver_Panel/changelog.cgi