Alphabetically sort special characters in pandoc-citeproc bibliography - special-characters

I hope I'm not repeating a question, but I've searched high and low and haven't found any answers to something I'm sure other users of pandoc have encountered. (The references to this topic I have found are concerned with pandoc to LaTeX, whereas this is simple pandoc to .html / .rtf / etc conversion.) It seems that the pandoc-citeproc processor does not know what to do with special characters and diacritics, either unusual ones (e.g. ʿ) or common (e.g. á), and places them them all after "z" when creating its bibliographies. Since my typical bibliography bristles with special characters, it would be great to know if there's a workaround to manually modify the sorting order, or neutralize the special characters -- in the MWE I have below, for example, to tell it to sort ʿAal as though it were "Aal" and Áberforth as though it were "Aberforth", so that they show up above Amber.
Here is my MWE:
---
csl: chicago-note-bibliography.csl
references:
- id: Amber2000
type: book
author:
- family: Amber
given: Rodrigo
issued:
- year: '2000'
title: Book 1
- id: Aberf2000
type: book
author:
- family: Áberforth
given: Rodrigo
issued:
- year: '2000'
title: Book 2
- id: Aal2000
type: book
author:
- family: ʿAal
given: Rodrigo
issued:
- year: '2000'
title: Book 3
...
This is my MWE.[#Amber2000] I want to see how it handles diacritics[#Aberf2000] and special characters.[#Aal2000]
Running pandoc mwe.md -o test.html --filter=pandoc-citeproc, I get the following output:
<p>This is my MWE.<span class="citation"><sup>1</sup></span> I want to see how it handles diacritics<span class="citation"><sup>2</sup></span> and special characters.<span class="citation"><sup>3</sup></span></p>
<div id="refs" class="references">
<div id="ref-Amber2000">
<p>Amber, Rodrigo. <em>Book 1</em>, 2000.</p>
</div>
<div id="ref-Aberf2000">
<p>Áberforth, Rodrigo. <em>Book 2</em>, 2000.</p>
</div>
<div id="ref-Aal2000">
<p>ʿAal, Rodrigo. <em>Book 3</em>, 2000.</p>
</div>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Amber, <em>Book 1</em>. ↩</p></li>
<li id="fn2"><p>Áberforth, <em>Book 2</em>. ↩</p></li>
<li id="fn3"><p>ʿAal, <em>Book 3</em>. ↩</p></li>
</ol>
</div>
As you can see, it sorts the bibliography Amber, Áberforth, ʿAal. Any ideas?

Related

Targeting individual elements in HTML using Perl and Mojo::DOM in well-formated HTML

Relative begginer with Perl, with my first question here, trying the following:
I am trying to retrieve certain information from a large online dataset (Eur-Lex), where each HTML document is well-formed HTML, with constant elements. Each HTML file is identified by its Celex number, which is supplied as the argument to the script (see my Perl code below).
The HTML data looks like this (showing only the part I'm interested in):
<!--
<blahblah>
< lots of stuff here, before the interesting part>
-->
<div id="PPClass_Contents" class="panel-collapse collapse in" role="tabpanel"
aria-labelledby="PP_Class">
<div class="panel-body">
<dl class="NMetadata">
<dt xmlns="http://www.w3.org/1999/xhtml">EUROVOC descriptor: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=341&lang=en">
<span lang="en">descriptor_1</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=5158&lang=en">
<span lang="en">descriptor_2</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=7983&lang=en">
<span lang="en">descriptor_3</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=933&lang=en">
<span lang="en">descriptor_4</span>
</a>
</li>
</ul>
</dd>
<dt xmlns="http://www.w3.org/1999/xhtml">Subject matter: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CT_CODED=BUDG&lang=en">
<span lang="en">Subject_1</span>
</a>
</li>
</ul>
</dd>
<dt xmlns="http://www.w3.org/1999/xhtml">Directory code: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>01.60.20.00 <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_1_CODED=01&lang=en">
<span lang="en">Designation_level_1</span>
</a> / <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_2_CODED=0160&lang=en">
<span lang="en">Designation_level_2</span>
</a> / <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_3_CODED=016020&lang=en">
<span lang="en">Designation_level_3</span>
</a>
</li>
</ul>
</dd>
</dl>
</div>
</div>
</div>
<!--
<still more stuff here>
-->
I am interested in the info contained in "PPClass_Contents" div id, which consists of 3 elements:
- EUROVOC descriptor:
- Subject matter:
- Directory code:
Based on the above HTML, I would like to get the children of those 3 main elements, using Perl and Mojo, getting the result similar to this (single line text file, 3 groups separated by tabs, multiple child elements within a grup are separated by pipe characters, something like this:
CELEX_No "TAB" descriptor_1|descriptor_2|descriptor_3|descriptor_4|..|descriptor_n "TAB" Subject_1|..|Subject_n "TAB" Designation_level_1|Designation_level_2|Designation_level_3|..|Designation_level_n
"descriptors", "Subjects" and "Designation_levels" elements (children of those 3 main groups) can be from 1 to "n", the number is not fixed, and is not known in advance.
I have the following code, which does print out the plain text of the interesting part, but I need to address the individual elements and print them out in a new file as described above:
#!/usr/bin/perl
# returns "Classification" descriptors for given CELEX and Language
use strict;
use warnings;
use Mojo::UserAgent;
if ($#ARGV ne "1") {
print "Wrong number of arguments!\n";
print "Syntax: clookup.pl Lang_ID celex_No.\n";
exit -1;
}
my $lang = $ARGV[0];
my $celex = $ARGV[1];
my $lclang = lc $lang;
# fetch the eurlex page
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get("https://eur-lex.europa.eu/legal-content/$lang/ALL/?uri=CELEX:$celex")->res->dom;
################ let's extract interesting parts:
my $text = $dom->at('#PPClass_Contents')->all_text;
print "$text\n";
EDIT (added):
You can try my Perl script using two arguments:
lang_code ("DE","EN","IT", etc.)
Celex number (e.g.: E2014C0303, 52015BP2212, 52015BP0930(48), 52015BP0930(36), 52015BP0930(41), E2014C0302, E2014C0301, E2014C0271, E2014C0134).
For example (if you name my script "clookup.pl"):
$ perl clookup.pl EN E2014C0303
So, how can I address individual elements (of unknown number) as described above, using Mojo::DOM?
Or, is there something simpler or faster (using Perl)?
You are on the right track. First, you need to understand the HTML inside your #PPClass_Contents. Each set of things is in a definition list. Since you only care about the definition texts, you can search directly for the <dd> elements.
$dom->at('#PPClass_Contents')->find('dd')
This will give you a Mojo::Collection, which you can iterate with ->each. We pass that an anonymous function, pretty much like a callback.
$dom->at('#PPClass_Contents')->find('dd')->each(sub {
$_; # this is the current element
});
Each element will be passed to that sub, and can be referenced using the topic variable $_. There is an <ul> inside, and each <li> contains a <span> element with the text you want. So let's find those.
$_->find('span')
We can directly build the column in your output at this stage. Let's use the other form of ->each, which turns the Mojo::Collection returned from ->find into a normal Perl list. We can then use a regular map operation to grab each <span>'s text node and join that into a string.
join '|', map { $_->text } $_->find('span')->each
To tie all that together, we declare an array outside this construct, and stick the $celex number in it as the first column.
my #columns = ($celex);
$dom->at('#PPClass_Contents')->find('dd')->each(sub {
push #columns, join '|', map { $_->text } $_->find('span')->each;
});
Producing the final tab-separated output is now trivial.
print join "\t", #columns;
I've done this with EN as the language and the $celex number 32006L0121, which the search used in its example tooltip. The result is this:
32006L0121 marketing standard|chemical product|approximation of laws|dangerous substance|scientific report|packaging|European Chemicals Agency|labelling Internal market - Principles|Approximation of laws|Technical barriers|Environment|Consumer protection Industrial policy and internal market|Internal market: approximation of laws|Dangerous substances

Extract Few Values with Beautiful Soup

<p class="">
Teacher:
<a href="/name/nm12345/?ref_=adv_0"
>Scott</a>
<span class="ghost">|</span>
Students:
<a href="/name/nm12345/?ref_=adv_1"
>Benedict</a>,
<a href="/name/nm12345/?ref_=adv_2"
>Chiwetel</a>,
<a href="/name/nm12345/?ref_=adv_3"
>Rachel</a>,
<a href="/name/nm12345/?ref_=adv_4"
>Benedict Wong</a>
</p>
I would like to extract teacher's name - "Scott" which is under the tag of "Teacher" and extract all students' name under the tag of "Students". I tried:
soup.find(lambda tag:tag) and it returned
<a href="/name/nm12345/?ref_=adv_0"
>Scott</a>
I think it is not a right approach. How the code should actually be to extract both the names under "teacher" and "students" tag?
Assuming that your HTML block won't change much when parsing other pages you can find your p tag by class ( Your example has none ) and verify if the Teacher text is present.
If it is get the .contents[1] from the p tag which is the first a on the element.
Next find all a tags on which the href attribute doesn't match your teachers.
Example:
from bs4 import BeautifulSoup
example = """<p class="">
Teacher:
<a href="/name/nm12345/?ref_=adv_0"
>Scott</a>
<span class="ghost">|</span>
Students:
<a href="/name/nm12345/?ref_=adv_1"
>Benedict</a>,
<a href="/name/nm12345/?ref_=adv_2"
>Chiwetel</a>,
<a href="/name/nm12345/?ref_=adv_3"
>Rachel</a>,
<a href="/name/nm12345/?ref_=adv_4"
>Benedict Wong</a>
</p>"""
soup = BeautifulSoup(example, "html.parser")
Classroom = soup.find(lambda x: "Teacher" in x.get_text())
if Classroom is not None:
Teacher = Classroom.contents[1]
TeacherUrl = Teacher["href"]
Students = Classroom.find_all(lambda tag: tag.has_attr('href') and TeacherUrl not in tag["href"])
print (Teacher.text)
for Student in Students:
print (Student.text)
Which outputs:
Scott
Benedict
Chiwetel
Rachel
Benedict Wong

Form: How to properly connect Live Preview with controller? (AngularJS)

Like everyone else who has asked this question about Angular JS, my problem goes deeper than the simple "how to fix blank option in drop down menu". Basically, I am creating a live preview WITH the form that I am creating. Here is what I mean:
The initial blockquote in the HTML is basically the actual reviews that will be there. The second blockquote is the actual LIVE preview. Finally, the third part is the piece of dilemma I am having.
DILEMMA:
Here is the dilemma I am having:
By removing: ng-model="review.stars" from the select tag, my page will load the 5 stars as expected. However, since I need to bind the ratings already posted up, the live preview AND the initial selection of 5 stars, I have to use ng-model="review.stars" to bind everything together.
BUT, now what happens is that by adding ng-controller=starsController along with the ng-model, the whole thing simply doesn't work. I have tried using a few formulas (and one of those was looking promising, one using an orderProp), but because I NEED to bind the three aforementioned things, it breaks the code and that specific piece.
I can still select an option from the list, but the preview fails to show it. Furthermore, it will NOT automatically show up the 5th star in the drop down (I have to manually select it).
In case you would like a visual aid to understand better what I'm explaining, let me post a picture of the two situations I am referring to. This is the image via a link - I don't have enough rep to post it directly here :/
HTML
<blockquote ng-repeat="review in product.reviews">
<b>Stars: {{review.stars}}</b>
{{review.body}}
<cite>by: {{review.author}}</cite>
</blockquote>
<form name="reviewForm">
<blockquote>
<b> Stars: {{review.stars}}</b>
<br/>
<b> Review: {{review.body}}</b>
<br/>
<cite>by: {{review.author}}</cite>
</blockquote>
<select ng-model="review.stars" ng-controller="starsController" name="stars" id="stars">
<option style="display:none" value=""></option>
<optgroup label="Rate the product">
<option value="1 star" name="1 star">1 star</option>
<option value="2 stars" name="2 stars">2 stars</option>
<option value="3 stars" name="3 stars">3 stars</option>
<option value="4 stars" name="4 stars">4 stars</option>
<option value="5 stars" name="5 stars" selected="selected">5 stars
</option>
</optgroup>
</select>
<br/>
JS
app.controller('starsController', ['$scope', function($scope) {
$scope.options = [
{ name: '1 star', value: '1 star' },
{ name: '2 stars', value: '2 stars' },
{ name: '3 stars', value: '3 stars' },
{ name: '4 stars', value: '4 stars' },
{ name: '5 stars', value: '5 stars' }
];
$scope.orderProp = options[4];
}]);
Updated Plunker
To fix this, add ng-init="review={}" to your form tag.
<form name="reviewForm" ng-init="review={}">
Or, even better - move ng-controller higher up the DOM tree (maybe on your form element)
<form name="reviewForm" ng-controller="starsController">
Or the best option, create a custom reviewStars directive:
app.directive('reviewStars', function() {...});

How to create a comma separated tag list in Docpad?

The tags for a post can be accessed within the post like this (using embedded Coffeescript):
<div class="tags">
Tags:
<ul>
<% for tag in #document.tags: %>
<li><a class="tag_item" href="<%= #getTagUrl(tag) %>"><%= tag %></a></li>
<% end %>
</ul>
</div>
This generates an unordered list of the tags for this specific topic, like this:
Tags:
tag1
tag2
tag3
How can I generate the list of tags as comma separated values on a single line, like this:
Tags: tag1, tag2, tag3
I do it like this on my blog:
<div class="post-tags">
Posted In: <%- ("<a href='#{#getTagUrl(tag)}'>#{tag}</a>" for tag in #tags).join(', ') %>
</div>
Note, #getTagUrl comes from the docpad-plugin-tagging plugin. If you don't want hyperlinks to a page for each tag, you could simplify this to the following:
<div class="post-tags">
Posted In: <%- (tag for tag in #tags).join(', ') %>
</div>

Html Dom with jsoup

I have this html code (part of a long html code) :
<div class="yt-lockup-content">
<h3 class="yt-lockup-ellipsize">
<a class="yt-uix-contextlink yt-uix-sessionlink yt-uix-tile-link result-item-translation-title"dir="ltr"title="Rihanna ft. Calvin Harris - We Found Love LYRICS"data-sessionlink="ved=CAoQvxs%3D&ei=CPLZjMDLwbICFUag3wod7Dm-pw%3D%3D"href="***/watch?v=1KzEu5hWmRY***">Rihanna ft. Calvin Harris - We Found Love LYRICS</a></h3><p class="description " dir="ltr">
<b>bla</b> ft. Calvin Harris -
<b>bla</b>
bla bla
<b>bla bla bla</b>
<b>...</b>
</p><div class="yt-lockup-meta">
<ul class="single-line-lego-list">
<li> <a href="/results?search_type=videos&search_query=rihanna+we+found+love&high_definition=1" class="yt-badge-std">HD
</a>
</h3>
And i want to get to this text : href=/watch?v=1KzEu5hWmRY (The bold text).
I know how to get the all line, but how can i grab this text ?
Thanks,
Or.
Here's an example:
String html = // your html here ...
Document doc = Jsoup.parse(html);
Element element = doc.select("h3[class=yt-lockup-ellipsize] > a").first();
String hrefLink = element.attr("href");
However if you want to parse a website, you should use Jsoup.connect("http://link.com").get() instead of Jsoup.parse(html)