Curious whether I can use Xidel pattern from a file to convert an html list into an array of json objects.
Given this example HTML:
<div class="watch-sidebar-body">
<ul id="watch-related" class="video-list">
<li class="video-list-item">
<div class="content-wrapper">
<a href="/watch?v=XPt3uMaqG7c">
<span class="title">
10 Best Super Bowl 2018 Commercials
</span>
<span class="accessible-description">
- Duration: 11:07.
</span>
<span class="stat attribution"><span class="">Dalibor Truhlar</span></span>
<span class="stat view-count">546,346 views</span>
</a>
</div>
</li>
<li class="video-list-item">
...
By using this pattern from a file:
<ul id="watch-related" class="video-list">
<t:loop>
<li>
<a>
<span class="title">{$title}</span>
<span class="accessible-description">{$duration := extract(., "[0-9]+:[0-9]+")}</span>
<span>
<span>{$author}</span>
</span>
<span class="view-count">{$views := extract(., "[0-9,]+")}</span>
{url := #href}
</a>
</li>
</t:loop>
</ul>
I would like to produce the following json:
[{ "title" : "10 Best Super Bowl 2018 Commercials"
, "duration" : "11:07"
, "author" : "Dalibor Truhlar"
, "views" : "546,346"
, "url" : "/watch?v=XPt3uMaqG7c"
}
,{ "title" : "Funny Commercial Compilation"
, "duration" : "9:33"
, "author" : "Gaming Coyote"
, "views" : "9,449,290"
, "url" : "/watch?v=BTka0cgf99c"
}
...
The pattern properly matches the HTML and extracts the data, but I can't get the json output illustrated above.
When I run the command
curl -s https://www.youtube.com/watch\?v\=HE9nLWFZ6ac | xidel - --silent --extract-file=yt.xq
I'm just getting the dump of all variables on stdout:
title := 10 Best Super Bowl 2018 Commercials
duration := 11:07
author := Dalibor Truhlar
views := 548,710
url := /watch?v=XPt3uMaqG7c
title := Funny Commercial Compilation
duration := 9:33
author := Gaming Coyote
views := 9,451,516
url := /watch?v=BTka0cgf99c
But how do I go from this to an array of json objects?
--output-format=json-wrapped doesn't help, since it converts each variable into it's own array, instead of zipping them into objects.
I know it is possible to create the desired json output with an xpath expression on command line, but I'm specifically interested in learning how to get that output from a pattern stored in a file.
You can create an explicit object in the pattern:
<ul id="watch-related" class="video-list">
<t:loop>
<li>
<a>
{$out := {}}
<span class="title">{$out.title}</span>
<span class="accessible-description">{$out.duration := extract(., "[0-9]+:[0-9]+")}</span>
<span>
<span>{$out.author}</span>
</span>
<span class="view-count">{$out.views := extract(., "[0-9,]+")}</span>
{$out.url := #href}
</a>
</li>
</t:loop>
</ul>
and call it with | xidel - --silent --dot-notation=on --extract-file=yt.xq
Related
Relative begginer with Perl, with my first question here, trying the following:
I am trying to retrieve certain information from a large online dataset (Eur-Lex), where each HTML document is well-formed HTML, with constant elements. Each HTML file is identified by its Celex number, which is supplied as the argument to the script (see my Perl code below).
The HTML data looks like this (showing only the part I'm interested in):
<!--
<blahblah>
< lots of stuff here, before the interesting part>
-->
<div id="PPClass_Contents" class="panel-collapse collapse in" role="tabpanel"
aria-labelledby="PP_Class">
<div class="panel-body">
<dl class="NMetadata">
<dt xmlns="http://www.w3.org/1999/xhtml">EUROVOC descriptor: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=341&lang=en">
<span lang="en">descriptor_1</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=5158&lang=en">
<span lang="en">descriptor_2</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=7983&lang=en">
<span lang="en">descriptor_3</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=933&lang=en">
<span lang="en">descriptor_4</span>
</a>
</li>
</ul>
</dd>
<dt xmlns="http://www.w3.org/1999/xhtml">Subject matter: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CT_CODED=BUDG&lang=en">
<span lang="en">Subject_1</span>
</a>
</li>
</ul>
</dd>
<dt xmlns="http://www.w3.org/1999/xhtml">Directory code: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>01.60.20.00 <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_1_CODED=01&lang=en">
<span lang="en">Designation_level_1</span>
</a> / <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_2_CODED=0160&lang=en">
<span lang="en">Designation_level_2</span>
</a> / <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_3_CODED=016020&lang=en">
<span lang="en">Designation_level_3</span>
</a>
</li>
</ul>
</dd>
</dl>
</div>
</div>
</div>
<!--
<still more stuff here>
-->
I am interested in the info contained in "PPClass_Contents" div id, which consists of 3 elements:
- EUROVOC descriptor:
- Subject matter:
- Directory code:
Based on the above HTML, I would like to get the children of those 3 main elements, using Perl and Mojo, getting the result similar to this (single line text file, 3 groups separated by tabs, multiple child elements within a grup are separated by pipe characters, something like this:
CELEX_No "TAB" descriptor_1|descriptor_2|descriptor_3|descriptor_4|..|descriptor_n "TAB" Subject_1|..|Subject_n "TAB" Designation_level_1|Designation_level_2|Designation_level_3|..|Designation_level_n
"descriptors", "Subjects" and "Designation_levels" elements (children of those 3 main groups) can be from 1 to "n", the number is not fixed, and is not known in advance.
I have the following code, which does print out the plain text of the interesting part, but I need to address the individual elements and print them out in a new file as described above:
#!/usr/bin/perl
# returns "Classification" descriptors for given CELEX and Language
use strict;
use warnings;
use Mojo::UserAgent;
if ($#ARGV ne "1") {
print "Wrong number of arguments!\n";
print "Syntax: clookup.pl Lang_ID celex_No.\n";
exit -1;
}
my $lang = $ARGV[0];
my $celex = $ARGV[1];
my $lclang = lc $lang;
# fetch the eurlex page
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get("https://eur-lex.europa.eu/legal-content/$lang/ALL/?uri=CELEX:$celex")->res->dom;
################ let's extract interesting parts:
my $text = $dom->at('#PPClass_Contents')->all_text;
print "$text\n";
EDIT (added):
You can try my Perl script using two arguments:
lang_code ("DE","EN","IT", etc.)
Celex number (e.g.: E2014C0303, 52015BP2212, 52015BP0930(48), 52015BP0930(36), 52015BP0930(41), E2014C0302, E2014C0301, E2014C0271, E2014C0134).
For example (if you name my script "clookup.pl"):
$ perl clookup.pl EN E2014C0303
So, how can I address individual elements (of unknown number) as described above, using Mojo::DOM?
Or, is there something simpler or faster (using Perl)?
You are on the right track. First, you need to understand the HTML inside your #PPClass_Contents. Each set of things is in a definition list. Since you only care about the definition texts, you can search directly for the <dd> elements.
$dom->at('#PPClass_Contents')->find('dd')
This will give you a Mojo::Collection, which you can iterate with ->each. We pass that an anonymous function, pretty much like a callback.
$dom->at('#PPClass_Contents')->find('dd')->each(sub {
$_; # this is the current element
});
Each element will be passed to that sub, and can be referenced using the topic variable $_. There is an <ul> inside, and each <li> contains a <span> element with the text you want. So let's find those.
$_->find('span')
We can directly build the column in your output at this stage. Let's use the other form of ->each, which turns the Mojo::Collection returned from ->find into a normal Perl list. We can then use a regular map operation to grab each <span>'s text node and join that into a string.
join '|', map { $_->text } $_->find('span')->each
To tie all that together, we declare an array outside this construct, and stick the $celex number in it as the first column.
my #columns = ($celex);
$dom->at('#PPClass_Contents')->find('dd')->each(sub {
push #columns, join '|', map { $_->text } $_->find('span')->each;
});
Producing the final tab-separated output is now trivial.
print join "\t", #columns;
I've done this with EN as the language and the $celex number 32006L0121, which the search used in its example tooltip. The result is this:
32006L0121 marketing standard|chemical product|approximation of laws|dangerous substance|scientific report|packaging|European Chemicals Agency|labelling Internal market - Principles|Approximation of laws|Technical barriers|Environment|Consumer protection Industrial policy and internal market|Internal market: approximation of laws|Dangerous substances
<p class="">
Teacher:
<a href="/name/nm12345/?ref_=adv_0"
>Scott</a>
<span class="ghost">|</span>
Students:
<a href="/name/nm12345/?ref_=adv_1"
>Benedict</a>,
<a href="/name/nm12345/?ref_=adv_2"
>Chiwetel</a>,
<a href="/name/nm12345/?ref_=adv_3"
>Rachel</a>,
<a href="/name/nm12345/?ref_=adv_4"
>Benedict Wong</a>
</p>
I would like to extract teacher's name - "Scott" which is under the tag of "Teacher" and extract all students' name under the tag of "Students". I tried:
soup.find(lambda tag:tag) and it returned
<a href="/name/nm12345/?ref_=adv_0"
>Scott</a>
I think it is not a right approach. How the code should actually be to extract both the names under "teacher" and "students" tag?
Assuming that your HTML block won't change much when parsing other pages you can find your p tag by class ( Your example has none ) and verify if the Teacher text is present.
If it is get the .contents[1] from the p tag which is the first a on the element.
Next find all a tags on which the href attribute doesn't match your teachers.
Example:
from bs4 import BeautifulSoup
example = """<p class="">
Teacher:
<a href="/name/nm12345/?ref_=adv_0"
>Scott</a>
<span class="ghost">|</span>
Students:
<a href="/name/nm12345/?ref_=adv_1"
>Benedict</a>,
<a href="/name/nm12345/?ref_=adv_2"
>Chiwetel</a>,
<a href="/name/nm12345/?ref_=adv_3"
>Rachel</a>,
<a href="/name/nm12345/?ref_=adv_4"
>Benedict Wong</a>
</p>"""
soup = BeautifulSoup(example, "html.parser")
Classroom = soup.find(lambda x: "Teacher" in x.get_text())
if Classroom is not None:
Teacher = Classroom.contents[1]
TeacherUrl = Teacher["href"]
Students = Classroom.find_all(lambda tag: tag.has_attr('href') and TeacherUrl not in tag["href"])
print (Teacher.text)
for Student in Students:
print (Student.text)
Which outputs:
Scott
Benedict
Chiwetel
Rachel
Benedict Wong
I'm tyring to set up list view with tx_news and want to link a pdf file, that is also used as thumbnail directly in list view, not in detail view.
Tried
<a href="{relatedFile.originalResource.publicUrl -> f:format.htmlspecialchars()}" target="_blank">
But the link remains empty... Any suggestions?
The snippet as is looks ok, maybe the the loop around this link might not be in Order.
It should have structure like this:
<f:for each="{news}" as="newsItem">
<f:if condition="{newsItem.falMedia}">
<f:then>
<f:for each="{newsItem.falRelatedFiles}" as="falfile" iteration="i">
<li>
<span class="news-related-files-link">
<a href="{falfile.originalResource.publicUrl -> f:format.htmlspecialchars()}" target="_blank">
{falfile.originalResource.title}
</a>
</span>
<span class="news-related-files-size">
{falfile.originalResource.size -> f:format.bytes()}
</span>
<f:format.html>
falimage.uid : {falimage.uid}
identifier : {falimage.originalResource.identifier}
public_url : {falimage.originalResource.publicUrl}
name : {falimage.originalResource.name}
title : {falimage.originalResource.title}
alternative : {falimage.originalResource.alternative}
description : {falimage.originalResource.description}
extension : {falimage.originalResource.extension}
type : {falimage.originalResource.type}
mimeType : {falimage.originalResource.mimeType}
size : {falimage.originalResource.size}
creationTime : {falimage.originalResource.creationTime}
modificationTime : {falimage.originalResource.modificationTime}
</f:format.html>
</li>
</f:for>
</f:then>
<f:else>
</f:else>
</f:if>
</f:for>
I would like to generate ID's for an HTML list.
The list is generated dynamically from the database.
I cant use a for loop or the list.zipWithIndex function because my logic contains a few loops for the generation of the list already, in which the counter needs to be incremented too. I also tried it with the defining function, but its not allowed to reasign values like this: #{id = id + 1}
Whats the best way to accomplish the generation of Id's?
Thats part of the template (uniqueId needs to be replaced with an integer):
<div id="tree">
<ul>
<li id="uniqueId">
<a class="dashboard" href="/">Dashboard</a>
</li>
<li id="uniqueId">
<b>Products</b>
<ul id="uniqueId">
#for(cat <- Application.allCategories()) {
<li id="uniqueId">
<a class="name" href="#routes.Categories.getd(cat.id).url">#cat.name</a>
<ul>
#for(prod <- Application.allProducts()) {
<li id="uniqueId">
<a class="name" href="#routes.Product.getById(prod.id).url">#prod.name</a>
</li>
#*more code and the closing tags...*#
Use just ... object's id prefixed to make it unique, example for first listing:
#for(cat <- Application.allCategories()) {
<li id="cat_#cat.id">
for second:
#for(prod <- Application.allProducts()) {
<li id="prod_#prod.id">
or if the same product can be displayed in several categories prefix it with cat.id as well:
#for(cat <- Application.allCategories()) {
<li id="cat_#cat.id">
#for(prod <- Application.allProducts()) {
<li id="prod_#(cat.id)_#(prod.id)">
I'm new about hpple and xpath. for the below html code,I want to get both "title" and "tag" information.
From hpple's example code, I can get a array of title, and another array of tag. But if there are six properties I'm interested, there will be six arrays.
can I find the div[class="entry"], then get its child's , div[class="meta"]? (can anybody share the code?)
Thanks.
<div class="content">
<div id="1" class="entry">
<h2 class="title"> title for entry 1 </h2>
<div class="meta"> tag:xxx </div>
</div>
<div id="2" class="entry">
<h2 class="title"> title for entry 2 </h2>
<div class="meta"> tag:xxx </div>
</div>
...
</div>
#"//div[#class='content']//div[#class='entry']//div[#class='meta']"
This returns tag:xxx for both entries.
I want to get both "title" and "tag" information
//div[#class='content']/div[#class='entry']/*[#class='meta' or #class=title"']
This XPath gets all tags with class title or meta children of div class entry child of any div class content.