Extract Few Values with Beautiful Soup - tags

<p class="">
Teacher:
<a href="/name/nm12345/?ref_=adv_0"
>Scott</a>
<span class="ghost">|</span>
Students:
<a href="/name/nm12345/?ref_=adv_1"
>Benedict</a>,
<a href="/name/nm12345/?ref_=adv_2"
>Chiwetel</a>,
<a href="/name/nm12345/?ref_=adv_3"
>Rachel</a>,
<a href="/name/nm12345/?ref_=adv_4"
>Benedict Wong</a>
</p>
I would like to extract teacher's name - "Scott" which is under the tag of "Teacher" and extract all students' name under the tag of "Students". I tried:
soup.find(lambda tag:tag) and it returned
<a href="/name/nm12345/?ref_=adv_0"
>Scott</a>
I think it is not a right approach. How the code should actually be to extract both the names under "teacher" and "students" tag?

Assuming that your HTML block won't change much when parsing other pages you can find your p tag by class ( Your example has none ) and verify if the Teacher text is present.
If it is get the .contents[1] from the p tag which is the first a on the element.
Next find all a tags on which the href attribute doesn't match your teachers.
Example:
from bs4 import BeautifulSoup
example = """<p class="">
Teacher:
<a href="/name/nm12345/?ref_=adv_0"
>Scott</a>
<span class="ghost">|</span>
Students:
<a href="/name/nm12345/?ref_=adv_1"
>Benedict</a>,
<a href="/name/nm12345/?ref_=adv_2"
>Chiwetel</a>,
<a href="/name/nm12345/?ref_=adv_3"
>Rachel</a>,
<a href="/name/nm12345/?ref_=adv_4"
>Benedict Wong</a>
</p>"""
soup = BeautifulSoup(example, "html.parser")
Classroom = soup.find(lambda x: "Teacher" in x.get_text())
if Classroom is not None:
Teacher = Classroom.contents[1]
TeacherUrl = Teacher["href"]
Students = Classroom.find_all(lambda tag: tag.has_attr('href') and TeacherUrl not in tag["href"])
print (Teacher.text)
for Student in Students:
print (Student.text)
Which outputs:
Scott
Benedict
Chiwetel
Rachel
Benedict Wong

Related

Targeting individual elements in HTML using Perl and Mojo::DOM in well-formated HTML

Relative begginer with Perl, with my first question here, trying the following:
I am trying to retrieve certain information from a large online dataset (Eur-Lex), where each HTML document is well-formed HTML, with constant elements. Each HTML file is identified by its Celex number, which is supplied as the argument to the script (see my Perl code below).
The HTML data looks like this (showing only the part I'm interested in):
<!--
<blahblah>
< lots of stuff here, before the interesting part>
-->
<div id="PPClass_Contents" class="panel-collapse collapse in" role="tabpanel"
aria-labelledby="PP_Class">
<div class="panel-body">
<dl class="NMetadata">
<dt xmlns="http://www.w3.org/1999/xhtml">EUROVOC descriptor: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=341&lang=en">
<span lang="en">descriptor_1</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=5158&lang=en">
<span lang="en">descriptor_2</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=7983&lang=en">
<span lang="en">descriptor_3</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=933&lang=en">
<span lang="en">descriptor_4</span>
</a>
</li>
</ul>
</dd>
<dt xmlns="http://www.w3.org/1999/xhtml">Subject matter: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CT_CODED=BUDG&lang=en">
<span lang="en">Subject_1</span>
</a>
</li>
</ul>
</dd>
<dt xmlns="http://www.w3.org/1999/xhtml">Directory code: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>01.60.20.00 <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_1_CODED=01&lang=en">
<span lang="en">Designation_level_1</span>
</a> / <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_2_CODED=0160&lang=en">
<span lang="en">Designation_level_2</span>
</a> / <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_3_CODED=016020&lang=en">
<span lang="en">Designation_level_3</span>
</a>
</li>
</ul>
</dd>
</dl>
</div>
</div>
</div>
<!--
<still more stuff here>
-->
I am interested in the info contained in "PPClass_Contents" div id, which consists of 3 elements:
- EUROVOC descriptor:
- Subject matter:
- Directory code:
Based on the above HTML, I would like to get the children of those 3 main elements, using Perl and Mojo, getting the result similar to this (single line text file, 3 groups separated by tabs, multiple child elements within a grup are separated by pipe characters, something like this:
CELEX_No "TAB" descriptor_1|descriptor_2|descriptor_3|descriptor_4|..|descriptor_n "TAB" Subject_1|..|Subject_n "TAB" Designation_level_1|Designation_level_2|Designation_level_3|..|Designation_level_n
"descriptors", "Subjects" and "Designation_levels" elements (children of those 3 main groups) can be from 1 to "n", the number is not fixed, and is not known in advance.
I have the following code, which does print out the plain text of the interesting part, but I need to address the individual elements and print them out in a new file as described above:
#!/usr/bin/perl
# returns "Classification" descriptors for given CELEX and Language
use strict;
use warnings;
use Mojo::UserAgent;
if ($#ARGV ne "1") {
print "Wrong number of arguments!\n";
print "Syntax: clookup.pl Lang_ID celex_No.\n";
exit -1;
}
my $lang = $ARGV[0];
my $celex = $ARGV[1];
my $lclang = lc $lang;
# fetch the eurlex page
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get("https://eur-lex.europa.eu/legal-content/$lang/ALL/?uri=CELEX:$celex")->res->dom;
################ let's extract interesting parts:
my $text = $dom->at('#PPClass_Contents')->all_text;
print "$text\n";
EDIT (added):
You can try my Perl script using two arguments:
lang_code ("DE","EN","IT", etc.)
Celex number (e.g.: E2014C0303, 52015BP2212, 52015BP0930(48), 52015BP0930(36), 52015BP0930(41), E2014C0302, E2014C0301, E2014C0271, E2014C0134).
For example (if you name my script "clookup.pl"):
$ perl clookup.pl EN E2014C0303
So, how can I address individual elements (of unknown number) as described above, using Mojo::DOM?
Or, is there something simpler or faster (using Perl)?
You are on the right track. First, you need to understand the HTML inside your #PPClass_Contents. Each set of things is in a definition list. Since you only care about the definition texts, you can search directly for the <dd> elements.
$dom->at('#PPClass_Contents')->find('dd')
This will give you a Mojo::Collection, which you can iterate with ->each. We pass that an anonymous function, pretty much like a callback.
$dom->at('#PPClass_Contents')->find('dd')->each(sub {
$_; # this is the current element
});
Each element will be passed to that sub, and can be referenced using the topic variable $_. There is an <ul> inside, and each <li> contains a <span> element with the text you want. So let's find those.
$_->find('span')
We can directly build the column in your output at this stage. Let's use the other form of ->each, which turns the Mojo::Collection returned from ->find into a normal Perl list. We can then use a regular map operation to grab each <span>'s text node and join that into a string.
join '|', map { $_->text } $_->find('span')->each
To tie all that together, we declare an array outside this construct, and stick the $celex number in it as the first column.
my #columns = ($celex);
$dom->at('#PPClass_Contents')->find('dd')->each(sub {
push #columns, join '|', map { $_->text } $_->find('span')->each;
});
Producing the final tab-separated output is now trivial.
print join "\t", #columns;
I've done this with EN as the language and the $celex number 32006L0121, which the search used in its example tooltip. The result is this:
32006L0121 marketing standard|chemical product|approximation of laws|dangerous substance|scientific report|packaging|European Chemicals Agency|labelling Internal market - Principles|Approximation of laws|Technical barriers|Environment|Consumer protection Industrial policy and internal market|Internal market: approximation of laws|Dangerous substances

Show tags under their own specific tab

I'm creating a tag system on my website. So far it's all working great. But now I wish to create a page that shows all my tags under their own tab like you have with a portfolio page.
Example: http://www.don-zalmrol.be/tags?tag=Electronics
My page already displays the tags for a specific tag under it's own tab (i.e. electronics), but as you might guess I wish to populate the other tags in their respective tab as well.
So in short you land a view that displays the tag you've selected, but on the same page you can see the others as well.
Anybody has any idea how I can do this? I don't think I'm far away from the solution as I can already load the projects for specific tags under it's own tab. Now I only need to populate the remaining tabs with the tags!
Thanks!
This is my code so far:
http://pastebin.com/jwGW0NKZ
#inherits Umbraco.Web.Mvc.UmbracoTemplatePage
#inherits Umbraco.Web.Mvc.UmbracoTemplatePage
#{
var portfolio = Umbraco.TagQuery.GetAllContentTags().OrderBy(t => t.Text);
var tagList = Umbraco.TagQuery.GetAllContentTags().OrderBy(t => t.Text);
string tag = Request.QueryString["tag"];
if (!tag.IsNullOrWhiteSpace())
{
var publishedContent = Umbraco.TagQuery.GetContentByTag(tag);
if (publishedContent.Count() > 0)
{
#* Show title *#
<div class="media contact-info wow fadeInDown" data-wow-duration="1000ms" data-wow-delay="600ms">
<center>
<div>
<i class="fa fa-tags"></i>
</div>
<br />
<div class="media-body">
<h2>Tags</h2>
<p>Browse content by tag</p>
</div>
</center>
<br />
</div>
#* Show tag titles in tabs *#
<ul class="portfolio-filter text-center">
<li><a class="btn btn-default" href="#" data-filter="*">All tags</a></li>
#foreach (var tags in tagList)
{
<!-- Create a selected tag -->
if(#tags.Text == #tag)
{
<li><a class="btn btn-default active" href="#" data-filter=".#tag">#tag</a></li>
}
#* Show all other tags *#
else
{
<li><a class="btn btn-default" href="#" data-filter=".#tags.Text">#tags.Text</a></li>
}
}
</ul>
<div class="row">
<div class="portfolio-items">
#* Start picture content *#
#foreach (var tags in tagList)
{
#* Put selected tag in the right tag tab *#
if(#tags.Text == #tag)
{
#* Show tag content *#
foreach (var item in publishedContent.OrderByDescending(i => i.CreateDate))
{
<div class='portfolio-item #tag col-xs-12 col-sm-4 col-md-3'>
<div class="recent-work-wrap">
#* IF the project has a picture *#
#if(item.HasValue("pictureOfTheProject"))
{
var featureImage = Umbraco.TypedMedia((int)item.GetPropertyValue("pictureOfTheProject"));
<img class="img-responsive" src="#featureImage.GetCropUrl(250, 250)" alt='#item.GetPropertyValue("titleOfTheProject")' />
<div class="overlay">
<div class="recent-work-inner">
<h3>#item.GetPropertyValue("titleOfTheProject")</h3>
<a class="preview" href="#featureImage.GetCropUrl(250, 250)" rel="prettyPhoto">
<i class="fa fa-eye"></i> View
</a>
</div>
</div>
}
#* Else when the project doesnt have a picture, show default one *#
else
{
var noImage = "http://www.don-zalmrol.be/media/1440/no_image_available.png";
<img class="img-responsive" src="#noImage.GetCropUrl(250, 250)" alt="No image" />
<div class="overlay">
<div class="recent-work-inner">
<h3>#item.GetPropertyValue("titleOfTheProject")</h3>
<a class="preview" href="#noImage.GetCropUrl(250, 250)" rel="prettyPhoto">
<i class="fa fa-eye"></i> View
</a>
</div>
</div>
}
</div>
</div>
}
}
#* Put the other tags under there own tab *#
else
{
}
}
#* End dynamic tags *#
</div>
</div>
}
#* No content matching the tag? *#
else
{
<p>There isn't any content matching that tag.</p>
#Html.Partial("TagList")
}
}
#* Show the tag list with amount *#
else
{
#Html.Partial("TagList")
}
}
EDIT 27-03-2016
Ok so I now know that I need to play around with my tag query or use the IEnumerable. But I can't seem to find it out how I can do this without breaking the code...
#* Get all tags and order them by name *#
var tagList = Umbraco.TagQuery.GetAllContentTags().OrderBy(t => t.Text);
#* Get requested tag *#
string tag = Request.QueryString["tag"];
#* Show all content by requested tag *#
var publishedContent = Umbraco.TagQuery.GetContentByTag(tag);
Above are the pieces of code that list all the tags I have in a var, gets the name from the URL (i.e. Electronics) and one that then displays all content that matches said queried tag.
So in short I need to change the last part of the TagQuery to list all content that has a tag and then filter it out by the querystring to display them in their own category.
But how can list all tagcontent?
Cheers,
Don
Your issue is on line 76 of the pastebin where you loop through the published content, which is already filtered by the selected tag. Because of this, only content with the selected tag ever gets written to the template.
What you need to do is use the loop on line 70 where you're looping through all tags. The jQuery plugin you're using will handle the filtering by using the CSS class you add on line 78. You can get rid of the loop on line 76.
For clarity, I would also change the
#foreach (var tags in tagList)
to
#foreach (var item in tagList)
since the foreach is producing a single tag rather than plural "tags" as your code insinuates. This has the added bonus of allowing you to keep the rest of your code the same once you remove the loop on line 76.

Perl Scrappy select using class attribute

I was trying to scrape using Perl Scrappy. I would like to select html elements with class attribute using 'select'.
<p>
<h1>
<a href='http://test.com'>Test</a>
<a href='http://list.com'>List</a>
</h1>
</p>
<p class='parent-1'>
<h1>
<a class='child-1' href="http://sample.com">SampleLink</a>
<a class='child-2' href="http://list.com">List</a>
</h1>
</p>
I need to get element('a' tag) with class name 'child-1' which is a child nod of <p class='parent-1'> using select method.
I have tried like this
#!/usr/bin/perl
use Scrappy;
my $scraper = Scrappy->new;
$scraper->get($url);
$scraper->select('p a')->data;
But it will select the first 'p' tag also.
Could you please help me with this?
Bearing in mind choroba's warning, to select an <a> element with a class of child-1 that is a child of a <p> element with a class of parent-1 you would write
$scraper->select('p.parent-1 > a.child-1')
The problem is that in HTML, a <p> tag can't contain a <h1> tag. In fact, the HTML is parsed as
<p></p>
<h1>
<a href='http://test.com'>Test</a>
<a href='http://list.com'>List</a>
</h1>
<p class='parent-1'></p>
<h1>
<a class='child-1' href="http://sample.com">SampleLink</a>
<a class='child-2' href="http://list.com">List</a>
</h1>

How to create a comma separated tag list in Docpad?

The tags for a post can be accessed within the post like this (using embedded Coffeescript):
<div class="tags">
Tags:
<ul>
<% for tag in #document.tags: %>
<li><a class="tag_item" href="<%= #getTagUrl(tag) %>"><%= tag %></a></li>
<% end %>
</ul>
</div>
This generates an unordered list of the tags for this specific topic, like this:
Tags:
tag1
tag2
tag3
How can I generate the list of tags as comma separated values on a single line, like this:
Tags: tag1, tag2, tag3
I do it like this on my blog:
<div class="post-tags">
Posted In: <%- ("<a href='#{#getTagUrl(tag)}'>#{tag}</a>" for tag in #tags).join(', ') %>
</div>
Note, #getTagUrl comes from the docpad-plugin-tagging plugin. If you don't want hyperlinks to a page for each tag, you could simplify this to the following:
<div class="post-tags">
Posted In: <%- (tag for tag in #tags).join(', ') %>
</div>

hpple html parse block by block or property by property?

I'm new about hpple and xpath. for the below html code,I want to get both "title" and "tag" information.
From hpple's example code, I can get a array of title, and another array of tag. But if there are six properties I'm interested, there will be six arrays.
can I find the div[class="entry"], then get its child's , div[class="meta"]? (can anybody share the code?)
Thanks.
<div class="content">
<div id="1" class="entry">
<h2 class="title"> title for entry 1 </h2>
<div class="meta"> tag:xxx </div>
</div>
<div id="2" class="entry">
<h2 class="title"> title for entry 2 </h2>
<div class="meta"> tag:xxx </div>
</div>
...
</div>
#"//div[#class='content']//div[#class='entry']//div[#class='meta']"
This returns tag:xxx for both entries.
I want to get both "title" and "tag" information
//div[#class='content']/div[#class='entry']/*[#class='meta' or #class=title"']
This XPath gets all tags with class title or meta children of div class entry child of any div class content.