Perl Scrappy select using class attribute - perl

I was trying to scrape using Perl Scrappy. I would like to select html elements with class attribute using 'select'.
<p>
<h1>
<a href='http://test.com'>Test</a>
<a href='http://list.com'>List</a>
</h1>
</p>
<p class='parent-1'>
<h1>
<a class='child-1' href="http://sample.com">SampleLink</a>
<a class='child-2' href="http://list.com">List</a>
</h1>
</p>
I need to get element('a' tag) with class name 'child-1' which is a child nod of <p class='parent-1'> using select method.
I have tried like this
#!/usr/bin/perl
use Scrappy;
my $scraper = Scrappy->new;
$scraper->get($url);
$scraper->select('p a')->data;
But it will select the first 'p' tag also.
Could you please help me with this?

Bearing in mind choroba's warning, to select an <a> element with a class of child-1 that is a child of a <p> element with a class of parent-1 you would write
$scraper->select('p.parent-1 > a.child-1')

The problem is that in HTML, a <p> tag can't contain a <h1> tag. In fact, the HTML is parsed as
<p></p>
<h1>
<a href='http://test.com'>Test</a>
<a href='http://list.com'>List</a>
</h1>
<p class='parent-1'></p>
<h1>
<a class='child-1' href="http://sample.com">SampleLink</a>
<a class='child-2' href="http://list.com">List</a>
</h1>

Related

How to stop querying when it reaches a specific class with XPath?

Say I have the following:
<div class="data">
<h2 class="entry-contentH2">Preparation</h2>
<h2>Airplanes</h2>
<ul>
<li><strong>3 large</strong> wings</li>
<li><strong>2</strong>doors</li>
</ul>
<h2>Car</h2>
<ul>
<li><strong>4</strong> doors</li>
<li><strong>1 cup</strong> holder</li>
</ul>
<h2 class="stopHeader">Execution</h2>
<h2>Motorcycles</h2>
<ul>
<li>Easy to learn</li>
</ul>
</div>
I'm trying to get query all of the <p></p> tags text after the <h2>Preparing</h2>, but I want it to stop at the last <p></p> before the stopHeader class.
This is the code that I came up with:
//h2[contains(.,"Preparation")]/following-sibling::h2/text()[not(preceding::h2[#class="stopHeader"])]
#and also
//h2[contains(.,"Preparation")]/following-sibling::h2/text()[not(preceding::h2[contains(., "Execution")])]
Try below XPath to get desired output:
//h2[.="Preparation"]/following-sibling::h2[./following-sibling::h2[.="Execution"]]/text()
This should return text content of each header (h2) between "Preparation" and "Execution"
Try this xpath.
//h2[text()='Preparation']/following::h2[not(#class='stopHeader')]/text()

Selenium- Tag traversal Xpath for facebook First name not working

<div id="reg_form_box" class="large_form">
<div class="clearfix _58mh">
<div class="mbm _3-90 lfloat _ohe">
<div id="u_0_0" class="_5dbb">
<div class="uiStickyPlaceholderInput uiStickyPlaceholderEmptyInput">
<div class="placeholder" aria-hidden="true">First name</div>
<input id="u_0_1" class="inputtext _58mg _5dba _2ph-" data-type="text" name="firstname" aria-required="1" placeholder="" aria-label="First name" aria-controls="js_0" aria-haspopup="true" role="null" aria-describedby="js_w" aria-invalid="true" type="text"/>
</div>
<i class="_5dbc img sp_beZQzZ7Rg6Q sx_5ca7f2"/>
<i class="_5dbd img sp_beZQzZ7Rg6Q sx_9c246c"/>
</div>
Above is the code for which i want to write Xpath using tag name traversal. Here is the xpath i have made
"//div[#id='reg_form_box']/div[1]/div[1]/div[1]/div/input"
Please suggest what's wrong here and how can i correct the same. Website is Facebook and field is First name on homepage.
Ideally unless you have a case of multiple modes that match the same xpath, you don't have to traverse through the entire hierarchy.
This will work:
//input[#name='firstname']

How to find sibling element with behat/mink?

HTML:
<div id="my-id">
<li class="list_element">
<div class="my_class"></div>
</li>
<li class="list_element">
<div class="another_class"></div>
</li>
<li class="list_element">
<div class="class3"></div>
</li>
</div>
What I want to do with behat/mink:
$page = $this->getSession()->getPage();
$selector = $page->find('css', "#my-id .my_class"); //here I need anchor element located near to .my_class div.
I don't know in which one .list_element .my_class div is. I know only anchor is next to .my_class element. Which selector should I use in the find() function?
Try one of these:
#my-id .my_class ~ a
#my-id .my_class + p
#my-id .list_element a
This is too basic question.Please see more here w3schools

Show tags under their own specific tab

I'm creating a tag system on my website. So far it's all working great. But now I wish to create a page that shows all my tags under their own tab like you have with a portfolio page.
Example: http://www.don-zalmrol.be/tags?tag=Electronics
My page already displays the tags for a specific tag under it's own tab (i.e. electronics), but as you might guess I wish to populate the other tags in their respective tab as well.
So in short you land a view that displays the tag you've selected, but on the same page you can see the others as well.
Anybody has any idea how I can do this? I don't think I'm far away from the solution as I can already load the projects for specific tags under it's own tab. Now I only need to populate the remaining tabs with the tags!
Thanks!
This is my code so far:
http://pastebin.com/jwGW0NKZ
#inherits Umbraco.Web.Mvc.UmbracoTemplatePage
#inherits Umbraco.Web.Mvc.UmbracoTemplatePage
#{
var portfolio = Umbraco.TagQuery.GetAllContentTags().OrderBy(t => t.Text);
var tagList = Umbraco.TagQuery.GetAllContentTags().OrderBy(t => t.Text);
string tag = Request.QueryString["tag"];
if (!tag.IsNullOrWhiteSpace())
{
var publishedContent = Umbraco.TagQuery.GetContentByTag(tag);
if (publishedContent.Count() > 0)
{
#* Show title *#
<div class="media contact-info wow fadeInDown" data-wow-duration="1000ms" data-wow-delay="600ms">
<center>
<div>
<i class="fa fa-tags"></i>
</div>
<br />
<div class="media-body">
<h2>Tags</h2>
<p>Browse content by tag</p>
</div>
</center>
<br />
</div>
#* Show tag titles in tabs *#
<ul class="portfolio-filter text-center">
<li><a class="btn btn-default" href="#" data-filter="*">All tags</a></li>
#foreach (var tags in tagList)
{
<!-- Create a selected tag -->
if(#tags.Text == #tag)
{
<li><a class="btn btn-default active" href="#" data-filter=".#tag">#tag</a></li>
}
#* Show all other tags *#
else
{
<li><a class="btn btn-default" href="#" data-filter=".#tags.Text">#tags.Text</a></li>
}
}
</ul>
<div class="row">
<div class="portfolio-items">
#* Start picture content *#
#foreach (var tags in tagList)
{
#* Put selected tag in the right tag tab *#
if(#tags.Text == #tag)
{
#* Show tag content *#
foreach (var item in publishedContent.OrderByDescending(i => i.CreateDate))
{
<div class='portfolio-item #tag col-xs-12 col-sm-4 col-md-3'>
<div class="recent-work-wrap">
#* IF the project has a picture *#
#if(item.HasValue("pictureOfTheProject"))
{
var featureImage = Umbraco.TypedMedia((int)item.GetPropertyValue("pictureOfTheProject"));
<img class="img-responsive" src="#featureImage.GetCropUrl(250, 250)" alt='#item.GetPropertyValue("titleOfTheProject")' />
<div class="overlay">
<div class="recent-work-inner">
<h3>#item.GetPropertyValue("titleOfTheProject")</h3>
<a class="preview" href="#featureImage.GetCropUrl(250, 250)" rel="prettyPhoto">
<i class="fa fa-eye"></i> View
</a>
</div>
</div>
}
#* Else when the project doesnt have a picture, show default one *#
else
{
var noImage = "http://www.don-zalmrol.be/media/1440/no_image_available.png";
<img class="img-responsive" src="#noImage.GetCropUrl(250, 250)" alt="No image" />
<div class="overlay">
<div class="recent-work-inner">
<h3>#item.GetPropertyValue("titleOfTheProject")</h3>
<a class="preview" href="#noImage.GetCropUrl(250, 250)" rel="prettyPhoto">
<i class="fa fa-eye"></i> View
</a>
</div>
</div>
}
</div>
</div>
}
}
#* Put the other tags under there own tab *#
else
{
}
}
#* End dynamic tags *#
</div>
</div>
}
#* No content matching the tag? *#
else
{
<p>There isn't any content matching that tag.</p>
#Html.Partial("TagList")
}
}
#* Show the tag list with amount *#
else
{
#Html.Partial("TagList")
}
}
EDIT 27-03-2016
Ok so I now know that I need to play around with my tag query or use the IEnumerable. But I can't seem to find it out how I can do this without breaking the code...
#* Get all tags and order them by name *#
var tagList = Umbraco.TagQuery.GetAllContentTags().OrderBy(t => t.Text);
#* Get requested tag *#
string tag = Request.QueryString["tag"];
#* Show all content by requested tag *#
var publishedContent = Umbraco.TagQuery.GetContentByTag(tag);
Above are the pieces of code that list all the tags I have in a var, gets the name from the URL (i.e. Electronics) and one that then displays all content that matches said queried tag.
So in short I need to change the last part of the TagQuery to list all content that has a tag and then filter it out by the querystring to display them in their own category.
But how can list all tagcontent?
Cheers,
Don
Your issue is on line 76 of the pastebin where you loop through the published content, which is already filtered by the selected tag. Because of this, only content with the selected tag ever gets written to the template.
What you need to do is use the loop on line 70 where you're looping through all tags. The jQuery plugin you're using will handle the filtering by using the CSS class you add on line 78. You can get rid of the loop on line 76.
For clarity, I would also change the
#foreach (var tags in tagList)
to
#foreach (var item in tagList)
since the foreach is producing a single tag rather than plural "tags" as your code insinuates. This has the added bonus of allowing you to keep the rest of your code the same once you remove the loop on line 76.

hpple html parse block by block or property by property?

I'm new about hpple and xpath. for the below html code,I want to get both "title" and "tag" information.
From hpple's example code, I can get a array of title, and another array of tag. But if there are six properties I'm interested, there will be six arrays.
can I find the div[class="entry"], then get its child's , div[class="meta"]? (can anybody share the code?)
Thanks.
<div class="content">
<div id="1" class="entry">
<h2 class="title"> title for entry 1 </h2>
<div class="meta"> tag:xxx </div>
</div>
<div id="2" class="entry">
<h2 class="title"> title for entry 2 </h2>
<div class="meta"> tag:xxx </div>
</div>
...
</div>
#"//div[#class='content']//div[#class='entry']//div[#class='meta']"
This returns tag:xxx for both entries.
I want to get both "title" and "tag" information
//div[#class='content']/div[#class='entry']/*[#class='meta' or #class=title"']
This XPath gets all tags with class title or meta children of div class entry child of any div class content.