Using xpath to parse out html attributes from webpage - powershell

I am having trouble extracting some attributes out of an html page and need some ideas to help me get unstuck.
I am using PowerShell and am using the htmlagilitypack to help me parse the html. I have a very crude version that I was able to do with regex but it doesn't always work so I thought the better option would be to use xpath to parse the results. If regex is the way to go please let me know.
So far I have been able to grab the page that I am interested in and split it apart by rows.
$results = $htmldoc.DocumentNode.SelectNodes("//p[#class='row']")
After the page is split up I am trying to iterate through each row using xpath to grab the information I am interested in.
ForEach ($item in $results) {
$ID=$null
$ID = $item.OuterHtml
}
This gets me close to what I am wanting but it grabs a bunch of other info that I don't want as well. Here is what the $item.outerhml looks like at this point.
OuterHtml : <p class="row" data-latitude="41.5937565437255" data-longitude="-93.6437636649079" data-pid="4184719674">
<span class="star"></span> <span class="pl"> <span class="date">Nov 27</span> iPhone and other Cell Phone Unlocks
</span> <span class="l2"> <span class="pnr"> <small> (Des Moines)</small> <span class="px"> <span class="p"> <a href="#" class="maptag"
data-pid="4184719674">map</a></span></span> </span> <a class="gc" href="/mod/" data-cat="mod">cell phones - by dealer</a> </span> </p>
I just want the data-pid attribute.
I have tried a bunch of other ways to extract the data-pid attribute but haven't had any success. Here is one such method I have tried, but it keeps returning the same value over and over.
$ID = $Date.DocumentNode.SelectSingleNode("//p/#data-pid")
I have a feeling that this is something simple but have hit a roadblock. Let me know what other information I need to post.

In your foreach loop you should be able to get the attribute's value like this:
$ID = $item.GetAttributeValue("data-pid", "")
To walk all the attributes on that node try:
$item.Attributes | Select Name,Value

Related

How can I find a specific piece of a webpage using MOJO::Dom?

Below I put an extract from an IMDb page, I purposely kept it short. My end goal is to get the 2 links. But I can't even figure out how to get a specific div with an id. Because obviously the class below is spread out all over the page. I've Googled, looking for an example using class and id, but still can't find he solution.
p.s. The only reason I have Dumper in there is so when I run it, I can instantly see how I still haven't got it.
my $ua = Mojo::UserAgent->new( max_redirects=>3, timeout => 30 );
my $dom = $ua->get($newrip)->result->dom;
my $module_list = $dom->find('div.article');
print Dumper $module_list;
exit;
<div class="article" id="titleDetails">
<span class=rightcornerlink>
Edit
</span>
<h2>Details</h2>
<div class="txt-block">
<h4 class="inline">Official Sites:</h4>
<a href="/offsite/?page-action=offsite-facebook&token=BCYtlGRCvvzcjTOrSRBQqjTPuEUBGkxbnkfQjRYZi0XJxm-A-A4vf0mzJF5WqH6HYLt2TZCuVR7c%0D%0A209QQMCwUe-51EwtDDYbNczYCnFRIRzctUhoXJCF2gsQJw6m050sV9g0sTJJEfiGP37rfeIIoXMS%0D%0ACfj2qgUNCaL2YaP_FeWVGCg39Bw-3dRsP5cB1Wk9FfobPd5tG8Q4WjVbUR2pTOvE0Pkc5QUK5E7U%0D%0AX7O9awNb0Kw%0D%0A&ref_=tt_pdt_ofs_offsite_0"
rel="nofollow">Official Facebook</a>
<span class="ghost">|</span>
Official site
<span class="see-more inline"></span>
</div>

Locators combination

I am working on protractor to test the AngularJs application. Here I came across one scenario where I want to click on image for different users. But the id for image is same for all (say 10) users. So I found one more element that is one unique number allocated to each user. The code for 2 different users are:
USER1:
img id="searchPatientImgAdmittedM" class="img-circle picwidth" ng-click="getPatientVitalLabPharmacy(patient.patientId._id)" onclick="ShowHide(this)" src="icons/male.png" alt="" role="button" tabindex="0"
span class="clearfloat ng-binding">12339/span
USER2:
img id="searchPatientImgAdmittedM" class="img-circle picwidth" ng-click="getPatientVitalLabPharmacy(patient.patientId._id)" onclick="ShowHide(this)" src="icons/male.png" alt="" role="button" tabindex="0"
span class="clearfloat ng-binding">8841/span
EDIT:
The full HTML code
<div class="col-md-10 col-sm-9 col-xs-9 skin-font-color paddingTop7">
<span class="skin-font-color">
<span class="name clearfloat ng-binding">KRISHA</span>
<span class="clearfloat ng-binding">12348</span>
<img id="searchPatientImgAdmittedF" class="img-circle picwidth" ng-click="getPatientVitalLabPharmacy(patient.patientId._id)" onclick="ShowHide(this)" src="icons/femaleImages.jpg" alt="" role="button" tabindex="0">
</div>
I tried to do :
element(by.id('searchPatientImgAdmittedF')).all(by.tagName('‌​12348')).click();
// or
element(by.id('searchPatientImgAdmittedF')).element(by.tagNa‌​me('12348')).click()‌​;
How can I make combination of locators to click on this users. Only image part is clickable.
Thanks four your additions.
Now you're trying to click on a sister-element. There are several approaches to do so.
The one I'm usually using is:
element(by.cssContainingText('span.clearfloat','12348')).element(by.xpath('..')).$('#searchPatientImgAdmittedF').click();
//equal to
element(by.cssContainingText('span.clearfloat','12348')).element(by.xpath('..')).element(by.id('searchPatientImgAdmittedF')).click();
This evaluates first the identifiable tag with the unique number, then climbs up to its parent element, then from there gets the img-element with the ID.
The $() selector
The cssContainingText() selector
Another option would be to use isElementPresent(), which evaluates the existence of a child-element. However, the code is (from my point of view) more complex and I don't see, how cssContainingText() could be used there, so I don't try to do it here.
Thanks for your quick help in solving my issue. I want to add here that I found the answer to my problem and now I am able to click on the particular user I want from the list of many users. The code I am using is :
element(by.cssContainingText('span.clearfloat','12339'))
.element(by.xpath('/html/body/div[3]/div[1]/div[17]/div/div/table[4]/tbody/tr[3]/td[1]/div[1]/img'))
.click();
This is finding the child element first and then the parent element.The id was all same for all the users so it was not taking that and so I used only xpath along with unique number.
Thanks again for the help.

AngularJS retrieve from object based on entry in ng-repeat input

This application is for running a writing contest.
Coodinators are assigning entries to judges for them to judge. I have three sets of data I retrieve from the server, a judge list, an entries list and an assignment list that ties the two together. There can be a variable number of input fields...if a judge has agreed to judge 4 entries, there will be 4 inputs...if 7, then 7.
I have all of that working OK, but only insofar as the entry number can be input and the data updated.
Now I would like confirm that the entryID IS a valid ID by checking the list and also to show a field or two on the screen so the coordinator knows that they typed in the right entry.
The relevant section of the HTML
<div ng-app>
<div id="assignment" ng-controller="AssignData" ng-init="JudgeID=107;CategorySelect='MS';PublishSelect='P'">
<div ng-show="loaded">
<form class="entryform ng-cloak" name="assignform" ng-submit="sendForm()">
<p>Entry numbers assigned to this judge</p>
<p ng-repeat="assign in (formassigns =(assigns | filter:AssignedJudge))">
<input type="text" ng-model="assign.entryid" required/>
{{entries.authorname}} {{entries.entrytitle}}
</p>
<button type="submit">Save Assignments</button>
<p>This will keep the assignments attached to this judge.
You will be able to send all of your assignments to all
of your judges when you are finished.</p>
</form>
</div>
</div>
</div>
The part that I haven't been able to figure out is how to make entries.authorname and entries.entrytitle show up when the user types in an entryid that is in entries.entryid.
assigns and entries are both arrays of records using JSON
assigns is JSON made up of assigns.id, assigns.judgeid, assigns.entryid.
entries is JSON made up of entries.entryid, entries.entrytitle, entries.authorname
When assigns arrives, entryid is empty. The form is used to fill in the entryid and when it is filled in, I'd like to be able to show next to it the title and authorname for that entry.
NOTE: I've added some important information at the end of this answer. So please read to the end before you decide what you're going to do.
You're going to have to do something that does the look up.
Also a few other changes I'd add, mostly so you can actually validate the items in your repeat.
(There's a summary of what I did after the psuedo code below).
<div ng-app>
<div id="assignment" ng-controller="AssignData"
ng-init="JudgeID=107;CategorySelect='MS';PublishSelect='P'">
<div ng-show="loaded">
<form class="entryform ng-cloak" name="assignform" ng-submit="sendForm()">
<p>Entry numbers assigned to this judge</p>
<p ng-repeat="assign in (formassigns =(assigns | filter:AssignedJudge))"
ng-form="assignForm">
<input type="text" ng-model="assign.entryid"
ng-change="checkEntryId(assign, assignForm)"
name="entryid" required/>
<span ng-show="assignForm.entryid.$error.required">required</span>
<span ng-show="assignForm.$error.validEntry">
{{assignForm.$error.validEntry[0]}}</span>
{{assign.entry.authorname}} {{assign.entry.entrytitle}}
</p>
<button type="submit">Save Assignments</button>
<p>This will keep the assignments attached to this judge.
You will be able to send all of your assignments to all
of your judges when you are finished.</p>
</form>
</div>
</div>
</div>
Then in your controller, you'd add a function like so (be sure to inject $http or a service you wrote to pull the values from the server):
$scope.checkEntryId = function(assign, form) {
$http.get('/CheckEntry?id=' + assign.entryid,
function(entry) {
if(entry) {
assign.entry = entry;
form.$setValidity('validEntry', true);
} else {
form.$setValidity('validEntry', false, 'No entry found with that id');
}
}, function() {
form.$setValidity('validEntry', true, 'An error occurred during the request');
console.log('an error occurred');
});
};
The basic idea above:
Use ng-form on your repeating elements to allow for validation of those dynamic parts.
Create a function that you can pass your item and your nested form to.
In that function, make your AJAX call to see if the entry is valid.
Check the validity based on the response, and call $setValidity on your nested form you passed to the function.
Use ng-show on a span (or something) in your nested form to show your validation messages.
Also, assign your checked entry to your repeated object for display purposes. (you could use a seperate array if you want, I suppose, but that would probably get unnecessarily complicated).
I hope that helps.
EDIT: Other thoughts
You might want to wrap your call in a $timeout or some sort of throttling function to prevent the entry id check from spamming yoru server. This is an implementation detail that's totally up to you.
If this is a check you do all over the place, you'll probably want to create a directive to do it. The idea would be very similar, but you'll do the check inside of a $parser on the ngModelController.
The method I showed above will still actually update the model's entryid, even if it's invalid. This is usually not a big deal. If it is, you'll want to go with what I suggested in "other thought #2", which is a custom validation directive.
If you need more information about validation via custom directives I did a blog entry on that a while back

Trouble pinpointing child elements while using Mojo::DOM

I'm trying to extract text from an old vBulletin forum using WWW::Mechanize and Mojo::DOM.
vBulletin doesn't use HTML and CSS for semantic markup, and I'm having trouble using Mojo::DOM->children to get at certain elements.
These vBulletin posts are structured differently depending on their content.
Single message:
<div id="postid_12345">The quick brown fox jumps over the lazy dog.<div>
Single message quoting another user:
<div id="postid_12345">
<div>
<table>
<tr>
<td>
<div>Quote originally posted by Bob</div>
<div>Everyone knows the sky is blue.</div>
</td>
</tr>
</table>
</div>
I disagree with you, Bob. It's obviously green.
</div>
Single message with spoilers:
<div id="postid_12345">
<div class="spoiler">Yoda is Luke's father!</div>
</div>
Single message quoting another user, with spoilers:
<div id="postid_12345">
<div>
<table>
<tr>
<td>
<div>Quote originally posted by Fred</div>
<div class="spoiler">Yoda is Luke's father!</div>
</td>
</tr>
</table>
</div>
<div class="spoiler">No waaaaay!</div>
</div>
Assuming the above HTML and an array packed with the necessary post IDs:
for (#post_ids) {
$mech->get($full_url_of_specific_forum_post);
my $dom = Mojo::DOM->new($mech->content);
my $div_id = 'postid_' . $_;
say $dom->at($div_id)->children('div')->first;
say $dom->at($div_id)->text;
}
Using $dom->at($div_id)->all_text gives me everything in an unbroken line, which makes it difficult to tell what's quoted and what's original in the post.
Using $dom->at($div_id)->text skips all of the child elements, so quoted text and spoilers are not picked up.
I've tried variations of $dom->at($div_id)->children('div')->first, but this gives me everything, including the HTML.
Ideally, I'd like to be able to pick up all the text in each post, with each child element on its own line, e.g.
POSTID12345:
+ Quote originally posted by Bob
+ Everyone knows the sky is blue.
I disagree with you, Bob. It's obviously green.
I'm new to Mojo and rusty with Perl. I wanted to solve this on my own, but after looking over the documentation and fiddling with it for a few hours, my brain is mush and I'm at a loss. I'm just not getting how Mojo::DOM and Mojo::Collections work.
Any help will be greatly appreciated.
Looking at the source of Mojo::DOM, basically the all_text method recursively walks the DOM and extracts all text. Use that source to write your own walking the DOM function. Its recursive function depends on returning a single string, in yours you might have it return an array with whatever context you need.
EDIT:
After some discussion on IRC, the web scraping example has been updated, it might help you guide you. http://mojolicio.us/perldoc/Mojolicious/Guides/Cookbook#Web_scraping
There is a module to flattern HTML tree, HTML::Linear.
The explanation of purpose for flatterning HTML tree is a bit long and boring, so here's a picture showing the output of the xpathify tool, bound with that module:
As you see, HTML tree nodes become single key/value list, where the key is the XPath for that node, and the value is the node's text attribute.
In a few keystrokes, this is how you use HTML::Linear:
#!/usr/bin/env perl
use strict;
use utf8;
use warnings;
use Data::Printer;
use HTML::Linear;
my $hl = HTML::Linear->new;
$hl->parse_file(q(vboard.html));
for my $el ($hl->as_list) {
my $hash = $el->as_hash;
next unless keys %{$hash};
p $hash;
}

MVC3 and Razor - How to place a dynamic value for hidden field?

I'm a beginner about Razor, and sometimes I get stuck with really simple things.
I have this foreach loop:
#foreach (dynamic item in ViewBag.EAList)
{
<li>
#using (#Html.BeginForm("Duplicate, "Daily"))
{
<p>#item.AuthorComment</p>
#Html.Hidden("EstadoDeAlmaID", #item.EAID)
#Html.Hidden("PosterID", Session["id"].ToString())
<input type="submit" value="Send" />
}
</li>
}
This line:
#Html.Hidden("EstadoDeAlmaID", #item.EAID)
Doesn't work, and I don't know how to make it work, I tried many ways, without #, with (--), with #(--)...
Could someone help me to display the dynamic value in my hidden field?
In addition, if someone know about a good Razor samples websites, I would be very thankful.
I had the same problem, found that a simple cast solved my problem.
#Html.Hidden("id", (string) ViewBag.ebook.isbn)
In Razor, once you are in "C# land", you no longer need to prefix values with # sign.
This should suffice:
#Html.Hidden("EstadoDeAlmaID", item.EAID)
Check out Scott Gu's article covering the syntax for more help.
Update
And I would also move your <li></li> within your using block, as Razor works better when you wrap HTML blocks inside of a code blocks.
Also, your Html.BeginForm should live outside of your loop.
#using (#Html.BeginForm("Duplicate, "Daily"))
{
<ul>
#foreach (? item in ViewBag.EAList)
{
<li>
<p>#item.AuthorComment</p>
#Html.Hidden("EstadoDeAlmaID", item.EAID)
#Html.Hidden("PosterID", Session["id"].ToString())
<input type="submit" value="Send" />
</li>
}
</ul>
}
Where ? in the foreach loop is the type of your items in EAList.
To avoid the Extension methods cannot be dynamically dispatched exception, use a model instead of ViewBag so you will not be using dynamic objects (this will avoid all the unnecessary casting in the View and is more in line with MVC style in general):
In your action when you return the view:
return View("ViewName", db.EAList.ToList());
In your view, the first line should be:
#model IEnumerable<EAListItem> //or whatever the type name is
Then just do:
#foreach(var item in Model)
You got the error, "Extension methods cannot be dynamically dispatched"... therein lies your trouble.
You should declare you loop variable not to be of type dynamic, but of the actual type in the collection. Then remove the # from the item.EAID call inside the #Html.Hidden() call.
The simple solution for me was to use ViewData instead of ViewBag. ViewBag is just a dynamic wrapper around ViewData anyway.
#Html.Hidden("ReportID", ViewData["ReportID"])
but I don't know if this will help in your case or not since you are creating dynamic items in your foreach loop.
I have found that when i want to use the view bag data in the HTML
Getting back to basics has often worked for me
<input type="hidden" name="Data" id="Data" value="#ViewBag.Data" />
this gave the same result.