Scrapy - Optional hxs.select - select

So, my issue is that, when I'm extracting data, there are a couple of entries on the page that, because there isn't a link also associated with them, they don't get selected:
To better explain here is the hxs.select statement that gets almost all of the data:
opening = hxs.select('//div[#id="body"]/div/table/tr/td/table/tr[2]/td/table[2]/tr/td[7]/font/a/text()').extract()
This statement gets all but 3 opening movie dates. The three missing dates, as I mentioned, don't have a link associated with them and are actually found at:
hxs.select('//div[#id="body"]/div/table/tr/td/table/tr[2]/td/table[2]/tr/td[7]/font/text()').extract()
*Notice: there is no /a found at the end.
I would just add an additional statement to get these, but I need all of the information in order. I also have statements that get a movie title and grossing amount. I then take these statements and iterate through them to pair them up with where they belong- I can't do this if I add another statement to separately deal with them. Any suggestions?
::::Data:::::
Here is the url of the data I'm trying to get BoxOfficeMojo
A quick note: If you use Firebug to view the xpath, it adds tbody which doens't actually exist (it adds it in).
Here is what a normal opening date looks like:
<td bgcolor="#ffffff" align="right">
<font size="2">
6/11/2010
</font>
</td>
Here is what one of the 'problem' opening dates look like:
<td bgcolor="#f4f4ff" align="right">
<font size="2">11/20/1981</font>
</td>

Just select all text nodes within that <font/> element using the descendant-or-self-axis step //.
//div[#id="body"]/div/table/tr/td/table/tr[2]/td/table[2]/tr/td[7]/font//text()

Related

Netsuite : HTML/Email Sublist of Transaction not looping correctly

I am looping the transaction.item to get the stockcodes which perfectly works like a charm.
But when I tried to get the links for individual item, it populates all of the td tag, the link should exist ONLY on stockcode 100132 but instead the rest of the items get the links too. Also I did double check the databse if there were any links for the rest of stockcodes. It only exist on stockcode 100132.
This is definitely weird and doesnt make any sense to me. Here's my code for the list
<#list transaction.item as sdsitem>
<tr style="text-align: center">
<td class="th-border stockcode">${sdsitem.item}</td>
<td class="th-border sdslink">
<#if (sdsitem.item.custitemabco_sds_email_link)??>
<a href="${sdsitem.item.custitemabco_sds_email_link}"
target="_blank">Link only exists on stockcode 100132</a>
</#if>
</td>
</tr>
</#list>
Thank you so much for those who will give a time to help me. I'm a beginner at Netsuite, and will really appreciate the answer. God bless!
I think the problem you're seeing can be answered by Suite Answer 98056. When an item is referenced on a transaction (Purchase/Sales/Work/Transfer Order etc), the fields that are found on the item record can not be directly accessed by using a dot to drill through the item.
Instead, you will need to create a new Transaction Item Field that is sourced from the item record, and the field name you're looking at using i.e. custitemabco_sds_email_link.

Katalon IDE generated script gives error in eclipse

script2
I am using Katalon IDE for generating a script.
As my application has a signout button on top right, when I click it, IDE will generate
xpath=(.//*[normalize-space(text()) and normalize-space(.)='S'])[3]/following::span[3]
When running this in Eclipse this line gives an error. On inspecting this element I found this:
<td id="titlebar_hyperlink_8-co_0" role="presentation"
nowrap="nowrap" align="left" class=" verticalSpacer"
style="vertical-align:top;">
<span id="titlebar_hyperlink_8-lbsignout" align="left"
ctype="label" tabindex="0" targetid=
"titlebar_hyperlink_8-lbsignout" mxevent="click" accesskey="S"
class="text powerwhite anchor" style="display:block;cursor:pointer;"
title="Sign Out ALT+S" hotkey="83"><img id="titlebar_hyperlink_8-
lbsignout_image" src="btn_signout.gif" class="pwimg" border="0"
style="vertical-align:top;margin:0px;margin-left:3px;margin-right:3px;"
alt="Sign Out ALT+S"><span><span></span><span class="text hl
hlak">S</span><span>ign Out</span></span></span></td>
I'm new to selenium and all its related stuff. I would appreciate any kind of help on this. Thanks, Stack Overflow Community.
inspecting Signout element
Try changing the xpath to
'.//*[#title="Sign Out ALT+S"]'
Explanation:
You want to uniquely locate an element. Usually, you would try with the id, but the id of your element seems dynamic so it might not work in every case.
* - this means any element, and inside of the brackets [ ] you put the attribute you want to find the element by. I chose title because it will probably be unique on a given page.
I recommend this cheatsheet for xpath reference.

Trouble pinpointing child elements while using Mojo::DOM

I'm trying to extract text from an old vBulletin forum using WWW::Mechanize and Mojo::DOM.
vBulletin doesn't use HTML and CSS for semantic markup, and I'm having trouble using Mojo::DOM->children to get at certain elements.
These vBulletin posts are structured differently depending on their content.
Single message:
<div id="postid_12345">The quick brown fox jumps over the lazy dog.<div>
Single message quoting another user:
<div id="postid_12345">
<div>
<table>
<tr>
<td>
<div>Quote originally posted by Bob</div>
<div>Everyone knows the sky is blue.</div>
</td>
</tr>
</table>
</div>
I disagree with you, Bob. It's obviously green.
</div>
Single message with spoilers:
<div id="postid_12345">
<div class="spoiler">Yoda is Luke's father!</div>
</div>
Single message quoting another user, with spoilers:
<div id="postid_12345">
<div>
<table>
<tr>
<td>
<div>Quote originally posted by Fred</div>
<div class="spoiler">Yoda is Luke's father!</div>
</td>
</tr>
</table>
</div>
<div class="spoiler">No waaaaay!</div>
</div>
Assuming the above HTML and an array packed with the necessary post IDs:
for (#post_ids) {
$mech->get($full_url_of_specific_forum_post);
my $dom = Mojo::DOM->new($mech->content);
my $div_id = 'postid_' . $_;
say $dom->at($div_id)->children('div')->first;
say $dom->at($div_id)->text;
}
Using $dom->at($div_id)->all_text gives me everything in an unbroken line, which makes it difficult to tell what's quoted and what's original in the post.
Using $dom->at($div_id)->text skips all of the child elements, so quoted text and spoilers are not picked up.
I've tried variations of $dom->at($div_id)->children('div')->first, but this gives me everything, including the HTML.
Ideally, I'd like to be able to pick up all the text in each post, with each child element on its own line, e.g.
POSTID12345:
+ Quote originally posted by Bob
+ Everyone knows the sky is blue.
I disagree with you, Bob. It's obviously green.
I'm new to Mojo and rusty with Perl. I wanted to solve this on my own, but after looking over the documentation and fiddling with it for a few hours, my brain is mush and I'm at a loss. I'm just not getting how Mojo::DOM and Mojo::Collections work.
Any help will be greatly appreciated.
Looking at the source of Mojo::DOM, basically the all_text method recursively walks the DOM and extracts all text. Use that source to write your own walking the DOM function. Its recursive function depends on returning a single string, in yours you might have it return an array with whatever context you need.
EDIT:
After some discussion on IRC, the web scraping example has been updated, it might help you guide you. http://mojolicio.us/perldoc/Mojolicious/Guides/Cookbook#Web_scraping
There is a module to flattern HTML tree, HTML::Linear.
The explanation of purpose for flatterning HTML tree is a bit long and boring, so here's a picture showing the output of the xpathify tool, bound with that module:
As you see, HTML tree nodes become single key/value list, where the key is the XPath for that node, and the value is the node's text attribute.
In a few keystrokes, this is how you use HTML::Linear:
#!/usr/bin/env perl
use strict;
use utf8;
use warnings;
use Data::Printer;
use HTML::Linear;
my $hl = HTML::Linear->new;
$hl->parse_file(q(vboard.html));
for my $el ($hl->as_list) {
my $hash = $el->as_hash;
next unless keys %{$hash};
p $hash;
}

xPath Groupings how?

OK So, I'm learning/using xpath for a basic application that's effectively ripping data off another website.
I need to gain the knowledge of each persons Country/Suburb/area.
In some instances you can get Australia/Victoria/Melbourne for instance.
Others may just be Australia/Melbourne.
Or even just Melbourne OR just Australia.
So I'm current able to view the below code and rip all of the information with the string xpath //table/tr/td/table/tr/td/font/a. This returns every entry, but what I really want is to group each lot separately.
I hope someone out there on planet earth knows what I just tried to explain... and can help...
Good day!
The source document contains data like this:
<tr>
<td>
<font face="arial" size="2">
<strong>Location:</strong>
Australia,
<a href='http://maps.google.com/maps?q=Australia%20Victoria'target="mapblast" style='text-decoration:none'>Victoria</a>,
<a href='http://maps.google.com/maps?q=Australia%20Melbourne%20Victoria'target="mapblast" style='text-decoration:none'>Melbourne</a>
</font>
</td>
</tr>
To find each person's record, the XPath query is //table/tr/td/table/tr/td/font, or you could use //td/font[strong = 'Location:']. This will return a collection containing 1 element for each person.
To find the a elements under a particular font you could use XPath a from the font. This can also be done by iterating the children collection of the element.

php - splitting a string with HTML by the first instance of a table cell

I am checking on HTML content on my page, and I've got the split down to have the variable left with this content:
">
<td>Oklahoma City</td>
<td>Oklahoma</td>
<td>OK</td>
<td>405</td>
<td>CST</td>
</tr>
</table>
<div id="
Those are dynamic pages I'm checking, so the data will always be different, but the layout the same...
How can I get the value out of the second <td> if that html is in 1 variable(string)?
It was a full page, I've used explode twice to remove everything above a div field and everything below the last dive field id... so it has some open html tags left because I did not know how to get rid of that along the way to be left with just this:
<td>Oklahoma City</td>
<td>Oklahoma</td>
<td>OK</td>
<td>405</td>
<td>CST</td>
</tr>
</table>
Can you tell me how to get that out? I just need the second one because it is the county and that is what I'm checking on...