Are there reliable algorithms for generating robust DOM node selectors, given only the target node? - dom

In writing a scraper, we typically use some kind of selector to identify particular nodes of interest. Ideally the selectors should continue to work even as the page changes over time. A lot of the common approaches like grabbing nodes by id are fragile on frequently updated pages and impossible on some nodes. I'm trying to find good algorithms for generating robust selectors, but since there doesn't seem to be a standard terminology for this problem, it's hard to find everything that's out there.
Here are the selector DSLs I already know.
XPath selectors - Implemented everywhere from JS to the popular
Python and Ruby scraping libraries.
CSS selectors - Found in many of the places where you can find xpath
selectors.
High level selectors - Here I'll give the example of Chickenfoot,
which allows users to write click("begin tutorial") to find a link
with the text "Begin Tutorial." Usually these are implemented on top of
xpath and CSS selectors. I'd love to find out about more members of
this language family.
Visual selectors - This would be the approach taken by, for instance,
Sikuli, which makes it appear as though the program is calling a
function on a screengrab of the relevant node. I don't know any
web-specific instances of this approach, but I imagine there are
some.
Here are the selector generation algorithms I already know. By a selector generation algorithm I mean an algorithm that takes a node as input and produces a robust selector as output.
iMacros: Finds all elements with the same node type and text as the
target element, finds the target element's index in this list list. Uses
the node type, text, and index as the selector. Also includes id
for forms and form elements.
CoScripter: Uses element's text if available. If not, uses preceding
text.
Selenium: Uses id where available. Uses various other attributes
otherwise, such as image alt text, links' displayed texts, buttons'
displayed texts.
Wargo System: Uses element text.
Many systems: Many systems use the xpath from the root to the target node, or some
suffix of that xpath.
All of these selector generation algorithms fail on some nodes. Are there better approaches out there? Or other approaches that I could combine with these algorithms to produce a better hybrid algorithm?

When I started investigating this topic for some work I am doing, I was also surprised by how little information is available on this topic.
I did find this 2003 paper, but unfortunately, I only have access to the abstract:
Abe, Mari, and Masahiro Hori. “Robust Pointing by XPath Language: Authoring Support and Empirical Evaluation.” In Proceedings of the 2003 Symposium on Applications and the Internet, 156 – . SAINT ’03. Washington, DC, USA: IEEE Computer Society, 2003.
For my own use, I followed the approach in Tim Vasil's 50-line jquery plugin. I won't reproduce the code which is available at that link, but instead I'll describe it:
It recursively traverses up the DOM tree from the element, building a selector "backwards". At each level:
If the node has an ID, just use that and skip all the parents; they aren't added to the selector.
If node has a tag name or a set of classes that is unique among its siblings, use that as the selector. Otherwise, use :nth-child.
Since I will be storing element contents between visits to a page, I'm thinking of implementing some "blunder detection" here, maybe using a percentage change from last visit to detect if the selector may be grabbing the wrong element.

Related

Which one of the locator is efficient by.css or by.xpath or by.id in Protractor?

Which one is better, in terms of performance , to use : by.css or by.xpath or by.id.
I have a really lengthy xpath :
by.xpath('//*#id="logindiv"]/div[3]/div/div[1]/div/nav/div/div[1]/form/div/div/button')
which can be used with other selectors like by.css or by.id.
But it is not very clear which one is better.
Protractor uses selenium-webdriver underneath for element lookup/interaction etc, so this is not protractor specific question, but rather selenium-webdriver specific.
CSS selectors perform far better than Xpath and it is well documented in Selenium community. Here are some reasons,
Xpath engines are different in each browser, hence make them inconsistent.
Last time I checked, IE does not have a native xpath engine, therefore selenium-webdriver injects its own xpath engine for compatibility of its API. Hence we lose the advantage of using native browser features that selenium-webdriver inherently promotes.
Xpath tend to become complex like your example and hence make hard to read/maintain in my opinion.
However there are some situations where, you need to use xpath, for example, searching for a parent element or searching element by its text (I wouldn't recommend the later).
You can read blog from Simon(creator of selenium-webdriver) here . He also recommends CSS over Xpath.
So I would recommend you use id, name etc for faster lookup. If thats not available use css and finally use xpath if none other suite your situation.

REST: The same resource accessed from two different URLs

I've just started my journey with REST, so please be patient while anserwing my questions.
I have some doubts how to cope with some situations. For instance, I read that evenry resource/entity should have its own, unique URL. But how to deal with examples like:
There is a list of objects.
Each object is described by some basic information and its state.
Basic info for object is for example its shape, color.
The state of object contains information like position in space, velocity etc.
Both two pieces of information (basic + state) are rather big objects (only examplary properties were introduced).
Now, we have the following questions:
Get all basic info about objects.
Get all object states.
Get basic info of object with ID=2.
Get state of object ID=5.
Get all info of object ID=7.
I tried to solve it in this way:
/rest/objects
/rest/states
/rest/objects/2/basic
/rest/objects/5/state (/rest/states/5)
/rest/objects/7
However, I've some doubts pointed in 4 - it doesn't look like to be correct. There are two ways to access the same information/resource/entity.
How to deal with it?
Isn't the "basic" just part of the resource? It seems not from your description, but you should consider it. In fact, all of "basic" and "state" might just be part of the resource. It's difficult to be authoritative without more information. For instance, how large is "rather big"?
It also doesn't seem like /state should be an independent endpoint, since a state is an integral part of an object, not something that lives on its own.
I might start with something like:
1. GET /objects
2. GET /objects?expand=state
3. GET /objects/2?expand=basic -- or -- GET /objects/2/basic
4. GET /objects/5?expand=state -- or -- GET /objects/5/state
5. GET /objects/7?expand=basic,state
If you need /state to be top level, then you have two options:
2. GET /states -- or -- GET /objects/states
Neither of those are really ideal. It is acceptable to have multiple paths to get to the same resource, as long as there's only one canonical URI for the resource. The issue is that it's harder for the end user to learn APIs that lean heavily on that style. The second style is also confusing. /objects/XXX is a specific object, unless XXX == states? That's a special case for your end user to have to remember.
From a caching point of view it makes sense to have "state" and "basic" in separate resources - "state" resources changes frequently and is thus non-cachable and "basic" changes seldom and should thus be cachable for a long time. If this is your goal, then go ahead, it sure makes sense (and is an often mentioned pattern). But usually, to keep complexity low, both sets of values are combined into one single resource (your "Object" in this case).
Your paths are pretty good, Eric's version is also fine - here is my suggestion:
/rest/objects (all complete objects info)
/rest/objects/basic (all basic info)
/rest/objects/state (all state info)
/rest/objects/7 (single complete object)
/rest/objects/2/basic (single basic info)
/rest/objects/5/state (single state info)
In the end it all boils down to a matter of personal preference as the exact URL doesn't matter much.
See for instance http://soabits.blogspot.dk/2013/10/url-structures-and-hyper-media-for-web.html for a longer discussion on this issue (disclaimer: that's my blog).
resource/entity
Don't confuse these concepts. REST has nothing to do how you store your data, and resources map to entities only by CRUD applications.
I read that evenry resource/entity should have its own, unique URL.
You should read the IRI standard. IRIs are for identifying resources. Without them you cannot tell the server what you want to manipulate.
However, I've some doubts pointed in 4 - it doesn't look like to be
correct. There are two ways to access the same
information/resource/entity.
How to deal with it?
It is up to you. You can choose either of them or both. There is no restriction about how many identifiers a resource can have or what IRI structure to choose.

Has anyone implement DITA 1.2 Subject scheme in their work?

I would like to know if there is anyone who has implemented the subjectscheme maps of DITA1.2 in their work? If yes, can you please break-up the example to show:
how to do it?
when not to use it?
I am aware of the theory behind it, but I am yet to implement the same and I wanted to know if there are things I must keep in mind during the planning and implementation phase.
An example is here:
How to use DITA subjectSchemes?
The DITA 1.2 spec also has a good example (3.1.5.1.1).
What you can currently do with subject scheme maps is:
define a taxonomy
bind the taxonomy to a profiling or flagging attribute, so that it the attribute only takes a value that you have defined
filter or flag elements that have a defined value with a DITAVAL file.
Advantage 1: Since you have a taxonomy, filtering a parent value also filters its children, which is convenient.
Advantage 2: You can fully define and thus control the list of values, which prevents tag bloat.
Advantage 3: You can reuse the subject scheme map in many topic maps, in the usual modular DITA way, so you can apply the same taxonomies anywhere.
These appear to be the main uses for a subject scheme map at present.
The only disadvantages I have found is that I can think of other hypothetical uses for subject scheme maps such as faceted browsing, but I don't think any implementation exists. The DITA-OT doesn't have anything like that yet anyway.

UIMA: Plug & Play Annotators for differen teams' chains

Assume I have a UIMA toolchain that does something like this:
tokenize -> POS tagging -> assign my custom tags/annotations -> use the custom tags to assign more tags -> further processing.
Would it be possible tou use a third party, let's say entity-recognition (that uses POS tags but does not need much more), right after the POS-tagging, in between the two custom things or afterwards?
I'm asking this questions because I can see complications due to the type systems. In particular the most difficult case may be pluggin a thrid party ER annotator in between the custom things or right after them. The third party annotator won't expect our custom tags to be there.
However, there are just additional annotations that have to be "passed through" the annotator without looking at them or modifying them. So, in principle, I'd assume that this is possible. I just don't know if UIMA supports this or is all about writing full chains on your own with strict typing everywhere.
If this isn't possible out of the box, could we write the custom annotators in a way such that they can be plugged anywhere where POS tags are available independent from if there are other annotations present. I.e. as authors of annotators take care that there may be some necessary annotations, some annotations we add and any number of annotations that may be present or not and we do not care about them and only pass them through?
The third party annotator won't expect our custom tags to be there.
If I understand correctly, you are concerned that your custom annotations might collide with the third-party NER, right? It won't, unless your code adds exactly the same annotations.
This is the strength of UIMA: every Analysis Engine (AE) is independent of the others, it only cares about the annotations that are passed in the CAS. For example, say you have an AE that expects annotation of type my.namespace.Token. It doesn't matter which AE created these annotations, as long as there are present in the CAS.
The price to pay for this flexibility is that you (as a developer) have to make sure that the required annotation for each AE are present. For example, if an AE expects annotations of type my.namespace.Sentence but none are present, this AE won't be able to do any processing.

HTML xpath tree dump? using Ruby Watir

Help! In carefully stepping through irb to control a browser (Firefox and Chrome) using the Watir library, it seems the xpath addresses are too shifty to rely on. Eg. one moment, the xpath for one's balance seems consistent, so I use that address in my script. Sometimes it works, but too often crashing with "element not found" although every time I manually step through, the webpage data is there (firebug inspect to confirm).
Yes, using Ajax-reliant sites, but not that changing....bank websites that pretty much remain the same across visits.
So question....is there some way watir-webdriver can simply give me a long, verbose dump of everything it sees in the DOM at the moment, in the form of an xpath tree? Would help me troubleshoot.
The big answer is to not use xpath, but instead use watir as the UI is intended to be used.
When it comes to a means to specify elements in browser automation, by and large Xpath is evil, it is SLOW, the code it creates is often (as you are finding) very very brittle, and it's nearly impossible to read and make sense of. Use it only as a means of last resort when nothing else will work.
If you are using a Watir API (as with Watir or Watir-webdriver) then you want to identify the element based on it's own attributes, such as class, name, id, text, etc If that doesn't work, then identify based on the closest container that wraps the element which has a way to find it uniquely. If that doesn't work identify by a sibling or sub-element and use the .parent method as a way to walk 'up' the dom to the 'parent container element.
To the point of being brittle and difficult readability, compare the following taken from the comments and consider the code using element_by_xpath on this:
/html/body/form/div[6]/div/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td/p/table[‌​2]/tbody/tr/td[2]/p/table/tbody/tr[3]/td[2]
and then compare to this (where the entire code is shorter than just the xpath alone)
browser.cell(:text => "Total Funds Avail. for Trading").parent.cell(:index => 1).text
or to be a bit less brittle replace index by some attribute of the cell who's text you want
browser.cell(:text => "Total Funds Avail. for Trading").parent.cell(:class => "balanceSnapShotCellRight").text
The xpath example is very difficult to make any sense of, no idea what element you are after or why the code might be selecting that element. And since there are so many index values, any change to the page design or just extra rows in the table above the one you want will break that code.
The second is much easier to make sense of, I can tell just by reading it what the script is trying to find on the page, and how it is locating it. Extra rows in the table, or other changes to page layout will not break the code. (with the exception of re-arranging the columns in the table, and even that could be avoided if I was to make use of class or some other characteristic of the target cell (as did an example in the comments below)
For that matter, if the use of the class is unique to that element on the page then
browser.cell(:class => 'balanceSnapShotCellRight').text
Would work just fine as long as there is only one cell with that class in the table.
Now to be fair I know there are ways to use xpath more elegantly to do something similar to what we are doing in the Watir code above, but while this is true, it's still not as easy to read and work with, and is not how most people commonly (mis)use xpath to select objects, especially if they have used recorders that create brittle cryptic xpath code similar to the sample above)
The answers to this SO question describe the three basic approaches to identifying elements in Watir. Each answer covers an approach, which one you would use depends on what works best in a given situation.
If you are finding a challenge on a given page, start a question here about it and include a sample of the HTML before/after/around the element you are trying to work with, and the folks here can generally point you the way.
If you've not done so, work through some of the tutorials in the Watir wiki, notice how seldom xpath is used.
Lastly, you mention Firewatir. Don't use Firewatir, it's out of date and no longer being developed and will not work with any recent version of FF. Instead use Watir-Webdriver to driver Firefox or Chrome (or IE).
You just need to output the "innerXml" (I don't know Watir) of the node selected by this XPath expression:
/
Update:
In case that by "dump" you mean something different, such as a set of the XPath expressions each selecting a node, then have a look at the answer of this question:
https://stackoverflow.com/a/4747858/36305