HTML xpath tree dump? using Ruby Watir - dom

Help! In carefully stepping through irb to control a browser (Firefox and Chrome) using the Watir library, it seems the xpath addresses are too shifty to rely on. Eg. one moment, the xpath for one's balance seems consistent, so I use that address in my script. Sometimes it works, but too often crashing with "element not found" although every time I manually step through, the webpage data is there (firebug inspect to confirm).
Yes, using Ajax-reliant sites, but not that changing....bank websites that pretty much remain the same across visits.
So question....is there some way watir-webdriver can simply give me a long, verbose dump of everything it sees in the DOM at the moment, in the form of an xpath tree? Would help me troubleshoot.

The big answer is to not use xpath, but instead use watir as the UI is intended to be used.
When it comes to a means to specify elements in browser automation, by and large Xpath is evil, it is SLOW, the code it creates is often (as you are finding) very very brittle, and it's nearly impossible to read and make sense of. Use it only as a means of last resort when nothing else will work.
If you are using a Watir API (as with Watir or Watir-webdriver) then you want to identify the element based on it's own attributes, such as class, name, id, text, etc If that doesn't work, then identify based on the closest container that wraps the element which has a way to find it uniquely. If that doesn't work identify by a sibling or sub-element and use the .parent method as a way to walk 'up' the dom to the 'parent container element.
To the point of being brittle and difficult readability, compare the following taken from the comments and consider the code using element_by_xpath on this:
/html/body/form/div[6]/div/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td/p/table[‌​2]/tbody/tr/td[2]/p/table/tbody/tr[3]/td[2]
and then compare to this (where the entire code is shorter than just the xpath alone)
browser.cell(:text => "Total Funds Avail. for Trading").parent.cell(:index => 1).text
or to be a bit less brittle replace index by some attribute of the cell who's text you want
browser.cell(:text => "Total Funds Avail. for Trading").parent.cell(:class => "balanceSnapShotCellRight").text
The xpath example is very difficult to make any sense of, no idea what element you are after or why the code might be selecting that element. And since there are so many index values, any change to the page design or just extra rows in the table above the one you want will break that code.
The second is much easier to make sense of, I can tell just by reading it what the script is trying to find on the page, and how it is locating it. Extra rows in the table, or other changes to page layout will not break the code. (with the exception of re-arranging the columns in the table, and even that could be avoided if I was to make use of class or some other characteristic of the target cell (as did an example in the comments below)
For that matter, if the use of the class is unique to that element on the page then
browser.cell(:class => 'balanceSnapShotCellRight').text
Would work just fine as long as there is only one cell with that class in the table.
Now to be fair I know there are ways to use xpath more elegantly to do something similar to what we are doing in the Watir code above, but while this is true, it's still not as easy to read and work with, and is not how most people commonly (mis)use xpath to select objects, especially if they have used recorders that create brittle cryptic xpath code similar to the sample above)
The answers to this SO question describe the three basic approaches to identifying elements in Watir. Each answer covers an approach, which one you would use depends on what works best in a given situation.
If you are finding a challenge on a given page, start a question here about it and include a sample of the HTML before/after/around the element you are trying to work with, and the folks here can generally point you the way.
If you've not done so, work through some of the tutorials in the Watir wiki, notice how seldom xpath is used.
Lastly, you mention Firewatir. Don't use Firewatir, it's out of date and no longer being developed and will not work with any recent version of FF. Instead use Watir-Webdriver to driver Firefox or Chrome (or IE).

You just need to output the "innerXml" (I don't know Watir) of the node selected by this XPath expression:
/
Update:
In case that by "dump" you mean something different, such as a set of the XPath expressions each selecting a node, then have a look at the answer of this question:
https://stackoverflow.com/a/4747858/36305

Related

Which one of the locator is efficient by.css or by.xpath or by.id in Protractor?

Which one is better, in terms of performance , to use : by.css or by.xpath or by.id.
I have a really lengthy xpath :
by.xpath('//*#id="logindiv"]/div[3]/div/div[1]/div/nav/div/div[1]/form/div/div/button')
which can be used with other selectors like by.css or by.id.
But it is not very clear which one is better.
Protractor uses selenium-webdriver underneath for element lookup/interaction etc, so this is not protractor specific question, but rather selenium-webdriver specific.
CSS selectors perform far better than Xpath and it is well documented in Selenium community. Here are some reasons,
Xpath engines are different in each browser, hence make them inconsistent.
Last time I checked, IE does not have a native xpath engine, therefore selenium-webdriver injects its own xpath engine for compatibility of its API. Hence we lose the advantage of using native browser features that selenium-webdriver inherently promotes.
Xpath tend to become complex like your example and hence make hard to read/maintain in my opinion.
However there are some situations where, you need to use xpath, for example, searching for a parent element or searching element by its text (I wouldn't recommend the later).
You can read blog from Simon(creator of selenium-webdriver) here . He also recommends CSS over Xpath.
So I would recommend you use id, name etc for faster lookup. If thats not available use css and finally use xpath if none other suite your situation.

Are there reliable algorithms for generating robust DOM node selectors, given only the target node?

In writing a scraper, we typically use some kind of selector to identify particular nodes of interest. Ideally the selectors should continue to work even as the page changes over time. A lot of the common approaches like grabbing nodes by id are fragile on frequently updated pages and impossible on some nodes. I'm trying to find good algorithms for generating robust selectors, but since there doesn't seem to be a standard terminology for this problem, it's hard to find everything that's out there.
Here are the selector DSLs I already know.
XPath selectors - Implemented everywhere from JS to the popular
Python and Ruby scraping libraries.
CSS selectors - Found in many of the places where you can find xpath
selectors.
High level selectors - Here I'll give the example of Chickenfoot,
which allows users to write click("begin tutorial") to find a link
with the text "Begin Tutorial." Usually these are implemented on top of
xpath and CSS selectors. I'd love to find out about more members of
this language family.
Visual selectors - This would be the approach taken by, for instance,
Sikuli, which makes it appear as though the program is calling a
function on a screengrab of the relevant node. I don't know any
web-specific instances of this approach, but I imagine there are
some.
Here are the selector generation algorithms I already know. By a selector generation algorithm I mean an algorithm that takes a node as input and produces a robust selector as output.
iMacros: Finds all elements with the same node type and text as the
target element, finds the target element's index in this list list. Uses
the node type, text, and index as the selector. Also includes id
for forms and form elements.
CoScripter: Uses element's text if available. If not, uses preceding
text.
Selenium: Uses id where available. Uses various other attributes
otherwise, such as image alt text, links' displayed texts, buttons'
displayed texts.
Wargo System: Uses element text.
Many systems: Many systems use the xpath from the root to the target node, or some
suffix of that xpath.
All of these selector generation algorithms fail on some nodes. Are there better approaches out there? Or other approaches that I could combine with these algorithms to produce a better hybrid algorithm?
When I started investigating this topic for some work I am doing, I was also surprised by how little information is available on this topic.
I did find this 2003 paper, but unfortunately, I only have access to the abstract:
Abe, Mari, and Masahiro Hori. “Robust Pointing by XPath Language: Authoring Support and Empirical Evaluation.” In Proceedings of the 2003 Symposium on Applications and the Internet, 156 – . SAINT ’03. Washington, DC, USA: IEEE Computer Society, 2003.
For my own use, I followed the approach in Tim Vasil's 50-line jquery plugin. I won't reproduce the code which is available at that link, but instead I'll describe it:
It recursively traverses up the DOM tree from the element, building a selector "backwards". At each level:
If the node has an ID, just use that and skip all the parents; they aren't added to the selector.
If node has a tag name or a set of classes that is unique among its siblings, use that as the selector. Otherwise, use :nth-child.
Since I will be storing element contents between visits to a page, I'm thinking of implementing some "blunder detection" here, maybe using a percentage change from last visit to detect if the selector may be grabbing the wrong element.

objective-c - Which lib I should use to parse HTML?

I am trying to parse some not-complicated RSS html content in iphone.
So I don't need a heavy HTML parser.
I have searched here and found these two:
https://github.com/topfunky/hpple
https://github.com/zootreeves/Objective-C-HMTL-Parser
Both are simple to use. But I guess they have their problems for my purpose.
For TFHpple, it is good, but for every element, it does not have the complete HTML <> with itself. for example, element doesn't have this complete tag string. I need this complete tag string, because I need to remove it from the whole HTML string. I would be more convenient for me if element has that.
For zootreeves HTML-Parser, it is also simple and good. And it has the complete tag string with every element. I am very happy. However, it seems to be a big memory-comsumer. I monitored it. If I try to parse a big number of HTML fragments (say, 1000), the memory it will cost and stays occupied is like 40MB. It is not applicable for ios devices. zootreeves is using pure C codes and linked-list to organise the tree structures of the HTML, I guess. and it uses pure malloc and free for memory. I don't know whether that will affect ios memory.
So, anyone can recommend a state-of-art better and fast and simple HTML parser for iOs for me?
Thanks
I'd use libxml2. It's not just for xml; it has an HTML parser too. It's fast and low-memory and is available in iOS. The only drawback is that it's a C-based API, but for all that it's not terribly difficult to work with.
Update
In response to the first comment below: It's been awhile, so I'm not sure, but I don't think so. What you get is a data structure with lots of information about the document structure, and each tag has a list of attribute/value pairs. Nowhere is the original html string stored (I presume that this is considered redundant and is not done to save memory).
However, it doesn't seem like you actually need it for what you want to do. It seems to me that you are using information from the parser to modify the original string, stripping out HTML tags. What you want to do instead is to rebuild the document using information from the parse tree, and when you do this, leave out the tags you want omitted.

Design - When to create new functions?

This is a general design question not relating to any language. I'm a bit torn between going for minimum code or optimum organization.
I'll use my current project as an example. I have a bunch of tabs on a form that perform different functions. Lets say Tab 1 reads in a file with a specific layout, tab 2 exports a file to a specific location, etc. The problem I'm running into now is that I need these tabs to do something slightly different based on the contents of a variable. If it contains a 1 I may need to use Layout A and perform some extra concatenation, if it contains a 2 I may need to use Layout B and do no concatenation but add two integer fields, etc. There could be 10+ codes that I will be looking at.
Is it more preferable to create an individual path for each code early on, or attempt to create a single path that branches out only when absolutely required.
Creating an individual path for each code would allow my code to be extremely easy to follow at a glance, which in turn will help me out later on down the road when debugging or making changes. The downside to this is that I will increase the amount of code written by calling some of the same functions in multiple places (for example, steps 3, 5, and 9 for every single code may be exactly the same.
Creating a single path that would branch out only when required will be a bit messier and more difficult to follow at a glance, but I would create less code by placing conditionals only at steps that are unique.
I realize that this may be a case-by-case decision, but in general, if you were handed a previously built program to work on, which would you prefer?
Edit: I've drawn some simple images to help express it. Codes 1/2/3 are the variables and the lines under them represent the paths they would take. All of these steps need to be performed in a linear chronological fashion, so there would be a function to essentially just call other functions in the proper order.
Different Paths
Single Path
Creating a single path that would
branch out only when required will be
a bit messier and more difficult to
follow at a glance, but I would create
less code by placing conditionals only
at steps that are unique.
Im not buying this statement. There is a level of finesse when deciding when to write new functions. Functions should be as simple and reusable as possible (but no simpler). The correct answer is almost never 'one big file that does a lot of branching'.
Less LOC (lines of code) should not be the goal. Readability and maintainability should be the goal. When you create functions, the names should be self documenting. If you have a large block of code, it is good to do something like
function doSomethingComplicated() {
stepOne();
stepTwo();
// and so on
}
where the function names are self documenting. Not only will the code be more readable, you will make it easier to unit test each segment of the code in isolation.
For the case where you will have a lot of methods that call the same exact methods, you can use good OO design and design patterns to minimize the number of functions that do the same thing. This is in reference to your statement "The downside to this is that I will increase the amount of code written by calling some of the same functions in multiple places (for example, steps 3, 5, and 9 for every single code may be exactly the same."
The biggest danger in starting with one big block of code is that it will never actually get refactored into smaller units. Just start down the right path to begin with....
EDIT --
for your picture, I would create a base-class with all of the common methods that are used. The base class would be abstract, with an abstract method. Subclasses would implement the abstract method and use the common functions they need. Of course, replace 'abstract' with whatever your language of choice provides.
You should always err on the side of generalization, with the only exception being early prototyping (where throughput of generating working stuff is majorly impacted by designing correct abstractions/generalizations). having said that, you should NEVER leave that mess of non-generalized cloned branches past the early prototype stage, as it leads to messy hard to maintain code (if you are doing almost the same thing 3 different times, and need to change that thing, you're almost sure to forget to change 1 out of 3).
Again it's hard to specifically answer such an open ended question, but I believe you don't have to sacrifice one for the other.
OOP techniques solves this issue by allowing you to encapsulate the reusable portions of your code and generate child classes to handle object specific behaviors.
Personally I think you might (if possible by your API) create inherited forms, create them on fly on master form (with tabs), pass agruments and embed in tab container.
When to inherit form and when to decide to use arguments (code) to show/hide/add/remove functionality is up to you, yet master form should contain only decisions and argument passing and embeddable forms just plain functionality - this way you can separate organisation from implementation.

POST/GET bindings in Racket

Is there a built-in way to get at POST/GET parameters in Racket? extract-binding and friends do what I want, but there's a dire note attached about potential security risks related to file uploads which concludes
Therefore, we recommend against their
use, but they are provided for
compatibility with old code.
The best I can figure is (and forgive me in advance)
(bytes->string/utf-8 (binding:form-value (bindings-assq (string->bytes/utf-8 "[field_name_here]") (request-bindings/raw req))))
but that seems unnecessarily complicated (and it seems like it would suffer from some of the same bugs documented in the Bindings section).
Is there a more-or-less standard, non-buggy way to get the value of a POST/GET-variable, given a field name and request? Or better yet, a way of getting back a collection of the POST/GET values as a list/hash/a-list? Barring either of those, is there a function that would do the same, but only for POST variables, ignoring GETs?
extract-binding is bad because it is case-insensitive, is very messy for inputs that return multiple times, doesn't have a way of dealing with file uploads, and automatically assumes everything is UTF-8, which isn't necessarily true. If you can accept those problems, feel free to use it.
The snippet you wrote works when the data is UTF-8 and when there is only one field return. You can define it is a function and avoid writing it many times.
In general, I recommend using formlets to deal with forms and their values.
Now your questions...
"Is there a more-or-less standard, non-buggy way to get the value of a POST/GET-variable, given a field name and request?"
What you have is the standard thing, although you wrongly assume that there is only one value. When there are multiple, you'll want to filter the bindings on the field name. Similarly, you don't need to turn the value into a string, you can leave it as bytes just fine.
"Or better yet, a way of getting back a collection of the POST/GET values as a list/hash/a-list?"
That's what request-bindings/raw does. It is a list of binding? objects. It doesn't make sense to turn it into a hash due to multiple value returns.
"Barring either of those, is there a function that would do the same, but only for POST variables, ignoring GETs?"
The Web server hides the difference between POSTs and GETs from you. You can inspect uri and raw post data to recover them, but you'd have to parse them yourself. I don't recommend it.
Jay