objective-c - Which lib I should use to parse HTML?

objective-c - Which lib I should use to parse HTML? - iphone

I am trying to parse some not-complicated RSS html content in iphone.
So I don't need a heavy HTML parser.
I have searched here and found these two:
https://github.com/topfunky/hpple
https://github.com/zootreeves/Objective-C-HMTL-Parser
Both are simple to use. But I guess they have their problems for my purpose.
For TFHpple, it is good, but for every element, it does not have the complete HTML <> with itself. for example, element doesn't have this complete tag string. I need this complete tag string, because I need to remove it from the whole HTML string. I would be more convenient for me if element has that.
For zootreeves HTML-Parser, it is also simple and good. And it has the complete tag string with every element. I am very happy. However, it seems to be a big memory-comsumer. I monitored it. If I try to parse a big number of HTML fragments (say, 1000), the memory it will cost and stays occupied is like 40MB. It is not applicable for ios devices. zootreeves is using pure C codes and linked-list to organise the tree structures of the HTML, I guess. and it uses pure malloc and free for memory. I don't know whether that will affect ios memory.
So, anyone can recommend a state-of-art better and fast and simple HTML parser for iOs for me?
Thanks

I'd use libxml2. It's not just for xml; it has an HTML parser too. It's fast and low-memory and is available in iOS. The only drawback is that it's a C-based API, but for all that it's not terribly difficult to work with.
Update
In response to the first comment below: It's been awhile, so I'm not sure, but I don't think so. What you get is a data structure with lots of information about the document structure, and each tag has a list of attribute/value pairs. Nowhere is the original html string stored (I presume that this is considered redundant and is not done to save memory).
However, it doesn't seem like you actually need it for what you want to do. It seems to me that you are using information from the parser to modify the original string, stripping out HTML tags. What you want to do instead is to rebuild the document using information from the parse tree, and when you do this, leave out the tags you want omitted.

Related

Is there a way to use GWT static string i18n with server-provided properties?

I am searching a solution for a tricky question.
I would like to use GWT static string internationalization, thus using Constants, ConstantsWithLookup and Messages, but the strings must come from the server at runtime, instead that compile time.
Is there already a project that does such a thing, or should I write my own GWT generators?
Thanks to everyone that will help me.
UPDATE: The Dictionary is not an option, because the application is almost complete and I cannot change all the application for this.
UPDATE 2: In fact Dictionary is an option if it is wrapped by a Costants-like or Messages-like interface.

What you ask for is not static i18n at all. Some of the reasons why GWT's i18n is virtually all static:
It is a synchronous API. Fetching resources from the server will either require an async API to spread throughout the entire application (ie, passing a Future to a widget telling it where to get its inner text once that string has been fetched from the server) or you will have to basically block execution of the app until the i18n resources have been downloaded at the beginning (which will give poor experience for users).
We can optimize the generated code to only include those formatters and associated data that are actually needed by the messages in the app. If you don't include any plural messages, we don't have to include that code, etc. Expressions can be inlined, dead code removed, and class references removed entirely in most cases.
We can make use of things at compile time that would be hard or expensive to do at runtime. For example, simply parsing the message format strings takes a fair amount of code, and none of that needs to be included in the compiled output. Let's say you fetch strings for your app from the server, and you find that one of them has {0,localtime,YMd} in it -- now you need ICU4J in order to localize that -- oops! Even if it could all be compiled to JS, it would be huge. Perhaps you can support a subset of GWT's i18n in this way, but you will have to include every formatter that might possibly be referenced from a message, even though most of them never would be.
If you really want dynamic i18n, then do as the other answers suggest and use Dictionary (note however that you won't be able to properly localize your app if it has any complexity to its messages). If you need more than can be provided by that, then bite the bullet and use static i18n.

There are two options: Good and Less Good.
Good:
The standard way, static string i18n were all language permutations are optimized and inlined where they are used (i.e. put the Japanese company name into the HTML template for a button/column/header).
Because the full suite of i18n can be elaborate with support for pluralization and message builders, #nnoations, and automatic i18n, it is preferable. It is also the fastest option for performance.
Less Good:
Often because you need to work with a legacy system, so Good is not good enough. Here rather than all the rocket widgets, you just need to get text in boxes. Then use the dynamic string i18n and drop the strings into your page with something like an old school Dictionary object.

HTML xpath tree dump? using Ruby Watir

Help! In carefully stepping through irb to control a browser (Firefox and Chrome) using the Watir library, it seems the xpath addresses are too shifty to rely on. Eg. one moment, the xpath for one's balance seems consistent, so I use that address in my script. Sometimes it works, but too often crashing with "element not found" although every time I manually step through, the webpage data is there (firebug inspect to confirm).
Yes, using Ajax-reliant sites, but not that changing....bank websites that pretty much remain the same across visits.
So question....is there some way watir-webdriver can simply give me a long, verbose dump of everything it sees in the DOM at the moment, in the form of an xpath tree? Would help me troubleshoot.

The big answer is to not use xpath, but instead use watir as the UI is intended to be used.
When it comes to a means to specify elements in browser automation, by and large Xpath is evil, it is SLOW, the code it creates is often (as you are finding) very very brittle, and it's nearly impossible to read and make sense of. Use it only as a means of last resort when nothing else will work.
If you are using a Watir API (as with Watir or Watir-webdriver) then you want to identify the element based on it's own attributes, such as class, name, id, text, etc If that doesn't work, then identify based on the closest container that wraps the element which has a way to find it uniquely. If that doesn't work identify by a sibling or sub-element and use the .parent method as a way to walk 'up' the dom to the 'parent container element.
To the point of being brittle and difficult readability, compare the following taken from the comments and consider the code using element_by_xpath on this:
/html/body/form/div[6]/div/table/tbody/tr[2]/td[2]/table/tbody/tr[2]/td/p/table[‌2]/tbody/tr/td[2]/p/table/tbody/tr[3]/td[2]
and then compare to this (where the entire code is shorter than just the xpath alone)
browser.cell(:text => "Total Funds Avail. for Trading").parent.cell(:index => 1).text
or to be a bit less brittle replace index by some attribute of the cell who's text you want
browser.cell(:text => "Total Funds Avail. for Trading").parent.cell(:class => "balanceSnapShotCellRight").text
The xpath example is very difficult to make any sense of, no idea what element you are after or why the code might be selecting that element. And since there are so many index values, any change to the page design or just extra rows in the table above the one you want will break that code.
The second is much easier to make sense of, I can tell just by reading it what the script is trying to find on the page, and how it is locating it. Extra rows in the table, or other changes to page layout will not break the code. (with the exception of re-arranging the columns in the table, and even that could be avoided if I was to make use of class or some other characteristic of the target cell (as did an example in the comments below)
For that matter, if the use of the class is unique to that element on the page then
browser.cell(:class => 'balanceSnapShotCellRight').text
Would work just fine as long as there is only one cell with that class in the table.
Now to be fair I know there are ways to use xpath more elegantly to do something similar to what we are doing in the Watir code above, but while this is true, it's still not as easy to read and work with, and is not how most people commonly (mis)use xpath to select objects, especially if they have used recorders that create brittle cryptic xpath code similar to the sample above)
The answers to this SO question describe the three basic approaches to identifying elements in Watir. Each answer covers an approach, which one you would use depends on what works best in a given situation.
If you are finding a challenge on a given page, start a question here about it and include a sample of the HTML before/after/around the element you are trying to work with, and the folks here can generally point you the way.
If you've not done so, work through some of the tutorials in the Watir wiki, notice how seldom xpath is used.
Lastly, you mention Firewatir. Don't use Firewatir, it's out of date and no longer being developed and will not work with any recent version of FF. Instead use Watir-Webdriver to driver Firefox or Chrome (or IE).

You just need to output the "innerXml" (I don't know Watir) of the node selected by this XPath expression:
/
Update:
In case that by "dump" you mean something different, such as a set of the XPath expressions each selecting a node, then have a look at the answer of this question:
https://stackoverflow.com/a/4747858/36305

Which is the fastest way to get a URL from an HTML tag (Regex, NScanner, Hppple)?

I found 3 different way to get the value of the src attribute of an img tag in an HTML string.
With a Regex using RegexKitLite.
With TFHpple HTML parser
Using a NSSCanner to scan the HTML string.
So which way I must use to optimize performance of my iPhone app?

Maybe not the fastest, but RegEx is imho the most versatile and portable method. And unless you're really doing hundreds of parses per second, you won't notice the performance hit you'll get by not using the fastest method around..
I use lots of regexes on iPhone for user input validation while the user is entering text (so a lag would certainly be seen). Never had any problems.

Parsing HTML which is not valid XML

I need to parse a website which has a lot of nested <div>s all over. I tried with XML::Simple to get a nice tree-structure, but the parse fails all the time because there seems to be two or three not closed <p> somewhere. I tried HTML::Parser, but that only lets me define some handler functions that give me the right tags, but not their nested elements.
There any way to get XML::Simple accept non-valid XML or HTML::Parser to give me a handy tree structure?

The HTML::TreeBuilder builds nice trees and gives tons of handy methods to traverse it.

An alternative to something based on HTML::TreeBuilder is XML::LibXML->load_html(...).

But is it valid HTML? If so, XML::LibXML will do a marvelous job if you use the HTML parsing functions. It is lightning fast and provides a great interface. It should even be able to handle some bad HTML using the recover option.
Alternatively, HTML::Parser (often used via HTML::TreeBuilder or HTML::TreeBuilder::XPath) is renown for handling bad HTML. It won't be as fast, though.

POST/GET bindings in Racket

Is there a built-in way to get at POST/GET parameters in Racket? extract-binding and friends do what I want, but there's a dire note attached about potential security risks related to file uploads which concludes
Therefore, we recommend against their
use, but they are provided for
compatibility with old code.
The best I can figure is (and forgive me in advance)
(bytes->string/utf-8 (binding:form-value (bindings-assq (string->bytes/utf-8 "[field_name_here]") (request-bindings/raw req))))
but that seems unnecessarily complicated (and it seems like it would suffer from some of the same bugs documented in the Bindings section).
Is there a more-or-less standard, non-buggy way to get the value of a POST/GET-variable, given a field name and request? Or better yet, a way of getting back a collection of the POST/GET values as a list/hash/a-list? Barring either of those, is there a function that would do the same, but only for POST variables, ignoring GETs?

extract-binding is bad because it is case-insensitive, is very messy for inputs that return multiple times, doesn't have a way of dealing with file uploads, and automatically assumes everything is UTF-8, which isn't necessarily true. If you can accept those problems, feel free to use it.
The snippet you wrote works when the data is UTF-8 and when there is only one field return. You can define it is a function and avoid writing it many times.
In general, I recommend using formlets to deal with forms and their values.
Now your questions...
"Is there a more-or-less standard, non-buggy way to get the value of a POST/GET-variable, given a field name and request?"
What you have is the standard thing, although you wrongly assume that there is only one value. When there are multiple, you'll want to filter the bindings on the field name. Similarly, you don't need to turn the value into a string, you can leave it as bytes just fine.
"Or better yet, a way of getting back a collection of the POST/GET values as a list/hash/a-list?"
That's what request-bindings/raw does. It is a list of binding? objects. It doesn't make sense to turn it into a hash due to multiple value returns.
"Barring either of those, is there a function that would do the same, but only for POST variables, ignoring GETs?"
The Web server hides the difference between POSTs and GETs from you. You can inspect uri and raw post data to recover them, but you'd have to parse them yourself. I don't recommend it.
Jay