Parsing HTML which is not valid XML - perl

I need to parse a website which has a lot of nested <div>s all over. I tried with XML::Simple to get a nice tree-structure, but the parse fails all the time because there seems to be two or three not closed <p> somewhere. I tried HTML::Parser, but that only lets me define some handler functions that give me the right tags, but not their nested elements.
There any way to get XML::Simple accept non-valid XML or HTML::Parser to give me a handy tree structure?

The HTML::TreeBuilder builds nice trees and gives tons of handy methods to traverse it.

An alternative to something based on HTML::TreeBuilder is XML::LibXML->load_html(...).

But is it valid HTML? If so, XML::LibXML will do a marvelous job if you use the HTML parsing functions. It is lightning fast and provides a great interface. It should even be able to handle some bad HTML using the recover option.
Alternatively, HTML::Parser (often used via HTML::TreeBuilder or HTML::TreeBuilder::XPath) is renown for handling bad HTML. It won't be as fast, though.

Related

iPhone - entering equations

I've been researching this topic for a few weeks now, but I'm still unsure as to what is the "best" way to approach this problem.
I am designing an app, and part of the input involves entering an equation (ie mathematical function). I'm not looking for anything super complicated; it's single-variable, at least for now.
What is the best way to approach entry and parsing? Is there a parser that is very good for this? What about a graphical approach such as dragging/selecting parts and assembling a function by its components?
Thanks.
You should be able to use regular expressions to parse it out.
Check out NSRegularExpression and Google around for a regex that will parse out the equation into its different parts.
If you want to make your application extensible (for the future) you should read something about parser theory. There is a simple example on wikipedia (here) from which you can start. It uses flex (to generate the lexer) and bison (to generate the parser) which can be easily integrated with Objective-C code.
If that example is more than expected you can start with a more simple one from the bison manual (here).
you can use mathml products like mathtype and maths magic.
for other products see this
If you want to use javascript for formatting that use jqmath

Perl module that works like Data::Dumper but allow data manipulation

Is there a popular Perl module that works like Data::Dumper but allows user to write hook to manipulate the data inside complex structure or object.
There are a few modules showing up in google such as Data::Visitor or Data::Structure::Util that might do the job, but I'm not sure if they are the popular ones .
I've written Data::Dmap to do this, but as mentioned, Data::Rmap, Data::Transformer and Data::Visitor are also relevant.
The basic idea of Data::Dmap is that it allows you to transform anything in a nested data structure and still tries to behave like the built in map function.
I am not sure it is what you mean, but Data::Dump supports hooks to filter dumped data. Similar hooks are also possible in Data::Printer.
Edit: If you need editing, I would look at Data::Rmap or Data::Transformer. Also, if your structure is simple (say only scalars, hashes and arrays), you can make simple recursive traversal yourself.
YAML is a nice serialization format, easy to edit string values and such. It might not handle all your objects, but it's worth a try, and it both serializes and reloads things easily.

objective-c - Which lib I should use to parse HTML?

I am trying to parse some not-complicated RSS html content in iphone.
So I don't need a heavy HTML parser.
I have searched here and found these two:
https://github.com/topfunky/hpple
https://github.com/zootreeves/Objective-C-HMTL-Parser
Both are simple to use. But I guess they have their problems for my purpose.
For TFHpple, it is good, but for every element, it does not have the complete HTML <> with itself. for example, element doesn't have this complete tag string. I need this complete tag string, because I need to remove it from the whole HTML string. I would be more convenient for me if element has that.
For zootreeves HTML-Parser, it is also simple and good. And it has the complete tag string with every element. I am very happy. However, it seems to be a big memory-comsumer. I monitored it. If I try to parse a big number of HTML fragments (say, 1000), the memory it will cost and stays occupied is like 40MB. It is not applicable for ios devices. zootreeves is using pure C codes and linked-list to organise the tree structures of the HTML, I guess. and it uses pure malloc and free for memory. I don't know whether that will affect ios memory.
So, anyone can recommend a state-of-art better and fast and simple HTML parser for iOs for me?
Thanks
I'd use libxml2. It's not just for xml; it has an HTML parser too. It's fast and low-memory and is available in iOS. The only drawback is that it's a C-based API, but for all that it's not terribly difficult to work with.
Update
In response to the first comment below: It's been awhile, so I'm not sure, but I don't think so. What you get is a data structure with lots of information about the document structure, and each tag has a list of attribute/value pairs. Nowhere is the original html string stored (I presume that this is considered redundant and is not done to save memory).
However, it doesn't seem like you actually need it for what you want to do. It seems to me that you are using information from the parser to modify the original string, stripping out HTML tags. What you want to do instead is to rebuild the document using information from the parse tree, and when you do this, leave out the tags you want omitted.

TTXMLParser Sample Code?

Is anybody famaliar with how to use TTXMLParser. I can't find any documentation or sample code on it.
Is it SAX or DOM?
Does it support Xpath?
Can I extract CDATA from elements?
I have an application that already uses several Three20 modules it would be a shame to have to use another parser.
The main documentation I've found for TTXMLParser is in the header file. The comment there gives an overview of what TTXMLParser does.
TTXMLParser shouldn't really be thought of as an XML parser in the way you are thinking of it -- in this sense, questions such as "is it SAX or DOM" and "does it support XPath" aren't directly applicable. Instead, think of TTXMLParser as a convenience class to take XML and turn it into a tree of Objective-C objects. For example, this XML node:
<myNode attr1="value1" attr2="value2" />
would be turned into an Objective-C NSDictionary node which mapped the key "attr1" to the value "value1" and the key "attr2" to the key "value2".
TTXMLParser internally uses NSXMLParser (which is basically SAX) to build up its tree, but you, as the user of TTXMLParser, don't have to do any SAX-like stuff.
So, no, you will not end up with an XML document on which you can perform XPath queries. Instead, you will end up with an Objective-C tree of objects. If that's what you want, great; if you want a traditional XML parser with XPath, I'm currently working on a project that uses both Three20 and TouchXML. TouchXML supports XPath.
I agree it's hard to find sample code for TTXMLParser. Three20's TTTwitter sample used to use TTXMLParser (well actually, TTURLXMLResponse, which in turn uses TTURLParser), but at some point it was changed to use TTURLJSONResponse instead, which is a shame, because this was their only XML sample.
You can still see the old XML-based sample code here. Specifically, look at the -[requestDidFinishLoad:] function near the bottom of the file, for an example of some code that takes a TTURLXMLResponse, queries its rootObject member, and then walks down the resulting tree of objects.

Should I use the function-oriented or object-oriented CGI interfaces?

I've been learning about the CGI module lately, and the book I'm using shows there are two ways you can use CGI, function-oriented or object-oriented. They say the benefit of having object-oriented is only to be able to create two CGI objects. First of all is this true, and are there any other benefits, and secondly what example is there for using two CGI objects?
When I need to put together a very simple CGI script, I use the CGI module's OO interface.
I use the OOP interface because the standard, imperative interface imports a ton of symbols that may conflict with my own symbols. I don't like this, so I always prevent symbol importation. I don't use CGI;. Instead, I use CGI ();.
I also limit my use to generating the header and parsing parameters. I always generate HTML as HTML or better yet, use a template module like TemplateToolkit.
I strictly avoid CGI's HTML generation functions. Why?
I (along with many other people) already know HTML, and I see no benefit in learning CGI's pseudo-html interface.
When a script grows up and needs to be used in another environment, it is easier to extract the HTML blocks or templates and reuse them.
Don't interpret what I've written as a blanket condemnation of CGI.pm. There's plenty to love about CGI.pm. It gets content type generation right. It makes parameter parsing trivial. It is a core module. It makes command line debugging and testing easy.
I think I have found the answer to my question
http://perldoc.perl.org/CGI.html#PROGRAMMING-STYLE
Reading through the faq, an example given for multiple uses of CGI objects is I can store CGI and load previous CGI objects, which is quite useful.
Beyond the advantages you cite I'd also point out that OOP usage of CGI.pm is much cleaner to read (at least for me) and manage than the functional version.
I also suspect it is more common so people who have to maintain your code after you (including you six months from now) will find it easier to maintain.