Best Way to Parse HTML to XML

Best Way to Parse HTML to XML - iphone

Essentially, I currently have an iPhone app that can query and parse an XML file on my server. Right now, I currently have to manually update and upload my XML file every morning so my users can have the updated information. I would like to automate this process, which would essentially entail parsing various websites (NYTimes, iAmBored.com, etc), outputting the relevant information from each of these websites to an XML file, and uploading that file to my server.
Does anyone know the best way to accomplish this (parsing HTML to an XML file). Since I am a beginner, I'm not sure what languages this requires or what is the best way to do this?
Thanks a lot in advance!

You can try to translate HTML to XHTML (XHTML is based on XML so it's XML with some rules defined in a DTD).
You can also try to parse directly HTML with a SGML parser (As XHTML is based on XML, HTML is based on SGML).
The links are provided as inspiration.

If the content you need to scrape is in XHTML then you can easily use the XSLT language to transform original content in what you need inside the XML you provide to your users.
Otherwise any kind of scraping and XML producing solution will be fine, every programming language has its support to do such things.. but you could use XPath to select the elements you need from the page and then save them inside the output file.

Can you get what you need from the RSS/Atom feeds? That will simplify things greatly because they are XML rather than HTML and can be parsed by a standard XML parser. Of course, descriptions embedded inside RSS feeds will be HTML, so depending on your application, that may be when you need to parse HTML.
XSLT is a domain-specific programming language designed for processing XML, but you can also use any programming language that includes an XML parser for the task.

TagSoup - Just Keep On Truckin'
...a SAX-compliant parser written in Java
that, instead of parsing well-formed
or valid XML, parses HTML as it is
found in the wild: poor, nasty and
brutish, though quite often far from
short.
TagSoup is designed for people
who have to process this stuff using
some semblance of a rational
application design.
By providing a SAX
interface, it allows standard XML
tools to be applied to even the worst
HTML. TagSoup also includes a
command-line processor that reads HTML
files and can generate either clean
HTML or well-formed XML that is a
close approximation to XHTML.
Also, Taggle, a TagSoup in C++, available now

Related

Objective-C – Smart programming with dependencies

I have an iOS application that parses xml data from the web. I've setup it to parse some xml tags for me and then display some information in the application.
I do not own the xml data so it's not unlikely that the xml tags could change without my knowledge and then rendering my iOS application useless because I'm not able to parse the data with the wrong xml tags.
So instead of having the application crashing when (if) they change xml tags I was thinking of having the application send an e-mail in the background alerting that the xml tags have changed. Or something like that. Is that possible to do or is it even a smart solution to my problem?

Why don't you parse the XML file in your server side using any technology that you prefer, and provide your controlled XML file to your iOS application. That way you will have the full control over the XML tags that your application expects! If the other party changes the tags, you just re-write your server side program to handle the changes gracefully!

How safe is the data being parsed by RTF editors like TinyMCE?

I have a great concern in deploying the TinyMCE editor on a website. Looking at the code parsed by the editor it does a great job, and I leave the HTML button off the toolbar configuration so users can not inject their own source.
However, from what I read in the TinyMCE docs, it claims to degrade nicely to a regular textarea should javascript be disabled on a users browser... and therein lies my concern. If it does revert to a normal textarea, then the user is then able to easily inject their own HTML, and this leaves me with a security concern.
I just pass through data created with TinyMCE, and it is used within another page created by my script, so it poses no security risk to my server. The security concern arises over what malicious data may be passed to another user viewing the generated page.
I know many of you will tell me to just use regexes, or parse this data, but that itself could be a nightmare, as I would be trying to either...
a.) Use regexes to try and clean up the HTML without breaking the generated page,
and it is better to parse the data for that anyway.
b.) Reparsing data that has already been parsed by the RTF editor, which also
would probably end up breaking the generated page.
Anyone with any previous experience with this type of scenario, I would really appreciate a 'heads-up' as to any other risks that using an RTF editor for user data could entail.
I would really like to provide this as a user option, but not if the risks outweigh giving the user using the RTF a chance to take a wack at another user viewing the page that is generated by the script.
My gut feeling is to steer a wide berth around use of the RTF at this point.
Thanks for any direction you can give me with your own experiences.

You cannot have client-side security on the web. You simply can't trust the browser, because it's easy for a malicious user to substitute a replacement browser that does whatever he wants.
If you accept HTML from users (using TinyMCE or through any other method) and display it to other users, you must sanitize or validate the HTML in some way on the server. If you're using Perl, the leading package seems to be HTML::Scrubber (along with various other modules that help you plug it in to various frameworks). I haven't had occasion to try it myself.
The TinyMCE Security page mentions some ways to make it harder for people to submit arbitrary HTML, but you still need server-side checks.

Regex is generally not considered good for parsing HTML
RegEx match open tags except XHTML self-contained tags but I have noted the "perl" tag :)
My advice when taking markup from users is to always parse it through something that can accept mal-formed HTML and return well formed HTML. These parses generally produce something that can be queried and updated with some form of XPath.
In Python there is a module called BeautifulSoup, Ruby has Nokogiri and in ASP.NET there is a project called HtmlAgilityPack that all do this sort of thing. I'm not sure what library perl has, but I'm sure there would be something.

How parse html string in iphone

i m using php file for using data in my application,
in this file i post data on the server and if i get the data from the server
then it is in html formate.
so problem is that i have a string with html tags how i use data in that string.
how i extract data from html string.

Use NSXMLParser class. it works for HTML too. There are three useful delegate methods.

If your HTML out put is some simple data - may be you can write some simple NSString parser your self like 'markhunte' mentioned, if you have large complex data in HTML then you have to go for some open source parsers.
Cocoa does not provide HTML parser, Forum discussion claims in some case XML parser itself work for you, but I never go it working for my data.
In my case I had very simple TAG which I had handled using my own parser using NSString.

I have used the code from --> Flatten-html-content-ie-strip-tags-cocoaobjective-c.
There are also examples of its use on SO.

Just use NSScanner, it is great for searching in between tags that are permanent. If you post some page code I help you set up the scanner.

Parsing source of a webpage with Objective-C

Is there a way to parse a website's source on the iPhone to get the URL's of photos on that page? If so how would you do that?
Thanks

I'd say go for regular expressions - there is a one page library that wraps c regexesthat you can drop into your project.

I recommend regular expressions. There's a great open source Regex library for Cocoa called RegexKit. For the most part, you can just drop it in your code and it'll "just work".
Getting all the urls of images wouldn't be too difficult (less than 20 lines of code) if you assume that all images are going to be in <img> tags. You'd just grab all the image tags (something like: <img\s+[^>]+>), then iterate through those matches. For each match, you'd pull out whatever's in the src attribute: src\s*=\s*("|')?\s*([^\s"']+)(\s|"|')
You might need to tweak that a bit, but it shouldn't be too bad.

There is no super easy way. When I had to do it I wrote a libxml2 SAX parser. libxml2 has an html reader that works fairly well with malformed html, and libxml2 is included with the base system.

You could try it using regular expressions, but I wouldn't recommend that. You should have a look at NSXMLParser, assuming the webpage is coded to be XHTML compliant. TouchXML is another good library.

take a look at Event Driven XML Parsing in the iPhone reference library

Are you OK with any approach you use not picking up on images loaded dynamically via JavaScript.
The closest thing I could see working is to parse out any JavaScript imports, load those up too, and then use a regular expression across the whole file looking for anything that ends in ".jpg/.gif/.png" and grab the full URL out from that. The libxml approach would miss out on references to images not in img tags, but it might well be good enough.

iPhone RSS Reader -- parseXML won't Load some XML feeds

I am using the SIMPLE RSS reading example found at http://theappleblog.com/2008/08/04/tutorial-build-a-simple-rss-reader-for-iphone/
It uses parseXML to load the RSS feeds.
Here is the problem I am having. For the following RSS feed example, I am having trouble getting it to load the feed. Comes up with an error that it cannot connect. However on my Mac RSS Reader it works fine, so I know the link is good.
Any ideas on why it cannot load this particular feed but it can load others fine?
http://www.okstate.com/rss.dbml?db_oem_id=200&media=news
Thanks.

I've just released an open source RSS/Atom Parser for iPhone and hopefully it might be of some use.
I'd love to hear your thoughts on it too!

In my experience, HTML markup causes an RSS parser to fail in most cases. I've experienced a problem like this with a lot of parser classes I've come across (in search of the ultimate one, which I didn't find)
My guess is that entities such as
&#39;s
are responsible for your crash. That was usually the case with my crashes. This also lead to my decision to create a 'proxy server' to pre-parse the XML before sending it to the iPhone (which gives me the advantage of caching, scaling, and some other stuff). I do believe there are solid solutions out there, but is always difficult writing a parser for so many RSS implementations.
P.S: W3C validates this feed as 'valid', so it really is 'our' problem..

Your problem could lie with:
Unicode characters (i.e. I see some o's with two dots above them in the feed)
The code you have doesn't respect CDATA sections correctly
To find out which is the case, save the feed file to your local disk and load it via your code to make sure the error happens.
Do a binary search on the file to find out if a particular RSS entry is causing the problem (i.e. remove all but the first rss entry and see if the problem exists. If it does, then the problem is there, if it doesn't put half the rss entries back in the file and repeat)

I've been experiencing a similar issue. I haven't yet pinned down the answer, but I've noticed that RSS 2 tends to parse more successfully than the rest.

There are many RSS feeds that contain invalid XML, usually because they were hacked together on the server side using HTML templates by somebody who didn't understand XML. I've seen improperly escaped (or non-escaped) HTML post contents, missing close tags, badly nested tags, and so on.
If you want to be able to parse arbitrary feeds, you have to clean up bad XML. The usual way is to use the "htmlTidy" library, which is included in the OS. This can clean up XML as well as HTML.
This example you're following uses NSXMLParser -- I have no idea why. It's a lower-level API and it doesn't support tidying. I would suggest using NSXMLDocument instead. There's a flag in that API that will tell it to use tidy when parsing the XML. This API also returns you the XML as a handy tree of elements that's easy to work with.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse