How to extract data from a web site and format to raw text - iPhone Dev - iphone

I have been looking around for a while and not found anything useful, also not sure if I have worded the question in the clearest fashion so apologies
I have a section of an app I am building called 'Company News'. The company in question has a news page on their website which displays a title, an excerpt of text and a read more option.
At the minute in the iPhone application I just have a UIWebView which links to that URL, displays an error if no connection is available. However, if my user clicks a story to read the news obviously it opens up a new page, I want to avoid having to build in 'back' and 'forward' buttons and stay away from it looking like a browser within the app.
With that said, I am looking for a way to just extract that data from the website and just display it in my app as raw text. I am not particularly bothered about rich text formatting or anything fancy. I would just like the title and body of text.
Is this possible?

In essence, then, you are looking for an HTML parser.
Assuming the HTML you wish to parse has a predictable format, the approach I would take is to load the HTML via whatever URL loading system you want - e.g. NSURLConnection, ASIHTTPRequest, etc.
Then you will need to parse the raw HTML. I use XPath. It requires that you learn the syntax, but it should work.
For more details about how you might use XPath for parsing HTML, see the second response to this question. You will need to link to libxml2 in your project then use XPath to extract the nodes of interest.
Scraping web pages in this way is fragile, though, because it depends on the structure of a page you don't control and which could be changed unpredictably.

Related

How can I "drill down" into a website using Perl's WWW::Mechanize

I have used the WWW::Mechanize Perl module on a number of projects and it's helped me out a lot.
I am trying to use it on a different site and I can't "drill down" into the content of the site.
The site is https://customer.bookingbug.com/?client=hantsrecyclingcentres#/services
I have tried figure out what the URL would be to get content shown in the resulting HTML, such as bb.d570283b87c834518ba9.css, bb.d570283b87c834518ba9.js and version.js
I tried to copy the resulting HTML into this posting, but used all sorts of quote and code sample combinations and it wouldn't display properly.
Does anyone have any idea how I "navigate" this site using this Perl module please?
WWW::Mechanize is a web client with some HTML parsing capabilities. But as you clearly noticed, the information you want is not in the HTML document you requested. Either download the correct document (whatever that might be), or do what the browser does and execute the JavaScript. This would require a JavaScript engine. The simplest way to achieve that is to remote-control a web browser (e.g. using Selenium::Chrome).

coldfusion show pdf on page

I know coldfusion has extensive pdf support, but I'm not sure if this is possible.
I was given a pdf form and told to make it so it is both filled out online, the data is captured, and the form can be printed.
Obviously, I can create an html page that looks like the document, save everything, generate the filled pdf form, etc.
Alternately, I think I can show the pdf, have them fill it, then grab the form data. I'm not entirely sure I can do this though, because I would need to detect when they are done filling it out.
But I was thinking it would be nice if I could do it this way - Show the pdf embedded on the webpage, let them fill it out and print it, then capture everything when they are done. I was looking through the CF documentation (cfpdf cfhttp, etc), but not finding exactly what I need. Is this an option?
You can extract the data from a PDF using the cfpdfform tag or as an HTTP Post. Here's a link to the docs on how to do that, but it depends on how you set up the PDF itself. You can edit your PDF form to actually submit just the formdata to a given CF page. It arrives on the page in a struct tied to the form name (ie. #form.form1.Fields.blah# etc.). Dump it out to deipher it (it's kind of convoluted) So you could fire print and submit from within the PDF.
The second way is to submit the PDF itself as a file. In this case you use the cfpdform tag - not well documented or widely used. Both approaches are covered lightly in the link above. Good luck!
We can show the pdf on page using cfheader and cfdocument tags. We can only show the pdf on webpage using the following example code.
<cfsavecontent name="pdfcontent">
//Here what you need to show the pdf
</cfsavecontent>
<cfheader name="Content-Disposition" value="filename=Mydocument.pdf">
<cfdocument format="pdf" orientation = "landscape" bookmark="Yes" marginleft=".25" marginright=".25" marginTop = ".25" marginbottom=".75" scale="90">
<cfoutput> #pdfcontent# </cfoutput>
</cfdocument>

Getting images by parsing constantly changing HTML

I'm in the process of developing an iOS app that retrieves images from a URL (http://m.9gag.com). It so happens that the HTML for the URL is in constant change and whenever I have a working code, the site's style changes.
Is there any way I could pull those images from the HTML without having to worry about webpage changes? There is no public API at the moment so that's sadly not an option.
I look forward to hearing some options.
Also, if the page is set so that when the user scrolls to the bottom, it loads more content, how can I get more html to load based on how far down in the HTML parsing I've got? I'm not using a webview, I just need to update the HTML I initially retrieved.
It seems that the simplest way in your case - use regular expression (for example http://[A-Za-z0-9_/\.]*\.jpg) to extract URLs and keep track of already pulled images.

How safe is the data being parsed by RTF editors like TinyMCE?

I have a great concern in deploying the TinyMCE editor on a website. Looking at the code parsed by the editor it does a great job, and I leave the HTML button off the toolbar configuration so users can not inject their own source.
However, from what I read in the TinyMCE docs, it claims to degrade nicely to a regular textarea should javascript be disabled on a users browser... and therein lies my concern. If it does revert to a normal textarea, then the user is then able to easily inject their own HTML, and this leaves me with a security concern.
I just pass through data created with TinyMCE, and it is used within another page created by my script, so it poses no security risk to my server. The security concern arises over what malicious data may be passed to another user viewing the generated page.
I know many of you will tell me to just use regexes, or parse this data, but that itself could be a nightmare, as I would be trying to either...
a.) Use regexes to try and clean up the HTML without breaking the generated page,
and it is better to parse the data for that anyway.
b.) Reparsing data that has already been parsed by the RTF editor, which also
would probably end up breaking the generated page.
Anyone with any previous experience with this type of scenario, I would really appreciate a 'heads-up' as to any other risks that using an RTF editor for user data could entail.
I would really like to provide this as a user option, but not if the risks outweigh giving the user using the RTF a chance to take a wack at another user viewing the page that is generated by the script.
My gut feeling is to steer a wide berth around use of the RTF at this point.
Thanks for any direction you can give me with your own experiences.
You cannot have client-side security on the web. You simply can't trust the browser, because it's easy for a malicious user to substitute a replacement browser that does whatever he wants.
If you accept HTML from users (using TinyMCE or through any other method) and display it to other users, you must sanitize or validate the HTML in some way on the server. If you're using Perl, the leading package seems to be HTML::Scrubber (along with various other modules that help you plug it in to various frameworks). I haven't had occasion to try it myself.
The TinyMCE Security page mentions some ways to make it harder for people to submit arbitrary HTML, but you still need server-side checks.
Regex is generally not considered good for parsing HTML
RegEx match open tags except XHTML self-contained tags but I have noted the "perl" tag :)
My advice when taking markup from users is to always parse it through something that can accept mal-formed HTML and return well formed HTML. These parses generally produce something that can be queried and updated with some form of XPath.
In Python there is a module called BeautifulSoup, Ruby has Nokogiri and in ASP.NET there is a project called HtmlAgilityPack that all do this sort of thing. I'm not sure what library perl has, but I'm sure there would be something.

iPhone: random "image of the day" type service?

The iPhone makes it really simple to snarf down an image from the web; you can turn a URL into a UIImage in one line of code. So I'd like to enable my app (an educational puzzle game... my first!) to download some random images to make it more interesting and dynamic.
I thought about using Kodak's image of the day RSS feed, but I'm having quite a time figuring out how to parse it. Rather than being a simple list of image URLs, it seems to reference a bunch of "jhtml" URLs, which run Javascript to display the images in your RSS reader. Is this intentionally obfuscated, or am I missing some basic step to parse this?
I also tried the Astronomy Picture of the Day, via this RSS feed, but it's just the original page's HTML stuffed into CDATA... ugh.
So I guess this is really two questions:
Is there a simple way to parse these feeds to actually get at the JPG URLs on the iPhone?
Is there a better source for "picture of the day" type images?
PS: I'm using NSXMLParser, which I learned to use here.
I would recommend going with something that has an API, perhaps the Flickr "Interestingness" feed:
http://www.flickr.com/services/api/flickr.interestingness.getList.html
There is an objective-C library written to help with accessing Flickr but not sure if this API call is included:
http://github.com/lukhnos/objectiveflickr/tree/master