Getting images by parsing constantly changing HTML

Getting images by parsing constantly changing HTML - iphone

I'm in the process of developing an iOS app that retrieves images from a URL (http://m.9gag.com). It so happens that the HTML for the URL is in constant change and whenever I have a working code, the site's style changes.
Is there any way I could pull those images from the HTML without having to worry about webpage changes? There is no public API at the moment so that's sadly not an option.
I look forward to hearing some options.
Also, if the page is set so that when the user scrolls to the bottom, it loads more content, how can I get more html to load based on how far down in the HTML parsing I've got? I'm not using a webview, I just need to update the HTML I initially retrieved.

It seems that the simplest way in your case - use regular expression (for example http://[A-Za-z0-9_/\.]*\.jpg) to extract URLs and keep track of already pulled images.

Related

How to get array of pixels from browser window without using canvas

I'm attempting to get an array of pixels of the screen (web page) but i know of no way of doing that without using canvas (either straight-up or converting HTML dom elements into canvas, first). I need to capture every pixel on the screen and i don't know what operating system is going to be used so i can't request the display from the O/S, either. Is there a third-party tool, possibly, or a way to do this from the window object in the DOM?

I have only one idea. Maybe you should try to move this functionality to the server. You can use WkHtmlToPDF(http://wkhtmltopdf.org/) for saving websites as PDF, pdf file you can convert to an image and read pixels array.

As web developers with no control of the client machine, there's two approaches to getting a screenshot of a webpage:
Open the webpage in a headless browser on the server and make the screenshot there. phantomjs is a popular one.
(I'm including this for completeness, though you said you don't want to take that route): Use the canvas element on the client. html2canvas is an interesting project that re-renders an entire HTML document into a canvas element so a screenshot can be made.
If your use case allows it, you could of course instruct your users to take a screenshot and paste it in an upload form that can handle images from the clipboard. On Windows, that's a matter of hitting "Print Screen" and CTRL+V.

Here is an api to generate images from online web pages: http://www.page2images.com/Create-Website-Screenshot-with-Javascript-API

coldfusion show pdf on page

I know coldfusion has extensive pdf support, but I'm not sure if this is possible.
I was given a pdf form and told to make it so it is both filled out online, the data is captured, and the form can be printed.
Obviously, I can create an html page that looks like the document, save everything, generate the filled pdf form, etc.
Alternately, I think I can show the pdf, have them fill it, then grab the form data. I'm not entirely sure I can do this though, because I would need to detect when they are done filling it out.
But I was thinking it would be nice if I could do it this way - Show the pdf embedded on the webpage, let them fill it out and print it, then capture everything when they are done. I was looking through the CF documentation (cfpdf cfhttp, etc), but not finding exactly what I need. Is this an option?

You can extract the data from a PDF using the cfpdfform tag or as an HTTP Post. Here's a link to the docs on how to do that, but it depends on how you set up the PDF itself. You can edit your PDF form to actually submit just the formdata to a given CF page. It arrives on the page in a struct tied to the form name (ie. #form.form1.Fields.blah# etc.). Dump it out to deipher it (it's kind of convoluted) So you could fire print and submit from within the PDF.
The second way is to submit the PDF itself as a file. In this case you use the cfpdform tag - not well documented or widely used. Both approaches are covered lightly in the link above. Good luck!

We can show the pdf on page using cfheader and cfdocument tags. We can only show the pdf on webpage using the following example code.
<cfsavecontent name="pdfcontent">
//Here what you need to show the pdf
</cfsavecontent>
<cfheader name="Content-Disposition" value="filename=Mydocument.pdf">
<cfdocument format="pdf" orientation = "landscape" bookmark="Yes" marginleft=".25" marginright=".25" marginTop = ".25" marginbottom=".75" scale="90">
<cfoutput> #pdfcontent# </cfoutput>
</cfdocument>

How to extract data from a web site and format to raw text - iPhone Dev

I have been looking around for a while and not found anything useful, also not sure if I have worded the question in the clearest fashion so apologies
I have a section of an app I am building called 'Company News'. The company in question has a news page on their website which displays a title, an excerpt of text and a read more option.
At the minute in the iPhone application I just have a UIWebView which links to that URL, displays an error if no connection is available. However, if my user clicks a story to read the news obviously it opens up a new page, I want to avoid having to build in 'back' and 'forward' buttons and stay away from it looking like a browser within the app.
With that said, I am looking for a way to just extract that data from the website and just display it in my app as raw text. I am not particularly bothered about rich text formatting or anything fancy. I would just like the title and body of text.
Is this possible?

In essence, then, you are looking for an HTML parser.
Assuming the HTML you wish to parse has a predictable format, the approach I would take is to load the HTML via whatever URL loading system you want - e.g. NSURLConnection, ASIHTTPRequest, etc.
Then you will need to parse the raw HTML. I use XPath. It requires that you learn the syntax, but it should work.
For more details about how you might use XPath for parsing HTML, see the second response to this question. You will need to link to libxml2 in your project then use XPath to extract the nodes of interest.
Scraping web pages in this way is fragile, though, because it depends on the structure of a page you don't control and which could be changed unpredictably.

Delete certain parts of html code on the iPhone after downloading it

I'm wondering if it's possible to edit out certain parts of the html code. It's really longa and as I parse it (with element parser), the deeper the parser goes into the code the slower it runs. Any ideas? I'm using a 3G as well.
edit:
For example on this site I'd want the posts and the usernames. Let's say there are like 50 replies on this thread and assume it will take a long time for the 3G phone to parse thousands of lines.
I'd want to remove the right links, the ads, the links at the top and bottom of the page too. Then I'd get the revised html and push it into the parser.

If you downloading a webpage using UIWebView, then you can use normal javascript to (by using the method stringByEvaluatingJavascriptFromString) to hide or remove any elements you want to remove from the view of the user.

How to Intercept image load requests in WebView?

Is it possible to intercept image load requests in WebView before they are actually started and modify their URLs?
For example, I have
mWebView.loadUrl(myUrl);
In onLoadResources event I can see URLs, but I can't modify them?
The thing is I am working on application that loads html content from remote location. For some reason author excluded image path and in img src he just have file name. Existing iPhone application is using this html content and I assume the content is build the way that is the best for iPhone. So, I need somehow to figure how to alter these paths. For example, if I choose to download all images first, I would need to alter path and add file:///... in front of image.jpg name.
Thanks.

you can use onLoadResource although are not only images but any resource loaded like javascript and css

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Getting images by parsing constantly changing HTML - iphone

It seems that the simplest way in your case - use regular expression (for example http://[A-Za-z0-9_/\.]*\.jpg) to extract URLs and keep track of already pulled images.

Related

How to get array of pixels from browser window without using canvas

coldfusion show pdf on page

How to extract data from a web site and format to raw text - iPhone Dev

Delete certain parts of html code on the iPhone after downloading it

How to Intercept image load requests in WebView?

Categories

Resources