Processing Non-Well-Formed HTML with DOM - dom

I have got a HTML which has obtained from a web site's source code. Im sending data with post method from my page to the web site, and the response is source code. I need some text in the source.
The document is non-well-formed. So, i can not use DOM, and i can not separate tag and data with DOM.
How can i separate tag and data, and how can i get the only data?
I'm using PHP.
Thanks.

I found something about getting data from html source code. I said in question, im using PHP.
I will use preg_match_all function and regular expression. Hopefully, i can overcome ;)
Thx to all who interested ;)

Related

How can I "drill down" into a website using Perl's WWW::Mechanize

I have used the WWW::Mechanize Perl module on a number of projects and it's helped me out a lot.
I am trying to use it on a different site and I can't "drill down" into the content of the site.
The site is https://customer.bookingbug.com/?client=hantsrecyclingcentres#/services
I have tried figure out what the URL would be to get content shown in the resulting HTML, such as bb.d570283b87c834518ba9.css, bb.d570283b87c834518ba9.js and version.js
I tried to copy the resulting HTML into this posting, but used all sorts of quote and code sample combinations and it wouldn't display properly.
Does anyone have any idea how I "navigate" this site using this Perl module please?
WWW::Mechanize is a web client with some HTML parsing capabilities. But as you clearly noticed, the information you want is not in the HTML document you requested. Either download the correct document (whatever that might be), or do what the browser does and execute the JavaScript. This would require a JavaScript engine. The simplest way to achieve that is to remote-control a web browser (e.g. using Selenium::Chrome).

textbox/form saved and reloaded

I have a question about textboxes or forms. I don't have any experience with them.
I would like to have a textbox/form where the user can type/copy text too.
There should be a save button and the saved text should be loaded and be editable again.
This isn't an internet application, so I don't need to specify a database of users.
Searching the web got me a lot of partial asp/.net/php solutions. I don't really know much of these.
My question is, would this be possible? And where should I start?
Thanx
I would use PHP for this, if you're unfamiliar there are some great learning resources at W3 Schools PHP tutuorial as well as the documentation at PHP.net you'll want to write an HTML form that submits to a php script. In this script you can save text to a file using the fwrite() function and load saved text using fopen(). Definitely have a look through the w3 Schools php guide if you're new to this.

How parse html string in iphone

i m using php file for using data in my application,
in this file i post data on the server and if i get the data from the server
then it is in html formate.
so problem is that i have a string with html tags how i use data in that string.
how i extract data from html string.
Use NSXMLParser class. it works for HTML too. There are three useful delegate methods.
If your HTML out put is some simple data - may be you can write some simple NSString parser your self like 'markhunte' mentioned, if you have large complex data in HTML then you have to go for some open source parsers.
Cocoa does not provide HTML parser, Forum discussion claims in some case XML parser itself work for you, but I never go it working for my data.
In my case I had very simple TAG which I had handled using my own parser using NSString.
I have used the code from --> Flatten-html-content-ie-strip-tags-cocoaobjective-c.
There are also examples of its use on SO.
Just use NSScanner, it is great for searching in between tags that are permanent. If you post some page code I help you set up the scanner.

Parsing source of a webpage with Objective-C

Is there a way to parse a website's source on the iPhone to get the URL's of photos on that page? If so how would you do that?
Thanks
I'd say go for regular expressions - there is a one page library that wraps c regexesthat you can drop into your project.
I recommend regular expressions. There's a great open source Regex library for Cocoa called RegexKit. For the most part, you can just drop it in your code and it'll "just work".
Getting all the urls of images wouldn't be too difficult (less than 20 lines of code) if you assume that all images are going to be in <img> tags. You'd just grab all the image tags (something like: <img\s+[^>]+>), then iterate through those matches. For each match, you'd pull out whatever's in the src attribute: src\s*=\s*("|')?\s*([^\s"']+)(\s|"|')
You might need to tweak that a bit, but it shouldn't be too bad.
There is no super easy way. When I had to do it I wrote a libxml2 SAX parser. libxml2 has an html reader that works fairly well with malformed html, and libxml2 is included with the base system.
You could try it using regular expressions, but I wouldn't recommend that. You should have a look at NSXMLParser, assuming the webpage is coded to be XHTML compliant. TouchXML is another good library.
take a look at Event Driven XML Parsing in the iPhone reference library
Are you OK with any approach you use not picking up on images loaded dynamically via JavaScript.
The closest thing I could see working is to parse out any JavaScript imports, load those up too, and then use a regular expression across the whole file looking for anything that ends in ".jpg/.gif/.png" and grab the full URL out from that. The libxml approach would miss out on references to images not in img tags, but it might well be good enough.

iPhone RSS Reader -- parseXML won't Load some XML feeds

I am using the SIMPLE RSS reading example found at http://theappleblog.com/2008/08/04/tutorial-build-a-simple-rss-reader-for-iphone/
It uses parseXML to load the RSS feeds.
Here is the problem I am having. For the following RSS feed example, I am having trouble getting it to load the feed. Comes up with an error that it cannot connect. However on my Mac RSS Reader it works fine, so I know the link is good.
Any ideas on why it cannot load this particular feed but it can load others fine?
http://www.okstate.com/rss.dbml?db_oem_id=200&media=news
Thanks.
I've just released an open source RSS/Atom Parser for iPhone and hopefully it might be of some use.
I'd love to hear your thoughts on it too!
In my experience, HTML markup causes an RSS parser to fail in most cases. I've experienced a problem like this with a lot of parser classes I've come across (in search of the ultimate one, which I didn't find)
My guess is that entities such as
&#39;s
are responsible for your crash. That was usually the case with my crashes. This also lead to my decision to create a 'proxy server' to pre-parse the XML before sending it to the iPhone (which gives me the advantage of caching, scaling, and some other stuff). I do believe there are solid solutions out there, but is always difficult writing a parser for so many RSS implementations.
P.S: W3C validates this feed as 'valid', so it really is 'our' problem..
Your problem could lie with:
Unicode characters (i.e. I see some o's with two dots above them in the feed)
The code you have doesn't respect CDATA sections correctly
To find out which is the case, save the feed file to your local disk and load it via your code to make sure the error happens.
Do a binary search on the file to find out if a particular RSS entry is causing the problem (i.e. remove all but the first rss entry and see if the problem exists. If it does, then the problem is there, if it doesn't put half the rss entries back in the file and repeat)
I've been experiencing a similar issue. I haven't yet pinned down the answer, but I've noticed that RSS 2 tends to parse more successfully than the rest.
There are many RSS feeds that contain invalid XML, usually because they were hacked together on the server side using HTML templates by somebody who didn't understand XML. I've seen improperly escaped (or non-escaped) HTML post contents, missing close tags, badly nested tags, and so on.
If you want to be able to parse arbitrary feeds, you have to clean up bad XML. The usual way is to use the "htmlTidy" library, which is included in the OS. This can clean up XML as well as HTML.
This example you're following uses NSXMLParser -- I have no idea why. It's a lower-level API and it doesn't support tidying. I would suggest using NSXMLDocument instead. There's a flag in that API that will tell it to use tidy when parsing the XML. This API also returns you the XML as a handy tree of elements that's easy to work with.