libXML relaxed HTML parsing - iphone

I am trying to scrape some content from an HTML page. I'm using libxml2 and htmlReadMemory to get a xmlDocPtr. The HTML is simple, but it has a problem. It's basically the following:
<tr><td><tr><td>Some content</td></tr></td></tr>
libxml doesn't like the nested tr, tds. It keeps giving me the following error:
HTML parser error : Unexpected end tag : td
</TD>
^
HTML parser error : Unexpected end tag : tr
</TR>
I am using the following option: HTML_PARSE_RECOVER.
At this point nothing i do allows libxml to parse the HTML because of this. I can't change the HTML because I have no access to it.
Anyone have any clues how I can get libxml to parse this sort of HTML?
Thanks

What's the exact call you're using to parse? I'd suggest combining these options if you don't want any errors/warnings:
HTML_PARSE_RECOVER|HTML_PARSE_NOERROR|HTML_PARSE_NOWARNING

Related

XMLParser encounter invalid tags

I'm writing an RSS reader app using Swift. I use the built-in class XMLParser to do the parsing job.
The XMLParser would stop when encounter some strange tags, for instance, <figure>(This tag is matched by end tag </figure>). The error code is 76(tagNameMismatchError).
I extract the part causing the tagNameMismatchError from xml:
<figure tabindex="0" draggable="false" class="ss-img-wrapper" contenteditable="false"><img src="https://cdn.sspai.com/2019/08/19/34d2340bbf2cbc3b08ffe4fe1594168d.png" alt=""><figcaption class="ss-image-caption">图 / iHelpBR</figcaption></figure>
Why this error(tagNameMismatchError)? It is <figure> an invalid tag or something else?
Besides, I can't predict what possible tags could come from possible feeds.
The problem is the img tag, which is not terminated. This is not valid XML. HTML is more lax regarding closing tags than XML is. Insert a </img> or change the img tag to be <img src=... /> and it will work.
If you ever need to confirm that the content is valid XML, you can also save it to a file and then use the command line xmllint which will report (emphasis added):
parser error : Opening and ending tag mismatch: img line 1 and figure
Bottom line, you’ll need to fix the XML, or use a HTML parser (such as Hpple or NDHpple) instead.

how to pass richtext generated html content from jsp to js file

I have a richtext component, I gave input as "foo" to richtext component, and it generated
<p>foo</p>, I'm trying to pass this generated content from JSP to JS using the following code.
<script>
var jsvariable = '<%=jspvariable%>'
</script>
the above line throws "unterminated string literal" error, as the JS variable contains
ptagstarts foo ptagends
I'm using the value in JS as I need this variable in other pages as well.
May I know how we to remove this error.
From what you wrote, seems, that you have in your jspvariable string </script>. Html parser treats it as ending of the script block, and you getting invalid script block.
You can check source of your page to be sure, that I am right.
As Thomas suggested, you can escape your content. But as long as this content is provided by user, I would use XssApi, to prevent xss attack as well.
So it would be something like:
var jsvariable = '<%=xssApi.encodeForJSString(jspvariable)%>'
Or:
var jsvariable = '<%=xssApi.filterHTML(jspvariable)%>'
In first case you will get that <script> block from richtext component into your js variable. It will be encoded, and you will not get this error, but I think you do not need it.
In second case, you, should get only text value from you component.
UPDATE 1
Also, as I wrote you in comments, It would be nice to see the way you extract content from your richtext component, because I think, there is a better way of doing this, so you will get only text without anything else.

Safe to use Regex for this? (HTML)

I'm parsing some HTML, and I need to get all html in the body tag. My target string will always look something like this:
<body><div><img src="" />text etc</div></body>
However, I just need:
<div><img src="" />text etc</div>
My target string will always begin and end with those body tags. However, there is the repeated warning of not use Regex to parse HTML, but I do not have any viable solutions for that available, besides Regex at the moment.
Question: Are there any safe Regex(s) to use in this case? Or should I just forget it?
You didn't show us what your regex is, but it's not as safe as using DOM parsing if it's as simple as:
<body>(.*?)</body>
...because it's possible that </body> is contained in an attribute string or comment. If you're willing to take that risk, then you'll be fine. There's no reason you shouldn't be able to use DOM parsing and just get the text of the body, though, except it would probably be less efficient.
You could also skip the regex and just find the string indices of <body> and </body> and get the substring between them. That should be even faster.
By the way, this is not parsing the HTML; you're just extracting from the HTML
It's fine to use a RegEx in this case.
Having said that there are much easier ways to get the innerHTML of the body tag.
alert(document.body.innerHTML);
should give you exactly that with no RegEx...
or if you're using jQuery
$(body).html();

Need to find the tags under a tag in an XML using jQuery

I have this xml as part of the responseXml of an Ajax call:
<banner-ad>
<title><span style="color:#ffff00;"><strong>Title</strong></span></title>
</banner-ad>
When I used this jQuery(responseXml).find("title").text(); the result is "Title".
I also tried jQuery(responseXml).find("title:first-child") but the result is [object Object].
I want to get the result:
<span style="color:#ffff00;"><strong>Title</strong></span>
Please let me know how to do this in jQuery.
Thanks in advance for any help.
Regards,
Racs
Your problem is that you cannot simply append nodes from one document (the XML response) to another (your HTML page). The issue is two-fold:
You can use jQuery to append nodes from the XML document to the HTML page. This works; the nodes appear in the HTML DOM, but they stay XML nodes and therefore the browser ignores the style attribute, for example. Consequently the text will not be yellow (#ffff00).
As far as I can see, jQuery offers no built-in way to get the XML string (i.e. a serialized node) from an XML node. jQuery can handle XML documents quite well, but there is no equivalent to what .html() does in HTML documents.
So to make this work we need to extract the XML string from the XML document. Some browsers support the .xml property on XML nodes (namely, IE), the others come with an XMLSerializer object:
// find the proper XML node
var $title = $(doc).find("title");
// either use .xml or, when unavailable, an XMLSerializer
var html = $title[0].xml || (new XMLSerializer()).serializeToString($title[0]);
// result:
// '<title><span style="color:#ffff00;"><strong>Title</strong></span></title>'
Then we have to feed this HTML string to jQuery so new, real HTML elements can be created from it:
$("#target").append(html);
There is a fiddle to show this in action: http://jsfiddle.net/Tomalak/QWHj8/. This example also gets rid of the superfluous <title> element.
Anyway. If you have a chance to influence the XML itself, it would make sense to change it:
<banner-ad>
<title><span style="color:#ffff00;"><strong>Title</strong></span></title>
</banner-ad>
Just XML-encode the payload of <title> and you can do this in jQuery:
$("#target").append( $(doc).find("title").text() );
This would probably work:
$(responseXml).find("title").html();

Ignore CDATA while xml parsing

I am new to iphone development.I want to ignore CDATA tag while parsing because it consider the HTML tag following it as text.Since i want to display the content alone ,i want my parser to ignore CDATA tag.My source code is
[CDATA[<br /><p class="author"><span class="by">By: </span>By Sydney Ember</p><br><p>In the week since an </p>]].
Is there any way to ignore CDATA tag?
Is there any way to parse my source twice so it displays only the content?
Please give me some sample code.Please help me out.Thanks.
If you treat the CDATA content as XML instead of CDATA then your parser will throw an error (since your HTML is a weird mix of XHTML and HTML and is not well formed).
If you want to get the HTML, then parse the XML, extract the text content of the node, then parse that text as HTML.
There is no way to ignore the CDATA tag - it's part of the xml spec and parsers should honour it.
If you don't like the idea of this answer to your earlier question, you could get the contents of the CDATA section and parse it as XML again. However, this is highly not recommended! You don't know that the contents of the CDATA are going to be valid xml (they're probably not).
If you can 100% guarentee that the CDATA section contains the form you have above, you could probably use some string manipulation to get the data out (i.e. string replace '<span class="by">By: </span>' with '') but again, this will almost certainly break if the CDATA contents change.
Where is the xml coming from? It's a better idea to talk to owner of the service and get them to send you instead of description something like
<description>
<author>By Sydney Ember</autho>
<text>In the week since an </text>
</description>
S