I'm writing an RSS reader app using Swift. I use the built-in class XMLParser to do the parsing job.
The XMLParser would stop when encounter some strange tags, for instance, <figure>(This tag is matched by end tag </figure>). The error code is 76(tagNameMismatchError).
I extract the part causing the tagNameMismatchError from xml:
<figure tabindex="0" draggable="false" class="ss-img-wrapper" contenteditable="false"><img src="https://cdn.sspai.com/2019/08/19/34d2340bbf2cbc3b08ffe4fe1594168d.png" alt=""><figcaption class="ss-image-caption">图 / iHelpBR</figcaption></figure>
Why this error(tagNameMismatchError)? It is <figure> an invalid tag or something else?
Besides, I can't predict what possible tags could come from possible feeds.
The problem is the img tag, which is not terminated. This is not valid XML. HTML is more lax regarding closing tags than XML is. Insert a </img> or change the img tag to be <img src=... /> and it will work.
If you ever need to confirm that the content is valid XML, you can also save it to a file and then use the command line xmllint which will report (emphasis added):
parser error : Opening and ending tag mismatch: img line 1 and figure
Bottom line, you’ll need to fix the XML, or use a HTML parser (such as Hpple or NDHpple) instead.
Related
I have this xml as part of the responseXml of an Ajax call:
<banner-ad>
<title><span style="color:#ffff00;"><strong>Title</strong></span></title>
</banner-ad>
When I used this jQuery(responseXml).find("title").text(); the result is "Title".
I also tried jQuery(responseXml).find("title:first-child") but the result is [object Object].
I want to get the result:
<span style="color:#ffff00;"><strong>Title</strong></span>
Please let me know how to do this in jQuery.
Thanks in advance for any help.
Regards,
Racs
Your problem is that you cannot simply append nodes from one document (the XML response) to another (your HTML page). The issue is two-fold:
You can use jQuery to append nodes from the XML document to the HTML page. This works; the nodes appear in the HTML DOM, but they stay XML nodes and therefore the browser ignores the style attribute, for example. Consequently the text will not be yellow (#ffff00).
As far as I can see, jQuery offers no built-in way to get the XML string (i.e. a serialized node) from an XML node. jQuery can handle XML documents quite well, but there is no equivalent to what .html() does in HTML documents.
So to make this work we need to extract the XML string from the XML document. Some browsers support the .xml property on XML nodes (namely, IE), the others come with an XMLSerializer object:
// find the proper XML node
var $title = $(doc).find("title");
// either use .xml or, when unavailable, an XMLSerializer
var html = $title[0].xml || (new XMLSerializer()).serializeToString($title[0]);
// result:
// '<title><span style="color:#ffff00;"><strong>Title</strong></span></title>'
Then we have to feed this HTML string to jQuery so new, real HTML elements can be created from it:
$("#target").append(html);
There is a fiddle to show this in action: http://jsfiddle.net/Tomalak/QWHj8/. This example also gets rid of the superfluous <title> element.
Anyway. If you have a chance to influence the XML itself, it would make sense to change it:
<banner-ad>
<title><span style="color:#ffff00;"><strong>Title</strong></span></title>
</banner-ad>
Just XML-encode the payload of <title> and you can do this in jQuery:
$("#target").append( $(doc).find("title").text() );
This would probably work:
$(responseXml).find("title").html();
I am trying to scrape some content from an HTML page. I'm using libxml2 and htmlReadMemory to get a xmlDocPtr. The HTML is simple, but it has a problem. It's basically the following:
<tr><td><tr><td>Some content</td></tr></td></tr>
libxml doesn't like the nested tr, tds. It keeps giving me the following error:
HTML parser error : Unexpected end tag : td
</TD>
^
HTML parser error : Unexpected end tag : tr
</TR>
I am using the following option: HTML_PARSE_RECOVER.
At this point nothing i do allows libxml to parse the HTML because of this. I can't change the HTML because I have no access to it.
Anyone have any clues how I can get libxml to parse this sort of HTML?
Thanks
What's the exact call you're using to parse? I'd suggest combining these options if you don't want any errors/warnings:
HTML_PARSE_RECOVER|HTML_PARSE_NOERROR|HTML_PARSE_NOWARNING
I am new to iphone development.I want to parse an xml page .The source code contains some htmls tags.This html tag is displayed in my simulator.I want to filter the tags and display only the content.The sorce code of xml is like
<description>
<![CDATA[<br /><p class="author"><span class="by">By: </span>By Sydney Ember</p><br><p>In the week since an earthquake devastated Haiti ...</p>]]>
</description>
I want "in the week since an ..." to be displayed and not the html tags.Please help me out.Thanks
As said before in other answers, the data in your xml is inside a CDATA block - this means that when you get the contents of the tag, the XML parser won't be able to get rid of the 'By:' bit for you - as far as it's concerned, it's all just text.
However,if you're going to display it inside as HTML inside a UIWebView (instead of a UILabel etc), you can add a style sheet to the start of the string that makes the 'By:' hidden. Something like
NSString *cssString = #"<style type='text/css'>span.by { display:none; }</style>"
NSString *html = [NSString stringWithFormat:#"<html><head>%#</head><body>%#</body></html>", cssString, descriptionString];
[webView loadHTMLString:html baseURL:nil];
where descriptionString is the contents of the <description> tag in your xml.
However this approach is a little heavy handed, I would try very hard to get some cleaner xml from your server!
As for actually parsing the xml, try the NSXMLParser object.
The contents inside a CDATA block are considered as text (xml specific chars like <, &, > etc will be ignored and treated as plain chars). If the text canvas you're using to display the text accepts html, read the text node of description tag and assign it to the innerHTML equivalent of the canvas.
I see that all the tags are HTML. In addition, there is a CDATA that defines that its content should be considered as text and not XML. As for the XML parsing - there are few XML parsers available for iPhone:
TouchXML
XPathQuery
I prefer the latter.
I'm not sure how the parsers will treat the CDATA.
Maybe you will have to parse twice - first time for getting the CDATA contents and second time for parsing the content...
i want to parse the xml File. xml File structure is following
<?xml version="1.0" encoding="utf-8"?>
<Level>
<p id='327'>
<Item>
<Id>5877</Id>
<Type>0</Type>
<Icon>---</Icon>
<Title>Btn1Item1</Title>
</Item>
<Item>
<Id>5925</Id>
<Type>0</Type>
<Icon>---</Icon>
<Title>Btn1Item4</Title>
</Item>
</p>
<p id='328'>
<Item>
<Id>5878</Id>
<Type>0</Type>
<Icon>---</Icon>
<Title>Btn2Item1</Title>
</Item>
<Item>
<Id>5926</Id>
<Type>0</Type>
<Icon>---</Icon>
<Title>Btn2Item4</Title>
</Item>
</p>
</Level>
in above code there are only 2 tag for <p>. but in actual there are multiple tag. i want to search the specific tag for which attribute id have some specific value (say 327).
so one way is that i parse the XML file from start to get the desired result. whether there are any other method from which i can direct locate the desired tag. for example if i want to search the <p> tag in above XML for attribute id =328, then it does not parse the id=327 and direct return only those item which are related to id=328
Please suggest
Depends how you define "parse".
A "quick & dirty" (and potentially buggy) way would be to find the fragment using a regex search (or a custom parser) first, then feed the fragment to a true XML parser. I don't know of anything that would do this for you, you'd have to roll it yourself. I would suggest that it's not the way to go.
The next level is to feed it through a SAX-like parser (which NSXMLParser is a form of).
In your handler for the <p> element, check the id attribute and if it matches your value (or values), set a flag to indicate if child elements should be interpreted.
In your child element handlers, just check that flag first (in a raw NSXMLParser handler all elements would go to the same method, of course).
So it's true that NSXMLParser would be parsing the whole document - but just to do the minimal work to establish the correct XML parser context. The real work of handling the elements would be deferred until the value is met. I don't see any way around that without something hacky like the regex suggestion.
If this is too much overhead I'd reconsider whether XML is the right serialization format for you (assuming you have any control over that)?
If you do stick with NSXMLParser, my blog article here might help to at least make the experience nicer.
The libxml2 library can receive XPath queries with the following extensions. With these extensions you might issue the XPath query /p[#id = "328"] to retrieve that specific node's children.
I am new to iphone development.I want to ignore CDATA tag while parsing because it consider the HTML tag following it as text.Since i want to display the content alone ,i want my parser to ignore CDATA tag.My source code is
[CDATA[<br /><p class="author"><span class="by">By: </span>By Sydney Ember</p><br><p>In the week since an </p>]].
Is there any way to ignore CDATA tag?
Is there any way to parse my source twice so it displays only the content?
Please give me some sample code.Please help me out.Thanks.
If you treat the CDATA content as XML instead of CDATA then your parser will throw an error (since your HTML is a weird mix of XHTML and HTML and is not well formed).
If you want to get the HTML, then parse the XML, extract the text content of the node, then parse that text as HTML.
There is no way to ignore the CDATA tag - it's part of the xml spec and parsers should honour it.
If you don't like the idea of this answer to your earlier question, you could get the contents of the CDATA section and parse it as XML again. However, this is highly not recommended! You don't know that the contents of the CDATA are going to be valid xml (they're probably not).
If you can 100% guarentee that the CDATA section contains the form you have above, you could probably use some string manipulation to get the data out (i.e. string replace '<span class="by">By: </span>' with '') but again, this will almost certainly break if the CDATA contents change.
Where is the xml coming from? It's a better idea to talk to owner of the service and get them to send you instead of description something like
<description>
<author>By Sydney Ember</autho>
<text>In the week since an </text>
</description>
S