xml Tableview nsxmlparsing [duplicate] - iphone

I think I read every single web page relating to this problem but I still cannot find a solution to it, so here I am.
I have an HTML web page which is not under my control and I need to parse it from my iPhone application. Here is a sample of the web page I'm talking about:
<HTML>
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
</HEAD>
<BODY>
<LI class="bye bye" rel="hello 1">
<H5 class="onlytext">
<A name="morning_part">morning</A>
</H5>
<DIV class="mydiv">
<SPAN class="myclass">something about you</SPAN>
<SPAN class="anotherclass">
Bye Bye è un saluto
</SPAN>
</DIV>
</LI>
</BODY>
</HTML>
I'm using NSXMLParser and it is going well till it find the è html entity. It calls foundCharacters: for "Bye Bye" and then it calls resolveExternalEntityName:systemID:: with an entityName of "egrave".
In this method i'm just returning the character "è" trasformed in an NSData, the foundCharacters is called again adding the string "è" to the previous one "Bye Bye " and then the parser raise the NSXMLParserUndeclaredEntityError error.
I have no DTD and I cannot change the html file I'm parsing. Do you have any ideas on this problem?
Update (12/03/2010). After the suggestion of Griffo I ended up with something like this:
data = [self replaceHtmlEntities:data];
NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser parse];
where replaceHtmlEntities:(NSData *) is something like this:
- (NSData *)replaceHtmlEntities:(NSData *)data {
NSString *htmlCode = [[NSString alloc] initWithData:data encoding:NSISOLatin1StringEncoding];
NSMutableString *temp = [NSMutableString stringWithString:htmlCode];
[temp replaceOccurrencesOfString:#"&" withString:#"&" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
[temp replaceOccurrencesOfString:#" " withString:#" " options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
...
[temp replaceOccurrencesOfString:#"À" withString:#"À" options:NSLiteralSearch range:NSMakeRange(0, [temp length])];
NSData *finalData = [temp dataUsingEncoding:NSISOLatin1StringEncoding];
return finalData;
}
But I am still looking the best way to solve this problem. I will try TouchXml in the next days but I still think that there should be a way to do this using NSXMLParser API, so if you know how, feel free to write it here.

After exploring several alternatives, it appears that NSXMLParser will not support entities other than the standard entities <, >, &apos;, " and &
The code below fails resulting in an NSXMLParserUndeclaredEntityError.
// Create a dictionary to hold the entities and NSString equivalents
// A complete list of entities and unicode values is described in the HTML DTD
// which is available for download http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
NSDictionary *entityMap = [NSDictionary dictionaryWithObjectsAndKeys:
[NSString stringWithFormat:#"%C", 0x00E8], #"egrave",
[NSString stringWithFormat:#"%C", 0x00E0], #"agrave",
...
,nil];
NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
[parser setDelegate:self];
[parser setShouldResolveExternalEntities:YES];
[parser parse];
// NSXMLParser delegate method
- (NSData *)parser:(NSXMLParser *)parser resolveExternalEntityName:(NSString *)entityName systemID:(NSString *)systemID {
return [[entityMap objectForKey:entityName] dataUsingEncoding: NSUTF8StringEncoding];
}
Attempts to declare the entities by prepending the HTML document with ENTITY declarations will pass, however the expanded entities are not passed back to parser:foundCharacters and the è and à characters are dropped.
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
[
<!ENTITY agrave "à">
<!ENTITY egrave "è">
]>
In another experiment, I created a completely valid xml document with an internal DTD
<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE author [
<!ELEMENT author (#PCDATA)>
<!ENTITY js "Jo Smith">
]>
<author>< &js; ></author>
I implemented the parser:foundInternalEntityDeclarationWithName:value:; delegate method and it is clear that the parser is getting the entity data, however the parser:foundCharacters is only called for the pre-defined entities.
2010-03-20 12:53:59.871 xmlParsing[1012:207] Parser Did Start Document
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundElementDeclarationWithName: author model:
2010-03-20 12:53:59.873 xmlParsing[1012:207] Parser foundInternalEntityDeclarationWithName: js value: Jo Smith
2010-03-20 12:53:59.874 xmlParsing[1012:207] didStartElement: author type: (null)
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters Before:
2010-03-20 12:53:59.875 xmlParsing[1012:207] parser foundCharacters After: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters Before: <
2010-03-20 12:53:59.876 xmlParsing[1012:207] parser foundCharacters After: <
2010-03-20 12:53:59.877 xmlParsing[1012:207] parser foundCharacters Before: <
2010-03-20 12:53:59.878 xmlParsing[1012:207] parser foundCharacters After: <
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters Before: <
2010-03-20 12:53:59.879 xmlParsing[1012:207] parser foundCharacters After: < >
2010-03-20 12:53:59.880 xmlParsing[1012:207] didEndElement: author with content: < >
2010-03-20 12:53:59.880 xmlParsing[1012:207] Parser Did End Document
I found a link to a tutorial on Using the SAX Interface of LibXML. The xmlSAXHandler that is used by NSXMLParser allows for a getEntity callback to be defined. After calling getEntity, the expansion of the entity is passed to the characters callback.
NSXMLParser is missing functionality here. What should happen is that the NSXMLParser or its delegate store the entity definitions and provide them to the xmlSAXHandler getEntity callback. This is clearly not happening. I will file a bug report.
In the meantime, the earlier answer of performing a string replacement is perfectly acceptable if your documents are small. Check out the SAX tutorial mentioned above along with the XMLPerformance sample app from Apple to see if implementing the libxml parser on your own is worthwhile.
This has been fun.

A possibly less hacky solution is replace the DTD with a local modified one with all external entity declaration replaced with local one.
This is how I do it:
First, find and replace the document DTD declaration with a local file. For example, replace this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html><body><a href='a.html'>hi!</a><br><p>Hello</p></body></html>
with this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "file://localhost/Users/siuying/Library/Application%20Support/iPhone%20Simulator/6.1/Applications/17065C0F-6754-4AD0-A1EA-9373F6476F8F/App.app/xhtml1-transitional.dtd">
<html><body><a href='a.html'>hi!</a><br><p>Hello</p></body></html>
```
Download the DTD from the W3C URL and add it to your app bundle. You can find the path of the file with following code:
NSBundle* bundle = [NSBundle bundleForClass:[self class]];
NSString* path = [[bundle URLForResource:#"xhtml1-transitional" withExtension:#"dtd"] absoluteString];
Open the DTD file, find any external entity reference:
<!ENTITY % HTMLlat1 PUBLIC
"-//W3C//ENTITIES Latin 1 for XHTML//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
%HTMLlat1;
replace it with the content of the entity file ( http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent in the above case)
After replacing all external reference, NSXMLParser should properly handle the entities without the need to download every remote DTD/external entities each time it parse a XML file.

You could do a string replace within the data before you parse it with NSXMLParser. NSXMLParser is UTF-8 only as far as I know.

I think your going to run into another problem with this example as it isn't vaild XML which is what the NSXMLParser is looking for.
The exact problem in the above is that the tags META, LI, HTML and BODY aren't closed so the parser looks all the way though the rest of the document looking for its closing tag.
The only way around this that I know of if you don't have access to change the HTML is to mirror it with the closing tags inserted.

I would try using a different parser, like libxml2 - in theory I think that one should be able to handle poor HTML.

Since I've just started doing iOS development I've been searching for the same thing and found a related mailing list entry: http://www.mail-archive.com/cocoa-dev#lists.apple.com/msg17706.html
- (NSData *)parser:(NSXMLParser *)parser resolveExternalEntityName: (NSString *)entityName systemID:(NSString *)systemID {
NSAttributedString *entityString = [[[NSAttributedString alloc] initWithHTML:[[NSString stringWithFormat:#"&%#;", entityName] dataUsingEncoding:NSUTF8StringEncoding] documentAttributes:NULL] autorelease];
NSLog(#"resolved entity name: %#", [entityString string]);
return [[entityString string] dataUsingEncoding:NSUTF8StringEncoding];
}
This is fairly similar to your original solution and also causes a parser error NSXMLParserErrorDomain error 26; but it does continue parsing after that. The problem is, of course, that it's harder to tell real errors apart ;-)

Related

How to remove tags in html in objective-c?

When the Webservice hit it shows the response on my textfield like this:
Response i get at when i hit my web service at browser.
catid = 38;
created = "May 02 2013";
"created_by" = 588;
fulltext = "<p>
But I must explain to you how all this mistaken idea of denouncing
leasure and praising pain was born and I will give you a complete
account of the system, and expound the actual teachings of the great
xplorer of the truth, the master-builder of human happiness. No one
rejects, dislikes, or avoids pleasure itself, because it is pleasure
?<\p> <\p>But I must explain to you how all this mistaken idea of
denouncing leasure and praising pain was born and I will give you a
complete account of the system, and expound the actual teachings of
the great explorer of the truth, the master-builder of human
happiness.
But i want to replace <\p> <\p> with paragraph change:
I want This response on my iPhone App
catid = 38;
created = "May 02 2013";
"created_by" = 588;
fulltext = "<p>
But I must explain to you how all this mistaken idea of denouncing
leasure and praising pain was born and I will give you a complete
account of the system, and expound the actual teachings of the great
xplorer of the truth, the master-builder of human happiness. No one
rejects, dislikes, or avoids pleasure itself, because it is pleasure
? But I must explain to you how all this mistaken idea of
denouncing leasure and praising pain was born and I will give you a
complete account of the system, and expound the actual teachings of
the great explorer of the truth, the master-builder of human
happiness.
try this function to remove html tag.
- (NSString *)flattenHTML:(NSString *)html {
NSScanner *theScanner;
NSString *text = nil;
theScanner = [NSScanner scannerWithString:html];
while ([theScanner isAtEnd] == NO) {
// find start of tag
[theScanner scanUpToString:#"<" intoString:NULL] ;
// find end of tag
[theScanner scanUpToString:#">" intoString:&text] ;
// replace the found tag with a space
//(you can filter multi-spaces out later if you wish)
html = [html stringByReplacingOccurrencesOfString:
[ NSString stringWithFormat:#"%#>", text]
withString:#" "];
html = [html stringByReplacingOccurrencesOfString:[ NSString stringWithFormat:#" "] withString:#" "];
} // while //
return html;
}
Not sure why the </p> tags always come in three because a valid HTML should be giving the tags in pair, e.g. <p></p>. But if you are really sure it will always come in three, you can use the NSString instance method stringByReplacingOccurrencesOfString:
- (NSString *)replaceThreeParagraphTags:(NSString *)html {
return [html stringByReplacingOccurrencesOfString:#"</p></p></p>" withString:#"\n\n"];
}
Basically we just replace every occurrences of three </p> and replace it with two line breaks \n because it seems you want to have a line in between the paragraphs.

CDATA Block Parsing

i was searched for this and am getting brain fire.
i am gettig
<description><![CDATA[<img src='http://behance.vo.llnwd.net/profiles22/700504/projects/2335700.jpg' style='float:left; margin-right:15px;' /><br /> NIL]]></description>
i dont know parse the Particular Link (http://behance.vo.llnwd.net/profiles22/700504/projects.jpg).
even Though i have tried to use
- (void)parser:(NSXMLParser *)parser foundCDATA:(NSData *)CDATABlock
{
if([sElementName isEqualToString:#"description"])
{
NSMutableString *someString= [[NSMutableString alloc] initWithData:CDATABlock encoding:NSUTF8StringEncoding];
NSLog(#"%#",str);
}
}
it is get printed like
<img src='http://behance.vo.llnwd.net/profiles22/700504/projects/2335700.jpg' style='float:left; margin-right:15px;' /><br /> NIL
help me to get the particular link. Any links or answer may help..,
Thanks in Advance.,
The CDATA function is exactly for this purpose - if you have some XML that you want to embed into another XML as text (as opposed to as nested XML that modifies the structure itself). So, after obtaining this particular string, the <img> tag, you can use another XML parser to obtain the value of the src attribute.

NSXMLParser processing complex units of information

I'm processing a response from the server using NSXMLParser successfuly.
Something like this
<data>
<company id="">
<name>XXX</name>
<latitude></latitude>
<longitude></longitude>
</company>
<company id="">
<name>XXX</name>
<latitude></latitude>
<longitude></longitude>
</company>
</data>
I've been using the next methods
didStartElement:namespaceURI: ... to detect when the new company need to be parsed, then I allocate a new instance. And also, detect when an attribute starts
foundCharacters: process the content of every attribute
didEndElement: ... the company has been parsed completely and could be added to the internal list. And also, detect when an attribute has been processed, then set the value processed on the foundCharacters: method
Now, I also need to get the complete XML for one company, and store it in a local cache, anybody knows if there is any way using NSXMLParser to get all the content just for one company? Or maybe without using NSXMLParser. don't know.
<company id="">
<name>XXX</name>
<latitude></latitude>
<longitude></longitude>
</company>
Thank you,
Finally I decided to re-create the XML usign the SAX methods
parser:didStartElement:
// Adding the initial TAG of the xml
accountXML = [[NSString alloc] initWithFormat:#"<%#", elementName];
for (NSString *key in [attributeDict allKeys]){
accountXML = [accountXML stringByAppendingFormat:#" %#=\"%#\"", key
, [attributeDict valueForKey:key]];
}
accountXML = [accountXML stringByAppendingString:#">\n"];
and
parser:didEndElement:
// Add the xml to the account and release it
accountXML = [accountXML stringByAppendingFormat:#"</%#>\n", elementName];
[account setCompleteXML:accountXML];

How to get a url substring from a string in objective-c

<p><a href="http://vimeo.com/23486376" title="Rebecca Black's Friday on Rock Band"><img src="http://b.vimeocdn.com/ts/152/946/152946954_200.jpg" alt="Rebecca Black's Friday on Rock Band" /></a></p><p></p><p>Cast: <a href="http://vimeo.com/thenerdery" style="color: #2786c2; text-decoration: none;">The Nerdery</a></p>
I got a string like above, I am wondering what's the best way in objective-c to get the "http://b.vimeocdn.com/ts/152/946/152946954_200.jpg" substring? NSScanner? NSString methods? Thanks!
update: the string is actually:
<p><a href="http://vimeo.com/23333305" title="Ad League Bowling Championship"><img src="http://b.vimeocdn.com/ts/151/787/151787049_200.jpg" alt="Ad League Bowling Championship" /></a></p><p></p><p>Cast: <a href="http://vimeo.com/thenerdery" style="color: #2786c2; text-decoration: none;">The Nerdery</a></p>
Here's another approach:
Transform the string into actual HTML. In other words, conver the < and > stuff into < and >
Run it through an NSXMLParser
In the parser delegate, check the attributes dictionary passed into the -parser:didStartElement:namespaceURI:qualifiedName:attributes: method.
If the element name is #"img", then the attributes dictionary should have a key called #"src" that maps to the string #"http://b.vimeocdn.com/ts/152/946/152946954_200.jpg".
Once you have the string, it's trivial to transform it into an NSURL using +[NSURL URLWithString:]
This will work regardless of how the source HTML changes over time. The other approaches suggested are extremely fragile, because they rely on things like src being all lowercase and there only being a single src attribute anywhere in the string (what if you have 2?). You don't want to parse HTML; you want to get an attribute out of an XML element. So use the built-in way of doing it! :)
NSString *strComplete = #"<p><a href="http://vimeo.com/23486376" title="Rebecca Black's Friday on Rock Band"><img src="http://b.vimeocdn.com/ts/152/946/152946954_200.jpg" alt="Rebecca Black's Friday on Rock Band" /></a></p><p></p><p>Cast: <a href="http://vimeo.com/thenerdery" style="color: #2786c2; text-decoration: none;">The Nerdery</a></p>";
NSArray *arrComplete = [strComplete componentSeparatedBy:#"src="];
NSString *strSecond = [arrComplete objectAtIndex:1];
NSArray *arrSecond = [strSecond componentSeparatedBy:#" alt"];
NSString *strURLImage = [arrSecond objectAtIndex:0];
strURLImage will be your desired string.
You can use a framework for URL detection and parsing, such as AutoHyperlinks. On iOS, you will, of course, have to build it statically or build the source directly into your app.
Alternatively, for iOS only (currently), use NSDataDetector. Data detectors can find URLs, physical addresses, phone numbers, etc.; you tell it what you'll want from the string, then use the methods of NSRegularExpression to obtain its findings.
Assuming complete string is called myString, you can do:
NSRange srcRange = [myString rangeOfString:#"src="];
NSRange endrange = [myString rangeOfString:#"\"" options:nil range:
NSMakeRange(srcRange.location + 5, [mystring count] - srcRange.location - 6)];
NSString *url = [myString subStringWithRange:
NSMakeRange(srcRange.location + 5, srcRange.location + 5 + endRange.lenght)];
I haven't tested that so I could have put a +5 instead of a +6, so if the string has got some chars less or more than you want, just change the numbers in the last line.

NSXMLParser rss issue NSXMLParserInvalidCharacterError

NSXMLParserInvalidCharacterError # 9
This is the error I get when I hit a weird character (like quotes copied and pasted from word to the web form, that end up in the feed). The feed I am using is not giving an encoding, and their is no hope for me to get them to change that. This is all I get in the header:
< ?xml version="1.0"?>
< rss version="2.0">
What can I do about illegal characters when parsing feeds? Do I sweep the data prior to the parse? Is there something I am missing in the API? Has anyone dealt with this issue?
NSString *dataString = [[[NSString alloc] initWithData:webData encoding:NSASCIIStringEncoding] autorelease];
NSData *data = [dataString dataUsingEncoding:NSUTF8StringEncoding allowLossyConversion:YES];
NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
Fixed my problems...
The NSString -initWithData:encoding: method returns nil if it fails, so you can try one encoding after another until you find one that converts. This doesn't guarantee that you'll convert all the characters correctly, but if your feed source isn't sending you correctly encoded XML, then you'll probably have to live with it.
The basic idea is:
// try the most likely encoding
NSString xmlString = [[NSString alloc] initWithData:xmlData
encoding:NSUTF8StringEncoding];
if (xmlString == nil) {
// try the next likely encoding
xmlString = [[NSString alloc] initWithData:xmlData
encoding:NSWindowsCP1252StringEncoding];
}
if (xmlString == nil) {
// etc...
}
To be generic and robust, you could do the following until successful:
1.) Try the encoding specified in the Content-Type header of the HTTP response (if any)
2.) Check the start of the response data for a byte order mark and if found, try the indicated encoding
3.) Look at the first two bytes; if you find a whitespace character or '<' paired with a nul/zero character, try UTF-16 (similarly, you can check the first four bytes to see if you have UTF-32)
4.) Scan the start of the data looking for the <?xml ... ?> processing instruction and look for encoding='something' inside it; try that encoding.
5.) Try some common encodings. Definitely check Windows Latin-1, Mac Roman, and ISO Latin-1 if your data source is in English.
6.) If none of the above work, you could try removing all bytes greater than 127 (or substitute '?' or another ASCII character) and convert the data using the ASCII encoding.
If you don't have an NSString by this point, you should fail. If you do have an NSString, you should look for the encoding declaration in the <?xml ... ?> processing instruction (if you didn't already in step 4). If it's there, you should convert the NSString back to NSData using that encoding; if it's not there, you should convert back using UTF-8 encoding.
Also, the CFStringConvertIANACharSetNameToEncoding() and CFStringConvertEncodingToNSStringEncoding() functions can help get the NSStringEncoding that goes with the encoding name form the Content-Type header or the <?xml ... ?> processing instruction.
You can also remove that encoding line from xml like this:
int length = str.length >100 ? 100:str.length;
NSString*mystr= [str stringByReplacingOccurrencesOfString:#"encoding=\".*?\""
withString:#""
options:NSRegularExpressionSearch
range:NSMakeRange(0, length)];